| Title: | Assessment Tools for Regression Models with Discrete and Semicontinuous Outcomes |
|---|---|
| Description: | Provides assessment tools for regression models with discrete and semicontinuous outcomes proposed in Yang (2021) <https://doi.org/10.1080/10618600.2021.1910042>, Yang (2024) <https://doi.org/10.1080/10618600.2024.2303336>, Yang (2024) <https://doi.org/10.1093/biomtc/ujae007>, and Yang (2026) <https://doi.org/10.1002/cjs.70046>. It calculates the double probability integral transform (DPIT) residuals. It also constructsQQ plots of residuals the ordered curve for assessing mean structures, quasi-empirical distribution function for overall assessment, and a formal goodness-of-fit test. |
| Authors: | Lu Yang [aut], Jeonghwan Lee [cre, aut] |
| Maintainer: | Jeonghwan Lee <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 1.3.0 |
| Built: | 2026-05-21 09:02:01 UTC |
| Source: | https://github.com/jhlee1408/assessor |
This dataset provides annual statistics for Major League Baseball (MLB) players, including home run counts, at-bats, mean exit velocities, launch angles, quantile statistics of exit velocities and launch angles, and red zone metrics. It is intended for analyzing batted ball performance, with additional variables on the red zone, which is defined as balls in play with a launch angle between 20 and 35 degrees and an exit velocity of at least 95 mph.
bballHRbballHR
A data frame with the following columns:
Player's full name (character).
Player's unique identifier in the Lahman database (character).
Team abbreviation (character).
Season year (numeric).
Home runs hit during the season (integer).
At-bats during the season (integer).
Average exit velocity (mph) over the season (numeric).
Average launch angle (degrees) over the season (numeric).
Launch angle at the 75th percentile of the player's distribution (numeric).
Launch angle at the 70th percentile of the player's distribution (numeric).
Launch angle at the 65th percentile of the player's distribution (numeric).
Exit velocity at the 75th percentile of the player's distribution (numeric).
Exit velocity at the 80th percentile of the player's distribution (numeric).
Exit velocity at the 85th percentile of the player's distribution (numeric).
Seasonal count of batted balls in the red zone, defined as a launch angle between 20 and 35 degrees and an exit velocity greater than or equal to 95 mph (integer).
Proportion of batted balls that fall into the red zone (numeric).
Ballpark factor, indicating the effect of the player's home ballpark on offensive statistics (integer).
mean_exit_velo and mean_launch_angle represent the player's average exit velocities and launch angles, respectively, over the course of a season.
The launch_angle_x and exit_velo_x columns denote the upper -percentiles (e.g., 75th percentile) of the player's launch angle and exit velocity distributions for that year.
count_red_zone gives the number of balls in play that fall into the red zone, while prop_red_zone represents the proportion of balls in play in this category.
The Ballpark Factor (BPF) quantifies the influence of the player's home ballpark on offensive performance, with values above 100 indicating a hitter-friendly environment.
Player statistics: Lahman R Package
Batted ball data: Baseball Savant
Additional analysis: Patterns of Home Run Hitting in the Statcast Era by Jim Albert
data(bballHR) head(bballHR)data(bballHR) head(bballHR)
Calculates DPIT residuals for regression models with non-continuous outcomes.
In particular, model assumptions for GLMs with discrete outcomes (e.g., binary, Poisson, and negative binomial), ordinal
regression models, zero-inflated regression models, and semicontinuous outcome
models can be assessed using dpit().
dpit(model, plot=TRUE, scale="normal", line_args=list(), ...)dpit(model, plot=TRUE, scale="normal", line_args=list(), ...)
model |
A model object. |
plot |
A logical value indicating whether or not to return QQ-plot |
scale |
You can choose the scale of the residuals among |
line_args |
A named list of graphical parameters passed to
|
... |
Additional graphical arguments passed to
|
This function determines the appropriate computation based on the class of
model. The supported model objects and outcome types are listed below.
In addition to the class-based interface, the package also provides
distribution-specific DPIT calculators. If a fitted model comes from a
different class but has a supported outcome distribution, users can call
the corresponding distribution-based function directly.
For instance, for a regression model with Poisson outcomes,
one can use dpit to calculate the residuals if
the model is fit using glm function, or to use dpit_pois upon supplying fitted mean values.
Discrete outcomes
Zero-inflated discrete outcomes
zeroinfl with dist = "poisson"
(see dpit_zpois).
Semicontinuous outcomes
Tobit regression via tobit from AER or vglm from VGAM
(see dpit_tobit).
Tweedie regression via glm with a Tweedie family
(see dpit_tweedie).
Formulation for Discrete and Zero-Inflated Outcomes:
The DPIT residual for the th observation is defined as follows:
where
and refers to the fitted cumulative distribution function.
When scale="uniform", DPIT residuals should closely follow a uniform distribution, otherwise it implies model deficiency.
When scale="normal", it applies the normal quantile transformation to the DPIT residuals
The null pattern is the standard normal distribution in this case.
Formulation for Semicontinuous Outcomes:
The DPIT residuals for regression models with semi-continuous outcomes are
where is the fitted probability of zero, and is the fitted cumulative distribution function for the th observation. Furthermore,
where is the fitted cumulative distribution for the positive data.
DPIT residuals. If plot=TRUE, also produces a QQ plot.
L. Yang. Double probability integral transform residuals for regression models with discrete outcomes. Journal of Computational and Graphical Statistics, 33(3), pp.787–803, 2024.
L. Yang. Diagnostics for regression models with semicontinuous outcomes. Biometrics, 80(1), ujae007, 2024.
library(MASS) n <- 500 set.seed(1234) ## Negative Binomial example # Covariates x1 <- rnorm(n) x2 <- rbinom(n, 1, 0.7) ### Parameters beta0 <- -2 beta1 <- 2 beta2 <- 1 size1 <- 2 lambda1 <- exp(beta0 + beta1 * x1 + beta2 * x2) # generate outcomes y <- rnbinom(n, mu = lambda1, size = size1) # True model model1 <- glm.nb(y ~ x1 + x2) resid.nb1 <- dpit(model1, plot = TRUE, scale = "uniform") # Overdispersion model2 <- glm(y ~ x1 + x2, family = poisson(link = "log")) resid.nb2 <- dpit(model2, plot = TRUE, scale = "normal") ## Binary example n <- 500 set.seed(1234) # Covariates x1 <- rnorm(n, 1, 1) x2 <- rbinom(n, 1, 0.7) # Coefficients beta0 <- -5 beta1 <- 2 beta2 <- 1 beta3 <- 3 q1 <- 1 / (1 + exp(beta0 + beta1 * x1 + beta2 * x2 + beta3 * x1 * x2)) y1 <- rbinom(n, size = 1, prob = 1 - q1) # True model model01 <- glm(y1 ~ x1 * x2, family = binomial(link = "logit")) resid.bin1 <- dpit(model01, plot = TRUE) # Missing covariates model02 <- glm(y1 ~ x1, family = binomial(link = "logit")) resid.bin2 <- dpit(model02, plot = TRUE) ## Poisson example n <- 500 set.seed(1234) # Covariates x1 <- rnorm(n) x2 <- rbinom(n, 1, 0.7) # Coefficients beta0 <- -2 beta1 <- 2 beta2 <- 1 lambda1 <- exp(beta0 + beta1 * x1 + beta2 * x2) y <- rpois(n, lambda1) # True model poismodel1 <- glm(y ~ x1 + x2, family = poisson(link = "log")) resid.poi1 <- dpit(poismodel1, plot = TRUE) # Enlarge three outcomes y <- rpois(n, lambda1) + c(rep(0, (n - 3)), c(10, 15, 20)) poismodel2 <- glm(y ~ x1 + x2, family = poisson(link = "log")) resid.poi2 <- dpit(poismodel2, plot = TRUE) ## Ordinal example n <- 500 set.seed(1234) # Covariates x1 <- rnorm(n, mean = 2) # Coefficient beta1 <- 3 # True model p0 <- plogis(1, location = beta1 * x1) p1 <- plogis(4, location = beta1 * x1) - p0 p2 <- 1 - p0 - p1 genemult <- function(p) { rmultinom(1, size = 1, prob = c(p[1], p[2], p[3])) } test <- apply(cbind(p0, p1, p2), 1, genemult) y1 <- rep(0, n) y1[which(test[1, ] == 1)] <- 0 y1[which(test[2, ] == 1)] <- 1 y1[which(test[3, ] == 1)] <- 2 multimodel <- polr(as.factor(y1) ~ x1, method = "logistic") resid.ord1 <- dpit(multimodel, plot = TRUE) ## Non-Proportionality n <- 500 set.seed(1234) x1 <- rnorm(n, mean = 2) beta1 <- 3 beta2 <- 1 p0 <- plogis(1, location = beta1 * x1) p1 <- plogis(4, location = beta2 * x1) - p0 p2 <- 1 - p0 - p1 genemult <- function(p) { rmultinom(1, size = 1, prob = c(p[1], p[2], p[3])) } test <- apply(cbind(p0, p1, p2), 1, genemult) y1 <- rep(0, n) y1[which(test[1, ] == 1)] <- 0 y1[which(test[2, ] == 1)] <- 1 y1[which(test[3, ] == 1)] <- 2 multimodel <- polr(as.factor(y1) ~ x1, method = "logistic") resid.ord2 <- dpit(multimodel, plot = TRUE)library(MASS) n <- 500 set.seed(1234) ## Negative Binomial example # Covariates x1 <- rnorm(n) x2 <- rbinom(n, 1, 0.7) ### Parameters beta0 <- -2 beta1 <- 2 beta2 <- 1 size1 <- 2 lambda1 <- exp(beta0 + beta1 * x1 + beta2 * x2) # generate outcomes y <- rnbinom(n, mu = lambda1, size = size1) # True model model1 <- glm.nb(y ~ x1 + x2) resid.nb1 <- dpit(model1, plot = TRUE, scale = "uniform") # Overdispersion model2 <- glm(y ~ x1 + x2, family = poisson(link = "log")) resid.nb2 <- dpit(model2, plot = TRUE, scale = "normal") ## Binary example n <- 500 set.seed(1234) # Covariates x1 <- rnorm(n, 1, 1) x2 <- rbinom(n, 1, 0.7) # Coefficients beta0 <- -5 beta1 <- 2 beta2 <- 1 beta3 <- 3 q1 <- 1 / (1 + exp(beta0 + beta1 * x1 + beta2 * x2 + beta3 * x1 * x2)) y1 <- rbinom(n, size = 1, prob = 1 - q1) # True model model01 <- glm(y1 ~ x1 * x2, family = binomial(link = "logit")) resid.bin1 <- dpit(model01, plot = TRUE) # Missing covariates model02 <- glm(y1 ~ x1, family = binomial(link = "logit")) resid.bin2 <- dpit(model02, plot = TRUE) ## Poisson example n <- 500 set.seed(1234) # Covariates x1 <- rnorm(n) x2 <- rbinom(n, 1, 0.7) # Coefficients beta0 <- -2 beta1 <- 2 beta2 <- 1 lambda1 <- exp(beta0 + beta1 * x1 + beta2 * x2) y <- rpois(n, lambda1) # True model poismodel1 <- glm(y ~ x1 + x2, family = poisson(link = "log")) resid.poi1 <- dpit(poismodel1, plot = TRUE) # Enlarge three outcomes y <- rpois(n, lambda1) + c(rep(0, (n - 3)), c(10, 15, 20)) poismodel2 <- glm(y ~ x1 + x2, family = poisson(link = "log")) resid.poi2 <- dpit(poismodel2, plot = TRUE) ## Ordinal example n <- 500 set.seed(1234) # Covariates x1 <- rnorm(n, mean = 2) # Coefficient beta1 <- 3 # True model p0 <- plogis(1, location = beta1 * x1) p1 <- plogis(4, location = beta1 * x1) - p0 p2 <- 1 - p0 - p1 genemult <- function(p) { rmultinom(1, size = 1, prob = c(p[1], p[2], p[3])) } test <- apply(cbind(p0, p1, p2), 1, genemult) y1 <- rep(0, n) y1[which(test[1, ] == 1)] <- 0 y1[which(test[2, ] == 1)] <- 1 y1[which(test[3, ] == 1)] <- 2 multimodel <- polr(as.factor(y1) ~ x1, method = "logistic") resid.ord1 <- dpit(multimodel, plot = TRUE) ## Non-Proportionality n <- 500 set.seed(1234) x1 <- rnorm(n, mean = 2) beta1 <- 3 beta2 <- 1 p0 <- plogis(1, location = beta1 * x1) p1 <- plogis(4, location = beta2 * x1) - p0 p2 <- 1 - p0 - p1 genemult <- function(p) { rmultinom(1, size = 1, prob = c(p[1], p[2], p[3])) } test <- apply(cbind(p0, p1, p2), 1, genemult) y1 <- rep(0, n) y1[which(test[1, ] == 1)] <- 0 y1[which(test[2, ] == 1)] <- 1 y1[which(test[3, ] == 1)] <- 2 multimodel <- polr(as.factor(y1) ~ x1, method = "logistic") resid.ord2 <- dpit(multimodel, plot = TRUE)
Calculates DPIT residuals with model for semi-continuous outcomes.
dpit_2pm can be used either with model0 and model1 or with part0 and part1 as arguments.
dpit_2pm(model0, model1, y, part0, part1, plot=TRUE, scale = "normal", line_args= list(), ...)dpit_2pm(model0, model1, y, part0, part1, plot=TRUE, scale = "normal", line_args= list(), ...)
model0 |
Model object for 0 outcomes (e.g., logistic regression) |
model1 |
Model object for the continuous part (gamma regression) |
y |
Semicontinuous outcomes. |
part0 |
Alternative argument to |
part1 |
Alternative argument to |
plot |
A logical value indicating whether or not to return QQ-plot |
scale |
You can choose the scale of the residuals among |
line_args |
A named list of graphical parameters passed to
|
... |
Additional graphical arguments passed to
|
For formulation details on semicontinuous outcomes, see dpit.
In two-part models, the probability of zero can be modeled using a logistic regression, model0,
while the positive observations can be modeled using a gamma regression, model1.
Users can choose to use different models and supply the resulting probabilities of zero and probability integral transforms.
part0 should be the sequence of fitted probabilities of zeros .
part1 should be the probability integral transform of the positive part .
Note that the length of part1 is the number of positive values in y and can be shorter than part0.
Residuals. If plot=TRUE, also produces a QQ plot.
library(MASS) n <- 500 beta10 <- 1 beta11 <- -2 beta12 <- -1 beta13 <- -1 beta14 <- -1 beta15 <- -2 x11 <- rnorm(n) x12 <- rbinom(n, size = 1, prob = 0.4) p1 <- 1 / (1 + exp(-(beta10 + x11 * beta11 + x12 * beta12))) lambda1 <- exp(beta13 + beta14 * x11 + beta15 * x12) y2 <- rgamma(n, scale = lambda1 / 2, shape = 2) y <- rep(0, n) u <- runif(n, 0, 1) ind1 <- which(u >= p1) y[ind1] <- y2[ind1] # models as input mgamma <- glm(y[ind1] ~ x11[ind1] + x12[ind1], family = Gamma(link = "log")) m10 <- glm(y == 0 ~ x12 + x11, family = binomial(link = "logit")) resid.model <- dpit_2pm(model0 = m10, model1 = mgamma, y = y) # PIT as input cdfgamma <- pgamma(y[ind1], scale = mgamma$fitted.values * gamma.dispersion(mgamma), shape = 1 / gamma.dispersion(mgamma) ) p1f <- m10$fitted.values resid.pit <- dpit_2pm(y = y, part0 = p1f, part1 = cdfgamma)library(MASS) n <- 500 beta10 <- 1 beta11 <- -2 beta12 <- -1 beta13 <- -1 beta14 <- -1 beta15 <- -2 x11 <- rnorm(n) x12 <- rbinom(n, size = 1, prob = 0.4) p1 <- 1 / (1 + exp(-(beta10 + x11 * beta11 + x12 * beta12))) lambda1 <- exp(beta13 + beta14 * x11 + beta15 * x12) y2 <- rgamma(n, scale = lambda1 / 2, shape = 2) y <- rep(0, n) u <- runif(n, 0, 1) ind1 <- which(u >= p1) y[ind1] <- y2[ind1] # models as input mgamma <- glm(y[ind1] ~ x11[ind1] + x12[ind1], family = Gamma(link = "log")) m10 <- glm(y == 0 ~ x12 + x11, family = binomial(link = "logit")) resid.model <- dpit_2pm(model0 = m10, model1 = mgamma, y = y) # PIT as input cdfgamma <- pgamma(y[ind1], scale = mgamma$fitted.values * gamma.dispersion(mgamma), shape = 1 / gamma.dispersion(mgamma) ) p1f <- m10$fitted.values resid.pit <- dpit_2pm(y = y, part0 = p1f, part1 = cdfgamma)
Computes DPIT residuals for regression models with binary outcomes
using the observed responses (y) and their fitted distributional parameters(prob).
dpit_bin(y, prob, plot=TRUE, scale="normal", line_args=list(), ...)dpit_bin(y, prob, plot=TRUE, scale="normal", line_args=list(), ...)
y |
An observed outcome vector. |
prob |
A vector of fitted probabilities of one. |
plot |
A logical value indicating whether or not to return QQ-plot |
scale |
You can choose the scale of the residuals among |
line_args |
A named list of graphical parameters passed to
|
... |
Additional graphical arguments passed to
|
For formulation details on discrete outcomes, see dpit_pois.
DPIT residuals.
## Binary example n <- 500 set.seed(1234) # Covariates x1 <- rnorm(n, 1, 1) x2 <- rbinom(n, 1, 0.7) # Coefficients beta0 <- -5 beta1 <- 2 beta2 <- 1 beta3 <- 3 q1 <- 1 / (1 + exp(beta0 + beta1 * x1 + beta2 * x2 + beta3 * x1 * x2)) y1 <- rbinom(n, size = 1, prob = 1 - q1) # True model model01 <- glm(y1 ~ x1 * x2, family = binomial(link = "logit")) fitted1 <- fitted(model01) y1 <- model01$y resid.bin1 <- dpit_bin(y=y1, prob=fitted1) # Missing covariates model02 <- glm(y1 ~ x1, family = binomial(link = "logit")) y2 <- model02$y fitted2 <- fitted(model02) resid.bin2 <- dpit_bin(y=y2, prob=fitted2)## Binary example n <- 500 set.seed(1234) # Covariates x1 <- rnorm(n, 1, 1) x2 <- rbinom(n, 1, 0.7) # Coefficients beta0 <- -5 beta1 <- 2 beta2 <- 1 beta3 <- 3 q1 <- 1 / (1 + exp(beta0 + beta1 * x1 + beta2 * x2 + beta3 * x1 * x2)) y1 <- rbinom(n, size = 1, prob = 1 - q1) # True model model01 <- glm(y1 ~ x1 * x2, family = binomial(link = "logit")) fitted1 <- fitted(model01) y1 <- model01$y resid.bin1 <- dpit_bin(y=y1, prob=fitted1) # Missing covariates model02 <- glm(y1 ~ x1, family = binomial(link = "logit")) y2 <- model02$y fitted2 <- fitted(model02) resid.bin2 <- dpit_bin(y=y2, prob=fitted2)
Computes DPIT residuals for regression models with negative binomial
outcomes using the observed counts (y) and their fitted distributional
parameters (mu, size).
dpit_nb(y, mu, size, plot=TRUE, scale="normal", line_args=list(), ...)dpit_nb(y, mu, size, plot=TRUE, scale="normal", line_args=list(), ...)
y |
An observed outcome vector. |
mu |
A vector of fitted mean values. |
size |
A dispersion parameter of the negative binomial distribution. |
plot |
A logical value indicating whether or not to return QQ-plot |
scale |
You can choose the scale of the residuals among |
line_args |
A named list of graphical parameters passed to
|
... |
Additional graphical arguments passed to
|
For formulation details on discrete outcomes, see dpit.
DPIT residuals.
## Negative Binomial example library(MASS) n <- 500 x1 <- rnorm(n) x2 <- rbinom(n, 1, 0.7) ### Parameters beta0 <- -2 beta1 <- 2 beta2 <- 1 size1 <- 2 lambda1 <- exp(beta0 + beta1 * x1 + beta2 * x2) # generate outcomes y <- rnbinom(n, mu = lambda1, size = size1) # True model model1 <- glm.nb(y ~ x1 + x2) y1 <- model1$y fitted1 <- fitted(model1) size1 <- model1$theta resid.nb1 <- dpit_nb(y=y1, mu=fitted1, size=size1) # Overdispersion model2 <- glm(y ~ x1 + x2, family = poisson(link = "log")) y2 <- model2$y fitted2 <- fitted(model2) resid.nb2 <- dpit_pois(y=y2, mu=fitted2)## Negative Binomial example library(MASS) n <- 500 x1 <- rnorm(n) x2 <- rbinom(n, 1, 0.7) ### Parameters beta0 <- -2 beta1 <- 2 beta2 <- 1 size1 <- 2 lambda1 <- exp(beta0 + beta1 * x1 + beta2 * x2) # generate outcomes y <- rnbinom(n, mu = lambda1, size = size1) # True model model1 <- glm.nb(y ~ x1 + x2) y1 <- model1$y fitted1 <- fitted(model1) size1 <- model1$theta resid.nb1 <- dpit_nb(y=y1, mu=fitted1, size=size1) # Overdispersion model2 <- glm(y ~ x1 + x2, family = poisson(link = "log")) y2 <- model2$y fitted2 <- fitted(model2) resid.nb2 <- dpit_pois(y=y2, mu=fitted2)
Computes DPIT residuals for regression models with ordinal outcomes
using observed outcomes (y), ordinal outcome levels (level) and their fitted category
probabilities (fitprob).
dpit_ordi(y, level, fitprob, plot=TRUE, scale="normal", line_args=list(), ...)dpit_ordi(y, level, fitprob, plot=TRUE, scale="normal", line_args=list(), ...)
y |
An observed ordinal outcome vector. |
level |
The names of the response levels. For instance, c(0,1,2). |
fitprob |
A matrix of fitted category probabilities. Each row corresponds to an observation, and column j contains the fitted probability P(Y_i = j). |
plot |
A logical value indicating whether or not to return QQ-plot |
scale |
You can choose the scale of the residuals among |
line_args |
A named list of graphical parameters passed to
|
... |
Additional graphical arguments passed to
|
For formulation details on discrete outcomes, see dpit.
DPIT residuals.
## Ordinal example library(MASS) n <- 500 x1 <- rnorm(n, mean = 2) beta1 <- 3 # True model p0 <- plogis(1, location = beta1 * x1) p1 <- plogis(4, location = beta1 * x1) - p0 p2 <- 1 - p0 - p1 genemult <- function(p) { rmultinom(1, size = 1, prob = c(p[1], p[2], p[3])) } test <- apply(cbind(p0, p1, p2), 1, genemult) y1 <- rep(0, n) y1[which(test[1, ] == 1)] <- 0 y1[which(test[2, ] == 1)] <- 1 y1[which(test[3, ] == 1)] <- 2 multimodel <- polr(as.factor(y1) ~ x1, method = "logistic") y1 <- multimodel$model[,1] lev1 <- multimodel$lev fitprob1 <- fitted(multimodel) resid.ord <- dpit_ordi(y=y1, level=lev1, fitprob=fitprob1)## Ordinal example library(MASS) n <- 500 x1 <- rnorm(n, mean = 2) beta1 <- 3 # True model p0 <- plogis(1, location = beta1 * x1) p1 <- plogis(4, location = beta1 * x1) - p0 p2 <- 1 - p0 - p1 genemult <- function(p) { rmultinom(1, size = 1, prob = c(p[1], p[2], p[3])) } test <- apply(cbind(p0, p1, p2), 1, genemult) y1 <- rep(0, n) y1[which(test[1, ] == 1)] <- 0 y1[which(test[2, ] == 1)] <- 1 y1[which(test[3, ] == 1)] <- 2 multimodel <- polr(as.factor(y1) ~ x1, method = "logistic") y1 <- multimodel$model[,1] lev1 <- multimodel$lev fitprob1 <- fitted(multimodel) resid.ord <- dpit_ordi(y=y1, level=lev1, fitprob=fitprob1)
Computes DPIT residuals for Poisson outcomes regression using the observed counts (y) and their
corresponding fitted mean values (mu).
dpit_pois(y, mu, plot=TRUE, scale="normal", line_args=list(), ...)dpit_pois(y, mu, plot=TRUE, scale="normal", line_args=list(), ...)
y |
An observed outcome vector. |
mu |
A vector of fitted mean values. |
plot |
A logical value indicating whether or not to return QQ-plot |
scale |
You can choose the scale of the residuals among |
line_args |
A named list of graphical parameters passed to
|
... |
Additional graphical arguments passed to
|
For formulation details on discrete outcomes, see dpit.
DPIT residuals.
## Poisson example n <- 500 set.seed(1234) # Covariates x1 <- rnorm(n) x2 <- rbinom(n, 1, 0.7) # Coefficients beta0 <- -2 beta1 <- 2 beta2 <- 1 lambda1 <- exp(beta0 + beta1 * x1 + beta2 * x2) y <- rpois(n, lambda1) # True model poismodel <- glm(y ~ x1 + x2, family = poisson(link = "log")) y1 <- poismodel$y p1f <- fitted(poismodel) resid.poi <- dpit_pois(y=y1, mu=p1f)## Poisson example n <- 500 set.seed(1234) # Covariates x1 <- rnorm(n) x2 <- rbinom(n, 1, 0.7) # Coefficients beta0 <- -2 beta1 <- 2 beta2 <- 1 lambda1 <- exp(beta0 + beta1 * x1 + beta2 * x2) y <- rpois(n, lambda1) # True model poismodel <- glm(y ~ x1 + x2, family = poisson(link = "log")) y1 <- poismodel$y p1f <- fitted(poismodel) resid.poi <- dpit_pois(y=y1, mu=p1f)
Computes DPIT residuals for tobit regression models using the observed
responses (y) and their corresponding fitted distributional parameters (mu, sd).
dpit_tobit(y, mu, sd, plot=TRUE, scale="normal", line_args=list(), ...)dpit_tobit(y, mu, sd, plot=TRUE, scale="normal", line_args=list(), ...)
y |
An observed outcome vector. |
mu |
A vector of fitted mean values of latent variables. |
sd |
A standard deviation of latent variables. |
plot |
A logical value indicating whether or not to return QQ-plot |
scale |
You can choose the scale of the residuals among |
line_args |
A named list of graphical parameters passed to
|
... |
Additional graphical arguments passed to
|
For formulation details on semicontinuous outcomes, see dpit.
DPIT residuals.
## Tobit regression model library(VGAM) n <- 500 beta13 <- 1 beta14 <- -3 beta15 <- 3 set.seed(1234) x11 <- runif(n) x12 <- runif(n) lambda1 <- beta13 + beta14 * x11 + beta15 * x12 sd0 <- 0.3 yun <- rnorm(n, mean = lambda1, sd = sd0) y <- ifelse(yun >= 0, yun, 0) # Using VGAM package # True model fit1 <- vglm(formula = y ~ x11 + x12, tobit(Upper = Inf, Lower = 0, lmu = "identitylink")) # Missing covariate fit1miss <- vglm(formula = y ~ x11, tobit(Upper = Inf, Lower = 0, lmu = "identitylink")) resid.tobit1 <- dpit_tobit(y = y, mu = VGAM::fitted(fit1), sd = sd0) resid.tobit2 <- dpit_tobit(y = y, mu = VGAM::fitted(fit1miss), sd = sd0) # Using AER package library(AER) # True model fit2 <- tobit(y ~ x11 + x12, left = 0, right = Inf, dist = "gaussian") # Missing covariate fit2miss <- tobit(y ~ x11, left = 0, right = Inf, dist = "gaussian") resid.aer1 <- dpit_tobit(y = y, mu = fitted(fit2), sd = sd0) resid.aer2 <- dpit_tobit(y = y, mu = fitted(fit2miss), sd = sd0)## Tobit regression model library(VGAM) n <- 500 beta13 <- 1 beta14 <- -3 beta15 <- 3 set.seed(1234) x11 <- runif(n) x12 <- runif(n) lambda1 <- beta13 + beta14 * x11 + beta15 * x12 sd0 <- 0.3 yun <- rnorm(n, mean = lambda1, sd = sd0) y <- ifelse(yun >= 0, yun, 0) # Using VGAM package # True model fit1 <- vglm(formula = y ~ x11 + x12, tobit(Upper = Inf, Lower = 0, lmu = "identitylink")) # Missing covariate fit1miss <- vglm(formula = y ~ x11, tobit(Upper = Inf, Lower = 0, lmu = "identitylink")) resid.tobit1 <- dpit_tobit(y = y, mu = VGAM::fitted(fit1), sd = sd0) resid.tobit2 <- dpit_tobit(y = y, mu = VGAM::fitted(fit1miss), sd = sd0) # Using AER package library(AER) # True model fit2 <- tobit(y ~ x11 + x12, left = 0, right = Inf, dist = "gaussian") # Missing covariate fit2miss <- tobit(y ~ x11, left = 0, right = Inf, dist = "gaussian") resid.aer1 <- dpit_tobit(y = y, mu = fitted(fit2), sd = sd0) resid.aer2 <- dpit_tobit(y = y, mu = fitted(fit2miss), sd = sd0)
Computes DPIT residuals for Tweedie-distributed outcomes using the observed responses (y),
their fitted mean values (mu), the variance power parameter
(), and the dispersion parameter ().
dpit_tweedie(y, mu, xi, phi, plot=TRUE, scale="normal", line_args=list(), ...)dpit_tweedie(y, mu, xi, phi, plot=TRUE, scale="normal", line_args=list(), ...)
y |
Observed outcome vector. |
mu |
Vector of fitted mean values of each outcomes. |
xi |
Value of |
phi |
Dispersion parameter |
plot |
A logical value indicating whether or not to return QQ-plot The sample quantiles of the residuals are plotted against |
scale |
You can choose the scale of the residuals among |
line_args |
A named list of graphical parameters passed to
|
... |
Additional graphical arguments passed to
|
For formulation details on semicontinuous outcomes, see dpit.
DPIT residuals.
## Tweedie model library(tweedie) library(statmod) n <- 500 x11 <- rnorm(n) x12 <- rnorm(n) beta0 <- 5 beta1 <- 1 beta2 <- 1 lambda1 <- exp(beta0 + beta1 * x11 + beta2 * x12) y1 <- rtweedie(n, mu = lambda1, xi = 1.6, phi = 10) # Choose parameter p # True model model1 <- glm(y1 ~ x11 + x12, family = tweedie(var.power = 1.6, link.power = 0) ) y1 <- model1$y p.max <- get("p", envir = environment(model1$family$variance)) lambda1f <- model1$fitted.values phi1f <- summary(model1)$dis resid.tweedie <- dpit_tweedie(y= y1, mu=lambda1f, xi=p.max, phi=phi1f)## Tweedie model library(tweedie) library(statmod) n <- 500 x11 <- rnorm(n) x12 <- rnorm(n) beta0 <- 5 beta1 <- 1 beta2 <- 1 lambda1 <- exp(beta0 + beta1 * x11 + beta2 * x12) y1 <- rtweedie(n, mu = lambda1, xi = 1.6, phi = 10) # Choose parameter p # True model model1 <- glm(y1 ~ x11 + x12, family = tweedie(var.power = 1.6, link.power = 0) ) y1 <- model1$y p.max <- get("p", envir = environment(model1$family$variance)) lambda1f <- model1$fitted.values phi1f <- summary(model1)$dis resid.tweedie <- dpit_tweedie(y= y1, mu=lambda1f, xi=p.max, phi=phi1f)
Computes DPIT residuals for regression models with zero-inflated negative
binomial outcomes using the observed counts (y) and their fitted distributional
parameters (mu, pzero, size).
dpit_znb(y, mu, pzero, size, plot=TRUE, scale="normal", line_args=list(), ...)dpit_znb(y, mu, pzero, size, plot=TRUE, scale="normal", line_args=list(), ...)
y |
An observed outcome vector. |
mu |
A vector of fitted mean values for the count (non-zero) component. |
pzero |
A vector of fitted probabilities for the zero-inflation component. |
size |
A dispersion parameter of the negative binomial distribution. |
plot |
A logical value indicating whether or not to return QQ-plot |
scale |
You can choose the scale of the residuals among |
line_args |
A named list of graphical parameters passed to
|
... |
Additional graphical arguments passed to
|
For formulation details on discrete outcomes, see dpit.
DPIT residuals.
## Zero-Inflated Negative Binomial library(pscl) n <- 500 set.seed(1234) # Covariates x1 <- rnorm(n) x2 <- rbinom(n, 1, 0.7) # Coefficients beta0 <- -2 beta1 <- 2 beta2 <- 1 beta00 <- -2 beta10 <- 2 # NB dispersion (size = theta; larger => closer to Poisson) theta_true <- 1.2 # Mean of NB count part mu_true <- exp(beta0 + beta1 * x1 + beta2 * x2) # Excess zero probability (logit) p0 <- 1 / (1 + exp(-(beta00 + beta10 * x1))) ## simulate outcomes z <- rbinom(n, size = 1, prob = 1 - p0) # 1 => from NB, 0 => structural zero y1 <- rnbinom(n, size = theta_true, mu = mu_true) # NB count draw y <- ifelse(z == 0, 0, y1) ## True model modelzero1 <- zeroinfl(y ~ x1 + x2 | x1, dist = "negbin", link = "logit") y1 <- modelzero1$y mu1 <- stats::predict(modelzero1, type = "count") pzero1 <- stats::predict(modelzero1, type = "zero") theta1 <- modelzero1$theta resid.zero1 <- dpit_znb(y = y1, pzero = pzero1, mu = mu1, size = theta1) ## Ignoring zero-inflation: NB only modelzero2 <- MASS::glm.nb(y ~ x1 + x2) y2 <- modelzero2$y mu2 <- fitted(modelzero2) theta2 <- modelzero2$theta resid.zero2 <- dpit_nb(y = y2, mu = mu2, size = theta2)## Zero-Inflated Negative Binomial library(pscl) n <- 500 set.seed(1234) # Covariates x1 <- rnorm(n) x2 <- rbinom(n, 1, 0.7) # Coefficients beta0 <- -2 beta1 <- 2 beta2 <- 1 beta00 <- -2 beta10 <- 2 # NB dispersion (size = theta; larger => closer to Poisson) theta_true <- 1.2 # Mean of NB count part mu_true <- exp(beta0 + beta1 * x1 + beta2 * x2) # Excess zero probability (logit) p0 <- 1 / (1 + exp(-(beta00 + beta10 * x1))) ## simulate outcomes z <- rbinom(n, size = 1, prob = 1 - p0) # 1 => from NB, 0 => structural zero y1 <- rnbinom(n, size = theta_true, mu = mu_true) # NB count draw y <- ifelse(z == 0, 0, y1) ## True model modelzero1 <- zeroinfl(y ~ x1 + x2 | x1, dist = "negbin", link = "logit") y1 <- modelzero1$y mu1 <- stats::predict(modelzero1, type = "count") pzero1 <- stats::predict(modelzero1, type = "zero") theta1 <- modelzero1$theta resid.zero1 <- dpit_znb(y = y1, pzero = pzero1, mu = mu1, size = theta1) ## Ignoring zero-inflation: NB only modelzero2 <- MASS::glm.nb(y ~ x1 + x2) y2 <- modelzero2$y mu2 <- fitted(modelzero2) theta2 <- modelzero2$theta resid.zero2 <- dpit_nb(y = y2, mu = mu2, size = theta2)
Computes DPIT residuals for regression models with zero-inflated Poisson
outcomes using the observed counts(y) and their fitted distributional
parameters(mu, pzero).
dpit_zpois(y, mu, pzero, plot=TRUE, scale="normal", line_args=list(), ...)dpit_zpois(y, mu, pzero, plot=TRUE, scale="normal", line_args=list(), ...)
y |
An observed outcome vector. |
mu |
A vector of fitted mean values for the count (non-zero) component. |
pzero |
A vector of fitted probabilities for the zero-inflation component. |
plot |
A logical value indicating whether or not to return QQ-plot |
scale |
You can choose the scale of the residuals among |
line_args |
A named list of graphical parameters passed to
|
... |
Additional graphical arguments passed to
|
For formulation details on discrete outcomes, see dpit.
DPIT residuals.
## Zero-Inflated Poisson library(pscl) n <- 500 set.seed(1234) # Covariates x1 <- rnorm(n) x2 <- rbinom(n, 1, 0.7) # Coefficients beta0 <- -2 beta1 <- 2 beta2 <- 1 beta00 <- -2 beta10 <- 2 # Mean of Poisson part lambda1 <- exp(beta0 + beta1 * x1 + beta2 * x2) # Excess zero probability p0 <- 1 / (1 + exp(-(beta00 + beta10 * x1))) ## simulate outcomes y0 <- rbinom(n, size = 1, prob = 1 - p0) y1 <- rpois(n, lambda1) y <- ifelse(y0 == 0, 0, y1) ## True model modelzero1 <- zeroinfl(y ~ x1 + x2 | x1, dist = "poisson", link = "logit") y1 <- modelzero1$y mu1 <- stats::predict(modelzero1, type = "count") pzero1 <- stats::predict(modelzero1, type = "zero") resid.zero1 <- dpit_zpois(y= y1, pzero=pzero1, mu=mu1) ## Zero inflation modelzero2 <- glm(y ~ x1 + x2, family = poisson(link = "log")) y2 <- modelzero2$y mu2 <- fitted(modelzero2) resid.zero2 <- dpit_pois(y= y2, mu=mu2)## Zero-Inflated Poisson library(pscl) n <- 500 set.seed(1234) # Covariates x1 <- rnorm(n) x2 <- rbinom(n, 1, 0.7) # Coefficients beta0 <- -2 beta1 <- 2 beta2 <- 1 beta00 <- -2 beta10 <- 2 # Mean of Poisson part lambda1 <- exp(beta0 + beta1 * x1 + beta2 * x2) # Excess zero probability p0 <- 1 / (1 + exp(-(beta00 + beta10 * x1))) ## simulate outcomes y0 <- rbinom(n, size = 1, prob = 1 - p0) y1 <- rpois(n, lambda1) y <- ifelse(y0 == 0, 0, y1) ## True model modelzero1 <- zeroinfl(y ~ x1 + x2 | x1, dist = "poisson", link = "logit") y1 <- modelzero1$y mu1 <- stats::predict(modelzero1, type = "count") pzero1 <- stats::predict(modelzero1, type = "zero") resid.zero1 <- dpit_zpois(y= y1, pzero=pzero1, mu=mu1) ## Zero inflation modelzero2 <- glm(y ~ x1 + x2, family = poisson(link = "log")) y2 <- modelzero2$y mu2 <- fitted(modelzero2) resid.zero2 <- dpit_pois(y= y2, mu=mu2)
Goodness-of-fit test for discrete-outcome regression models.
Works with GLMs (Poisson, binomial/logistic, negative binomial),
ordinal outcome regression (MASS::polr), and
zero-inflated regressions (zero-inflated Poisson and negative binomial via pscl::zeroinfl()).
gof_disc(model, B=1e2, seed=NULL)gof_disc(model, B=1e2, seed=NULL)
model |
A fitted model object (e.g., from |
B |
Number of bootstrap samples. Default is 1e2. |
seed |
random seed for bootstrap. |
Let denote independent observations, and let
be the fitted model-based CDF.
It was shown in Yang (2025) that under the correctly specified model,
should be close to the identity function, where
The test statistic
measures the deviation of from the
identity function, with p-values obtained by bootstrap. This method
complements residual-based diagnostics by providing a formal check of model adequacy.
Test statistics and p-values
Yang L, Genest C, Neslehova J (2025). “A goodness-of-fit test for regression models with discrete outcomes.” Canadian Journal of Statistics
library(MASS) library(pscl) n <- 500 B <- 1000 beta1 <- 1; beta2 <- 1 beta0 <- -2; beta00 <- -2; beta10 <- 2 size1 <- 2 set.seed(1) x1 <- rnorm(n) x2 <- rbinom(n,1,0.7) lambda1 <- exp(beta0 + beta1 * x1 + beta2 * x2) p0 <- 1 / (1 + exp(-(beta00 + beta10 * x1))) y0 <- rbinom(n, size = 1, prob = 1 - p0) y1 <- rnegbin(n, mu=lambda1, theta=size1) y <- ifelse(y0 == 0, 0, y1) model1 <- zeroinfl(y ~ x1 + x2 | x1, dist = "negbin", link = "logit") gof_disc(model1)library(MASS) library(pscl) n <- 500 B <- 1000 beta1 <- 1; beta2 <- 1 beta0 <- -2; beta00 <- -2; beta10 <- 2 size1 <- 2 set.seed(1) x1 <- rnorm(n) x2 <- rbinom(n,1,0.7) lambda1 <- exp(beta0 + beta1 * x1 + beta2 * x2) p0 <- 1 / (1 + exp(-(beta00 + beta10 * x1))) y0 <- rbinom(n, size = 1, prob = 1 - p0) y1 <- rnegbin(n, mu=lambda1, theta=size1) y <- ifelse(y0 == 0, 0, y1) model1 <- zeroinfl(y ~ x1 + x2 | x1, dist = "negbin", link = "logit") gof_disc(model1)
Data from the Wisconsin Local Government Property Insurance Fund (LGPIF). The LGPIF was established to provide property insurance for local government entities that include counties, cities, towns, villages, school districts, fire departments, and other miscellaneous entities, and is administered by the Wisconsin Office of the Insurance Commissioner. Properties covered under this fund include government buildings, vehicles, and equipment.
LGPIFLGPIF
A data frame with 5677 rows and 41 variables:
PolicyNumPolicy number
YearPolicy year
ClaimBCTotal building and contents (BC) claims in the year
ClaimIMTotal inland marine (IM) claims in the year (contractor’s equipment)
ClaimPNTotal comprehensive claims from new motor vehicles in the year
ClaimPOTotal comprehensive claims from old motor vehicles in the year
ClaimCNTotal collision claims from new vehicles in the year
ClaimCOTotal collision claims from old vehicles in the year
TypeCityIndicator for city entity
TypeCountyIndicator for county entity
TypeMiscIndicator for miscellaneous entity
TypeSchoolIndicator for school entity
TypeTownIndicator for town entity
TypeVillageIndicator for village entity
IsRCIndicator for replacement cost (motor vehicles)
CoverageBCLog coverage amount for building and contents (in millions of dollars)
lnDeductBCLog deductible amount for building and contents
NoClaimCreditBCIndicator for no BC claims in prior year
yAvgBCAverage BC claim amount
FreqBCFrequency of BC claims
CoverageIMLog coverage amount for inland marine (in millions of dollars)
lnDeductIMLog deductible amount for inland marine
NoClaimCreditIMIndicator for no IM claims in prior year
yAvgIMAverage IM claim amount
FreqIMFrequency of IM claims
CoveragePNLog coverage amount for comprehensive new vehicles (in millions of dollars)
NoClaimCreditPNIndicator for no PN claims in prior year
yAvgPNAverage PN claim amount
FreqPNFrequency of PN claims
CoveragePOLog coverage amount for comprehensive old vehicles (in millions of dollars)
NoClaimCreditPOIndicator for no PO claims in prior year
yAvgPOAverage PO claim amount
FreqPOFrequency of PO claims
CoverageCNLog coverage amount for collision of new vehicles (in millions of dollars)
NoClaimCreditCNIndicator for no CN claims in prior year
yAvgCNAverage CN claim amount
FreqCNFrequency of CN claims
CoverageCOLog coverage amount for collision of old vehicles (in millions of dollars)
NoClaimCreditCOIndicator for no CO claims in prior year
yAvgCOAverage CO claim amount
FreqCOFrequency of CO claims
https://sites.google.com/a/wisc.edu/jed-frees/
Frees, E. W., Lee, G., & Yang, L. (2016). Multivariate frequency-severity regression models in insurance. Risks, 4(1), 4.
Healthcare expenditure data set.
MEPSMEPS
A data frame with 29784 rows and 29 variables:
EXPthe aggregate annual office based expenditure per participants, semicontinuous outcomes
AGEAge
GENDER1 if female
ASIAN1 if Asian
BLACK1 if Black
NORTHEAST1 if Northeast
MIDWEST1 if Midwest
SOUTH1 if South
USC1 if have usual source of care
COLLEGE1 if colleage or higher degrees
HIGHSCH1 if high school degree
MARRIED1 if married
WIDIVSEP1 if widowed or divorced or separated
FAMSIZEFamily Size
HINCOME1 if high income
MINCOME1 if middle income
LINCOME1 if low income
NPOOR1 if near poor
POOR1 if poor
FAIR1 if fair
GOOD1 if good
VGOOD1 if very good
MNHPOOR1 if poor or fair mental health
ANYLIMIT1 if any functional or activity limitation
unemployed1 if unemployed at the beginning of 2006
EDUCHEALTH1 if education, health and social services
PUBADMIN1 if public administration
insured1 if is insured at the beginning of the year 2006
MANAGEDCAREif enrolled in an HMO or a gatekeeper plan
http://www.meps.ahrq.gov/mepsweb/
Creates a plot to assess the mean structure of regression models. The plot compares the cumulative sum of the response variable and its hypothesized value. Deviation from the diagonal suggests the possibility that the mean structure of the model is incorrect.
ord_curve(model, thr, line_args=list(), ...)ord_curve(model, thr, line_args=list(), ...)
model |
Regression model object (e.g., |
thr |
Threshold variable (e.g., predictor, fitted values, or variable to be included as a covariate) |
line_args |
A named list of graphical parameters passed to
|
... |
Additional graphical arguments passed to
|
The ordered curve plots
against
where is the fitted mean, and is the threshold variable.
If the mean structure is correctly specified in the model,
and should be close to each other.
If the curve is distant from the diagonal, it suggests incorrectness in the mean structure.
Moreover, if the curve is above the diagonal, the summation of the response is larger than
the fitted mean, which implies that the mean is underestimated, and vice versa.
The role of thr (threshold variable ) is to determine the rule for accumulating and ,
for the ordered curve.
The candidate for thr could be any function of predictors such as a single predictor (e.g., x1),
a linear combination of predictor (e.g., x1+x2), or fitted values (e.g., fitted(model)).
It can also be a variable being considered to be included in the mean function.
If a variable leads to a large discrepancy between the ordered curve and the diagonal,
including this variable in the mean function should be considered.
For more details, see the reference paper.
x-axis:
y-axis:
which are defined in Details.
Yang, Lu. "Double Probability Integral Transform Residuals for Regression Models with Discrete Outcomes." arXiv preprint arXiv:2308.15596 (2023).
## Binary example of ordered curve n <- 500 set.seed(1234) x1 <- rnorm(n, 1, 1) x2 <- rbinom(n, 1, 0.7) beta0 <- -5 beta1 <- 2 beta2 <- 1 beta3 <- 3 q1 <- 1 / (1 + exp(beta0 + beta1 * x1 + beta2 * x2 + beta3 * x1 * x2)) y1 <- rbinom(n, size = 1, prob = 1 - q1) ## True Model model0 <- glm(y1 ~ x1 * x2, family = binomial(link = "logit")) ord_curve(model0, thr = model0$fitted.values) # set the threshold as fitted values ## Missing a covariate model1 <- glm(y1 ~ x1, family = binomial(link = "logit")) ord_curve(model1, thr = x2) # set the threshold as a covariate ## Poisson example of ordered curve n <- 500 set.seed(1234) x1 <- rnorm(n) x2 <- rnorm(n) beta0 <- 0 beta1 <- 2 beta2 <- 1 lambda1 <- exp(beta0 + beta1 * x1 + beta2 * x2) y <- rpois(n, lambda1) ## True Model poismodel1 <- glm(y ~ x1 + x2, family = poisson(link = "log")) ord_curve(poismodel1, thr = poismodel1$fitted.values) ## Missing a covariate poismodel2 <- glm(y ~ x1, family = poisson(link = "log")) ord_curve(poismodel2, thr = poismodel2$fitted.values) ord_curve(poismodel2, thr = x2)## Binary example of ordered curve n <- 500 set.seed(1234) x1 <- rnorm(n, 1, 1) x2 <- rbinom(n, 1, 0.7) beta0 <- -5 beta1 <- 2 beta2 <- 1 beta3 <- 3 q1 <- 1 / (1 + exp(beta0 + beta1 * x1 + beta2 * x2 + beta3 * x1 * x2)) y1 <- rbinom(n, size = 1, prob = 1 - q1) ## True Model model0 <- glm(y1 ~ x1 * x2, family = binomial(link = "logit")) ord_curve(model0, thr = model0$fitted.values) # set the threshold as fitted values ## Missing a covariate model1 <- glm(y1 ~ x1, family = binomial(link = "logit")) ord_curve(model1, thr = x2) # set the threshold as a covariate ## Poisson example of ordered curve n <- 500 set.seed(1234) x1 <- rnorm(n) x2 <- rnorm(n) beta0 <- 0 beta1 <- 2 beta2 <- 1 lambda1 <- exp(beta0 + beta1 * x1 + beta2 * x2) y <- rpois(n, lambda1) ## True Model poismodel1 <- glm(y ~ x1 + x2, family = poisson(link = "log")) ord_curve(poismodel1, thr = poismodel1$fitted.values) ## Missing a covariate poismodel2 <- glm(y ~ x1, family = poisson(link = "log")) ord_curve(poismodel2, thr = poismodel2$fitted.values) ord_curve(poismodel2, thr = x2)
Draw the quasi-empirical residual distribution functions for regression models with discrete outcomes.
Specifically, the model assumption of GLMs with binary, ordinal, Poisson, negative binomial,
zero-inlated Poisson, and zero-inflated negative binomial outcomes can be assessed using quasi_plot().
A plot far apart from the diagonal indicates lack of fit.
quasi_plot(model, line_args=list(), ...)quasi_plot(model, line_args=list(), ...)
model |
Model object (e.g., |
line_args |
A named list of graphical parameters passed to
|
... |
Additional graphical arguments passed to
|
The quasi-empirical residual distribution function is defined as follows:
where
is the bandwidth; and is a bounded, symmetric, and Lipschitz continuous kernel.
A plot of quasi-empirial residual distribution function against .
Lu Yang (2021). Assessment of Regression Models with Discrete Outcomes Using Quasi-Empirical Residual Distribution Functions, Journal of Computational and Graphical Statistics, 30(4), 1019-1035.
## Negative Binomial example library(MASS) # Covariates n <- 500 x1 <- rnorm(n) x2 <- rbinom(n, 1, 0.7) ### Parameters beta0 <- -2 beta1 <- 2 beta2 <- 1 size1 <- 2 lambda1 <- exp(beta0 + beta1 * x1 + beta2 * x2) # generate outcomes y <- rnbinom(n, mu = lambda1, size = size1) # True model model1 <- glm.nb(y ~ x1 + x2) resid.nb1 <- quasi_plot(model1) # Overdispersion model2 <- glm(y ~ x1 + x2, family = poisson(link = "log")) resid.nb2 <- quasi_plot(model2) ## Zero inflated Poisson example library(pscl) n <- 500 set.seed(1234) # Covariates x1 <- rnorm(n) x2 <- rbinom(n, 1, 0.7) # Coefficients beta0 <- -2 beta1 <- 2 beta2 <- 1 beta00 <- -2 beta10 <- 2 # Mean of Poisson part lambda1 <- exp(beta0 + beta1 * x1 + beta2 * x2) # Excess zero probability p0 <- 1 / (1 + exp(-(beta00 + beta10 * x1))) ## simulate outcomes y0 <- rbinom(n, size = 1, prob = 1 - p0) y1 <- rpois(n, lambda1) y <- ifelse(y0 == 0, 0, y1) ## True model modelzero1 <- zeroinfl(y ~ x1 + x2 | x1, dist = "poisson", link = "logit") resid.zero1 <- quasi_plot(modelzero1)## Negative Binomial example library(MASS) # Covariates n <- 500 x1 <- rnorm(n) x2 <- rbinom(n, 1, 0.7) ### Parameters beta0 <- -2 beta1 <- 2 beta2 <- 1 size1 <- 2 lambda1 <- exp(beta0 + beta1 * x1 + beta2 * x2) # generate outcomes y <- rnbinom(n, mu = lambda1, size = size1) # True model model1 <- glm.nb(y ~ x1 + x2) resid.nb1 <- quasi_plot(model1) # Overdispersion model2 <- glm(y ~ x1 + x2, family = poisson(link = "log")) resid.nb2 <- quasi_plot(model2) ## Zero inflated Poisson example library(pscl) n <- 500 set.seed(1234) # Covariates x1 <- rnorm(n) x2 <- rbinom(n, 1, 0.7) # Coefficients beta0 <- -2 beta1 <- 2 beta2 <- 1 beta00 <- -2 beta10 <- 2 # Mean of Poisson part lambda1 <- exp(beta0 + beta1 * x1 + beta2 * x2) # Excess zero probability p0 <- 1 / (1 + exp(-(beta00 + beta10 * x1))) ## simulate outcomes y0 <- rbinom(n, size = 1, prob = 1 - p0) y1 <- rpois(n, lambda1) y <- ifelse(y0 == 0, 0, y1) ## True model modelzero1 <- zeroinfl(y ~ x1 + x2 | x1, dist = "poisson", link = "logit") resid.zero1 <- quasi_plot(modelzero1)