Diagnostics for other GLMs

library(regressinator)
library(dplyr)
library(ggplot2)
library(patchwork)
library(broom)

In the vignette on logistic regression (vignette("logistic-regression-diagnostics")), we considered diagnostics for the most common type of GLM. But many of the same techniques can be applied to generalized linear models with other distributions and link functions.

To illustrate this, we’ll consider a bivariate Poisson generalized linear model:

pois_pop <- population(
  x1 = predictor("runif", min = -5, max = 15),
  x2 = predictor("runif", min = 0, max = 10),
  y = response(0.7 + 0.2 * x1 + x1^2 / 100 - 0.2 * x2,
               family = poisson(link = "log"))
)

pois_data <- sample_x(pois_pop, n = 100) |>
  sample_y()

fit <- glm(y ~ x1 + x2, data = pois_data, family = poisson)

In other words, the population relationship is \[ \begin{align*} Y \mid X = x &\sim \text{Poisson}(\mu(x)) \\ \mu(x) &= \exp\left(0.7 + 0.2 x_1 + \frac{x_1^2}{100} - 0.2 x_2\right), \end{align*} \] but we chose to fit a model that does not allow a quadratic term for \(x_1\).

We’ll consider the same diagnostics as we used for logistic regression, but consider the special problems for Poisson regression, illustrating what you must consider for each type of GLM.

Naive residual plots

Using the fitted model, we can produce plots of standardized residuals. Plotted against the fitted values (the linear predictor), they do indicate some kind of trend, but the plot is difficult to interpret, and it does not tell us which predictor is the problem.

# .fitted is the linear predictor, unless we set `type.predict = "response"` as
# an argument to augment()
augment(fit) |>
  ggplot(aes(x = .fitted, y = .std.resid)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  labs(x = "Fitted value", y = "Residual")
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Plots against the predictors suggest something is wrong with x1, but again, they are somewhat difficult to interpret:

augment_longer(fit) |>
  ggplot(aes(x = .predictor_value, y = .std.resid)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  facet_wrap(vars(.predictor_name), scales = "free_x") +
  labs(x = "Predictor", y = "Residual")
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

In a lineup, we can compare the residual plots against simulated ones where the model is correct, making it at least clearer that the problem we observe is real:

model_lineup(fit, fn = augment_longer, nsim = 5) |>
  ggplot(aes(x = .predictor_value, y = .std.resid)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  facet_grid(rows = vars(.sample), cols = vars(.predictor_name),
             scales = "free_x") +
  labs(x = "Predictor", y = "Residual")
#> decrypt("nsW7 Ykjk l3 gCPljlC3 45")
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

(Of course, in applications one should do the lineup before viewing the real residual plots, so one’s perception is not biased by foreknowledge of the true plot.)

Marginal model plots

For each predictor, we plot the predictor versus \(Y\). We plot the smoothed curve of fitted values (red) as well as a smoothed curve of response values (blue), on a log scale so the large \(Y\) values are not distracting:

augment_longer(fit, type.predict = "response") |>
  ggplot(aes(x = .predictor_value)) +
  geom_point(aes(y = y)) +
  geom_smooth(aes(y = .fitted), color = "red") +
  geom_smooth(aes(y = y)) +
  scale_y_log10() +
  facet_wrap(vars(.predictor_name), scales = "free_x") +
  labs(x = "Predictor", y = "Y")
#> Warning: Transformation introduced infinite values in continuous y-axis
#> Transformation introduced infinite values in continuous y-axis
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#> Warning: Removed 40 rows containing non-finite values (`stat_smooth()`).

The red line is a smoothed version of \(\hat \mu(x)\) versus the predictors, while the blue line averages \(Y\) versus the predictors. Comparing the two lines helps us evaluate if the model is well-specified. We again have trouble with observations with \(Y = 0\), as on the log scale these are transformed to \(-\infty\). Nonetheless, the plots again point to a problem with x1.

Partial residuals

The partial residuals make the quadratic shape of the relationship somewhat clearer:

partial_residuals(fit) |>
  ggplot(aes(x = .predictor_value, y = .partial_resid)) +
  geom_point() +
  geom_smooth() +
  geom_line(aes(x = .predictor_value, y = .predictor_effect)) +
  facet_wrap(vars(.predictor_name), scales = "free") +
  labs(x = "Predictor", y = "Partial residual")
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

We can see curvature on the left-hand side of the plot for x1, while the plot for x2 appears (nearly) linear.

Binned residuals

Binned residuals bin the observations based on their predictor values, and average the residual value in each bin. This avoids the problem that individual residuals are hard to interpret because \(Y\) is only 0 or 1:

binned_residuals(fit) |>
  ggplot(aes(x = predictor_mean, y = resid_mean)) +
  facet_wrap(vars(predictor_name), scales = "free") +
  geom_point() +
  labs(x = "Predictor", y = "Residual mean")

This is comparable to the marginal model plots above: where the marginal model plots show a smoothed curve of fitted values and a smoothed curve of actual values, the binned residuals show the average residuals, which are actual values minus fitted values. We can think of the binned residual plot as showing the difference between the lines in the marginal model plot.

We can also bin by the fitted values of the model:

binned_residuals(fit, predictor = .fitted) |>
  ggplot(aes(x = predictor_mean, y = resid_mean)) +
  geom_point() +
  labs(x = "Fitted values", y = "Residual mean")

Limitations

As we can see, these graphical methods for detecting misspecification are each limited. The nature of GLMs, with their nonlinear link function and non-Normal conditional response distribution, makes useful diagnostics much harder to construct.

There are several factors limiting the diagnostics here. First, in the population we specified (pois_pop), most \(Y\) values are less than 10, and there are many zeroes. In this range, the conditional distribution of \(Y\) given \(X\) is asymmetric, since it is bounded below by zero, making plots hard to read; and we want to use log-scale plots so the relationship is linear, but the frequent presence of \(Y = 0\) makes these less useful.

Second, partial residuals for GLMs are most useful in domains where the link function is nearly linear. As noted by Cook and Croos-Dabrera (1998):

But if \(\mu(x)\) stays well away from its extremes so that the link function \(h\) is essentially a linear function of \(\mu\), and if \(\mathbb{E}[x_1 \mid x_2]\) is linear, then fitting a reasonable regression curve to the partial residual plot should yield a useful estimate of \(g\) within the class of GLMs.

That is, the range of \(x\) needs to be limited so that the link function is approximately linear. But here the range of \(X_1\) and \(X_2\) is large enough that the link is decidedly nonlinear. For instance, if we fix \(X_2 = 5\) (the middle of its marginal distribution) and vary \(X_1\) across its range in the population, the mean function is exponential:

ggplot() +
  geom_function(fun = function(x1) {
    exp(0.7 + 0.2 * x1 + x1^2 / 100 - 0.2 * 5)
  }) +
  xlim(-5, 10) +
  labs(x = "X1", y = "μ(x1, 5)")

If \(X_1\) were restricted to a smaller range, this curve would be more approximately linear, and the partial residual plots would give a better approximation of the true relationship.

In short, graphical diagnostics for GLMs are possible, and model lineups make it easier to distinguish when a strange pattern is a genuine problem and when it is an artifact of the model. But determining the nature of the misspecification and the appropriate changes to the model is much more difficult than in linear regression. No one diagnostic plot is a panacea.