Why is there little difference in glm fit using poisson and gaussian family for Poisson data

generalized linear modelpoisson-regression

I have been puzzling over a toy regression problem with simulated Poisson distributed data and hoping someone more educated in statistics can help me gain some insight about the following observation.

Libraries used

library(tidyverse)
library(cowplot)
library(broom)
library(modelbased)
library(parameters)
library(ggbeeswarm)

Data generation

I simulated count values using rpois for two scenarios:

Counts of traffic accidents where lambda is linearly proportional to traffic volume.
Counts of traffic accidents when lambda scales as the exponent of traffic volume.

# number of observations
n_obs = 10

# generate log dependent data
traffic_volume = log(c(1, 2, 4, 7, 10, 15))
log_data = tibble(
  volume=traffic_volume,
  lambda=exp(0.43*volume + 0.2)
) %>%
  rowwise %>%
  mutate(accident_counts = list(rpois(n_obs, lambda = lambda))) %>%
  mutate(observed_avg_accidents = mean(accident_counts)) 

# generate linear dependent data
linear_data = tibble(
  volume=traffic_volume,
  lambda = 0.43*volume + 0.2
) %>%
  rowwise %>%
  mutate(accident_counts = list(rpois(n_obs, lambda = lambda))) %>%
  mutate(observed_avg_accidents = mean(accident_counts))

Modelling

I fit two glms to each data set. One using the gaussian family and one using the poisson family. I used the "log" linker for log data and the "identity" linker for the linear data.

# fit
proc_list = list(
  log=list(data=log_data, linker="log"),
  linear=list(data=linear_data, linker="identity")
)
models = map(proc_list, function(proc) {
  poisson_model <- glm(
    accident_counts ~ volume,
    data = proc$data %>% unnest(accident_counts),
    family = poisson(link=proc$linker),
  )
  gaussian_model = glm(
    accident_counts ~ volume,
    data = proc$data %>% unnest(accident_counts),
    family = gaussian(link=proc$linker),
    start=c(1, 1)
  )
  return(list("poisson"=poisson_model, "gaussian"=gaussian_model))
})

Results

> compare_models(unlist(models, recursive=FALSE))

Parameter    |        log.poisson |        log.gaussian |    linear.poisson |    linear.gaussian
------------------------------------------------------------------------------------------------
(Intercept)  | 0.01 (-0.40, 0.43) | -0.03 (-0.59, 0.53) | 0.39 (0.06, 0.73) | 0.43 (-0.08, 0.94)
volume       | 0.52 ( 0.32, 0.72) |  0.54 ( 0.30, 0.79) | 0.36 (0.13, 0.59) | 0.34 ( 0.05, 0.62)
------------------------------------------------------------------------------------------------
Observations |                 60 |                  60 |                60 |                 60

Visualization

# create the visualization grid and predict values
viz_grid = modelbased::visualisation_matrix(tibble(volume=traffic_volume)) %>% as_tibble
augmented = map_df(unlist(models, recursive=FALSE), function(.x) {
  augment(.x, newdata=viz_grid, type.predict="response")
}, .id="model")

# separate model and data labels
augmented = augmented %>%
  separate("model", c("data", "regression"), sep="\\.")

p = map_df(proc_list, ~.x$data, .id="data") %>%
  unnest(accident_counts) %>%
  ggplot(aes(volume, accident_counts)) +
  # geom_violin(adjust=1.5) +
  geom_quasirandom() +
  geom_point(
    data=~.x %>% distinct(data, volume, observed_avg_accidents),
    aes(volume, observed_avg_accidents),
    color="red"
  ) +
  geom_line(data=augmented, aes(volume, .fitted, color=regression)) +
  facet_wrap(~data, labeller=label_both) +
  theme_gray(base_size=16)
p %>% ggsave(file="temp.pdf", w=8, h=4)

Question

Why is there essentially no difference in the fits regardless of family function? I purposely chose a small number of observations and relatively small lambda values hoping the fit using the gaussian family would break down. But this did not happen. If the data generation process here is truly Poisson, wouldn't the choice of family function affect the fit?

Best Answer

You’ve got models to two different data sets.

For the Poisson regression, your true conditional expected values (lambda values) are given by $\exp(0.43x+0.2)$.

For the linear regression, your true conditional expected values are given by $0.43x+0.2$.

What you might be more interested in is fitting both models to the same data set, rather than using different $y$ variables.

As far as why the OLS linear model achieves a good fit despite the outcome being Poisson-distributed instead of Gaussian, OLS linear models are rather robust to deviations from normality. If you bootstrap your residuals and calculate the mean, you are likely to find that the distribution looks rather normal.

Related Solutions

Generalized Linear Model – Difference Between LM and GLM for Gaussian Family

While for the specific form of model mentioned in the body of the question (i.e. lm(y ~ x1 + x2) vs glm(y ~ x1 + x2, family=gaussian)), regression and GLMs are the same model, the title question asks something slightly more general:

Is there any difference between lm and glm for the gaussian family of glm?

To which the answer is "Yes!".

The reason that they can be different is because you can also specify a link function in the GLM. This allows you to fit particular forms of nonlinear relationship between $y$ (or rather its conditional mean) and the $x$-variables; while you can do this in nls as well, there's no need for starting values, sometimes the convergence is better (also the syntax is a bit easier).

Compare, for example, these models (you have R so I assume you can run these yourself):

x1=c(56.1, 26.8, 23.9, 46.8, 34.8, 42.1, 22.9, 55.5, 56.1, 46.9, 26.7, 33.9, 
37.0, 57.6, 27.2, 25.7, 37.0, 44.4, 44.7, 67.2, 48.7, 20.4, 45.2, 22.4, 23.2, 
39.9, 51.3, 24.1, 56.3, 58.9, 62.2, 37.7, 36.0, 63.9, 62.5, 44.1, 46.9, 45.4, 
23.7, 36.5, 56.1, 69.6, 40.3, 26.2, 67.1, 33.8, 29.9, 25.7, 40.0, 27.5)

x2=c(12.29, 11.42, 13.59, 8.64, 12.77, 9.9, 13.2, 7.34, 10.67, 18.8, 9.84, 16.72, 
10.32, 13.67, 7.65, 9.44, 14.52, 8.24, 14.14, 17.2, 16.21, 6.01, 14.23, 15.63, 
10.83, 13.39, 10.5, 10.01, 13.56, 11.26, 4.8, 9.59, 11.87, 11, 12.02, 10.9, 9.5, 
10.63, 19.03, 16.71, 15.11, 7.22, 12.6, 15.35, 8.77, 9.81, 9.49, 15.82, 10.94, 6.53)

y = c(1.54, 0.81, 1.39, 1.09, 1.3, 1.16, 0.95, 1.29, 1.35, 1.86, 1.1, 0.96,
1.03, 1.8, 0.7, 0.88, 1.24, 0.94, 1.41, 2.13, 1.63, 0.78, 1.55, 1.5, 0.96, 
1.21, 1.4, 0.66, 1.55, 1.37, 1.19, 0.88, 0.97, 1.56, 1.51, 1.09, 1.23, 1.2, 
1.62, 1.52, 1.64, 1.77, 0.97, 1.12, 1.48, 0.83, 1.06, 1.1, 1.21, 0.75)

lm(y ~ x1 + x2)
glm(y ~ x1 + x2, family=gaussian) 
glm(y ~ x1 + x2, family=gaussian(link="log")) 
nls(y ~ exp(b0+b1*x1+b2*x2), start=list(b0=-1,b1=0.01,b2=0.1))

Note that the first pair are the same model ($y_i \sim N(\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i},\sigma^2)\,$), and the second pair are the same model ($y_i \sim N(\exp(\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}),\sigma^2)\,$ and the fits are essentially the same within each pair.

So - in relation to the title question - you can fit a substantially wider variety of Gaussian models with a GLM than with regression.

Generalized Linear Model (GLM) – Using GLM: Gaussian, Poisson Vs Gamma

This answer elaborates on some discussion in comments on the answer from Nick Cox.

Your situation might be handled by a multi-category extension of binomial regression: ordinal regression. You model the probability of moving from one category to the next in a way that takes advantage of the ordering among the outcome categories.

This UCLA web page illustrates ordinal logistic regression, based on a "proportional odds" (PO) assumption for moving up the scale. I don't know whether that assumption will hold for your data, but the page does show how to evaluate it.

Also, as Frank Harrell points out in Section 13.3.3 of his Regression Modeling Strategies book, a PO model can sometimes work well even if the assumption isn't met. In this answer to a question on highly skewed data that take only a few values with clumping at one end, he says:

When the dependent variable Y has a beautiful distribution I still recommend it be modeled using a Y-transformation-invariant semiparametric ordinal regression model such as the proportional odds model. With your Y, the need for a semiparametric model is even greater. Semiparametric models handle arbitrary clumping of Y values, bimodality, floor effects, ceiling effects, and outliers. Such models are also very efficient.

The orm() function in Harrell's rms package allows for ordinal regression with link functions other than the logit, and Section 13.4 of his book shows how to implement a "continuation ratio" method that sometimes works better than a PO model. That provides you some flexibility in how to proceed.

With a PO model you can often model, without overfitting, almost as many parameters as you can with linear regression. Section 4.4 of Harrell's book and course notes provides an estimate of the effective sample size that takes the distribution of cases among categories into account. Your sample size of about 200 would be reduced to an effective sample size of about 180 on that basis, so you should be able to estimate about 12 regression coefficients.