# Why is there little difference in glm fit using poisson and gaussian family for Poisson data

generalized linear modelpoisson-regression

I have been puzzling over a toy regression problem with simulated Poisson distributed data and hoping someone more educated in statistics can help me gain some insight about the following observation.

# Libraries used

library(tidyverse)
library(cowplot)
library(broom)
library(modelbased)
library(parameters)
library(ggbeeswarm)


# Data generation

I simulated count values using rpois for two scenarios:

1. Counts of traffic accidents where lambda is linearly proportional to traffic volume.
2. Counts of traffic accidents when lambda scales as the exponent of traffic volume.
# number of observations
n_obs = 10

# generate log dependent data
traffic_volume = log(c(1, 2, 4, 7, 10, 15))
log_data = tibble(
volume=traffic_volume,
lambda=exp(0.43*volume + 0.2)
) %>%
rowwise %>%
mutate(accident_counts = list(rpois(n_obs, lambda = lambda))) %>%
mutate(observed_avg_accidents = mean(accident_counts))

# generate linear dependent data
linear_data = tibble(
volume=traffic_volume,
lambda = 0.43*volume + 0.2
) %>%
rowwise %>%
mutate(accident_counts = list(rpois(n_obs, lambda = lambda))) %>%
mutate(observed_avg_accidents = mean(accident_counts))


# Modelling

I fit two glms to each data set. One using the gaussian family and one using the poisson family. I used the "log" linker for log data and the "identity" linker for the linear data.

# fit
proc_list = list(
)
models = map(proc_list, function(proc) {
poisson_model <- glm(
accident_counts ~ volume,
data = proc$data %>% unnest(accident_counts), family = poisson(link=proc$linker),
)
gaussian_model = glm(
accident_counts ~ volume,
data = proc$data %>% unnest(accident_counts), family = gaussian(link=proc$linker),
start=c(1, 1)
)
return(list("poisson"=poisson_model, "gaussian"=gaussian_model))
})


# Results

> compare_models(unlist(models, recursive=FALSE))

Parameter    |        log.poisson |        log.gaussian |    linear.poisson |    linear.gaussian
------------------------------------------------------------------------------------------------
(Intercept)  | 0.01 (-0.40, 0.43) | -0.03 (-0.59, 0.53) | 0.39 (0.06, 0.73) | 0.43 (-0.08, 0.94)
volume       | 0.52 ( 0.32, 0.72) |  0.54 ( 0.30, 0.79) | 0.36 (0.13, 0.59) | 0.34 ( 0.05, 0.62)
------------------------------------------------------------------------------------------------
Observations |                 60 |                  60 |                60 |                 60


# Visualization

# create the visualization grid and predict values
viz_grid = modelbased::visualisation_matrix(tibble(volume=traffic_volume)) %>% as_tibble
augmented = map_df(unlist(models, recursive=FALSE), function(.x) {
augment(.x, newdata=viz_grid, type.predict="response")
}, .id="model")

# separate model and data labels
augmented = augmented %>%
separate("model", c("data", "regression"), sep="\\.")

p = map_df(proc_list, ~.x\$data, .id="data") %>%
unnest(accident_counts) %>%
ggplot(aes(volume, accident_counts)) +
geom_quasirandom() +
geom_point(
data=~.x %>% distinct(data, volume, observed_avg_accidents),
aes(volume, observed_avg_accidents),
color="red"
) +
geom_line(data=augmented, aes(volume, .fitted, color=regression)) +
facet_wrap(~data, labeller=label_both) +
theme_gray(base_size=16)
p %>% ggsave(file="temp.pdf", w=8, h=4)


# Question

Why is there essentially no difference in the fits regardless of family function? I purposely chose a small number of observations and relatively small lambda values hoping the fit using the gaussian family would break down. But this did not happen. If the data generation process here is truly Poisson, wouldn't the choice of family function affect the fit?

Youâ€™ve got models to two different data sets.

For the Poisson regression, your true conditional expected values (lambda values) are given by $$\exp(0.43x+0.2)$$.

For the linear regression, your true conditional expected values are given by $$0.43x+0.2$$.

What you might be more interested in is fitting both models to the same data set, rather than using different $$y$$ variables.

As far as why the OLS linear model achieves a good fit despite the outcome being Poisson-distributed instead of Gaussian, OLS linear models are rather robust to deviations from normality. If you bootstrap your residuals and calculate the mean, you are likely to find that the distribution looks rather normal.