What you are looking for might be a Generalized Linear Mixed Model, i.e. a Poisson model with a random intercept to start with.
To motivate the Generalized Linear Mixed Model choice, I will provide some background from Mixed Models Theory and Applications by E. Demidenko (below, refer a "cluster" to a particular firm from your data):
Often data have a clustered (panel or tabular) structure. Classical
statistics assumes that observations are independent and identically
distributed (iid). Applied to clustered data, this assumption may lead
to false results. In contrast, the mixed effects model treats
clustered data adequately and assumes two sources of variation, within
cluster and between clusters. Two types of coefficients are
distinguished in the mixed model: population-averaged and cluster (or
subject) - specific. The former have the same meaning as in classical
statistics, but the latter are random and are estimated as posteriori
means.
and:
The Generalized Linear Mixed Model (GLMM) is an extension of the Generalized Linear Model (GLM) complicated by random effects.
I believe that you are interested in allowing for random effects for years
by firm
.
This can be obtained with glmer {lme4}
function in R
:
set.seed(1)
# Longitudinal data in a "long" format
data.sim <- data.frame(firmID = rep(1:120, times = 4),
year = c(rep(1, 120), rep(2, 120), rep(3, 120), rep(4, 120)),
X1 = rnorm(120*4),
X2 = runif(120*4),
res = rpois(120*4, lambda = 10))
head(data.sim)
# firmID year X1 X2 res
# 1 1 1 -0.6264538 0.3604340 11
# 2 2 1 0.1836433 0.4421617 7
# 3 3 1 -0.8356286 0.1257292 7
# 4 4 1 1.5952808 0.6243645 6
# 5 5 1 0.3295078 0.3024313 7
# 6 6 1 -0.8204684 0.2396372 12
# Fit a model
model.glmer <- glmer(res ~ 1 + year + X1 + X2 + (year | firmID),
data = data.sim,
family = poisson(link = "log"))
summary(model.glmer)
Best Answer
For count data it is indicated (for reasons of interpretability of estimated parameters) to use a generalized linear model (GLM) with logarithmic link function, see my answer to Goodness of fit and which model to choose linear regression or Poisson
But the distribution family can be choosen in different ways. The reason for the advice you refer, to use Poisson regression when the counts are not to large, is that usually with large counts there is overdispersion, that is, the variance is larger than the variance of the Poisson.
That can be solved in various ways, like using (in R terminology) a quasipoisson family, which can be good enough. Or you can use a negative binomial family. Or, especially if you only wants predictions from the model and not need interpretable parameters, use the traditional way of a usual linear model (that is, identity link function) for the response $\sqrt{Y}$. The square-root transformation is (approximately) variance stabilizing for the Poisson (and quasi-Poisson) families. Look at Why is the square root transformation recommended for count data? for an explanation of this!
For more about overdispersion, see Modelling a Poisson distribution with overdispersion and Comparing overdispersion distributions