Solved – How to handle outliers in Poisson regression

outlierspoisson distributionpoisson-regressionregression

Consider the following count data:

df <- data.frame(
      count=c(0,1,2,3,4,5,9,20),
      freq=c(1120,42,10,5,1,1,1,1)
)

I want to use a quasi-poisson regression model (see glm) to examine the relationship between the counts and a given matrix of covariates $X$.

My personal knowledge of the empirical context suggests that the single observation with a count of 20 is not fully comparable with the others. That is, I judge the observation an outlier, and I would like to drop it from the analysis.

Is there a formal way to support my decision? How can I empirically assess whether a particular observation is an outlier? And, in case, is it ok to simply drop the observation from the sample or should I do something else?

Best Answer

Partially answered in comments:

There is a large literature on outliers (real, possible, imaginary) and what to do about them, and many threads here too. I'd assert that most statistically experienced people regard it as a very bad idea to drop data points just on the basis of suspicion. In the case of Poisson regression the use of a logarithmic link function may mean that outliers are much less weird than you guess. The simplest technique in cases of severe doubt is to run the analysis with and without this data point and see what difference you get. We can't tell what to think if there is a big difference. – Nick Cox

If you are going by the Poisson distribution, all four observations above 3 might be considered pretty unlikely given a marginal mean of ~ 0.1 (ergo outliers). It is possible the conditional mean estimate will explain the high observations though - so they may not be outliers once you fit your model. – Andy W

Related Question