Solved – Robust estimation of Poisson distribution

estimationgeneralized linear modelpoisson distributionrrobust

I have a set of numbers which are assumed to be coming from a Poisson distribution. The set has some outliers also and because of that, maximum likelihood estimates are badly affected. I heard that robust estimation procedures can help in such a situation. Can any one explain how to do this? I am not a statistics student.

I found that the glmrob function in R can be used for this. (I am quite new to R). But I could not figure out, how to use that despite reading the manual pages. In particular I am unable to understand how to get a forumula which is the first argument to the glmrob function.

Thanks.

Best Answer

@cardinal has telegraphed an answer in comments. Let's flesh it out. His point is that although general linear models (such as implemented by lm and, in this case, glmRob) appear intended to evaluate relationships among variables, they can be powerful tools for studying a single variable, too. The trick relies on the fact that regressing data against a constant is just another way of estimating its average value ("location").

As an example, generate some Poisson-distributed data:

set.seed(17)
x <- rpois(10, lambda=2)

In this case, R will produce the vector $(1,5,2,3,2,2,1,1,3,1)$ of values for x from a Poisson distribution of mean $2$. Estimate its location with glmRob:

library(robust)
glmrob(x ~ 1, family=poisson())

The response tells us the intercept is estimated at $0.7268$. Of course, anyone using a statistical method needs to know how it works: when you use generalized linear models with the Poisson family, the standard "link" function is the logarithm. This means the intercept is the logarithm of the estimated location. So we compute

exp(0.7268)

The result, $2.0685$, is comfortably close to $2$: the procedure seems to work. To see what it is doing, plot the data:

plot(x, ylim=c(0, max(x)))
abline(exp(0.7268), 0, col="red")

Plot with fitted line

The fitted line is purely horizontal and therefore estimates the middle of the vertical values: our data. That's all that's going on.

To check robustness, let's create a bad outlier by tacking a few zeros onto the first value of x:

x[1] <- 100

This time, for greater flexibility in post-processing, we will save the output of glmRob:

m <- glmrob(x ~ 1, family=poisson())

To obtain the estimated average we can request

exp(m$coefficients)

The value this time equals $2.496$: a little off, but not too far off, given that the average value of x (obtained as mean(x)) is $12$. That is the sense in which this procedure is "robust." More information can be obtained via

summary(m)

Its output shows us, among other things, that the weight associated with the outlying value of $100$ in x[1] is just $0.02179$, almost $0$, pinpointing the suspected outlier.

Related Question