Solved – Why is Ordinary Least Squares performing better than Poisson regression

least squarespoisson-regressionregression

I'm trying to fit a regression to explain the number of homicides in each district of a city.
Although I know that my data follows a Poisson distribution, I tried to fit an OLS like this:

$log(y+1) = \alpha + \beta X + \epsilon $

Then, I also tried (of course!) a Poisson regression. The problem is that I have better results in the OLS regression: the pseudo-$R^2$ is higher (0.71 vs 0.57) and the RMSE as well (3.8 vs 8.88. Standardized to have the same unit).

Why? Is it normal? What's wrong on using the OLS no matter what the distribution of the data is?

edit
Following the suggestions of kjetil b halvorsen and others, I fitted the data through two models: OLS and Negative Binomial GLM (NB). I started with all the features I have, then I recursively removed one by one the features which were not significant.
OLS is

$\sqrt{\frac{crime}{area}} = \alpha + \beta X + \epsilon $

with weights = $area$.

summary(w <- lm(sqrt(num/area) ~  RNR_nres_non_daily + RNR_nres_daily + hType_mix_std + area_filtr + num_community_places+ num_intersect + pop_rat_num + employed + emp_rat_pop + nden_daily + nden_non_daily+ bld_rat_area + bor_rat_area + mdist_highways+ mdist_parks, data=p, weights=area))

error2 <- p$num - (predict(w, newdata=p[,-1:-2], type="response")**2)*p$area

rmse(error2)
[1] 80.64783

The NB predicts the number of crime, with the district's area as offset.

summary(m3 <- glm.nb(num ~  LUM5_single  + RNR_nres + mdist_daily + mdist_non_daily+ hType_mix_std + ratio_daily_nondaily_area + area_filtr + num_community_places  + employed  + nden_daily + nden_non_daily+ bld_rat_area + bor_rat_area + mdist_smallparks + mdist_highways+ mdist_parks + offset(log(area)), data=p, maxit = 1000))

error <- p$num - predict(m3, newdata=p[,-1:-2], type="response")

rmse(error)
[1] 121.8714

OLS residuals:

enter image description here

NB residuals

enter image description here

So the RMSE is lower in the OLS but it seems that the residuals are not so Normal….

Best Answer

I suspect that part of the problem may lie in your choice of performance metric. If you measure test performance using RMSE then training the model to minimise the MSE matches the test criterion, giving a hint as to what is considered important. You may find that if you measure the test performance using the negative log-likelihood of the test set using a Poisson likelihood that the Poisson model works better (as might be expected). This may be a minor issue compared with the other issues raised, but it might be a useful sanity check.

Related Question