# Solved – Regression with skewed data

multiple regressionpredictive-modelsqq-plotregression

Trying to calculate visit counts from demographics and service. The data is very skewed.

Histograms:

qq plots (left is log):

m <- lm(d$Visits~d$Age+d$Gender+city+service) m <- lm(log(d$Visits)~d$Age+d$Gender+city+service)


city and service are factor variables.

I get a low p value *** for all the variables, but I also get a low r-squared of .05. What should I do? Would another model work, like exponential or something?

Linear regression is not the right choice for your outcome, given:

1. The outcome variable is not normally distributed
2. The outcome variable being limited in the values it can take on (count data means the predicted values cannot be negative)
3. What appears to be a high frequency of cases with 0 visits

## Limited dependent variable models for count data

The estimation strategy you can choose from is dictated by the "structure" of your outcome variable. That is, if your outcome variable is limited in the values it can take on (i.e. if it's a limited dependent variable), you need to choose a model where the predicted values will fall within the possible range for your outcome. While sometimes linear regression is a good approximation for limited dependent variables (for example, in the case of binary logit/probit), oftentimes it is not. Enter Generalized Linear Models. In your case, because the outcome variable is count data, you have several choices:

1. Poisson model
2. Negative Binomial model
3. Zero Inflated Poisson (ZIP) model
4. Zero Inflated Negative Binomial (ZINB) model

Choice is usually empirically determined. I will briefly discuss choosing between these options below.

Poisson vs. Negative Binomial

In general, Poisson is the go-to "general workhorse" model of the 4 count data models I mentioned above. A limitation of the model is the assumption that the conditional variance = the conditional mean, which may not always be true. If your model is overdispersed (conditional variance > conditional mean), you will need to use the Negative Binomial model instead. Fortunately, when you run the Negative Binomial, the output usually includes a statistical test for the dispersion parameter (R calls this dispersion parameter "theta ($\theta$)," which is called "alpha" in other packages). The null hypothesis in the choice between Poisson vs. Negative Binomial is: $H_0:\theta=0$, while the alternative hypothesis is $H_1: \theta≠0$. If the coefficient on $\theta$ is significant, there is evidence of overdispersion in the model, and you would choose Negative Binomial over Poisson. If the coefficient is not statistically significant, present Poisson results.

ZIP vs. ZINB

One potential complication is the zero inflation, which might be an issue here. This is where the zero-inflated models ZIP and ZINB come in. Using these models, you assume that the process generating the zero values is separate from the process generating the other, non-zero values. As with before, ZINB is appropriate when the outcome has excessive zeroes and is overdispersed, while ZIP is appropriate when the outcome has excessive zeroes but conditional mean = conditional variance. For the zero-inflated models, in addition to the model covariates you have listed above, you will need to think of variables that may have generated the excess zeroes you saw in the outcome. Again, there are statistical tests that come with the output of these models (sometimes you might have to specify them when you execute a command) that will let you empirically decide which model is the best one for your data. There are two tests of interest: The first is the test of the coefficient on the dispersion parameter $\theta$ and the second is what is known as the Vuong test, that tells you whether the excess zeroes are generated by a separate process (i.e. whether there is, indeed, zero inflation in the outcome).

In comparing choice between ZIP and ZINB, you will again look at the test of the dispersion parameter $\theta$. Again, $H_0: \theta=0$ (ZIP is a better fit) and $H_1: \theta≠0$ (ZINB is a better fit). The Vuong test allows you to make a decision between Poisson vs. ZIP or NB vs. ZINB. For the Vuong test, $H_0: Excess$ $zeroes$ $is$ $not$ $a$ $result$ $of$ $a$ $separate$ $process$( Poisson/NB is a better fit) and $H_1:Excess$ $zeroes$ $is$ $a$ $result$ $of$ $a$ $separate$ $process$ (ZIP/ZINB is a better fit).

Other users can comment on the "usual" workflow, but my approach is to visualise the data and go from there. In your case, I would probably start with ZINB and run both the test on the coefficient on $\theta$ and the Vuong test, since it's the test on the coefficient on $\theta$ would tell you which one was better between ZIP and ZINB, and the Vuong test would tell you whether you should use zero-inflated models.

Finally, I do not use R, but the IDRE at UCLA data analysis examples page can guide you in fitting these models.

[Edit by another user without enough reputation to comment: This paper explains why you should not use the Vuong test to compare a zero-inflation model and provides alternatives.