Solved – Selecting regression model for a non-negative integer response

model selectionnegative-binomial-distributionpoisson distributionregression

I have a series of non-negative integers $y=(y_1,y_2,…, y_n)$ and a design matrix $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2$, where $x_0$ and $x_1$ are $0$ or $1$, $x_1x_2$ is the interaction, and $\beta_0 \ldots \beta_3$ are parameters we want to estimate. For example, the data look like

y    x1    x2    x1*x2
10   0     0     0
23   0     1     0
18   1     1     1
19   1     0     0
25   0     1     0
...

I want to estimate the $\beta_0$, $\beta_1$, $\beta_2$ and $\beta_3$ coefficients and perform a test to see if any coefficient is nonzero.

There are several different regression models that might be applied to this case:

  1. Simple linear regression: lm
  2. Poisson regression (when $y$ follows a Poisson distribution): glm with family = poisson
  3. Quasi-poisson regression (when $y$ is over-dispersed; that means $\text{sd}(y) \gt \text{mean}(y)$): glm with family = quasi-poisson
  4. Negative binomial regression (when $y$ is over-dispersed, $\text{sd}(y) \gt \text{mean}(y)$): glm.nb, in MASS package.

The questions I want to ask are:

  1. How should I select the model for this dataset? Is there any way to choose the right model based on some descriptive statistics of my dataset?
  2. How should I check and validate if the fitted selected model is right for my data?

Best Answer

Your model is fully saturated, because you have indicators for every possible combination of categories. As such you have correctly specified the conditional expectation.

Any MLE estimate based on a distribution in the linear exponential family is consistent when the conditional expectation is correctly specified. Therefore you can use Poisson or a number of other distributions.

As whuber implies though, this problem reduces to estimating means and testing for their differences, which could conveniently be done in a regression framework.

Related Question