Solved – Which model for panel data with dependent variables from [0,1]

fixed-effects-modelgeneralized linear modelpanel datarrandom-effects-model

I'm stuck with a regression modeling problem. I have panel data where the dependent variable is a probability. Below is an excerpt from my data. The complete panel covers more countries and years, however it is unbalanced. What I can observe is the number of events and the number of trials. The event probability was derived from those values (estimation of this probability should be quite good, given the large number of trials). All independent variables are county-year specific.

     country  year  event_prob  events trials    x    x_lag2 ... more variables
  1   Cyprus  2008  0.03902140  11342  290661   4.60   4.13  ...
  2   Cyprus  2009  0.04586650  13482  293940   4.60   4.48  ...
  3   Cyprus  2010  0.05188398  15206  293077   4.60   4.60  ...
  4   Cyprus  2011  0.06433411  18505  287639   5.79   4.60  ...
  5  Estonia  2008  0.07872978  21686  275449   6.02   4.11  ...
  6  Estonia  2009  0.09516270  33599  353069  13.18   4.91  ...
  7  Estonia  2010  0.08645905  36180  418464   7.95   6.03  ...
  8  Estonia  2011  0.07731997  31590  408562   5.53  13.18  ...
  ...
165  USA  2011  0.06100000  9192822  150702000   2.73  3.27  ...

My goal is to use regression analysis to find out which variables are significant for the event probability. In R-terminology, I'm looking for a model of the form event_prob ~ x + x_lag2 + ... .

The problem is as follows: event_prob has to be between 0 and 1, hence using event_prob ~ x + x_lag2 + ... might not be the best idea. So I was thinking of using the logit transform of event_prob such that logit(event_prob) ranges from $-\infty$ to $\infty$. The first idea was to use the R's plm package, i.e. plm(logit(event_prob)~x+x_lag2,data,index=c("country","year"),model="random") or model="within" (see below). Is that a reasonable approach or am I violating some essential assumptions?

I was also thinking of using panel generalized linear models from the package pglm (with the logit link function), however since I don't know the outcome of the binary events (only the total number of events and trials) is known, I got stuck there. Maybe someone can help me how to proceed here.

Since I have panel data, I'd like to compute both fixed-effects models and random-effects model and then apply the Hausman (1978) test to decide which model is more appropriate.

Do my first attempts at modeling make sense? I'm really not sure how to correctly address this problem. I hope the description of my problem is detailed enough. If not, I'm happy to provide more details

In terms of software, I'd prefer R. SAS and SPSS are also ok since my university has licences for them. I just don't have much experience with them.

Best Answer

Addressing unobserved heterogeneity in panel models with fixed effects for fractional response variables (or nonlinear models in general) is not trivial due to the incidental parameter problem (for $N\rightarrow\infty$ and $T$ fixed), see for example Lancaster (2000) or this answer here at CrossValidated. If $T$ is small (and fixed), fixed effects are inconsistent (and random effects rely probably strongly on the distributional assumptions). So you cannot just compare a random effects and a fixed effects model via Hausman test.

Proposals for panel models for fractional response variables can be found in Papke and Wooldridge (2008) or here.

Related Question