I'm stuck with a regression modeling problem. I have panel data where the dependent variable is a probability. Below is an excerpt from my data. The complete panel covers more countries and years, however it is unbalanced. What I can observe is the number of events and the number of trials. The event probability was derived from those values (estimation of this probability should be quite good, given the large number of trials). All independent variables are county-year specific.
country year event_prob events trials x x_lag2 ... more variables
1 Cyprus 2008 0.03902140 11342 290661 4.60 4.13 ...
2 Cyprus 2009 0.04586650 13482 293940 4.60 4.48 ...
3 Cyprus 2010 0.05188398 15206 293077 4.60 4.60 ...
4 Cyprus 2011 0.06433411 18505 287639 5.79 4.60 ...
5 Estonia 2008 0.07872978 21686 275449 6.02 4.11 ...
6 Estonia 2009 0.09516270 33599 353069 13.18 4.91 ...
7 Estonia 2010 0.08645905 36180 418464 7.95 6.03 ...
8 Estonia 2011 0.07731997 31590 408562 5.53 13.18 ...
...
165 USA 2011 0.06100000 9192822 150702000 2.73 3.27 ...
My goal is to use regression analysis to find out which variables are significant for the event probability. In R-terminology, I'm looking for a model of the form event_prob ~ x + x_lag2 + ...
.
The problem is as follows: event_prob
has to be between 0 and 1, hence using event_prob ~ x + x_lag2 + ...
might not be the best idea. So I was thinking of using the logit transform of event_prob
such that logit(event_prob)
ranges from $-\infty$ to $\infty$. The first idea was to use the R's plm
package, i.e. plm(logit(event_prob)~x+x_lag2,data,index=c("country","year"),model="random")
or model="within"
(see below). Is that a reasonable approach or am I violating some essential assumptions?
I was also thinking of using panel generalized linear models from the package pglm
(with the logit link function), however since I don't know the outcome of the binary events (only the total number of events and trials) is known, I got stuck there. Maybe someone can help me how to proceed here.
Since I have panel data, I'd like to compute both fixed-effects models and random-effects model and then apply the Hausman (1978) test to decide which model is more appropriate.
Do my first attempts at modeling make sense? I'm really not sure how to correctly address this problem. I hope the description of my problem is detailed enough. If not, I'm happy to provide more details
In terms of software, I'd prefer R. SAS and SPSS are also ok since my university has licences for them. I just don't have much experience with them.
Best Answer
Addressing unobserved heterogeneity in panel models with fixed effects for fractional response variables (or nonlinear models in general) is not trivial due to the incidental parameter problem (for $N\rightarrow\infty$ and $T$ fixed), see for example Lancaster (2000) or this answer here at CrossValidated. If $T$ is small (and fixed), fixed effects are inconsistent (and random effects rely probably strongly on the distributional assumptions). So you cannot just compare a random effects and a fixed effects model via Hausman test.
Proposals for panel models for fractional response variables can be found in Papke and Wooldridge (2008) or here.