First differencing or within transformations like demeaning are not available in models like logit because in the case of nonlinear models such tricks do not remove the unobserved fixed effects. Even if you had a smaller data set in which it was feasible to include N-1 individual dummies to estimate the fixed effects directly, this would lead to biased estimates unless the time dimension of your data is large. Elimination of the fixed effects in panel logit therefore follows neither differencing nor demeaning and is only possible due to the logit functional form. If you are interested in the details you could have a look at these notes by Söderbom on PDF page 30 (explanation for why demeaning/first differencing in logit/probit doesn't help) and page 42 (introduction of the panel logit estimator).
Another problem is that xtlogit
and panel logit models in general do not estimate the fixed effects directly which are needed to calculate marginal effects. Without those it will be very awkward to interpret your coefficients which might be disappointing after having run the model for hours and hours.
With such a large data set and the previously mentioned conceptional difficulties of FE panel logit I would stick with the linear probability model. I hope this answer does not disappoint you but there are many good reasons for giving such advice: the LPM is much faster, the coefficients can be interpreted straight away (this holds in particular if you have interaction effects in your model because the interpretation of their coefficients in non-linear models changes!), the fixed effects are easily controlled for and you can adjust the standard errors for autocorrelation and clusters without estimation times increasing beyond reason. I hope this helps.
They deal with estimating different parameters but indeed share common features:
Nickell (Econometrica 1981) bias:
The time demeaning operation of fixed effects in a dynamic panel data model
$$
y_{it}=\alpha_i+\beta y_{it-1}+\epsilon_{it}
$$
leads to a transformed regression model
$$
y_{it}-y_{i\cdot}=\beta (y_{it-1}-y_{i\cdot-1})+(\epsilon_{it}-\epsilon_{i\cdot})
$$
where dots indicate time averages. Here, error terms $(\epsilon_{it}-\epsilon_{i\cdot})$ and regressors $(y_{it-1}-y_{i\cdot-1})$ are correlated even as $N\to\infty$, where $N$ is the number of units in the panel. This can be shown formally, but essentially follows from the observation that $y_{i\cdot}$ contains future $y_{it}$ which are generated by past $y_{it}$ which, in turn, are generated by past $\epsilon_{it}$ which are contained in $\epsilon_{i\cdot})$
Hence, even as $N\to\infty$, the FE estimator will not consistently estimate $\beta$.
Incidental parameter problem:
The classical Neyman and Scott (Econometrica 1948) case is an an example that MLEs need not be consistent. Consider a random sample of size $N\equiv nr$, $$X_{11},\ldots,X_{1r},X_{21},\ldots,X_{2r},\ldots,X_{nr},$$ where we have $n$ subsamples of size $r$, $X_{\alpha 1},\ldots,X_{\alpha r}$, $\alpha=1,\ldots,n$ which are distributed as $N(\theta_\alpha,\sigma^2)$. Hence, each subsample has a different mean $\theta_\alpha$, but a common variance $\sigma^2$.
It can be shown that the MLE for $\sigma^2$ is given by
$$
\hat{\sigma}^2=\frac{1}{rn}\sum_{\alpha=1}^n\sum_{j=1}^r(X_{\alpha j}-X_{\alpha \cdot})^2
$$
One may show that
$$\hat{\sigma}^2\to_pE(S_\alpha^2)=\frac{r-1}{r}\sigma^2\neq\sigma^2$$
Hence, the MLE is not consistent as $n\to\infty$.
So they are related through the fact that both FE and $\hat{\sigma}^2$ inconsistent estimators, that however both are consistent as the "other" dimension also goes to infinity - $r$ in the incidental parameter problem and $T$, the number of time series observations per panel unit in the Nickell bias case.
Nickell shows the inconsistency to be approximately equal to
$$
-\frac{1+\beta}{T-1}
$$
for $T$ "reasonably" large.
Best Answer
In FE models of the type $$y_{it} = \alpha_i + \beta X_{it} + u_{it}$$ $\alpha$ is the incidental parameter, because theoretically speaking, it is of a secondary importance. Usually, $\beta$ is the important parameter, statistically speaking. But in essence, $\alpha$ is important because it provides useful information on the individual intercept.
Most of the panels are short, i.e., T is relatively small. In order to illustrate the incidental parameter problem I will disregard $\beta$ for simplicity. So the model is now: $$y_{it} = \alpha_i + u_{it} \quad \quad u_{it}\sim iiN(0,\sigma^2)$$ So by using deviations from means method we have $\hat{u}_{it} = y_{it}-\bar{y}_i$ - and that's how we can get $\alpha$. Lets have a look on the estimate for $\sigma^2$: $$\hat{\sigma}^2 = \frac{1}{NT}\sum_i\sum_t (y_{it}-\bar{y}_i)^2 = \sigma^2\frac{\chi_{N(T-1)}^2}{NT} \overset p{\to} \sigma^2\frac{N(T-1)}{NT} = \sigma^2\frac{T-1}{T}$$
You can see that if T is "large" then the term $\frac{T-1}{T}$ disappears, BUT, if T is small (which is the case in most of the panels) then the estimate of $\sigma^2$ will be inconsistent. This makes the FE estimator to be inconsistent.
The reason $\beta$ is usually consistent because usually N is indeed sufficiently large and therefore has the desired asymptotic requirements.
Note that in spatial panels for example, the situation is opposite - T is usually considered large enough, but N is fixed. So the asymptotics comes from T. Therefore in spatial panels you need a large T!
Hope it helps somehow.