Econometrics – Relationship Between Random Effects Model and Mixed Models

econometricslme4-nlmemixed modelpanel dataplm

I used to think that "random effects model" in econometrics corresponds to a "mixed model with random intercept" outside of econometrics, but now I am not sure. Does it?

Econometrics uses terms like "fixed effects" and "random effects" somewhat differently from the literature on mixed models, and this causes a notorious confusion. Let us consider a simple situation where $y$ linearly depends on $x$ but with a different intercept in different groups of measurements:

$$y_{it} = \beta x_{it} + u_i + \epsilon_{it}.$$

Here each unit/group $i$ is observed at different timepoints $t$. Econometricians call it "panel data".

In mixed models terminology, we can treat $u_i$ as a fixed effect or as a random effect (in this case, it's random intercept). Treating it as fixed means fitting $\hat \beta$ and $\hat u_i$ to minimize squared error (i.e. running OLS regression with dummy group variables). Treating it as random means that we additionally assume that $u_i\sim\mathcal N(u_0,\sigma^2_u)$ and use maximum likelihood to fit $u_0$ and $\sigma^2_u$ instead of fitting each $u_i$ on its own. This leads to the "partial pooling" effect, where the estimates $\hat u_i$ get shrunk toward their mean $\hat u_0$.
```
R formula when treating group as fixed:    y ~ x + group
R formula when treating group as random:   y ~ x + (1|group)
```
In econometrics terminology, we can treat this whole model as a fixed effects model or as a random effects model. The first option is equivalent to the fixed effect above (but econometrics has its own way of estimating $\beta$ in this case, called "within" estimator). I used to think that the second option is equivalent to the random effect above; e.g. @JiebiaoWang in his highly upvoted answer to What is a difference between random effects-, fixed effects- and marginal model? says that

In econometrics, the random-effects model may only refer to random intercept model as in biostatistics

Okay — let us test if this understanding is correct. Here is some random data generated by @ChristophHanck in his answer to What is the difference between fixed effect, random effect and mixed effect models? (I put the data here on pastebin for those who do not use R):

@Christoph does two fits using econometrics approaches:

fe <- plm(stackY~stackX, data = paneldata, model = "within")
re <- plm(stackY~stackX, data = paneldata, model = "random")

The first one yields the estimate of beta equal to -1.0451, the second one 0.77031 (yes, positive!). I tried to reproduce it with lm and lmer:

l1 = lm(stackY ~ stackX + as.factor(unit), data = paneldata)
l2 = lmer(stackY ~ stackX + (1|as.factor(unit)), data = paneldata)

The first one yields -1.045 in perfect agreement with the within estimator above. Cool. But the second yields -1.026, which is miles away from the random effects estimator. Heh? What is going on? In fact, what is plm even doing, when called with model = "random"?

Whatever it is doing, can one somehow understand it via the mixed models perspective?

And what is the intuition behind whatever it is doing? I read in a couple of econometrics places that random effects estimator is a weighted average between the fixed effects estimator and the "between" estimator which is more or less regression slope if we do not include group identity in the model at all (this estimate is strongly positive in this case, around 4.) E.g. @Andy writes here:

The random effects estimator then uses a matrix weighted average of the within and between variation of your data. […] This makes random effects more efficient[.]

Why? Why would we want this weighted average? And in particular, why would we want it instead of running a mixed model?

Best Answer

Summary: the "random-effects model" in econometrics and a "random intercept mixed model" are indeed the same models, but they are estimated in different ways. The econometrics way is to use FGLS, and the mixed model way is to use ML. There are different algorithms of doing FGLS, and some of them (on this dataset) produce results that are very close to ML.

1. Differences between estimation methods in `plm`

I will answer with my testing on plm(..., model = "random") and lmer(), using the data generated by @ChristophHanck.

According to the plm package manual, there are four options for random.method: the method of estimation for the variance components in the random effects model. @amoeba used the default one swar (Swamy and Arora, 1972).

For random effects models, four estimators of the transformation parameter are available by setting random.method to one of "swar" (Swamy and Arora (1972)) (default), "amemiya" (Amemiya (1971)), "walhus" (Wallace and Hussain (1969)), or "nerlove" (Nerlove (1971)).

I tested all the four options using the same data, ~~getting an error for amemiya~~, and three totally different coefficient estimates for the variable stackX. The ones from using random.method='nerlove' and 'amemiya' are nearly equivalent to that from lmer(), -1.029 and -1.025 vs -1.026. They are also not very different from that obtained in the "fixed-effects" model, -1.045.

# "amemiya" only works using the most recent version:
# install.packages("plm", repos="http://R-Forge.R-project.org")

re0 <- plm(stackY~stackX, data = paneldata, model = "random") #random.method='swar'
re1 <- plm(stackY~stackX, data = paneldata, model = "random",  random.method='amemiya')
re2 <- plm(stackY~stackX, data = paneldata, model = "random",  random.method='walhus')
re3 <- plm(stackY~stackX, data = paneldata, model = "random",  random.method='nerlove')
l2  <- lmer(stackY~stackX+(1|as.factor(unit)), data = paneldata)

coef(re0)     #    (Intercept)   stackX    18.3458553   0.7703073 
coef(re1)     #    (Intercept)   stackX    30.217721   -1.025186 
coef(re2)     #    (Intercept)   stackX    -1.15584     3.71973 
coef(re3)     #    (Intercept)   stackX    30.243678   -1.029111 
fixef(l2)     #    (Intercept)   stackX    30.226295   -1.026482

Unfortunately I do not have time right now, but interested readers can find the four references, to check their estimation procedures. It would be very helpful to figure out why they make such a difference. I expect that for some cases, the plm estimation procedure using the lm() on transformed data should be equivalent to the maximum likelihood procedure utilized in lmer().

2. Comparison between GLS and ML

The authors of plm package did compare the two in Section 7 of their paper: Yves Croissant and Giovanni Millo, 2008, Panel Data Econometrics in R: The plm package.

Econometrics deal mostly with non-experimental data. Great emphasis is put on specification procedures and misspecification testing. Model specifications tend therefore to be very simple, while great attention is put on the issues of endogeneity of the regressors, dependence structures in the errors and robustness of the estimators under deviations from normality. The preferred approach is often semi- or non-parametric, and heteroskedasticity-consistent techniques are becoming standard practice both in estimation and testing.

For all these reasons, [...] panel model estimation in econometrics is mostly accomplished in the generalized least squares framework based on Aitken’s Theorem [...]. On the contrary, longitudinal data models in nlme and lme4 are estimated by (restricted or unrestricted) maximum likelihood. [...]

The econometric GLS approach has closed-form analytical solutions computable by standard linear algebra and, although the latter can sometimes get computationally heavy on the machine, the expressions for the estimators are usually rather simple. ML estimation of longitudinal models, on the contrary, is based on numerical optimization of nonlinear functions without closed-form solutions and is thus dependent on approximations and convergence criteria.

3. Update on mixed models

I appreciate that @ChristophHanck provided a thorough introduction about the four random.method used in plm and explained why their estimates are so different. As requested by @amoeba, I will add some thoughts on the mixed models (likelihood-based) and its connection with GLS.

The likelihood-based method usually assumes a distribution for both the random effect and the error term. A normal distribution assumption is commonly used, but there are also some studies assuming a non-normal distribution. I will follow @ChristophHanck's notations for a random intercept model, and allow unbalanced data, i.e., let $T=n_i$.

The model is \begin{equation} y_{it}= \boldsymbol x_{it}^{'}\boldsymbol\beta + \eta_i + \epsilon_{it}\qquad i=1,\ldots,m,\quad t=1,\ldots,n_i \end{equation} with $\eta_i \sim N(0,\sigma^2_\eta), \epsilon_{it} \sim N(0,\sigma^2_\epsilon)$.

For each $i$, $$\boldsymbol y_i \sim N(\boldsymbol X_{i}\boldsymbol\beta, \boldsymbol\Sigma_i), \qquad\boldsymbol\Sigma_i = \sigma^2_\eta \boldsymbol 1_{n_i} \boldsymbol 1_{n_i}^{'} + \sigma^2_\epsilon \boldsymbol I_{n_i}.$$ So the log-likelihood function is $$const -\frac{1}{2} \sum_i\mathrm{log}|\boldsymbol\Sigma_i| - \frac{1}{2} \sum_i(\boldsymbol y_i - \boldsymbol X_{i}\boldsymbol\beta)^{'}\boldsymbol\Sigma_i^{-1}(\boldsymbol y_i - \boldsymbol X_{i}\boldsymbol\beta).$$

When all the variances are known, as shown in Laird and Ware (1982), the MLE is $$\hat{\boldsymbol\beta} = \left(\sum_i\boldsymbol X_i^{'} \boldsymbol\Sigma_i^{-1} \boldsymbol X_i \right)^{-1} \left(\sum_i \boldsymbol X_i^{'} \boldsymbol\Sigma_i^{-1} \boldsymbol y_i \right),$$ which is equivalent to the GLS $\hat\beta_{RE}$ derived by @ChristophHanck. So the key difference is in the estimation for the variances. Given that there is no closed-form solution, there are several approaches:

directly maximization of the log-likelihood function using optimization algorithms;
Expectation-Maximization (EM) algorithm: closed-form solutions exist, but the estimator for $\boldsymbol \beta$ involves empirical Bayesian estimates of the random intercept;
a combination of the above two, Expectation/Conditional Maximization Either (ECME) algorithm (Schafer, 1998; R package lmm). With a different parameterization, closed-form solutions for $\boldsymbol \beta$ (as above) and $\sigma^2_\epsilon$ exist. The solution for $\sigma^2_\epsilon$ can be written as $$\sigma^2_\epsilon = \frac{1}{\sum_i n_i}\sum_i(\boldsymbol y_i - \boldsymbol X_{i} \hat{\boldsymbol\beta})^{'}(\hat\xi \boldsymbol 1_{n_i} \boldsymbol 1_{n_i}^{'} + \boldsymbol I_{n_i})^{-1}(\boldsymbol y_i - \boldsymbol X_{i} \hat{\boldsymbol\beta}),$$ where $\xi$ is defined as $\sigma^2_\eta/\sigma^2_\epsilon$ and can be estimated in an EM framework.

In summary, MLE has distribution assumptions, and it is estimated in an iterative algorithm. The key difference between MLE and GLS is in the estimation for the variances.

Croissant and Millo (2008) pointed out that

While under normality, homoskedasticity and no serial correlation of the errors OLS are also the maximum likelihood estimator, in all the other cases there are important differences.

In my opinion, for the distribution assumption, just as the difference between parametric and non-parametric approaches, MLE would be more efficient when the assumption holds, while GLS would be more robust.

Related Solutions

Solved – Concepts of mixed effects in statistics and econometrics, how to cope with them

Perhaps another way of seeing the difference is to focus on what the "fixed effect" is defined to be. In econometrics, a panel (longitudinal) model is typically specified as $$ y_{it} = X_{it}*b + a_{i} + e_{it} $$ where the $X$ matrix would be called the "right hand side" variables, the "design matrix", or the "independent variables", etc. The $a_i$ is an "unobserved error component". The term "Fixed Effect" or "Random Effect" has to do ONLY with the assumptions about the unobserved component ($a_i$).

If one assumes it is a "fixed effect", then the beta-hat statistics are robust to correlation between $a_i$ and $e_{it}$. That is, the beta-hat statistics are conditional on the "fixed" unobserved component being controlled for. One can either include a dummy variable for each individual $i$ in the data as part of the $X$ matrix to calculate this (bad idea) or it can just be partial-ed out (better idea).

The "Random Effect" assumption about $a_i$ allows for $a_i$ to be a random (unobserved) variable, but assumptions must be made about the independence (or at least lack of correlation) between $a_i$ and $e_{it}$.

Another way to state this is that under most assumptions, assuming that $a_i$ is "fixed" will result in consistent asymptotic estimates, whereas ONLY under the independence assumption will "random" effects be consistent. A Hausman-style test can be used to see if the Random effect assumption is valid. In most of the cases for the observational data that economists use, the random effects assumption (i.e. the assumption of non-correlation between the unobserved random component and the error term) is invalid ... and this is why economists tend to favor the "Fixed-Effect model" when using longitudinal data.

I too have seen a lot of confused jargon in the literature, mostly because people from different disciplines are talking past each other, and the term "Fixed" and Random" when applied to "Effects" are not used to communicate, but rather used as inertial labels, and cause inadvertent confusion. At this stage, most of what goes under the rubric of "Mixed Models" would simply be the "Random Effects" model from the typical Econometrician's perspective (which would tend to use the label "random coefficient model" for the equivalent math). That is, all the worry economists have about the inconsistency of the random effects assumption for panel data would (in observational data) still hold for any Mixed Model.

Mixed Models – Discrepancies in Slope Estimates for Random vs. Fixed Effects

There are several things going on here. These are interesting issues, but it will take a fair amount of time/space to explain it all.

First of all, this all becomes a lot easier to understand if we plot the data. Here is a scatter plot where the data points are colored by group. Additionally, we have a separate group-specific regression line for each group, as well as a simple regression line (ignoring groups) in dashed bold:

plot(y ~ x, data=dat, col=f, pch=19)
abline(coef(lm(y ~ x, data=dat)), lwd=3, lty=2)
by(dat, dat$f, function(i) abline(coef(lm(y ~ x, data=i)), col=i$f))

data

The fixed-effect model

What the fixed-effect model is going to do with these data is fairly straightforward. The effect of $x$ is estimated "controlling for" groups. In other words, $x$ is first orthogonalized with respect to the group dummies, and then the slope of this new, orthogonalized $x$ is what is estimated. In this case, this orthogonalization is going to remove a lot of the variance in $x$ (specifically, the between-cluster variability in $x$), because the group dummies are highly correlated with $x$. (To recognize this intuitively, think about what would happen if we regressed $x$ on just the set of group dummies, leaving $y$ out of the equation. Judging from the plot above, it certainly seems that we would expect to have some high $t$-statistics on each of the dummy coefficients in this regression!)

So basically what this ends up meaning for us is that only the within-cluster variability in $x$ is used to estimate the effect of $x$. The between-cluster variability in $x$ (which, as we can see above, is substantial), is "controlled out" of the analysis. So the slope that we get from lm() is the average of the 4 within-cluster regression lines, all of which are relatively steep in this case.

The mixed model

What the mixed model does is slightly more complicated. The mixed model attempts to use both within-cluster and between-cluster variability on $x$ to estimate the effect of $x$. Incidentally this is really one of the selling points of the model, as its ability/willingness to incorporate this additional information means it can often yield more efficient estimates. But unfortunately, things can get tricky when the between-cluster effect of $x$ and the average within-cluster effect of $x$ do not really agree, as is the case here. Note: this situation is what the "Hausman test" for panel data attempts to diagnose!

Specifically, what the mixed model will attempt to do here is to estimate some sort of compromise between the average within-cluster slope of $x$ and the simple regression line that ignores the clusters (the dashed bold line). The exact point within this compromising range that mixed model settles on depends on the ratio of the random intercept variance to the total variance (also known as the intra-class correlation). As this ratio approaches 0, the mixed model estimate approaches the estimate of the simple regression line. As the ratio approaches 1, the mixed model estimate approaches the average within-cluster slope estimate.

Here are the coefficients for the simple regression model (the dashed bold line in the plot):

> lm(y ~ x, data=dat)

Call:
lm(formula = y ~ x, data = dat)

Coefficients:
(Intercept)            x  
   0.008333     0.008643

As you can see, the coefficients here are identical to what we obtained in the mixed model. This is exactly what we expected to find, since as you already noted, we have an estimate of 0 variance for the random intercepts, making the previously mentioned ratio/intra-class correlation 0. So the mixed model estimates in this case are just the simple linear regression estimates, and as we can see in the plot, the slope here is far less pronounced than the within-cluster slopes.

This brings us to one final conceptual issue...

Why is the variance of the random intercepts estimated to be 0?

The answer to this question has the potential to become a little technical and difficult, but I'll try to keep it as simple and nontechnical as I can (for both our sakes!). But it will maybe still be a little long-winded.

I mentioned earlier the notion of intra-class correlation. This is another way of thinking about the dependence in $y$ (or, more correctly, the errors of the model) induced by the clustering structure. The intra-class correlation tells us how similar on average are two errors drawn from the same cluster, relative to the average similarity of two errors drawn from anywhere in the dataset (i.e., may or may not be in the same cluster). A positive intra-class correlation tells us that errors from the same cluster tend to be relatively more similar to each other; if I draw one error from a cluster and it has a high value, then I can expect above chance that the next error I draw from the same cluster will also have a high value. Although somewhat less common, intra-class correlations can also be negative; two errors drawn from the same cluster are less similar (i.e., further apart in value) than would typically be expected across the dataset as a whole. All of this intra-class correlation business is just a useful alternative way of describing the dependence in the data.

The mixed model we are considering is not using the intra-class correlation method of representing the dependence in the data. Instead it describes the dependence in terms of variance components. This is all fine as long as the intra-class correlation is positive. In those cases, the intra-class correlation can be easily written in terms of variance components, specifically as the previously mentioned ratio of the random intercept variance to the total variance. (See the wiki page on intra-class correlation for more info on this.) But unfortunately variance-components models have a difficult time dealing with situations where we have a negative intra-class correlation. After all, writing the intra-class correlation in terms of the variance components involves writing it as a proportion of variance, and proportions cannot be negative.

Judging from the plot, it looks like the intra-class correlation in these data would be slightly negative. (What I am looking at in drawing this conclusion is the fact that there is a lot of variance in $y$ within each cluster, but relatively little variance in the cluster means on $y$, so two errors drawn from the same cluster will tend to have a difference that nearly spans the range of $y$, whereas errors drawn from different clusters will tend to have a more moderate difference.) So your mixed model is doing what, in practice, mixed models often do in this case: it gives estimates that are as consistent with a negative intra-class correlation as it can muster, but it stops at the lower bound of 0 (this constraint is usually programmed into the model fitting algorithm). So we end up with an estimated random intercept variance of 0, which is still not a very good estimate, but it's as close as we can get with this variance-components type of model.

So what can we do?

One option is to just go with the fixed-effects model. This would be reasonable here because these data have two separate features that are tricky for mixed models (random group effects correlated with $x$, and negative intra-class correlation).

Another option is to use a mixed model, but set it up in such a way that we separately estimate the between- and within-cluster slopes of $x$ rather than awkwardly attempting to pool them together. At the bottom of this answer I reference two papers that talk about this strategy; I follow the approach advocated in the first paper by Bell & Jones.

To do this, we take our $x$ predictor and split it into two predictors, $x_b$ which will contain only between-cluster variation in $x$, and $x_w$ which will contain only within-cluster variation in $x$. Here's what this looks like:

> dat <- within(dat, x_b <- tapply(x, f, mean)[paste(f)])
> dat <- within(dat, x_w <- x - x_b)
> dat
      y   x f x_b x_w
1  -0.5   2 1   3  -1
2   0.0   3 1   3   0
3   0.5   4 1   3   1
4  -0.6  -4 2  -3  -1
5   0.0  -3 2  -3   0
6   0.6  -2 2  -3   1
7  -0.2  13 3  14  -1
8   0.1  14 3  14   0
9   0.4  15 3  14   1
10 -0.5 -15 4 -14  -1
11 -0.1 -14 4 -14   0
12  0.4 -13 4 -14   1
> 
> mod <- lmer(y ~ x_b + x_w + (1|f), data=dat)
> mod
Linear mixed model fit by REML 
Formula: y ~ x_b + x_w + (1 | f) 
   Data: dat 
   AIC   BIC logLik deviance REMLdev
 6.547 8.972  1.726   -23.63  -3.453
Random effects:
 Groups   Name        Variance Std.Dev.
 f        (Intercept) 0.000000 0.00000 
 Residual             0.010898 0.10439 
Number of obs: 12, groups: f, 4

Fixed effects:
            Estimate Std. Error t value
(Intercept) 0.008333   0.030135   0.277
x_b         0.005691   0.002977   1.912
x_w         0.462500   0.036908  12.531

Correlation of Fixed Effects:
    (Intr) x_b  
x_b 0.000       
x_w 0.000  0.000

A few things to notice here. First, the coefficient for $x_w$ is exactly the same as what we got in the fixed-effect model. So far so good. Second, the coefficient for $x_b$ is the slope of the regression we would get from regression $y$ on just a vector of the cluster means of $x$. As such it is not quite equivalent to the bold dashed line in our first plot, which used the total variance in $x$, but it is close. Third, although the coefficient for $x_b$ is smaller than the coefficient from the simple regression model, the standard error is also substantially smaller and hence the $t$-statistic is larger. This also is unsurprising because the residual variance is far smaller in this mixed model due to the random group effects eating up a lot of the variance that the simple regression model had to deal with.

Finally, we still have an estimate of 0 for the variance of the random intercepts, for the reasons I elaborated in the previous section. I'm not really sure what all we can do about that one at least without switching to some software other than lmer(), and I'm also not sure to what extent this is still going to be adversely affecting our estimates in this final mixed model. Maybe another user can chime in with some thoughts about this issue.

References

Bell, A., & Jones, K. (2014). Explaining fixed effects: Random effects modelling of time-series cross-sectional and panel data. Political Science Research and Methods. PDF
Bafumi, J., & Gelman, A. E. (2006). Fitting multilevel models when predictors and group effects correlate. PDF