Solved – Unified view on shrinkage: what is the relation (if any) between Stein’s paradox, ridge regression, and random effects in mixed models

mixed modelregressionregularizationridge regressionsteins-phenomenon

Consider the following three phenomena.

  1. Stein's paradox: given some data from multivariate normal distribution in $\mathbb R^n, \: n\ge 3$, sample mean is not a very good estimator of the true mean. One can obtain an estimation with lower mean squared error if one shrinks all the coordinates of the sample mean towards zero [or towards their mean, or actually towards any value, if I understand correctly].

    NB: usually Stein's paradox is formulated via considering only one single data point from $\mathbb R^n$; please correct me if this is crucial and my formulation above is not correct.

  2. Ridge regression: given some dependent variable $\mathbf y$ and some independent variables $\mathbf X$, the standard regression $\beta = (\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top \mathbf y$ tends to overfit the data and lead to poor out-of-sample performance. One can often reduce overfitting by shrinking $\beta$ towards zero: $\beta = (\mathbf X^\top \mathbf X + \lambda \mathbf I)^{-1} \mathbf X^\top \mathbf y$.

  3. Random effects in multilevel/mixed models: given some dependent variable $y$ (e.g. student's height) that depends on some categorical predictors (e.g. school id and student's gender), one is often advised to treat some predictors as 'random', i.e. to suppose that the mean student's height in each school comes from some underlying normal distribution. This results in shrinking the estimations of mean height per school towards the global mean.

I have a feeling that all of this are various aspects of the same "shrinkage" phenomenon, but I am not sure and certainly lack a good intuition about it. So my main question is: is there indeed a deep similarity between these three things, or is it only a superficial semblance? What is the common theme here? What is the correct intuition about it?

In addition, here are some pieces of this puzzle that don't really fit together for me:

  • In ridge regression, $\beta$ is not shrunk uniformly; ridge shrinkage is actually related to singular value decomposition of $\mathbf X$, with low-variance directions being shrunk more (see e.g. The Elements of Statistical Learning 3.4.1). But James-Stein estimator simply takes the sample mean and multiplies it by one scaling factor. How does that fit together?

    Update: see James-Stein Estimator with unequal variances and e.g. here regarding variances of $\beta$ coefficients.

  • Sample mean is optimal in dimensions below 3. Does it mean that when there are only one or two predictors in the regression model, ridge regression will always be worse than ordinary least squares? Actually, come to think of it, I cannot imagine a situation in 1D (i.e. simple, non-multiple regression) where ridge shrinkage would be beneficial…

    Update: No. See Under exactly what conditions is ridge regression able to provide an improvement over ordinary least squares regression?

  • On the other hand, sample mean is always suboptimal in dimensions above 3. Does it mean that with more than 3 predictors ridge regression is always better than OLS, even if all the predictors are uncorrelated (orthogonal)? Usually ridge regression is motivated by multicollinearity and the need to "stabilize" the $(\mathbf X^\top \mathbf X)^{-1}$ term.

    Update: Yes! See the same thread as above.

  • There are often some heated discussion about whether various factors in ANOVA should be included as fixed or random effects. Shouldn't we, by the same logic, always treat a factor as random if it has more than two levels (or if there are more than two factors? now I am confused)?

    Update: ?


Update: I got some excellent answers, but none provides enough of a big picture, so I will let the question "open". I can promise to award a bounty of at least 100 points to a new answer that will surpass the existing ones. I am mostly looking for a unifying view that could explain how the general phenomenon of shrinkage manifests itself in these various contexts and point out the principal differences between them.

Best Answer

Connection between James–Stein estimator and ridge regression

Let $\mathbf y$ be a vector of observation of $\boldsymbol \theta$ of length $m$, ${\mathbf y} \sim N({\boldsymbol \theta}, \sigma^2 I)$, the James-Stein estimator is, $$\widehat{\boldsymbol \theta}_{JS} = \left( 1 - \frac{(m-2) \sigma^2}{\|{\mathbf y}\|^2} \right) {\mathbf y}.$$ In terms of ridge regression, we can estimate $\boldsymbol \theta$ via $\min_{\boldsymbol{\theta}} \|\mathbf{y}-\boldsymbol{\theta}\|^2 + \lambda\|\boldsymbol{\theta}\|^2 ,$ where the solution is $$\widehat{\boldsymbol \theta}_{\mathrm{ridge}} = \frac{1}{1+\lambda}\mathbf y.$$ It is easy to see that the two estimators are in the same form, but we need to estimate $\sigma^2$ in James-Stein estimator, and determine $\lambda$ in ridge regression via cross-validation.

Connection between James–Stein estimator and random effects models

Let us discuss the mixed/random effects models in genetics first. The model is $$\mathbf {y}=\mathbf {X}\boldsymbol{\beta} + \boldsymbol{Z\theta}+\mathbf {e}, \boldsymbol{\theta}\sim N(\mathbf{0},\sigma^2_{\theta} I), \textbf{e}\sim N(\mathbf{0},\sigma^2 I).$$ If there is no fixed effects and $\mathbf {Z}=I$, the model becomes $$\mathbf {y}=\boldsymbol{\theta}+\mathbf {e}, \boldsymbol{\theta}\sim N(\mathbf{0},\sigma^2_{\theta} I), \textbf{e}\sim N(\mathbf{0},\sigma^2 I),$$ which is equivalent to the setting of James-Stein estimator, with some Bayesian idea.

Connection between random effects models and ridge regression

If we focus on the random effects models above, $$\mathbf {y}=\mathbf {Z\theta}+\mathbf {e}, \boldsymbol{\theta}\sim N(\mathbf{0},\sigma^2_{\theta} I), \textbf{e}\sim N(\mathbf{0},\sigma^2 I).$$ The estimation is equivalent to solve the problem $$\min_{\boldsymbol{\theta}} \|\mathbf{y}-\mathbf {Z\theta}\|^2 + \lambda\|\boldsymbol{\theta}\|^2$$ when $\lambda=\sigma^2/\sigma_{\theta}^2$. The proof can be found in Chapter 3 of Pattern recognition and machine learning.

Connection between (multilevel) random effects models and that in genetics

In the random effects model above, the dimension of $\mathbf y$ is $m\times 1,$ and that of $\mathbf Z$ is $m \times p$. If we vectorize $\mathbf Z$ as $(mp)\times 1,$ and repeat $\mathbf y$ correspondingly, then we have the hierarchical/clustered structure, $p$ clusters and each with $m$ units. If we regress $\mathrm{vec}(\mathbf Z)$ on repeated $\mathbf y$, then we can obtain the random effect of $Z$ on $y$ for each cluster, though it is kind of like reverse regression.


Acknowledgement: the first three points are largely learned from these two Chinese articles, 1, 2.

Related Question