Solved – What are the assumptions of ridge regression and how to test them

assumptionsregressionridge regression

Consider the standard model for multiple regression $$Y=X\beta+\varepsilon$$ where $\varepsilon \sim \mathcal N(0, \sigma^2I_n)$, so normality, homoscedasticity and uncorrelatedness of errors all hold.

Suppose that we perform a ridge regression, by adding the same small amount to all the elements of the diagonal of $X$:

$$\beta_\mathrm{ridge}=[X'X+kI]^{-1}X'Y$$

There are some values of $k$ for which the ridge coefficient has less mean squared error than those obtained by OLS, although $\beta_\mathrm{ridge}$ is a biased estimator of $\beta$. In practice, $k$ is obtained by cross-validation.

Here is my question: what are the assumptions underlying the ridge model? To be more concrete,

  1. Are all the assumptions of ordinary least square (OLS) valid with ridge regression?

  2. If yes to question 1, how do we test homoscedasticity and lack of autocorrelation with a biased estimator of $\beta$?

  3. Is there any work on testing other OLS assumptions (homoscedasticity and lack of autocorrelation) under ridge regression?

Best Answer

What is an assumption of a statistical procedure?

I am not a statistician and so this might be wrong, but I think the word "assumption" is often used quite informally and can refer to various things. To me, an "assumption" is, strictly speaking, something that only a theoretical result (theorem) can have.

When people talk about assumptions of linear regression (see here for an in-depth discussion), they are usually referring to the Gauss-Markov theorem that says that under assumptions of uncorrelated, equal-variance, zero-mean errors, OLS estimate is BLUE, i.e. is unbiased and has minimum variance. Outside of the context of Gauss-Markov theorem, it is not clear to me what a "regression assumption" would even mean.

Similarly, assumptions of a, say, one-sample t-test refer to the assumptions under which $t$-statistic is $t$-distributed and hence the inference is valid. It is not called a "theorem", but it is a clear mathematical result: if $n$ samples are normally distributed, then $t$-statistic will follow Student's $t$-distribution with $n-1$ degrees of freedom.

Assumptions of penalized regression techniques

Consider now any regularized regression technique: ridge regression, lasso, elastic net, principal components regression, partial least squares regression, etc. etc. The whole point of these methods is to make a biased estimate of regression parameters, and hoping to reduce the expected loss by exploiting the bias-variance trade-off.

All of these methods include one or several regularization parameters and none of them has a definite rule for selecting the values of these parameter. The optimal value is usually found via some sort of cross-validation procedure, but there are various methods of cross-validation and they can yield somewhat different results. Moreover, it is not uncommon to invoke some additional rules of thumb in addition to cross-validation. As a result, the actual outcome $\hat \beta$ of any of these penalized regression methods is not actually fully defined by the method, but can depend on the analyst's choices.

It is therefore not clear to me how there can be any theoretical optimality statement about $\hat \beta$, and so I am not sure that talking about "assumptions" (presence or absence thereof) of penalized methods such as ridge regression makes sense at all.

But what about the mathematical result that ridge regression always beats OLS?

Hoerl & Kennard (1970) in Ridge Regression: Biased Estimation for Nonorthogonal Problems proved that there always exists a value of regularization parameter $\lambda$ such that ridge regression estimate of $\beta$ has a strictly smaller expected loss than the OLS estimate. It is a surprising result -- see here for some discussion, but it only proves the existence of such $\lambda$, which will be dataset-dependent.

This result does not actually require any assumptions and is always true, but it would be strange to claim that ridge regression does not have any assumptions.

Okay, but how do I know if I can apply ridge regression or not?

I would say that even if we cannot talk of assumptions, we can talk about rules of thumb. It is well-known that ridge regression tends to be most useful in case of multiple regression with correlated predictors. It is well-known that it tends to outperform OLS, often by a large margin. It will tend to outperform it even in the case of heteroscedasticity, correlated errors, or whatever else. So the simple rule of thumb says that if you have multicollinear data, ridge regression and cross-validation is a good idea.

There are probably other useful rules of thumb and tricks of trade (such as e.g. what to do with gross outliers). But they are not assumptions.

Note that for OLS regression one needs some assumptions for $p$-values to hold. In contrast, it is tricky to obtain $p$-values in ridge regression. If this is done at all, it is done by bootstrapping or some similar approach and again it would be hard to point at specific assumptions here because there are no mathematical guarantees.