Too long for a comment: OLS regression of x on y vs. y on x minimizes horizontal vs. vertical differences, respectively. A very good treatment of this and additional references can be found here What is the difference between linear regression on y with x and x with y?

What I meant there is that a choice between untransformed vs. transformed data and reg type should be made prior to fitting the model. *E.g.* if the data are skewed (they were indeed) and do not satisfy assumptions of OLS, this model should not be used, unless a certain transformation improves data distribution. So a legit approach would be to understand the structure of your data, select the most appropriate transformation, then choose and justify the appropriate model(s), not try all models and see which give a better, or less worse in their case:) result.

Although I don't think it is stated directly, I believe they used actual survival as x and back-predicted survival as y.

### What is an *assumption* of a statistical procedure?

I am not a statistician and so this might be wrong, but I think the word "assumption" is often used quite informally and can refer to various things. To me, an "assumption" is, strictly speaking, something that only a theoretical result (theorem) can have.

When people talk about assumptions of linear regression (see here for an in-depth discussion), they are usually referring to the Gauss-Markov theorem that says that *under assumptions* of uncorrelated, equal-variance, zero-mean errors, OLS estimate is BLUE, i.e. is unbiased and has minimum variance. Outside of the context of Gauss-Markov theorem, it is not clear to me what a "regression assumption" would even mean.

Similarly, assumptions of a, say, one-sample t-test refer to the assumptions under which $t$-statistic is $t$-distributed and hence the inference is valid. It is not called a "theorem", but it is a clear mathematical result: if $n$ samples are normally distributed, then $t$-statistic will follow Student's $t$-distribution with $n-1$ degrees of freedom.

### Assumptions of penalized regression techniques

Consider now any regularized regression technique: ridge regression, lasso, elastic net, principal components regression, partial least squares regression, etc. etc. The whole point of these methods is to make a *biased* estimate of regression parameters, and hoping to reduce the expected loss by exploiting the bias-variance trade-off.

All of these methods include one or several regularization parameters and none of them has a definite rule for selecting the values of these parameter. The optimal value is usually found via some sort of cross-validation procedure, but there are various methods of cross-validation and they can yield somewhat different results. Moreover, it is not uncommon to invoke some additional rules of thumb in addition to cross-validation. As a result, the actual outcome $\hat \beta$ of any of these penalized regression methods is not actually fully defined by the method, but can depend on the analyst's choices.

It is therefore not clear to me how there can be any theoretical optimality statement about $\hat \beta$, and so I am not sure that talking about "assumptions" (presence or absence thereof) of penalized methods such as ridge regression makes sense at all.

### But what about the mathematical result that ridge regression always beats OLS?

Hoerl & Kennard (1970) in Ridge Regression: Biased Estimation for Nonorthogonal Problems proved that there *always* exists a value of regularization parameter $\lambda$ such that ridge regression estimate of $\beta$ has a strictly smaller expected loss than the OLS estimate. It is a surprising result -- see here for some discussion, but it only proves the existence of such $\lambda$, which will be dataset-dependent.

This result does not actually require any assumptions and is always true, but it would be strange to claim that ridge regression does not have any assumptions.

### Okay, but how do I know if I can apply ridge regression or not?

I would say that even if we cannot talk of assumptions, we can talk about *rules of thumb*. It is well-known that ridge regression tends to be most useful in case of multiple regression with correlated predictors. It is well-known that it tends to outperform OLS, often by a large margin. It will tend to outperform it even in the case of heteroscedasticity, correlated errors, or whatever else. So the simple rule of thumb says that if you have multicollinear data, ridge regression and cross-validation is a good idea.

There are probably other useful rules of thumb and tricks of trade (such as e.g. what to do with gross outliers). But they are not assumptions.

Note that for OLS regression one needs some assumptions for $p$-values to hold. In contrast, it is tricky to obtain $p$-values in ridge regression. If this is done at all, it is done by bootstrapping or some similar approach and again it would be hard to point at specific assumptions here because there are no mathematical guarantees.

## Best Answer

$R^2=1-\frac{SSE}{SST}$, where $SSE$ is the sum of squared error (residuals or deviations from the regression line) and $SST$ is the sum of squared deviations from the dependent's $Y$ mean.

$MSE=\frac{SSE}{n-m}$, where $n$ is the sample size and $m$ is the number of parameters in the model (including intercept, if any).

$R^2$ is a standardized measure of degree of predictedness, or fit, in the sample. $MSE$ is the estimate of variance of residuals, or non-fit, in the population. The two measures are clearly related, as seen in the most usual formula for

adjusted$R^2$ (the estimate of $R^2$ for population):$R_{adj}^2=1-(1-R^2)\frac{n-1}{n-m}=1-\frac{SSE/(n-m)}{SST/(n-1)}=1-\frac{MSE}{\sigma_y^2}$.