Solved – Optimal Alpha value in Ridge Regression

regressionridge regression

I've tried searching for answers on this site, but I've not found a clear answer.
I have a dataset with around 9471 observations and 10 attributes. I wish to fit a Ridge Regression model, and figure out what the optimal Alpha I should finally end up using in the model. I could try a range of values from 0 to 10.

What metric should I use when I compare models of different alpha values? How do I know which is better, i.e which model has a more optimal alpha? How should I go about finding this optimal alpha?

Best Answer

As Richard mentions, you can use cross validation. Another option which does not require that you do a K-fold cross-validation is generalized cross validation. See e.g., the smooth.spline function in R or the mgcv package in R (and the book by Simon Wood). Particularly, see the H argument in the gam function in the mgcv package.

Related Solutions

Solved – Grid fineness and overfitting when tuning $\lambda$ in LASSO, ridge, elastic net

Can I say something about the propensity to overfit in (A) versus (B)?

Provided that both grids cover a sufficient range, grid fineness doesn't really have anything to do with overfitting in this problem (though a coarse grid might underfit if it skips over a profitable interval). It's not as if testing too many values will somehow change what out-of-sample looks like.* In the case of these penalized regressions, we definitely want to optimize our penalized likelihood function for values $\lambda$, and it doesn't matter how many values of $\lambda$ we test, because out-of-sample performance for a fixed data set and fixed partitioning is entirely deterministic. More to the point, the out-of-sample metric is not at all altered by how many values $\lambda$ you test. A coarser grid might mean that you skip over the absolute minimum in your out-of-sample metric, but finding the absolute minimum probably isn't desirable in the first place because hyperparameters tend to be poorly estimated, and finite sample properties mean that data limitations will be a source noise in that estimate that will overwhelm slight changes in the distance between adjacent grid points: the standard error of your estimate will tend to swamp differences in grid fineness.

If you're truely concerned that out-of-sample performance metric might be overly optimistic, you could adopt the 1 standard error rule, which picks the most regularized model within 1 standard error of the minimum. That way, you're being slightly more conservative and picking a less complex model.

Can I determine the optimal grid fineness? How?

The LARS algorithm does not a priori define which values of $\lambda$ to check; rather, $\lambda$ is changed continuously and the algorithm checks for values of $\lambda$ for which a coefficient goes from 0 to a nonzero value. Those values of $\lambda$ where a new coefficient is nonzero are retained, with the observation that coefficient paths are piecewise linear in the case of the lasso, so there's no loss of information by just storing off the knots in that case. LARS only works when coefficient paths are piecewise linear, though. The ridge penalty never shrinks a coefficient to precisely zero, so all of your coefficient paths are smooth and always nonzero; likewise elastic net regressions (excluding the case of elastic net regressions which are also lasso regressions).

But most people use GLMNET because it's often faster. In terms of determining what grid of $\lambda$ to search over, I recommend reading the GLMNET article "Regularization Paths for Generalized Linear Models via Coordinate Descent" by Jerome Friedman, Trevor Hastie, and Rob Tibshirani. In it, they develop a very efficient algorithm for estimating ridge, lasso and elastic net regressions. The algorithm checks for a value of $\lambda_\text{max}$ for which $\beta$ is the zero vector, and then identifies a minimum value $\lambda_\text{min}$ relative to $\lambda_\text{max}$. Finally, they generate a sequence of values between the two uniformly on the log scale. This grid is sufficient for most purposes, though it does omit the property that you will know precisely when a coefficient is estimated at a nonzero value. Warm starts are used to provide solutions much more quickly, and it supports many common GLMs.

*You might be thinking about this from the perspective of an artificial neural network, where early stopping is sometimes used to accomplish regularization, but that's an entirely unrelated problem (namely, that the optimization algorithm is prevented from reaching an optimum, so the model is forced to be less complex).

Solved – Naive Ridge Regression in R

First, very simply, I don't think your call to solve looks right, this is what I would expect

solve(t(X) %*% X + lambda.diag, t(X) %*% y)

Your code seems to be explicitly calculating a matrix inverse and then multiplying. This is mathematically correct, but computationally incorrect. It is always better to solve the system of equations. I've gotten in the habit of reading equations like $y = X^{-1}z$ as "solve the system of equations $Xy = z$ for $y$."

On a more mathematical note, you should not have to include an intercept term when fitting a ridge regression.

It is very important, when applying penalized methods, to standardize your data (as you point out with your comment about scale. It's also important to realize that penalties are not generally applied to the intercept term, as this would cause the model to violate the attractive property that the average predictions equal the average response (on the training data).

Together, these facts (centered data, no intercept penalty) imply that the intercept parameter estimate in a ridge regression is known a priori, it is zero.

The coefficient vector from ridge regression is the solution to the penalized optimization problem

$$ \beta = argmin \left( (y - X\beta)^t (y - X\beta) + \frac{1}{2}\sum_{j > 0} \beta_j^2 \right) $$

Taking a partial with respect to the intercept parameter

$$ \frac{\partial L}{\partial \beta_0} = \sum_{i=1}^{n} \left( y - \sum_{j=0}^q \beta_j x_{ij} \right) x_{i0} $$

But $x_{0i}$ are the entries in the model matrix corresponding to the intercept, so $x_{0i} = 1$ always. So we get

$$\sum_{i=1} y_i + \sum_{j=0}^q \beta_j \sum_{i=1}^n x_{ij} $$

The first term, with the sum over y, is zero because $y$ is centered (or not, a good check of understanding is to work out what happens if you don't center y). In the second term, each predictor is centered, so the sum over $i$ is zero for every predictor $j$ except the intercept. For the intercept, the second term $i$ sum comes out to $n$ (it's $1 + 1 + 1 + \cdots$). So this whole thing reduces to

$$ n \beta_0 $$

Setting this partial equal to zero, $n\beta_0 = 0$, we recover $\beta_0 = 0$, as expected.

So, you do not need to bind on an intercept term to your model matrix. Your function should either expect standardized data (and if you plan on making it public, it should check this is so), or standardize the data itself. Once this is done, the intercept is known to be zero. I'll leave it as an exersize to work out what the intercept should be when you translate the coefficients back to the un-normalized scale.

I still don't get results equivalent to lm.ridge though, but it might just be a question of translating the formula back into the original scales. However, I can't seem to work out how to do this. I thought it would just entail multiplying by the standard deviation of the response and adding the mean, as usual for standard scores, but either my function is still wrong or rescaling is more complex than I realize.

It's a bit more complicated, but not too bad if you are careful. Here's a place where I answered a very similar question:

GLMnet - “Unstandardizing” Linear Regression Coefficients

You may have to make a very simple change if you are not standardizing $y$.

Best Answer

Related Solutions

Solved – Grid fineness and overfitting when tuning $\lambda$ in LASSO, ridge, elastic net

Solved – Naive Ridge Regression in R

Related Question