Solved – Why Ridge regression increase and not decrease the model’s error

lassoregressionregularizationridge regression

I am trying to optimize the linear regression using the regularisation of Ridge using glmnet function. But the problem is, instead of decreasing the error after using Ridge method, the error increase. This is weird because I think that Ridge regularisation is used to optimize fitting models and then minimize prediction error ! How do you explain this please ?

I share with you my code to see what I do exactly :

First, this is the data set:

> str(DATABASE)
'data.frame':   1667 obs. of  28 variables:
 $ YEAR_SALES            : num  2 1 2 2 1 1 1 1 1 1 ...
 $ MONTH_SALES           : num  9 9 2 9 3 3 11 12 3 6 ...
 $ DAY_SALES             : num  13 3 10 23 12 10 26 4 18 9 ...
 $ HOURS_INS             : num  17 14 18 16 23 18 16 12 17 16 ...
 $ CREATION_YEAR_SALES   : num  2 1 2 2 2 1 1 2 1 1 ...
 $ CREATION_MONTH_SALES  : num  9 9 2 10 12 3 11 2 3 6 ...
 $ CREATION_DAY_SALES    : num  13 11 15 31 5 10 27 7 18 9 ...
 $ VALIDATION_YEAR_SALES : num  2 1 2 2 2 1 1 2 1 1 ...
 $ VALIDATION_MONTH_SALES: num  9 9 2 11 12 3 12 2 3 6 ...
 $ VALIDATION_DAY_SALES  : num  15 14 16 3 6 19 1 8 21 10 ...
 $ AGE_CUSTUMER          : num  32 37 23 32 44 33 29 30 56 48 ...
 $ MEAN_Sales            : num  0 71 50 0 0 83 0 25 23 35 ...
 $ NBR_GIFTS             : num  1 1 1 1 1 1 1 1 4 3 ...
 $ TYPE_PEAU             : num  2 3 4 2 2 3 2 2 2 2 ...
 $ SENSIBILITE           : num  3 3 3 2 1 3 3 2 2 2 ...
 $ IMPERFECTIONS         : num  2 3 2 1 3 2 2 1 2 1 ...
 $ BRILLANCE             : num  3 1 1 3 3 3 3 3 3 3 ...
 $ GRAIN_PEAU            : num  3 3 3 3 1 3 1 1 1 3 ...
 $ RIDES_VISAGE          : num  1 1 1 3 3 3 3 1 3 1 ...
 $ ALLERGIES             : num  1 1 1 1 1 1 1 1 1 1 ...
 $ MAINS                 : num  2 3 3 3 2 2 2 2 2 2 ...
 $ PEAU_CORPS            : num  1 2 2 1 1 1 1 1 1 1 ...
 $ INTERET_ALIM_NATURELLE: num  1 3 3 1 3 1 1 1 3 1 ...
 $ INTERET_ORIGINE_GEO   : num  1 2 1 1 3 1 3 1 1 3 ...
 $ INTERET_VACANCES      : num  2 3 1 2 1 2 1 1 2 3 ...
 $ INTERET_ENVIRONNEMENT : num  1 3 3 3 3 1 1 1 1 1 ...
 $ INTERET_COMPOSITION   : num  1 1 1 3 3 1 1 1 1 1 ...
 $ OUTCOME               : num  3 4 7 3 3 6 3 9 26 17 ...

Then,splitting data and create the linear regression model:

> set.seed(123)
> smp_size <- floor(0.75 * nrow(DATABASE))
> train_ind <- sample(seq_len(nrow(DATABASE)),size =smp_size)
> 
> train <- DATABASE[train_ind, ]
> test <- DATABASE[-train_ind, ]
> reg<-lm(OUTCOME~.-1,data=train)

Finally, computing the prediction error:

> y.test<-test$OUTCOME
> NBR_Achat=predict(reg,newdata=test)
> round(sqrt(mean(((1-NBR_Achat/y.test)^2))),4)
[1] 0.4523

The above code is just for the simple case. Now let's see the ridge regression what does give:

y<-train$OUTCOME
x<-as.matrix(train[,1:27])
lambdas <- 10^seq(3,-2,by=-.1)
fit<-glmnet(x,y,alpha =0,lambda=lambdas)

To get the best model I use: It semms that best lambda equal to 0.1

> cv_fit <- cv.glmnet(x,y,alpha = 0,lambda=lambdas)
> plot(cv_fit)
> opt_lambda <- cv_fit$lambda.min
> opt_lambda
[1] 0.1

And finally, the prediction error is computed by the following code :

> x<-as.matrix(test[,1:27])
> y_predicted <- predict(cv_fit,s = opt_lambda,newx=x)
> y.test<-test$OUTCOME
> round(sqrt(mean(((1-y_predicted/y.test)^2))),4)
[1] 0.4605

What do you think about this??

Best Answer

In general, ridge regression won't necessarily improve the error. Recall that the goal of the regularization is to make a simpler model to avoid overfitting and thus better prediction on the independent set. However, if overfitting is not a problem (for example when there are much more samples than features), more complex model (less regularized) might predict better. Often models predict better when they are more complex and not less, which is why things like neural networks, random forests and kernels exist.

Traditional way to improve prediction is to look at your carefully at your data and think about what assumptions is your model making. Linear regression assumes that all your variables have linear effect and that there are no interactions between variables. So if oyu have a U shaped effect of some variable on the outcome or when variable A behaves differently for males and females, your model won't predict as well as it could.

Related Solutions

Ridge Regression – Why It Doesn’t Shrink Coefficients to Zero Like Lasso

This is regarding the variance

OLS provides what is called the Best Linear Unbiased Estimator (BLUE). That means that if you take any other unbiased estimator, it is bound to have a higher variance then the OLS solution. So why on earth should we consider anything else than that?

Now the trick with regularization, such as the lasso or ridge, is to add some bias in turn to try to reduce the variance. Because when you estimate your prediction error, it is a combination of three things: $$ \text{E}[(y-\hat{f}(x))^2]=\text{Bias}[\hat{f}(x))]^2 +\text{Var}[\hat{f}(x))]+\sigma^2 $$ The last part is the irreducible error, so we have no control over that. Using the OLS solution the bias term is zero. But it might be that the second term is large. It might be a good idea, (if we want good predictions), to add in some bias and hopefully reduce the variance.

So what is this $\text{Var}[\hat{f}(x))]$? It is the variance introduced in the estimates for the parameters in your model. The linear model has the form $$ \mathbf{y}=\mathbf{X}\beta + \epsilon,\qquad \epsilon\sim\mathcal{N}(0,\sigma^2I) $$ To obtain the OLS solution we solve the minimization problem $$ \arg \min_\beta ||\mathbf{y}-\mathbf{X}\beta||^2 $$ This provides the solution $$ \hat{\beta}_{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} $$ The minimization problem for ridge regression is similar: $$ \arg \min_\beta ||\mathbf{y}-\mathbf{X}\beta||^2+\lambda||\beta||^2\qquad \lambda>0 $$ Now the solution becomes $$ \hat{\beta}_{\text{Ridge}} = (\mathbf{X}^T\mathbf{X}+\lambda I)^{-1}\mathbf{X}^T\mathbf{y} $$ So we are adding this $\lambda I$ (called the ridge) on the diagonal of the matrix that we invert. The effect this has on the matrix $\mathbf{X}^T\mathbf{X}$ is that it "pulls" the determinant of the matrix away from zero. Thus when you invert it, you do not get huge eigenvalues. But that leads to another interesting fact, namely that the variance of the parameter estimates becomes lower.

I am not sure if I can provide a more clear answer then this. What this all boils down to is the covariance matrix for the parameters in the model and the magnitude of the values in that covariance matrix.

I took ridge regression as an example, because that is much easier to treat. The lasso is much harder and there is still active ongoing research on that topic.

These slides provide some more information and this blog also has some relevant information.

EDIT: What do I mean that by adding the ridge the determinant is "pulled" away from zero?

Note that the matrix $\mathbf{X}^T\mathbf{X}$ is a positive definite symmetric matrix. Note that all symmetric matrices with real values have real eigenvalues. Also since it is positive definite, the eigenvalues are all greater than zero.

Ok so how do we calculate the eigenvalues? We solve the characteristic equation: $$ \text{det}(\mathbf{X}^T\mathbf{X}-tI)=0 $$ This is a polynomial in $t$, and as stated above, the eigenvalues are real and positive. Now let's take a look at the equation for the ridge matrix we need to invert: $$ \text{det}(\mathbf{X}^T\mathbf{X}+\lambda I-tI)=0 $$ We can change this a little bit and see: $$ \text{det}(\mathbf{X}^T\mathbf{X}-(t-\lambda)I)=0 $$ So we can solve this for $(t-\lambda)$ and get the same eigenvalues as for the first problem. Let's assume that one eigenvalue is $t_i$. So the eigenvalue for the ridge problem becomes $t_i+\lambda$. It gets shifted by $\lambda$. This happens to all the eigenvalues, so they all move away from zero.

Here is some R code to illustrate this:

# Create random matrix
A <- matrix(sample(10,9,T),nrow=3,ncol=3)

# Make a symmetric matrix
B <- A+t(A)

# Calculate eigenvalues
eigen(B)

# Calculate eigenvalues of B with ridge
eigen(B+3*diag(3))

Which gives the results:

> eigen(B)
$values
[1] 37.368634  6.952718 -8.321352

> eigen(B+3*diag(3))
$values
[1] 40.368634  9.952718 -5.321352

So all the eigenvalues get shifted up by exactly 3.

You can also prove this in general by using the Gershgorin circle theorem. There the centers of the circles containing the eigenvalues are the diagonal elements. You can always add "enough" to the diagonal element to make all the circles in the positive real half-plane. That result is more general and not needed for this.

Regression – Why Smaller Beta is Preferable in Ridge Regression and LASSO

TL;DR - Same principle applies to both LASSO and Ridge

Less features make the model simpler and therefore less likely to be over-fitting

This is the same intuition with ridge regression - we prevent the model from over-fitting the data, but instead of targeting small, potentially spurious variables (which get reduced to zero in LASSO), we instead target the biggest coefficients which might be overstating the case for their respective variables.

The L2 penalty generally prevents the model from placing "too much" importance on any one variable, because large coefficients are penalized more than small ones.

This might not seem like it "simplifies" the model, but it does a similar task of preventing the model from over-fitting to the data at hand.

An example to build intuition

Take a concrete example - you might be trying to predict hospital readmissions based on patient characteristics.

In this case, you might have a relatively rare variable (such as an uncommon disease) that happens to be very highly correlated in your training set with readmission. In a dataset of 10,000 patients, you might only see this disease 10 times, with 9 readmissions (an extreme example to be sure)

As a result, the coefficient might be massive relative to the coefficient of other variables. By minimizing both MSE and the L2 penalty, this would be a good candidate for ridge regression to "shrink" towards a smaller value, since it is rare (so doesn't impact MSE as much), and an extreme coefficient value.