Solved – Why is least squares performing as well as ridge regression when there is multicollinearity

least squaresMATLABmulticollinearityridge regression

I am learning about ridge regression, so I am implementing it in MATLAB as practice. However, I am having trouble finding a structure of data where ridge regression performs better than an ordinary least squares.

Reading up I've found that data that is collinear is often better to be regularized. However when I implemented this in the below code least squares is performing just a well as ridge regression (the best lambda parameter is in the range e-10, almost non-existent!). MATLAB tells me that X is rank deficient (rank=2) when using the built in function for least squares, however it still performs well?

I was wondering if anyone knew why this was performing this way, is my data perhaps not collinear enough to show a real performance difference, or have I misunderstood something?

% Generate data;
clear;
Nt = 100;
X(:,1) = randn(Nt,1);
X(:,2) =  2*X(:,1) + 6;
X(:,3) = 12*X(:,2) + 16;
p=[0.74,3,4.5];
y = X*p' + randn(Nt,1);

% Least Squares;
pLS = X\y;
%pLS = pinv(X'*X)*(X'*y);
nmseN =  sum((X*pLS-y).^2)/length(y)/var(y);

% Tikhonov;
lspace     = logspace(-10,-1,1000);
bestNMSE   = inf;
bestLambda = -1;
I=eye(size(X, 2));
for lambda=1:length(lspace)
  prLS = pinv(X'*X + lspace(lambda)*(I'*I))*(X'*y);

  nmse = sum((X*prLS-y).^2)/length(y)/var(y);
  if nmse<bestNMSE
    bestNMSE=nmse;
    bestLambda=lspace(lambda);
  end
end
prLS = pinv(X'*X + bestLambda*(I'*I))*(X'*y);
nmseR =  sum((X*prLS-y).^2)/length(y)/var(y);

Best Answer

To echo Cardinal's comment, ols will always perform the best of any linear method at describing a given data set. Per se, that's the definition of ols. The point of regularized regression is to improve prediction accuracy. For an example of how regularization (and other techniques) can improve predictive accuracy, you could have a look at "Introduction to Statistical Learning with R". In the chapter(6) in which regularized regression is introduced, the use the 'hitters' data set to show various models that are better at prediction than ordinary regression. Chapter 6, labs 1 and 2 go through various methods.

Related Solutions

Solved – Calculate log-likelihood “by hand” for generalized nonlinear least squares regression (nlme)

Let's start with the simpler case where there is no correlation structure for the residuals:

fit <- gnls(model=model,data=data,start=start)
logLik(fit)

The log likelihood can then be easily computed by hand with:

N <- fit$dims$N
p <- fit$dims$p
sigma <- fit$sigma * sqrt((N-p)/N)
sum(dnorm(y, mean=fitted(fit), sd=sigma, log=TRUE))

Since the residuals are independent, we can just use dnorm(..., log=TRUE) to get the individual log likelihood terms (and then sum them up). Alternatively, we could use:

sum(dnorm(resid(fit), mean=0, sd=sigma, log=TRUE))

Note that fit$sigma is not the "less biased estimate of $\sigma^2$" -- so we need to make the correction manually first.

Now for the more complicated case where the residuals are correlated:

fit <- gnls(model=model,data=data,start=start,correlation=correlation)
logLik(fit)

Here, we need to use the multivariate normal distribution. I am sure there is a function for this somewhere, but let's just do this by hand:

N <- fit$dims$N
p <- fit$dims$p
yhat <- cbind(fitted(fit))
R <- vcv(tree, cor=TRUE)
sigma <- fit$sigma * sqrt((N-p)/N)
S <- diag(sigma, nrow=nrow(R)) %*% R %*% diag(sigma, nrow=nrow(R))
-1/2 * log(det(S)) - 1/2 * t(y - yhat) %*% solve(S) %*% (y - yhat) - N/2 * log(2*pi)

Solved – Why does Ridge Regression work well in the presence of multicollinearity

Consider the simple case of 2 predictor variables ($x_1$, $x_2$). If there is no or little colinearity and good spread in both predictors, then we are fitting a plane to the data ($y$ is the 3rd dimension) and there is often a very clear "best" plane. But with colinearity the relationship is really a line through 3 dimensional space with data scattered around it. But the regression routine tries to fit a plane to a line, so there are an infinite number of planes that intersect perfectly with that line, which plane is chosen depends on the influential points in the data, change one of those points just a little and the "best" fitting plane changes quite a bit. What ridge regression does is to pull the chosen plane towards simpler/saner models (bias values towards 0). Think of a rubber band from the origin (0,0,0) to the plane that pulls the plane towards 0 while the data will pull it away for a nice compromise.

Best Answer

Related Solutions

Solved – Calculate log-likelihood “by hand” for generalized nonlinear least squares regression (nlme)

Solved – Why does Ridge Regression work well in the presence of multicollinearity

Related Question