Solved – Preventing overfitting with Least Squares Linear Regression via QR decomposition

least squaresregressionregularization

I am trying to solve a linear regression problem in an automated fashion, however am having a problem with extremely large weights.

I have several thousand datasets, and am running linear regression on each of them. I am doing this by using the apache commons math OLSMultipleLinearRegression library. In 90% of cases I am getting good results, however in the remaining 10% there appears to be overfitting, and in 0.1% that overfitting is horrendous (i.e. weights with order of magnitude 10^30). When running via gradient descent I can implement regularisation to deal with these issues, however is there a similar method when solving via QR decomposition?

Currently my best idea is to run QR decomposition, then if the weights are too high re-run with gradient descent. Is there a better way?

Best Answer

Your specific application isn't defined so I can only address a couple things to look for that could help reduce some of your linear regression issues.

Run a test for co-linearity or high correlation between your X (independent) variables. The more independent your factors, the more likely your regression will be more stable.
Whitten your data. Before you do your regression, normalize each variable by subtracting its mean and dividing by its standard deviation. In the case where your data have vastly different scales, this will make the calculations, at least more numerically stable.
Penalize larger coefficients using LASSO or RIDGE Regression. This involves modifying your loss function (and therefore gradient descent) to incorporate this additional error term. Please look here Generalized Linear Models for more information. I realize this linked code is in python but the methodologies are described well.

Best Answer

Related Solutions

Solved – How to handle very different weights in a least squares fit

Solved – Using sparse inverse covariance matrix in estimating least squares coefficients

Bias-Variance Tradeoff

$X^TY$ or $f(W)Y$

Related Question