Solved – Ridge Regression with Gradient Descent Converges to OLS estimates

numpypythonregressionregularizationridge regression

I'm implementing a homespun version of Ridge Regression with gradient descent, and to my surprise it always converges to the same answers as OLS, not the closed form of Ridge Regression.

This is true regardless of what size alpha I'm using. I don't know if this is due to something wrong with how I'm setting up the problem, or something about how gradient descent works on different types of data.

The dataset I'm using is the boston housing dataset, which is very small (506 rows), and has a number of correlated variables.

BASIC INTUITION:

I'd love to be corrected if my priors are inaccurate, but this is how I frame the problem.

The cost function for OLS regression:

enter image description here

Where output is the dot product of the feature matrix and the weights for each column + intercept.

The cost function for regression with L2 regularization (ie, Ridge Regression):

enter image description here

Where alpha is the tuning parameter and omega represents the regression coefficient, squared and summed together.

The derivative of the above statement can be written like this:

enter image description here

We then take the result of this expression and multiply it by the learning rate, and adjust each weight appropriately.

To test this out on my own, I made my own python class for Ridge Regression according to the above postulates, and it looks like so:

RIDGE REGRESSION CODE:

class RidgeRegression():
    def __init__(self, alpha=1, eta =.0001, random_state=None, n_iter=10000):
        self.eta            = eta
        self.random_state   = random_state
        self.n_iter         = n_iter
        self.alpha          = alpha
        self.w_             = []

    def output(self, X):
        return X.dot(self.w_[1:]) + self.w_[0]

    def fit(self, X, y):
        rgen                   = np.random.RandomState(self.random_state)  # random number generator
        self.w_                = rgen.normal(0, .01, size= 1 + X.shape[1]) # fill weights w/ near 0 values
        self.l2_regularization = self.alpha * self.w_[1:].dot(self.w_[1:]) # alpha * sum of squared weights
        self.l2_gradient       = self.alpha * np.sum(self.w_[1:])          # alpha * sum of weights

        for i in range(self.n_iter):
            gradient               = (y - self.output(X)) + self.l2_gradient   # the gradient
            self.w_[1:]           += (X.T.dot(gradient) * (self.eta))          # update each weight w.r.t. its gradient
            self.w_[0]            += (y - self.output(X)).sum() * self.eta     # update the intercept w.r.t. the gradient
            self.l2_regularization = self.alpha * self.w_[1:].dot(self.w_[1:]) # adjust regularization term to reflect new values of weights
            self.l2_gradient       = self.alpha * np.sum(self.w_[1:])          # update derivate of regularization term to be used in gradient

This looks right to me, but I'm not sure how to interpret the fact that the results are always the same as OLS, vs what I'd get using the closed form of Ridge Regression.

EDIT:

Even if I set the value of alpha to something very large, like 1000, the results do not change.

CODE:

from sklearn.linear_model import Ridge
r = RidgeRegression(alpha=1000, n_iter=5000, eta=0.0001) # my algorithm
r.fit(X, y)
rreg = Ridge(alpha=1000)  # scikit learn's ridge algorithm
rreg.fit(X, y)
ols_coeffs = np.linalg.inv(X.T @ X) @ (X.T @ y) # ols algorithm

If we compare these, we can see the ridge coefficients in my homespun version match the OLS estimates almost exactly:

pd.DataFrame({
    'Variable'     : X.columns,
    'gd_Weight'    : r.w_[1:],     # coefficients for ridge regression w/ gd
    'cf_Weight'    : ridge_coeffs, # closed form version of ridge
    'ols_weight'   : coeffs        # regular OLS, closed form
})

Which gives the following dataframe:

enter image description here

They all have roughly the same intercept of 22.5.

Best Answer

I'm not very well versed in Python, but if I interpret your code correctly then X.T.dot(gradient) takes the gradient variable you computed on the previous line, and computes its dot product with $X$. This doesn't seem right since 'gradient' includes the "L2-gradient" which shouldn't be multiplied by $X$. Only the residual (y-self.output(X)) should be in that product. You want to add the L2-gradient only afterwards, and then multiply the result by eta.

Also, the L2-gradient shouldn't sum over the $w$'s (remember the gradient should be vector-valued since you're taking the derivative of a scalar-valued function w.r.t a vector). Together these errors probably explain the confusing results you're getting, although I would have expected with an incorrect gradient like that the output would actually just be garbage rather than the OLS-solution, so there is probably something subtle happening that I'm not seeing.

(Note that your mathematical expressions for the gradient also aren't quite right, but there you actually omitted the pre-multiplication of the residuals with $X^T$. And you also again have a sum over $w$'s in the L2-gradient which shouldn't be there.)

Going forward I would recommend first checking your implementation of each individual part of the cost function and gradient and satisfy yourself that they are correct, before running gradient descent.

Related Question