Solved – stochastic gradient descent of ridge regression when regularization parameter is very big

gradient descentregularizationstochastic gradient descent

As we know, the gradient of ridge regression is:
$$
g = \frac{\partial L}{\partial \theta} = -X_i^T(y_i-X_i\theta)+2\lambda\theta
$$
where $X_i$ is the $i$th training sample.
The update of $\theta$ is then:
$$
\theta^+ =\theta-\eta g
$$
with learning rate $\eta$.

My question is: If $\lambda$ is very huge, then the first term in gradient $-X_i^T(y_i-X_i\theta)$ can be ignored, which means lost function cannot be optimized since $g$ is irrelevant to training sample. Am I wrong about this?
(The thing is: I tried to use python package to run ridge regression, and the regularization parameter $\lambda$ is a huge value, obtained from validation set. Then I tried to implement stochastic gradient descent (as a comparison), but I found the lost cannot decrease to the lost obtained from python model. Actually, the lost doesn't decrease at all with this huge $\lambda$.)

Best Answer

Ridge Regression python package has several solver options, and is not employing the same method as you. Your implementation is the very basic of gradient descent method that employs constant learning coefficient I presume, i.e. you don't have any strategy for adaptively setting your learning coefficient. And in sensitive cases as yours (i.e. large numbers), this can easily lead to different results. Library methods, in general, are products of highly experienced researchers and developers and highly stable in cases of numerical challenges.