Solved – Optimization of the regularized least squares with gradient descent

gradient descentMATLABregressionregularization

I have this regularized least square formula:
$$\sum\limits_{i=1}^N (\omega^T x_i – y_i)^2 + \lambda \left\|\omega\right\|^2$$

And the gradient:
$$2 \sum\limits_{i=1}^N ((\sum\limits_{j=1}^d x_{ij}\omega_j)x_{ik} – x_{ik} y_i) + 2\lambda \omega_k$$

I want to use gradient descent to find the vector w. I am using matlab. I though I would be able to make two loops and calculate the ws but my solution is very unstable and I need to use very small learning term a (a=0.000000001) in order to get not NAN solution. But I thought the values of w should head towards 0 when the lambda is large but it does not happen…
My data set is a matrix X (400×64) and y (400×1). This is two class problem where contains the class labels (+1 for class 1 and -1 for class 2).

Here is my matlab code:

function [ w ] = gradDecent( X, Y, a, lambda, iter )
% GRADIENT DESCENT

w = zeros(size(X(1,:)))';

for it=1:iter % For each iteration
    for k = 1:size(w,1)
        s = 0;
       for i = 1:size(X,1)
           s = s + (X(i,:)*w - Y(i))*X(i,k);
       end
       w(k) = w(k) - a*(2*s+2*lambda*w(k));
    end
end

Am I making some stupid mistakes?

Best Answer

Instead of using a constant learning rate, set your learning rate to a reasonably large value (try $\alpha=1$) and then use "backoff" to shrink your learning rate as appropriate using the following algorithm:

while $f(w^{(k)} - \alpha d) \gt f(w^{(k)}) $: $\alpha := \alpha/2$

$w^{(k+1)} := w^{(k)} - \alpha d$

where $f$ is your objective function, $d$ is your descent direction, and $\alpha$ is the learning rate.
I don't think you've correctly implemented the gradient in your code. In particular, I believe the way you have it coded the s term evaluates to $s=\sum_{i=1}^N ( x_i \omega - y_i x_i )$ when you really want $s=\sum\limits_{i=1}^N ((\sum\limits_{j=1}^d x_{ij}\omega_j)x_{ik} - x_{ik} y_i)$.
Gradient Descent probably isn't the best solution here. Gradient descent is slow: you shouldn't be surprised that it's taking a long time to converge, because gradient descent usually does. It gets the job done, but it's generally a slow option. Try playing with other optimization alogorithms and see what happens. Try using newton-raphson, for instance. I'm sure if you poke around the literature you'll be able to determine the "state-of-the-art" algorithm choice for optimizing the ridge regression loss function. I'd put money on it not being simple gradient descent.

Best Answer

Related Solutions

Kernelized SVM – Is Gradient Descent Possible?

Solved – Cost function turning into nan after a certain number of iterations

Related Question