Solved – How to find parameters for ridge and lasso regularization when cost minimization does not converge

convergencelassologisticridge regression

In the Stanford ML course, we were taught to find good values for the lambda parameters of ridge/lasso by iterating for various lambda values on several cross-validation sets and picking the values which correspond to the hypothesis with the minimum CV error.

The problem is, I am playing with a big data set (which might not even be well suited for logistic regression (??): Internet Ads set) and I can't use the method described above because, during optimization, the cost stays stable (changes only around the 8th decimal place between iterations) and doesn't seem to converge.

I need good values for the regularization in order to converge, but I can't converge without the aforementioned good values.

Any suggestions? Should I move on to using SVM, or is this data set solvable with logistic regression?

NB: I am doing this for learning purposes, so I'm much more interested in explanations why my approach is bad than I am in black-box libraries which will give me a solution.

EDIT: Some relevant code snippets (occasionally pseudocode-ish, for clarity). The usual notations apply.

The function used to compute the cost:

def computeCost(theta, X, y):
    global iter
    iter += 1
    if iter > 10:
        raise TooManyIterationsException(iter) # Because the cost doesn't converge, I force interruption in order to jump to another combination of lambda values
    m = y.size
    h = sigmoid(X.dot(theta.T))
    J = y.T.dot(log(h)) + (1.0 - y.T).dot(numpy.log(1.0 - h))
    J_reg2 = theta[1:]**2
    J_reg1 = theta[1:]
    cost = (-1.0 / m) * (J.sum()) + LAMBDA2 * J_reg2.sum() + LAMBDA1 * J_reg1.sum()
    print "Cost: ", cost
    return cost

Invoking scipy.optimize.fmin_bfgs:

initial_thetas = numpy.zeros((len(train_X[0]), 1))
myargs = (train_X, train_y)

for LAMBDA1 in [0.01, 0.02, 0.04, ..., 10]:
    for LAMBDA2 in my_range[0.01, 0.02, 0.04, ..., 10]:
        try:
            iter = 0
            theta = scipy.optimize.fmin_bfgs(computeCost, x0=initial_thetas, args=myargs)            
        except TooManyIterationsException as e:
            print '\n'

A typical output looks like this:
enter image description here

EDITED AGAIN: Evolution of thetas!
enter image description here

Best Answer

Is the question "why is cost() so flat near [0 0 0 0] ?",
or "why does fmin_xx not find a minimum on a flat surface ?"
A couple of suggestions anyway:

1) look at cost() near x0, with look( f, x0, h ) -> f() at all corners of a cube of side 2h around x0, or if that's too many at a random subset.
2) is [0 0 0 0] a reasonable start point for weights ?
3) what happened to the first 3, continuous, columns of X ?
4) start with fmin() a.k.a. Nelder-Mead (at the best dim + 1 points from look()) before running fmin_xx -- more powerful but harder to drive.
5) scikit-learn SGDClassifier got to

97.3 % correct  SGDClassifier  uciml/ad*
    X (1572, 1558) Xtest (787, 1558)  -- 2/3, 1/3 split
    centre 3  -- scale all feature columns to [-1, 1]
    sgditer 100  loss log  penalty l2
11 sec
Confusion matrix: 97.3 % correct = 766 / 787
True classes down, estimated across  / true class sizes
0:  652    8  /  660  99 %
1:   13  114  /  127  90 %

Best Answer

Related Solutions

Solved – elastic net regularization, and how does it solve the drawbacks of Ridge ($L^2$) and Lasso ($L^1$)

1. Which method is preferred?

2. What's the intuition and math behind elastic net?

3. What if you add additional $L^q$ norms?

Solved – Is regression with L1 regularization the same as Lasso, and with L2 regularization the same as ridge regression? And how to write “Lasso”

Related Question