Finally we were able to produce the same solution with both methods! First issue is that glmnet solves the lasso problem as stated in the question, but lars has a slightly different normalization in the objective function, it replaces $\frac{1}{2N}$by $\frac{1}{2}$. Second, both methods normalize the data differently, so the normalization must be swiched off when calling the methods.
To reproduce that, and see that the same solutions for the lasso problem can be computed using lars and glmnet, the following lines in the code above must be changed:
la <- lars(X,Y,intercept=TRUE, max.steps=1000, use.Gram=FALSE)
to
la <- lars(X,Y,intercept=TRUE, normalize=FALSE, max.steps=1000, use.Gram=FALSE)
and
glm2 <- glmnet(X,Y,family="gaussian",lambda=0.5*la$lambda,thresh=1e-16)
to
glm2 <- glmnet(X,Y,family="gaussian",lambda=1/nbSamples*la$lambda,standardize=FALSE,thresh=1e-16)
Due to the class imbalance, you should have a look at the probabilities that your forests outputs (I'm not familiar with the random forest R package, but I think there is an option (type="prob"
) in the predict
function that will give you a matrix of class probabilities.
I believe, the next thing to do with these probabilities is to derive a ROC curve and see if it performs better than the majority vote. In that case, it just means you should consider a 'soft' voting approach while optimising the threshold (based on the ROC curve) to determine the predicted class (which is straightforward in a binary case) instead of a 'majority' voting one.
Best Answer
Before going deeper into the comparison make sure that each of the two methods agrees with itself. You can find this out by bootstrapping the entire variable importance process a few times. Plot the original variable importance for each variable vs. the importance estimated from a bootstrap sample.
The bootstrap involves taking samples of size $n$ with replacement from the original dataset of $n$ observations, and repeating any analysis. The repetitions have to be "from scratch." Here is what the process looks like in R: