Solved – Why I am that unsuccessful with predicting with Generalized Lasso (genlasso {genlasso})

lassopredictionrregularization

I try to utilize Generalized Lasso genlasso {genlasso} function to incorporate additional penalty matrix into estimation process.

I started with a "hello world" example: I put a diagonal matrix as a penalty matrix in a model. I realized the predictions I got from the model are very far from what I expected:

predictions are negative whereas my response variable is positive.

Regular Lasso from glmnet model gives me at least reasonable (positive) predictions, so my question is: what do I miss in function call or what I do not understand that I end up with such unintuitive predictions from genlasso {genlasso} model.

Reproducible example.

Step. 1. Simulate data based on covariance and mean estimates from my real data (correct simulation result for negative response variable values).

# Read objects describing simulation structure 
sim.cov   <- dget(url("https://raw.githubusercontent.com/martakarass/so_q/master/genlasso/sim_cov"))
sim.means <- dget(url("https://raw.githubusercontent.com/martakarass/so_q/master/genlasso/sim_means"))

# Simulate X (explanatory variables) and Y (response) matrices 
# from multivariate normal
library(MASS)
set.seed(1)
sim.sample <- mvrnorm(n = 150, mu = sim.means, Sigma = sim.cov)
sim.sample.x <- sim.sample[, 1:66]
sim.sample.y <- sim.sample[, 67]

# Correct for negative response variable value 
sim.sample.y[sim.sample.y < 0]
# [1] -0.8409024 -1.8470264 -1.8503388
sim.sample.y[sim.sample.y < 0] <- abs(sim.sample.y[sim.sample.y < 0])
summary(sim.sample.y)
# Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
# 0.01549  7.39000 10.93000 10.72000 13.58000 24.77000

Step 2. Simulate train and test subset indices

train.idc <- as.logical(rbinom(n = nrow(sim.sample), size = 1, prob = 0.8))
test.idc <- !train.idc

Step 3. Build regular Lasso model. Predict values with the use of best lambda value from cross-validation.

# LASSO
# Cross-validate on a train set 
library(glmnet)
cv.glmnet.out <- cv.glmnet(x = sim.sample.x[train.idc, ], 
                           y = sim.sample.y[train.idc], 
                           nfolds = 10, type.measure = "mse", alpha = 1)
lambda.cv <- cv.glmnet.out$lambda.min
model.lasso <- glmnet(sim.sample.x[train.idc, ], sim.sample.y[train.idc], 
                      alpha = 1)
model.lasso.predicted <- predict(model.lasso, s = lambda.cv, 
                                 newx = sim.sample.x[test.idc, ])

# Investigate predictions from a model
summary(as.vector(model.lasso.predicted))
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   3.412   9.235  10.850  10.960  12.760  17.950

Step 4. Build Generalized Lasso model. For convenience, use same best lambda as in regular LASSO model.

# LASSO `genlasso`: diagonal penalty matrix 
p <- ncol(sim.sample.x)
library(genlasso)
genlasso.diag.out <- genlasso(y = sim.sample.y[train.idc], 
                              X = sim.sample.x[train.idc, ], 
                              D = diag(1, p))
model.genlasso.diag.predicted <- predict(genlasso.diag.out, lambda=lambda.cv, 
                                         Xnew = sim.sample.x[test.idc, ])$fit

-> investigate predicted values, which are negative

summary(as.vector(model.genlasso.diag.predicted))
#     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
# -27.2800  -6.0070   1.2580   0.3049   6.2230  24.4700

Best Answer

There are two issues:

glmnet includes an intercept by default while genlasso does not. To disable the intercept in glmnet, use intercept=FALSE; it will reduce the performance significantly. To instead add an intercept in genlasso, cbind a column of ones onto your X matrix, and change D accordingly. If you choose to penalize it, you might need to set the corresponding value in the p+1 by p+1 D matrix more carefully than in the example code below. By setting penalize.intercept to FALSE in the code below, it is not penalized by using a p by p+1 penalty matrix D which is an identity matrix with the row corresponding to the intercept cut off.
the lambda values for glmnet and genlasso are on different scales. Note that glmnet optimizes $\frac{1}{2N}\sum_{i=1}^N(y-\beta_0-x_i^T\beta)^2+\lambda\left[(1-\alpha)\left\|\beta\right\|_2^2+\alpha\left\|\beta\right\|_1\right]$, while genlasso optimizes $\frac12\left\|y-X\beta\right\|_2^2+\lambda\left\|D\beta\right\|_1$; both use the $L^1$ norm for the lasso penalty, but glmnet's error term is equivalent to the $L^2$ norm divided by $2N$, while genlasso's formulation divides by just $2$. The $\lambda$ values from one must be rescaled to be applied to another. This is also an issue when using the CV'd $\lambda$ values for fitting a model with significantly different $N$, which is ignored in the code below. We can either scale lambda.cv to the appropriate value for genlasso, or use cross-validation to select an appropriate value directly. The code below takes the latter approach. Note that glmnet provides two methods of selecting the lambda value from cross-validation: the one giving the minimum CV loss (lambda.min), and one from the 1-standard-error rule, which adds some more regularization in exchange for a bit higher CV loss (lambda.1se); the 1se rule produces better out-of-sample performance in this case.

Code for glmnet without intercept, genlasso with intercept and CV-selected lambda:

set.seed(42) # set RNG seed for reproducibility 
n.train <- sum(train.idc) # number of training points
n.folds <- 10L # number of CV folds
foldid <- sample(rep_len(seq.int(n.folds), n.train)) # fold number for each training point
## (creates folds with roughly the same number of points)

train.x <- sim.sample.x[train.idc, ]
train.y <- sim.sample.y[train.idc]
test.x <- sim.sample.x[test.idc, ]
test.y <- sim.sample.y[test.idc]

## Call cv.glmnet specifying the foldid's:
cv.glmnet.fit <- cv.glmnet(x = train.x, 
                           y = train.y, 
                           foldid = foldid,
                           type.measure = "mse", alpha = 1)
## Choose the lambda with the minimum CV-estimated loss:
cv.glmnet.lambda.min.pred <- predict(cv.glmnet.fit,
                                     s = cv.glmnet.fit$lambda.min, #$
                                     newx = test.x)
## Using the 1-standard-error lambda selection rule:
cv.glmnet.lambda.1se.pred <- predict(cv.glmnet.fit,
                                     s = cv.glmnet.fit$lambda.1se, #$
                                     newx = test.x)
## with no intercept:
cv.glmnet0.fit <- cv.glmnet(x = train.x, 
                            y = train.y, 
                            foldid = foldid,
                            type.measure = "mse", alpha = 1,
                            intercept=FALSE)
cv.glmnet0.lambda.min.pred <- predict(cv.glmnet0.fit,
                                      s = cv.glmnet0.fit$lambda.min, #$ 
                                      newx = test.x)
cv.glmnet0.lambda.1se.pred <- predict(cv.glmnet0.fit,
                                      s = cv.glmnet0.fit$lambda.1se, #$
                                      newx = test.x)


## Get lambda sequence for genlasso on all training data:
D <- diag(1, p+1)
penalize.intercept <- FALSE
if (!penalize.intercept) D <- D[-1,]
genlasso.fit <- genlasso(y = train.y, 
                         X = cbind(1,train.x), 
                         D = D)
## Evaluate each lambda on each fold:
fold.lambda.losses <- tapply(seq_along(foldid), foldid, function(fold.indices) {
  fold.genlasso.fit <- genlasso(y = train.y[-fold.indices],
                                X = cbind(1,train.x[-fold.indices,]),
                                D = D)
  ## length(fold.indices)-by-length(cv.genlasso.fit$lambda) matrix, with
      ## predictions for this fold:
      ## $
  fold.genlasso.preds <- predict(fold.genlasso.fit,
                                 lambda = genlasso.fit$lambda, #$
                                 Xnew = cbind(1, train.x[fold.indices,]))$fit #$
  lambda.losses <- colMeans((fold.genlasso.preds - train.y[fold.indices])^2)
  return (lambda.losses)
})
## CV loss for each lambda:
cv.lambda.losses <- colMeans(do.call(rbind, fold.lambda.losses))
cv.genlasso.lambda.min <- genlasso.fit$lambda[which.min(cv.lambda.losses)] #$
## Caution, this lambda may need rescaling based on the ratio of the full training set size to the fold training set sizes.

## Predict:
cv.genlasso.lambda.min.pred <- predict(genlasso.fit,
                                       lambda = cv.genlasso.lambda.min,
                                       Xnew = cbind(1,test.x))$fit #$

summary(as.vector(cv.glmnet.lambda.min.pred))
summary(as.vector(cv.glmnet.lambda.1se.pred))
summary(as.vector(cv.glmnet0.lambda.min.pred))
summary(as.vector(cv.glmnet0.lambda.1se.pred))
summary(as.vector(cv.genlasso.lambda.min.pred))

mean((cv.glmnet.lambda.min.pred-test.y)^2)
mean((cv.glmnet.lambda.1se.pred-test.y)^2)
mean((cv.glmnet0.lambda.min.pred-test.y)^2)
mean((cv.glmnet0.lambda.1se.pred-test.y)^2)
mean((cv.genlasso.lambda.min.pred-test.y)^2)

Related Solutions

Solved – Fitting data to gamma distribution to find score which corresponds to pvalue < 0.05

One simple way to fit a gamma distribution to the data is the method of moments: the gamma distribution with parameters $(\alpha, \beta)$ has mean $\frac\alpha\beta$ and variance $\frac\alpha{\beta^2}$. You can use sample estimates of the mean and variance and some algebra to solve for the parameters of the model. Naturally, more advanced methods exist. But with the large number of observations, this should be a good starting point.
I'm uncertain what your second question means. Are you looking for the score which divides the smallest 95% of your data from the largest 5%?

Solved – How to interpret glmnet

Here's an unintuitive fact - you're not actually supposed to give glmnet a single value of lambda. From the documentation here:

Do not supply a single value for lambda (for predictions after CV use predict() instead). Supply instead a decreasing sequence of lambda values. glmnet relies on its warms starts for speed, and its often faster to ﬁt a whole path than compute a single ﬁt.

cv.glmnet will help you choose lambda, as you alluded to in your examples. The authors of the glmnet package suggest cv$lambda.1se instead of cv$lambda.min, but in practice I've had success with the latter.

After running cv.glmnet, you don't have to rerun glmnet! Every lambda in the grid (cv$lambda) has already been run. This technique is called "Warm Start" and you can read more about it here. Paraphrasing from the introduction, the Warm Start technique reduces running time of iterative methods by using the solution of a different optimization problem (e.g., glmnet with a larger lambda) as the starting value for a later optimization problem (e.g., glmnet with a smaller lambda).

To extract the desired run from cv.glmnet.fit, try this:

small.lambda.index <- which(cv$lambda == cv$lambda.min)
small.lambda.betas <- cv$glmnet.fit$beta[, small.lambda.index]

Revision (1/28/2017)

No need to hack to the glmnet object like I did above; take @alex23lemm's advice below and pass the s = "lambda.min", s = "lambda.1se" or some other number (e.g., s = .007) to both coef and predict. Note that your coefficients and predictions depend on this value which is set by cross validation. Use a seed for reproducibility! And don't forget that if you don't supply an "s" in coef and predict, you'll be using the default of s = "lambda.1se". I have warmed up to that default after seeing it work better in a small data situation. s = "lambda.1se" also tends to provide more regularization, so if you're working with alpha > 0, it will also tend towards a more parsimonious model. You can also choose a numerical value of s with the help of plot.glmnet to get to somewhere in between (just don't forget to exponentiate the values from the x axis!).

Best Answer

Related Solutions

Solved – Fitting data to gamma distribution to find score which corresponds to pvalue < 0.05

Solved – How to interpret glmnet

Related Question