Solved – Cross-validation and weights with glmnet

cross-validationglmnet

Are any weights supplied to glmnet::cv.glmnet taken into account for the calculation of the figure of merit (AUC for logistic regression in my case) on the "held-out" dataset, or only used directly in the fitting?

Best Answer

I was looking for the answer to the same question and I took a look into the code (v1.8-2) in http://cran.r-project.org/web/packages/glmnet/index.html .

The file "cv.elnet.R" (in your case cv.lognet.r) contains the relevant pieces of code.

cvraw=switch(type.measure,

"mse"=(y-predmat)^2,

"deviance"=(y-predmat)^2,

"mae"=abs(y-predmat)

)

cvm=apply(cvraw,2,weighted.mean,w=weights,na.rm=TRUE)

cvsd=sqrt(apply(scale(cvraw,cvm,FALSE)^2,2,weighted.mean,w=weights,na.rm=TRUE)/(N-1))

So the package does use the weights in the computation of the figures of merit even on the held-out data. (You will find similar lines in cv.lognet.r as well.)

Related Solutions

Solved – Cross-validation with sampling weights

You can save yourself some coding effort, surprisingly enough, by simply doing the leave-one-out (LWO) cross-validation yourself:

data1$norm.wt <- data1$wt / sum(data1$wt)
lwo <- rep(0,nrow(data1))

for (j in 1:nrow(data1))
{
  fj <- glm(y~x1+x2, family=quasibinomial, weights=norm.wt, data=data1[-j,])
  lwo[j] <- predict(fj, data1[j,], type="response")
}

> print(lwo, digits=4)
 [1] 2.564e-02 2.220e-16 4.405e-01 2.128e-07 7.360e-01 5.360e-01 2.383e-02
 [8] 2.064e-02 1.316e-01 2.174e-02 4.152e-01 1.895e-01 7.162e-01

Normalizing the weights to sum to one prevents a numerical problem (in this case) that results in your parameter estimates blowing up:

> summary(glm(y~x1+x2, family=quasibinomial, weights=wt, data=data1))

Call:
glm(formula = y ~ x1 + x2, family = quasibinomial, data = data1, 
    weights = wt)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-394.9     0.0     0.0     0.0   164.4  

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept) -6.790e+15  2.354e+15  -2.885   0.0162 *
x1           2.004e+15  1.630e+15   1.230   0.2470  
x2           3.403e+15  1.393e+15   2.443   0.0347 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for quasibinomial family taken to be 2.030037e+18)

    Null deviance:  25460  on 12  degrees of freedom
Residual deviance: 324940  on 10  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 12

versus normalized weights:

> summary(glm(y~x1+x2, family=quasibinomial, weights=data1$wt/sum(data1$wt), data=data1))

Call:
glm(formula = y ~ x1 + x2, family = quasibinomial, data = data1, 
    weights = data1$wt/sum(data1$wt))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.4273  -0.1004  -0.0444   0.1706   0.3879  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   -8.038      6.978  -1.152    0.276
x1             1.631      3.177   0.513    0.619
x2             4.142      3.536   1.171    0.269

(Dispersion parameter for quasibinomial family taken to be 0.1440099)

    Null deviance: 1.30513  on 12  degrees of freedom
Residual deviance: 0.84517  on 10  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 6

You don't have to renormalize the weights at every step of the LWO loop, in effect they are renormalized anyway as the weights are relative.

This doesn't match the SAS probabilities, admittedly, but it seems to me it's what you're trying to do.

Solved – Exact definition of Deviance measure in glmnet package, with crossvalidation

In Friedman, Hastie, and Tibshirani (2010), the deviance of a binomial model, for the purpose of cross-validation, is calculated as

minus twice the log-likelihood on the left-out data (p. 17)

Given that this is the paper cited in the documentation for glmnet (on p. 2 and 5), that is probably the formula used in the package.

And indeed, in the source code for function cvlognet, the deviance residuals for the response are calculated as

-2*((y==2)*log(predmat)+(y==1)*log(1-predmat))

where predmat is simply

predict(glmnet.object,x,lambda=lambda)

and passed in from the encolsing cv.glmnet function. I used the source code available on the JStatSoft page for the paper, and I don't know how up-to-date that code is. The code for this package is surprisingly simple and readable; you can always check for yourself by typing glmnet:::cv.glmnet.

Best Answer

Related Solutions

Solved – Cross-validation with sampling weights

Solved – Exact definition of Deviance measure in glmnet package, with crossvalidation

Related Question