Solved – Cross-validation with sampling weights

cross-validationlogisticrsas

I am trying to cross validate a logistic regression model with probability sampling weights (weights representing number of subjects in the population). I am not sure how to handle the weights in each of the 'folds' (cross-validation steps). I don't think it is as simple as leaving out the observations, I believe the weights need to be rescaled at each step.

SAS has an option in proc surveylogistic to get cross validated (leave one out) prediction probabilities. Unfortunately I cannot find in the documentation any details on how these were calculated. I would like to reproduce those probabilities in R. So far I have not had success and am not sure if my approach is correct.

I hope someone can recommend an appropriate method to do the cross validation with the sampling weights. If they could match the SAS results that would be great too.

R code for leave-one-out cross validated probabilities (produces error):

library(bootstrap)
library(survey)
fitLogistic = function(x,y){
  tmp=as.data.frame(cbind(y,x))
  dsn=svydesign(ids=~0,weights=wt,data=tmp)
  svyglm(y~x1+x2, 
         data=tmp,family = quasibinomial,design=dsn)
} 
predict.logistic = function(fitLog,x){
  pred.logistic=predict(fitLog,newdata=x,type='response')
  print(pred.logistic)
  ifelse(pred.logistic>=.5,1,0)
} 
CV_Res= crossval(x=data1[,-1], y=data1[,1], fitLogistic, predict.logistic, ngroup = 13)

Sample Data Set:

y   x1  x2  wt
0   0   1   2479.223
1   0   1   374.7355
1   0   2   1953.4025
1   1   2   1914.0136
0   0   2   2162.8524
1   0   2   491.0571
0   0   1   1842.1192
0   0   1   400.8098
0   1   1   995.5307
0   0   1   955.6634
1   0   2   2260.7749
0   1   1   1707.6085
0   0   2   1969.9993

SAS proc surveylogistic leave-one-out cross validated probabilities for sample data set:

.0072, 1 .884, .954, …

SAS Code:

proc surveylogistic;
model y=x1 x2;
weight wt;
output out=a2 predprobs=x;
run;

Best Answer

You can save yourself some coding effort, surprisingly enough, by simply doing the leave-one-out (LWO) cross-validation yourself:

data1$norm.wt <- data1$wt / sum(data1$wt)
lwo <- rep(0,nrow(data1))

for (j in 1:nrow(data1))
{
  fj <- glm(y~x1+x2, family=quasibinomial, weights=norm.wt, data=data1[-j,])
  lwo[j] <- predict(fj, data1[j,], type="response")
}

> print(lwo, digits=4)
 [1] 2.564e-02 2.220e-16 4.405e-01 2.128e-07 7.360e-01 5.360e-01 2.383e-02
 [8] 2.064e-02 1.316e-01 2.174e-02 4.152e-01 1.895e-01 7.162e-01

Normalizing the weights to sum to one prevents a numerical problem (in this case) that results in your parameter estimates blowing up:

> summary(glm(y~x1+x2, family=quasibinomial, weights=wt, data=data1))

Call:
glm(formula = y ~ x1 + x2, family = quasibinomial, data = data1, 
    weights = wt)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-394.9     0.0     0.0     0.0   164.4  

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept) -6.790e+15  2.354e+15  -2.885   0.0162 *
x1           2.004e+15  1.630e+15   1.230   0.2470  
x2           3.403e+15  1.393e+15   2.443   0.0347 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for quasibinomial family taken to be 2.030037e+18)

    Null deviance:  25460  on 12  degrees of freedom
Residual deviance: 324940  on 10  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 12

versus normalized weights:

> summary(glm(y~x1+x2, family=quasibinomial, weights=data1$wt/sum(data1$wt), data=data1))

Call:
glm(formula = y ~ x1 + x2, family = quasibinomial, data = data1, 
    weights = data1$wt/sum(data1$wt))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.4273  -0.1004  -0.0444   0.1706   0.3879  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   -8.038      6.978  -1.152    0.276
x1             1.631      3.177   0.513    0.619
x2             4.142      3.536   1.171    0.269

(Dispersion parameter for quasibinomial family taken to be 0.1440099)

    Null deviance: 1.30513  on 12  degrees of freedom
Residual deviance: 0.84517  on 10  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 6

You don't have to renormalize the weights at every step of the LWO loop, in effect they are renormalized anyway as the weights are relative.

This doesn't match the SAS probabilities, admittedly, but it seems to me it's what you're trying to do.

Related Question