I am trying to cross validate a logistic regression model with probability sampling weights (weights representing number of subjects in the population). I am not sure how to handle the weights in each of the 'folds' (cross-validation steps). I don't think it is as simple as leaving out the observations, I believe the weights need to be rescaled at each step.
SAS has an option in proc surveylogistic to get cross validated (leave one out) prediction probabilities. Unfortunately I cannot find in the documentation any details on how these were calculated. I would like to reproduce those probabilities in R. So far I have not had success and am not sure if my approach is correct.
I hope someone can recommend an appropriate method to do the cross validation with the sampling weights. If they could match the SAS results that would be great too.
R code for leave-one-out cross validated probabilities (produces error):
library(bootstrap)
library(survey)
fitLogistic = function(x,y){
tmp=as.data.frame(cbind(y,x))
dsn=svydesign(ids=~0,weights=wt,data=tmp)
svyglm(y~x1+x2,
data=tmp,family = quasibinomial,design=dsn)
}
predict.logistic = function(fitLog,x){
pred.logistic=predict(fitLog,newdata=x,type='response')
print(pred.logistic)
ifelse(pred.logistic>=.5,1,0)
}
CV_Res= crossval(x=data1[,-1], y=data1[,1], fitLogistic, predict.logistic, ngroup = 13)
Sample Data Set:
y x1 x2 wt
0 0 1 2479.223
1 0 1 374.7355
1 0 2 1953.4025
1 1 2 1914.0136
0 0 2 2162.8524
1 0 2 491.0571
0 0 1 1842.1192
0 0 1 400.8098
0 1 1 995.5307
0 0 1 955.6634
1 0 2 2260.7749
0 1 1 1707.6085
0 0 2 1969.9993
SAS proc surveylogistic leave-one-out cross validated probabilities for sample data set:
.0072, 1 .884, .954, …
SAS Code:
proc surveylogistic;
model y=x1 x2;
weight wt;
output out=a2 predprobs=x;
run;
Best Answer
You can save yourself some coding effort, surprisingly enough, by simply doing the leave-one-out (LWO) cross-validation yourself:
Normalizing the weights to sum to one prevents a numerical problem (in this case) that results in your parameter estimates blowing up:
versus normalized weights:
You don't have to renormalize the weights at every step of the LWO loop, in effect they are renormalized anyway as the weights are relative.
This doesn't match the SAS probabilities, admittedly, but it seems to me it's what you're trying to do.