Solved – K-fold or hold-out cross validation for ridge regression using R

cross-validationpredictionrridge regression

I am working on cross-validation of prediction of my data with 200 subjects and 1000 variables. I am interested ridge regression as number of variables (I want to use) is greater than number of sample. So I want to use shrinkage estimators. The following is made up example data:

 #random population of 200 subjects with 1000 variables 
    M <- matrix(rep(0,200*100),200,1000)
    for (i in 1:200) {
    set.seed(i)
      M[i,] <- ifelse(runif(1000)<0.5,-1,1)
    }
    rownames(M) <- 1:200

    #random yvars 
    set.seed(1234)
    u <- rnorm(1000)
    g <- as.vector(crossprod(t(M),u))
    h2 <- 0.5 
    set.seed(234)
    y <- g + rnorm(200,mean=0,sd=sqrt((1-h2)/h2*var(g)))

    myd <- data.frame(y=y, M)
myd[1:10,1:10]

y X1 X2 X3 X4 X5 X6 X7 X8 X9
1   -7.443403 -1 -1  1  1 -1  1  1  1  1
2  -63.731438 -1  1  1 -1  1  1 -1  1 -1
3  -48.705165 -1  1 -1 -1  1  1 -1 -1  1
4   15.883502  1 -1 -1 -1  1 -1  1  1  1
5   19.087484 -1  1  1 -1 -1  1  1  1  1
6   44.066119  1  1 -1 -1  1  1  1  1  1
7  -26.871182  1 -1 -1 -1 -1  1 -1  1 -1
8  -63.120595 -1 -1  1  1 -1  1 -1  1  1
9   48.330940 -1 -1 -1 -1 -1 -1 -1 -1  1
10 -18.433047  1 -1 -1  1 -1 -1 -1 -1  1

I would like to do following for cross validation –

(1) split data into two halts – use first half as training and second half as test

(2) K-fold cross validation (say 10 fold or suggestion on any other appropriate fold for my case are welcome)

I can simply sample the data into two (gaining and test) and use them:

# using holdout (50% of the data) cross validation 
training.id <- sample(1:nrow(myd), round(nrow(myd)/2,0), replace = FALSE)
test.id <- setdiff(1:nrow(myd), training.id)

 myd_train <- myd[training.id,]
 myd_test  <- myd[test.id,]   

I am using lm.ridge from MASS R package.

library(MASS)
out.ridge=lm.ridge(y~., data=myd_train, lambda=seq(0, 100,0.001))
plot(out.ridge)
select(out.ridge)

lam=0.001
abline(v=lam)

out.ridge1 =lm.ridge(y~., data=myd_train, lambda=lam)
hist(out.ridge1$coef)
    out.ridge1$ym
hist(out.ridge1$xm)

I have two questions –

(1) How can I predict the test set and calculate accuracy (as correlation of predicted vs actual)?

(2) How can I perform K-fold validation? say 10-fold?

Best Answer

You can use caret package (vignnettes, paper ) for this type of things, which can wrap a number of machine learning models or you can use your own customized models . As you are interested in ridge regression here is just custom codes for ridge regression, you might want to adopt to your situation more precisely.

For simple split in data:

set.seed(107)
# stratified random split of the data
inTrain <- createDataPartition(y = myd$y, p = .5,list = FALSE)
training <- myd[ inTrain,]
testing <- myd[-inTrain,]

For K-fold validation and other type of CV including default boot

ridgeFit1 <- train(y ~ ., data = training,method = 'ridge', 
preProc = c("center", "scale"), metric = "ROC")
plot(ridgeFit1)

Here is discussion on how to use train function. Note the ridge method depends upon the package elasticnet functions ( and its dependency on lars, should or need to be installed). If not installed in the system will ask if you want to do so.

the type of resampling used, The simple bootstrap is used by default.To modify the resampling method, a trainControl function is used

The option method controls the type of resampling and defaults to "boot". Another method, "repeatedcv", is used to specify repeated K–fold cross–validation (and the argument repeats controls the number of repetitions). K is controlled by the number argument and defaults to 10.

 ctrl <- trainControl(method = "repeatedcv", repeats = 5)

 ridgeFit <- train(y ~ ., data = training,method = 'ridge',
preProc = c("center", "scale"),trControl = ctrl, metric = "ROC")

plot(ridgefit)

For predictions:

plsClasses <- predict(ridgeFit, newdata = testing)