I am working on cross-validation of prediction of my data with 200 subjects and 1000 variables. I am interested ridge regression as number of variables (I want to use) is greater than number of sample. So I want to use shrinkage estimators. The following is made up example data:
#random population of 200 subjects with 1000 variables
M <- matrix(rep(0,200*100),200,1000)
for (i in 1:200) {
set.seed(i)
M[i,] <- ifelse(runif(1000)<0.5,-1,1)
}
rownames(M) <- 1:200
#random yvars
set.seed(1234)
u <- rnorm(1000)
g <- as.vector(crossprod(t(M),u))
h2 <- 0.5
set.seed(234)
y <- g + rnorm(200,mean=0,sd=sqrt((1-h2)/h2*var(g)))
myd <- data.frame(y=y, M)
myd[1:10,1:10]
y X1 X2 X3 X4 X5 X6 X7 X8 X9
1 -7.443403 -1 -1 1 1 -1 1 1 1 1
2 -63.731438 -1 1 1 -1 1 1 -1 1 -1
3 -48.705165 -1 1 -1 -1 1 1 -1 -1 1
4 15.883502 1 -1 -1 -1 1 -1 1 1 1
5 19.087484 -1 1 1 -1 -1 1 1 1 1
6 44.066119 1 1 -1 -1 1 1 1 1 1
7 -26.871182 1 -1 -1 -1 -1 1 -1 1 -1
8 -63.120595 -1 -1 1 1 -1 1 -1 1 1
9 48.330940 -1 -1 -1 -1 -1 -1 -1 -1 1
10 -18.433047 1 -1 -1 1 -1 -1 -1 -1 1
I would like to do following for cross validation –
(1) split data into two halts – use first half as training and second half as test
(2) K-fold cross validation (say 10 fold or suggestion on any other appropriate fold for my case are welcome)
I can simply sample the data into two (gaining and test) and use them:
# using holdout (50% of the data) cross validation
training.id <- sample(1:nrow(myd), round(nrow(myd)/2,0), replace = FALSE)
test.id <- setdiff(1:nrow(myd), training.id)
myd_train <- myd[training.id,]
myd_test <- myd[test.id,]
I am using lm.ridge
from MASS
R package.
library(MASS)
out.ridge=lm.ridge(y~., data=myd_train, lambda=seq(0, 100,0.001))
plot(out.ridge)
select(out.ridge)
lam=0.001
abline(v=lam)
out.ridge1 =lm.ridge(y~., data=myd_train, lambda=lam)
hist(out.ridge1$coef)
out.ridge1$ym
hist(out.ridge1$xm)
I have two questions –
(1) How can I predict the test set and calculate accuracy (as correlation of predicted vs actual)?
(2) How can I perform K-fold validation? say 10-fold?
Best Answer
You can use
caret
package (vignnettes, paper ) for this type of things, which can wrap a number of machine learning models or you can use your own customized models . As you are interested in ridge regression here is just custom codes for ridge regression, you might want to adopt to your situation more precisely.For simple split in data:
For K-fold validation and other type of CV including default boot
Here is discussion on how to use
train
function. Note the ridge method depends upon the packageelasticnet
functions ( and its dependency onlars
, should or need to be installed). If not installed in the system will ask if you want to do so.the type of resampling used, The simple bootstrap is used by default.To modify the resampling method, a trainControl function is used
The option method controls the type of resampling and defaults to "boot". Another method, "repeatedcv", is used to specify repeated K–fold cross–validation (and the argument repeats controls the number of repetitions). K is controlled by the number argument and defaults to 10.
For predictions: