Solved – Leave-one-out cross validation output interpretation and ROC curve

cross-validationlogisticrroc

I have taken plenty of time to try and help myself, but I keep reaching dead ends.

I have a dataset consisting of body measurements collected from a bird species, and the sex of each bird (known by molecular means). I built a logistic regression model (using the AIC information criterion) to assess which measurements explain better the sex of the birds. My ultimate goal is to have an equation which could be used by others under field conditions to predict reliably the sex of the birds by taking as few body measurements as possible.

My final model includes four independent variables, namely "Culmen", "Head-bill", "Tarsus length", and "Wing length" (all continuous). I wish my model was a little more parsimonious, but all the variables seem to be important according to AIC criterion. Because the model produced should be used as prediction tool, I decided validate it using a leave-one-out cross validation approach. In my learning process, I first tried to complete the analyses (cross-validation and plotting) by including only one explanatory variable, namely "Culmen".

The output of the cross validation (package "boot" in R) yields two values (deltas), which are the cross-validated prediction errors where the first number is the raw leave-one-out, or lieu cross-validation result, and the second one is a bias-corrected version of it.

model.full <- glm(Sex ~ Culmen, data = my.data, family = binomial)
summary(model.full.1)

cv.glm(my.data, model.full, K=114)

$call
cv.glm(data = my.data, glmfit = model.full, K = 114)

$K
[1] 114

$delta
[1] 0.05941851 0.05937288

Q1. Could anyone expalin what do these two values represent and how to interpret them?

Following is the code as presented by Dr. Markus Müller (Calimo) in a similar, albeit not identical, post (https://stackoverflow.com/questions/20346568/feature-selection-cross-validation-but-how-to-make-roc-curves-in-r) which I tried to tweak to meet my data:

library(pROC)
data(my.data)
k <- 114    # Number of observations or rows in dataset
n <- dim(my.data)[1]
indices <- sample(rep(1:k, ceiling(n/k))[1:n])

all.response <- all.predictor <- aucs <- c()
for (i in 1:k) {
test = my.data[indices==i,]
learn = my.data[indices!=i,]
model <- glm(Sex ~ Culmen, data = learn, family=binomial)
model.pred <- predict(model, newdata=test)
aucs <- c(aucs, roc(test$Sex, model.pred)$auc)
all.response <- c(all.response, test$outcome)
all.predictor <- c(all.predictor, model.pred)
}

Error in roc.default(test$Sex, model.pred) : No case observation.

roc(all.response, all.predictor)

Error in roc.default(all.response, all.predictor) : No valid data provided.

mean(aucs)

Q2. What's the reason for the first error message? I guess the second error is associated with the first one, and that it will be solved once I find a solution to the first one.

I will appreciate very much any help!!

Luciano

Best Answer

Include more data into your test set.

Sometimes algorithms like Neural Network give classification output such that all the output labels are the same.

For example, suppose your actual labels are like this: c(1,0,0,1,1,0,0,1,1,1).
You might end up training your neural network (I am mentioning neural network specifically because I have faced this problem while applying the algorithm) in such a way that the output labels come out to be: c(1,1,1,1,1,1,1,1,1,1).

In such a case, your auc/roc functions would show the above mentioned error as these are no 0 labels in predicted data.

Hope this might help!