Solved – Naive Bayes error with caret

caretnaive bayes

I want to predict a variable with Naive Bayes. I tried it with another one from the same dataset and it worked perfect but not with the desired. The variable to predict contains values like "OL","DL" etc.

train_control <- trainControl(method="cv")
naiveModel <- train(as.factor(position)~groesse+koerpergewicht+bankdruecken+kniebeuge+maximalkraft_bb+maximalkraft_l+maximalkraft_r+schnellkraft_l+schnellkraft_r+sprint_10m+sprint_20m,data, trControl=train_control, method="nb")

 Something is wrong; all the Accuracy metric values are missing:
    Accuracy       Kappa    
 Min.   : NA   Min.   : NA  
 1st Qu.: NA   1st Qu.: NA  
 Median : NA   Median : NA  
 Mean   :NaN   Mean   :NaN  
 3rd Qu.: NA   3rd Qu.: NA  
 Max.   : NA   Max.   : NA  
 NA's   :2     NA's   :2  

I uploaded my data dput: http://wikisend.com/download/745072/data_dput.txt

Best Answer

The problem lies in the fact that your data is highly imbalanced. If you look at the distribution of position, you will notice that FS and TE only appear once in your dataset. Since this is a factor the cross validation encounters no value for these 2 values, but expects them, because they are present in the factor level. Hence you will see in your warnings stop("Zero variances for at least one class in variables:", : missing value where TRUE/FALSE needed

If you remove these 2 values from your data you will see that the model will be created (with some warning messages).

data <- subset(data, !position %in% c("FS", "TE"))
data$position <- droplevels(data$position)

train_control <- trainControl(method="cv")
naiveModel <- train(position~groesse+koerpergewicht+bankdruecken+kniebeuge+maximalkraft_bb+maximalkraft_l+maximalkraft_r+schnellkraft_l+schnellkraft_r+sprint_10m+sprint_20m, data, trControl=train_control, method="nb")

To solve the imbalance you could use trainControl with sampling = "up" like this: train_control <- trainControl(method="cv", sampling = "up")

But better is to see if you can increase the number of records with position values TS Fand TE