Machine Learning – Understanding Correlated Cases and Cross Validation

autocorrelationclassificationcross-validationmachine learning

I'm posting to ask if there is a method of cross-validation for correllated data that is already well implemented in R language. Some quick search on such method shows some techniques like h-block cross-validation, hv-block cross validation and leave-one-block-out (LOBO) cross-validation, but nothing yet implemented on R as far as I know.

Mine is a simple classification problem ( with 295 negatives and 247 positives) and I need a CV technique for a dataset that contains several well defined blocks of variable size, that may contain cases from both positive and negative. Within each block some (but not all) of the 60 predictors of interest may be highly correlated between the cases.

One attempt to bypass the problem was to randomly pick one case from each block and them train the model. Unfortunatelly this further decreases my datset to just 151 negatives and 113 positives makes the outcome of the model highly variable. Also, I'm getting some weird LOOCV an k-fold CV test errors that are bellow the training errors, using the adaboost algorithm, no matter what I do.

I'm open to suggestions of all kinds.

EDIT: The problem refers to SNV mutations that are close to each other inside the same gene (block). The proximity between each case can be precisely measured in one dimension (nucleotides of distance) . Predictors based on the surrounding context, like sequence conservation, will tend to be very similar in mutations that lie next to each other.

Best Answer

We have a paper in press that discusses this problem. AFAIK, there is no R package with sophisticated options for block cross-validation, but the paper has some code attached in the appendix that may be useful.

Roberts, D. R. et al. (2017) Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure Ecography, in press.

http://onlinelibrary.wiley.com/doi/10.1111/ecog.02881/abstract