Solved – Training AUC and CV AUC in Boosted Regression Tree

aucboosting

My question is regarding the differences in the training data AUC score and the cross validation AUC score in boosted regression trees (BRT) built using the gbm.step function in the dismo package. I am observing training AUC scores ~0.10-0.20 higher than the CV AUC scores.

Background and methods:

I am using the dismo and gbm packages to build boosted regression trees for a wildlife species distribution study. We put GPS collars on 10 individual animals and took readings every 4 hours for 1 year. From each animal's year of collar data we randomly selected 50 winter locations, 50 summer locations, and 50 "available" or pseudo-absence locations from the animal's calculated home range. Habitat variables were sampled by a crew at each location and GIS sampled variables were also added to the data set.

The data from each animal were pooled together and population BRT models were calculated for each season using the gbm.step function. The gbm.step function assesses the optimal number of boosting trees using k-fold cross validation. The data are divided into 10 subsets, with stratification by prevalence and then a BRT model of increasing complexity is fit, calculating the residual deviance at each step. After each fold is processed, the function calculates the average holdout residual deviance and its standard error and then identifies the optimal number of trees at which the holdout deviance is minimized. The function then fits a model with the optimal number of trees and returns additional information from the cross validation process.

To determine the optimal the model parameters for tree complexity (model depth) and learning rate (shrinkage) in the gbm.step function, we first fitted fixed models of 80,000 trees to 10 levels of tree complexity (1-10) and 7 learning rates (0.005-0.0001) using the gbm.fixed function in the dismo package and the predict.gbm function in the gbm package. Deviance was calculated from models fitted to 25 percent of the locations and predicted to the remaining 75 percent. The combination of tree complexity and learning rate that achieved the lowest predictive deviance at greater than 1,000 trees were chosen as the parameters when fitting the BRT models using gbm.step. The bag fraction was left as the default (0.5).

The gbm.simplify function was then used to further reduce the models. The gbm.simplify function takes an initial BRT model and assesses the potential of removing predictors using k-fold cross validation. At each fold, the lowest contributing predictor is removed and the change in predictive deviance is computed relative to that obtained when using all predictors. At the conclusion of cross validation, the sequence of variable removal is identified for the full data set.

Lastly, reduced BRT models were built using gbm.step with the same model parameters as the base BRT models.

Results and questions:

The summer model returned a training AUC of 0.86 and a CV AUC of 0.65. The winter model returned a training AUC of 0.93 and a CV AUC of 0.71.

In both cases, it looks as if the AUC for the training data set is excellent while the CV AUC score is poor to acceptable. I expect that there would be some difference in the training and CV AUC scores, but should this much of a difference be of concern? If not, how should I interpret and report these results? If it is of concern, what are some possible reasons for the differences and strategies I can take to fix them?

From texts and reading other posts similar to this question, I have a few ideas of possible reasons for the difference, but am not sure.

  1. The data are ordered by date, but I permutated the data row-wise before using gbm.fixed and predict.gbm. Also, from what I understand gbm.step also randomized the data.
  2. Could I have too few observations or too many variables? Could overfitting be an issue?
  3. All of the individual animals are pooled together, could differences in the preferences of each individual animal be contributing.
  4. Could the number CV folds in the gbm.step or gbm. simplify be at play? I used the default number of folds (10) for gbm.step and gbm.simplify. Would reducing the number of folds possibly help?

Best Answer

I expect that there would be some difference in the training and CV AUC scores, but should this much of a difference be of concern? If not, how should I interpret and report these results? If it is of concern, what are some possible reasons for the differences and strategies I can take to fix them?

You are overfitting the training data. The stark decrease in AUC shows that given new data, your model would likely not perform as well as it does for the training data.

  1. The data are ordered by date, but I permutated the data row-wise before using gbm.fixed and predict.gbm. Also, from what I understand gbm.step also randomized the data.

Structured dependency in the data is something that you should try to capture in the model. If date/time is important then you should find a way to include it. Admittedly this is ignored by most machine learning algorithms.

  1. Could I have too few observations or too many variables? Could overfitting be an issue?

Yes, your results are the definition of overfitting.

  1. All of the individual animals are pooled together, could differences in the preferences of each individual animal be contributing.

It is possible. Another consideration for model development.

  1. Could the number CV folds in the gbm.step or gbm. simplify be at play?

Yes, read about the bias-variance trade-off.