I expect that there would be some difference in the training and CV AUC scores, but should this much of a difference be of concern? If not, how should I interpret and report these results? If it is of concern, what are some possible reasons for the differences and strategies I can take to fix them?
You are overfitting the training data. The stark decrease in AUC shows that given new data, your model would likely not perform as well as it does for the training data.
- The data are ordered by date, but I permutated the data row-wise before using gbm.fixed and predict.gbm. Also, from what I understand gbm.step also randomized the data.
Structured dependency in the data is something that you should try to capture in the model. If date/time is important then you should find a way to include it. Admittedly this is ignored by most machine learning algorithms.
- Could I have too few observations or too many variables? Could overfitting be an issue?
Yes, your results are the definition of overfitting.
- All of the individual animals are pooled together, could differences in the preferences of each individual animal be contributing.
It is possible. Another consideration for model development.
- Could the number CV folds in the gbm.step or gbm. simplify be at play?
Yes, read about the bias-variance trade-off.
It seems like you understand that you're able to have n
levels, as opposed to n-1
, because unlike in linear regression you don't need to worry about perfect colinearity.
(I'm coming at this from an R perspective, but I assume it's the same in Python.) That depends on a couple of things, such as 1) which package you're using and 2) how many factor levels you have.
1) If you are using R's randomForest
package, then if you have <33 factor levels then you can go ahead and leave them in one feature if you want. That's because in R's random forest implementation, it will check to see which factor levels should be on one side of the split and which on the other (e.g., 5 of your levels might be grouped together on the left side, and 7 might be grouped together on the right). If you split the categorical feature out into n
dummies, then the algorithm would not have this option at its disposal.
Obviously if the particularly package you're using can't handle categorical features then you'd just need to create n
dummy variables.
2) As I alluded to above, R's random forest implementation can only handle 32 factor levels - if you have more than that then you either need to split your factors into smaller subsets, or create a dummy variable for each level.
Best Answer
First, you could pick a learner, that does support categorical splits such as the R
gbm
package (in contrary toxgboost
).You could simply randomly enumerate categories and treat as numerical. This procedure works surprisingly well. So if you prefer
xgboost
, you may just be lazy and simply convert/coerce your data.frame of mixed factors(categoricals) and numeric features into a numeric matrix and pass toxgboost
.One hot encoding means each category gets a dummy variable and is either zero or one. This method only allow one-vs-all splits. I would try first two options first.
Sometimes your feature have numerous number of categories. It is often not as useful to simply plug-in such a feature into the model by any method. It may be worth to cluster the categories with kmeans and/or cautiously bin(few bins, to avoid over-fitting) the categories by naively expected target value.