Solved – Why is logistic regression giving a better prediction than linear reg in XGBoost

boostinglogisticoverfittingrregression

I am building a predictive model with 7 features. My target is binary. I have tried using XGBoost in R.

bst <- xgboost(data = as.matrix(trainSet[,predictors]),
               label = trainSet[,outcomeName], max.depth=10, 
               nround=1000, objective="reg:linear", verbose=0)
pred <- predict(bst, as.matrix(testSet[,predictors]), outputmargin=TRUE)
rmse(as.numeric(testSet[,outcomeName]), as.numeric(pred))

Since my target is binary, I used logistic regression. But the prediction quality goes very bad compared to linear regression. Is it OK to use linear regression or am I overfitting? Misclassification is 23.4% with linear regression but it goes 47% with logistic.

Best Answer

I think Matthew Drury had the answer. In the last line of the code, you are using rmse, which is not 0-1 loss on your testing data. That could be one reason.

On the other hand, with the information there there is no way to answer this question. Because we do not know what is trainSet and testSet.

If you want to know if you are over fitting, you can change the number of iterations (nround) and observe the performance respect to number of iterations.

Details can be found in following link (although it is the question tile has SVM, but my answer uses boosted trees as example.)

How to know if a learning curve from SVM model suffers from bias or variance?

Related Question