Solved – Should I choose Random Forest regressor or classifier

pythonrandom forest

I fit a dataset with a binary target class by the random forest. In python, I can do it either by randomforestclassifier or randomforestregressor.

I can get the classification directly from randomforestclassifier or I could run randomforestregressor first and get back a set of estimated scores (continuous value). Then I can find a cutoff value to derive the predicted classes out of the set of scores. Both methods can achieve the same goal (i.e. predict the classes for the test data).

Also I can observe that

randomforestclassifier.predict_proba(X_test)[:,1]) 

is different from

randomforestregressor.predict(X_test)

So I just wanna confirm that both methods are valid and then which one is better in random forest application?

Best Answer

Use the Classifier. No, they are not both valid.

First, I really encourage you to read yourself into the topic of Regression vs Classification. Because using ML without knowing anything about it will give you wrong results which you won't realize. And that's quite dangerous... (it's a little bit like asking which way around you should hold your gun or if it doesn't matter)

Whether you use a classifier or a regressor only depends on the kind of problem you are solving. You have a binary classification problem, so use the classifier.

I could run randomforestregressor first and get back a set of estimated probabilities.

NO. You don't get probabilities from regression. It just tries to "extrapolate" the values you give (in this case only 0 and 1). This means values above 1 or below 0 are perfectly valid as a regression output as it does not expect only two discrete values as output (that's called classification!) but continuous values.

If you want to have the "probabilities" (be aware that these don't have to be well calibrated probabilities) for a certain point to belong to a certain class, train a classifier (so it learns to classify the data) and then use .predict_proba(), which then predicts the probability.

Just to mention it here: .predict vs .predict_proba (for a classifier!)
.predict just takes the .predict_proba output and changes everything to 0 below a certain threshold (usually 0.5) respectively to 1 above that threshold.

Remark: sure, internally, they are the very same except from the "last layer" etc.! Still, see them (or better the problem they are solving) as completely different!