Solved – Choosing a model for classification: decision tree or naive bayes

classificationmachine learningnaive bayes

student

My goal is to predict churn for a monthly subscription model business. The data set has a small number of dimensions:

  • subscription id
  • start date (can infer "months as a subscriber")
  • price plan (4 variants)
  • marketing channel (how they found us e.g. Google, email, Facebook)
  • cancel date (the dependant variable. Not blank if never churned, if has a value then the date the subscription churned)

There are 10k records.

Thinking back to class the concept of Naive Bayes model sounded pretty intuitive and I wanted to go that route. But then I read that NB is better where there are many variables. I only have 5, one of which will be the dependant variable).

Then I remembered decision tree. But this article says that decision tree are "not well suited for handling a continuous input variable such as time (whereas survival models are likely a better fit)". My variable "start date" might nullify this then, since the theory is that an months as a paying customer may impact churn (In fact we know this from regular attrition/cohort analysis).

The goal is to predict whether or not an account will churn (yes/no) (not the actual date of churn, so I might edit my data set for variable "cancel date" to be "churned" yes or no).

  1. Can I use Naive Bayes or Decision Tree given my data set and goal? Is one more appropriate than the other? I'd invite other model suggestions but I'm taking baby steps which what I've learned about in class.
  2. In the case of either model, how might I want to edit my dataset? Currently I have start date and, if they churned cancel date. I could therefore create a new field: "months as paying customer". Is this advised?
  3. Do I need to change my field for dependant variable? Either it will be blank (not churned) or it will have a date value (churned). Should I create a new field "churned" with values "yes" or "no"?

I realize this question is a little open ended. Any pointers of help to get me going would be much appreciated.

Best Answer

Don't make a choice based on believes try both! Once you developed a cross validation framework it is not very hard to feed it with various models and pick the best one. Sometimes, these cross validation framework already exists (caret in R, but there must be plenty of others!)

The paper : "Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?" http://jmlr.org/papers/volume15/delgado14a/delgado14a.pdf is a very good review on existing models, their implementations and their performances on various data sets. You can find information about the performances of models according to the number of features, per example.

But even if performances of a model depend a priori on the number of features, the number of observations (some models have been developed to handle specific situations), and the type of observations, it is not obvious that one model will perform better than another.

IMHO, the only thing that you should consider when selecting a model is the time to train it. Some training times become prohibitive with large data sets. Per example, training a kernel SVM on 1M+ observations will never end. With 10k records and 5 features, you can train almost anything though.

As for the feature engineering, you should try every idea you have as well! For categorical variables a first start is to encode them into dummy variables, in order to end up with a numeric matrix. But you could also want to remove scarce factors, consider interactions... And keep observing the influence on the predictive performance!

Related Question