Solved – Using the Naive Bayes classifier in R with continuous variables

classificationnaive bayesr

I am trying to predict a categorical variable (type of job, there are three classes) using a dataset that mainly consists of continuous variables (like years of education, salary,etc), using the Naive Bayes classifier in the package 'klaR'. My question is now this, in case I use continuous variables to train my Naive Bayes classifier on, I get very bad predictions for an out-of-sample dataset. However when I divide the continuous variables in categories (so by making it categorical) I get pretty good predictions. So is the problem that I don't specify correctly that I am using continuous variables, because by doing this I lose information and would expect worse results.
My code has the following form:

m<-NaiveBayes(Job~.,data=JobDataTrain)  # Training in sample

m_predict<-predict(m,JobDataTest) 

Best Answer

The difference that you are seeing is likely a result of the fact that Naïve Bayes (NB) works quite differently on categorical and numerical variables. Explaining needs a little notation.

Assume that we are trying to predict Type that takes on values ${ t_1, t_2, ..., t_n }$. For each variable, V, NB makes an estimate of $P(Type = t_i | V = v)$. For categorical variables, there is a simple way to compute this. Just take all points in the training data with $V=v$ and compute the proportion for each class, $t_i$. For continuous variables, NB makes another naïve assumption that for each $t_i$ the data with $Type = t_i$ are normally distributed. For each $t_i$ the mean and standard deviation of V is computed for points with $Type = t_i$. This normal approximation is used to estimate $P(V=v | Type = t_i)$ which, together with Bayes Law is used to estimate (something proportional to) $P(Type = t_i | V=v)$.

Of course not all data is normally distributed, so if your continuous variable does not match that model well, this Gaussian approximation may provide bad estimates of the needed probabilities.

Here is an (artificial) example of the behavior that you saw.

### Response to: https://stats.stackexchange.com/q/215146/141956
library(klaR)
library(sm)

## One dimensional data
set.seed(2017)
x = c(runif(200,0,1), runif(50,2,3), runif(50,4,5), runif(200,6,7)) 
Type = factor(c(rep(1,200), rep(2,50), rep(1,50), rep(2,200))) 
df=data.frame(x, Type) 
sm.density.compare(x, Type, lty=c(2,2))

Non-Gaussian Distribution

For both types, the distribution is non-Gaussian. But nevertheless NB uses a Gaussian approximation.

NB = NaiveBayes(Type ~ x, data=df)
table(predict(NB, df)$class, df$Type)   
      1   2
  1 200  50
  2  50 200
NB$tables
$x
      [,1]     [,2]
1 1.278952 1.640579
2 5.703749 1.628859

mean(x[Type==1])
[1] 1.278952
sd(x[Type==1])
[1] 1.640579

sm.density.compare(x, Type, lty=c(2,2))
lines(seq(-2,9,0.1), dnorm(seq(-2,9,0.1), 1.3, 1.63), col="red", lwd=2)
lines(seq(-2,9,0.1), dnorm(seq(-2,9,0.1), 5.7, 1.63), col="green", lwd=2)

Gaussian approximations

NB represents both types by Gaussians with sd ~ 1.63 and means at about 1.3 and 5.7 . The dashed red distribution is approximated by the bold red curve and the dashed green distribution is approximated by the bold green curve. These poorly represent the data and they incorrectly predict the type for all of the points in the smaller bumps. The gaussian distributions are just not doing a good job of representing this data.

What if we discretize the data before applying NB?

## Discretize ##
DiscX = cut(x, breaks=0:7)
Ddf = data.frame(DiscX, Type)
NB2 = NaiveBayes(Type ~ DiscX, data=Ddf) 
table(predict(NB2, Ddf)$class, df$Type)  
      1   2
  1 250   0
  2   0 250

Now, it correctly classifies all of the points in the training data. In this case, the discretized form of the data captures the structure much better than the Gaussians.

However, I want to caution that just because your data is not Gaussian does not mean that Naïve Bayes will give a bad answer. In fact, NB can do surprisingly well, even on non-normally distributed data.