Solved – Does Naive Bayes assume normality

assumptionsclassificationnaive bayesnormal distribution

I came across this paper about Naive Bayes that states

[Naive Bayes] is based on another common simplifying assumption: the values of numeric attributes are normally distributed within each class.

Is that true? Does Naive Bayes require that you assume continuous predictors are normal?

From this answer and my experience, I thought that you can use any distribution to describe your predictors, normal or otherwise. Am I missing something? Is the paper just badly worded, e.g. it should say "is commonly based on a normality assumption" instead of "is based on a normality assumption?"

Best Answer

On its own, Naive Bayes does not assume the normal distribution. The heart of Naive Bayes is the heroic conditional independence assumption: $$P(x \mid X, C) = P(x \mid C)$$

Gaussian Naive Bayes assumes the normal distribution...

Related Solutions

Solved – Normality of dependent variable = normality of residuals

One point that may help your understanding:

If $x$ is normally distributed and $a$ and $b$ are constants, then $y=\frac{x-a}{b}$ is also normally distributed (but with a possibly different mean and variance).

Since the residuals are just the y values minus the estimated mean (standardized residuals are also divided by an estimate of the standard error) then if the y values are normally distributed then the residuals are as well and the other way around. So when we talk about theory or assumptions it does not matter which we talk about because one implies the other.

So for the questions this leads to:

yes, both, either
No, (however the individual y-values will come from normals with different means which can make them look non-normal if grouped together)
Normality of residuals means normality of groups, however it can be good to examine residuals or y-values by groups in some cases (pooling may obscure non-normality that is obvious in a group) or looking all together in other cases (not enough observations per group to determine, but all together you can tell).
This depends on what you mean by compare, how big your sample size is, and your feelings on "Approximate". The normality assumption is only required for tests/intervals on the results, you can fit the model and describe the point estimates whether there is normality or not. The Central Limit Theorem says that if the sample size is large enough then the estimates will be approximately normal even if the residuals are not.
It depends on what question your are trying to answer and how "approximate" your are happy with.

Another point that is important to understand (but is often conflated in learning) is that there are 2 types of residuals here: The theoretical residuals which are the differences between the observed values and the true theoretical model, and the observed residuals which are the differences between the observed values and the estimates from the currently fitted model. We assume that the theoretical residuals are iid normal. The observed residuals are not i, i, or distributed normal (but do have a mean of 0). However, for practical purposes the observed residuals do estimate the theoretical residuals and are therefore still useful for diagnostics.

Solved – Is Naive Bayes robust

Let's start with an experiment. I am just duplicating the first column again and again in my data set.

data(HouseVotes84, package = "mlbench")

errors <- NULL
for(i in 1:50)
{
  HouseVotes84[,ncol(HouseVotes84)+1] <- HouseVotes84$V1

  model <- naiveBayes(Class ~ ., data = HouseVotes84[1:299,])
  error <- sum(predict(model, HouseVotes84[300:400,])!=HouseVotes84[300:400,]$Class)

  errors <- c(errors,error)
}

plot(errors,type='l',xlab='Number of duplications of V1',ylab='Error on the test set')

For information, the data set looks like:

       Class   V1 V2 V3   V4   V5 V6 V7 V8 V9 V10  V11  V12 V13 V14 V15  V16
1 republican    n  y  n    y    y  y  n  n  n   y <NA>    y   y   y   n    y
2 republican    n  y  n    y    y  y  n  n  n   n    n    y   y   y   n <NA>
3   democrat <NA>  y  y <NA>    y  y  n  n  n   n    y    n   y   y   n    n
4   democrat    n  y  y    n <NA>  y  n  n  n   n    y    n   y   n   n    y

Indeed, the error rate increases as the first column gets duplicated. It seems to saturate at 32. Note that, keeping the first two columns only:

  model <- naiveBayes(Class ~ ., data = HouseVotes84[1:299,1:2])
  error <- sum(predict(model, HouseVotes84[300:400,])!=HouseVotes84[300:400,]$Class)

The error is 31.

What actually went on?

It all boils down to the construction of the Naive Bayes. Keeping Wikipedia's notations (https://en.wikipedia.org/wiki/Naive_Bayes_classifier):

$$p(C_k \vert x_1, \dots, x_n) = \frac{1}{Z} p(C_k) \prod_{i=1}^n p(x_i \vert C_k)$$

Where $C_k$ is the event "the target belongs to class $k$", and $x_i$ is the value of the $i$-th variable and $Z$ is a constant.

Classifying is just looking for the max of the above expression.

$$k = \arg \max_l p(C_l|x) $$

Looking at the logarithm and replicating $M$ times the first variable (calling $\tilde x_M$ the new vector created), we observe that:

$$\log(p(C_k|\tilde x_M))= \log(p(C_k|x)) + M \log(p(x_1 \vert C_k))$$

And we observe that clustering is done according to the first variable only, for $M$ large enough.

Best Answer

Related Solutions

Solved – Normality of dependent variable = normality of residuals

Solved – Is Naive Bayes robust

Related Question