Solved – Why is naive Bayes a linear model

linear modelnaive bayes

People always said that naive Bayes is a linear model. I am not able to understand why, so can anybody explain?

Best Answer

I don't see how Naive Bayes is a linear model. This gives one possible definition. In general, I would say a linear model would involve a linear combination of the parameters or of some transformation of the parameters.

Naive Bayes multiplies the probabilities from the different variables and they are also not really weighted in the standard sense. However, if you take the logarithm this will become addition, so it is possible that given some type of conditional probability distribution (CPD) you could get something that looks like a linear model. I don't think this is true for every CPD though (but I may be wrong).

Related Solutions

Solved – Why does Naive Bayes outperform Support Vector Machines

I disagree with the explanation given in the other answer. SVM, works well for large dimensional problems with relatively few instances because it is well regularized. In this case, I suspect the problem is not the tool but rather how it is being used.

For SVM I use a dot kernel and other parameters are all in their defaults ...

This is why your SVM results are bad. If you do not tune the SVM parameters (probably $c$ in your case), the resulting classifier will likely be poor unless you happen to get lucky with the default value.

when I change the parameters and try again, I get same result; Naive Bayes still outperforms SVM.

How do you change the parameters? What search method do you use? Do you just pick a value and hope for the best? Parameter search is important and must be done properly. Typically, optimal parameters are found through cross-validation.

Note that Naive Bayes may well be better for your particular application. Just because SVM is known to work well on this type of problems, does not mean that it always does.

Maximum Likelihood – Naive Bayes Classifier Explained

Bayesian model is defined in terms of likelihood function (probability of observing the data given the parameters) and priors (assumed distributions for the estimated parameters). Naive Bayes algorithm estimates the probabilities directly from the data, so it does not make any assumptions about their distributions (does not use priors), so it is not Bayesian.

The algorithm estimates the joint probability for the target variable $y$ and the features $x_1, x_2, \dots, x_m$ and then classifies by choosing classes that got assigned the greatest probabilities. It is called as naive, because it makes the "naive" assumption that the features are independent, so we can write

$$ p(x_1, x_2, \dots, x_m \mid y) = \prod_{j=1}^m p(x_j \mid y) $$

and then, using the properties of conditional probability, given this, we can estimate the joint probability

$$ p(y, x_1, x_2, \dots, x_m) = p(x_1, x_2, \dots, x_m \mid y) \; p(y) $$

This may not sound like a big deal, but estimating the probabilities for pairs of variables in $p(x_j \mid y)$ is a way easier then estimating the joint probability $p(x_1, x_2, \dots, x_m \mid y)$ all at once.

What we need to estimate in here, are the conditional $p(x_j \mid y)$ and marginal $p(y)$ probabilities, and we use maximum likelihood for this. It is nicely explained on this blog and in the The Naive Bayes Model, Maximum-Likelihood Estimation, and the EM Algorithm paper by Michael Collins. For general introduction to maximum likelihood estimation, check the Maximum Likelihood Estimation (MLE) in layman terms thread.

If you are dealing with binary, or categorical variables, then they follow Bernoulli and categorical distributions and the maximum likelihood estimators for parameters of those distributions (the probabilities), are simply the empirical proportions. For example, if you observed the sequence of coin tosses $z_1,z_2,\dots,z_n$ that are independent and identically distributed according to Bernoulli distribution with unknown probability of success $\theta$, and among those tosses you observed $k$ heads, then the "most likely" probability of observing heads given this data is $\hat \theta = \tfrac{k}{n}$. So if you previously observed that a third of tosses were heads, then your best guess is, that also third of the future guesses would be heads. Here you can find the formal derivation. Same kind of computations are done for all the probabilities of interest and then everything is multiplied (see formulas above) to obtain the final result.

All this said, if you want, you can estimate the naive Bayes algorithm in Bayesian paradigm by assuming priors for the probabilities and maximizing the posterior probability rather then likelihood.

You can also find the detailed explanation of naive Bayes algorithm with multiple examples on StackOverflow.

Best Answer

Related Solutions

Solved – Why does Naive Bayes outperform Support Vector Machines

Maximum Likelihood – Naive Bayes Classifier Explained

Related Question