Solved – Linear discriminant analysis (Fisher) = Bayes

bayesiandiscriminant analysisfilternaive bayes

I'd like to ask a question, I am reading book right now about mail filtering, both methods: naïve Bayes and Fisher are there very similar in implementation. I am also writing a paper on Bayesian spam filtering methods and are curious if the Fisher method is one of them or it's completely something separate? Could someone elaborate this?

@edit

The book I am using is "Programming Collective Intelligence: Building Smart Web 2.0 Applications", my book is in Polish, so can't give you guys the page although it's chapter six, document filtering. First part of this chapter is naïve Bayesian classifier, and the second part called Fisher method starts with words (sorry if it's not 100% the same, translating from Polish) that Fisher method is an alternative method for a naïve Bayesian classifier.

In a Bayesian filter we are calculating probability of P ( word | category (spam/ham)), next P (document | category) and then using Bayes' theorem calculating P ( category | document) and in fisher method they first use P ( category | word).

Naïve Bayesian calculating for a whole document is multiplying probability values from P ( word | category) and in Fisher method they multiply values and use chi square and multiply by -2.

To me both methods are very similar, and I wanted to make a little paper about Bayesian filtering methods, and I don't know if Bayesian filtering means only naïve Bayes or the Fisher method as well?

Hope this helps, sorry for being "newb'" here 🙂

@edit2

Let me quote what is Fisher method:
"The Fisher method, named for R. A. Fisher, is an alternative method that's been shown to give very accurate results, particularly for spam filtering. This is the method used by SpamBayes, an Outlook plug-in written in Python. Unlike the naïve Bayesian filter, which uses the feature probabilities to create a whole document probability, the Fisher method calculates the probability of a category for each feature in the document, then combines the probabilities and tests to see if the set of probabilities is more or less likely than a random set. This method also returns a probability for each category that can be compared to the others. Although this is a more complex method, it is worth learning because it allows much greater flexibility when choosing cutoffs for categorization." from book I am using.

Best Answer

In my understanding, naive Bayes can be seen as a special case of classification done in discriminant analysis when we assume predictors to be normally distributed. We seek, via Bayes' theorem,

\begin{align*} Pr(Y = k | X = x) =& \, \frac{\pi_k f_k(x)}{\sum_{l=1}^K \pi_l f_l(x)} \end{align*} and classify an observation to a class for which this probability is highest. The density of the predictors is assumed to be
\begin{align*} f_k(x) =& \, \frac{1}{2\pi^{p/2} |\Sigma|^{1/2}} e^{-\frac{1}{2}(x - \mu_k)^T \Sigma^{-1}(x - \mu_k)} \end{align*} in Fisher's linear DA, as $\Sigma$ is assumed constant across classes. In quadratic DA the covariances $\Sigma_k$ may vary across classes. $\pi_k$ denotes the prior class probabilities. The discriminant function becomes: \begin{align*} \delta_k(x) =& \, x^T \Sigma^{-1} \mu_k - \frac{1}{2} \mu_k^T \Sigma^{-1} \mu_k + \log \pi_k \end{align*}

With $f_k (x) = \prod_{j=1}^p f_{jk}(x_j)$ (conditional independence of predictors) in each class we get naive Bayes. For Gaussian predictors this means the $\Sigma_k$ are diagonal.

I guess it's called naive because predictors are typically not independent. But as this assumption affords estimating far fewer parameters especially when there are many predictors in either $\Sigma$ or, a fortiori, in the $\Sigma_k$, this may still be a useful assumption due to the bias-variance tradeoff.