Bayesian Classifier – Implementing a Bayesian Classifier with Multivariate Normal Densities

machine learningnaive bayes

Supposing a Bayesian classifier with multivariate normal densities, how do I find the error rate of the classifier when we have two classes?

I am using this:

When dimension $d = 1$:

$$P(x | \mu , \sigma^2) = N(x, \mu, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} $$

In $d$ dimensions:

$$P(x | \mu, \Sigma) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \cdot e^{-\frac{1}{2}{(x-μ)^T\Sigma^{-1}(x-μ)}}$$

where $\mu$ is in $\mathbb{R}^d$ and $\Sigma$ is a $d \times d$ variance-covariance matrix.

What would be the algebra to get the error rate, or could you give an example? I would like to do this in Matlab.

Lets say I have 2 classes and 2 atributes with 20 examples:

features:
    1.8756    1.4236
    2.0677    1.3759
    0.8540    0.7782
    0.5651   -0.3511
    0.6103   -0.6901
   -0.1945    1.6438
   -0.2620    0.8022
    1.5326    0.0188
    0.3334    1.0578
    0.8535   -1.8545
    0.3066    1.9716
   -1.4424    0.4216
    0.3275   -0.2844
   -0.0079    2.8506
    0.0114    1.4001
   -0.4049   -0.3981
   -0.0913    2.2094
    0.3376   -1.0467
    0.3455    2.4960
    0.3232   -0.5614


and targets

     2
     2
     1
     1
     1
     2
     1
     1
     2
     1
     2
     1
     1
     2
     2
     1
     2
     1
     2
     1

What is the process to classify them and get the error rate of this examples, so we divide the set in 15 examples for training and the other 5 for testing?

Best Answer

I would have thought the "best" (most intuitive) estimate of error would be the probability that the classification is incorrect. Or alternatively/equivalently the odds against the class. Using the wikipedia page, you classify as Male or female. I would have thought the estimate of accuracy should be:

$$O(Male|evidence)=\frac{P(Male|evidence)}{P(Female|evidence)}$$

So you would report the data in terms of "the evidence gives odds O:1 in favour of this classification"

In a problem with more than 2 classes, you should report the "worst" odds ratio. That is given classes $C_{i}\;\;(i=1,\dots,R+1)$, one (and only one) of which is assumed to be true. Suppose you classify an observation as $C_{R+1}$, then you would report odds of:

$$O(C_{R+1}|evidence)=\frac{P(C_{R+1}|evidence)}{Max_{i\in[1,\dots,R]}P(C_{i}|evidence)}$$

So you would report the data in terms of "the evidence gives odds O:1 in favour of this classification against the next best alternative"

EDIT/UPDATE: in response to the new part of the question, to put some numbers into the calculation. Using the first 15 observations to "train" the classifier. Because you are dealing with normal distribution, you only need the sufficient statistics, which in this case are:

$$ \begin{pmatrix} \hat{\mu}_{1} & \hat{\mu}_{2} \\ \hat{\nu}_{1} & \hat{\nu}_{2} \\ \hat{\sigma}_{1} & \hat{\sigma}_{2} \\ \hat{\tau}_{1} & \hat{\tau}_{2} \end{pmatrix} = \begin{pmatrix} 0.3798 & 0.6275 \\ -0.1449 & 1.6748 \\ 0.8367 & 0.8685 \\ 0.8200 & 0.5451 \end{pmatrix} $$

Where the subscript denotes the class. $\mu$ and $\sigma$ denote the mean and standard deviation of the first attribute. $\nu$ and $\tau$ denote the mean and standard deviation of the second attribute. You could include correlation, but it is "safer" not to unless you know correlations actually exist. As these are just numbers to me, I have no reason to suppose them to be dependent. So I will not constrain them by forcing a dependence assumption.

Now you need a decision rule in order to classify an observation into 1 class or the other. The one I was suggesting was to use the odds ratio. So we need to calculate the probability of belonging to class 1, given the training data (denoted by D), the prior information (denoted by I), and the sample to test (denoted by y):

$$p(C_{1}|y,D,I)=\frac{p(C_{1}|D,I)p(y|C_{1},D,I)}{p(y|D,I)} \rightarrow O(C_{1}|y,D,I)=\frac{p(C_{1}|D,I)p(y|C_{1},D,I)}{p(C_{2}|D,I)p(y|C_{2},D,I)}$$

Where $p(C_{1}|D,I) = \frac{8}{15}$ assuming complete initial ignorance (because I am ignorant prior to seeing the data). If you knew it was possible for both categories to occur prior to observing the training data, then the probability would be given by the rule of succession $\frac{9}{17}$.

$Pr(y|C_{1},D,I)$ is the posterior predictive distribution for class 1, given by:

$$p(y|C_{1},D,I)=\int p(y|\mu_{1},\nu_{1},\sigma_{1},\tau_{1},I) p(\mu_{1},\nu_{1},\sigma_{1},\tau_{1}|D,I)d\mu_{1}d\nu_{1}d\sigma_{1}d\tau_{1}$$

Now I will just give the posterior assuming complete ignorance, it is not hard to derive, you use a prior $p(\mu_{1},\nu_{1},\sigma_{1},\tau_{1}|I)\propto\frac{1}{\sigma_{1}\tau_{1}}$ and do the necessary integrals. It is a product of two student t densities (denoted by $St(x|\mu,\sigma,df)$)

$$p(y|C_{1},D,I)=St(y_{1}|\hat{\mu}_{1},\hat{\sigma}_{1}\sqrt{\frac{8+1}{8-1}},8-1)St(y_{2}|\hat{\nu}_{1},\hat{\tau}_{1}\sqrt{\frac{8+1}{8-1}},8-1)$$

Where $y_j$ is the attribute j value of the new data point. This should make it fairly obvious how it would generalise to more than two attributes. Similarly, we have $p(C_{2}|D,I) = \frac{7}{15}$ assuming ignorance or $\frac{8}{17}$ using the rule of succession. The posterior predictive is:

$$p(y|C_{2},D,I)=St(y_{1}|\hat{\mu}_{2},\hat{\sigma}_{2}\sqrt{\frac{7+1}{7-1}},7-1)St(y_{2}|\hat{\nu}_{2},\hat{\tau}_{2}\sqrt{\frac{7+1}{7-1}},7-1)$$

And so the final odds ratio is given by:

$$O(C_{1}|y,D,I)=\frac{8}{7} \times \frac{St(y_{1}|\hat{\mu}_{1},\hat{\sigma}_{1}\sqrt{\frac{9}{7}},7)}{St(y_{1}|\hat{\mu}_{2},\hat{\sigma}_{2}\sqrt{\frac{8}{6}},6)} \times \frac{St(y_{2}|\hat{\nu}_{1},\hat{\tau}_{1}\sqrt{\frac{9}{7}},7)}{St(y_{2}|\hat{\nu}_{2},\hat{\tau}_{2}\sqrt{\frac{8}{6}},6)}$$

I think you will agree that this number is sensible by any criteria. It "goes in all the right directions", and it appropriately accounts for the uncertainty in estimating the parameters of the model. inserting in these densities gives:

$$\frac{8}{7} \times\frac{\hat{\sigma}_{2}\hat{\tau}_{2}}{\hat{\sigma}_{1}\hat{\tau}_{1}} \times\left[\frac{\frac{\Gamma(4)}{\Gamma(\frac{7}{2})\sqrt{7}}}{\frac{\Gamma(\frac{7}{2})}{\Gamma(3)\sqrt{6}}}\right]^2 \frac{ \left[1+\frac{1}{8}\left(\frac{y_{1}-\hat{\mu}_{2}}{\hat{\sigma}_{2}}\right)^2 \right]^{\frac{7}{2}} \left[1+\frac{1}{8}\left(\frac{y_{2}-\hat{\nu}_{2}}{\hat{\tau}_{2}}\right)^2 \right]^{\frac{7}{2}} }{ \left[1+\frac{1}{9}\left(\frac{y_{1}-\hat{\mu}_{1}}{\hat{\sigma}_{1}}\right)^2 \right]^{\frac{8}{2}} \left[1+\frac{1}{9}\left(\frac{y_{2}-\hat{\nu}_{1}}{\hat{\tau}_{1}}\right)^2 \right]^{\frac{8}{2}} } $$

Now in order to make a decision, you need to think about the consequences of making a wrong classification. is it worse to classify a $1$ as a $2$ compared to classifying a $2$ as a $1$? If not then the cut-off is simply $O(C_{1}|y,D,I)>1$ are classed as $1$, otherwise as $2$. the cut-off will slide up or down depending on "which is more important to get right". Although for this particular example, the classifier is so good, you hardly need to bother about this.

The table below shows the odds for each of the testing observations. Interestingly, attribute 2 is what is driving most of the changes in odds - attribute 1 does not appear to be as useful. In fact you can see that attribute 1 actually introduces more uncertainty with identifying group 2. This is obvious when you note that the mean and variance of attribute 1 are basically the same for each group (but mean and variance of attribute 2 is quite different between two groups):

$$ \begin{array}{c|c} y_{1} & y_{2} & O(C_{1}|y_{1},y_{2},D,I) & O(C_{1}|y_{2},D,I) & \text{True Class} \\ \hline -0.4049 & -0.3981 & 47.0 &37.0 & 1 \\ -0.0913 & 2.2094 & 0.109 & 0.089 & 2 \\ 0.3376 & -1.0467 & 102.4 & 93.5 & 1 \\ 0.3455 & 2.496 & 0.104 & 0.095 & 2 \\ 0.3232 & -0.5614 & 54.6 & 49.7 & 1 \end{array} $$