If you're just playing around with this, may I make two recommendations?
- Start out with one of the classic datasets. This will let you compare your results with other published data. There are a lot floating around the internet (e.g., from the UCI Repository). Some are even built into matlab--you can load the Fisher Iris data set by running
load fisheriris
.
- If possible, consider a two-class problem first. Both the training and evaluation get more complicated with multiple classes.
That said, onto your specific questions.
1 and 2) I'm still not quite sure I follow your question. if you've got strings of As, Bs, and Cs and you want to classify them based on the number of As, Bs, and Cs, why not just count up the As, Bs, and Cs and use that? I guess you could feed those counts into a classifier if the counts are noisy or your data breaks ties in a systematic but unspecified way (e.g AAABBBCC --> A, AABBBCCC --> C), but otherwise I'm not sure what it buys you.
Since I think I'm still missing something, maybe we should make sure we're on the same page about what a feature is? The Fisher Iris data set contains data from Iris flowers. There are four features: the length and width of the flower's petal and sepal (the leaf that covers the flower) and three classes (three species of flowers). The length and width features are continuous (like your proportions), but you could also have discrete or binary features too.
Coming up with clever feature representations is one of the harder parts of building a classifier system. It probably depends a lot on the domain and make take some trial-and-error.
2) kNN is actually pretty flexible. If your features are proportions (or other continuous values), then Euclidean distance is a perfectly reasonable choice. Other distance metrics might be reasonable too, depending your specific application. One thing to keep in mind is that you want your feature dimensions to be about the same size. If feature1 ranges from 0-1 and feature2 ranges from 0-100000, then feature1 may be ignored.
Actually, if you're comparing strings, have you thought about using an edit-distance function like Levenshtein Distance? That might be particularly appropriate if you have several "prototype" strings and you want to figure out which one best matches the (potentially noisy) input. You'd compute the edit distance between your string and the prototypes and pick the one with the lowest distance.
3) No, $k$ is the number of neighbors to consider; It doesn't have to related to the number of classes. When classifying a new data point with 1-NN, the new point gets the class of the closest data point in your training set. For 5-NN, you're going to assign the most common class of the 5 nearest data points, etc. To avoid ties, $k$ is usually odd for two-class classifiers; I would avoid $k=3$ if you have 3 classes, for the same reason.
4) I don't know much about ROC analysis for multiclass problems. I would be tempted to do either pairwise (A vs B, A vs C, B vs C) or one-versus all (A vs {BC}, B vs {AC}, C vs {AB}). There are few threads about ROC Surfaces; that might let you compare them all at once.
5) Again, for two-class problems, people typically compare the area under the curve: More area under the curve -> better classification. It looks like there is an analogous "volume under the ROC surface" (e.g., He and Frey, 2007) for multiclass problems.
Edit: Some answers to your new questions
A) Sure. That seems like the fairest way to compare them.
B) The matlab function's prototype looks like this (at least for the most recent version)
Class = knnclassify(Sample, Training, Group, k, distance,rule)
Sample
: Your test set (the data to be classified). Each row is a single example; the columns are the feature values. Sample(1,:) = your first example. Sample(2,:) = the 2nd, etc.
Training
: Training data. The columns have to be the same as Sample, though obviously you can have a different number of rows.
Group
: The class label for the training set. Training(1,:) is from class Group(1), etc. Can either be strings or numbers (numbers is probably easier).
k
: Number of neighbors to consider. Say we've got a data point D that we want to classify. If k=1, then we find the closest point (see next parameter) to D and assign D the same class as that point. For larger $k$, we find the $k$ nearest points, and then use them to assign the label. Suppose $k=3$ and the three nearest points to D are from class A, class, B, and class A. Since the majority of the points are from class A, we assign D to class A too. Sometimes, there are ties (e.g., if the three closest points were from class A, B, and C). The rule
parameter determines how those are broken.
distance
: Determines which distance metric to use. See the docs for options, but you probably want Euclidean distance, at least to start.
rule
: How ties are broken. This obviously only matters if $k>1$. Suppose you set $k=3$. If it's 'consensus', then the classifier doesn't classify examples where all three points are not from the same class (you get a nan or empty string, depending on how you arranged groups). If it's one of the other settings, you get the most common class of the three nearest points. If two or more classes are equally common, then 'random' picks one at random and 'nearest' favors the class of the closest point to break ties.
The classify() function is similar, but has a lot more options (see the docs). However, if you want a stock Naive Bayes classifier, I think something like
classify(sample, train, groups, 'diaglinear', 'empirical');
will work.
C) For ROC analysis, you generally don't provide the features! Instead, you need the true class label, the classifier's output, and a "score" value that gives an estimate of how confident the classifier is in its output. See the matlab function perfcurve for the two-class version; you'll have to roll your own, I think, for ROC surfaces.
The general term Naive Bayes refers the the strong independence assumptions in the model, rather than the particular distribution of each feature. A Naive Bayes model assumes that each of the features it uses are conditionally independent of one another given some class. More formally, if I want to calculate the probability of observing features $f_1$ through $f_n$, given some class c, under the Naive Bayes assumption the following holds:
$$ p(f_1,..., f_n|c) = \prod_{i=1}^n p(f_i|c)$$
This means that when I want to use a Naive Bayes model to classify a new example, the posterior probability is much simpler to work with:
$$ p(c|f_1,...,f_n) \propto p(c)p(f_1|c)...p(f_n|c) $$
Of course these assumptions of independence are rarely true, which may explain why some have referred to the model as the "Idiot Bayes" model, but in practice Naive Bayes models have performed surprisingly well, even on complex tasks where it is clear that the strong independence assumptions are false.
Up to this point we have said nothing about the distribution of each feature. In other words, we have left $p(f_i|c)$ undefined. The term Multinomial Naive Bayes simply lets us know that each $p(f_i|c)$ is a multinomial distribution, rather than some other distribution. This works well for data which can easily be turned into counts, such as word counts in text.
The distribution you had been using with your Naive Bayes classifier is a Guassian p.d.f., so I guess you could call it a Guassian Naive Bayes classifier.
In summary, Naive Bayes classifier is a general term which refers to conditional independence of each of the features in the model, while Multinomial Naive Bayes classifier is a specific instance of a Naive Bayes classifier which uses a multinomial distribution for each of the features.
References:
Stuart J. Russell and Peter Norvig. 2003. Artificial Intelligence: A Modern Approach (2 ed.). Pearson Education. See p. 499 for reference to "idiot Bayes" as well as the general definition of the Naive Bayes model and its independence assumptions
Best Answer
They are not the same, but in you case they could be used for the same purpose.
Optimal Bayes classifier is
$$ \DeclareMathOperator*{\argmax}{arg\,max} \argmax_{c \in C} p(c|X) $$
i.e., among all hypotheses, take the $c$ that maximizes the posterior probability. You use Bayes theorem
$$ \underbrace{p(c|X)}_{\text{posterior}} \propto \underbrace{p(X|c)}_{\text{likelihood}} \underbrace{p(c)}_{\text{prior}} $$
but since using uniform prior (all $c$ are equally likely, so $p(c) \propto 1$) it reduces to the likelihood function
$$ p(c|X) \propto p(X|c) $$
The difference between maximizing the likelihood function and comparing the likelihood ratios, is that with likelihood ratio you compare only two likelihoods, while in maximizing the likelihood you may consider multiple hypothesis. So if you have only two hypotheses, then they will do essentially the same thing. However imagine that you had multiple classes, in such case comparing each of them with all the others pair by pair would be a really inefficient way to go.
Notice that likelihood ratio serves also other purpose than finding which of the two models has greater likelihood. Likelihood ratio can be used for hypothesis testing and it tells you how much more (or less) likely is is one of the models comparing to the other. Moreover, you can do the same when comparing the posterior distributions by using Bayes factor in similar fashion.