I did a webinar titled An Introduction to Classification with MATLAB. You can download the code and the dataset from the MATLAB file exchange:
http://www.mathworks.com/matlabcentral/fileexchange/28770-introduction-to-classification
I'm attaching some code directly that might be helpful
%% Use a Naive Bayes Classifier to develop a classification model
% Some of the features exhibit significant correlation, however, its
% unclear whether the correlated features will be selected for our model
% Start with a Naive Bayes Classifier
% Use cvpartition to separate the dataset into a test set and a training set
% cvpartition will automatically ensure that feature values are evenly
% divided across the test set and the training set
% Create a cvpartition object that defined the folds
c = cvpartition(Y,'holdout',.2);
% Create a training set
X_Train = X(training(c,1),:);
Y_Train = Y(training(c,1));
%% Train a Classifier using the Training Set
Bayes_Model = NaiveBayes.fit(X_Train, Y_Train, 'Distribution','kernel');
%% Evaluate Accuracy Using the Test Set
clc
% Generate a confusion matrix
[Bayes_Predicted] = Bayes_Model.predict(X(test(c,1),:));
[conf, classorder] = confusionmat(Y(test(c,1)),Bayes_Predicted);
conf
% Calculate what percentage of the Confusion Matrix is off the diagonal
Bayes_Error = 1 - trace(conf)/sum(conf(:))
%% Naive Bayes Classification using Forward Feature Selection
% Create a cvpartition object that defined the folds
c2 = cvpartition(Y,'k',10);
% Set options
opts = statset('display','iter');
fun = @(Xtrain,Ytrain,Xtest,Ytest)...
sum(Ytest~=predict(NaiveBayes.fit(Xtrain,Ytrain,'Distribution','kernel'),Xtest));
[fs,history] = sequentialfs(fun,X,Y,'cv',c2,'options',opts)
White_Wine.Properties.VarNames(fs)
Ad in an illustration of how to calculate an ROC curve. Please note: this example is using a bagged decision tree rather than a Naive Bayes classifier
%% Run Treebagger Using Sequential Feature Selection
tic
f = @(X,Y)oobError(TreeBagger(50,X,Y,'method','classification','oobpred','on'),'mode','ensemble');
opt = statset('display','iter');
[fs,history] = sequentialfs(f,X,Y,'options',opt,'cv','none');
toc
%% Evaluate the accuracy of the model using a performance curve
Test_Results = dataset(Y_Test, Predicted, Class_Score);
[xVal,yVal,~,auc] = perfcurve(Test_Results.Predicted, ...
Test_Results.Class_Score(:,4),'6');
plot(xVal,yVal)
xlabel('False positive rate'); ylabel('True positive rate')
I am simply copy-pasting the answers I got from Alexandre Passos on Metaoptimize. It would really help if someone here can add more to it.
- Any binary classifier can be used for multiclass with the 1-vs-all reduction, or the all-vs-all reduction. This list seems to cover most
of the common multiclass algorithms.
- Logistic regression and SVMs are linear (though SVMs are linear in kernel space). Neural networks, decision trees, and knn aren't
lineasr. Naive bayes and discriminant analysis are linear. Random
forests aren't linear.
- Logistic regression can give you calibrated probabilities. So can many SVM implementations (though it requires slightly different
training). Neural networks can do that too, if using a right loss
(softmax). Decision trees and KNN can be probabilistic, though are not
particularly well calibrated. Naive bayes does not produce well
calibrated probabilities, nor does the discriminant analysis. I'm not
sure about random forests, depends on the implementation I think.
- All are deterministic except for neural networks and random forests.
Why do you want to compare different classification algorithms? Are
you trying to decide which one is the best in general, or just for one
application?
If the former, it's not worth doing it, as most claims are rather
sketchy and there is no method which can give that kind of conclusion.
If the latter, it is well accepted that cross-validation, or comparing
performance on a fixed test-set, gives you unbiased results. For
multiclass classification it is not always obvious which metric to
use, but things like accuracy; per-class precision/recall/f1,
per-class AUC, and the confusion matrix are commonly used.
Best Answer
If you're just playing around with this, may I make two recommendations?
load fisheriris
.That said, onto your specific questions.
1 and 2) I'm still not quite sure I follow your question. if you've got strings of As, Bs, and Cs and you want to classify them based on the number of As, Bs, and Cs, why not just count up the As, Bs, and Cs and use that? I guess you could feed those counts into a classifier if the counts are noisy or your data breaks ties in a systematic but unspecified way (e.g AAABBBCC --> A, AABBBCCC --> C), but otherwise I'm not sure what it buys you.
Since I think I'm still missing something, maybe we should make sure we're on the same page about what a feature is? The Fisher Iris data set contains data from Iris flowers. There are four features: the length and width of the flower's petal and sepal (the leaf that covers the flower) and three classes (three species of flowers). The length and width features are continuous (like your proportions), but you could also have discrete or binary features too.
Coming up with clever feature representations is one of the harder parts of building a classifier system. It probably depends a lot on the domain and make take some trial-and-error.
2) kNN is actually pretty flexible. If your features are proportions (or other continuous values), then Euclidean distance is a perfectly reasonable choice. Other distance metrics might be reasonable too, depending your specific application. One thing to keep in mind is that you want your feature dimensions to be about the same size. If feature1 ranges from 0-1 and feature2 ranges from 0-100000, then feature1 may be ignored.
Actually, if you're comparing strings, have you thought about using an edit-distance function like Levenshtein Distance? That might be particularly appropriate if you have several "prototype" strings and you want to figure out which one best matches the (potentially noisy) input. You'd compute the edit distance between your string and the prototypes and pick the one with the lowest distance.
3) No, $k$ is the number of neighbors to consider; It doesn't have to related to the number of classes. When classifying a new data point with 1-NN, the new point gets the class of the closest data point in your training set. For 5-NN, you're going to assign the most common class of the 5 nearest data points, etc. To avoid ties, $k$ is usually odd for two-class classifiers; I would avoid $k=3$ if you have 3 classes, for the same reason.
4) I don't know much about ROC analysis for multiclass problems. I would be tempted to do either pairwise (A vs B, A vs C, B vs C) or one-versus all (A vs {BC}, B vs {AC}, C vs {AB}). There are few threads about ROC Surfaces; that might let you compare them all at once.
5) Again, for two-class problems, people typically compare the area under the curve: More area under the curve -> better classification. It looks like there is an analogous "volume under the ROC surface" (e.g., He and Frey, 2007) for multiclass problems.
Edit: Some answers to your new questions A) Sure. That seems like the fairest way to compare them. B) The matlab function's prototype looks like this (at least for the most recent version)
Class = knnclassify(Sample, Training, Group, k, distance,rule)
Sample
: Your test set (the data to be classified). Each row is a single example; the columns are the feature values. Sample(1,:) = your first example. Sample(2,:) = the 2nd, etc.Training
: Training data. The columns have to be the same as Sample, though obviously you can have a different number of rows.Group
: The class label for the training set. Training(1,:) is from class Group(1), etc. Can either be strings or numbers (numbers is probably easier).k
: Number of neighbors to consider. Say we've got a data point D that we want to classify. If k=1, then we find the closest point (see next parameter) to D and assign D the same class as that point. For larger $k$, we find the $k$ nearest points, and then use them to assign the label. Suppose $k=3$ and the three nearest points to D are from class A, class, B, and class A. Since the majority of the points are from class A, we assign D to class A too. Sometimes, there are ties (e.g., if the three closest points were from class A, B, and C). Therule
parameter determines how those are broken.distance
: Determines which distance metric to use. See the docs for options, but you probably want Euclidean distance, at least to start.rule
: How ties are broken. This obviously only matters if $k>1$. Suppose you set $k=3$. If it's 'consensus', then the classifier doesn't classify examples where all three points are not from the same class (you get a nan or empty string, depending on how you arranged groups). If it's one of the other settings, you get the most common class of the three nearest points. If two or more classes are equally common, then 'random' picks one at random and 'nearest' favors the class of the closest point to break ties.The classify() function is similar, but has a lot more options (see the docs). However, if you want a stock Naive Bayes classifier, I think something like
classify(sample, train, groups, 'diaglinear', 'empirical');
will work.C) For ROC analysis, you generally don't provide the features! Instead, you need the true class label, the classifier's output, and a "score" value that gives an estimate of how confident the classifier is in its output. See the matlab function perfcurve for the two-class version; you'll have to roll your own, I think, for ROC surfaces.