Solved – How to find classifier success performance

classificationMATLABpattern recognition

EDIT
I am playing with pattern recognition techniques and just to get a grip of it for simplicity I have tried to develop a classifier which categorizes strings into 3 classes based on the probabilistic frequency count under labels A,B,C; each label indicating the dominance of the letters which means if a string of fixd length L=8 contains maximum A then it is classified under A and so on. SO,these numerical results form the features. Is there a code which plots and gives the success rate of classifiers when using the same sample and training set which are data points. The data points are the features which indicate the frequency count of letters in a string. I am interested to use k-NN, Bayes Classifier and Piecewise Component Analysis (PCA). I am aware of the cross validate function but unable to use it for the purpose. There are 3 classes each containing 100 rows of single column data and an unknown sample of same size.

The issue is just to confirm that would all the other generic classes also have this same probabilistic numbers as their features? If so, then how to work with k-NN since it computes the distance between the strings (in htis case I guess it would be simply asscii values or euclidiean) and my classifier computes the frequency count.
What would be the features for k-NN,Bayes,and PCA?
For k-NN would k=number of classes -1 ?
I have plotted the ROC for my classsifier and as the design goes that a string would certainly be classified under one of the classes.So,each curve of the ROC indicates a class,hence there are 3 such curves.Is this approach ok?
How to proceed with a comparative ROC for all the 4 classifiers?

Best Answer

If you're just playing around with this, may I make two recommendations?

Start out with one of the classic datasets. This will let you compare your results with other published data. There are a lot floating around the internet (e.g., from the UCI Repository). Some are even built into matlab--you can load the Fisher Iris data set by running load fisheriris.
If possible, consider a two-class problem first. Both the training and evaluation get more complicated with multiple classes.

That said, onto your specific questions.

1 and 2) I'm still not quite sure I follow your question. if you've got strings of As, Bs, and Cs and you want to classify them based on the number of As, Bs, and Cs, why not just count up the As, Bs, and Cs and use that? I guess you could feed those counts into a classifier if the counts are noisy or your data breaks ties in a systematic but unspecified way (e.g AAABBBCC --> A, AABBBCCC --> C), but otherwise I'm not sure what it buys you.

Since I think I'm still missing something, maybe we should make sure we're on the same page about what a feature is? The Fisher Iris data set contains data from Iris flowers. There are four features: the length and width of the flower's petal and sepal (the leaf that covers the flower) and three classes (three species of flowers). The length and width features are continuous (like your proportions), but you could also have discrete or binary features too.

Coming up with clever feature representations is one of the harder parts of building a classifier system. It probably depends a lot on the domain and make take some trial-and-error.

2) kNN is actually pretty flexible. If your features are proportions (or other continuous values), then Euclidean distance is a perfectly reasonable choice. Other distance metrics might be reasonable too, depending your specific application. One thing to keep in mind is that you want your feature dimensions to be about the same size. If feature1 ranges from 0-1 and feature2 ranges from 0-100000, then feature1 may be ignored.

Actually, if you're comparing strings, have you thought about using an edit-distance function like Levenshtein Distance? That might be particularly appropriate if you have several "prototype" strings and you want to figure out which one best matches the (potentially noisy) input. You'd compute the edit distance between your string and the prototypes and pick the one with the lowest distance.

3) No, $k$ is the number of neighbors to consider; It doesn't have to related to the number of classes. When classifying a new data point with 1-NN, the new point gets the class of the closest data point in your training set. For 5-NN, you're going to assign the most common class of the 5 nearest data points, etc. To avoid ties, $k$ is usually odd for two-class classifiers; I would avoid $k=3$ if you have 3 classes, for the same reason.

4) I don't know much about ROC analysis for multiclass problems. I would be tempted to do either pairwise (A vs B, A vs C, B vs C) or one-versus all (A vs {BC}, B vs {AC}, C vs {AB}). There are few threads about ROC Surfaces; that might let you compare them all at once.

5) Again, for two-class problems, people typically compare the area under the curve: More area under the curve -> better classification. It looks like there is an analogous "volume under the ROC surface" (e.g., He and Frey, 2007) for multiclass problems.

Edit: Some answers to your new questions A) Sure. That seems like the fairest way to compare them. B) The matlab function's prototype looks like this (at least for the most recent version) Class = knnclassify(Sample, Training, Group, k, distance,rule)

Sample: Your test set (the data to be classified). Each row is a single example; the columns are the feature values. Sample(1,:) = your first example. Sample(2,:) = the 2nd, etc.

Training: Training data. The columns have to be the same as Sample, though obviously you can have a different number of rows.

Group: The class label for the training set. Training(1,:) is from class Group(1), etc. Can either be strings or numbers (numbers is probably easier).

k: Number of neighbors to consider. Say we've got a data point D that we want to classify. If k=1, then we find the closest point (see next parameter) to D and assign D the same class as that point. For larger $k$, we find the $k$ nearest points, and then use them to assign the label. Suppose $k=3$ and the three nearest points to D are from class A, class, B, and class A. Since the majority of the points are from class A, we assign D to class A too. Sometimes, there are ties (e.g., if the three closest points were from class A, B, and C). The rule parameter determines how those are broken.

distance: Determines which distance metric to use. See the docs for options, but you probably want Euclidean distance, at least to start.

rule: How ties are broken. This obviously only matters if $k>1$. Suppose you set $k=3$. If it's 'consensus', then the classifier doesn't classify examples where all three points are not from the same class (you get a nan or empty string, depending on how you arranged groups). If it's one of the other settings, you get the most common class of the three nearest points. If two or more classes are equally common, then 'random' picks one at random and 'nearest' favors the class of the closest point to break ties.

The classify() function is similar, but has a lot more options (see the docs). However, if you want a stock Naive Bayes classifier, I think something like classify(sample, train, groups, 'diaglinear', 'empirical'); will work.

C) For ROC analysis, you generally don't provide the features! Instead, you need the true class label, the classifier's output, and a "score" value that gives an estimate of how confident the classifier is in its output. See the matlab function perfcurve for the two-class version; you'll have to roll your own, I think, for ROC surfaces.

Related Solutions

Solved – Issues in pattern recognition and plots

I did a webinar titled An Introduction to Classification with MATLAB. You can download the code and the dataset from the MATLAB file exchange:

http://www.mathworks.com/matlabcentral/fileexchange/28770-introduction-to-classification

I'm attaching some code directly that might be helpful

%%  Use a Naive Bayes Classifier to develop a classification model

% Some of the features exhibit significant correlation, however, its
% unclear whether the correlated features will be selected for our model

% Start with a Naive Bayes Classifier

% Use cvpartition to separate the dataset into a test set and a training set
% cvpartition will automatically ensure that feature values are evenly
% divided across the test set and the training set

% Create a cvpartition object that defined the folds
c = cvpartition(Y,'holdout',.2);

% Create a training set

X_Train = X(training(c,1),:);
Y_Train = Y(training(c,1));

%%  Train a Classifier using the Training Set

Bayes_Model = NaiveBayes.fit(X_Train, Y_Train, 'Distribution','kernel');

%%  Evaluate Accuracy Using the Test Set

clc

% Generate a confusion matrix
[Bayes_Predicted] = Bayes_Model.predict(X(test(c,1),:));
[conf, classorder] = confusionmat(Y(test(c,1)),Bayes_Predicted);
conf

% Calculate what percentage of the Confusion Matrix is off the diagonal
Bayes_Error = 1 - trace(conf)/sum(conf(:))


%%  Naive Bayes Classification using Forward Feature Selection

% Create a cvpartition object that defined the folds
c2 = cvpartition(Y,'k',10);

% Set options
opts = statset('display','iter');

fun = @(Xtrain,Ytrain,Xtest,Ytest)...
      sum(Ytest~=predict(NaiveBayes.fit(Xtrain,Ytrain,'Distribution','kernel'),Xtest));

[fs,history] = sequentialfs(fun,X,Y,'cv',c2,'options',opts)
White_Wine.Properties.VarNames(fs)

Ad in an illustration of how to calculate an ROC curve. Please note: this example is using a bagged decision tree rather than a Naive Bayes classifier

%%  Run Treebagger Using Sequential Feature Selection
tic
f = @(X,Y)oobError(TreeBagger(50,X,Y,'method','classification','oobpred','on'),'mode','ensemble');
opt = statset('display','iter');
[fs,history] = sequentialfs(f,X,Y,'options',opt,'cv','none');
toc
%%  Evaluate the accuracy of the model using a performance curve

Test_Results = dataset(Y_Test, Predicted, Class_Score);
[xVal,yVal,~,auc] = perfcurve(Test_Results.Predicted, ...
    Test_Results.Class_Score(:,4),'6'); 

plot(xVal,yVal)
xlabel('False positive rate'); ylabel('True positive rate')

Solved – Comparing multiclass classification algorithms for a particular application

I am simply copy-pasting the answers I got from Alexandre Passos on Metaoptimize. It would really help if someone here can add more to it.

Any binary classifier can be used for multiclass with the 1-vs-all reduction, or the all-vs-all reduction. This list seems to cover most of the common multiclass algorithms.

Logistic regression and SVMs are linear (though SVMs are linear in kernel space). Neural networks, decision trees, and knn aren't lineasr. Naive bayes and discriminant analysis are linear. Random forests aren't linear.

Logistic regression can give you calibrated probabilities. So can many SVM implementations (though it requires slightly different training). Neural networks can do that too, if using a right loss (softmax). Decision trees and KNN can be probabilistic, though are not particularly well calibrated. Naive bayes does not produce well calibrated probabilities, nor does the discriminant analysis. I'm not sure about random forests, depends on the implementation I think.

All are deterministic except for neural networks and random forests.

Why do you want to compare different classification algorithms? Are you trying to decide which one is the best in general, or just for one application?

If the former, it's not worth doing it, as most claims are rather sketchy and there is no method which can give that kind of conclusion. If the latter, it is well accepted that cross-validation, or comparing performance on a fixed test-set, gives you unbiased results. For multiclass classification it is not always obvious which metric to use, but things like accuracy; per-class precision/recall/f1, per-class AUC, and the confusion matrix are commonly used.

Best Answer

Related Solutions

Solved – Issues in pattern recognition and plots

Solved – Comparing multiclass classification algorithms for a particular application

Related Question