Solved – How to find classifier success performance

classificationMATLABpattern recognition

EDIT
I am playing with pattern recognition techniques and just to get a grip of it for simplicity I have tried to develop a classifier which categorizes strings into 3 classes based on the probabilistic frequency count under labels A,B,C; each label indicating the dominance of the letters which means if a string of fixd length L=8 contains maximum A then it is classified under A and so on. SO,these numerical results form the features. Is there a code which plots and gives the success rate of classifiers when using the same sample and training set which are data points. The data points are the features which indicate the frequency count of letters in a string. I am interested to use k-NN, Bayes Classifier and Piecewise Component Analysis (PCA). I am aware of the cross validate function but unable to use it for the purpose. There are 3 classes each containing 100 rows of single column data and an unknown sample of same size.

  1. The issue is just to confirm that would all the other generic classes also have this same probabilistic numbers as their features? If so, then how to work with k-NN since it computes the distance between the strings (in htis case I guess it would be simply asscii values or euclidiean) and my classifier computes the frequency count.
  2. What would be the features for k-NN,Bayes,and PCA?
  3. For k-NN would k=number of classes -1 ?
  4. I have plotted the ROC for my classsifier and as the design goes that a string would certainly be classified under one of the classes.So,each curve of the ROC indicates a class,hence there are 3 such curves.Is this approach ok?
  5. How to proceed with a comparative ROC for all the 4 classifiers?

Best Answer

If you're just playing around with this, may I make two recommendations?

  • Start out with one of the classic datasets. This will let you compare your results with other published data. There are a lot floating around the internet (e.g., from the UCI Repository). Some are even built into matlab--you can load the Fisher Iris data set by running load fisheriris.
  • If possible, consider a two-class problem first. Both the training and evaluation get more complicated with multiple classes.

That said, onto your specific questions.

1 and 2) I'm still not quite sure I follow your question. if you've got strings of As, Bs, and Cs and you want to classify them based on the number of As, Bs, and Cs, why not just count up the As, Bs, and Cs and use that? I guess you could feed those counts into a classifier if the counts are noisy or your data breaks ties in a systematic but unspecified way (e.g AAABBBCC --> A, AABBBCCC --> C), but otherwise I'm not sure what it buys you.

Since I think I'm still missing something, maybe we should make sure we're on the same page about what a feature is? The Fisher Iris data set contains data from Iris flowers. There are four features: the length and width of the flower's petal and sepal (the leaf that covers the flower) and three classes (three species of flowers). The length and width features are continuous (like your proportions), but you could also have discrete or binary features too.

Coming up with clever feature representations is one of the harder parts of building a classifier system. It probably depends a lot on the domain and make take some trial-and-error.

2) kNN is actually pretty flexible. If your features are proportions (or other continuous values), then Euclidean distance is a perfectly reasonable choice. Other distance metrics might be reasonable too, depending your specific application. One thing to keep in mind is that you want your feature dimensions to be about the same size. If feature1 ranges from 0-1 and feature2 ranges from 0-100000, then feature1 may be ignored.

Actually, if you're comparing strings, have you thought about using an edit-distance function like Levenshtein Distance? That might be particularly appropriate if you have several "prototype" strings and you want to figure out which one best matches the (potentially noisy) input. You'd compute the edit distance between your string and the prototypes and pick the one with the lowest distance.

3) No, $k$ is the number of neighbors to consider; It doesn't have to related to the number of classes. When classifying a new data point with 1-NN, the new point gets the class of the closest data point in your training set. For 5-NN, you're going to assign the most common class of the 5 nearest data points, etc. To avoid ties, $k$ is usually odd for two-class classifiers; I would avoid $k=3$ if you have 3 classes, for the same reason.

4) I don't know much about ROC analysis for multiclass problems. I would be tempted to do either pairwise (A vs B, A vs C, B vs C) or one-versus all (A vs {BC}, B vs {AC}, C vs {AB}). There are few threads about ROC Surfaces; that might let you compare them all at once.

5) Again, for two-class problems, people typically compare the area under the curve: More area under the curve -> better classification. It looks like there is an analogous "volume under the ROC surface" (e.g., He and Frey, 2007) for multiclass problems.

Edit: Some answers to your new questions A) Sure. That seems like the fairest way to compare them. B) The matlab function's prototype looks like this (at least for the most recent version) Class = knnclassify(Sample, Training, Group, k, distance,rule)

Sample: Your test set (the data to be classified). Each row is a single example; the columns are the feature values. Sample(1,:) = your first example. Sample(2,:) = the 2nd, etc.

Training: Training data. The columns have to be the same as Sample, though obviously you can have a different number of rows.

Group: The class label for the training set. Training(1,:) is from class Group(1), etc. Can either be strings or numbers (numbers is probably easier).

k: Number of neighbors to consider. Say we've got a data point D that we want to classify. If k=1, then we find the closest point (see next parameter) to D and assign D the same class as that point. For larger $k$, we find the $k$ nearest points, and then use them to assign the label. Suppose $k=3$ and the three nearest points to D are from class A, class, B, and class A. Since the majority of the points are from class A, we assign D to class A too. Sometimes, there are ties (e.g., if the three closest points were from class A, B, and C). The rule parameter determines how those are broken.

distance: Determines which distance metric to use. See the docs for options, but you probably want Euclidean distance, at least to start.

rule: How ties are broken. This obviously only matters if $k>1$. Suppose you set $k=3$. If it's 'consensus', then the classifier doesn't classify examples where all three points are not from the same class (you get a nan or empty string, depending on how you arranged groups). If it's one of the other settings, you get the most common class of the three nearest points. If two or more classes are equally common, then 'random' picks one at random and 'nearest' favors the class of the closest point to break ties.

The classify() function is similar, but has a lot more options (see the docs). However, if you want a stock Naive Bayes classifier, I think something like classify(sample, train, groups, 'diaglinear', 'empirical'); will work.

C) For ROC analysis, you generally don't provide the features! Instead, you need the true class label, the classifier's output, and a "score" value that gives an estimate of how confident the classifier is in its output. See the matlab function perfcurve for the two-class version; you'll have to roll your own, I think, for ROC surfaces.