MATLAB: Binary classification

classification

Hey all!
My question: Is it possible to use classification methods to determine if an unknown sample fits the distribution of known samples?
I have a known dataset that constitutes an object parameters distribution (various circles with various proprieties as circularity, area, perimeter, solidity, etc.). Rows are independent samples, and columns are each parameters. The problem is that I need the function to determine if a new sample is a circle or not. From what I saw in classification, you need to specify every class, there is no "everything else" class. What should be the best way to find if the new object is a circle or not (here circle is really just an example) and have an error or confidence measurements on the decision?
Regards,
Olivier

Best Answer

You might want to start here http://en.wikipedia.org/wiki/One-class_classification The 1st reference (PhD thesis) gives an overview of methods.
There are no utilities in the official MATLAB release you could use right away, but it would be fairly easy to code some of the reviewed methods. For example, in the ascending order by complexity:
  • Assume that predictors (columns) are uncorrelated and compute the distance between a new sample (row) and the mean of the training set (set of known samples). Compare with the reference distribution obtained by taking the distance between every row in the training set and the mean of all other rows.
  • Assume that the known samples come from a Gaussian mixture of distributions. Find this mixture using gmdistribution from Statistics Toolbox. Compute Mahalanobis distance between the new sample and every Gaussian component. Estimate the probability assuming chisq distribution for the squared Mahalanobis distance.
  • Find k nearest neighbors for every sample in the training set using knnsearch. Compute the distribution of the average distance between every sample and its k nearest neighbors. Find k nearest neighbors in the training set for the new sample and take the average of their distance values. Compare to the reference distribution.
And so on. If your training set is pure (all objects are indeed circles) and if your data are low-dimensional, you really have plenty of methods at your disposal. Without purity or in high dimensions, the problem can become substantially harder.