First of all, as @Marc Claesen already explained, semi-supervised classification is one of the techniques to take care of the situation where you know that the classes are really distinct, but you are not certain which class the case actually belongs to.
However, there are related situations as well, where the the "reality" isn't that clear, and the assumption of having really distinct classes is not met: bordeline cases may be a "physical" reality (see below for papers about an application where we met such a condition).
There is one crucial assumption for semi-supervised classifers that you need to make sure is met: the assumption that in feature space, class boundaries come along with low sample density. This is referred to as the cluster assumption.
Even if the reality underlying your data has distinct classes, your data set may have disproportionally more borderline cases: e.g. if your classification technique is targeted at classifying difficult cases, while the clear and easy cases are not of interest and already your training data reflects this situation.
only taking "certain" classifications for training? I fear that in this case, there will be more misclassifications because "border" cases are not covered.
I fully agree with you that excluding the borderline cases is often a bad idea:
by removing all difficult cases you end up with an artificially easy problem. IMHO it is even worse that excluding borderline cases usually does not stop with model training, but the borderline cases are also excluded from testing, thus testing the model only with easy cases. With that you'd not even realize that the model does not perform well with borderline cases.
Here are two papers we wrote about a problem that differs from yours in that in our application also the reality can have "mixed" classes (a more general version of your problem: uncertainty in reference labels is covered as well).
- The appliation: brain tumour diagnostics. We used logistic regression. Semi-supervised modeling was not appropriate as we cannot assume low sample density at class boundaries.
C. Beleites, K. Geiger, M. Kirsch, S. B. Sobottka, G. Schackert and R. Salzer: Raman spectroscopic grading of astrocytoma tissues: using soft reference information, Anal. Bioanal. Chem., 400 (2011), 2801 - 2816.
- Theory paper deriving a general framework for measuring the performance of the classifier for borderline cases.
C. Beleites, R. Salzer and V. Sergo:
Validation of Soft Classification Models using Partial Class Memberships: An Extended Concept of Sensitivity & Co. applied to Grading of Astrocytoma Tissues
Chemom. Intell. Lab. Syst., 122 (2013), 12 - 22.
The links go to a project page of an R package I developed to do the performance calculations. There are further links to both the official web page and my manuscripts of the papers.
While I have not used Weka so far, I understand that an interface to R is available.
practical considerations:
- While the copy-and-label-differently approach is straightforward, it does not work well with all classifiers and implementations in practice. E.g. AFAIK there is no way to tell
libSVM
s tuning by cross validation that all copies of each data point need to be kept in the same cross validation fold. Thus libSVM
s tuning would probably yield a massively overfit model.
- Also for logistic regression, I found that many implementations did not allow the partial membership labels I needed.
- The implementation I used for the papers above is actually an ANN without hidden layer using the logistic as sigmoidal link function (
nnet::multinom
).
I think part of your confusion is about which types of variables a chi-squared can compare. Wikipedia says the following about this:
It tests a null hypothesis stating that the frequency distribution of certain events observed in a sample is consistent with a particular theoretical distribution.
Thus it compares frequency distributions, also known as counts, also known as non-negative numbers. The different frequency distributions are defined by the categorical variable; i.e. for each of the values of a categorical variable there needs to be a frequency distribution that can be compared to the other ones.
There are several ways to get the frequency distribution. It might be from a second categorical variable wherein the co-occurances with the first categorical variable are counted to get a discrete frequency distribution. Another option is to use a (multiple) numerical variable for different values of a categorical variable, it can (e.g.) sum the values of the numerical variable. In fact, if categorical variables are binarised the former is a specific version of the later.
Example
As an example look at these sets of variables:
x = ['mouse', 'cat', 'mouse', 'cat']
z = ['wild', 'domesticated', 'domesticated', 'domesticated']
The categorical variables x
and y
can be compared by counting the co-occurances, and this is what happens with a chi-squared test:
'mouse' 'cat'
'wild' 1 0
'domesticated' 1 2
However, you can also binarise the values of 'x' and get the following variables:
x1 = [1, 0, 1, 0]
x2 = [0, 1, 0, 1]
z = ['wild', 'domesticated', 'domesticated', 'domesticated']
Counting the values is now equal to summing the values that correspond to the value of z
.
x1 x2
'wild' 1 0
'domesticated' 1 2
As you can see a single categorical variable (x
) or multiple numerical variables (x1
and x2
) are equally represented by in the contingency table. Thus chi-squared tests can be applied on a categorical variable (the label in sklearn) combined with another categorical variable or multiple numerical variables (the features in sklearn).
Best Answer
Scikit learn only handles real numbers I believe. So you need to do something like one hot encoding where n numerical dimensions are used to represent membership in the categories. If you just pass in strings they'll get cast to floats in unpredictable ways.
There are mathematical reasons some methods (like svm) need floats. IE they are only defined in the space of real numbers. Representing 3 categories as values 1,2,3 in a single method might work but it may also yield suboptimal performance compared to one hot encoding since the split (1,3) vs (2) is difficult to pick up on unless the method can capture very non linear behavior like that.
Other methods like random forest can be made to work directly on categorical values. Ie during decision learnings you can propose potential splits as diffrent combinations of categories. For such methods it is often convenient to use ints to represent the categories because an array of ints is much nicer to work with then an array of strings on a computational level. You can also do things like generate all possible combinations of n categories by looking at the bit values of an n-bit integer you are incrementing which can be much faster and memory efficient then searching for splits over n-floats.