I think part of your confusion is about which types of variables a chi-squared can compare. Wikipedia says the following about this:
It tests a null hypothesis stating that the frequency distribution of certain events observed in a sample is consistent with a particular theoretical distribution.
Thus it compares frequency distributions, also known as counts, also known as non-negative numbers. The different frequency distributions are defined by the categorical variable; i.e. for each of the values of a categorical variable there needs to be a frequency distribution that can be compared to the other ones.
There are several ways to get the frequency distribution. It might be from a second categorical variable wherein the co-occurances with the first categorical variable are counted to get a discrete frequency distribution. Another option is to use a (multiple) numerical variable for different values of a categorical variable, it can (e.g.) sum the values of the numerical variable. In fact, if categorical variables are binarised the former is a specific version of the later.
Example
As an example look at these sets of variables:
x = ['mouse', 'cat', 'mouse', 'cat']
z = ['wild', 'domesticated', 'domesticated', 'domesticated']
The categorical variables x
and y
can be compared by counting the co-occurances, and this is what happens with a chi-squared test:
'mouse' 'cat'
'wild' 1 0
'domesticated' 1 2
However, you can also binarise the values of 'x' and get the following variables:
x1 = [1, 0, 1, 0]
x2 = [0, 1, 0, 1]
z = ['wild', 'domesticated', 'domesticated', 'domesticated']
Counting the values is now equal to summing the values that correspond to the value of z
.
x1 x2
'wild' 1 0
'domesticated' 1 2
As you can see a single categorical variable (x
) or multiple numerical variables (x1
and x2
) are equally represented by in the contingency table. Thus chi-squared tests can be applied on a categorical variable (the label in sklearn) combined with another categorical variable or multiple numerical variables (the features in sklearn).
Let me try to answer this, I will edit the answer as I have more information. In general scikit-learn does not provide classifiers that handle the multi-label classification problem very well. That's why I started the scikit-multilearn's extension of scikit-learn and together with a lovely team of multi-label classification people around the world we are implementing more state of the art methods for MLC.
First of all, the question is do you need probabilities or just an estimate of how sure a classifier is. Not always the exact probabilities are what you can get very cheaply. I understand that you want to get probabilities P(A|X), P(B|X) etc. for a given instance X.
A. The simplest case: labels are indpendent i.e. P(A and B|X)=P(A|X)P(B|X). If such case occurs you can use scikit-multilearns Binary Relevance classifier's predict_proba:
here's a simple example with SVC as the per label probability estimator:
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.svm import SVC
classifier = BinaryRelevance(classifier = SVC(probability=True),
require_dense = [False, True])
classifier.fit(X_train, y_train)
probabilities = classifier.predict_proba(X_test)
This will estimate per label probabilities and then renormalize them. Unfortunately Binary Relevance may fail to detect a rise/fall of probabilities in case when a combination of labels is mutually or even totally dependent, it just happens.
B. If your labels are not independent you need to explore the data set and ask yourself what is the level of co-dependence in your data. There are several ways to handle dependencies. If you really expect a total dependence, a Label Powerset approach may be better, where each combination is treated as a separate class and probability will be estimated per that class. Note that this transformation is a hard one to perform, due to label imbalances and the underfitting nature of Label Powerset transformation, I've created a solution for this to divide the label space into interconnected subspaces - a data-driven approach to detect dependencies and split the problem into interally more dependent subproblems - see the data-driven approach to multi-label classification paper.
An example how to use it is here: http://scikit.ml/api/classify.html#ensemble-approaches - just use predict_proba instead of predict. Also you might want to change the clusterer to:
clusterer = IGraphLabelCooccurenceClusterer('fastgreedy', weighted=True, include_self_edges=False)
so that the label partition is more granular. It detects clusters of co-occurring labels and then calculates joint distributions P(A1,...,An|X) for labels A1...An per cluster, i.e. it expects the clusters of labels to independent.
If you like it and use this method please cite both the data-driven paper and the arxiv paper of scikit-multilearn, we can get funding to develop the library that way :) Also I'd love to know how the method worked for you so I can maybe improve it.
I find it easiest to just start with something, so if I were you, I'd go ahead and check the approach from point A and see what level of result you're getting. Then I'd try the label space partition approach. I need to write a tutorial on how to use it to explore the relations, will add this to my documentation todo list.
Best Answer
Using MLPClassifier you can do exactly what you suggested, that is represent classes as integers from 0 to 27 (in the case of 28 classes). Here is an example with MLPClassifier and MNIST dataset. You can use sklearn to transform data to such format with Label Encoder.
Although the above will solve your problem, I believe MLPClassifier actually transforms the numerical labels to one-hot vectors for the neural network. Using lower level neural network libraries you would have to do it yourself.
This is because in multi-class classification the last layer's activation is softmax, which outputs a vector of n (number of classes) elements with continuous (0, 1) values. This makes sense as an indication of probability of observing a given class. To transform numerical labels to one-hot vectors with sklearn you can use Label Binarizer. When we expect a neural network to predict a numerical value we're really talking about a regression, not classification.