Solved – How does Scikit Learn resolve ties in the KNN classification

classificationk nearest neighbourscikit learnself-study

I have a multi-class classification problem, in which I'm using Scikit Learn's k nearest neighbour classifier, (5 classes), which means that an odd number for k won't prevent classification ties.

So how does Scikit Learn resolve ties in the k nearest neighbour classification? I can't seem find this anywhere in the internet.

I need this for an exam assignment, so quick answers, if possible with a source of your knowledge, is much appreciated 🙂

Best Answer

From the documentation for KNeighborsClassifier:

Warning: Regarding the Nearest Neighbors algorithms, if it is found that two neighbors, neighbor k+1 and k, have identical distances but but different labels, the results will depend on the ordering of the training data.

To get exactly what happens, we'll have to look at the source. You can see that, in the unweighted case, KNeighborsClassifier.predict ends up calling scipy.stats.mode, whose documentation says

Returns an array of the modal (most common) value in the passed array.

If there is more than one such value, only the first is returned.

So, in the case of ties, the answer will be the class that happens to appear first in the set of neighbors.

Digging a little deeper, the used neigh_ind array is the result of calling the kneighbors method, which (though the documentation doesn't say so) appears to return results in sorted order. So ties should be broken by choosing the class with the point closest to the query point, but this behavior isn't documented and I'm not 100% sure it always happens.

Related Solutions

Solved – How to use scikit-learn’s cross validation functions on multi-label classifiers

Stratified sampling means that the class membership distribution is preserved in your KFold sampling. This doesn't make a lot of sense in the multilabel case where your target vector might have more than one label per observation.

There are two possible interpretations of stratified in this sense.

For $n$ labels where at least one of them is filled that gives you $\sum\limits_{i=1}^n2^n$ unique labels. You could perform stratified sampling on the each of the unique label bins.

The other option is to try and segment the training data s.t. that probability mass of the distribution of the label vectors is approximately the same over the folds. E.g.

import numpy as np

np.random.seed(1)
y = np.random.randint(0, 2, (5000, 5))
y = y[np.where(y.sum(axis=1) != 0)[0]]


def proba_mass_split(y, folds=7):
    obs, classes = y.shape
    dist = y.sum(axis=0).astype('float')
    dist /= dist.sum()
    index_list = []
    fold_dist = np.zeros((folds, classes), dtype='float')
    for _ in xrange(folds):
        index_list.append([])
    for i in xrange(obs):
        if i < folds:
            target_fold = i
        else:
            normed_folds = fold_dist.T / fold_dist.sum(axis=1)
            how_off = normed_folds.T - dist
            target_fold = np.argmin(np.dot((y[i] - .5).reshape(1, -1), how_off.T))
        fold_dist[target_fold] += y[i]
        index_list[target_fold].append(i)
    print("Fold distributions are")
    print(fold_dist)
    return index_list

if __name__ == '__main__':
    proba_mass_split(y)

To get the normal training, testing indices that KFold produces you want to rewrite that to it returns the np.setdiff1d of each index with np.arange(y.shape[0]), then wrap that in a class with an iter method.

Solved – SciKit Learn get feature importance for multiclass classification using Decision Tree

You could modify your problem by using multiple one-vs-rest classifiers. For example train a classifier to distinguish between (1) class_a and (2) rest. Then you can access the feature importance. Nonetheless, the feature importance is not the importance of a feature for a certain class, but a measure for the usability of a single feature to distinguish two classes (here one-vs-rest).

Therefore your the feature importance attribute does not answer the question "Which feature is important for a class?" but rather answers the question "Which feature helps best at distinguishing class_a from other classes present".

Best Answer

Related Solutions

Solved – How to use scikit-learn’s cross validation functions on multi-label classifiers

Solved – SciKit Learn get feature importance for multiclass classification using Decision Tree

Related Question