Solved – Dealing with ties, weights and voting in kNN

k nearest neighbourtiesweights

I am programming a kNN algorithm and would like to know the following:

Tie-breaks:

What happens if there is no clear winner in the majority voting? E.g. all k nearest neighbors are from different classes, or for k=4 there are 2 neighbors from class A and 2 neighbors from class B?
What happens if it is not possible to determine exactly k nearest neighbors because there are more neighbors which have the same distance? E.g. for the list of distances (x1;2), (x2;3.5), (x3;4.8), (x4;4.8), (x5;4.8), (x6;9.2) it would not be possible to determine the k=3 or k=4 nearest neighbors, because the 3rd to 5th neighbors all have same distance.

Weights:

I read it is good to weight the k-nearest neighbors before selecting the winning class. How does that work? I.e. how are the neighbors weighted and how is then the class determined?

Majority vote alternatives:

Are there other rules/strategies to determine the winning class other than majority vote?

Best Answer

The ideal way to break a tie for a k nearest neighbor in my view would be to decrease k by 1 until you have broken the tie. This will always work regardless of the vote weighting scheme, since a tie is impossible when k = 1. If you were to increase k, pending your weighting scheme and number of categories, you would not be able to guarantee a tie break.

Related Solutions

Solved – Why is this nearest neighbors algorithm classifier implementation giving low accuracy

You should probably try to reduce the number of variables to a sensible set before trying to classify using nearest neighbors. Otherwise you'll fall victim to the curse of dimensionality, which is referenced in the Wikipedia article on $k$-nearest neighbors. You might also consider some sort of scaling of the variables so that no particular attribute has an undue influence on your classifications.

Your Python code could also be simplified quite a bit. Instead of defining these functions you could use the inner product function from numpy:

import math
import numpy as np

# inner product
np.dot(a, b)

# cosine similarity
np.dot(a, b) / math.sqrt(np.dot(a, a) * np.dot(b, b))

Solved – Range of values for hyperparameters of the KNN

Two hyperparameters are K (i.e. the number of neighbors to consider) and the choice of which Distance Function to employ.

For K, you could iterate from 1 though $N$, i.e. the number of datapoints in your dataset. Whether choosing an extremely large number of data points (close to $N$) is a sensible thing to do or not, might be a different question, but $[1,N]$ is at least the range of all values that you can possibly test in an exhaustive search for fitting your hyperpameters.

For the distance function, you could test distance functions from different families of distance functions. Two commonly mentioned ones are e.g. Euclidean distance and Manhattan distance. But also here, you might consider more than just those two.

Best Answer

Related Solutions

Solved – Why is this nearest neighbors algorithm classifier implementation giving low accuracy

Solved – Range of values for hyperparameters of the KNN

Related Question