Solved – Is Tomek Link undersampling the same as Edited Nearest Neighbours with 1 neighbour

data preprocessingdown-sampleresamplingunbalanced-classes

From what I've read I've understood that undersampling the majority class with Tomek Links or Edited Nearest Neighbours with 1 neighbour should yield the same result. However, I've tried it on this library I've been working with called imbalanced-learn, and I got different outputs. In the documentation both methods are described here: 3.2.2 Cleaning under-sampling techniques.

This is what I tried:

Counter(y_train)
Out[45]: Counter({0: 91, 1: 26})

Using Tomek Links:

X_res1, y_res1 = TomekLinks(ratio='all').fit_sample(X_train_std, y_train)
Counter(y_res1)
Out[44]: Counter({0: 88, 1: 23})

Using Edited Nearest Neighbours:

X_res1, y_res1 = EditedNearestNeighbours(n_neighbors=1).fit_sample(X_train_std, y_train)
Counter(y_res1)
Out[50]: Counter({0: 73, 1: 26})

Is my interpretation right or are there any mistakes in the code?

Best Answer

Eventually I found the answer my self. The idea of nearest neighbours is not necessarily reciprocal; in other words, if A is B's nearest neighbour, this doesn't imply that B is A's nearest neighbour.

Edited Nearest Neighbours requires a sample (usually from the majority class) to have as its nearest neighbour an opposite class sample in order to remove it. On the other hand, Tomek Links requires both samples to be each other's nearest neighbours. In summary, Tomek Links uses a more restrictive condition resulting in less samples being removed.