How to Plot Decision Boundary of a k-Nearest Neighbor Classifier

data visualizationk nearest neighbourr

I want to generate the plot described in the book ElemStatLearn "The Elements of
Statistical Learning: Data Mining, Inference, and Prediction. Second Edition" by Trevor Hastie
& Robert Tibshirani& Jerome Friedman. The plot is:

enter image description here

I am wondering how I can produce this exact graph in R, particularly note the grid graphics and calculation to show the boundary.

Best Answer

To reproduce this figure, you need to have the ElemStatLearn package installed on you system. The artificial dataset was generated with mixture.example() as pointed out by @StasK.

library(ElemStatLearn)
require(class)
x <- mixture.example$x
g <- mixture.example$y
xnew <- mixture.example$xnew
mod15 <- knn(x, xnew, g, k=15, prob=TRUE)
prob <- attr(mod15, "prob")
prob <- ifelse(mod15=="1", prob, 1-prob)
px1 <- mixture.example$px1
px2 <- mixture.example$px2
prob15 <- matrix(prob, length(px1), length(px2))
par(mar=rep(2,4))
contour(px1, px2, prob15, levels=0.5, labels="", xlab="", ylab="", main=
        "15-nearest neighbour", axes=FALSE)
points(x, col=ifelse(g==1, "coral", "cornflowerblue"))
gd <- expand.grid(x=px1, y=px2)
points(gd, pch=".", cex=1.2, col=ifelse(prob15>0.5, "coral", "cornflowerblue"))
box()

All but the last three commands come from the on-line help for mixture.example. Note that we used the fact that expand.grid will arrange its output by varying x first, which further allows to index (by column) colors in the prob15 matrix (of dimension 69x99), which holds the proportion of the votes for the winning class for each lattice coordinates (px1,px2).

enter image description here

Related Solutions

Solved – Bayes decision boundary of Figure 2.5 in Elements of Statistical Learning

I asked the authors this question, and apparently they no longer are in possession of the code that created the data. So there is no real way to reconstruct the Bayes rule for this particular data set. Otherwise, it would be based on the ratio of the densities that would have been known for the Gaussian mixture distributions that the authors used to create the two classes.

Solved – k nearest neighbor with decision tree

1) Why will the performance of k-nn change if I change the representation of attribute X to cm instead of m ?

k nearest neighbor classifiers use a distance measure , usually Euclid distance, to decide classification. Suppose you have this data set.

10m 10kg
11m 11kg

 >> sqrt((11-10)^2 + (11-10)^2) 
 ans =
 1.4142

if you change m to cm. you have following data set.

1000cm 10kg
1100cm 11kg

Here your distances changed.

 >> sqrt((1100-1000)^2 + (11-10)^2)
 ans =
 100.0050

If you do not want this behavior, you need to normalize your data.

2) Why will the performance of k-nn and decision tree not change if I multiply all the attributes with 20?

here you scale your data but you scale ALL of them. Since both k-nn and decision trees use distance measures to classify your data, classifications does not change.

Best Answer

Related Solutions

Solved – Bayes decision boundary of Figure 2.5 in Elements of Statistical Learning

Solved – k nearest neighbor with decision tree

Related Question