Solved – How to overcome the computational cost of the KNN algorithm

k nearest neighbourmachine learning

How to overcome the computational cost of the KNN algorithm when the dataset is large.

Bear in mind that I am using my own function for building the knn algorithm in R

Is there any manipulation to do with this?

I have tried dataset with 6000 instances and it takes a long time.

Note

wdbc_n is my normalized data set.

Lbl is the class label of the data set.

0.1 is the l norm as I defined within lpnorm function which I do not put here.

fx1NN<-function(TrainS, TestS,TrainLabel,p){ 
  n<-dim(TrainS)[1]
  Dist=matrix(0,nrow=n,ncol=1)
    for (i in 1:n){
      Dist[i]=lpnorm(as.matrix(TrainS[i,]-TestS),p) 
    }
  Dist2=which.min(Dist)
  predL=TrainLabel[Dist2] 
  return(predL)
}


LOOCV1NN<-function(Data,Lbl,p){ 
  N<-dim(Data)[1] 
  predLOOCV=matrix(0,nrow=N,ncol=1)
  for (i in 1:N){
    LOOOCV_train=Data[-i,] 
  LOOOCV_test=Data[i,] 
  TLabel=Lbl[-i] 
predLOOCV[i]=fx1NN(LOOOCV_train,LOOOCV_test,TLabel,p) 
  }

  confM=confusionMatrix(factor(predLOOCV),factor(Lbl)) 
  R=as.matrix(confM)
  Recall=R[1,1]/sum(R[,1])
  Precision=R[1,1]/sum(R[1,])
  F1score=2/((1/Recall)+(1/Precision))
  CorClass=sum(predLOOCV==Lbl) 
LL<-list(confM,CorClass,F1score) 
  return(LL)
}
LOOCV1NN(wdbc_n,Lbl,0.1)

Best Answer

There's a large literature on speeding up nearest neighbor search, as well as numerous software libraries (you might consider using one of these instead of re-implementing from scratch). I'll describe some general classes of solutions below. Most of these methods aim to reduce the amount of computation by exploiting some kind of structure in the data.

Parallelization

One approach is to parallelize the computation (e.g. using a cluster, GPU, or multiple cores on a single machine). Rather than reducing the amount of computation, this strategy breaks the problem into multiple pieces that are solved simultaneously on different processing units. For example, Garcia et al. (2008) report large speedups by parallelizing brute force search using GPUs.

Exact space partitioning methods

This class of methods aims to reduce the number of distance computations needed to find nearest neighbors. The data points are used to build a tree structure that hierarchically partitions the data space. The tree can efficiently identify some data points that cannot be nearest neighbors of a new query point, so distances to these points don't need to be computed. Many algorithms exist using different types of trees. Common variants include KD trees, metric/ball trees, and cover trees. For example, see Kibriya and Frank (2007). Tree-based methods can be used to compute exact solutions (i.e. nearest neighbors match those found by brute force search). This approach is efficient for low dimensional data, but performance can degrade in high dimensional settings. There are also tree-based methods for approximate nearest neighbor search (see below).

Approximate nearest neighbor search

Larger speedups can be achieved (particularly for high dimensional data) if one is willing to settle for approximate nearest neighbors rather than exact nearest neighbors. This is often sufficient in practice, especially for large datasets. Many different strategies have been proposed for approximate nearest neighbor search. See Muja and Lowe (2014) for a review.

Methods based on locality sensitive hashing (LSH) are particularly popular (e.g. see Andoni and Indyk 2006). The idea here is to use specialized hash functions that efficiently map data points to discrete buckets, such that similar points end up in the same bucket. One can search for neighbors among the subset of training points that are mapped to the same buckets as the query point. The hash functions should be tailored to the distance metric used to define nearest neighbors.

There are also approximate nearest neighbor search algorithms based on space-partitioning trees and nearest neighbor graphs. For example, see Muja and Lowe (2014) and Liu et al. (2005).

Dimensionality reduction

Dimensionality reduction maps high dimensional data points to a lower dimensional space. Searching for neighbors in the lower dimensional space is faster because distance computations operate on fewer dimensions. Of course, one must take into account the computational cost of the mapping itself (which depends strongly on the method used). Note that nearest neighbor search in the low dimensional space is equivalent to using a different distance metric.

In some cases (depending on the method and structure of the data), it's possible to reduce the dimensionality such that nearest neighbors in the low dimensional space approximately match those in the high dimensional space. For supervised learning (e.g. KNN regression and classification), matching neighbors in the high dimensional space typically doesn't matter. Rather, the goal is to learn a function with good generalization performance. Dimensionality reduction can sometimes increase generalization performance by acting as a form of regularization, or by providing a space where distances are more meaningful to the problem.

There's a large literature on dimensionality reduction including linear, nonlinear, supervised, and unspervised methods. PCA is often the first thing people try because it's a standard method, works well in many cases, and scales efficiently to large datasets. But, whether it (or another method) will work well depends on the problem.

For an example application to nearest neighbor search, see Degalla and Bostrom (2006), comparing PCA and random projections for nearest neighbor classification.

Feature selection

Feature selection methods retain a subset of the input features and discard the rest. Typically, labels/target values from the training set are used to select features relevant for solving a classification or regression problem. Similar to dimensionality reduction, feature selection speeds up nearest neighbor search by reducing the number of dimensions. It is not generally aimed toward preserving distances.

Nearest prototypes

The idea here is to compress the training set into a smaller number of representative points (called prototypes). This speeds up neighbor search by reducing the number of points to search over. It can also have a denoising effect for supervised learning problems. Garcia et al. (2012) review various approaches.

Combining methods

Many of the techniques described above could be combined to yield even further speedups. For example, Pan and Manochoa (2011) and Gieseke et al. (2014) use GPUs to parallelize locality sensitive hashing and tree-based methods, respectively.

References

  • Andoni and Indyk (2006). Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions.

  • Deegalla and Bostrom (2006). Reducing high-dimensional data by principal component analysis vs. random projection for nearest neighbor classification.

  • Garcia et al. (2008). Fast k nearest neighbor search using GPU.

  • Garcia et al. (2012). Prototype selection for nearest neighbor classification: Taxonomy and empirical study.

  • Gieseke et al. (2014). Buffer kd trees: processing massive nearest neighbor queries on GPUs.

  • Kibriya and Frank (2007). An empirical comparison of exact nearest neighbour algorithms.

  • Liu et al. (2005). An investigation of practical approximate nearest neighbor algorithms.

  • Muja and Lowe (2014). Scalable nearest neighbor algorithms for high dimensional data.

  • Pan and Manocha (2011). Fast GPU-based locality sensitive hashing for k-nearest neighbor computation.

Related Question