Nearest neighbor can classify new data point based on the k nearest neighbor's class. Assuming there is dataset A contains 10000 data points. There is also another dataset B contains 1 MM data points. The goal is to find the most similar records from dataset B that resembles dataset A on a number of pre-decided attributes(features). Assume there is a list of specific features that I'm interested in, calculate a distance between the one record and all the other records and pick the records with the smallest distance can serve this purpose.
SAS has a couple of procedure can do nearest neighbor such as PROC DISCRIM that takes a training data and classify on the test data such as below. In this case, how to define training data as the purpose is just to find the most similar records in dataset B that looks like each individual records in data A? Can I construct a training data by randomly take 50% of dataset B and combine with dataset A as training data, and the rest 50% of dataset B as test data?
proc discrim data=train
method=npar k=5
testdata=toscore
testout=toscore_out
;
class y;
var x1-x10; /* a list of features to compare */
run;
Best Answer
Regarding
proc discrim
- considering PROC DISCRIM documentation I don't see a possibility to get for each observation from test data the closest neighbour (observation) from train data. I've found a similar problem described here: https://stackoverflow.com/questions/19626326/k-nearest-neighbor-in-sas-how-to-get-the-neighbor-list-for-each-row. Also without answer.I'd suggest to use another procedure. Here is an example workaround with the use of
proc modeclus
. However, it needs further work, as for each B's (testing set's) record it loops through the whole table A (training set). It also makes no use ofy
label from table A.1.Example data
Find nearest neighbors of A in B. Put results to table C.