Solved – How to use nearest neighbor to find similar population based on list of features

k nearest neighboursas

Nearest neighbor can classify new data point based on the k nearest neighbor's class. Assuming there is dataset A contains 10000 data points. There is also another dataset B contains 1 MM data points. The goal is to find the most similar records from dataset B that resembles dataset A on a number of pre-decided attributes(features). Assume there is a list of specific features that I'm interested in, calculate a distance between the one record and all the other records and pick the records with the smallest distance can serve this purpose.

SAS has a couple of procedure can do nearest neighbor such as PROC DISCRIM that takes a training data and classify on the test data such as below. In this case, how to define training data as the purpose is just to find the most similar records in dataset B that looks like each individual records in data A? Can I construct a training data by randomly take 50% of dataset B and combine with dataset A as training data, and the rest 50% of dataset B as test data?

proc discrim data=train
method=npar k=5
testdata=toscore
testout=toscore_out
;
class y;
var x1-x10; /* a list of features to compare */
run;

Best Answer

Regarding proc discrim - considering PROC DISCRIM documentation I don't see a possibility to get for each observation from test data the closest neighbour (observation) from train data. I've found a similar problem described here: https://stackoverflow.com/questions/19626326/k-nearest-neighbor-in-sas-how-to-get-the-neighbor-list-for-each-row. Also without answer.

I'd suggest to use another procedure. Here is an example workaround with the use of proc modeclus. However, it needs further work, as for each B's (testing set's) record it loops through the whole table A (training set). It also makes no use of y label from table A.

1.Example data

        data A; input a b c y; cards;
        1 1 1 1
        1 1 2 1
        2 2 2 2
        2 2 3 2
        4 4 4 3
        4 4 5 3
        ;
        run;
        data B; input a b c; cards;
        1 1 4
        2 2 3
        3 3 3
        3 3 4
        ;
        run;
  1. Find nearest neighbors of A in B. Put results to table C.

        %macro findNN(A,B,C);
        /* get table sizes */
        %let size1=;
            proc sql;
                select count(*) into :size1
                from &B.;
                select count(*) into :size2
                from &A.;
            quit;
        %let size2=%eval(&size2.+1);
        /* for each observation from table B*/
        %do i=1 %to &size1.;
            data AB;
                set &A. &B.(firstobs=&i. obs=&i.);
                keep a b c;
            run;
            /* find its nearest neighbour from table A */
            ods select Neighbor;
            proc modeclus data=AB method=1 k=2 Neighbor;  
                var a b c;
                ods output Neighbor=tableout;
            run;
            /* add the neighbor found to the table C */
            %if &i.=1 %then %do;
            data &C.;
                set tableout(where=(compress(id)="&size2.") keep=id nbor distance in=out);
                if out=1 then id=&i.;
            run;
            %end;
            %else %do;
            data &C.;
                set &C. tableout(where=(compress(id)="&size2.") keep=id nbor distance in=out);
                if out=1 then id=&i.;
            run;
            %end;
        %end;
        %mend findNN;
        %findNN(A,B,ANborB);
    
Related Question