Solved – How to calculate the distance in KNN for mixed data types

distancek nearest neighbourmixed type datapythonscikit learn

when the data is from different types (numerical and categorical)
of course euclidean distance alone or hamming distance alone can't help.
so i have 2 approaches:

standardize all the data with min_max scaling, now all the numeric data are between [0,1] now we can use euclidean distance alone
calculate the euclidean distance for numeric data and calculate hamming distance for categorical data, and then combine both distances(with weights)

my question is:
1-are my 2 approaches correct?if yes, then which is better?how can i combine the distances(choosing the weight for each feature)? is there an implementation of the second approach in sklearn in python?

Best Answer

In my opinion your first approach isn't enought because of the difference between categorical and numerical numerical. The standardisation should be maybe more appropriate but i don't have enough knowledge about it and recommand you to treat those two type separetely.

Your second proposition seems great because you use appropriate distance for each type of data and combine them to obtain a final result. There are lots to discuss about how weighted them.

I will encourage you to read this very interesting paper about categorical data where a lot of distance measure are inspect :

Similarity Measures for Categorical Data: A Comparative Evaluation

by Shyam Boriah, Varun Chandola and Vipin Kumar

http://www-users.cs.umn.edu/~sboriah/PDFs/BoriahBCK2008.pdf

It could be more preferable than Hamming depending the case.

Best Answer

Related Solutions

Solved – Choice of distance metric when data is combination text/numeric/categorical

Solved – Why are mixed data a problem for euclidean-based clustering algorithms

Related Question