when the data is from different types (numerical and categorical)
of course euclidean distance alone or hamming distance alone can't help.
so i have 2 approaches:
-
standardize all the data with min_max scaling, now all the numeric data are between [0,1] now we can use euclidean distance alone
-
calculate the euclidean distance for numeric data and calculate hamming distance for categorical data, and then combine both distances(with weights)
my question is:
1-are my 2 approaches correct?if yes, then which is better?how can i combine the distances(choosing the weight for each feature)? is there an implementation of the second approach in sklearn in python?
Best Answer
In my opinion your first approach isn't enought because of the difference between categorical and numerical numerical. The standardisation should be maybe more appropriate but i don't have enough knowledge about it and recommand you to treat those two type separetely.
Your second proposition seems great because you use appropriate distance for each type of data and combine them to obtain a final result. There are lots to discuss about how weighted them.
I will encourage you to read this very interesting paper about categorical data where a lot of distance measure are inspect :
It could be more preferable than Hamming depending the case.