I have a matrix X = [XCat XNum] where:
XCat is a matrix made of dummy variables resulting from encoding categorical variables
XNum is a matrix of continuous variables.
I want to apply a clustering algorithm, that keeps into account the categorical nature of part of the features in X. So I create a custom distance function, that uses the Hamming distance for the encoded categorical variables (dummies), and L1 (cityblock) for the continuous variable. This is the function:
function D = MixDistance(XCat,XNum) % Mixed categorical/numerical distance
% INPUT:
% XCat = matrix nObsCat x nFeatures of categorical features
% XNum = matrix nObsNum x nFeatures of numerical features
% OUTPUT:
% D = matrix of distances (nObsCat+nObsNum) x (nObsCat+nObsNum)
% Number of categorical and numerical features
nCat = size(XCat,2);nNum = size(XNum,2);% Compute distances, separately
DCat = pdist2(XCat, XCat, 'hamming');DNum = pdist2(XNum, XNum, 'cityblock');% Compute relative weight based on the number of categorical variables
wCat = nCat/(nCat + nNum); D = wCat*DCat + (1 - wCat)*DNum;
Now, one should be tempted to call kmedoids like this:
[IDX, C, SUMD, D, MIDX, INFO] = kmedoids(X,3,'distance', @MixDistance,'replicates',3);
but of course it doesn't work as the function MixDistance need XCat,XNum as input, not just X.
also, because of the way handles work, this doesn't work either:
[IDX, C, SUMD, D, MIDX, INFO] = kmedoids(X,3,'distance', MixDistance(XCat, XNum),'replicates',3);
Any idea?
Or alternatively, any idea on clustering when data are mixed, that is BOTH categorical AND continuous?
Best Answer