Solved – Dimensionality reduction for high dimensional sparse data before clustering or spherical k-means

clusteringdimensionality reductiondistancerecommender-systemsparse

I am trying to build my first recommender system where i create a user feature space and then cluster them into different groups. Then for the recommendation to work for a particular user , first i find out the cluster to which the user belongs and then recommend entities(items) in which his/her nearest neighbor showed interest. The data which i am working on is high dimensional and sparse. Before implementing the above approach, there are few questions, whose answers might help me in adopting a better approach.

As my data is high dimensional and sparse, should i go for dimensionality reduction and then apply clustering or should I go for an algorithm like spherical K-means which works on sparse high dimensional data?

How should I find the nearest neighbors after creating clusters of users.(Which distance measure should i take as i have read that Euclidean distance is not a good measure for high dimensional data)?

Best Answer

You're in a modeling space where there are no textbook answers, much less a "ground truth." In my view, the best solution is to develop and explore a logical combination of methodological options, settling on an approach that meets or exceeds some pre-agreed performance criteria.

That said, there are a plethora of possible distance functions to explore including (for text data) the Jaccard index, dice coefficients, discrete Hellinger distance, and so on. Mahalanobis distance functions are also worth exploring. They receive thorough and excellent treatment in this CV thread ...

Bottom to top explanation of the Mahalanobis distance?

Related Question