Solved – should one always perform svd before doing KNN

k nearest neighbourrecommender-systemsparsesvd

I am trying to perform a Collaborative filtering for recommendation of products to customers in fashion industry.I am using the usual KNN approach to bring similarities among products.
I have seen people using SVD(Singular Value Decomposition) before opting for collaborative filtering , but all of those seemed to be dealing with prediction of movie reviews.

I want to know if in my case it is suitable to use SVD(svd() in R) prior to collaborative filtering & if so, should I replace zero/missing values by non-zero ones. The second point comes with the idea that normal SVD is not very useful while dealing with sparse data.

Best Answer

The problem with directly using the sparse matrix or large dimensional to do the nearest neighbor computation is the curse of dimensionality. The curse of dimensionality is the general term for all the problems related to large dimensions. In your case, the distance function on high correlated (less information) will not give contrast to the distance.

Sparsity peeps in as well, which also makes the data not relevant to be used. SVD is helpful in reducing the matrix to a rank of the most relevance. If your data is N x M though it contains only rank r information, the dimensionality reduction using the SVD will give you mostly relevance features up to N x r. A similar effect is achieved using other dimensionality reduction techniques too such as PCA.

So, it's better for you to replace the zero with the mean of the data which will be non-extra information added by your pre-processing which anyhow will be removed by SVD. so that we do not lose any other relevant information on the same sample.

and then perform KNN (or other) to do an instance-based search of the related samples.