SVD for Imputing Missing Values – A Step-by-Step Example

data-imputationmissing datarsvd

I have read the great comments regarding how to deal with missing values before applying SVD, but I would like to know how it works with a simple example:

        Movie1 Movie2 Movie3
User1     5             4
User2     2      5      5
User3            3      4
User4     1             5
User5     5      1      5

Given the matrix above, if I remove the NA values, I will end up having only User2 and User5. This means that my U will be 2 × k. But if I predict the missing values, U should be 5 × k, which I can multiply with singular values and V.

Would anyone of you fill in the missing values in the matrix above by first removing users with missing values and then applying SVD? Please provide a very simple explanation of the procedure you applied and make your answer practical (i.e. number multiplied with another number gives an answer) rather than using too much math symbols.

I've read the following links:

stats.stackexchange.com/q/33142

stats.stackexchange.com/q/31096

stats.stackexchange.com/q/33103

Best Answer

SVD is only defined for complete matrices. So if you stick to plain SVD you need to fill in these missing values before (SVD is not a imputing-algorithm per se). The errors you introduce will hopefully be cancelled out by your matrix-factorization approach (general assumption: data is generated by a low-rank model).

Removing complete rows like you want to do is just bad. Even setting the missing values to zero would be better.

There are many imputation strategies, but in this case, i would impute with the column-mean (or maybe row-mean). This is basically the strategy recommend in your 2nd link.

        Movie1 Movie2 Movie3
User1   5             4
User2   2      5      5
User3          3      4
User4   1             5
User5   5      1      5

becomes (column-mean; average score of movie)

        Movie1 Movie2 Movie3
User1   5      3      4
User2   2      5      5
User3   3      3      4
User4   1      3      5
User5   5      1      5

And one more remark: you should preprocess the data. At least subtract the mean from all values!

Have a look at this introduction. It mensions the impute+SVD approach and also talks about a more direct modelling of missing values. But in this case, other algorithms are used.