Solved – Non-negative matrix factorization (NMF) on mixed data using 1-hot encoding

categorical-encodingdimensionality reductionmixed type datanon-negative-matrix-factorization

From a standpoint of interpretation, can I use NMF on one-hot encoded categorical data for dimension reduction? I have mixed data and was thinking about one-hot encoding the categorical features and min-max normalizing the numerical features.

I read that this approach is not good when using PCA (see Stack Overflow here), but NMF follows a different principle than PCA.

Is this a valid approach?

Thank you very much in advance!

Jimmy

Best Answer

There is in fact a probabilistic interpretation of NMF specifically when the objective function is the generalized KL divergence. The matrix entries are treated as the realizations of Poisson random variables, where the rate parameter is of each entry Xij is given by the dot product of the ith row of W and the jth column of H when decomposing X to WH. This takes advantage of the fact that the sum of two Poisson random variables is also Poisson, with a rate the sum of the two. This is what makes the model well suited for count data. The other objective function that is commonly used, the Frobenius norm, is generally very poorly suited to count data as in most applications, counts are very heteroskedastic. The Frobenius norm doesn't conform to intuition there as the contribution of a particular entry's error to the objective function is on the order of the error squared, whereas with the generalized KL divergence the contribution is roughly proportional to e*log(e) (e as the error). This penalizes reconstruction error in a way that takes into account the absolute size of the input as well.

But that also means that the contribution of small errors on small entries is comparatively much less than for large counts. It may turn out that NMF with the generalized KL divergence works just fine with a one hot encoding, but that isn't something you should count on. It's likely that you'll get a decomposition that way that's no better than if you were to decompose the numerical features only, and quite possibly worse. NMF isn't really a good way to take advantage of the structure of categorical features, and having that structure known can be very helpful for downstream tasks where another approach could use that. So I'd recommend that you try NMF with each objective function on the numerical features and use the concatenation of the learned components and the categorical features.