Solved – Trying to understand non-negative matrix factorization (NMF)

machine learningmatrix decompositionnon-negative-matrix-factorization

I'm trying to understand how NMF is derived, and I got the basic idea of NMF,
that is, it tries to approximate the original matrix $V$ with $WH$, where $V$ are non-negative, and $W,H$ are constrained to be non-negative.

Question 1

When I approach this problem, it's pretty intuitive for me that, I should try to minimize the reconstruction error,
$$Err(W,H) = \min_{W,H}\|V-WH\|\,,$$
if say $Err(W,H)$ is the objective function, then I have a problem to define the constraints. How to constrain $W,H$ to be non-negative?

Question 2

The paper I'm reading, it doesn't take the reconstruction error as the objective function, instead it takes the perspective that NMF tries to reconstruct the original $V$ from $WH$ adding some Poisson noise (last 2 paragraphs of page 3).

This confuses me badly, why Poisson? What about Gaussian or gamma or beta?

Question 3

No matter what the objective function is, I think I can always solve the problem by gradient descent or newton's method, can I?

And the reason why the multiplicative update method proposed is because it has better speed than gradient descent?

Best Answer

Question 3 :

an important reason is that you want non-negative entires in your matrix. If performing classical gradient descent on your objective function, you face two problems :

  • The objective function is clearly non-convex in its variables. It is convex if either factor matrix is left constant, and successive optimisation of the factors will indeed cause the objective function to decrease securely, though this might be very slow. Classical gradient descent has absolutely no guarantees of converging though, and may be caught in highly suboptimal local extrema, or even diverge completely.
  • More importantly, gradient descent may cause your parameters to become negative if large enough steps are taken. This sort of also answers Question 1 : there is a priori no way of formalizing the positivity constraint and optimise classically.

Also, the update rules (as mentionned by the paper) allow to find a local optimum of the chosen objective function while remaining positive. I assume the local optimum is often good enough.

Question 2 : using a poisson model is convenient because the error terms are positive, and the likelihood is simple. The update rule is compatible with the resulting objective function and is simple to implement.

Question 1 : this is sort of answered above. The update rules ensure that positivity is conserved during training.

Related Question