NMF is a subspace learning method to obtain low dimensional representations. The learned representations here are parts-based, so, for instance, if we use it for face recognition, different parts of the faces correspond to different basis vectors unlike PCA which produces eigenfaces that we can call \(\it{holistic}\) representations.

Let \(\mathbf{X} \in \mathcal{R}_ {\geq 0} ^{n \times m}\) be the non-negative matrix that we want to factorize, and its columns are \(n\)-dimensional data vectors. We want to obtain two matrices, \(\mathbf{W} \in \mathcal{R}_ {\geq 0} ^{n \times r}\) and \(\mathbf{H} \in \mathcal{R}_ {\geq 0} ^{r \times m}\), such that \(\mathbf{X} \approx \mathbf{W}\mathbf{H}\) where \(r\) is the factorization rank, and it is generally smaller than \(n\).

So, in simple terms, we form our input matrix, and learn two matrices: a basis matrix and a weight matrix. Therefore, each data vector is approximated by the weighted sum of the basis vectors (columns of \(\mathbf{W}\)) where the weights are the components of the corresponding column of \(\mathbf{H}\). To see this more explicitly, we can write the following:

\[x_ {ij} = \sum_ {k=1} ^r w_ {ik} h_ {kj}\]

and

\[\begin{bmatrix} x_ {11} & x_ {12} & x_ {13}\\ x_ {21} & x_ {22} & x_ {23}\\ x_ {31} & x_ {32} & x_ {33} \end{bmatrix} \approx \begin{bmatrix} w_ {11} & w_ {12}\\ w_ {21} & w_ {22}\\ w_ {31} & w_ {32} \end{bmatrix} \begin{bmatrix} h_ {11} & h_ {12} & h_ {13}\\ h_ {21} & h_ {22} & h_ {23}\\ \end{bmatrix}\]

so we can write the factorization of the first data vector as follows:

\[\begin{bmatrix} x_ {11}\\ x_ {21}\\ x_ {31} \end{bmatrix} \approx \begin{bmatrix} w_ {11}\\ w_ {21}\\ w_ {31} \end{bmatrix} h_ {11} + \begin{bmatrix} w_ {12}\\ w_ {22}\\ w_ {32} \end{bmatrix} h_ {21}\]

We can optimize this with a gradient based method, but to ensure non-negativity, the multiplicative update rule was proposed by Lee and Seung (2000).

The Multiplicative Update Rule

Let’s define the cost function first: \(||\mathbf{X}-\mathbf{W}\mathbf{H}||^2\). So if we can find an exact factorization the cost will be equal to zero, but that is not generally the case. With a better terminology, we can define the optimization problem as follows:

\[\min ||\mathbf{X}-\mathbf{W}\mathbf{H}||^2 ~~ \text{w.r.t.} ~~ \mathbf{W} ~~ \text{and} ~~ \mathbf{H} ~~ \text{subject to} ~~ \mathbf{W}, \mathbf{H} \geq 0\]

Our cost function is not convex in \(\mathbf{W}\) and \(\mathbf{H}\) together, but it is convex in \(\mathbf{W}\) and in \(\mathbf{H}\) separately, so we can iteratively optimize one of them while keeping the other one fixed, which would be the block coordinate descent method.

Okay, let’s start by finding the gradients of our cost function.

\[\frac{\partial ||\mathbf{X}-\mathbf{W}\mathbf{H}||^2}{\partial \mathbf{W}}=\frac{\partial}{\partial \mathbf{W}} \text{Tr}((\mathbf{X}-\mathbf{W}\mathbf{H})(\mathbf{X}-\mathbf{W}\mathbf{H})^T)=-2\mathbf{X}\mathbf{H}^T+2\mathbf{W}\mathbf{H}\mathbf{H}^T\] \[\frac{\partial ||\mathbf{X}-\mathbf{W}\mathbf{H}||^2}{\partial \mathbf{H}}=-2\mathbf{W}^T\mathbf{X}+2\mathbf{W}^T\mathbf{W}\mathbf{H}\]

With these gradients, we can write the gradient descent update rule as follows:

\[\mathbf{W} \xleftarrow{} \mathbf{W} - \eta_W(-\mathbf{X}\mathbf{H}^T+\mathbf{W}\mathbf{H}\mathbf{H}^T)=\mathbf{W}+\eta_W(\mathbf{X}\mathbf{H}^T-\mathbf{W}\mathbf{H}\mathbf{H}^T)\] \[\mathbf{H} \xleftarrow{} \mathbf{H} - \eta_H(-\mathbf{W}^T\mathbf{X}+\mathbf{W}^T\mathbf{W}\mathbf{H})=\mathbf{H} + \eta_H(\mathbf{W}^T\mathbf{X}-\mathbf{W}^T\mathbf{W}\mathbf{H})\]

Subtraction in these update rule is not good for our non-negativity constraint. Lee and Seung (2000) proposed the adaptive step size as follows:

\[\eta_W = \frac{\mathbf{W}}{\mathbf{W}\mathbf{H}\mathbf{H}^T}\] \[\eta_H = \frac{\mathbf{H}}{\mathbf{W}^T\mathbf{W}\mathbf{H}}\]

These step size terms cancel the subtractions in the update rule, so there is no risk of getting non-negative items. We find the multiplicative update rule as follows:

\[\mathbf{W} \xleftarrow{} \mathbf{W}+\eta_W(\mathbf{X}\mathbf{H}^T-\mathbf{W}\mathbf{H}\mathbf{H}^T)=\mathbf{W}+\frac{\mathbf{W}}{\mathbf{W}\mathbf{H}\mathbf{H}^T}(\mathbf{X}\mathbf{H}^T-\mathbf{W}\mathbf{H}\mathbf{H}^T)=\mathbf{W}\frac{\mathbf{X}\mathbf{H}^T}{\mathbf{W}\mathbf{H}\mathbf{H}^T}\] \[\mathbf{H} \xleftarrow{} \mathbf{H} + \frac{\mathbf{H}}{\mathbf{W}^T\mathbf{W}\mathbf{H}}(\mathbf{W}^T\mathbf{X}-\mathbf{W}^T\mathbf{W}\mathbf{H})=\mathbf{H}\frac{\mathbf{W}^T\mathbf{X}}{\mathbf{W}^T\mathbf{W}\mathbf{H}}\]

The convergence proof of this is given in the paper, and I am skipping it for now.

References

D. D. Lee and H. S. Seung, “Algorithms for Non-negative Matrix Factorization”, Advances in Neural Information Processing Systems 13 (NIPS 2000), 2000.