# Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework

- 5.5k Downloads
- 60 Citations

## Abstract

We review algorithms developed for nonnegative matrix factorization (NMF) and nonnegative tensor factorization (NTF) from a unified view based on the block coordinate descent (BCD) framework. NMF and NTF are low-rank approximation methods for matrices and tensors in which the low-rank factors are constrained to have only nonnegative elements. The nonnegativity constraints have been shown to enable natural interpretations and allow better solutions in numerous applications including text analysis, computer vision, and bioinformatics. However, the computation of NMF and NTF remains challenging and expensive due the constraints. Numerous algorithmic approaches have been proposed to efficiently compute NMF and NTF. The BCD framework in constrained non-linear optimization readily explains the theoretical convergence properties of several efficient NMF and NTF algorithms, which are consistent with experimental observations reported in literature. In addition, we discuss algorithms that do not fit in the BCD framework contrasting them from those based on the BCD framework. With insights acquired from the unified perspective, we also propose efficient algorithms for updating NMF when there is a small change in the reduced dimension or in the data. The effectiveness of the proposed updating algorithms are validated experimentally with synthetic and real-world data sets.

## Keywords

Nonnegative matrix factorization Nonnegative tensor factorization Low-rank approximation Block coordinate descent## 1 Introduction

Nonnegative matrix factorization (NMF) is a dimension reduction and factor analysis method. Many dimension reduction techniques are closely related to the low-rank approximations of matrices, and NMF is special in that the low-rank factor matrices are constrained to have only nonnegative elements. The nonnegativity reflects the inherent representation of data in many application areas, and the resulting low-rank factors lead to physically natural interpretations [66]. NMF was first introduced by Paatero and Tapper [74] as positive matrix factorization and subsequently popularized by Lee and Seung [66]. Over the last decade, NMF has received enormous attention and has been successfully applied to a broad range of important problems in areas including text mining [77, 85], computer vision [47, 69], bioinformatics [10, 23, 52], spectral data analysis [76], and blind source separation [22] among many others.

Our first goal in this paper is to provide an overview of algorithms developed to solve (2) from a unifying perspective. Our review is organized based on the block coordinate descent (BCD) method in non-linear optimization, within which we show that most successful NMF algorithms and their convergence behavior can be explained. Among numerous algorithms studied for NMF, the most popular is the multiplicative updating rule by Lee and Seung [67]. This algorithm has an advantage of being simple and easy to implement, and it has contributed greatly to the popularity of NMF. However, slow convergence of the multiplicative updating rule has been pointed out [40, 71], and more efficient algorithms equipped with stronger theoretical convergence property have been introduced. The efficient algorithms are based on either the alternating nonnegative least squares (ANLS) framework [53, 59, 71] or the hierarchical alternating least squares (HALS) method [19, 20]. We show that these methods can be derived using one common framework of the BCD method and then characterize some of the most promising NMF algorithms in Sect. 2. Algorithms for accelerating the BCD-based methods as well as algorithms that do not fit in the BCD framework are summarized in Sect. 3, where we explain how they differ from the BCD-based methods. In the ANLS method, the subproblems appear as the nonnegativity constrained least squares (NLS) problems. Much research has been devoted to design NMF algorithms based on efficient methods to solve the NLS subproblems [18, 42, 51, 53, 59, 71]. A review of many successful algorithms for the NLS subproblems is provided in Sect. 4 with discussion on their advantages and disadvantages.

Extending our discussion to low-rank approximations of tensors, we show that algorithms for some nonnegative tensor factorization (NTF) can similarly be elucidated based on the BCD framework. Tensors are mathematical objects for representing multidimensional arrays; vectors and matrices are first-order and second-order special cases of tensors, respectively. The canonical decomposition (CANDECOMP) [14] or the parallel factorization (PARAFAC) [43], which we denote by the CP decomposition, is one of the natural extensions of the singular value decomposition to higher order tensors. The CP decomposition with nonnegativity constraints imposed on the loading matrices [19, 21, 32, 54, 60, 84], which we denote by nonnegative CP (NCP), can be computed in a way that is similar to the NMF computation. We introduce details of the NCP decomposition and summarize its computation methods based on the BCD method in Sect. 5.

Lastly, in addition to providing a unified perspective, our review leads to the realizations of NMF in more dynamic environments. Such a common case arises when we have to compute NMF for several \(K\) values, which is often needed to determine a proper \(K\) value from data. Based on insights from the unified perspective, we propose an efficient algorithm for updating NMF when \(K\) varies. We show how this method can compute NMFs for a set of different \(K\) values with much less computational burden. Another case occurs when NMF needs to be updated efficiently for a data set which keeps changing due to the inclusion of new data or the removal of obsolete data. This often occurs when the matrices represent data from time-varying signals in computer vision [11] or text mining [13]. We propose an updating algorithm which takes advantage of the fact that most of data in two consecutive time steps are overlapped so that we do not have to compute NMF from scratch. Algorithms for these cases are discussed in Sect. 7, and their experimental validations are provided in Sect. 8.

Our discussion is focused on the algorithmic developments of NMF formulated as (2). In Sect. 9, we only briefly discuss other aspects of NMF and conclude the paper.

*Notations*: Notations used in this paper are as follows. A lowercase or an uppercase letter, such as \(x\) or \(X\), denotes a scalar; a boldface lowercase letter, such as \(\mathbf{x }\), denotes a vector; a boldface uppercase letter, such as \(\mathbf{X }\), denotes a matrix; and a boldface Euler script letter, such as \({\varvec{\mathcal{X }}}\), denotes a tensor of order three or higher. Indices typically start from \(1\) to its uppercase letter: For example, \(n\in \left\{ 1,\ldots ,N\right\} \). Elements of a sequence of vectors, matrices, or tensors are denoted by superscripts within parentheses, such as \(\mathbf{X }^{(1)},\ldots ,\mathbf{X }^{(N)}\), and the entire sequence is denoted by \(\left\{ \mathbf{X }^{(n)}\right\} \). When matrix \(\mathbf{X }\) is given, \(\left(\mathbf{X }\right)_{\cdot i}\) or \({\mathbf{x }}_{\cdot i}\) denotes its \(i\)th column, \(\left(\mathbf{X }\right)_{i\cdot }\) or \(\mathbf{x }_{i\cdot }\) denotes its \(i\)th row, and \(x_{ij}\) denotes its \((i,j)\)th element. For simplicity, we also let \(\mathbf{x }_{i}\) (without a dot) denote the \(i\)th column of \(\mathbf X \). The set of nonnegative real numbers are denoted by \(\mathbb{R }_{+}\), and \(\mathbf{X }\ge 0\) indicates that the elements of \(\mathbf X \) are nonnegative. The notation \([\mathbf{X }]_{+}\) denotes a matrix that is the same as \(\mathbf{X }\) except that all its negative elements are set to zero. A *nonnegative matrix* or a *nonnegative tensor* refers to a matrix or a tensor with only nonnegative elements. The null space of matrix \(\mathbf{X }\) is denoted by \(null(\mathbf{X })\). Operator \(\bigotimes \) denotes element-wise multiplcation of vectors or matrices.

## 2 A unified view—BCD framework for NMF

The BCD method is a divide-and-conquer strategy that can be generally applied to non-linear optimization problems. It divides variables into several disjoint subgroups and iteratively minimize the objective function with respect to the variables of each subgroup at a time. We first introduce the BCD framework and its convergence properties and then explain several NMF algorithms under the framework.

*non-linear Gauss-Siedel*method [5], this algorithm updates one block each time, always using the most recently updated values of other blocks \(\mathbf{x }_{\tilde{m}},\tilde{m}\ne m\). This is important since it ensures that after each update the objective function value does not increase. For a sequence \(\left\{ \mathbf{x }^{(i)}\right\} \) where each \(\mathbf{x }^{(i)}\) is generated by the BCD method, the following property holds.

**Theorem 1**

The proof of this theorem for an arbitrary number of blocks is shown in Bertsekas [5], and the last statement regarding the two-block case is due to Grippo and Sciandrone [41]. For a non-convex optimization problem, most algorithms only guarantee the stationarity of a limit point [46, 71].

When applying the BCD method to a constrained non-linear programming problem, it is critical to wisely choose a partition of \(\mathcal X \), whose Cartesian product constitutes \(\mathcal X \). An important criterion is whether subproblems (5) for \(m=1,\ldots ,M\) are efficiently solvable: For example, if the solutions of subproblems appear in a closed form, each update can be computed fast. In addition, it is worth checking whether the solutions of subproblems depend on each other. The BCD method requires that the most recent values need to be used for each subproblem (5). When the solutions of subproblems depend on each other, they have to be computed sequentially to make use of the most recent values; if solutions for some blocks are independent from each other, however, simultaneous computation of them would be possible. We discuss how different choices of partitions lead to different NMF algorithms. Three cases of partitions are shown in Fig. 1, and each case is discussed below.

### 2.1 BCD with two matrix blocks—ANLS method

**Corollary 1**

If a minimum of each subproblem in (7) is attained at each step, every limit point of the sequence \( \left\{ \left({\mathbf{W }},\mathbf{H }\right)^{(i)}\right\} \) generated by the ANLS framework is a stationary point of (2).

Note that the minimum is not required to be unique for the convergence result to hold because the number of blocks are two [41]. Therefore, \(\mathbf{H }\) in (7a) or \({\mathbf{W }}\) in (7b) need not be of full column rank for the property in Corollary 1 to hold. On the other hand, some numerical methods for the NLS subproblems require the full rank conditions so that they return a solution that attains a minimum: See Sect. 4 as well as regularization methods in Sect. 2.4.

### 2.2 BCD with \(2K\) vector blocks—HALS/RRI method

**Theorem 2**

*Proof*

**Corollary 2**

If the columns of \({\mathbf{W }}\) and \(\mathbf{H }\) remain nonzero throughout all the iterations and the minimums in (13) are attained at each step, every limit point of the sequence \(\left\{ \left({\mathbf{W }},\mathbf{H }\right)^{(i)}\right\} \) generated by the HALS/RRI algorithm is a stationary point of (2).

In practice, a zero column could occur in \({\mathbf{W }}\) or \(\mathbf H \) during the HALS/RRI algorithm. This happens if \(\mathbf{h }_{k}\in null(\mathbf{R }_{k}), \mathbf{w }_{k}\in null(\mathbf{R }_{k}^{T}), \mathbf{R }_{k}\mathbf{h }_{k}\le 0\), or \(\mathbf{R }_{k}^{T}\mathbf{w }_{k}\le 0\). To prevent zero columns, a small positive number could be used for the maximum operator in (13): That is, \(\max (\cdot ,\epsilon )\) with a small positive number \(\epsilon \) such as \(10^{-16}\) is used instead of \(\max (\cdot ,0)\) [20, 35]. The HALS/RRI algorithm with this modification often shows faster convergence compared to other BCD methods or previously developed methods [37, 59]. See Sect. 3.1 for acceleration techniques for the HALS/RRI method and Sect. 6.2 for more discussion on experimental comparisons.

### 2.3 BCD with \(K(M+N)\) scalar blocks

The updates of \(w_{mk}\) and \(h_{nk}\) are independent of all other elements in the same column. Therefore, it is possible to update all the elements in each column of \(\mathbf{W }\) (and \(\mathbf H \)) simultaneously. Once we organize the update of (18) column-wise, the result is the same as (14). That is, a particular arrangement of the BCD method with scalar blocks is equivalent to the BCD method with \(2K\) vector blocks. Accordingly, the HALS/RRI method can be derived by the BCD method either with vector blocks or with scalar blocks. On the other hand, it is not possible to simultaneously solve for the elements in each row of \(\mathbf{W }\) (or \(\mathbf H \)) because their solutions depend on each other. The convergence property of the scalar block case is similar to that of the vector block case.

**Corollary 3**

If the columns of \(\mathbf{W }\) and \(\mathbf H \) remain nonzero throughout all the iterations and if the minimums in (18) are attained at each step, every limit point of the sequence \(\left\{ \left(\mathbf{W },\mathbf H \right)^{(i)}\right\} \) generated by the BCD method with \(K(M+N)\) scalar blocks is a stationary point of (2).

The multiplicative updating rule also uses element-wise updating [67]. However, the multiplicative updating rule is different from the scalar block BCD method in a sense that its solutions are not optimal for subproblems (18). See Sect. 3.2 for more discussion.

### 2.4 BCD for some variants of NMF

*squares*of the \(l_{1}\)-norm of the columns of \(\mathbf H \). Alternatively, we can impose the \(l_{1}\)-norm based regularization without squaring: That is,

## 3 Acceleration and other approaches

### 3.1 Accelerated methods

The BCD methods described so far have been very successful for the NMF computation. In addition, several techniques to accelerate the methods have been proposed. Korattikara et al. [62] proposed a subsampling strategy to improve the two matrix block (i.e., ANLS) case. Their main idea is to start with a small factorization problem, which is obtained by random subsampling, and gradually increase the size of subsamples. Under the assumption of asymptotic normality, the decision whether to increase the size is made based on statistical hypothesis testing. Gillis and Glineur [38] proposed a multi-level approach, which also gradually increases the problem size based on a multi-grid representation. The method in [38] is applicable not only to the ANLS methods, but also to the HALS/RRI method and the multiplicative updating method.

Hsieh and Dhillon proposed a greedy coordinate descent method [48]. Unlike the HALS/RRI method, in which every element is updated exactly once per iteration, they selectively choose the elements whose update will lead to the largest decrease of the objective function. Although their method does not follow the BCD framework, they showed that every limit point generated by their method is a stationary point. Gillis and Glineur also proposed an acceleration scheme for the HALS/RRI and the multiplicative updating methods: Unlike the standard versions, their approach repeats updating the elements of \(\mathbf{W }\) several times before updating the elements of \(\mathbf H \) [37]. Noticeable improvements in the speed of convergence is reported.

### 3.2 Multiplicative updating rules

*multiplications*as

Lee and Seung [67] showed that under the multiplicative updating rule, the objective function in (2) is non-increasing. However, it is unknown whether it converges to a stationary point. Gonzalez and Zhang demonstrated the difficulty [40], and the slow convergence of multiplicative updates has been further reported in [53, 58, 59, 71]. To overcome this issue, Lin [70] proposed a modified update rule for which every limit point is stationary; note that, after this modification, the update rule becomes additive instead of multiplicative.

Since the values are updated only though multiplications, the elements of \(\mathbf{W }\) and \(\mathbf H \) obtained by the multiplicative updating rule typically remain nonzero. Hence, its solution matrices typically are denser than those from the BCD methods. The multiplicative updating rule breaks down if a zero value occurs to an element of the denominators in (27). To curcumvent this difficulty, practical implementations often add a small number, such as \(10^{-16}\), to each element of the denominators.

### 3.3 Alternating least squares method

### 3.4 Successive rank one deflation

Some algorithms have been designed to compute NMF based on successive rank-one deflation. This approach is motivated from the fact that the singular value decomposition (SVD) can be computed through successive rank-one deflation. When considered for NMF, however, the rank-one deflation method has a few issues as we summarize below.

When it comes to NMF, a notable theoretical result about nonnegative matrices relates SVD and NMF when \(K=1\). The following theorem, which extends the Perron-Frobenius theorem [3, 45], is shown in Chapter 2 of Berman and Plemmons [3].

**Theorem 3**

For a nonnegative symmetric matrix \(\mathbf{A }\in \mathbb{R }_{+}^{N\times N}\), the eigenvalue of \(\mathbf{A }\) with the largest magnitude is nonnegative, and there exists a nonnegative eigenvector corresponding to the largest eigenvalue.

A direct consequence of Theorem 3 is the nonnegativity of the best rank-one approximation.

**Corollary 4**

Due to this difficulty, some variations of rank-one deflation have been investigated for NMF. Biggs et al. [6] proposed a rank-one reduction algorithm in which they look for a nonnegative submatrix that is close to a rank-one approximation. Once such a submatrix is identified, they compute the best rank-one approximation using the power method and ignore the residual. Gillis and Glineur [36] sought a nonnegative rank-one approximation under the constraint that the residual matrix remains element-wise nonnegative. Due to the constraints, however, the problem of finding the nonnegative rank-one approximation becomes more complicated and computationally expensive than the power iteration. Optimization properties such as a convergence to a stationary point has not been shown for these modified rank-one reduction methods.

It is worth noting the difference between the HALS/RRI algorithm, described as the \(2K\) vector block case in Sect. 2.2, and the rank-one deflation method. These approaches are similar in that the rank-one approximation problem with nonnegativity constraints is solved in each step, filling in the \(k\)th columns of \(\mathbf{W }\) and \(\mathbf H \) with the solution for \(k=1,\ldots ,K\). In the rank-one deflation method, once the \(k\)th columns of \(\mathbf{W }\) and \(\mathbf H \) are computed, they are fixed and kept as a part of the final solution before the \((k+1)\)th columns are computed. On the other hand, the HALS/RRI algorithm updates all the columns through multiple iterations until a local minimum is achieved. This simultaneous searching for all \(2K\) vectors throughout the iterations is necessary to achieve an optimal solution of NMF, unlike in the case of SVD.

## 4 Algorithms for the nonnegativity constrained least squares problems

### 4.1 Projected iterative methods

Many other methods have been developed. Merritt and Zhang [73] proposed an interior point gradient method, and Friedlander and Hatz [32] used a two-metric projected gradient method in their study on NTF. Zdunek and Cichocki [86] proposed a quasi-Newton method, but its lack of convergence was pointed out [51]. Zdunek and Cichocki [87] also studied the projected Landweber method and the projected sequential subspace method.

### 4.2 Active-set and active-set-like methods

*active*and

*passive*sets, respectively. In the active-set method, so-called

*workings sets*are kept track of until the optimal active and passive sets are found. A rough pseudo-code for the active-set method is shown in Algorithm 1.

Lawson and Hanson’s method has been a standard for the NLS problems, but applying it directly to NMF is very slow. When used for NMF, it can be accelerated in two different ways. The first approach is to use the QR decomposition to solve (41) or the Cholesky decomposition to solve the normal equations \(\left(\mathbf B _\mathcal{E }^{T}\mathbf B _\mathcal{E }\right)\mathbf z =\mathbf B _\mathcal{E }^{T}\mathbf c \) and have the Cholesky or QR factors updated by the Givens rotations [39]. The second approach, which was proposed by Bro and De Jong [9] and Ven Benthem and Keenan [81], is to identify common computations in solving the NLS problems with multiple right-hand sides. More information and experimental comparisons of these two approaches are provided in [59].

The active-set methods possess a property that the objective function decreases after each iteration; however, maintaining this property often limits its scalability. A main computational burden of the active-set methods is in solving the unconstrained least squares problem (41); hence, the number of iterations required until termination considerably affects the computation cost. In order to achieve the monotonic decreasing property, typically only one variable is exchanged between working sets per iteration. As a result, when the number of unknowns is large, the number of iterations required for termination grows, slowing down the method. The block principal pivoting method developed by Kim and Park [58, 59] overcomes this limitation. Their method, which is based on the work of Judice and Pires [50], allows the exchanges of multiple variables between working sets. This method does not maintain the nonnegativity of intermediate vectors nor the monotonic decrease of the objective function, but it requires a smaller number of iterations until termination than the active set methods. It is worth emphasizing that the grouping-based speed-up technique, which was earlier devised for the active-set method, is also effective with the block principal pivoting method for the NMF computation: For more details, see [59].

### 4.3 Discussion and other methods

A main difference between the projected iterative methods and the active-set-like methods for the NLS problems lies in their convergence or termination. In projected iterative methods, a sequence of tentative solutions is generated so that an optimal solution is approached in the limit. In practice, one has to somehow stop iterations and return the current estimate, which might be only an approximation of the solution. In the active-set and active-set-like methods, in contrast, there is no concept of a limit point. Tentative solutions are generated with a goal of finding the optimal active and passive set partitioning, which is guaranteed to be found in a finite number of iterations since there are only a finite number of possible active and passive set partitionings. Once the optimal active and passive sets are found, the methods terminate. There are trade-offs of these behavior. While the projected iterative methods may return an approximate solution after a few number of iterations, the active-set and active-set-like methods only return a solution after they terminate. After the termination, however, the solution from the active-set-like methods is an exact solution only subject to numerical rounding errors while the solution from the projected iterative methods might be an approximate one.

Other approaches for solving the NLS problems can be considered as a subroutine for the NMF computation. Bellavia et al. [2] have studied an interior point Newton-like method, and Franc et al. [31] presented a sequential coordinate-wise method. Some observations about the NMF computation based on these methods as well as other methods are offered in Cichocki et al. [22]. Chu and Lin [18] proposed an algorithm based on low-dimensional polytope approximation: Their algorithm is motivated by a geometrical interpretation of NMF that data points are approximated by a simplicial cone [27].

Different conditions are required for the NLS algorithms to guarantee convergence or termination. The requirement of the projected gradient method [71] is mild as it only requires an appropriate selection of the step-length. Both the quasi-Newton method [51] and the interior point gradient method [73] require that matrix \(\mathbf B \) in (37) is of full column rank. The active-set method [53, 64] does not require the full-rank condition as long as a zero vector is used for initialization [28]. In the block principal pivoting method [58, 59], on the other hand, the full-rank condition is required. Since NMF is formulated as a lower rank approximation and \(K\) is typically much smaller than the rank of the input matrix, the ranks of both \(\mathbf{W }\) and \(\mathbf H \) in (7) typically remain full. When this condition is not likely to be satisfied, the Frobenius-norm regularization of Sect. 2.4 can be adopted to guarantee the full rank condition.

## 5 BCD framework for nonnegative CP

Our discussion on the low-rank factorizations of nonnegative matrices naturally extends to those of nonnegative tensors. In this section, we discuss nonnegative CANDECOMP/PARAFAC (NCP) and explain how it can be computed by the BCD framework. A few other decomposition models of higher order tensors have been studied, and interested readers are referred to [1, 61]. The organization of this section is similar to that of Sect. 2, and we will show that the NLS algorithms reviewed in Sect. 4 can also be used to factorize tensors.

*rank-one*tensor. Model (42) is known as CANDECOMP/PARAFAC (CP) [14, 43]: In the CP decomposition, \({\varvec{\mathcal{A }}}\) is represented as the sum of \(K\) rank-one tensors. The smallest integer \(K\) for which (42) holds as equality is called the

*rank*of tensor \({\varvec{\mathcal{A }}}\). The CP decomposition reduces to a matrix decomposition if \(N=2\). The nonnegative CP decomposition is obtained by adding nonnegativity constraints to factor matrices \(\mathbf{H }^{(1)},\ldots ,\mathbf{H }^{(N)}\). A corresponding problem can be written as, for \({\varvec{\mathcal{A }}}\in \mathbb{R }_{+}^{M_{1}\times \cdots \times M_{N}}\),

**Mode-n matricization:**The mode-n matricization of \({\varvec{\mathcal{A }}}\in \mathbb{R }^{M_{1}\times \cdots \times M_{N}}\), denoted by \(\mathbf{A }^{<n>}\), is a matrix obtained by linearizing all the indices of tensor \({\varvec{\mathcal{A }}}\) except \(n\). Specifically, \(\mathbf{A }^{<n>}\) is a matrix of size \(M_{n}\times (\prod _{\tilde{n}=1,\tilde{n}\ne n}^{N}M_{\tilde{n}})\), and the \((m_{1},\ldots ,m_{N})\)th element of \({\varvec{\mathcal{A }}}\) is mapped to the \((m_{n},J)\)th element of \(\mathbf{A }^{<n>}\) where

**Mode-n fibers:**The fibers of higher-order tensors are vectors obtained by specifying all indices except one. Given a tensor \({\varvec{\mathcal{A }}}\in \mathbb{R }^{M_{1}\times \cdots \times M_{N}}\), a mode-n fiber denoted by \(\mathbf a _{m_{1}\ldots m_{n-1}:m_{n+1}\ldots m_{N}}\) is a vector of length \(M_{n}\) with all the elements having \(m_{1},\ldots ,m_{n-1}, m_{n+1},\ldots ,m_{N}\) as indices for the 1st\(,\ldots ,(n-1)\)th, \((n+2)\)th\(,\ldots ,N\)th orders. The columns and the rows of a matrix are the mode-1 and the mode-2 fibers, respectively.

**Mode-n product:**The mode-n product of a tensor \({\varvec{\mathcal{A }}}\in \mathbb{R }^{M_{1}\times \cdots \times M_{N}}\) and a matrix \(\mathbf U \in \mathbb{R }^{J\times M_{n}}\), denoted by \({\varvec{\mathcal{A }}}\times _{n}\mathbf U \), is a tensor obtained by multiplying all mode-n fibers of \({\varvec{\mathcal{A }}}\) with the columns of \(\mathbf U \). The result is a tensor of size \(M_{1}\times \cdots \times M_{n-1}\times J\times M_{n+1}\times \cdots \times M_{N}\) having elements as

**Khatri-Rao product:**The Khatri-Rao product of two matrices \(\mathbf{A }\in \mathbb{R }^{J_{1}\times L}\) and \(\mathbf B \in \mathbb{R }^{J_{2}\times L}\), denoted by \(\mathbf{A }\odot \mathbf B \in \mathbb{R }^{(J_{1}J_{2})\times L}\), is defined as

### 5.1 BCD with \(N\) matrix blocks

**Corollary 5**

If a unique solution exists for (48) and is attained for \(n=1,\ldots ,N\), then every limit point of the sequence \(\left\{ \left(\mathbf{H }^{(1)},\ldots ,\mathbf{H }^{(N)}\right)^{(i)}\right\} \) generated by the ANLS framework is a stationary point of (45).

In particular, if each \(\mathbf{B }^{(n)}\) is of full column rank, the subproblem has a unique solution. Algorithms for the NLS subproblems discussed in Sect. 4 can be used to solve (48).

### 5.2 BCD with \(KN\) vector blocks

**Corollary 6**

If a unique solution exists for (52) and is attained for \(n=1,\ldots ,N\) and for \(k=1,\ldots ,K\), every limit point of the sequence \(\left\{ \left(\mathbf{H }^{(1)},\ldots ,\mathbf{H }^{(N)}\right)^{(i)}\right\} \) generated by the vector-block BCD method is a stationary point of (45).

## 6 Implementation issues and comparisons

### 6.1 Stopping criterion

Iterative methods have to be equipped with a criterion for stopping iterations. In NMF or NTF, an ideal criterion would be to stop iterations after a local minimum of (2) or (45) is attained. In practice, a few alternatives have been used.

### 6.2 Results of experimental comparisons

A number of papers have reported the results of experimental comparisons of NMF algorithms. A few papers have shown the slow convergence of Lee and Seung’s multiplicative updating rule and demonstrated the superiority of other algorithms published subsequently [40, 53, 71]. Comprehensive comparisons of several efficient algorithms for NMF were conducted in Kim and Park [59], where MATLAB implementations of the ANLS-based methods, the HALS/RRI method, the multiplicative updating rule, and a few others were compared. In their results, the slow convergence of the multiplicative updating was confirmed, and the ALS method in Sect. 3.3 was shown to fail to converge in many cases. Among all the methods tested, the HALS/RRI method and the ANLS/BPP method showed the fastest overall convergence.

Further comparisons are presented in Gillis and Glineur [37] and Hsieh and Dhillon [48] where the authors proposed acceleration methods for the HALS/RRI method. Their comparisons show that the HALS/RRI method or the accelerated versions converge the fastest among all methods tested. Korattikara et al. [62] demonstrated an effective approach to accelerate the ANLS/BPP method. Overall, the HALS/RRI method, the ANLS/BPP method, and their accelerated versions show the state-of-the-art performance in the experimental comparisons.

Comparison results of algorithms for NCP are provided in [60]. Interestingly, the ANLS/BPP method showed faster convergence than the HALS/RRI method in the tensor factorization case. Further investigations and experimental evaluations of the NCP algorithms are needed to fully explain these observations.

## 7 Efficient NMF updating: algorithms

In practice, we often need to update a factorization with a slightly modified condition or some additional data. We consider two scenarios where an existing factorization needs to be efficiently updated to a new factorization. Importantly, the unified view of the NMF algorithms presented in earlier chapters provides useful insights when we choose algorithmic components for updating. Although we focus on the NMF updating here, similar updating schemes can be developed for NCP as well.

### 7.1 Updating NMF with an increased or decreased \(K\)

NMF algorithms discussed in Sects. 2 and 3 assume that \(K\), the reduced dimension, is provided as an input. In practical applications, however, prior knowledge on \(K\) might not be available, and a proper value for \(K\) has to be determined from data. To determine \(K\) from data, typically NMFs are computed for several different \(K\) values and then the best \(K\) is chosen according to some criterion [10, 33, 49]. In this case, computing several NMFs each time from scratch would be very expensive, and therefore it is desired to develop an algorithm to efficiently update an already computed factorization. We propose an algorithm for this task in this subsection.

Summarizing the two cases, an algorithm for updating NMF with an increased or decreased \(K\) value is presented in Algorithm 2. Note that the HALS/RRI method is chosen for Step 2: Since the new entries appear as column blocks (see Fig. 2), the HALS/RRI method is an optimal choice. For Step 2, although any algorithm may be chosen, we have adopted the HALS/RRI method for our experimental evaluation in Sect. 8.1.

### 7.2 Updating NMF with incremental data

Deleting obsolete data is easier. If \(\mathbf{A }=[\varDelta \mathbf{A } \tilde{\mathbf{A }}]\) where \(\varDelta \mathbf{A }\in \mathbb{R }_{+}^{M\times \varDelta N}\) is to be discarded, we similarly divide \(\mathbf H _{old}\) as \(\mathbf H _{old}=\left[\begin{array}{c} \varDelta \mathbf H \\ \tilde{\mathbf{H }}_{old} \end{array}\right]\). We then use \(\mathbf{W }_{old}\) and \(\tilde{\mathbf{H }}_{old}\) to initialize \(\mathbf{W }_{new}\) and \(\mathbf H _{new}\) and execute an NMF algorithm to find a minimizer of (2).

## 8 Efficient NMF updating: experiments and applications

We provide the experimental validations of the effectiveness of Algorithms 2 and 3 and show their applications. The computational efficiency was compared on dense and sparse synthetic matrices as well as on real-world data sets. All the experiments were executed with MATLAB on a Linux machine with 2GHz Intel Quad-core processor and 4GB memory.

### 8.1 Comparisons of NMF updating methods for varying \(K\)

We compared Algorithm 2 with two alternative methods for updating NMF. The first method is to compute NMF with \(K=K_{2}\) from scratch using the HALS/RRI algorithm, which we denote as ‘recompute’ in our figures. The second method, denoted as ‘warm-restart’ in the figures, computes the new factorization as follows. If \(K_{2}>K_{1}\), it generates \(\mathbf{W }_{add}\in \mathbb{R }_{+}^{M\times (K_{2}-K_{1})}\) and \(\mathbf H _{add}\in \mathbb{R }_{+}^{N\times (K_{2}-K_{1})}\) using random entries to initialize \(\mathbf{W }_{new}\) and \(\mathbf H _{new}\) as in (64). If \(K_{2}<K_{1}\), it randomly selects \(K_{2}\) pairs of columns from \(\mathbf{W }_{old}\) and \(\mathbf H _{old}\) to initialize the new factors. Using these initializations, ‘warm-restart’ executes the HALS/RRI algorithm to finish the NMF computation.

^{1}and the resulting matrix had 88.2 % sparsity. We generated a synthetic sparse matrix of size \(3,000\times 3,000\).

^{2}In order to observe efficiency in updating, an NMF with \(K_{1}=60\) was first computed and stored. We then computed NMFs with \(K_{2}=50, 65\), and 80. The plots of relative error vs. execution time for all three methods are shown in Fig. 4. Our proposed method achieved faster convergence compared to ‘warm-restart’ and ‘recompute’, which sometimes required several times more computation to achieve the same accuracy as our method. The advantage of the proposed method can be seen in both dense and sparse cases.

We have also used four real-world data sets for our comparisons. From the Topic Detection and Tracking 2 (TDT2) text corpus,^{3} we selected 40 topics to create a sparse term-document matrix of size \(19,009\times 3,087\). From the 20 Newsgroups data set,^{4} a sparse term-document matrix of size \(7,845\times 11,269\) was obtained after removing keywords and documents with frequency less than 20. The AT&T facial image database^{5} produced a dense matrix of size \(10,304\times 400\). The images in the CMU PIE database^{6} were resized to \(64\times 64\) pixels, and we formed a dense matrix of size \(4,096\times 11,554\).^{7} We focused on the case when \(K\) increases, and the results are reported in Fig. 5. As with the synthetic data sets, our proposed method was shown to be the most efficient among the methods we tested.

### 8.2 Applications of NMF updating for varying \(K\)

Further applications of Algorithm 2 are shown in Fig. 6b, c. Figure 6b demonstrates a process of probing the approximation errors of NMF with various \(K\) values. With \(K=20,40,60\) and 80, we generated \(600\times 600\) synthetic dense matrices as described in Sect. 8.1. Then, we computed NMFs with Algorithm 2 for \(K\) values ranging from 10 to 160 with a step size 5. The relative objective function values with respect to \(K\) are shown in Fig. 6b. In each of the cases where \(K=20, 40, 60\), and 80, we were able to determine the correct \(K\) value by choosing a point where the relative error stopped decreasing significantly.

Figure 6c demonstrates the process of choosing \(K\) for a classification purpose. Using the \(10,304\times 400\) matrix from the AT&T facial image database, we computed NMF to generate a \(K\) dimensional representation of each image, taken from each row of \(\mathbf H \). We then trained a nearest neighbor classifier using the reduced-dimensional representations [83]. To determine the best \(K\) value, we performed the \(5\)-fold cross validation: Each time a data matrix of size \(10,304\times 320\) was used to compute \(\mathbf{W }\) and \(\mathbf H \), and the reduced-dimensional representations for the test data \(\tilde{\mathbf{A }}\) were obtained by solving a NLS problem, \(\min _{{\tilde{\mathbf{H }}}\ge 0}\left\Vert\tilde{\mathbf{A }}-\mathbf{W }\tilde{\mathbf{H }}\right\Vert_{F}^{2}\). Classification errors on both training and testing sets are shown in Fig. 6c. Five paths of training and testing errors are plotted using thin lines, and the averaged training and testing errors are plotted using thick lines. Based on the figure, we chose \(K=13\) since the testing error barely decreased beyond the point whereas the training error approached zero.

### 8.3 Comparisons of NMF updating methods for incremental data

We also tested the effectiveness of Algorithm 3. We created a \(1,000\times 500\) dense matrix \(\mathbf{A }\) as described in Sect. 8.1 with \(K=100\). An initial NMF with \(K=80\) was computed and stored. Then, an additional data set of size \(1,000\times 10, 1,000\times 20\), or \(1,000\times 50\) was appended, and we computed the updated NMF with several methods as follows. In addition to Algorithm 3, we considered four alternative methods. A naive approach that computes the entire NMF from scratch is denoted as ‘recompute’. An approach that initializes a new coefficient matrix as \(\mathbf H _{new}=\left[\begin{array}{c} \mathbf H _{old}\\ \varDelta \mathbf H \end{array}\right]\) where \(\varDelta \mathbf H \) is generated with random entries is denoted as ‘warm-restart’. The incremental NMF algorithm (INMF) [11] as well as the online NMF algorithm (ONMF) [13] were also included in the comparisons. Figure 7 shows the execution results, where our proposed method outperforms other methods tested.

## 9 Conclusion and discussion

There are many other interesting aspects of NMF that are not covered in this paper. Depending on the probabilistic model of the underlying data, NMF can be formulated with various divergences. Formulations and algorithms based on Kullback-Leibler divergence [67, 79], Bregman divergence [24, 68], Itakura-Saito divergence [29], and Alpha and Beta divergences [21, 22] have been developed. For discussion on nonnegative rank as well as the geometric interpretation of NMF, see Lin and Chu [72], Gillis [34], and Donoho and Stodden [27]. NMF has been also studied from the Bayesian statistics point of view: See Schmidt et al. [78] and Zhong and Girolami [88]. In the data mining community, variants of NMF such as convex and semi-NMFs [25, 75], orthogonal tri-NMF [26], and group-sparse NMF [56] have been proposed, and using NMF for clustering has been shown to be successful [12, 57, 63]. For an overview on the use of NMF in bioinformatics, see Devarajan [23] and references therein. Cichocki et al.’s book [22] explains the use of NMF for signal processing. See Chu and Plemmons [17], Berry et al. [4], and Cichocki et al. [22] for earlier surveys on NMF. See also Ph.D. dissertations on NMF algorithms and applications [34, 44, 55].

## Footnotes

## References

- 1.Acar, E., Yener, B.: Unsupervised multiway data analysis: a literature survey. IEEE Trans. Knowl. Data Eng.
**21**(1), 6–20 (2009)CrossRefGoogle Scholar - 2.Bellavia, S., Macconi, M., Morini, B.: An interior point newton-like method for non-negative least-squares problems with degenerate solution. Numer. Linear Algebra Appl.
**13**(10), 825–846 (2006)CrossRefGoogle Scholar - 3.Berman, A., Plemmons, R.J.: Nonnegative matrices in the mathematical sciences. Society for Industrial and Applied Mathematics, Philadelphia (1994)Google Scholar
- 4.Berry, M., Browne, M., Langville, A., Pauca, V., Plemmons, R.: Algorithms and applications for approximate nonnegative matrix factorization. Comput. Stat. Data Anal.
**52**(1), 155–173 (2007)CrossRefGoogle Scholar - 5.Bertsekas, D.P.: Nonlinear programming. Athena Scientific (1999)Google Scholar
- 6.Biggs, M., Ghodsi, A., Vavasis, S.: Nonnegative matrix factorization via rank-one downdate. In: Proceedings of the 25th International Conference on, Machine Learning, pp. 64–71 (2008)Google Scholar
- 7.Birgin, E., Martínez, J., Raydan, M.: Nonmonotone spectral projected gradient methods on convex sets. SIAM J. Optim.
**10**(4), 1196–1211 (2000)CrossRefGoogle Scholar - 8.Björck, Å.: Numerical Methods for Least Squares Problems. Society for Industrial and Applied Mathematics, Philadelphia (1996)Google Scholar
- 9.Bro, R., De Jong, S.: A fast non-negativity-constrained least squares algorithm. J. Chemom.
**11**, 393–401 (1997)CrossRefGoogle Scholar - 10.Brunet, J., Tamayo, P., Golub, T., Mesirov, J.: Metagenes and molecular pattern discovery using matrix factorization. Proc. Natal. Acad. Sci.
**101**(12), 4164–4169 (2004)CrossRefGoogle Scholar - 11.Bucak, S., Gunsel, B.: Video content representation by incremental non-negative matrix factorization. In: Proceedings of the 2007 IEEE International Conference on Image Processing (ICIP), vol. 2, pp. II-113–II-116 (2007)Google Scholar
- 12.Cai, D., He, X., Han, J., Huang, T.: Graph regularized nonnegative matrix factorization for data representation. IEEE Trans. Pattern Anal. Mach. Intell.
**33**(8), 1548–1560 (2011)CrossRefGoogle Scholar - 13.Cao, B., Shen, D., Sun, J.T., Wang, X., Yang, Q., Chen, Z.: Detect and track latent factors with online nonnegative matrix factorization. In: Proceedings of the 20th International Joint Conference on Artifical, Intelligence, pp. 2689–2694 (2007)Google Scholar
- 14.Carroll, J.D., Chang, J.J.: Analysis of individual differences in multidimensional scaling via an n-way generalization of ”eckart-young” decomposition. Psychometrika
**35**(3), 283–319 (1970)CrossRefGoogle Scholar - 15.Chen, D., Plemmons, R.J.: Nonnegativity constraints in numerical analysis. In: Proceedings of the Symposium on the Birth of Numerical Analysis, Leuven Belgium, pp. 109–140 (2009)Google Scholar
- 16.Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM Rev.
**43**(1), 129–159 (2001)CrossRefGoogle Scholar - 17.Chu, M., Plemmons, R.: Nonnegative matrix factorization and applications. IMAGE: Bull. Int. Linear Algebra Soc.
**34**, 2–7 (2005)Google Scholar - 18.Chu, M.T., Lin, M.M.: Low-dimensional polytope approximation and its applications to nonnegative matrix factorization. SIAM J. Sci. Comput.
**30**(3), 1131–1155 (2008)CrossRefGoogle Scholar - 19.Cichocki, A., Phan, A.H.: Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE Trans. Fundam. Electron. Commun. Comput. Sci.
**E92-A**(3), 708–721 (2009)Google Scholar - 20.Cichocki, A., Zdunek, R., Amari, S.I.: Hierarchical ALS algorithms for nonnegative matrix and 3d tensor factorization. In: Lecture Notes in Computer Science, vol. 4666, pp. 169–176. Springer (2007)Google Scholar
- 21.Cichocki, A., Zdunek, R., Choi, S., Plemmons, R., Amari, S.-I.: Nonnegative tensor factorization using alpha and beta divergencies. In: Proceedings of the 32nd International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Honolulu, April 2007, vol. 3, pp. III-1393–III-1396 (2007)Google Scholar
- 22.Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.I.: Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-Way Data Analysis and Blind Source Separation. Wiley, West Sussex (2009)CrossRefGoogle Scholar
- 23.Devarajan, K.: Nonnegative matrix factorization: an analytical and interpretive tool in computational biology. PLoS Comput. Biol.
**4**(7), e1000,029 (2008)Google Scholar - 24.Dhillon, I., Sra, S.: Generalized nonnegative matrix approximations with bregman divergences. In: Advances in Neural Information Processing Systems 18, pp. 283–290. MIT Press (2006)Google Scholar
- 25.Ding, C., Li, T., Jordan, M.: Convex and semi-nonnegative matrix factorizations. IEEE Trans. Pattern Anal. Mach. Intell.
**32**(1), 45–559 (2010)CrossRefGoogle Scholar - 26.Ding, C., Li, T., Peng, W., Park, H.: Orthogonal nonnegative matrix tri-factorizations for clustering. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 126–135 (2006)Google Scholar
- 27.Donoho, D., Stodden, V.: When does non-negative matrix factorization give a correct decomposition into parts? In: Advances in Neural Information Processing Systems 16. MIT Press (2004)Google Scholar
- 28.Drake, B., Kim, J., Mallick, M., Park, H.: Supervised Raman spectra estimation based on nonnegative rank deficient least squares. In: Proceedings of the 13th International Conference on Information Fusion, Edinburgh, UK (2010)Google Scholar
- 29.Févotte, C., Bertin, N., Durrieu, J.: Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis. Neural Comput.
**21**(3), 793–830 (2009)CrossRefGoogle Scholar - 30.Figueiredo, M.A.T., Nowak, R.D., Wright, S.J.: Gradient projection for sparse reconstruction: application to compressed sensing and other inverse problems. IEEE J. Sel. Top. Signal Process.
**1**(4), 586–597 (2007)CrossRefGoogle Scholar - 31.Franc, V., Hlavac, V., Navara, M.: Sequential coordinate-wise algorithm for the non-negative least squares problem. In: Proceedings of the 11th International Conference on Computer Analysis of Images and Patterns, pp. 407–414 (2005)Google Scholar
- 32.Friedlander, M.P., Hatz, K.: Computing nonnegative tensor factorizations. Comput. Optim. Appl.
**23**(4), 631–647 (2008)Google Scholar - 33.Frigyesi, A., Höglund, M.: Non-negative matrix factorization for the analysis of complex gene expression data: identification of clinically relevant tumor subtypes. Cancer Inform.
**6**, 275–292 (2008)Google Scholar - 34.Gillis, N.: Nonnegative matrix factorization complexity, algorithms and applications. Ph.D. thesis, Université catholique de Louvain (2011)Google Scholar
- 35.Gillis, N., Glineur, F.: Nonnegative factorization and the maximum edge biclique problem. CORE Discussion Paper 2008/64, Universite catholique de Louvain (2008)Google Scholar
- 36.Gillis, N., Glineur, F.: Using underapproximations for sparse nonnegative matrix factorization. Pattern Recognit.
**43**(4), 1676–1687 (2010)CrossRefGoogle Scholar - 37.Gillis, N., Glineur, F.: Accelerated multiplicative updates and hierarchical als algorithms for nonnegative matrix factorization. Neural Comput.
**24**(4), 1085–1105 (2012)CrossRefGoogle Scholar - 38.Gillis, N., Glineur, F.: A multilevel approach for nonnegative matrix factorization. J. Comput. Appl. Math.
**236**, 1708–1723 (2012)CrossRefGoogle Scholar - 39.Golub, G., Van Loan, C.: Matrix Computations. Johns Hopkins University Press, Baltimore (1996)Google Scholar
- 40.Gonzalez, E.F., Zhang, Y.: Accelerating the lee-seung algorithm for non-negative matrix factorization. Department of Computational and Applied Mathematics, Rice University, Technical report (2005)Google Scholar
- 41.Grippo, L., Sciandrone, M.: On the convergence of the block nonlinear gauss-seidel method under convex constraints. Oper. Res. Lett.
**26**(3), 127–136 (2000)CrossRefGoogle Scholar - 42.Han, L., Neumann, M., Prasad, U.: Alternating projected Barzilai-Borwein methods for nonnegative matrix factorization. Electron. Trans. Numer. Anal.
**36**, 54–82 (2009)Google Scholar - 43.Harshman, R.A.: Foundations of the parafac procedure: models and conditions for an ”explanatory” multi-modal factor analysis. In: UCLA Working Papers in Phonetics, vol. 16, pp. 1–84 (1970)Google Scholar
- 44.Ho, N.D.: Nonnegative matrix factorization algorithms and applications. Ph.D. thesis, Univ. Catholique de Louvain (2008)Google Scholar
- 45.Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (1990)Google Scholar
- 46.Horst, R., Pardalos, P., Van Thoai, N.: Introduction to Global Optimization. Kluwer, Berlin (2000)CrossRefGoogle Scholar
- 47.Hoyer, P.O.: Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res.
**5**, 1457–1469 (2004)Google Scholar - 48.Hsieh, C.J., Dhillon, I.S.: Fast coordinate descent methods with variable selection for non-negative matrix factorization. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1064–1072 (2011)Google Scholar
- 49.Hutchins, L., Murphy, S., Singh, P., Graber, J.: Position-dependent motif characterization using non-negative matrix factorization. Bioinformatics
**24**(23), 2684–2690 (2008)CrossRefGoogle Scholar - 50.Júdice, J.J., Pires, F.M.: A block principal pivoting algorithm for large-scale strictly monotone linear complementarity problems. Comput. Oper. Res.
**21**(5), 587–596 (1994)CrossRefGoogle Scholar - 51.Kim, D., Sra, S., Dhillon, I.S.: Fast Newton-type methods for the least squares nonnegative matrix approximation problem. In: Proceedings of the 2007 SIAM International Conference on Data Mining, pp. 343–354 (2007)Google Scholar
- 52.Kim, H., Park, H.: Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics
**23**(12), 1495–1502 (2007)CrossRefGoogle Scholar - 53.Kim, H., Park, H.: Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM J. Matrix Anal. Appl.
**30**(2), 713–730 (2008)CrossRefGoogle Scholar - 54.Kim, H., Park, H., Eldén, L.: Non-negative tensor factorization based on alternating large-scale non-negativity-constrained least squares. In: Proceedings of IEEE 7th International Conference on Bioinformatics and, Bioengineering (BIBE07), vol. 2, pp. 1147–1151 (2007)Google Scholar
- 55.Kim, J.: Nonnegative Matrix and Tensor Factorizations, Least Squares Problems, and Applications. Ph.D. Thesis, Georgia Institute of Technology (2011)Google Scholar
- 56.Kim, J., Monteiro, R.D., Park, H.: Group Sparsity in Nonnegative Matrix Factorization. In: Proceedings of the 2012 SIAM International Conference on Data Mining, pp 851–862 (2012)Google Scholar
- 57.Kim, J., Park, H.: Sparse nonnegative matrix factorization for clustering. Technical report, Georgia Institute of Technology GT-CSE-08-01 (2008)Google Scholar
- 58.Kim, J., Park, H.: Toward faster nonnegative matrix factorization: a new algorithm and comparisons. In: Proceedings of the 8th IEEE International Conference on Data Mining (ICDM), pp. 353–362 (2008)Google Scholar
- 59.Kim, J., Park, H.: Fast nonnegative matrix factorization: an active-set-like method and comparisons. SIAM J. Sci. Comput.
**33**(6), 3261–3281 (2011)CrossRefGoogle Scholar - 60.Kim, J., Park, H.: Fast nonnegative tensor factorization with an active-set-like method. In: High-Performance Scientific Computing: Algorithms and Applications, pp. 311–326. Springer (2012)Google Scholar
- 61.Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev.
**51**(3), 455–500 (2009)CrossRefGoogle Scholar - 62.Korattikara, A., Boyles, L., Welling, M., Kim, J., Park, H.: Statistical optimization of non-negative matrix factorization. In: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR: W &CP, vol. 15, pp. 128–136 (2011)Google Scholar
- 63.Kuang, D., Ding, C., Park, H.: Symmetric nonnegative matrix factorization for graph clustering. In: Proceedings of 2012 SIAM International Conference on Data Mining, pp. 106–117 (2012)Google Scholar
- 64.Lawson, C.L., Hanson, R.J.: Solving Least Squares Problems. Prentice Hall, New Jersey (1974)Google Scholar
- 65.LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE
**86**(11), 2278–2324 (1998)CrossRefGoogle Scholar - 66.Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature
**401**(6755), 788–791 (1999)CrossRefGoogle Scholar - 67.Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems 13, pp. 556–562. MIT Press (2001)Google Scholar
- 68.Li, L., Lebanon, G., Park, H.: Fast bregman divergence nmf using taylor expansion and coordinate descent. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 307–315 (2012)Google Scholar
- 69.Li, S.Z., Hou, X., Zhang, H., Cheng, Q.: Learning spatially localized, parts-based representation. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and, Pattern Recognition, vol. 1, pp. I-207–I-212 (2001)Google Scholar
- 70.Lin, C.: On the convergence of multiplicative update algorithms for nonnegative matrix factorization. IEEE Trans. Neural Netw.
**18**(6), 1589–1596 (2007)CrossRefGoogle Scholar - 71.Lin, C.J.: Projected gradient methods for nonnegative matrix factorization. Neural Comput.
**19**(10), 2756–2779 (2007)Google Scholar - 72.Lin, M.M., Chu, M.T.: On the nonnegative rank of euclidean distance matrices. Linear Algebra Appl.
**433**(3), 681–689 (2010)CrossRefGoogle Scholar - 73.Merritt, M., Zhang, Y.: Interior-point gradient method for large-scale totally nonnegative least squares problems. J. optim. Theory Appl.
**126**(1), 191–202 (2005)CrossRefGoogle Scholar - 74.Paatero, P., Tapper, U.: Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics
**5**(1), 111–126 (1994)CrossRefGoogle Scholar - 75.Park, H., Kim, H.: One-sided non-negative matrix factorization and non-negative centroid dimension reduction for text classification. In: Proceedings of the 2006 Text Mining Workshop in the Tenth SIAM International Conference on Data Mining (2006)Google Scholar
- 76.Pauca, V.P., Piper, J., Plemmons, R.J.: Nonnegative matrix factorization for spectral data analysis. Linear Algebra Appl.
**416**(1), 29–47 (2006)CrossRefGoogle Scholar - 77.Pauca, V.P., Shahnaz, F., Berry, M.W., Plemmons, R.J.: Text mining using non-negative matrix factorizations. In: Proceedings of the 2004 SIAM International Conference on Data Mining, pp. 452–456 (2004)Google Scholar
- 78.Schmidt, M.N., Winther, O., Hansen, L.K.: Bayesian non-negative matrix factorization. In: Proceedings of the 2009 International Conference on Independent Component Analysis and Signal Separation, Lecture Notes in Computer Science (LNCS), vol. 5441, pp. 540–547. Springer (2009)Google Scholar
- 79.Sra, S.: Block-iterative algorithms for non-negative matrix approximation. In: Proceedings of the 8th IEEE International Conference on Data Mining, pp. 1037–1042 (2008)Google Scholar
- 80.Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodological)
**58**(1), 267–288 (1996)Google Scholar - 81.Van Benthem, M.H., Keenan, M.R.: Fast algorithm for the solution of large-scale non-negativity-constrained least squares problems. J. Chemom.
**18**, 441–450 (2004)CrossRefGoogle Scholar - 82.Vavasis, S.A.: On the complexity of nonnegative matrix factorization. SIAM J. Optim.
**20**(3), 1364–1377 (2009)CrossRefGoogle Scholar - 83.Weinberger, K., Saul, L.: Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res.
**10**, 207–244 (2009)Google Scholar - 84.Welling, M., Weber, M.: Positive tensor factorization. Pattern Recogn. Lett.
**22**(12), 1255–1261 (2001)CrossRefGoogle Scholar - 85.Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 267–273 (2003)Google Scholar
- 86.Zdunek, R., Cichocki, A.: Non-negative matrix factorization with quasi-newton optimization. In: Proceedings of the Eighth International Conference on Artificial Intelligence and, Soft Computing, pp. 870–879 (2006)Google Scholar
- 87.Zdunek, R., Cichocki, A.: Fast nonnegative matrix factorization algorithms using projected gradient approaches for large-scale problems. Comput. Intell. Neurosci.
**2008**, 939567 (2008)Google Scholar - 88.Zhong, M., Girolami, M.: Reversible jump MCMC for non-negative matrix factorization. In: Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR: W &CP, vol. 5, pp. 663–670 (2009)Google Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.