Orthogonal nonnegative matrix tri-factorization based on Tweedie distributions
- 914 Downloads
Abstract
Orthogonal nonnegative matrix tri-factorization (ONMTF) is a biclustering method using a given nonnegative data matrix and has been applied to document-term clustering, collaborative filtering, and so on. In previously proposed ONMTF methods, it is assumed that the error distribution is normal. However, the assumption of normal distribution is not always appropriate for nonnegative data. In this paper, we propose three new ONMTF methods, which respectively employ the following error distributions: normal, Poisson, and compound Poisson. To develop the new methods, we adopt a k-means based algorithm but not a multiplicative updating algorithm, which was the main method used for obtaining estimators in previous methods. A simulation study and an application involving document-term matrices demonstrate that our method can outperform previous methods, in terms of the goodness of clustering and in the estimation of the factor matrix.
Keywords
Orthogonal nonnegative matrix tri-factorization Biclustering Tweedie family Compound Poisson distribution Spherical k-meansMathematics Subject Classification
15A23 Factorization of matrices 62H30 Classification and discrimination; cluster analysis 68T10 Pattern recognition, speech recognition1 Introduction
Nonnegative matrix factorization (NMF), which is a dimension reduction technique for decomposing a data matrix into two factor matrices, in both of which all entries are nonnegative, has been applied to many fields and extended to various forms (Lee and Seung 1999, 2001; Berry et al. 2007; Wang and Zhang 2013). One of best-known extensions is orthogonal NMF (ONMF), which imposes column orthogonality on one side’s nonnegative factor matrix (Ding et al. 2006; Yoo and Choi 2008; Choi 2008; Yoo and Choi 2010a; Li et al. 2010; Pompili et al. 2014; Mirzal 2014; Kimura et al. 2014). Because a nonnegative column orthogonal matrix plays a role analogous to an indicator matrix in k-means clustering, and in fact one can obtain the sparse factor matrix from ONMF, it has mainly been adopted for nearest-neighbor clustering tasks such as document and term clustering (Mauthner et al. 2010; Kim et al. 2011; Wang et al. 2016). Another extended version of NMF is nonnegative matrix tri-factorization (NMTF), which decomposes a nonnegative matrix into three nonnegative factor matrices. Because constraint-free NMTF is known to be generally equivalent to NMF, some constraints are often imposed. One popular constraint is column orthogonality of the left- and right-side nonnegative factor matrices, similar to ONMF. This NMTF is referred to as orthogonal NMTF (ONMTF) (Ding et al. 2006; Yoo and Choi 2010b, 2009). Owing to the relationship between column orthogonal nonnegative factor matrices and clustering mentioned above, ONMTF is considered to be a biclustering method. The objective of biclustering is to simultaneously detect any row or column clusters of a data matrix (Govaert and Nadif 2013). The sample objects and variables are classified using data matrix at a time. It has been adopted for use in document-term clustering, collaborative filtering, etc. (Costa and Ortale 2014; Chen et al. 2009).
In ONMF and ONMTF, it is often assumed that the error is normal or Poisson distributed. However, in NMF, various algorithms have been proposed based on various error distributions, including generalized ones. Tweedie family distributions are well-known generalized distributions for NMF (Févotte and Idier 2011; Nakano et al. 2010). It uses the index parameter \(\beta \) to identify the distribution, and includes normal (\(\beta =2\)), Poisson (\(\beta =1\)), compound Poisson (CP) (\(\beta \in (0,1)\)), gamma (\(\beta =0\)), and inverse normal (\(\beta =-1\)) as special cases (Jørgensen 1997; Dunn and Smyth 2001). The assumption of a Tweedie family distribution as the error distribution implies that the error criterion has the form of a \(\beta \)-divergence (Simsekli et al. 2013; Tan and Févotte 2013). \(\beta \)-divergence is a generalized divergence that includes Euclidean distance (\(\beta =2\)), KL-divergence (\(\beta =1\)), and Itakura–Saito divergence (\(\beta =0\))(Cichocki and Amari 2010). One of the merits of Tweedie family distributions is their flexibility with real nonnegative data. The assumption of normality for nonnegative data means that the variance is the same whenever the expected value is small or large. However, one can adjust the relationship between expected value and variance in a Tweedie distribution assumption by changing \(\beta \), because variance is proportional to a power of the expected value, such as in \(V(y)=\phi E(y)^{2-\beta }\), where V(y) and E(y) are the expected value and variance, respectively, of the random variable y, and \(\phi \) is the dispersion parameter in the Tweedie family distribution. For example, if we assume a CP distribution (i.e., choose \(\beta \) in (0,1)), it is implicitly suggested that the variance will be large when the expected value is also large. In fact, a CP distribution is defined as a Poisson mixture of gamma distributions; it is absolutely continuous on the positive axis and has a positive mass at zero, from the aspect of the generative model. In other words, a CP distributed random variable is the sum of n independent identically gamma-distributed random variables, and the number of random variables n is Poisson distributed. This assumption is familiar with a gross summation of the nonnegative values and associated with various types of real-world nonnegative data (e.g., precipitation, insurance, and purchase volume data) (Ohnishi and Dunn 2007; Smyth and Jørgensen 2002). From an NMF parameter estimation aspect, this assumption is related to robust estimation in the presence of outliers that have extremely large positive values (Li et al. 2017; Virtanen et al. 2015; Carabias-Orti et al. 2013; Weninger and Schuller 2012; Févotte et al. 2009; Virtanen 2007).
In this paper, we propose a new ONMTF method that generalizes the error distribution to a Tweedie family distribution. Our new method has two advantages. First, it facilitates interpretations from nonnegative data matrices by simultaneously detecting the clusters of row and column objects and the relationships among them. Second, the assumption of the error distribution is plausible with some nonnegative data, which leads to robust estimation against extremely large positive values.
One of the ways to develop the new method is to derive an iterative algorithm for estimating factor matrices using the same techniques as the previous ONMTF methods proposed by Ding et al. (2006) and Yoo and Choi (2010b). In both methods, the factor matrices are estimated using a multiplicative updating algorithm, in which they are iteratively updated by element-wise multiplication. However, this algorithm suffers from two drawbacks. First, column orthogonality is approximately (not precisely) obtained despite the column orthogonality constraints. Second, although the objective function value tends to be non-increasing in the early stages, it is not exactly monotonically non-increasing. These problems are caused by the difficulty of obtaining the optimal nonnegative column orthogonal factor matrices by constrained optimization. Mirzal (2014) pointed out the second drawback in the multiplicative algorithm for ONMF, and proposed a new convergent ONMF method using an additive updating algorithm; however, there is no guarantee that an orthogonal factor matrix will be obtained. Kimura et al. (2014) proposed a new ONMF method using a hierarchical alternating least-squares algorithm; this algorithm is faster than the previous multiplicative algorithms, but still has these two drawbacks. On the other hand, Pompili et al. (2014) proposed a k-means like method for ONMF, which exactly maintains orthogonality and ensures monotonicity for the objective function. Pompili et al. (2014) found that the optimization problem of ONMF is similar to that of spherical k-means introduced by Banerjee et al. (2003), and referred to this problem as a weighted variant of the spherical k-means (WVSK). This is an ONMF method, not ONMTF, and its error distribution is normal. Thus, we extend it to the ONMTF methods based on normal and Tweedie family distributions. Of course, our ONMTF method guarantees orthogonality and the monotonically non-increasing property of the objective function value.
Several biclustering methods based on k-means have been proposed in the literature as double k-means (Vichi 2001; Wang et al. 2011; Van Mechelen et al. 2004). These methods consider the following hard clustering problem: each object belongs to exactly one cluster, and only belongs or does not belong, like 0 or 1. Although we also consider a hard clustering problem, our proposed methods identify a membership degree. In that sense, our methods are somewhat more flexible than the double k-means. Li and Peng (2005) relaxed the hard double k-means method using orthogonal constrained factor matrices instead of indicator matrices. However, they did not impose nonnegativity constraints on these matrices. Moreover, these former double k-means approaches did not consider non-normal error distributions. The contribution of our work is the proposal of flexible hard bi-clustering approaches that consider the nonnegativity constraints and robust estimation with respect to extremely large positive values.
First, we introduce ONMTF-N, our new ONMTF method based on normal error distribution; this method is an ONMTF version of ONMF by Pompili et al. (2014). We then introduce two other new ONMTF methods (ONMTF-P and ONMTF-CP), which are respectively based on Poisson and CP error distributions. Poisson distribution is often used for integer data, e.g., count data, especially in the linear regression framework on the assumption that the expected value and the variance are equal; thereby, ONMTF-P can also be applied for count data, such as a contingency table. On the other hand, ONMTF-CP can be applied to nonnegative data that are collected as a summation of nonnegative values, such as a matrix containing the purchase volumes of individual customers in a store. To the best of our knowledge, ONMTF based on a CP distribution has not been proposed in any previous studies. Our two simulation studies demonstrate the increased consistency of ONMTF-N (compared to ONMTF based on the multiplicative algorithm), and the robustness of ONMTF-CP. In addition, we apply our ONMTF to document-term matrix to examine its goodness of clustering.
Because the previous methods are not relatively accurete, as shown in a simulation described later on, the estimates obtained by these methods can result in misinterpretations. However, our methods are more reliable because the estimates obtained by our methods are more accurate as compared to the previous methods. Moreover, our methods, especially ONMTF-P and ONMTF-CP, can be applied to real-world data because nonnegative data cannot be assumed to have a normal distribution; for example, purchase volume data contains a few extremely large positive values, which indicates that a large volume was purchased by only a few people. In such situations, a right-tailed distribution is more appropriate than a normal distribution. Therefore, our methods can significantly contribute in solving problems encountered in real-world data.
The notation employed in this paper is as follows. A matrix is represented in uppercase (bold type), e.g., \({\varvec{M}}\); its i, j element is represented in lowercase, e.g., \(m_{ij}\). An element of a complicated matrix is expressed as \([\cdot ]_{ij}\). Further, \({\varvec{m}}_{i}\) and \({\varvec{m}}_{(j)}\) are column vectors with elements \(m_{ij}\) of the i-th row and the j-th column of \({\varvec{M}}\), respectively. We use the prime symbol and ‘\(-1\)’ to express a transpose matrix and inverse matrix, e.g., \({\varvec{M}}^{\prime }\) and \({\varvec{M}}^{-1}\), respectively. The trace and diagonal parts of a square matrix \({\varvec{M}}\) are denoted by \(\text {tr}({\varvec{M}})\) and \(\text {diag}({\varvec{M}})\), respectively. The Euclidean norm of a matrix or vector is represented as \(\Vert {\varvec{M}}\Vert =\sqrt{\text {tr}({\varvec{M}}^{\prime }{\varvec{M}})}\). \({\varvec{D}}_{{\varvec{M}}}\) is a diagonal matrix in which each diagonal element is \(\Vert {\varvec{m}}_{(j)}\Vert \), while \(\varDelta ({\varvec{M}})\) is the vector with the absolute values of elements in eigenvector \({\varvec{v}}\) with the largest eigenvalue of square matrix \({\varvec{M}}\): \({\varvec{M}}{\varvec{v}} = \lambda {\varvec{v}}\). Finally, \(\mathbb {R}^{n \times p}_{+}\) is a set of \(n \times p\) nonnegative matrices.
We introduce herein our bi-clustering problem as a motivation for our ONMTF models, which are presented in later sections, and discuss the connection between the bi-clustering problem and ONMTF. Let \(y_{ij}\) be the i, j element of a nonnegative data matrix \({\varvec{Y}} \in \mathbb {R}^{n \times p}_{+}\) and \(R=\)\(\{ r_{1},\dots ,r_{n} \}\) and \(C=\{ c_{1},\dots ,c_{p} \}\) be the sets of the n row and p column objects, respectively. One of our aims is to detect a k-partition \(\mathcal {R}=\{ R_{1},\dots ,R_{k} \}\) of R and an \(\ell \)-partition \(\mathcal {C}=\{ C_{1},\dots ,C_{\ell } \}\) of C, where \(R_{m}\) and \(C_{q}\) are the m- and q-th classes of \(\mathcal {R}\) and \(\mathcal {C}\), respectively. Classes \(R_{m}\) and \(C_{q}\) are defined as sets of row and column objects, respectively, and \(\mathcal {R}\) and \(\mathcal {C}\) are sets of k and \(\ell \) disjoint non-empty classes that cover R and C, respectively. This definition implies that an object that belongs to one class does not belong to any other classes (i.e., when \(r_{i} \in R_{m}\) then \(r_{i} \notin R_{m^{*}}\) for all \(m^{*} \ne m\) and when \(c_{j} \in C_{q}\) then \(c_{j} \notin C_{q^{*}}\) for all \(q^{*} \ne q\)). We refer to \(R_{m}\) as an “m-th row cluster” and \(C_{q}\) as a “q-th column cluster.” We consider the so-called membership of objects for a cluster to which each of the objects belong to. Let \({\varvec{F}}=(f_{im})\) be the \(n \times k\) membership matrix of row objects such that \(f_{im}>0\) when \(r_{i} \in R_{m}\), while \(f_{im}=0\) when \(r_{i} \notin R_{m}\). Let \({\varvec{A}}=(a_{jq})\) be the \(p \times \ell \) membership matrix of column objects such that \(a_{jq}>0\) when \(c_{j} \in C_{q}\), while \(a_{jq}=0\) when \(c_{j} \notin C_{q}\). Note that this definition leads to the orthogonality of \({\varvec{F}}\) and \({\varvec{A}}\): \({\varvec{f}}_{(m)}^{\prime }{\varvec{f}}_{(m^{*})}=0\) for all \(m^{*} \ne m\) and \({\varvec{a}}_{(q)}^{\prime }{\varvec{a}}_{(q^{*})}=0\) for all \(q^{*} \ne q\). If \({\varvec{f}}_{(m)}\) and \({\varvec{a}}_{(q)}\) for all m, q have the unit length, the orthogonality is changed to orthonormality: \({\varvec{F}}^{\prime }{\varvec{F}}={\varvec{I}}_{k}\) and \({\varvec{A}}^{\prime }{\varvec{A}}={\varvec{I}}_{\ell }\). We also consider a relationship between row and column clusters. Let \({\varvec{S}} = (s_{mq})\) be a \(k \times \ell \) matrix such that \(s_{mq} > 0\) for all entries. This study aims to estimate the best unknown parameters \(\theta =\)\(\{ \mathcal {R}, \mathcal {C}, {\varvec{F}}, {\varvec{A}}, {\varvec{S}}\}\) by given \({\varvec{Y}}\). We consider the following approximation problem for this aim: the best \(\theta \) is obtained such that for all i, j there exist \(m: R_{m} \ni r_{i}\) and \(q: C_{q} \ni c_{j}\) such that \(y_{ij} \approx x_{ij} = f_{im}s_{mq}a_{jq}\). From the definition of the membership matrices \({\varvec{F}}\) and \({\varvec{A}}\), we can rewrite the approximation problem as \(y_{ij} \approx x_{ij} = \sum _{i=1}^{n}\sum _{j=1}^{p}f_{im}s_{mq}a_{jq}\). In matrix forms, we can describe it as \({\varvec{Y}} \approx {\varvec{X}} = {\varvec{F}}{\varvec{S}}{\varvec{A}}^{\prime }\). Note that \({\varvec{F}}\) and \({\varvec{A}}\) have orthogonality, and all entries in all matrices are nonnegative. Hence, the approximation problem is equivalent to the ONMTF problem. We consider herein that the numbers of classes k and \(\ell \) are chosen in advance.
2 Orthogonal NMTF based on normal error distribution
In this section, a new method for NMTF, namely, ONMTF-N, is introduced. This method is based on the WVSK algorithm proposed by Pompili et al. (2014); it is a fundamental algorithm for ONMTF methods described in the following sections, where it is assumed that data follows Poisson and CP distributions.
This optimization problem is formally derived from a maximum likelihood (ML) problem under the assumption of normality: \(y_{ij} \sim N(x_{ij},\sigma ^{2})\) for all i, j independently, where \(N(\mu ,\sigma ^{2})\) is a normal distribution with mean \(\mu \) and variance \(\sigma ^{2}\), which is why we named the method “ONMTF-N.” However, this assumption is incompatible because we consider that the given data matrix only contains nonnegative entries. For this issue, we start with (2) to describe the ONMTF-N problem. The explanation on the other two methods that will be described later starts with the ML problem.
In some former double k-means approaches (e.g., Vichi (2001); Wang et al. (2011); Li and Peng (2005)), the middle factor matrix \({\varvec{S}}\) is updated such that \({\varvec{S}} = {\varvec{F}}^{\prime }{\varvec{Y}}{\varvec{A}}\) or \({\varvec{S}} = ({\varvec{F}}^{\prime }{\varvec{F}})^{-1}{\varvec{F}}^{\prime }{\varvec{Y}}{\varvec{A}}({\varvec{A}}^{\prime }{\varvec{A}})^{-1}\). We can utilize this update in our methods, and it may lead to a better performance of the estimation accuracy or the computation time. However, we adopt the two-time update described in Algorithm2.1 to emphasize ONMTF-N as an expansion of the WVSK (Pompili et al. 2014). This expansion provides the idea of ONMTF-P and ONMTF-CP introduced in the sections that follow. Approaches for updating \({\varvec{S}}\) within the algorithm will be investigated in the future.
3 Orthogonal NMTF based on Poisson error distribution
In this section, we introduce a new ONMTF method, namely, ONMTF-P. It is a modified version of ONMTF-N described in Sect. 2, and it assumes that data follow a Poisson distribution. Although a multiplicative updating algorithm for NMTF under this assumption was proposed by Yoo and Choi (2009), the orthogonal constraints were not imposed on it. In contrast, our algorithm is not based on a multiplicative updating algorithm but on the WVSK algorithm, and the orthogonality constraints are imposed on it. Unfortunately, we only provide the model and the algorithm of ONMTF-P because of space limitations. For the derivation of the update equations, please see Abe H, Yadohisa H (2017) Supplementary material to “Orthogonal nonnegative matrix tri-factorization based on Tweedie distributions.”.
4 Orthogonal NMTF based on CP error distribution
In this section, we introduce the other new method for ONMTF, namely, ONMTF-CP, where it is assumed that data follows a CP distribution. This method is also based on the WVSK algorithm, as in the case of ONMTF-N and ONMTF-P described in Sects. 2 and 3, respectively. ONMTF-CP has a hyperparameter \(\beta \) that determines the robustness of estimation against the extremely large positive values. It is noteworthy that we derive a new auxiliary function for updating the middle factor matrix \({\varvec{S}}\).
Corollary 1
The bivariate function f(x, y) in (28) is concave if \((x,y)\in \mathbb {R}^{2}_{+}\) and \(\beta >0\).
Proof
4.1 Some issues
The proposed ONMTF methods introduced earlier are based on the k-means algorithm from the row and column sides. In other words, these methods have the disadvantages of double k-means clustering: initialization, local minima, empty clusters. As for the initialization, we randomly assign each column object to a cluster in \(\mathcal {C}\) and use an exponential random number as an initial value of \(a_{jq}\;(c_{j} \in C_{q})\) for all j and all entries of \({\varvec{S}}\). Note that the initial \(\mathcal {R}\) and \({\varvec{F}}\) are not randomly given, but updated by the randomly initialized \(\mathcal {C}\), \({\varvec{A}}\), and \({\varvec{S}}\). Concerning the local minima and empty clusters, it is recommended to run these algorithms with efficiently large random starts. We restart the update iteration from another random start if the empty clusters occur. We select the estimates with the least objective function value from the non-empty cluster estimates. Any initialization with the k-means (e.g., Xue et al. (2008)) is not used in our proposed methods because of the two following reasons: 1) its initialization also needs initialization to compute and 2) its initialization does not always lead to the best estimates; therefore, we use large random starts.
5 Simulation studies
In this section, we describe two simulation studies. The first study compares ONMTF-N and previous ONMTF methods in terms of estimation accuracy. The second study analyzes the characteristics of the estimates given by ONMTF-N, ONMTF-P, and ONMTF-CP.
5.1 Simulation study 1: estimation accuracy of ONMTF-N
5.2 Simulation study 2: robustness of ONMTF-CP against extremely large positive values
We conducted another simulation study to demonstrate the characteristics of the estimates given by ONMTF-N, ONMTF-P, and ONMTF-CP. As mentioned in previous sections, it is assumed that \(y_{ij}\) follows normal, Poisson, and CP distributions, respectively, in these three ONMTF methods. These distributions belong to the Tweedie family, which is described by (20), and the value of \(\beta \) determines the distribution: it is normal if \(\beta =2\), Poisson if \(\beta =1\), and CP if \(\beta \in (0,1)\). The index parameter \(\beta \) is related to the robust estimation against extremely large positive values. Figure 2 shows \(\beta \)-divergence, which is derived from the log-likelihood of Tweedie distributions for various \(\beta \) values when \(y=10\) or \(y=100\). The \(\beta \)-divergence around a small y is larger than that around a large y, except \(\beta =2\). This means that extremely large positive values in the data are not considered for parameter estimation when \(\beta \) is smaller than 2. To examine these characteristics in ONMTF, we measure the estimation accuracy of ONMTF-N, ONMTF-P, and ONMTF-CP for synthetic data matrices generated using normal, Poisson, and CP distributions of data. The accuracy is calculated using the ARI between true clusters and estimated clusters of row and column objects.
We now explain how to generate a synthetic data matrix. First, we generate \(\tilde{\mathcal {R}}\), \(\tilde{{\varvec{F}}}^{*}\), \(\tilde{\mathcal {C}}\), \(\tilde{{\varvec{A}}}^{*}\), and \(\tilde{{\varvec{S}}}^{*}\) as in Sect. 5.1. Next, we generate each element of the synthetic data matrix \({\varvec{Y}}\) as a random number from \(y_{ij} \sim TW(x_{ij},\phi ,\tilde{\beta })\), where \(TW(x,\phi ,\beta )\) denotes a Tweedie distribution, and the mean \(x_{ij}\) is the corresponding i, j element of \({\varvec{X}} = \tilde{{\varvec{F}}}^{*}\tilde{{\varvec{S}}}^{*}\tilde{{\varvec{A}}}^{*\prime }\). Note that \(TW(x_{ij},\phi ,\tilde{\beta })\) is normal if \(\tilde{\beta }=2\) and Poisson if \(\tilde{\beta }=1\). Negative values of \(y_{ij}\) can be generated when \(\tilde{\beta }=2\). In this case, \(y_{ij}\) is converted to zero. The parameters for generating synthetic data are set as follows: \((n,p,k,\ell ) = (100,100,5,5)\), \(\phi = 2\), \(\tilde{\beta }=\{ 2,1,0.8,0.5,0.2 \}\), \(\mu =10\), \(\tau = np \times 10^{-7}\) (threshold for the stopping algorithm), and \(\nu = 1000\) (the maximum number of iterative cycles.) It is noted that the true numbers of row and column clusters, and the estimated ones, are the same as k and \(\ell \), respectively. We generate 100 synthetic data matrices for each of five conditions. Then, from among the estimate candidates given by 20 executions, we select the best estimates, \(\hat{\mathcal {R}}\), for which the objective function value is minimized. We then calculate \(\text {ARI}(\tilde{\mathcal {R}},\hat{\mathcal {R}})\) of each ONMTF. We execute ONMTF-CP for three cases: \(\beta =\{0.2,0.5,0.8\}\). We refer to the procedures as ONMTF-CP2, ONMTF-CP5, and ONMTF-CP8, respectively. The results are shown in Fig. 3. Note that we do not show the result of ARI for column clustering because it is very similar to that for row clustering. When \(\tilde{\beta }=2\) (normal), ONMTF-N has the best accuracy, followed by ONMTF-P, ONMTF-CP8, ONMTF-CP5, and ONMTF-CP2, in that order. When \(\tilde{\beta }=0.5\), ONMTF-N has the worst accuracy; when \(\tilde{\beta }=0.2\), the accuracy deteriorates in the order of ONMTF-N, ONMTF-P, and ONMTF-CP8. Because more extremely large positive values are generated from a CP distribution with small \(\tilde{\beta }\) values, these results imply that ONMTF-N, ONMTF-P, and ONMTF-CP procedures with relatively larger \(\beta \) values do not fit a data matrix containing some extremely large positive values. This does not mean that an ONMTF-CP procedure with a small \(\beta \) value is the best in any case. It may be worse for a data matrix having a normal error, as shown in the case of \(\tilde{\beta }=2\) in Fig. 3. However, it may be preferable to use ONMTF-CP because the inaccuracy of ONMTF-CP with small \(\beta \) values is smaller than that of ONMTF-N for a data matrix containing some extremely large positive values.
6 Applications
List of methods for comparison in the application for document clustering
Method | Abbreviation | Paper |
---|---|---|
ONMTF by Ding et al. (2006) | Ding | Ding et al. (2006) |
ONMTF by Yoo and Choi (2010b) | Yoo | Yoo and Choi (2010b) |
Double K-means | DK | Vichi (2001) |
LP-FNMTF | LP | Wang et al. (2011) |
ONMTF-N | N | Proposed |
ONMTF-P | P | Proposed |
ONMTF-CP | CP | Proposed |
Graph modularity maximization | Mod | Ailem et al. (2016) |
SPKM | SP | Banerjee et al. (2003) |
WVSPKM | WV | Pompili et al. (2014) |
Stats of some text-word datasets in CLUTO
Data | Documents | Terms | Classes | Elements | Nonzero elements | Ratio of nonzero elements | Total words (%) |
---|---|---|---|---|---|---|---|
tr23 | 204 | 5832 | 6 | 1189728 | 78609 | 6.61 | 493387 |
tr12 | 313 | 5804 | 8 | 1816652 | 85640 | 4.71 | 311111 |
tr11 | 414 | 6429 | 9 | 2661606 | 116613 | 4.38 | 437143 |
re0 | 1504 | 2886 | 13 | 4340544 | 77808 | 1.79 | 128671 |
fbis | 2463 | 2000 | 17 | 4926000 | 393386 | 7.99 | 1063914 |
tr45 | 690 | 8261 | 10 | 5700090 | 193605 | 3.40 | 646537 |
re1 | 1657 | 3758 | 25 | 6227006 | 87328 | 1.40 | 142680 |
tr41 | 878 | 7454 | 10 | 6544612 | 171509 | 2.62 | 357606 |
tr31 | 927 | 10128 | 7 | 9388656 | 248903 | 2.65 | 892795 |
wap | 1560 | 8460 | 20 | 13197600 | 220482 | 1.67 | 337521 |
k1a | 2340 | 21839 | 20 | 51103260 | 349792 | 0.68 | 530374 |
k1b | 2340 | 21839 | 6 | 51103260 | 349792 | 0.68 | 530374 |
hitech | 2301 | 22498 | 6 | 51767898 | 346881 | 0.67 | 549664 |
The data matrices we used were obtained from the open data CLUTO^{1} website. Table 2 lists the selected data matrices and statistics. The list of datasets in Table 2 are ordered by the number of elements. The tr11, tr12, tr23, tr31, tr41, and tr45 datasets are derived from the TREC^{2} collections. The true categories of the documents in the tr31 and tr41 datasets are obtained by particular queries. The re0 and re1 datasets are from the Reuters-21578 text categorization test collection, distribution 1.0.^{3} The fbis data set is from the Foreign Broadcast Information Service data of TREC-5. The hitech is a dataset of San Jose Mercury Newspaper articles and contains documents about computers, electronics, health, medicine, research, and technology. The k1a, k1b, and wap datasets are used for the WebACE project (Boley et al. 1999) and contain web pages in various subject directories of Yahoo!.^{4} Datasets k1a and k1b contain the same documents, but the true labels are different.
ARI between the given and estimated clusters of the documents
data | Ding | Yoo | DK | LP | N | P | CP8 | CP5 | CP2 | Mod | SP | WV |
---|---|---|---|---|---|---|---|---|---|---|---|---|
tr23 | 0.02 | 0.24 | 0.22 | 0.09 | 0.21 | 0.14 | 0.06 | 0.02 | 0.02 | 0.13 | 0.28 | *0.22 |
tr12 | 0.59 | 0.60 | 0.18 | 0.13 | 0.57 | 0.12 | 0.04 | 0.01 | 0.01 | 0.12 | 0.36 | *0.58 |
tr11 | *0.60 | 0.63 | 0.24 | 0.29 | 0.69 | 0.13 | 0.01 | 0.01 | 0.00 | 0.26 | 0.52 | 0.59 |
re0 | *0.18 | 0.21 | 0.08 | 0.12 | 0.16 | 0.08 | 0.06 | 0.09 | 0.06 | 0.10 | 0.18 | 0.17 |
fbis | 0.31 | 0.29 | 0.24 | 0.20 | 0.32 | 0.36 | *0.36 | 0.32 | 0.17 | 0.48 | 0.36 | 0.33 |
tr45 | 0.42 | 0.32 | 0.22 | 0.10 | 0.64 | 0.44 | 0.32 | 0.03 | 0.02 | 0.38 | 0.62 | *0.54 |
re1 | 0.12 | 0.06 | 0.10 | *0.12 | 0.11 | 0.10 | 0.06 | 0.04 | 0.02 | 0.10 | 0.25 | 0.42 |
tr41 | 0.33 | 0.21 | 0.32 | 0.12 | 0.43 | 0.40 | 0.30 | 0.01 | 0.01 | 0.42 | 0.49 | *0.43 |
tr31 | *0.41 | 0.10 | 0.22 | 0.04 | 0.55 | 0.30 | 0.26 | 0.02 | 0.01 | 0.40 | 0.38 | 0.60 |
wap | *0.30 | 0.26 | 0.25 | 0.11 | 0.27 | 0.26 | 0.24 | 0.13 | 0.02 | 0.45 | 0.43 | 0.29 |
k1a | 0.33 | 0.44 | 0.29 | 0.14 | 0.21 | 0.25 | 0.16 | 0.01 | 0.01 | 0.39 | 0.32 | *0.34 |
k1b | 0.42 | 0.69 | 0.33 | 0.06 | 0.50 | 0.17 | 0.00 | 0.00 | 0.00 | 0.19 | 0.45 | *0.50 |
hitech | 0.20 | 0.00 | 0.07 | 0.01 | *0.12 | 0.08 | 0.00 | 0.00 | 0.00 | 0.11 | 0.26 | 0.12 |
Table 3 shows ARI of the 12 methods for each dataset. Yoo’s method has four datasets (tr12, re0k1a, and k1b) with the best ARI and this number is the largest amongst those of the 12 methods. The second best performance is shown in SP, which has three datasets (tr23, tr41, and hitech) with the best ARI and the method has high accuracy in many datasets. The third best performance is shown in N, Mod, and WV, each of which show the best ARI in two datasets. Although Yoo’s method seems to show the best performance from the perspective of which method shows the best ARI in any of the datasets, the method has poor ARI on some datasets, e.g., re1, tr41, tr31, and hitech. In fact, Yoo’s method shows higher performance than N in no more than 5 datasets of all 13 datasets. On the other hand, N is not poor for almost all datasets in the biclustering methods. However, it is poorer than one-side clustering methods, SP and WV. A possible reason for the result is as follows. Applying one-side clustering approach to document-term clustering task leads to the clustering of documents and terms into the same number of clusters; on the other hand, biclustering approach leads to clustering of the two into a different number of clusters. In the biclustering approach, a number of term clusters may be short or too much for clustering documents, and this mismatch could cause the poorer performance of N in comparison to one-side clustering methods. We also show the worse performance of P and CPs, which are robust methods, in comparison to that of other methods. The cause of this result can be robustness against the large positive values. We use a document-term matrix converted using tf-idf conversion, which strongly weights terms in a few documents. The entries of such terms have a large positive value. The effect of the weighted entries disappear when these robust ONMTFs are used, implying that an ONMTF based on the Euclidean distance is not used, and interpretable clusters cannot be obtained. Note that good performance of the robust ONMTF can be obtained using new standardizations other than tf-idf conversion. However, we will undertake this challenge in our future works. DK and LP have the poor result in most of the datasets. The accuracy of LP can be improved by selection more appropriate hyperparameters for each dataset.
We obtain the results of the computational time, degree of approximation (measured as Euclidean distance at convergence), and convergence behavior. However, we do not report these herein because of space limitation. Please see Abe H, Yadohisa H (2017) Supplementary material to “Orthogonal nonnegative matrix tri-factorization based on Tweedie distributions for the results.”
Terms of 10 clusters in the k1b datasets obtained using ONMTF-N
TC 1 | \(a_{j1}\) | TC 2 | \(a_{j2}\) | TC 3 | \(a_{j3}\) | TC 4 | \(a_{j4}\) | TC 5 | \(a_{j5}\) |
---|---|---|---|---|---|---|---|---|---|
Film | 0.26 | Cell | 0.25 | Box | 0.43 | Stock | 0.29 | Week | 0.48 |
Tv | 0.17 | Cancer | 0.24 | Million | 0.33 | Internet | 0.26 | Bestsell | 0.45 |
Hollywood | 0.15 | Risk | 0.24 | Weekend | 0.29 | Compani | 0.21 | Weekli | 0.35 |
cb | 0.14 | Studi | 0.22 | Offic | 0.27 | Dow | 0.21 | Hardcov | 0.32 |
Star | 0.14 | Research | 0.20 | Movi | 0.24 | Microsoft | 0.17 | Publish | 0.28 |
Diana | 0.12 | Patient | 0.19 | Gross | 0.24 | Percent | 0.16 | Paperback | 0.25 |
Fox | 0.12 | Women | 0.18 | Sept | 0.23 | Comput | 0.16 | Fiction | 0.14 |
Game | 0.12 | Diseas | 0.17 | Top | 0.22 | Busi | 0.14 | Mass | 0.11 |
Festiv | 0.12 | Heart | 0.16 | Chart | 0.21 | Market | 0.13 | Random | 0.10 |
Season | 0.11 | Drug | 0.14 | Exhibitor | 0.17 | Intel | 0.13 | Trade | 0.08 |
TC 6 | \(a_{j6}\) | TC 7 | \(a_{j7}\) | TC 8 | \(a_{j8}\) | TC 9 | \(a_{j9}\) | TC 10 | \(a_{j10}\) |
---|---|---|---|---|---|---|---|---|---|
Emmi | 0.74 | Report | 0.21 | Rate | 0.27 | Deal | 0.20 | York | 0.21 |
Win | 0.31 | Accord | 0.20 | Adult | 0.20 | Network | 0.19 | Unit | 0.18 |
Drama | 0.25 | People | 0.17 | Includ | 0.16 | Quote | 0.16 | Averag | 0.17 |
Comedi | 0.21 | American | 0.16 | Time | 0.16 | Am | 0.16 | Call | 0.17 |
Actor | 0.16 | Univers | 0.16 | Previou | 0.14 | Set | 0.16 | Loss | 0.17 |
Award | 0.16 | Develop | 0.15 | Home | 0.14 | Octob | 0.16 | Offer | 0.16 |
Franz | 0.15 | Death | 0.15 | Septemb | 0.14 | Wednesdai | 0.16 | System | 0.16 |
Actress | 0.14 | Author | 0.15 | Program | 0.14 | Mondai | 0.15 | Pm | 0.16 |
Sundai | 0.14 | Lead | 0.14 | Nation | 0.14 | Record | 0.15 | Gain | 0.15 |
Gillian | 0.13 | Surgeri | 0.13 | Fridai | 0.13 | Tuesdai | 0.14 | Provid | 0.14 |
Terms of 10 clusters in the k1b datasets by Yoo’s method. The value of the right side of each of the term is the value of \(a_{jq}\) on each of the term. All \(a_{jq}\) are standardized such that the length of each column vector in \({\varvec{A}}\) is 1. Only terms are shown which have top 10 values of \(a_{jq}\) in each cluster
TC 1 | \(a_{j1}\) | TC 2 | \(a_{j2}\) | TC 3 | \(a_{j3}\) | TC 4 | \(a_{j4}\) | TC 5 | \(a_{j5}\) |
---|---|---|---|---|---|---|---|---|---|
Film | 0.24 | Risk | 0.30 | Game | 0.33 | Stock | 0.35 | Week | 0.49 |
Tv | 0.22 | Patient | 0.24 | Season | 0.17 | Compani | 0.24 | Bestsell | 0.44 |
Box | 0.17 | Studi | 0.21 | Marlin | 0.16 | Internet | 0.19 | Weekli | 0.35 |
Top | 0.16 | Heart | 0.20 | Pippen | 0.16 | Microsoft | 0.17 | Hardcov | 0.32 |
Star | 0.14 | Drug | 0.20 | Blackhawk | 0.16 | Percent | 0.17 | Publish | 0.27 |
Festiv | 0.12 | Women | 0.19 | Surgeri | 0.15 | Busi | 0.16 | Paperback | 0.25 |
Weekend | 0.12 | Infect | 0.17 | Indian | 0.15 | Industri | 0.15 | Fiction | 0.13 |
Music | 0.12 | Blood | 0.17 | Oriol | 0.15 | Financi | 0.14 | Mass | 0.11 |
Diana | 0.12 | Breast | 0.16 | Coach | 0.14 | Oct | 0.13 | Random | 0.09 |
Pictur | 0.11 | Increas | 0.14 | Nomo | 0.14 | Trad | 0.13 | Trade | 0.08 |
TC 6 | \(a_{j6}\) | TC 7 | \(a_{j7}\) | TC 8 | \(a_{j8}\) | TC 9 | \(a_{j9}\) | TC 10 | \(a_{j10}\) |
---|---|---|---|---|---|---|---|---|---|
Emmi | 0.74 | Million | 0.28 | Dow | 0.38 | Cancer | 0.41 | Home | 0.14 |
Win | 0.32 | Cb | 0.27 | Internet | 0.23 | Cell | 0.30 | Includ | 0.14 |
Drama | 0.25 | Hollywood | 0.23 | Quarter | 0.22 | Research | 0.29 | People | 0.14 |
Comedi | 0.21 | Fox | 0.19 | Intel | 0.19 | Gene | 0.21 | York | 0.12 |
Actor | 0.16 | Debut | 0.19 | Softwar | 0.19 | Brain | 0.20 | Call | 0.12 |
Franz | 0.15 | Premier | 0.19 | Chip | 0.18 | Diseas | 0.19 | Program | 0.12 |
Award | 0.15 | Film | 0.15 | Apple | 0.17 | Mutat | 0.14 | Clinton | 0.12 |
Actress | 0.14 | Deal | 0.13 | Quote | 0.17 | Studi | 0.14 | Lead | 0.11 |
Sundai | 0.13 | Award | 0.13 | Oper | 0.16 | Tumor | 0.10 | Receiv | 0.11 |
Gillian | 0.13 | Ticket | 0.13 | Technologi | 0.16 | Test | 0.10 | Previou | 0.11 |
Middle factor matrix \({\varvec{S}}\) of the k1a dataset
Term | ONMTF-N | Yoo’s method | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
DC1 | DC2 | DC3 | DC4 | DC5 | DC6 | DC1 | DC2 | DC3 | DC4 | DC5 | DC6 | |
TC 1 | 1.76 | 0.08 | 0.17 | 0.12 | 0.05 | 0.11 | 1.53 | 0.01 | 0.04 | 0.05 | 0.01 | 0.00 |
TC 2 | 0.09 | 1.95 | 0.01 | 0.05 | 0.01 | 0.00 | 0.00 | 1.27 | 0.00 | 0.00 | 0.01 | 0.00 |
TC 3 | 0.28 | 0.04 | 1.57 | 0.11 | 0.04 | 0.03 | 0.03 | 0.00 | 1.59 | 0.02 | 0.00 | 0.00 |
TC 4 | 0.22 | 0.06 | 0.04 | 1.61 | 0.02 | 0.01 | 0.14 | 0.04 | 0.03 | 1.35 | 0.00 | 0.00 |
TC 5 | 0.16 | 0.06 | 0.01 | 0.05 | 1.75 | 0.02 | 0.02 | 0.01 | 0.00 | 0.01 | 1.80 | 0.00 |
TC 6 | 0.27 | 0.01 | 0.03 | 0.02 | 0.00 | 1.99 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 2.02 |
TC 7 | 0.36 | 0.81 | 0.04 | 0.19 | 0.02 | 0.03 | 0.93 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 |
TC 8 | 0.77 | 0.39 | 0.06 | 0.17 | 0.04 | 0.08 | 0.00 | 0.00 | 0.00 | 0.69 | 0.00 | 0.00 |
TC 9 | 0.85 | 0.11 | 0.06 | 0.58 | 0.02 | 0.06 | 0.01 | 1.17 | 0.00 | 0.01 | 0.00 | 0.00 |
TC 10 | 0.38 | 0.36 | 0.06 | 0.44 | 0.03 | 0.02 | 0.55 | 0.45 | 0.15 | 0.15 | 0.00 | 0.00 |
Our interpretation for term clusters obtained by two methods
Cluster | ONMTF-N | Yoo’s method |
---|---|---|
TC 1 | Cinema and television | Cinema, television, and box office |
TC 2 | Clinical | Clinical |
TC 3 | Box office | Sports |
TC 4 | Economics and technology | Economics and technology |
TC 5 | Book sales | Book sales |
TC 6 | Emmy awards | Emmy awards |
TC 7 | Words used for a research reference | Cenema and its profit |
TC 8 | (Assorted terms) | Internet technology |
TC 9 | (Assorted terms) | Biology |
TC 10 | (Assorted terms) | (Assorted terms) |
We now focus on the term clustering. Each of the term cluster obtained using the two methods could be interpreted as Table 7. We find that both methods extract similar term clusters from each other (e.g., TC 1, 2, 4, 5, and 6 of both.) However, TC 3, which is strongly related to DC 3 (see Table 6) in both methods, has different terms. TC 3 of ONMTF-N seems to mean the chart of movie by the box office and that of Yoo’s method seems to mean sports. Unfortunately, the detail of document clusters cannot be shown due to space limitations, but in fact, DC 3 of ONMTF-N includes some “entertainment” documents, while that of Yoo’s method includes almost all of “sports” documents. From the fact that Yoo’s TC 1 includes some words related to box office , documents related to box office chart can be integrated into DC 1 in Yoo’s estimates. On the other hand, there are no word cluster on sports in ONMTF-N. The “sports” documents are indeed misclassified to DC 1, which includes many entertainment documents, by ONMTF-N. Although this misclassifying of “sports” documents leads to poorer ARI of ONMTF-N than Yoo’s method, ONMTF-N detects the meaningful cluster related to box office chart instead of “sports.” However, ONMTF-N has irrelevantly estimated the document clusters unfamiliar with the real document labels. Moreover, ONMTF-N has some assorted term clusters that are not affected by document clustering. The difference of the two can be explained from the aspect of their estimating algorithms. In fact, ONMTF-N exhibits less freedom than Yoo’s method for estimating factor matrices owing to their orthogonality. Indeed its objective function value at convergence is larger than Yoo’s: ONMTF-N has 241.8; and Yoo’s method has 241.1. This estimating problem can be solved by more random starts thanks to the higher computational speed of ONMTF-N in comparison to Yoo’s method.
7 Conclusion
In this paper, we proposed a new method for ONMTF, namely, ONMTF-N, in which the objective function value is monotonically non-increasing and the orthogonality of the factor matrices is maintained. In addition, we proposed two other ONMTF methods, namely, ONMTF-P and ONMTF-CP. The main contributions of this paper are as follows. First, our simulation study and an application involving some document-term matrices indicated that ONMTF-N shows higher estimation accuracy than previous methods. Second, we derived a new auxiliary function to optimize the middle factor matrix in ONMTF-CP using an inequality of a bivariate concave function. Third, another simulation study indicated that ONMTF-CP may be robust against the effect of extremely large positive values.
NMFs with orthogonality, including our methods, should be used considering the trade-off between its easy-to-understand estimates and its under fitting. An orthogonal constraint simplifies a factor matrix, thereby facilitating result interpretation. However, a factor matrix with a simplified structure leads to a poor approximation to data matrix (e.g., \({\varvec{Y}}\).)
Two issues remain to be addressed in the future. First, cluster degeneration tends to occur in the three proposed methods. Therefore, further investigation is required to develop an approach for rapidly seeking alternative initial parameters, in order to avoid cluster degeneration. Second, it is necessary to develop an approach for estimating the best number of clusters for both row objects and column objects.
Footnotes
Notes
Acknowledgements
We would like to express our greatest appreciation to the editor and referees for their insightful comments, which have helped us significantly improve the paper.
Supplementary material
References
- Ailem M, Role F, Nadif M (2016) Graph modularity maximization as an effective method for co-clustering text data. Knowl Based Syst 109:160–173CrossRefGoogle Scholar
- Banerjee A, Dhillon I, Ghosh J, Sra S (2003) Generative model-based clustering of directional data. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 19–28Google Scholar
- Berry MW, Browne M, Langville AN, Pauca VP, Plemmons RJ (2007) Algorithms and applications for approximate nonnegative matrix factorization. Comput Stat Data Anal 52(1):155–173MathSciNetCrossRefGoogle Scholar
- Boley D (1998) Hierarchical taxonomies using divisive partitioning. Technical Report TR-98-012, Department of Computer Science, University of Minnesota, MinneapolisGoogle Scholar
- Boley D, Gini M, Gross R, Han EHS, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1999) Document categorization and query generation on the world wide web using webace. Artif Intell Rev 13(5–6):365–391CrossRefGoogle Scholar
- Carabias-Orti JJ, Rodríguez-Serrano FJ, Vera-Candeas P, Cañadas-Quesada FJ, Ruiz-Reyes N (2013) Constrained non-negative sparse coding using learnt instrument templates for realtime music transcription. Eng Appl Artif Intell 26(7):1671–1680CrossRefGoogle Scholar
- Chen G, Wang F, Zhang C (2009) Collaborative filtering using orthogonal nonnegative matrix tri-factorization. Inf Process Manag 45(3):368–379CrossRefGoogle Scholar
- Choi S (2008) Algorithms for orthogonal nonnegative matrix factorization. In: Neural Networks, 2008. IJCNN 2008. IEEE international joint conference on IEEE world congress on computational intelligence, IEEE, pp 1828–1832Google Scholar
- Cichocki A, Amari Si (2010) Families of alpha-beta-and gamma-divergences: flexible and robust measures of similarities. Entropy 12(6):1532–1568MathSciNetCrossRefGoogle Scholar
- Costa G, Ortale R (2014) XML document co-clustering via non-negative matrix tri-factorization. In: 2014 IEEE 26th international conference on tools with artificial intelligence (ICTAI), IEEE, pp 607–614Google Scholar
- Ding C, Li T, Peng W, Park H (2006) Orthogonal nonnegative matrix tri-factorizations for clustering. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 126–135Google Scholar
- Dunn PK, Smyth GK (2001) Tweedie family densities: methods of evaluation. In: Proceedings of the 16th international workshop on statistical modelling, Odense, Denmark, pp 2–6Google Scholar
- Févotte C, Idier J (2011) Algorithms for nonnegative matrix factorization with the \(\beta \)-divergence. Neural Comput 23(9):2421–2456MathSciNetCrossRefGoogle Scholar
- Févotte C, Bertin N, Durrieu JL (2009) Nonnegative matrix factorization with the Itakura–Saito divergence: with application to music analysis. Neural Comput 21(3):793–830CrossRefGoogle Scholar
- Govaert G, Nadif M (2013) Co-clustering. Wiley, HobokenCrossRefGoogle Scholar
- Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218CrossRefGoogle Scholar
- Jørgensen B (1997) The theory of dispersion models. CRC Press, Boca RatonzbMATHGoogle Scholar
- Kim Y, Kim TK, Kim Y, Yoo J, You S, Lee I, Carlson G, Hood L, Choi S, Hwang D (2011) Principal network analysis: identification of subnetworks representing major dynamics using gene expression data. Bioinformatics 27(3):391–398CrossRefGoogle Scholar
- Kimura K, Tanaka Y, Kudo M (2014) A fast hierarchical alternating least squares algorithm for orthogonal nonnegative matrix factorization. In: ACMLGoogle Scholar
- Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788–791CrossRefGoogle Scholar
- Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. In: Advances in neural information processing systems, pp 556–562Google Scholar
- Li T, Peng W (2005) A clustering model based on matrix approximation with applications to cluster system log files. In: European conference on machine learning, Springer, pp 625–632Google Scholar
- Li Y, Zhang X, Sun M (2017) Robust non-negative matrix factorization with \(\beta \)-divergence for speech separation. ETRI J 39(1):21–29CrossRefGoogle Scholar
- Li Z, Wu X, Peng H (2010) Nonnegative matrix factorization on orthogonal subspace. Pattern Recognit Lett 31(9):905–911CrossRefGoogle Scholar
- Mauthner T, Kluckner S, Roth PM, Bischof H (2010) Efficient object detection using orthogonal NMF descriptor hierarchies. In: Goesele M, Roth S, Kuijper A, Schiele B, Schindler K (eds) Pattern recognition. Springer, pp 212–221Google Scholar
- Mirzal A (2014) A convergent algorithm for orthogonal nonnegative matrix factorization. J Comput Appl Math 260:149–166MathSciNetCrossRefGoogle Scholar
- Nakano M, Kameoka H, Le Roux J, Kitano Y, Ono N, Sagayama S (2010) Convergence-guaranteed multiplicative algorithms for nonnegative matrix factorization with \(\beta \)-divergence. In: 2010 IEEE international workshop on machine learning for signal processing (MLSP), IEEE, pp 283–288Google Scholar
- Ohnishi T, Dunn PK (2007) Analysis of the rainfall data in queensland using the tweedie glm. In: Proceedings of the 2007 Japanese joint statistical meeting, Japanese joint statistical meeting, pp 18–18Google Scholar
- Pompili F, Gillis N, Absil PA, Glineur F (2014) Two algorithms for orthogonal nonnegative matrix factorization with application to clustering. Neurocomputing 141:15–25CrossRefGoogle Scholar
- Simsekli U, Cemgil A, Yilmaz YK (2013) Learning the beta-divergence in Tweedie compound poisson matrix factorization models. In: Proceedings of the 30th international conference on machine learning (ICM-13), pp 1409–1417Google Scholar
- Smyth GK, Jørgensen B (2002) Fitting tweedie’s compound poisson model to insurance claims data: dispersion modelling. Astin Bull 32(01):143–157MathSciNetCrossRefGoogle Scholar
- Tan VY, Févotte C (2013) Automatic relevance determination in nonnegative matrix factorization with the/spl beta/-divergence. IEEE Trans Pattern Anal Mach Intell 35(7):1592–1605CrossRefGoogle Scholar
- Van Mechelen I, Bock HH, De Boeck P (2004) Two-mode clustering methods: a structured overview. Stat Methods Med Res 13(5):363–394MathSciNetCrossRefGoogle Scholar
- Vichi M (2001) Double k-means clustering for simultaneous classification of objects and variables. In: Borra S, Rocci R, Vichi M, Schader M (eds) Advances in classification and data analysis. Springer, pp 43–52Google Scholar
- Virtanen T (2007) Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans Audio Speech Lang Process 15(3):1066–1074CrossRefGoogle Scholar
- Virtanen T, Gemmeke JF, Raj B, Smaragdis P (2015) Compositional models for audio processing: uncovering the structure of sound mixtures. IEEE Signal Process Mag 32(2):125–144CrossRefGoogle Scholar
- Wang F, Zhu H, Tan S, Shi H (2016) Orthogonal nonnegative matrix factorization based local hidden Markov model for multimode process monitoring. Chin J Chem Eng 24:856–860CrossRefGoogle Scholar
- Wang H, Nie F, Huang H, Makedon F (2011) Fast nonnegative matrix tri-factorization for large-scale data co-clustering. In: IJCAI proceedings-international joint conference on artificial intelligence, vol 22, p 1553Google Scholar
- Wang YX, Zhang YJ (2013) Nonnegative matrix factorization: a comprehensive review. IEEE Trans on Knowl Data Eng 25(6):1336–1353CrossRefGoogle Scholar
- Weninger F, Schuller B (2012) Optimization and parallelization of monaural source separation algorithms in the openBliSSART toolkit. J Signal Process Syst 69(3):267–277CrossRefGoogle Scholar
- Xue Y, Tong CS, Chen Y, Chen WS (2008) Clustering-based initialization for non-negative matrix factorization. Appl Math Comput 205(2):525–536MathSciNetzbMATHGoogle Scholar
- Yoo J, Choi S (2008) Orthogonal nonnegative matrix factorization: Multiplicative updates on Stiefel manifolds. In: Intelligent data engineering and automated learning–IDEAL 2008, Springer, pp 140–147Google Scholar
- Yoo J, Choi S (2009) Probabilistic matrix tri-factorization. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP), IEEE, pp 1553–1556Google Scholar
- Yoo J, Choi S (2010) Nonnegative matrix factorization with orthogonality constraints. J Comput Sci Eng 4(2):97–109CrossRefGoogle Scholar
- Yoo J, Choi S (2010) Orthogonal nonnegative matrix tri-factorization for co-clustering: multiplicative updates on Stiefel manifolds. Inf Process Manag 46(5):559–570CrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.