Abstract
Mining and exploring databases should provide users with knowledge and new insights. Tiles of data strive to unveil true underlying structure and distinguish valuable information from various kinds of noise. We propose a novel Boolean matrix factorization algorithm to solve the tiling problem, based on recent results from optimization theory. In contrast to existing work, the new algorithm minimizes the description length of the resulting factorization. This approach is well known for model selection and data compression, but not for finding suitable factorizations via numerical optimization. We demonstrate the superior robustness of the new approach in the presence of several kinds of noise and types of underlying structure. Moreover, our general framework can work with any cost measure having a suitable real-valued relaxation. Thereby, no convexity assumptions have to be met. The experimental results on synthetic data and image data show that the new method identifies interpretable patterns which explain the data almost always better than the competing algorithms.
Similar content being viewed by others
Notes
\({{\mathrm{dom}}}(\phi )\) is the domain of \(\phi \)
References
Bauckhage C (2015) k-means clustering is matrix factorization. arXiv preprint arXiv:1512.07548
Bolte J, Sabach S, Teboulle M (2014) Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math Program 146(1–2):459–494
Cover T, Thomas J (2006) Elements of information theory. Wiley-Interscience, Hoboken
De Bie T (2011) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446
Ding C, Li T, Peng W, Park H (2006) Orthogonal nonnegative matrix t-factorizations for clustering. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 126–135
Ding CH, He X, Simon HD (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proceedings of the SIAM international conference on data mining (SDM), pp 606–610
Geerts F, Goethals B, Mielikäinen T (2004) Tiling databases. In: International conference on discovery science (DS), pp 278–289
Grünwald P (2007) The minimum description length principle. MIT Press, Cambridge
Hess S, Piatkowski N, Morik K (2014) Shrimp: descriptive patterns in a tree. In: Proceedings of the LWA workshops: KDML, IR and FGWM, pp 181–192
Jarrett K, Kavukcuoglu K, Ranzato M, LeCun Y (2009) What is the best multi-stage architecture for object recognition? In: IEEE international conference on computer in proceedings (ICCV), pp 2146–2153
Karaev S, Miettinen P, Vreeken J (2015) Getting to know the unknown unknowns: destructive-noise resistant boolean matrix factorization. In: Proceedings of the SIAM international conference on data mining (SDM), pp 325–333
Kontonasios KN, De Bie T (2010) An information-theoretic approach to finding informative noisy tiles in binary databases. In: Proceedings of the SIAM international conference on data mining (SDM), pp 153–164
Kuhn HW (1955) The hungarian method for the assignment problem. Naval Res Logist Q 2(1–2):83–97
Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788–791
Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. In: Advances in neural information processing systems (NIPS), pp 556–562
Li PVM (1997) An introduction to kolmogorov complexity and its applications. Springer, Berlin
Li T (2005) A general model for clustering binary data. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery in data mining (KDD), pp 188–197
Li T, Ding C (2006) The relationships among various nonnegative matrix factorization methods for clustering. In: International conference on data mining (ICDM), pp 362–371
Lucchese C, Orlando S, Perego R (2010) Mining top-k patterns from binary datasets in presence of noise. In: Proceedings of the SIAM international conference on data mining (SDM), pp 165–176
Lucchese C, Orlando S, Perego R (2014) A unifying framework for mining approximate top-k binary patterns. Trans Knowl Data Eng 26(12):2900–2913
Miettinen P (2015) Generalized matrix factorizations as a unifying framework for pattern set mining: complexity beyond blocks. In: European conference on machine learning and principles and practice of knowledge discovery in databases (ECMLPKDD), pp 36–52
Miettinen P, Vreeken J (2014) Mdl4bmf: minimum description length for boolean matrix factorization. Trans Knowl Discov Data 8(4):18:1–18:31
Miettinen P, Mielikainen T, Gionis A, Das G, Mannila H (2008) The discrete basis problem. Trans Knowl Data Eng 20(10):1348–1362
Paatero P, Tapper U (1994) Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5(2):111–126
Parikh N, Boyd S (2014) Proximal algorithms. Found Trends Optim 1(3):127–239
Rissanen J (1978) Modeling by shortest data description. Automatica 14:465–471
Siebes A, Kersten R (2011) A structure function for transaction data. In: Proceedings of the SIAM international conference on data mining (SDM), pp 558–569
Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: Proceedings of the SIAM international conference on data mining (SDM), pp 393–404
Smets K, Vreeken J (2012) Slim: directly mining descriptive patterns. In: Proceedings of the SIAM international conference on data mining (SDM), pp 236–247
Tatti N, Vreeken J (2012) Comparing apples and oranges: measuring differences between exploratory data mining results. Data Min Knowl Discov 25(2):173–207
van Leeuwen M, Siebes A (2008) Streamkrimp: Detecting change in data streams. In: European conference on machine learning and principles and practice of knowledge discovery in databases (ECMLPKDD), pp 672–687
Vreeken J, Van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Discov 23(1):169–214
Wang YX, Zhang YJ (2013) Nonnegative matrix factorization: a comprehensive review. Trans Knowl Data Eng 25(6):1336–1353
Xiang Y, Jin R, Fuhry D, Dragan FF (2011) Summarizing transactional databases with overlapped hyperrectangles. Data Min Knowl Discov 23(2):215–251
Zhang Z, Ding C, Li T, Zhang X (2007) Binary matrix factorization with applications. In: International conference on data mining (ICDM), pp 391–400
Zimek A, Vreeken J (2013) The blind men and the elephant: on meeting the problem of multiple truths in data from clustering and pattern mining perspectives. Mach Learn 98(1–2):121–155
Acknowledgements
Part of the work on this paper has been supported by Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876 “Providing Information by Resource-Constrained Analysis”, projects A1 and C1 (http://sfb876.tudortmund.de). Furthermore, we thank Jilles Vreeken and Sanjar Karaev for their support in the execution of experiments and useful remarks.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Johannes Fürnkranz.
Appendices
Appendix 1: Derivation of the proximal operator
Theorem 1 Let \(\alpha >0\) and \(\phi (X)=\sum _{i,j}\varLambda (X_{ij})\) for \(X\in {\mathbb {R}}^{m\times n}\). The proximal operator of \(\alpha \phi \) maps the matrix X to the matrix \({{\mathrm{prox}}}_{\alpha \phi }(X)=A\in [0,1]^{m\times n}\) defined by \(A_{ji}={{\mathrm{prox}}}_{\alpha \varLambda }(X_{ji})\), where for \(x\in {\mathbb {R}}\) it holds that
Proof
Let \(\alpha >0\), \(X\in {\mathbb {R}}^{m\times n}\) for some \(m,n\in \mathbb {N}\) and \(A={{\mathrm{prox}}}_{\alpha \phi }(X)\). The function \(\phi \) is fully separable across all matrix entries. In this case, the proximal operator can be applied entry-wise to the composing scalar functions (Parikh and Boyd 2014), i.e., \(A_{ji}={{\mathrm{prox}}}_{\alpha \varLambda }(X_{ji})\). It remains to derive the proximal mapping of \(\varLambda \) Eq. (refeq:prox).
The proximal operator reduces to Euclidean projection if the argument lies outside of the function’s domain (Parikh and Boyd 2014) and it follows that
For \(x\in [0,1]\) holds \(\varLambda (x)=-|1-2x|+1\) and
where g is derived by a multiplication and addition of constants, such that the minimum can easily be derived by completing the square.
The function g is a continuous piecewise quadratic function which attains its global minimum at the minimum of one of the two quadratic functions, i.e.,
A function value comparison in the intersecting domain \(x\in (0.5-2\alpha ,0.5+2\alpha ]\) yields that
\(\square \)
Appendix 2: Krimp’s encoding as matrix factorization
Lemma 1 Let D be a data matrix. For any code table CT and its cover function there exists a Boolean matrix factorization \(D=\theta (YX^T)+N\) such that non-singleton patterns in CT are mirrored in X and the cover function is reflected by Y. The description lengths correspond to each other, such that
where the functions returning the model and the data description size are given as
The probabilities \(p_s\) and \(p_{r+i}\) indicate the relational usage of non-singleton patterns \(X_{\cdot s}\) and singletons \(\{i\}\),
We denote with \(c\in {\mathbb {R}}_+^n\) the vector of standard code lengths for each item, i.e.,
Proof
Let D be a data matrix, \(CT=\{( X_\sigma ,C_\sigma )|1\le \sigma \le \tau \}\) a \(\tau \)-element code table and cover the cover function. Let r be the number of non-singleton patterns in CT and assume w.l.o.g. that CT is indexed such that these non-singleton patterns have an index \(1\le \sigma \le r\). We construct the pattern matrix \(X\in \{0,1\}^{n\times r}\) and usage matrix \(Y\in \{0,1\}^{m\times r}\) such that for \(1\le \sigma \le r\) it holds that
The Boolean product \(\theta (YX^T)\) indicates the entries of D which are covered by non-singleton patterns of CT. That implies that ones in the noise matrix \(N=D-\theta (YX^T)\) are covered by singletons, it holds that
The usage of a non-singleton pattern \(X_\sigma \) is then computed as
and correspondingly it follows that \(usage(\{i\})=|N_{\cdot i}|\). The calculation of the probabilities \(p_\sigma \) for \(1\le \sigma \le r+n\) is directly obtained by inserting this usage calculation in the definition of code-usage-probabilities of Eq. (2). Likewise follow the functions \(f_{\mathsf {CT}}^M\) and \(f_{\mathsf {CT}}^D\) from the definition of the description sizes \(L_{\mathsf {CT}}^M\) and \(L_{\mathsf {CT}}^D\). \(\square \)
Appendix 3: Bounding the description length of code tables
Lemma 1
Let \((a_s)\) be a finite sequence of r non-negative scalars such that \(S_r=\sum _{s=1}^ra_s>0\). The function \(g:[0,\infty )\rightarrow [0,\infty )\) defined by
is monotonically increasing in x.
Proof
W.l.o.g., let \(a_1,\ldots ,a_{r_0}>0\) and \(a_{r_0+1},\ldots ,a_r=0\) for some \(r_0\in \mathbb {N}\). We rewrite the function g as
and show that each of the subfunctions is monotonically increasing. The first subfunction is differentiable and its derivative is non-negative
The second subfunction is monotonically increasing, since for \(a_s=0\) and all \(x\ge 0\) it holds that
\(\square \)
Theorem 2 Given binary matrices X and Y and \(\mu = 1+\log (n)\), it holds that
Proof
We recall that the description size of the data is computed by
Applying the logarithmic properties, we rewrite the first sum
It follows from the monotonicity of g (Lemma 1) and the logarithm inequality (\(\log (1+x)\le x, \forall x\ge 0\)) that \(f_1\) is upper bounded by
The second term \(f_2\) can be transformed into
Subsequently, we show \(f_2(X,Y,D)\le |N|\log (n) +|Y|\). This inequality trivially holds if \(|N|=0\). Otherwise, we apply Jensen’s inequality to the concave logarithm function
and obtain
where the last equality again follows from the logarithm inequality. We derive the final inequality by
\(\square \)
Appendix 4: Calculating the Lipschitz moduli of \(\textsc {Primp}\)
We study the partial gradients of the regularization term used in Primp (Sect. 3.4)
The partial gradient with respect to X is constant and has a Lipschitz constant of zero. The partial gradient with respect to Y can be written as the sum
From the triangle inequality follows that the gradient with respect to Y is Lipschitz continuous with modulus \(M_{\nabla _YG}(X)=M_A+M_B\), if the functions A and B are Lipschitz continuous with moduli \(M_A\) and \(M_B\):
The one-dimensional function \(x\mapsto \log (x+\delta )\), \(x\in {\mathbb {R}}_+\) is for any \(\delta >0\) Lipschitz continuous with modulus \(\delta ^{-1}\). This can be easily derived by the mean value theorem and the bound
for all \(x\ge 0\). We show with the following equations, that \(M_A=M_B=m\). For improved readability, we use the squared Lipschitz inequality, i.e.,
where Eq. (12) follows from the Lipschitz continuity of the logarithmic function as discussed above for \(\delta =1\) and Eq. (13) follows from the Cauchy-Schwarz inequality. Similar steps yield the Lipschitz modulus of B,
We conclude that the Lipschitz moduli of the gradients are given as
Rights and permissions
About this article
Cite this article
Hess, S., Morik, K. & Piatkowski, N. The PRIMPING routine—Tiling through proximal alternating linearized minimization. Data Min Knowl Disc 31, 1090–1131 (2017). https://doi.org/10.1007/s10618-017-0508-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-017-0508-z