Significance and Recovery of Block Structures in Binary Matrices with Noise

Sun, Xing; Nobel, Andrew

doi:10.1007/11776420_11

Xing Sun²⁰ &
Andrew Nobel^20,21

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4005))

Included in the following conference series:

International Conference on Computational Learning Theory

2722 Accesses
2 Citations

Abstract

Frequent itemset mining (FIM) is one of the core problems in the field of Data Mining and occupies a central place in its literature. One equivalent form of FIM can be stated as follows: given a rectangular data matrix with binary entries, find every submatrix of 1s having a minimum number of columns. This paper presents a theoretical analysis of several statistical questions related to this problem when noise is present. We begin by establishing several results concerning the extremal behavior of submatrices of ones in a binary matrix with random entries. These results provide simple significance bounds for the output of FIM algorithms. We then consider the noise sensitivity of FIM algorithms under a simple binary additive noise model, and show that, even at small noise levels, large blocks of 1s leave behind fragments of only logarithmic size. Thus such blocks cannot be directly recovered by FIM algorithms, which search for submatrices of all 1s. On the positive side, we show how, in the presence of noise, an error-tolerant criterion can recover a square submatrix of 1s against a background of 0s, even when the size of the target submatrix is very small.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of ACM SIGMOD 1993, pp. 207–216 (1993)
Google Scholar
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H.: Fast discovery of association rules. In: Advances in Knowledge Discovery and Data Mining, pp. 307–328. AAAI/MITPress (1996)
Google Scholar
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of ACM SIGMOD 1998, pp. 94–105 (1998)
Google Scholar
Bollobás, B., Erdös, P.: Cliques in random graphs. Math. Proc. Cam. Phil. Soc. 80, 419–427 (1976)
Article MATH Google Scholar
Bollobás, B. (ed.): Random Graphs, 2nd edn. Cambridge Studies in Advanced Mathematics (2001)
Google Scholar
Chakrabarti, D., Papadimitriou, S., Modha, D., Faloutsos, C.: Fully Automatic Cross-Associations. In: Proceedings of ACM SIGKDD 2004, pp. 79–88 (2004)
Google Scholar
Cheng, Y., Church, G.M.: Biclustering of expression data. In: Proceedings of ISMB 2000, pp. 93–103 (2000)
Google Scholar
Dawande, M., Keskinocak, P., Swaminathan, J., Tayur, S.: On bipartite and multipartite clique problems. J. Algorithms 41(2), 388–403 (2001)
Article MathSciNet MATH Google Scholar
Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Springer, New York (1996)
MATH Google Scholar
Dhillon, I., Mallela, S., Modha, D.: Information-Theoretic Co-clustering. In: Proceedings of ACM SIGKDD 2003, pp. 89–98 (2003)
Google Scholar
Dudoit, S., Fridlyand, J.: Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19(9), 1090–1099 (2003)
Article Google Scholar
Grimmett, G.R., McDiarmid, C.J.H.: On colouring random graphs. Math. Proc. Cam. Phil. Soc. 77, 313–324 (1975)
Article MathSciNet MATH Google Scholar
Han, J., Pei, J., Yin, Y.: Mining Frequent Patterns without Candidate Generation. In: Proceedings of ACM SIGMOD 2000, pp. 1–12 (2000)
Google Scholar
Hand, D.J., Mannila, H., Smyth, P.: Principles of Data Mining. MIT Press, Cambridge (2001)
Google Scholar
Karp, R.: Probabilistic Analysis of Algorithms. Class Notes, UC-Berkeley (1988)
Google Scholar
Koyutürk, M., Szpankowski, W., Grama, A.: Biclustering Gene-Feature Matrices for Statistically Significant Dense Patterns. In: IEEE Computer Society Bioinformatics Conference, Stanford, pp. 480–483 (2004)
Google Scholar
Lange, T., Roth, V., Braun, M., Buhmann, J.: Stability-Based Validation of Clustering Solution. Neural Computation 16(6), 1299–1323 (2004)
Article MATH Google Scholar
Liu, J., Paulsen, S., Wang, W., Nobel, A.B., Prins, J.: Mining Approximate Frequent Itemsets from Noisy Data. In: Proceedings of ICDM 2005, pp. 721–724 (2005)
Google Scholar
Liu, J., Paulsen, S., Sun, X., Wang, W., Nobel, A.B., Prins, J.: Mining approximate frequent itemsets in the presence of noise: algorithm and analysis. In: Proceedings of SDM (to appear, 2006)
Google Scholar
Matula, D.: The largest clique size in a random graph. Southern Methodist University, Tech. Report, CS 7608 (1976)
Google Scholar
Madeira, S., Oliveira, A.: Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics 1(1), 24–45 (2004)
Article Google Scholar
Okamoto, M.: Some inequalities relating to the partial sum of binomial probabilities. Annals of the Institute of Statistical Mathematics 10, 29–35 (1958)
Article MathSciNet MATH Google Scholar
Pei, J., Tung, A.K., Han, J.: Fault-tolerant frequent pattern mining: Problems and challenges. In: Proceedings of DMKD 2001 (2001)
Google Scholar
Pei, J., Dong, G., Zou, W., Han, J.: Mining Condensed Frequent-Pattern Bases. Knowledge and Information Systems 6(5), 570–594 (2002)
Google Scholar
Park, G., Szpankowshi, W.: Analysis of biclusters with applications to gene expression data. In: Proceeding of AoA 2005 (2005)
Google Scholar
Reuning-Scherer, J.D.: Mixture Models for Block Clustering. Phd Thesis, Yale university (1997)
Google Scholar
Seppänen, J.K., Mannila, H.: Dense Itemsets. In: Proceedings of ACM SIGKDD 2004, pp. 683–688 (2004)
Google Scholar
Tanay, A., Sharan, R., Shamir, R.: Discovering statistically significant biclusters in gene expression data. Bioinformatics 18(1), 136–144 (2002)
Google Scholar
Tanay, A., Sharan, R., Shamir, R.: Biclustering Algorithms: A Survey. In: Handbook of Computational Molecular Biology. Computer and Information Science Series, Chapman & Hall/CRC (in press, 2005)
Google Scholar
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via gap statistic. Technical Report 208, Dept of Statistics, Stanford University (2000)
Google Scholar
Yang, C., Fayyad, U., Bradley, P.S.: Efficient discovery of error-tolerant frequent itemsets in high dimensions. In: Proceedings of ACM SIGKDD 2001, pp. 194–203 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics and Operation Research,
Xing Sun & Andrew Nobel
Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
Andrew Nobel

Authors

Xing Sun
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Nobel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

ICREA and Department of Economics, Universitat Pompeu Fabra, Ramon Trias Fargas 25-27, 08005, Barcelona, Spain
Gábor Lugosi
Ruhr-Universität Bochum, Germany
Hans Ulrich Simon

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, X., Nobel, A. (2006). Significance and Recovery of Block Structures in Binary Matrices with Noise. In: Lugosi, G., Simon, H.U. (eds) Learning Theory. COLT 2006. Lecture Notes in Computer Science(), vol 4005. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11776420_11

Download citation

DOI: https://doi.org/10.1007/11776420_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35294-5
Online ISBN: 978-3-540-35296-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics