Skip to main content

Significance and Recovery of Block Structures in Binary Matrices with Noise

  • Conference paper
Book cover Learning Theory (COLT 2006)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4005))

Included in the following conference series:

Abstract

Frequent itemset mining (FIM) is one of the core problems in the field of Data Mining and occupies a central place in its literature. One equivalent form of FIM can be stated as follows: given a rectangular data matrix with binary entries, find every submatrix of 1s having a minimum number of columns. This paper presents a theoretical analysis of several statistical questions related to this problem when noise is present. We begin by establishing several results concerning the extremal behavior of submatrices of ones in a binary matrix with random entries. These results provide simple significance bounds for the output of FIM algorithms. We then consider the noise sensitivity of FIM algorithms under a simple binary additive noise model, and show that, even at small noise levels, large blocks of 1s leave behind fragments of only logarithmic size. Thus such blocks cannot be directly recovered by FIM algorithms, which search for submatrices of all 1s. On the positive side, we show how, in the presence of noise, an error-tolerant criterion can recover a square submatrix of 1s against a background of 0s, even when the size of the target submatrix is very small.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of ACM SIGMOD 1993, pp. 207–216 (1993)

    Google Scholar 

  2. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H.: Fast discovery of association rules. In: Advances in Knowledge Discovery and Data Mining, pp. 307–328. AAAI/MITPress (1996)

    Google Scholar 

  3. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of ACM SIGMOD 1998, pp. 94–105 (1998)

    Google Scholar 

  4. Bollobás, B., Erdös, P.: Cliques in random graphs. Math. Proc. Cam. Phil. Soc. 80, 419–427 (1976)

    Article  MATH  Google Scholar 

  5. Bollobás, B. (ed.): Random Graphs, 2nd edn. Cambridge Studies in Advanced Mathematics (2001)

    Google Scholar 

  6. Chakrabarti, D., Papadimitriou, S., Modha, D., Faloutsos, C.: Fully Automatic Cross-Associations. In: Proceedings of ACM SIGKDD 2004, pp. 79–88 (2004)

    Google Scholar 

  7. Cheng, Y., Church, G.M.: Biclustering of expression data. In: Proceedings of ISMB 2000, pp. 93–103 (2000)

    Google Scholar 

  8. Dawande, M., Keskinocak, P., Swaminathan, J., Tayur, S.: On bipartite and multipartite clique problems. J. Algorithms 41(2), 388–403 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  9. Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Springer, New York (1996)

    MATH  Google Scholar 

  10. Dhillon, I., Mallela, S., Modha, D.: Information-Theoretic Co-clustering. In: Proceedings of ACM SIGKDD 2003, pp. 89–98 (2003)

    Google Scholar 

  11. Dudoit, S., Fridlyand, J.: Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19(9), 1090–1099 (2003)

    Article  Google Scholar 

  12. Grimmett, G.R., McDiarmid, C.J.H.: On colouring random graphs. Math. Proc. Cam. Phil. Soc. 77, 313–324 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  13. Han, J., Pei, J., Yin, Y.: Mining Frequent Patterns without Candidate Generation. In: Proceedings of ACM SIGMOD 2000, pp. 1–12 (2000)

    Google Scholar 

  14. Hand, D.J., Mannila, H., Smyth, P.: Principles of Data Mining. MIT Press, Cambridge (2001)

    Google Scholar 

  15. Karp, R.: Probabilistic Analysis of Algorithms. Class Notes, UC-Berkeley (1988)

    Google Scholar 

  16. Koyutürk, M., Szpankowski, W., Grama, A.: Biclustering Gene-Feature Matrices for Statistically Significant Dense Patterns. In: IEEE Computer Society Bioinformatics Conference, Stanford, pp. 480–483 (2004)

    Google Scholar 

  17. Lange, T., Roth, V., Braun, M., Buhmann, J.: Stability-Based Validation of Clustering Solution. Neural Computation 16(6), 1299–1323 (2004)

    Article  MATH  Google Scholar 

  18. Liu, J., Paulsen, S., Wang, W., Nobel, A.B., Prins, J.: Mining Approximate Frequent Itemsets from Noisy Data. In: Proceedings of ICDM 2005, pp. 721–724 (2005)

    Google Scholar 

  19. Liu, J., Paulsen, S., Sun, X., Wang, W., Nobel, A.B., Prins, J.: Mining approximate frequent itemsets in the presence of noise: algorithm and analysis. In: Proceedings of SDM (to appear, 2006)

    Google Scholar 

  20. Matula, D.: The largest clique size in a random graph. Southern Methodist University, Tech. Report, CS 7608 (1976)

    Google Scholar 

  21. Madeira, S., Oliveira, A.: Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics 1(1), 24–45 (2004)

    Article  Google Scholar 

  22. Okamoto, M.: Some inequalities relating to the partial sum of binomial probabilities. Annals of the Institute of Statistical Mathematics 10, 29–35 (1958)

    Article  MathSciNet  MATH  Google Scholar 

  23. Pei, J., Tung, A.K., Han, J.: Fault-tolerant frequent pattern mining: Problems and challenges. In: Proceedings of DMKD 2001 (2001)

    Google Scholar 

  24. Pei, J., Dong, G., Zou, W., Han, J.: Mining Condensed Frequent-Pattern Bases. Knowledge and Information Systems 6(5), 570–594 (2002)

    Google Scholar 

  25. Park, G., Szpankowshi, W.: Analysis of biclusters with applications to gene expression data. In: Proceeding of AoA 2005 (2005)

    Google Scholar 

  26. Reuning-Scherer, J.D.: Mixture Models for Block Clustering. Phd Thesis, Yale university (1997)

    Google Scholar 

  27. Seppänen, J.K., Mannila, H.: Dense Itemsets. In: Proceedings of ACM SIGKDD 2004, pp. 683–688 (2004)

    Google Scholar 

  28. Tanay, A., Sharan, R., Shamir, R.: Discovering statistically significant biclusters in gene expression data. Bioinformatics 18(1), 136–144 (2002)

    Google Scholar 

  29. Tanay, A., Sharan, R., Shamir, R.: Biclustering Algorithms: A Survey. In: Handbook of Computational Molecular Biology. Computer and Information Science Series, Chapman & Hall/CRC (in press, 2005)

    Google Scholar 

  30. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via gap statistic. Technical Report 208, Dept of Statistics, Stanford University (2000)

    Google Scholar 

  31. Yang, C., Fayyad, U., Bradley, P.S.: Efficient discovery of error-tolerant frequent itemsets in high dimensions. In: Proceedings of ACM SIGKDD 2001, pp. 194–203 (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sun, X., Nobel, A. (2006). Significance and Recovery of Block Structures in Binary Matrices with Noise. In: Lugosi, G., Simon, H.U. (eds) Learning Theory. COLT 2006. Lecture Notes in Computer Science(), vol 4005. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11776420_11

Download citation

  • DOI: https://doi.org/10.1007/11776420_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-35294-5

  • Online ISBN: 978-3-540-35296-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics