Advertisement

Data Mining and Knowledge Discovery

, Volume 26, Issue 1, pp 130–173 | Cite as

Summarizing categorical data by clustering attributes

  • Michael Mampaey
  • Jilles Vreeken
Article

Abstract

For a book, its title and abstract provide a good first impression of what to expect from it. For a database, obtaining a good first impression is typically not so straightforward. While low-order statistics only provide very limited insight, downright mining the data rapidly provides too much detail for such a quick glance. In this paper we propose a middle ground, and introduce a parameter-free method for constructing high-quality descriptive summaries of binary and categorical data. Our approach builds a summary by clustering attributes that strongly correlate, and uses the Minimum Description Length principle to identify the best clustering—without requiring a distance measure between attributes. Besides providing a practical overview of which attributes interact most strongly, these summaries can also be used as surrogates for the data, and can easily be queried. Extensive experimentation shows that our method discovers high-quality results: correlated attributes are correctly grouped, which is verified both objectively and subjectively. Our models can also be employed as surrogates for the data; as an example of this we show that we can quickly and accurately query the estimated supports of frequent generalized itemsets.

Keywords

Attribute clustering MDL Summarization Categorical data 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Au W, Chan K, Wong A, Wang Y (2005) Attribute clustering for grouping, selection, and classification of gene expression data. IEEE/ACM Trans Comput Biol Bioinform 2(2): 83–101CrossRefGoogle Scholar
  2. Baumgartner C, Böhm C, Baumgartner D (2005) Modelling of classification rules on metabolic patterns including machine learning and expert knowledge. Biomed Inform 38(2): 89–98CrossRefGoogle Scholar
  3. Bringmann B, Zimmermann A (2007) The chosen few: on identifying valuable patterns. In: Proceedings of the IEEE international conference on data mining (ICDM’07), IEEE, pp 63–72Google Scholar
  4. Calders T, Goethals B (2007) Non-derivable itemset mining. Data Min Knowl Discov 14(1): 171–206MathSciNetCrossRefGoogle Scholar
  5. Chakrabarti D, Papadimitriou S, Modha DS, Faloutsos C (2004) Fully automatic cross-associations. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’04), pp 79–88Google Scholar
  6. Chandola V, Kumar V (2005) Summarization—compressing data into an informative representation. In: Proceedings of the IEEE international conference on data mining (ICDM’05), IEEE, pp 98–105Google Scholar
  7. Coenen F (2003) The LUCS-KDD discretised/normalised ARM and CARM data library. http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN/DataSets/dataSets.html. Accessed October 2010
  8. Cover TM, Thomas JA (2006) Elements of information theory, 2nd edn. Wiley, New YorkMATHGoogle Scholar
  9. Das G, Mannila H, Ronkainen P (1997) Similarity of attributes by external probes. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’97), pp 23–29Google Scholar
  10. De Bie T (2011) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3): 407–446MathSciNetMATHCrossRefGoogle Scholar
  11. Dhillon I, Mallela S, Kumar R (2003) A divisive information theoretic feature clustering algorithm for text classification. J Mach Learn Res 3: 1265–1287MATHGoogle Scholar
  12. Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed March 2011
  13. Garriga GC, Junttila E, Mannila H (2011) Banded structure in binary matrices. Knowl Inf Syst (KAIS) 28(1): 197–226CrossRefGoogle Scholar
  14. Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. Trans Knowl Discov Data 1(3): 1556–4681Google Scholar
  15. Goethals B, Zaki MJ (2003) Frequent itemset mining implementations repository (FIMI). http://fimi.ua.ac.be. Accessed October 2010
  16. Grünwald PD (2007) The minimum description length principle. MIT Press, CambridgeGoogle Scholar
  17. Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1): 55–86MathSciNetCrossRefGoogle Scholar
  18. Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something I don’t know: randomization strategies for iterative data mining. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’09). ACM, New York, pp 379–388Google Scholar
  19. Heikinheimo H, Hinkkanen E, Mannila H, Mielikäinen T, Seppänen JK (2007) Finding low-entropy sets and trees from binary data. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’07). ACM, New York, pp 350–359Google Scholar
  20. Heikinheimo H, Vreeken J, Siebes A, Mannila H (2009) Low-entropy set selection. In: Proceedings of the SIAM international conference on data mining (SDM’09). SIAM, New York, pp 569–579Google Scholar
  21. Kirkpatrick S (1984) Optimization by simulated annealing: quantitative studies. Stat Phys 34(5): 975–986MathSciNetCrossRefGoogle Scholar
  22. Knobbe AJ, Ho EKY (2006) Maximally informative k-itemsets and their efficient discovery. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’06). ACM, New York, pp 237–244Google Scholar
  23. Kontonasios KN, De Bie T (2010) An information-theoretic approach to finding noisy tiles in binary databases. In: Proceedings of the SIAM international conference on data mining (SDM’10). SIAM, New York, pp 153–164Google Scholar
  24. Li M, Vitányi P (1993) An introduction to Kolmogorov complexity and its applications. Springer, New YorkMATHGoogle Scholar
  25. Mampaey M, Vreeken J (2010) Summarising data by clustering items. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD’10). Springer, New York, pp 321–336Google Scholar
  26. Mampaey M, Tatti N, Vreeken J (2011) Tell me what I need to know: succinctly summarizing data with itemsets. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’11). ACM, New York, pp 573–581Google Scholar
  27. Mitchell-Jones A, Amori G, Bogdanowicz W, Krystufek B, Reijnders PH, Spitzenberger F, Stubbe M, Thissen J, Vohralik V, Zima J (1999) The atlas of European mammals. Academic Press, LondonGoogle Scholar
  28. Myllykangas S, Himberg J, Böhling T, Nagy B, Hollmén J, Knuutila S (2006) DNA copy number amplification profiling of human neoplasms. Oncogene 25(55): 7324–7332CrossRefGoogle Scholar
  29. Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of the ICDT international conference on database theory, pp 398–416Google Scholar
  30. Pensa R, Robardet C, Boulicaut JF (2005) A bi-clustering framework for categorical data. In: Proceedings of the European conference on principles and practice of knowledge discovery in databases (PKDD’05). Springer, New York, pp 643–650Google Scholar
  31. Rissanen J (1978) Modeling by shortest data description. Automatica 14(1): 465–471MATHCrossRefGoogle Scholar
  32. Rissanen J (2007) Information and complexity in statistical modeling. Springer, New YorkMATHGoogle Scholar
  33. Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27: 379–423MathSciNetMATHGoogle Scholar
  34. Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: Proceedings of the SIAM international conference on data mining (SDM’06). SIAM, New York, pp 393–404Google Scholar
  35. Vanden Bulcke T, Vanden Broucke P, Van Hoof V, Wouters K, Vanden Broucke S, Smits G, Smits E, Proesmans S, Van Genechten T, Eyskens F (2011) Data mining methods for classification of Medium-Chain Acyl-CoA dehydrogenase deficiency (MCADD) using non-derivatized tandem MS neonatal screening data. J Biomed Inform 44(2): 319–325CrossRefGoogle Scholar
  36. Vereshchagin N, Vitanyi P (2004) Kolmogorov’s structure functions and model selection. IEEE Trans Inf Theory 50(12): 3265–3290MathSciNetCrossRefGoogle Scholar
  37. Vreeken J, van Leeuwen M, Siebes A (2007) Preserving privacy through data generation. In: Proceedings of the IEEE international conference on data mining (ICDM’07), IEEE, pp 685–690Google Scholar
  38. Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Discov 23(1): 169–214MathSciNetMATHCrossRefGoogle Scholar
  39. Wallace C (2005) Statistical and inductive inference by minimum message length. Springer, New YorkMATHGoogle Scholar
  40. Wang J, Karypis G (2004) SUMMARY: efficiently summarizing transactions for clustering. In: Proceedings of the IEEE international conference on data mining (ICDM’04), IEEE, pp 241–248Google Scholar
  41. Wang C, Parthasarathy S (2006) Summarizing itemset patterns using probabilistic models. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’06). ACM, New York, pp 730–735Google Scholar
  42. Yan X, Cheng H, Han J, Xin D (2005) Summarizing itemset patterns: a profile-based approach. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’05). ACM, New York, pp 314–323Google Scholar

Copyright information

© The Author(s) 2011

Authors and Affiliations

  1. 1.Advanced Database Research and Modelling, Department of Mathematics and Computer ScienceUniversity of AntwerpAntwerpBelgium

Personalised recommendations