Advertisement

Data Mining and Knowledge Discovery

, Volume 23, Issue 3, pp 407–446 | Cite as

Maximum entropy models and subjective interestingness: an application to tiles in binary databases

  • Tijl De Bie
Article

Abstract

Recent research has highlighted the practical benefits of subjective interestingness measures, which quantify the novelty or unexpectedness of a pattern when contrasted with any prior information of the data miner (Silberschatz and Tuzhilin, Proceedings of the 1st ACM SIGKDD international conference on Knowledge discovery and data mining (KDD95), 1995; Geng and Hamilton, ACM Comput Surv 38(3):9, 2006). A key challenge here is the formalization of this prior information in a way that lends itself to the definition of a subjective interestingness measure that is both meaningful and practical. In this paper, we outline a general strategy of how this could be achieved, before working out the details for a use case that is important in its own right. Our general strategy is based on considering prior information as constraints on a probabilistic model representing the uncertainty about the data. More specifically, we represent the prior information by the maximum entropy (MaxEnt) distribution subject to these constraints. We briefly outline various measures that could subsequently be used to contrast patterns with this MaxEnt model, thus quantifying their subjective interestingness. We demonstrate this strategy for rectangular databases with knowledge of the row and column sums. This situation has been considered before using computation intensive approaches based on swap randomizations, allowing for the computation of empirical p-values as interestingness measures (Gionis et al., ACM Trans Knowl Discov Data 1(3):14, 2007). We show how the MaxEnt model can be computed remarkably efficiently in this situation, and how it can be used for the same purpose as swap randomizations but computationally more efficiently. More importantly, being an explicitly represented distribution, the MaxEnt model can additionally be used to define analytically computable interestingness measures, as we demonstrate for tiles (Geerts et al., Proceedings of the 7th international conference on Discovery science (DS04), 2004) in binary databases.

Keywords

Maximum entropy principle Subjective interestingness measures Prior information Rectangular databases Swap randomizations 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th international conference on Very large databases (VLDB94), pp 487–499Google Scholar
  2. Asuncion A, Newman D (2007) UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html
  3. Barabasi AL, Albert R (1999) Emergence of scaling in random networks. Science 286(5439): 509–512MathSciNetCrossRefGoogle Scholar
  4. Boyd S, Vandeberghe L (2004) Convex optimization. Cambridge University Press, CambridgeMATHGoogle Scholar
  5. Brijs T, Swinnen G, Vanhoof K, Wets G (1999) Using association rules for product assortment decisions: a case study. In: Proceedings of the 5th ACM SIGKDD international conference on Knowledge discovery in databases (KDD99), pp 254–260Google Scholar
  6. Calders T (2008) Itemset frequency satisfiability: complexity and axiomatization. Theor Comput Sci 394 (1-2): 84–111MathSciNetMATHCrossRefGoogle Scholar
  7. Chung F, Lu L (2004) The average distance in a random graph with given expected degrees. Int Math 1(1): 91–113MathSciNetGoogle Scholar
  8. Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience, HobokenMATHCrossRefGoogle Scholar
  9. De Bie T (2009a) Explicit probabilistic models for databases and networks. Tech. Rep. 123931, arXiv:0906.5148v1, University of BristolGoogle Scholar
  10. De Bie T (2009b) Finding interesting itemsets using a probabilistic model for binary databases. Tech. Rep. 123930, University of BristolGoogle Scholar
  11. De Raedt L, Zimmermann A (2007) Constraint-based pattern set mining. In: Proceedings of the 2007 SIAM international conference on Data mining (SDM08), pp 237–248Google Scholar
  12. Gallo A, De Bie T, Cristianini N (2007) MINI: Mining informative non-redundant itemsets. In: Proceedings of the 11th European conference on Principles and practice of knowledge discovery in databases (PKDD07), pp 438–445Google Scholar
  13. Gallo A, Mammone A, De Bie T, Turchi M, Cristianini N (2009) From frequent itemsets to informative patterns. Tech. Rep. 123936, University of BristalGoogle Scholar
  14. Geerts F, Goethals B, Mielikäinen T (2004) Tiling databases. In: Proceedings of the 7th international conference on Discovery science (DS04), pp 278–289Google Scholar
  15. Geng L, Hamilton HJ (2006) Interestingness measures for data mining: a survey. ACM Comput Surv 38(3): 9CrossRefGoogle Scholar
  16. Gentle JE (2005) Elements of computational statistics. Springer, New YorkGoogle Scholar
  17. Gionis A, Mannila H, Seppänen JK (2004) Geometric and combinatorial tiles in 0-1 data. In: Principles and practice of knowledge discovery in databases (PKDD04), pp 173–184Google Scholar
  18. Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data 1(3): 14CrossRefGoogle Scholar
  19. Gull S, Skilling J (1984) Maximum entropy method in image processing. Communications, radar and signal processing. IEE Proc F 131(6): 646–659Google Scholar
  20. Hanhijarvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something I don’t know: randomization strategies for iterative data mining. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD09), pp 379–388Google Scholar
  21. Jaroszewicz S, Simovici DA (2004) Interestingness of frequent itemsets using bayesian networks as background knowledge. In: Proceedings of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD04), pp 178–186Google Scholar
  22. Jaynes E (1957) Information theory and statistical mechanics I. Phys Rev 106(4): 620–630MathSciNetCrossRefGoogle Scholar
  23. Jaynes E (1982) On the rationale of maximum-entropy methods. Proc IEEE 70(9): 939–952CrossRefGoogle Scholar
  24. Khuller S, Moss A, Naor J (1999) The budgeted maximum coverage problem. Inf Process Lett 70(1):39–45CrossRefGoogle Scholar
  25. Kontonasios KN, De Bie T (2010) An information-theoretic approach to finding informative noisy tiles in binary databases. In: Proceedings of the 2010 SIAM international conference on Data mining (SDM10), pp 153–164Google Scholar
  26. Lehmann E, Romano J (1995) Testing statistical hypotheses, 3rd edn. Springer, New YorkGoogle Scholar
  27. Mannila H (2008) Randomization techniques for data mining methods. In: Proceedings of the 12th East European conference on Advances in databases and information systems (ADBIS08), p 1Google Scholar
  28. Miettinen P, Mielikainen T, Gionis A, Das G, Mannila H (2008) The discrete basis problem. IEEE Trans Knowl Data Eng 20(10): 1348–1362CrossRefGoogle Scholar
  29. Milo R, Shen-Orr S, Itzkovirz S, Kashtan N, Chklovskii D, Alon U (2002) Network motifs: simple building blocks of complex networks. Science 298(5594): 824–827CrossRefGoogle Scholar
  30. Minoux M (1978) Accelerated greedy algorithms for maximizing submodular set functions. Optimization Techniques. Springer, Berlin, pp 234–243Google Scholar
  31. Newman M (2003) The structure and function of complex networks. SIAM Rev 45(2): 167–256MathSciNetMATHCrossRefGoogle Scholar
  32. Ojala M, Vuokko N, Kallio A, Haiminen N, Mannila H (2008) Randomization of real-valued matrices for assessing the significance of data mining results. In: Proceedings of the 2008 SIAM international conference on Data mining (SDM08), pp 494–505Google Scholar
  33. Padmanabhan B, Tuzhilin A (1998) A belief-driven method for discovering unexpected patterns. In: Proceedings of the 4th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD98), pp 94–100Google Scholar
  34. Padmanabhan B, Tuzhilin A (2000) Small is beautiful: discovering the minimal set of unexpected patterns. In: Proceedings of the 6th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD00), pp 54–63Google Scholar
  35. Pavlov D, Mannila H, Smyth P (2003) Beyond independence: probabilistic models for query approximation on binary transaction data. IEEE Trans Knowl Data Eng 15: 1409–1421CrossRefGoogle Scholar
  36. Rasch G (1961) On general laws and the meaning of measurement in psychology. In: Proceedings of the fourth Berkeley symposium on Mathematical statistics and probability, vol IV, pp 321–333Google Scholar
  37. Robins G, Pattison P, Kalish Y, Lusher D (2007) An introduction to exponential random graph (p*) models for social networks. Soc Netw 29(2): 173–191CrossRefGoogle Scholar
  38. Savinov A (2004) Mining dependence rules by finding largest itemset support quota. In: Proceedings of the 2004 ACM symposium on Applied computing, pp 525–529Google Scholar
  39. Shewchuk J (1994) An introduction to the conjugate gradient method without the agonizing pain. Tech. rep, CMUGoogle Scholar
  40. Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: Proceedings of the 2006 SIAM international conference on Data mining (SDM06), pp 393–404Google Scholar
  41. Silberschatz A, Tuzhilin A (1995) On subjective measures of interestingness in knowledge discovery. In: Proceedings of the 1st ACM SIGKDD international conference on Knowledge discovery and data mining (KDD95), pp 275–281Google Scholar
  42. Tatti N (2008) Maximum entropy based significance of itemsets. Knowl Inf Syst 17(1): 57–77CrossRefGoogle Scholar
  43. Topsøe F (1979) Information-theoretical optimization techniques. Kybernetika 15(1): 8–27MathSciNetGoogle Scholar
  44. Tribus M (1961) Thermostatics and thermodynamics: an introduction to energy, information and states of matter, with engineering applications. Van Nostrand, PrincetonGoogle Scholar
  45. Wainwright M, Jordan MI (2008) Graphical models, exponential families, and variational inference. Found Trends Mach Learn 1(1-2): 1–305MATHGoogle Scholar
  46. Zaki M, Hsiao C (2002) CHARM: an efficient algorithm for closed itemsets mining. In: Proceedings of the 2002 SIAM international conference on Data mining (SDM02), pp 457–473Google Scholar

Copyright information

© The Author(s) 2010

Authors and Affiliations

  1. 1.Intelligent Systems LaboratoryUniversity of BristolBristolUK

Personalised recommendations