Abstract
Data mining algorithms such as the Apriori method for finding frequent sets in sparse binary data can be used for efficient computation of a large number of summaries from huge data sets. The collection of frequent sets gives a collection of marginal frequencies about the underlying data set. Sometimes, we would like to use a collection of such marginal frequencies instead of the entire data set (e.g. when the original data is inaccessible for confidentiality reasons) to compute other interesting summaries. Using combinatorial arguments, we may obtain tight upper and lower bounds on the values of inferred summaries. In this paper, we consider a class of summaries wider than frequent sets, namely that of frequencies of arbitrary Boolean formulae. Given frequencies of a number of any different Boolean formulae, we consider the problem of finding tight bounds on the frequency of another arbitrary formula. We give a general formulation of the problem of bounding formula frequencies given some background information, and show how the bounds can be obtained by solving a linear programming problem. We illustrate the accuracy of the bounds by giving empirical results on real data sets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I.: Fast discovery of association rules. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, ch. 12, pp. 307–328. AAAI Press, Menlo Park (1996)
Boulicaut, J.-F., Bykowski, A., Rigotti, C.: Approximation of frequency queries by means of free-sets. In: Zighed, A.D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 75–85. Springer, Heidelberg (2000)
Bykowski, A., Rigotti, C.: A condensed representation to find frequent patterns. In: Proc. of the 20th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 2001), Santa Barbara, CA, USA, May 2001, ACM, New York (2001)
Calders, T.: Deducing bounds on the frequency of itemsets. In: EDBT 2002 Workshop on Database Technologies for Data Mining (2002)
Calders, T., Goethals, B.: Mining all non-derivable frequent itemsets. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) PKDD 2002. LNCS (LNAI), vol. 2431, pp. 74–85. Springer, Heidelberg (2002)
Gouda, K., Zaki, M.J.: Efficiently mining maximal frequent itemsets. In: Proc. of the 2001 IEEE International Conference on Data Mining (ICDM 2001), San Jose, California, USA, pp. 163–170 (2001)
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Chen, W., Naughton, J., Bernstein, P.A. (eds.) 2000 ACM SIGMOD Intl. Conference on Management of Data, May 2000, pp. 1–12. ACM Press, New York (2000)
Karmarkar, N.: A new polynomial-time algorithm for linear programming. In: Proceedings of the sixteenth annual ACM symposium on Theory of computing, pp. 302–311 (1984)
Kreyszig, E.: Advanced Engineering Mathematics, 7th edn. John Wiley Inc., Chichester (1993)
Moore, A., Lee, M.S.: Cached sufficient statistics for efficient machine learning with large datasets. Journal of Artificial Intelligence Research 8, 67–91 (1998)
Matoušek, J., Sharir, M., Welzl, E.: A subexponential bound for linear programming. Algorithmica 16(4/5), 498–516 (1996)
Mannila, H., Toivonen, H.: Multiple uses of frequent sets and condensed representations: Extended abstract. In: Proc. of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD 1996), Portland, Oregon, USA, August 1996, pp. 189–194. AAAI Press, Menlo Park (1996)
Mannila, H., Toivonen, H., Verkamo, A.I.: Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery 1(3), 259–289 (1997)
Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Efficient mining of association rules using closed itemset lattices. Information Systems 24(1), 25–46 (1999)
Pavlov, D., Mannila, H., Smyth, P.: Probabilistic models for query approximation with large sparse binary datasets. In: Proc. of the 16th Conference in Uncertainty in Artificial Intelligence (UAI 2000), Stanford, California, USA (2000)
Toivonen, H.: Sampling large databases for association rules. In: Proc. of the 22th International Conference on Very Large Data Bases (VLDB 1996), Mumbai (Bombay), India, pp. 134–145 (1996)
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery 1(2), 141–182 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Bykowski, A., Seppänen, J.K., Hollmén, J. (2004). Model-Independent Bounding of the Supports of Boolean Formulae in Binary Data. In: Meo, R., Lanzi, P.L., Klemettinen, M. (eds) Database Support for Data Mining Applications. Lecture Notes in Computer Science(), vol 2682. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-44497-8_12
Download citation
DOI: https://doi.org/10.1007/978-3-540-44497-8_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22479-2
Online ISBN: 978-3-540-44497-8
eBook Packages: Springer Book Archive