Abstract
It is well recognized that mining association rules in a very large database is usually time consuming due to the I/O overhead in scanning the disk resident database. As one of the techniques for reducing the I/O overhead, sampling for mining association rules has been actively investigated during the last few years. Each sampling method and algorithm proposed in the literature has its own merits and demerits in terms of effectiveness and efficiency and none of them can claim to be the best. Which sampling method to use and how big the sample size should be for a given database are key issues in sampling for particular data mining tasks. In this paper a transaction size based stratified sampling method has been proposed, tested and compared with the simple random sampling method for mining association rules. It opens up the questions of how to stratify the datasets so that it can better suit the problem of association rule mining.
Key words
Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of items in large databases,” presented at ACM SIGMOD Conference on Management of Data, Washington, D.C, 1993.
R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” presented at the 20th VLDB Conference, Santiago, Chile, 1994.
H. T. Heikki Mannila, Inkeri Verkamo, “Efficient Algorithms for Discovering Association Rules,” presented at AAAI Workshop on Knowledge Discovery in Databases (KDD-94), Seattle, Washington, 1994.
J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation,” presented at Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD’00), Dallas, TX, 2000.
W. Cheung and O. R. Zaïane, “Incremental Mining of Frequent Patterns Without Candidate Generation or Support Constraint,” presented at Seventh International Database Engineering and Applications Symposium (IDEAS 2003), Hong Kong, China, 2003.
B. Goethals and M. J. Zaki, “Advances in Frequent Itemset Mining Implementations,” ACMSIGKDD Explorations, vol. 6, pp. 109–117, 2004.
H. Toivonen, “Sampling large databases for association rules,” presented at 22th International Conference on Very Large Databases (VLDB’96), Mumbay, India, 1996.
M. J. Zaki, S. Parthasarathy, W. Li, and M. Ogihara, “Evaluation of sampling for data mining of association rules,” presented at 7th International Workshop on Research Issues in Data Engineering (RIDE’ 97) High Performance Database Management for Large-Scale Applications, Birmingham, UK, 1997.
B. Chen, P. Haas, and P. Scheuermann, “A New Two Phase Sampling Based Algorithm for Discovering Association Rules,” presented at SIGKDD’ 02, Edmonton, Alberta, Canada, 2002.
C. Zhang, S. Zhang, and G. I. Webb, “Identifying Approximate Itemsets of Interest in Large Databases.” Applied Intelligence, vol. 18, pp. 91–104, 2003.
S. Parthasarathy, “Efficient Progressive Sampling for Association Rules,” presented at IEEE International Conference on Data Mining, 2002.
Y. Li and R. P. Gopalan, “Effective Sampling for Mining Association Rules,” presented at 17th Australian Joint Conference on Artificial Intelligence, Cairns, Australia, 2004.
S. K. Thomson, Sampling: John Wiley & Sons Inc., 1992.
“Frequent Itemset Mining Dataset Repository,” http://fimi.cs.helsinki.fi/data/.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 International Federation for Information Processing
About this paper
Cite this paper
Li, Y., Gopalan, R.P. (2005). Stratified Sampling for Association Rules Mining. In: Li, D., Wang, B. (eds) Artificial Intelligence Applications and Innovations. AIAI 2005. IFIP — The International Federation for Information Processing, vol 187. Springer, Boston, MA. https://doi.org/10.1007/0-387-29295-0_9
Download citation
DOI: https://doi.org/10.1007/0-387-29295-0_9
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-28318-0
Online ISBN: 978-0-387-29295-3
eBook Packages: Computer ScienceComputer Science (R0)