Abstract
Frequent itemset mining is widely used as a fundamental data mining technique. Recently, there have been proposed a number of MapReduce-based frequent itemset mining methods in order to overcome the limits on data size and speed of mining that sequential mining methods have. However, the existing MapReduce-based methods still do not have a good scalability due to high workload skewness, large intermediate data, and large network communication overhead. In this paper, we propose BIGMiner, a fast and scalable MapReduce-based frequent itemset mining method. BIGMiner generates equal-sized sub-databases called transaction chunks and performs support counting only based on transaction chunks and bitwise operations without generating and shuffling intermediate data. As a result, BIGMiner achieves very high scalability due to no workload skewness, no intermediate data, and small network communication overhead. Through extensive experiments using large-scale datasets of up to 6.5 billion transactions, we have shown that BIGMiner consistently and significantly outperforms the state-of-the-art methods without any memory problems.
Similar content being viewed by others
References
Aggarwal, C.C., Han, J.: Frequent Pattern Mining. Springer, New York (2014)
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: VLDB, pp. 487–499 (1994). http://www.vldb.org/conf/1994/P487.PDF
Apache Hadoop (2006). http://hadoop.apache.org
Apache Mahout (2013). http://mahout.apache.org
Apache Spark MLlib (2014). http://spark.apache.org/mllib/
BigFIM (2013). https://gitlab.com/adrem/bigfim-sa
Buehrer, G., de Oliveira, R.L., Fuhry, D., Parthasarathy, S.: Towards a parameter-free and parallel itemset mining algorithm in linearithmic time. In: 2015 IEEE 31st International Conference on Data Engineering (ICDE), pp. 1071–1082. IEEE (2015)
Cheng, L., Kotoulas, S.: Efficient skew handling for outer joins in a cloud computing environment. IEEE Transactions on Cloud Computing (2015)
Cheng, L., Kotoulas, S., Ward, T.E., Theodoropoulos, G.: Robust and skew-resistant parallel joins in shared-nothing systems. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 1399–1408. ACM (2014)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Fang, W., Lu, M., Xiao, X., He, B., Luo, Q.: Frequent itemset mining on graphics processors. In: DaMon, pp. 34–42. ACM (2009)
FIMI Repository (2005). http://fimi.ua.ac.be
Gonen, Y., Gudes, E.: An improved mapreduce algorithm for mining closed frequent itemsets. In: 2016 IEEE International Conference on Software Science, Technology and Engineering (SWSTE), pp. 77–83. IEEE (2016)
Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, Amsterdam (2011)
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: ACM SIGMOD Record. vol. 29, pp. 1–12. ACM (2000)
Kovacs, F., Illés, J.: Frequent itemset mining on hadoop. In: 2013 IEEE 9th International Conference on Computational Cybernetics (ICCC), pp. 241–245. IEEE (2013)
Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: Pfp: parallel fp-growth for query recommendation. In: RecSys, pp. 107–114. ACM (2008)
Li, N., Zeng, L., He, Q., Shi, Z.: Parallel implementation of apriori algorithm based on mapreduce. In: 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing (SNPD), pp. 236–241. IEEE (2012)
Li, X., Han, J., Gonzalez, H.: High-dimensional olap: a minimal cubing approach. In: PVLDB, pp. 528–539. VLDB Endowment (2004)
Lin, M.Y., Lee, P.Y., Hsueh, S.C.: Apriori-based frequent itemset mining algorithms on mapreduce. In: ICUIMC, p. 76. ACM (2012)
Lin, W., Alvarez, S.A., Ruiz, C.: Efficient adaptive-support association rule mining for recommender systems. Data Min. Knowl. Discov. 6(1), 83–105 (2002)
Lucchese, C., Orlando, S., Perego, R., Silvestri, F.: Webdocs: a real-life huge transactional dataset. In: FIMI, vol. 126 (2004)
Moens, S., Aksehirli, E., Goethals, B.: Frequent itemset mining for big data. In: Big Data, pp. 111–118. IEEE (2013)
Sandvig, J.J., Mobasher, B., Burke, R.: Robustness of collaborative recommendation based on association rule mining. In: Recsys, pp. 105–112. ACM (2007)
Schlegel, B.: Frequent itemset mining on multiprocessor systems. Dissertation, Technischen Universit\(\ddot{a}\)t Dresden (2013)
Sethi, K.K., Ramesh, D.: Hfim: a spark-based hybrid frequent itemset mining algorithm for big data processing. J. Supercomput. 1–17 (2017)
Wang, L., Feng, L., Zhang, J., Liao, P.: An efficient algorithm of frequent itemsets mining based on mapreduce. J. Inf. Comput. Sci. 11(8), 2809–2816 (2014)
Xun, Y., Zhang, J., Qin, X., Zhao, X.: Fidoop-dp: data partitioning in frequent itemset mining on hadoop clusters. IEEE Trans. Parallel Distrib. Syst. 28(1), 101–114 (2017)
Yahoo webscope. Yahoo! altavista web page hyperlink connectivity graph (2009). http://webscope.sandbox.yahoo.com
Yu, H., Wen, J., Wang, H., Jun, L.: An improved apriori algorithm based on the boolean matrix and hadoop. Procedia Eng. 15, 1827–1831 (2011)
Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W., et al.: New algorithms for fast discovery of association rules. KDD. 97, 283–286 (1997)
Zhang, F., Zhang, Y., Bakos, J.D.: Gpapriori: Gpu-accelerated frequent itemset mining. In: Cluster, pp. 590–594 (2011). http://dx.doi.org/10.1109/CLUSTER.2011.61
Zhang, F., Zhang, Y., Bakos, J.D.: Accelerating frequent itemset mining on graphics processing units. J. Supercomput. 66(1), 94–117 (2013). http://dx.doi.org/10.1007/s11227-013-0887-x
Zhou, L., Zhong, Z., Chang, J., Li, J., Huang, J.Z., Feng, S.: Balanced parallel fp-growth with mapreduce. In: 2010 IEEE Youth Conference on Information Computing and Telecommunications (YC-ICT), pp. 243–246. IEEE (2010)
Acknowledgements
This work was partly supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. R0190-15-2012, High Performance Big Data Analytics Platform Performance Acceleration Technologies Development, R7124-16-0004, Development of Intelligent Interaction Technology Based on Context Awareness and Human Intention Understanding).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chon, KW., Kim, MS. BIGMiner: a fast and scalable distributed frequent pattern miner for big data. Cluster Comput 21, 1507–1520 (2018). https://doi.org/10.1007/s10586-018-1812-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-018-1812-0