Skip to main content
Log in

BIGMiner: a fast and scalable distributed frequent pattern miner for big data

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Frequent itemset mining is widely used as a fundamental data mining technique. Recently, there have been proposed a number of MapReduce-based frequent itemset mining methods in order to overcome the limits on data size and speed of mining that sequential mining methods have. However, the existing MapReduce-based methods still do not have a good scalability due to high workload skewness, large intermediate data, and large network communication overhead. In this paper, we propose BIGMiner, a fast and scalable MapReduce-based frequent itemset mining method. BIGMiner generates equal-sized sub-databases called transaction chunks and performs support counting only based on transaction chunks and bitwise operations without generating and shuffling intermediate data. As a result, BIGMiner achieves very high scalability due to no workload skewness, no intermediate data, and small network communication overhead. Through extensive experiments using large-scale datasets of up to 6.5 billion transactions, we have shown that BIGMiner consistently and significantly outperforms the state-of-the-art methods without any memory problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Aggarwal, C.C., Han, J.: Frequent Pattern Mining. Springer, New York (2014)

    Book  Google Scholar 

  2. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: VLDB, pp. 487–499 (1994). http://www.vldb.org/conf/1994/P487.PDF

  3. Apache Hadoop (2006). http://hadoop.apache.org

  4. Apache Mahout (2013). http://mahout.apache.org

  5. Apache Spark MLlib (2014). http://spark.apache.org/mllib/

  6. BigFIM (2013). https://gitlab.com/adrem/bigfim-sa

  7. Buehrer, G., de Oliveira, R.L., Fuhry, D., Parthasarathy, S.: Towards a parameter-free and parallel itemset mining algorithm in linearithmic time. In: 2015 IEEE 31st International Conference on Data Engineering (ICDE), pp. 1071–1082. IEEE (2015)

  8. Cheng, L., Kotoulas, S.: Efficient skew handling for outer joins in a cloud computing environment. IEEE Transactions on Cloud Computing (2015)

  9. Cheng, L., Kotoulas, S., Ward, T.E., Theodoropoulos, G.: Robust and skew-resistant parallel joins in shared-nothing systems. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 1399–1408. ACM (2014)

  10. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  11. Fang, W., Lu, M., Xiao, X., He, B., Luo, Q.: Frequent itemset mining on graphics processors. In: DaMon, pp. 34–42. ACM (2009)

  12. FIMI Repository (2005). http://fimi.ua.ac.be

  13. Gonen, Y., Gudes, E.: An improved mapreduce algorithm for mining closed frequent itemsets. In: 2016 IEEE International Conference on Software Science, Technology and Engineering (SWSTE), pp. 77–83. IEEE (2016)

  14. Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, Amsterdam (2011)

    MATH  Google Scholar 

  15. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: ACM SIGMOD Record. vol. 29, pp. 1–12. ACM (2000)

  16. Kovacs, F., Illés, J.: Frequent itemset mining on hadoop. In: 2013 IEEE 9th International Conference on Computational Cybernetics (ICCC), pp. 241–245. IEEE (2013)

  17. Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: Pfp: parallel fp-growth for query recommendation. In: RecSys, pp. 107–114. ACM (2008)

  18. Li, N., Zeng, L., He, Q., Shi, Z.: Parallel implementation of apriori algorithm based on mapreduce. In: 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing (SNPD), pp. 236–241. IEEE (2012)

  19. Li, X., Han, J., Gonzalez, H.: High-dimensional olap: a minimal cubing approach. In: PVLDB, pp. 528–539. VLDB Endowment (2004)

  20. Lin, M.Y., Lee, P.Y., Hsueh, S.C.: Apriori-based frequent itemset mining algorithms on mapreduce. In: ICUIMC, p. 76. ACM (2012)

  21. Lin, W., Alvarez, S.A., Ruiz, C.: Efficient adaptive-support association rule mining for recommender systems. Data Min. Knowl. Discov. 6(1), 83–105 (2002)

    Article  MathSciNet  Google Scholar 

  22. Lucchese, C., Orlando, S., Perego, R., Silvestri, F.: Webdocs: a real-life huge transactional dataset. In: FIMI, vol. 126 (2004)

  23. Moens, S., Aksehirli, E., Goethals, B.: Frequent itemset mining for big data. In: Big Data, pp. 111–118. IEEE (2013)

  24. Sandvig, J.J., Mobasher, B., Burke, R.: Robustness of collaborative recommendation based on association rule mining. In: Recsys, pp. 105–112. ACM (2007)

  25. Schlegel, B.: Frequent itemset mining on multiprocessor systems. Dissertation, Technischen Universit\(\ddot{a}\)t Dresden (2013)

  26. Sethi, K.K., Ramesh, D.: Hfim: a spark-based hybrid frequent itemset mining algorithm for big data processing. J. Supercomput. 1–17 (2017)

  27. Wang, L., Feng, L., Zhang, J., Liao, P.: An efficient algorithm of frequent itemsets mining based on mapreduce. J. Inf. Comput. Sci. 11(8), 2809–2816 (2014)

    Article  Google Scholar 

  28. Xun, Y., Zhang, J., Qin, X., Zhao, X.: Fidoop-dp: data partitioning in frequent itemset mining on hadoop clusters. IEEE Trans. Parallel Distrib. Syst. 28(1), 101–114 (2017)

    Article  Google Scholar 

  29. Yahoo webscope. Yahoo! altavista web page hyperlink connectivity graph (2009). http://webscope.sandbox.yahoo.com

  30. Yu, H., Wen, J., Wang, H., Jun, L.: An improved apriori algorithm based on the boolean matrix and hadoop. Procedia Eng. 15, 1827–1831 (2011)

    Article  Google Scholar 

  31. Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W., et al.: New algorithms for fast discovery of association rules. KDD. 97, 283–286 (1997)

    Google Scholar 

  32. Zhang, F., Zhang, Y., Bakos, J.D.: Gpapriori: Gpu-accelerated frequent itemset mining. In: Cluster, pp. 590–594 (2011). http://dx.doi.org/10.1109/CLUSTER.2011.61

  33. Zhang, F., Zhang, Y., Bakos, J.D.: Accelerating frequent itemset mining on graphics processing units. J. Supercomput. 66(1), 94–117 (2013). http://dx.doi.org/10.1007/s11227-013-0887-x

    Article  Google Scholar 

  34. Zhou, L., Zhong, Z., Chang, J., Li, J., Huang, J.Z., Feng, S.: Balanced parallel fp-growth with mapreduce. In: 2010 IEEE Youth Conference on Information Computing and Telecommunications (YC-ICT), pp. 243–246. IEEE (2010)

Download references

Acknowledgements

This work was partly supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. R0190-15-2012, High Performance Big Data Analytics Platform Performance Acceleration Technologies Development, R7124-16-0004, Development of Intelligent Interaction Technology Based on Context Awareness and Human Intention Understanding).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Min-Soo Kim.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chon, KW., Kim, MS. BIGMiner: a fast and scalable distributed frequent pattern miner for big data. Cluster Comput 21, 1507–1520 (2018). https://doi.org/10.1007/s10586-018-1812-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-018-1812-0

Keywords

Navigation