Advertisement

BIGMiner: a fast and scalable distributed frequent pattern miner for big data

Article
  • 100 Downloads

Abstract

Frequent itemset mining is widely used as a fundamental data mining technique. Recently, there have been proposed a number of MapReduce-based frequent itemset mining methods in order to overcome the limits on data size and speed of mining that sequential mining methods have. However, the existing MapReduce-based methods still do not have a good scalability due to high workload skewness, large intermediate data, and large network communication overhead. In this paper, we propose BIGMiner, a fast and scalable MapReduce-based frequent itemset mining method. BIGMiner generates equal-sized sub-databases called transaction chunks and performs support counting only based on transaction chunks and bitwise operations without generating and shuffling intermediate data. As a result, BIGMiner achieves very high scalability due to no workload skewness, no intermediate data, and small network communication overhead. Through extensive experiments using large-scale datasets of up to 6.5 billion transactions, we have shown that BIGMiner consistently and significantly outperforms the state-of-the-art methods without any memory problems.

Keywords

Frequent pattern mining Big data Scalable algorithm Distributed algorithm MapReduce 

Notes

Acknowledgements

This work was partly supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. R0190-15-2012, High Performance Big Data Analytics Platform Performance Acceleration Technologies Development, R7124-16-0004, Development of Intelligent Interaction Technology Based on Context Awareness and Human Intention Understanding).

References

  1. 1.
    Aggarwal, C.C., Han, J.: Frequent Pattern Mining. Springer, New York (2014)CrossRefMATHGoogle Scholar
  2. 2.
    Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: VLDB, pp. 487–499 (1994). http://www.vldb.org/conf/1994/P487.PDF
  3. 3.
    Apache Hadoop (2006). http://hadoop.apache.org
  4. 4.
    Apache Mahout (2013). http://mahout.apache.org
  5. 5.
    Apache Spark MLlib (2014). http://spark.apache.org/mllib/
  6. 6.
  7. 7.
    Buehrer, G., de Oliveira, R.L., Fuhry, D., Parthasarathy, S.: Towards a parameter-free and parallel itemset mining algorithm in linearithmic time. In: 2015 IEEE 31st International Conference on Data Engineering (ICDE), pp. 1071–1082. IEEE (2015)Google Scholar
  8. 8.
    Cheng, L., Kotoulas, S.: Efficient skew handling for outer joins in a cloud computing environment. IEEE Transactions on Cloud Computing (2015)Google Scholar
  9. 9.
    Cheng, L., Kotoulas, S., Ward, T.E., Theodoropoulos, G.: Robust and skew-resistant parallel joins in shared-nothing systems. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 1399–1408. ACM (2014)Google Scholar
  10. 10.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  11. 11.
    Fang, W., Lu, M., Xiao, X., He, B., Luo, Q.: Frequent itemset mining on graphics processors. In: DaMon, pp. 34–42. ACM (2009)Google Scholar
  12. 12.
    FIMI Repository (2005). http://fimi.ua.ac.be
  13. 13.
    Gonen, Y., Gudes, E.: An improved mapreduce algorithm for mining closed frequent itemsets. In: 2016 IEEE International Conference on Software Science, Technology and Engineering (SWSTE), pp. 77–83. IEEE (2016)Google Scholar
  14. 14.
    Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, Amsterdam (2011)MATHGoogle Scholar
  15. 15.
    Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: ACM SIGMOD Record. vol. 29, pp. 1–12. ACM (2000)Google Scholar
  16. 16.
    Kovacs, F., Illés, J.: Frequent itemset mining on hadoop. In: 2013 IEEE 9th International Conference on Computational Cybernetics (ICCC), pp. 241–245. IEEE (2013)Google Scholar
  17. 17.
    Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: Pfp: parallel fp-growth for query recommendation. In: RecSys, pp. 107–114. ACM (2008)Google Scholar
  18. 18.
    Li, N., Zeng, L., He, Q., Shi, Z.: Parallel implementation of apriori algorithm based on mapreduce. In: 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing (SNPD), pp. 236–241. IEEE (2012)Google Scholar
  19. 19.
    Li, X., Han, J., Gonzalez, H.: High-dimensional olap: a minimal cubing approach. In: PVLDB, pp. 528–539. VLDB Endowment (2004)Google Scholar
  20. 20.
    Lin, M.Y., Lee, P.Y., Hsueh, S.C.: Apriori-based frequent itemset mining algorithms on mapreduce. In: ICUIMC, p. 76. ACM (2012)Google Scholar
  21. 21.
    Lin, W., Alvarez, S.A., Ruiz, C.: Efficient adaptive-support association rule mining for recommender systems. Data Min. Knowl. Discov. 6(1), 83–105 (2002)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Lucchese, C., Orlando, S., Perego, R., Silvestri, F.: Webdocs: a real-life huge transactional dataset. In: FIMI, vol. 126 (2004)Google Scholar
  23. 23.
    Moens, S., Aksehirli, E., Goethals, B.: Frequent itemset mining for big data. In: Big Data, pp. 111–118. IEEE (2013)Google Scholar
  24. 24.
    Sandvig, J.J., Mobasher, B., Burke, R.: Robustness of collaborative recommendation based on association rule mining. In: Recsys, pp. 105–112. ACM (2007)Google Scholar
  25. 25.
    Schlegel, B.: Frequent itemset mining on multiprocessor systems. Dissertation, Technischen Universit\(\ddot{a}\)t Dresden (2013)Google Scholar
  26. 26.
    Sethi, K.K., Ramesh, D.: Hfim: a spark-based hybrid frequent itemset mining algorithm for big data processing. J. Supercomput. 1–17 (2017)Google Scholar
  27. 27.
    Wang, L., Feng, L., Zhang, J., Liao, P.: An efficient algorithm of frequent itemsets mining based on mapreduce. J. Inf. Comput. Sci. 11(8), 2809–2816 (2014)CrossRefGoogle Scholar
  28. 28.
    Xun, Y., Zhang, J., Qin, X., Zhao, X.: Fidoop-dp: data partitioning in frequent itemset mining on hadoop clusters. IEEE Trans. Parallel Distrib. Syst. 28(1), 101–114 (2017)CrossRefGoogle Scholar
  29. 29.
    Yahoo webscope. Yahoo! altavista web page hyperlink connectivity graph (2009). http://webscope.sandbox.yahoo.com
  30. 30.
    Yu, H., Wen, J., Wang, H., Jun, L.: An improved apriori algorithm based on the boolean matrix and hadoop. Procedia Eng. 15, 1827–1831 (2011)CrossRefGoogle Scholar
  31. 31.
    Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W., et al.: New algorithms for fast discovery of association rules. KDD. 97, 283–286 (1997)Google Scholar
  32. 32.
    Zhang, F., Zhang, Y., Bakos, J.D.: Gpapriori: Gpu-accelerated frequent itemset mining. In: Cluster, pp. 590–594 (2011). http://dx.doi.org/10.1109/CLUSTER.2011.61
  33. 33.
    Zhang, F., Zhang, Y., Bakos, J.D.: Accelerating frequent itemset mining on graphics processing units. J. Supercomput. 66(1), 94–117 (2013). http://dx.doi.org/10.1007/s11227-013-0887-x
  34. 34.
    Zhou, L., Zhong, Z., Chang, J., Li, J., Huang, J.Z., Feng, S.: Balanced parallel fp-growth with mapreduce. In: 2010 IEEE Youth Conference on Information Computing and Telecommunications (YC-ICT), pp. 243–246. IEEE (2010)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Information and Communication EngineeringDaegu Gyeongbuk Institute of Science & Technology (DGIST)DaeguKorea

Personalised recommendations