A Caching-Based Parallel FP-Growth in Apache Spark

  • Zhicheng CaiEmail author
  • Xingyu Zhu
  • Yuehui Zheng
  • Duan Liu
  • Lei Xu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11336)


The association-rule-based recommendation is widespread in many big data applications which need quick response to improve user experience. Spark is a widely used distributed computing platform, which accelerates the processing of large-scale distributed data. Developing appropriate distributed algorithm for Spark is essential to decrease the processing time of distributed recommendation. The existing FP-Growth in Spark is a popular parallel recommendation method but getting the best performance only when the memory of machines can accommodate all immediate Resilient Distributed DataSets (RDDs). However, memory of many practice data centers is still not large enough for large data sets. Therefore, in this paper, a caching-based parallel FP-Growth is proposed which consists of an integer-based sorting and an RDD-caching strategy to improve the efficiency. Experimental results show that the proposal decreases the execution time by 32.37% on average compared with the existing parallel FP-Growth in Spark. Furthermore, impacts of some important parameters upon the performance of the proposal are analyzed by numerous realistic experiments in Spark.


Spark Parallel FP-Growth Caching strategy 



Zhicheng Cai is supported by the National Natural Science Foundation of China (Grant No. 61602243) and the Natural Science Foundation of Jiangsu Province (Grant No. BK20160846). Lei Xu is supported by the National Natural Science Foundation of China (No. 61671244). Duan Liu is supported by Postgraduate Research & Practice Innovation Program of Jiangsu Province.


  1. 1.
    Spark: Lightning-fast unified analytics engine. Accessed 14 June 2018
  2. 2.
    Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: International Conference on Very Large Data Bases, pp. 487–499 (1994)Google Scholar
  3. 3.
    Agrawal, R., Swami, A.: Mining association rules between sets of items in large databases. In: ACM SIGMOD International Conference on Management of Data, pp. 207–216 (1993)Google Scholar
  4. 4.
    Cai, Z., Zhu, X., Zheng, Y.: Source codes of the proposed caching-based parallel FP-Growth. Accessed June 14 2018
  5. 5.
    Chung, H., Nah, Y.: Performance comparison of distributed processing of large volume of data on top of Xen and Docker-based virtual clusters. In: Candan, S., Chen, L., Pedersen, T.B., Chang, L., Hua, W. (eds.) DASFAA 2017. LNCS, vol. 10177, pp. 103–113. Springer, Cham (2017). Scholar
  6. 6.
    Dean, J., Ghemawat, S.: Mapreduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)CrossRefGoogle Scholar
  7. 7.
    Gassama, A.D.D., Camara, F., Ndiaye, S.: S-FPG: a parallel version of FP-growth algorithm under apache spark. In: IEEE International Conference on Cloud Computing and Big Data Analysis, pp. 98–101 (2017)Google Scholar
  8. 8.
    Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: ACM SIGMOD International Conference on Management of Data, pp. 1–12 (2000)CrossRefGoogle Scholar
  9. 9.
    Li, C., He, K.: CBMR: an optimized mapreduce for item based collaborative filtering recommendation algorithm with empirical analysis. Concurr. Comput. Pract. Exp. 29(10), 1–7 (2017)Google Scholar
  10. 10.
    Lin, M.Y., Lee, P.Y., Hsueh, S.C.: Apriori-based frequent itemset mining algorithms on mapreduce. In: ICUIMC 2012, pp. 76:1–76:8. ACM, New York (2012)Google Scholar
  11. 11.
    Petridis, P., Gounaris, A., Torres, J.: Spark parameter tuning via trial-and-error. In: Angelov, P., Manolopoulos, Y., Iliadis, L., Roy, A., Vellasco, M. (eds.) INNS 2016. AISC, vol. 529, pp. 226–237. Springer, Cham (2017). Scholar
  12. 12.
    Qiu, H., Gu, R., Yuan, C., Huang, Y.: Yafim: a parallel frequent itemset mining algorithm with spark. In: Parallel and Distributed Processing Symposium Workshops, pp. 1664–1671 (2014)Google Scholar
  13. 13.
    Rathee, S., Kashyap, A.: Adaptive-miner: an efficient distributed association rule mining algorithm on spark. J. Big Data 5(1), 6 (2018)CrossRefGoogle Scholar
  14. 14.
    Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Item-based collaborative filtering recommendation algorithms. In: International Conference on World Wide Web, pp. 285–295 (2001)Google Scholar
  15. 15.
    Schafer, J.B., Konstan, J., Riedl, J.: Recommender systems in e-commerce. In: ACM Conference on Electronic Commerce, pp. 158–166 (1999)Google Scholar
  16. 16.
    Sethi, K.K., Ramesh, D.: HFIM: a spark-based hybrid frequent itemset mining algorithm for big data processing. J. Supercomput. 73, 1–17 (2017)CrossRefGoogle Scholar
  17. 17.
    Wang, G., Xu, J., He, B.: A novel method for tuning configuration parameters of spark based on machine learning. In: IEEE International Conference on High PERFORMANCE Computing and Communications; IEEE International Conference on Smart City; IEEE International Conference on Data Science and Systems, pp. 586–593 (2017)Google Scholar
  18. 18.
    Wang, K., Khan, M.M.H.: Performance prediction for apache spark platform. In: IEEE International Conference on High PERFORMANCE Computing and Communications, 2015 IEEE International Symposium on Cyberspace Safety and Security, and 2015 IEEE International Conference on Embedded Software and Systems, pp. 166–173 (2015)Google Scholar
  19. 19.
    Winlaw, M., Hynes, M.B., Caterini, A., Sterck, H.D.: Algorithmic acceleration of parallel ALS for collaborative filtering: speeding up distributed big data recommendation in spark. In: IEEE International Conference on Parallel and Distributed Systems, pp. 682–691 (2016)Google Scholar
  20. 20.
    Xun, Y., Zhang, J., Qin, X., Zhao, X.: Fidoop-dp: data partitioning in frequent itemset mining on hadoop clusters. IEEE Trans. Parallel Distrib. Syst. 28, 101–114 (2017)CrossRefGoogle Scholar
  21. 21.
    Ye, Y., Chiang, C.C.: A parallel apriori algorithm for frequent itemsets mining. In: International Conference on Software Engineering Research, Management and Applications, pp. 87–94 (2006)Google Scholar
  22. 22.
    Yu, K.-M., Zhou, J., Hsiao, W.C.: Load balancing approach parallel algorithm for frequent pattern mining. In: Malyshkin, V. (ed.) PaCT 2007. LNCS, vol. 4671, pp. 623–631. Springer, Heidelberg (2007). Scholar
  23. 23.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Usenix Conference on Hot Topics in Cloud Computing, p. 10 (2010)Google Scholar
  24. 24.
    Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)CrossRefGoogle Scholar
  25. 25.
    Zhang, D., Zhang, D., Zhang, D., Zhang, M., Chang, E.Y.: PFP: parallel FP-growth for query recommendation. In: ACM Conference on Recommender Systems, pp. 107–114 (2008)Google Scholar
  26. 26.
    Zhang, F., Liu, M., Gui, F., Shen, W., Shami, A., Ma, Y.: A distributed frequent itemset mining algorithm using spark for big data analytics. Cluster Comput. 18(4), 1493–1501 (2015)CrossRefGoogle Scholar
  27. 27.
    Zhou, L., Wang, X.: Research of the FP-growth algorithm based on cloud environments. J. Softw. 9(3), 676 (2014)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Zhicheng Cai
    • 1
    Email author
  • Xingyu Zhu
    • 1
  • Yuehui Zheng
    • 1
  • Duan Liu
    • 1
  • Lei Xu
    • 1
  1. 1.School of Computer Science and EngineeringNanjing University of Science and TechnologyNanjingChina

Personalised recommendations