Abstract
Initially, a number of frequent itemset mining (FIM) algorithms have been designed on the Hadoop MapReduce, a distributed big data processing framework. But, due to heavy disk I/O, MapReduce is found to be inefficient for such highly iterative algorithms. Therefore, Spark, a more efficient distributed data processing framework, has been developed with in-memory computation and resilient distributed dataset (RDD) features to support the iterative algorithms. On the Spark RDD framework, Apriori and FP-Growth based FIM algorithms have been designed, but Eclat-based algorithm has not been explored yet. In this paper, RDD-Eclat, a parallel Eclat algorithm on the Spark RDD framework is proposed with its five variants. The proposed algorithms are evaluated on the various benchmark datasets, which shows that RDD-Eclat outperforms the Spark-based Apriori by many times. Also, the experimental results show the scalability of the proposed algorithms on increasing the number of cores and size of the dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: 20th International Conference on Very Large Databases, VLDB 1215, pp. 487–499 (1994)
Apache Hadoop, http://hadoop.apache.org
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. ACM Commun. 51, 107–113 (2008)
Apache Spark, http://spark.apache.org
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 10 (2010)
Qiu, H., Gu, R., Yuan, C., Huang, Y.: YAFIM: a parallel frequent itemset mining algorithm with spark. In: IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW), pp. 1664–1671. IEEE Press (2014)
Rathee, S., Kaul, M., Kashyap, A.: R-Apriori: an efficient apriori based algorithm on spark, In: 8th Ph. D. Workshop in Information and Knowledge Management, pp. 27–34. ACM (2015)
Rathee, S., Kashyap, A.: Adaptive-Miner: an efficient distributed association rule mining algorithm on Spark. J. Big Data 5(1), 6 (2018)
Zhang, F., Liu, M., Gui, F., Shen, W., Shami, A., Ma, Y.: A distributed frequent itemset mining algorithm using Spark for Big Data analytics. Cluster Comput. 18(4), 1493–1501 (2015)
Sethi, K.K., Ramesh, D.: HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing. J. Supercomput. 73(8), 3652–3668 (2017)
Shi, X., Chen, S., Yang, H.: DFPS: Distributed FP-growth algorithm based on Spark. In: 2nd IEEE Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), pp. 1725–1731. Chongqing (2017)
Zaki, M.J.: Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng. 12(3), 372–390 (2000)
Borgelt, C.: Efficient implementations of apriori and éclat. In: IEEE ICDM Workshop on Frequent itemset mining Implementations (FIMI’03) (2003)
Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: Parallel algorithms for discovery of association rules. Data Min. Knowl. Disc. 1, 343–373 (1997)
Liu, J., Wu, Y., Zhou, Q., Fung, B.C.M., Chen, F., Yu, B.: Parallel eclat for opportunistic mining of frequent itemsets. In: Database and Expert Systems Applications, LNCS, vol. 9261, pp. 401–415. Springer (2015)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McChauley, M., Franklin, M. J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: 9th USENIX Conference on Networked Systems Design and Implementation, p. 2, USENIX Association (2012)
Apache Hadoop, http://hadoop.apache.org
Cluster Overview, https://spark.apache.org/docs/latest/cluster-overview.html
RDD Programming Guide, https://spark.apache.org/docs/latest/rdd-programming-guide.html
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. ACM SIGMOD Record 29(2), 1–12 (2000)
Lin, M-Y., Lee, P-Y., Hsueh, S-C.: Apriori-based Frequent Itemset Mining Algorithms on MapReduce. In: 6th International Conference on Ubiquitous Information Management and Communication (ICUIMC ’12), Article 76, ACM, New York (2012)
Singh, S., Garg, R., Mishra, P.K.: Performance optimization of MapReduce-based Apriori algorithm on Hadoop cluster. Comput. Electr. Eng. 67, 348–364 (2018)
Chon, K.W., Kim, M.S.: BIGMiner: a fast and scalable distributed frequent pattern miner for big data. Cluster Computing 21(3), 1507–1520 (2018)
Moens, S., Aksehirli, E., Goethals, B.: Frequent itemset mining for big data. In: IEEE International Conference on Big Data, pp. 111–118. IEEE Press (2013)
Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E. Y.: PFP: Parallel FP-growth for query recommendation. In: ACM Conference on Recommender System, pp. 107–114. ACM (2008)
Xun, Y., Zhang, J., Qin, X.: FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce. IEEE Trans. Syst. Man Cybern.: Syst. 46(3), 313–325 (2016)
Fournier-Viger, P., Lin, C.W., Gomariz, A., Gueniche, T., Soltani, A., Deng, Z., Lam, H. T.: The SPMF open-source data mining library version 2. In: 19th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2016) Part III, LNCS 9853, pp. 36–40. Springer (2016)
Frequent Itemset Mining Dataset Repository, http://fimi.ua.ac.be/data
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Singh, P., Singh, S., Mishra, P.K., Garg, R. (2020). RDD-Eclat: Approaches to Parallelize Eclat Algorithm on Spark RDD Framework. In: Smys, S., Senjyu, T., Lafata, P. (eds) Second International Conference on Computer Networks and Communication Technologies. ICCNCT 2019. Lecture Notes on Data Engineering and Communications Technologies, vol 44. Springer, Cham. https://doi.org/10.1007/978-3-030-37051-0_85
Download citation
DOI: https://doi.org/10.1007/978-3-030-37051-0_85
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-37050-3
Online ISBN: 978-3-030-37051-0
eBook Packages: EngineeringEngineering (R0)