RDD-Eclat: Approaches to Parallelize Eclat Algorithm on Spark RDD Framework

Singh, Pankaj; Singh, Sudhakar; Mishra, P. K.; Garg, Rakhi

doi:10.1007/978-3-030-37051-0_85

Pankaj Singh⁵,
Sudhakar Singh⁶,
P. K. Mishra⁵ &
…
Rakhi Garg⁷

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 44))

Included in the following conference series:

International Conference on Computer Networks and Inventive Communication Technologies

1461 Accesses
6 Citations
4 Altmetric

Abstract

Initially, a number of frequent itemset mining (FIM) algorithms have been designed on the Hadoop MapReduce, a distributed big data processing framework. But, due to heavy disk I/O, MapReduce is found to be inefficient for such highly iterative algorithms. Therefore, Spark, a more efficient distributed data processing framework, has been developed with in-memory computation and resilient distributed dataset (RDD) features to support the iterative algorithms. On the Spark RDD framework, Apriori and FP-Growth based FIM algorithms have been designed, but Eclat-based algorithm has not been explored yet. In this paper, RDD-Eclat, a parallel Eclat algorithm on the Spark RDD framework is proposed with its five variants. The proposed algorithms are evaluated on the various benchmark datasets, which shows that RDD-Eclat outperforms the Spark-based Apriori by many times. Also, the experimental results show the scalability of the proposed algorithms on increasing the number of cores and size of the dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: 20th International Conference on Very Large Databases, VLDB 1215, pp. 487–499 (1994)
Google Scholar
Apache Hadoop, http://hadoop.apache.org
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. ACM Commun. 51, 107–113 (2008)
Article Google Scholar
Apache Spark, http://spark.apache.org
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 10 (2010)
Google Scholar
Qiu, H., Gu, R., Yuan, C., Huang, Y.: YAFIM: a parallel frequent itemset mining algorithm with spark. In: IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW), pp. 1664–1671. IEEE Press (2014)
Google Scholar
Rathee, S., Kaul, M., Kashyap, A.: R-Apriori: an efficient apriori based algorithm on spark, In: 8th Ph. D. Workshop in Information and Knowledge Management, pp. 27–34. ACM (2015)
Google Scholar
Rathee, S., Kashyap, A.: Adaptive-Miner: an efficient distributed association rule mining algorithm on Spark. J. Big Data 5(1), 6 (2018)
Article Google Scholar
Zhang, F., Liu, M., Gui, F., Shen, W., Shami, A., Ma, Y.: A distributed frequent itemset mining algorithm using Spark for Big Data analytics. Cluster Comput. 18(4), 1493–1501 (2015)
Article Google Scholar
Sethi, K.K., Ramesh, D.: HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing. J. Supercomput. 73(8), 3652–3668 (2017)
Article Google Scholar
Shi, X., Chen, S., Yang, H.: DFPS: Distributed FP-growth algorithm based on Spark. In: 2nd IEEE Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), pp. 1725–1731. Chongqing (2017)
Google Scholar
Zaki, M.J.: Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng. 12(3), 372–390 (2000)
Article Google Scholar
Borgelt, C.: Efficient implementations of apriori and éclat. In: IEEE ICDM Workshop on Frequent itemset mining Implementations (FIMI’03) (2003)
Google Scholar
Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: Parallel algorithms for discovery of association rules. Data Min. Knowl. Disc. 1, 343–373 (1997)
Article Google Scholar
Liu, J., Wu, Y., Zhou, Q., Fung, B.C.M., Chen, F., Yu, B.: Parallel eclat for opportunistic mining of frequent itemsets. In: Database and Expert Systems Applications, LNCS, vol. 9261, pp. 401–415. Springer (2015)
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McChauley, M., Franklin, M. J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: 9th USENIX Conference on Networked Systems Design and Implementation, p. 2, USENIX Association (2012)
Google Scholar
Apache Hadoop, http://hadoop.apache.org
Cluster Overview, https://spark.apache.org/docs/latest/cluster-overview.html
RDD Programming Guide, https://spark.apache.org/docs/latest/rdd-programming-guide.html
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. ACM SIGMOD Record 29(2), 1–12 (2000)
Article Google Scholar
Lin, M-Y., Lee, P-Y., Hsueh, S-C.: Apriori-based Frequent Itemset Mining Algorithms on MapReduce. In: 6th International Conference on Ubiquitous Information Management and Communication (ICUIMC ’12), Article 76, ACM, New York (2012)
Google Scholar
Singh, S., Garg, R., Mishra, P.K.: Performance optimization of MapReduce-based Apriori algorithm on Hadoop cluster. Comput. Electr. Eng. 67, 348–364 (2018)
Article Google Scholar
Chon, K.W., Kim, M.S.: BIGMiner: a fast and scalable distributed frequent pattern miner for big data. Cluster Computing 21(3), 1507–1520 (2018)
Article Google Scholar
Moens, S., Aksehirli, E., Goethals, B.: Frequent itemset mining for big data. In: IEEE International Conference on Big Data, pp. 111–118. IEEE Press (2013)
Google Scholar
Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E. Y.: PFP: Parallel FP-growth for query recommendation. In: ACM Conference on Recommender System, pp. 107–114. ACM (2008)
Google Scholar
Xun, Y., Zhang, J., Qin, X.: FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce. IEEE Trans. Syst. Man Cybern.: Syst. 46(3), 313–325 (2016)
Article Google Scholar
Fournier-Viger, P., Lin, C.W., Gomariz, A., Gueniche, T., Soltani, A., Deng, Z., Lam, H. T.: The SPMF open-source data mining library version 2. In: 19th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2016) Part III, LNCS 9853, pp. 36–40. Springer (2016)
Google Scholar
Frequent Itemset Mining Dataset Repository, http://fimi.ua.ac.be/data

Download references

Author information

Authors and Affiliations

Department of Computer Science, Banaras Hindu University, Varanasi, India
Pankaj Singh & P. K. Mishra
Department of Electronics and Communication, University of Allahabad, Allahabad, India
Sudhakar Singh
Mahila Maha Vidyalaya, Banaras Hindu University, Varanasi, India
Rakhi Garg

Authors

Pankaj Singh
View author publications
You can also search for this author in PubMed Google Scholar
Sudhakar Singh
View author publications
You can also search for this author in PubMed Google Scholar
P. K. Mishra
View author publications
You can also search for this author in PubMed Google Scholar
Rakhi Garg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sudhakar Singh .

Editor information

Editors and Affiliations

Department of Computer Science Engineering, RVS Technical Campus, Coimbatore, Tamil Nadu, India
S. Smys
University of the Ryukyus, Okinawa, Japan
Tomonobu Senjyu
Department of Telecommunication Engineering, Faculty of Electrical Engineering, Czech Technical University in Prague, Prague, Czech Republic
Pavel Lafata

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Singh, P., Singh, S., Mishra, P.K., Garg, R. (2020). RDD-Eclat: Approaches to Parallelize Eclat Algorithm on Spark RDD Framework. In: Smys, S., Senjyu, T., Lafata, P. (eds) Second International Conference on Computer Networks and Communication Technologies. ICCNCT 2019. Lecture Notes on Data Engineering and Communications Technologies, vol 44. Springer, Cham. https://doi.org/10.1007/978-3-030-37051-0_85

Download citation

DOI: https://doi.org/10.1007/978-3-030-37051-0_85
Published: 22 January 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-37050-3
Online ISBN: 978-3-030-37051-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics