MapFIM: Memory Aware Parallelized Frequent Itemset Mining in Very Large Datasets

Duong, Khanh-Chuong; Bamha, Mostafa; Giacometti, Arnaud; Li, Dominique; Soulet, Arnaud; Vrain, Christel

doi:10.1007/978-3-319-64468-4_36

Khanh-Chuong Duong^19,20,
Mostafa Bamha²⁰,
Arnaud Giacometti¹⁹,
Dominique Li¹⁹,
Arnaud Soulet¹⁹ &
…
Christel Vrain²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10438))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

1106 Accesses
3 Citations

Abstract

Mining frequent itemsets in large datasets has received much attention, in recent years, relying on MapReduce programming models. Many famous FIM algorithms have been parallelized in a MapReduce framework like Parallel Apriori, Parallel FP-Growth and Dist-Eclat. However, most papers focus on work partitioning and/or load balancing but they are not extensible because they require some memory assumptions. A challenge in designing parallel FIM algorithms is thus finding ways to guarantee that data structures used during mining always fit in the local memory of the processing nodes during all computation steps.

In this paper, we propose MapFIM, a two-phase approach for frequent itemset mining in very large datasets relying both on a MapReduce-based distributed Apriori method and a local in-memory method. In our approach, MapReduce is first used to generate local memory-fitted prefix-projected databases from the input dataset benefiting from the Apriori principle. Then an optimized local in-memory mining process is launched to generate all frequent itemsets from each prefix-projected database. Performance evaluation shows that MapFIM is more efficient and more extensible than existing MapReduce based frequent itemset mining approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
HDFS: Hadoop Distributed File System.
2.
In our work, in order to generate each candidate once, we use a prefix-based join operation. More precisely, given two set of k-itemsets \(\mathcal {L}_{k-1}\) and \(\mathcal {L}'_{k-1}\), the join of \(\mathcal {L}_{k-1}\) and \(\mathcal {L}'_{k-1}\) is defined by: \(\mathcal {L}_{k-1} \bowtie \mathcal {L}'_{k-1} = \{(i_1, \dots , i_k) ~|~ (i_1, \dots , i_{k-2}, i_{k-1}) \in \mathcal {L}_{k-1} \wedge (i_1, \dots , i_{k-2},i_{k}) \in \mathcal {L}'_{k-1} \wedge i_1< \dots< i_{k-1} < i_{k} \}\).
3.
In our configuration, there is no real difference of performance between Hadoop 1.2.1 and Hadoop 2.7.3.
4.
In our implementation, \(M_{reduce\_task}\) is around 300 MB.

References

Aggarwal, A.C., Han, J.: Frequent Pattern Mining. Springer, Heidelberg (2014)
Book MATH Google Scholar
Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of VLDB 1994, vol. 1215, pp. 487–499 (1994)
Google Scholar
Al Hajj Hassan, M., Bamha, M.: Towards scalability and data skew handling in groupby-joins using MapReduce model. In: Proceedings of ICCS 2015, pp. 70–79 (2015)
Google Scholar
Al Hajj Hassan, M., Bamha, M., Loulergue, F.: Handling data-skew effects in join operations using MapReduce. In: Proceedings of ICCS 2014, pp. 145–158. IEEE (2014)
Google Scholar
Beedkar, K., Berberich, K., Gemulla, R., Miliaraki, I.: Closing the gap: sequence mining at scale. ACM Trans. Database Syst. 40(2), 8:1–8:44 (2015)
Article MathSciNet Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: PFP: parallel FP-growth for query recommendation. In: Proceedings of RecSys 2008, pp. 107–114. ACM (2008)
Google Scholar
Li, N., Zeng, L., He, Q., Shi, Z.: Parallel implementation of Apriori algorithm based on MapReduce. In: Proceedings of SNDP 2012, pp. 236–241. IEEE (2012)
Google Scholar
Lin, M.-Y., Lee, P.-Y., Hsueh, S.-C.: Apriori-based frequent itemset mining algorithms on MapReduce. In: Proceedings of ICUIMC 2012, pp. 76:1–76:8 (2012)
Google Scholar
Lucchese, C., Orlando, S., Perego, R., Silvestri, F.: Webdocs: a real-life huge transactional dataset. In: FIMI, vol. 126 (2004)
Google Scholar
Apache Mahout. Scalable machine learning and data mining (2012)
Google Scholar
Makanju, A., Farzanyar, Z., An, A., Cercone, N., Hu Z.Z., Hu, Y.: Deep parallelization of parallel FP-growth using parent-child MapReduce. In: Proceedings of BigData 2016, pp. 1422–1431. IEEE (2016)
Google Scholar
Miliaraki, I., Berberich, K., Gemulla, R., Zoupanos, S.: Mind the gap: large-scale frequent sequence mining. In: Proceedings of SIGMOD 2013, pp. 797–808. ACM (2013)
Google Scholar
Moens, S., Aksehirli, E., Goethals, B.: Frequent itemset mining for big data. In: Proceedings of BigData 2013, pp. 111–118. IEEE (2013)
Google Scholar
Salah, S., Akbarinia, R., Masseglia, F.: Data partitioning for fast mining of frequent itemsets in massively distributed environments. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds.) DEXA 2015. LNCS, vol. 9261, pp. 303–318. Springer, Cham (2015). doi:10.1007/978-3-319-22849-5_21
Chapter Google Scholar
Salah, S., Akbarinia, R., Masseglia, F.: Optimizing the data-process relationship for fast mining of frequent itemsets in MapReduce. In: Perner, P. (ed.) MLDM 2015. LNCS, vol. 9166, pp. 217–231. Springer, Cham (2015). doi:10.1007/978-3-319-21024-7_15
Chapter Google Scholar
Savasere, A., Omiecinski, E., Navathe, S.B.: An efficient algorithm for mining association rules in large databases. In: Proceedings of VLDB 1995, pp. 432–444. Morgan Kaufmann Publishers Inc., San Francisco (1995)
Google Scholar
Uno, T., Asai, T., Uchida, Y., Arimura, H.: LCM: an efficient algorithm for enumerating frequent closed item sets. In: FIMI, vol. 90. Citeseer (2003)
Google Scholar
Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W., et al.: New algorithms for fast discovery of association rules. KDD 97, 283–286 (1997)
Google Scholar
Zhou, L., Zhong, Z., Chang, J., Li, J., Huang, J.Z., Feng, S.: Balanced parallel FP-growth with MapReduce. In: Proceedings of YC-ICT 2010, pp. 243–246 (2010)
Google Scholar

Download references

Acknowledgement

This work is partly supported by the GIRAFON project funded by Centre-Val de Loire.

Author information

Authors and Affiliations

Université Francois Rabelais de Tours, LI EA 6300, Blois, France
Khanh-Chuong Duong, Arnaud Giacometti, Dominique Li & Arnaud Soulet
Université d’Orléans, INSA Centre Val de Loire, LIFO EA 4022, Blois, France
Khanh-Chuong Duong, Mostafa Bamha & Christel Vrain

Authors

Khanh-Chuong Duong
View author publications
You can also search for this author in PubMed Google Scholar
Mostafa Bamha
View author publications
You can also search for this author in PubMed Google Scholar
Arnaud Giacometti
View author publications
You can also search for this author in PubMed Google Scholar
Dominique Li
View author publications
You can also search for this author in PubMed Google Scholar
Arnaud Soulet
View author publications
You can also search for this author in PubMed Google Scholar
Christel Vrain
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Khanh-Chuong Duong .

Editor information

Editors and Affiliations

University of Lyon, Villeurbanne, France
Djamal Benslimane
University of Milan, Milan, Italy
Ernesto Damiani
University of Michigan, Dearborn, Michigan, USA
William I. Grosky
Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
Wright State University, Dayton, Ohio, USA
Amit Sheth
Johannes Kepler University, Linz, Austria
Roland R. Wagner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Duong, KC., Bamha, M., Giacometti, A., Li, D., Soulet, A., Vrain, C. (2017). MapFIM: Memory Aware Parallelized Frequent Itemset Mining in Very Large Datasets. In: Benslimane, D., Damiani, E., Grosky, W., Hameurlain, A., Sheth, A., Wagner, R. (eds) Database and Expert Systems Applications. DEXA 2017. Lecture Notes in Computer Science(), vol 10438. Springer, Cham. https://doi.org/10.1007/978-3-319-64468-4_36

Download citation

DOI: https://doi.org/10.1007/978-3-319-64468-4_36
Published: 01 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64467-7
Online ISBN: 978-3-319-64468-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics