Skip to main content

MapFIM: Memory Aware Parallelized Frequent Itemset Mining in Very Large Datasets

  • Conference paper
  • First Online:
Database and Expert Systems Applications (DEXA 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10438))

Included in the following conference series:

Abstract

Mining frequent itemsets in large datasets has received much attention, in recent years, relying on MapReduce programming models. Many famous FIM algorithms have been parallelized in a MapReduce framework like Parallel Apriori, Parallel FP-Growth and Dist-Eclat. However, most papers focus on work partitioning and/or load balancing but they are not extensible because they require some memory assumptions. A challenge in designing parallel FIM algorithms is thus finding ways to guarantee that data structures used during mining always fit in the local memory of the processing nodes during all computation steps.

In this paper, we propose MapFIM, a two-phase approach for frequent itemset mining in very large datasets relying both on a MapReduce-based distributed Apriori method and a local in-memory method. In our approach, MapReduce is first used to generate local memory-fitted prefix-projected databases from the input dataset benefiting from the Apriori principle. Then an optimized local in-memory mining process is launched to generate all frequent itemsets from each prefix-projected database. Performance evaluation shows that MapFIM is more efficient and more extensible than existing MapReduce based frequent itemset mining approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    HDFS: Hadoop Distributed File System.

  2. 2.

    In our work, in order to generate each candidate once, we use a prefix-based join operation. More precisely, given two set of k-itemsets \(\mathcal {L}_{k-1}\) and \(\mathcal {L}'_{k-1}\), the join of \(\mathcal {L}_{k-1}\) and \(\mathcal {L}'_{k-1}\) is defined by: \(\mathcal {L}_{k-1} \bowtie \mathcal {L}'_{k-1} = \{(i_1, \dots , i_k) ~|~ (i_1, \dots , i_{k-2}, i_{k-1}) \in \mathcal {L}_{k-1} \wedge (i_1, \dots , i_{k-2},i_{k}) \in \mathcal {L}'_{k-1} \wedge i_1< \dots< i_{k-1} < i_{k} \}\).

  3. 3.

    In our configuration, there is no real difference of performance between Hadoop 1.2.1 and Hadoop 2.7.3.

  4. 4.

    In our implementation, \(M_{reduce\_task}\) is around 300 MB.

References

  1. Aggarwal, A.C., Han, J.: Frequent Pattern Mining. Springer, Heidelberg (2014)

    Book  MATH  Google Scholar 

  2. Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of VLDB 1994, vol. 1215, pp. 487–499 (1994)

    Google Scholar 

  3. Al Hajj Hassan, M., Bamha, M.: Towards scalability and data skew handling in groupby-joins using MapReduce model. In: Proceedings of ICCS 2015, pp. 70–79 (2015)

    Google Scholar 

  4. Al Hajj Hassan, M., Bamha, M., Loulergue, F.: Handling data-skew effects in join operations using MapReduce. In: Proceedings of ICCS 2014, pp. 145–158. IEEE (2014)

    Google Scholar 

  5. Beedkar, K., Berberich, K., Gemulla, R., Miliaraki, I.: Closing the gap: sequence mining at scale. ACM Trans. Database Syst. 40(2), 8:1–8:44 (2015)

    Article  MathSciNet  Google Scholar 

  6. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  7. Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: PFP: parallel FP-growth for query recommendation. In: Proceedings of RecSys 2008, pp. 107–114. ACM (2008)

    Google Scholar 

  8. Li, N., Zeng, L., He, Q., Shi, Z.: Parallel implementation of Apriori algorithm based on MapReduce. In: Proceedings of SNDP 2012, pp. 236–241. IEEE (2012)

    Google Scholar 

  9. Lin, M.-Y., Lee, P.-Y., Hsueh, S.-C.: Apriori-based frequent itemset mining algorithms on MapReduce. In: Proceedings of ICUIMC 2012, pp. 76:1–76:8 (2012)

    Google Scholar 

  10. Lucchese, C., Orlando, S., Perego, R., Silvestri, F.: Webdocs: a real-life huge transactional dataset. In: FIMI, vol. 126 (2004)

    Google Scholar 

  11. Apache Mahout. Scalable machine learning and data mining (2012)

    Google Scholar 

  12. Makanju, A., Farzanyar, Z., An, A., Cercone, N., Hu Z.Z., Hu, Y.: Deep parallelization of parallel FP-growth using parent-child MapReduce. In: Proceedings of BigData 2016, pp. 1422–1431. IEEE (2016)

    Google Scholar 

  13. Miliaraki, I., Berberich, K., Gemulla, R., Zoupanos, S.: Mind the gap: large-scale frequent sequence mining. In: Proceedings of SIGMOD 2013, pp. 797–808. ACM (2013)

    Google Scholar 

  14. Moens, S., Aksehirli, E., Goethals, B.: Frequent itemset mining for big data. In: Proceedings of BigData 2013, pp. 111–118. IEEE (2013)

    Google Scholar 

  15. Salah, S., Akbarinia, R., Masseglia, F.: Data partitioning for fast mining of frequent itemsets in massively distributed environments. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds.) DEXA 2015. LNCS, vol. 9261, pp. 303–318. Springer, Cham (2015). doi:10.1007/978-3-319-22849-5_21

    Chapter  Google Scholar 

  16. Salah, S., Akbarinia, R., Masseglia, F.: Optimizing the data-process relationship for fast mining of frequent itemsets in MapReduce. In: Perner, P. (ed.) MLDM 2015. LNCS, vol. 9166, pp. 217–231. Springer, Cham (2015). doi:10.1007/978-3-319-21024-7_15

    Chapter  Google Scholar 

  17. Savasere, A., Omiecinski, E., Navathe, S.B.: An efficient algorithm for mining association rules in large databases. In: Proceedings of VLDB 1995, pp. 432–444. Morgan Kaufmann Publishers Inc., San Francisco (1995)

    Google Scholar 

  18. Uno, T., Asai, T., Uchida, Y., Arimura, H.: LCM: an efficient algorithm for enumerating frequent closed item sets. In: FIMI, vol. 90. Citeseer (2003)

    Google Scholar 

  19. Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W., et al.: New algorithms for fast discovery of association rules. KDD 97, 283–286 (1997)

    Google Scholar 

  20. Zhou, L., Zhong, Z., Chang, J., Li, J., Huang, J.Z., Feng, S.: Balanced parallel FP-growth with MapReduce. In: Proceedings of YC-ICT 2010, pp. 243–246 (2010)

    Google Scholar 

Download references

Acknowledgement

This work is partly supported by the GIRAFON project funded by Centre-Val de Loire.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Khanh-Chuong Duong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Duong, KC., Bamha, M., Giacometti, A., Li, D., Soulet, A., Vrain, C. (2017). MapFIM: Memory Aware Parallelized Frequent Itemset Mining in Very Large Datasets. In: Benslimane, D., Damiani, E., Grosky, W., Hameurlain, A., Sheth, A., Wagner, R. (eds) Database and Expert Systems Applications. DEXA 2017. Lecture Notes in Computer Science(), vol 10438. Springer, Cham. https://doi.org/10.1007/978-3-319-64468-4_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-64468-4_36

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-64467-7

  • Online ISBN: 978-3-319-64468-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics