Distributed synthesized association mining for big transactional data

Abstract

Data is increasing rapidly day by day along with the transactional database. Dividing this data and storing it in a distributed manner is an effective way for storage and retrieval. Mining such distributed data with minimum dependence between sub-problems is a crucial task. Finding frequent itemsets and corresponding association rules is a big challenge while considering the aggregation in a distributed environment. To overcome these challenges, we propose a distributed frequent itemset generation and association rule mining algorithm using MapReduce programming model. The proposed scheme generates frequent itemset and mine association rules using a synthesized distributed technique. The rules are mined in a distributed manner, and then weights are assigned to subsets of data and association rules. A proper mixture of association rules that are generated in distributed manner is done using a weighted approach. This paper presents a novel MapReduce-based synthesis approach, which can work well over a distributed storage of large amount of data.

This is a preview of subscription content, access via your institution.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7

References

  1. 1

    Wu X, Zhu X, Wu G Q and Ding W 2014 Data mining with big data. IEEE Transactions on Knowledge and Data Engineering 26(1): 97–107

    Article  Google Scholar 

  2. 2

    DeMers, J 2015 Why Facebook is making big data available to select partners. Forbes, retrieved from http://www.forbes.com/sites/jaysondemers/2015/03/25/why-facebook-is-making-big-data-available-to-select-partners/#24f4d0422966

  3. 3

    Turner V 2014 The digital universe of opportunities: rich data and the increasing value of the Internet of things. Retrieved October 26, 2016, from http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm

  4. 4

    Laney D 2001 3D Data management: controlling data volume, velocity and variety. META Group Research Note 6, 70

    Google Scholar 

  5. 5

    Fan W and Bifet A 2013 Mining Big Data: current status, and forecast to the future. ACM SIGKDD Explorations Newsletter 14(2): 1–5

    Article  Google Scholar 

  6. 6

    Rashid M M, Gondal I and Kamruzzaman J 2017 Dependable large scale behavioral patterns mining from sensor data using Hadoop platform. Information Sciences 379: 128–145

    Article  Google Scholar 

  7. 7

    Anitha R, Mukherjee S 2015 MaaS: fast retrieval of data in cloud using metadata as a service. Arabian Journal for Science and Engineering 40(8): 2323–2343

    Article  Google Scholar 

  8. 8

    Hipp J, Güntzer U and Nakhaeizadeh G 2000 Algorithms for association rule mining: a general survey and comparison. ACM SIGKDD Explorations Newsletter 2(1): 58-64.

    Article  Google Scholar 

  9. 9

    Seol W S, Jeong H W, Lee B and Youn H Y 2013 Reduction of association rules for Big Data sets in socially-aware computing. In: Proceedings of the 16th IEEE International Conference on Computational Science and Engineering (CSE), pp. 949–956

  10. 10

    Han J 2005 Data mining: concepts and techniques. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

    Google Scholar 

  11. 11

    Agrawal R and Srikant R 1994 Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB 1215, pp. 487–499

  12. 12

    Han J, Pei J, Yin Y and Mao R 2004 Mining frequent patterns without candidate generation: a frequent-pattern tree approach Data Mining and Knowledge Discovery 8(1): 53–87

    MathSciNet  Article  Google Scholar 

  13. 13

    Ordonez C, Mohanam N, Garcia-Alvarado C 2014 PCA for large data sets with parallel data summarization. Distributed and Parallel Databases 32(3): 377–403

    Article  Google Scholar 

  14. 14

    Dean J, Ghemawat S 2008 MapReduce: simplified data processing on large clusters. Communications of the ACM 51(1): 107–113

    Article  Google Scholar 

  15. 15

    Agrawal D, Das S, El Abbadi A 2011 Big data and cloud computing: current state and future opportunities. In: Proceedings of the 14th International Conference on Extending Database Technology, 530–533

  16. 16

    Agrawal R, Shafer J C 1996 Parallel mining of association rules: Design, implementation, and experience IBM Thomas J. Watson Research Division

  17. 17

    Yang X Y, Liu Z, Fu Y 2010 MapReduce as a programming model for association rules algorithm on Hadoop. In: Proceedings of the 3rd International Conference on Information Sciences and Interaction Sciences (ICIS), pp. 99–102

  18. 18

    Lin M Y, Lee P Y, Hsueh S C 2012 Apriori-based frequent itemset mining algorithms on MapReduce. In: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, p. 76

  19. 19.

    Chang X Z MapReduce-Apriori algorithm under cloud computing environment. In: Proceedings of the International Conference on Machine Learning and Cybernetics (ICMLC), vol. 2, pp. 637–641

  20. 20

    Lin X 2014 MR-apriori: association rules algorithm based on MapReduce. In: Proceedings of the 5th IEEE International Conference on Software Engineering and Service Science (ICSESS), pp. 141–144

  21. 21

    Li N, Zeng L, He Q, Shi Z 2012 Parallel implementation of apriori algorithm based on MapReduce. In: Proceedings of the 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing (SNPD), pp. 236–241

  22. 22.

    Guo J, Ren Y G 2013 Research on improved A Priori algorithm based on coding and MapReduce. In: Proceedings of the 10th Conference on Web Information System and Application (WISA), pp. 294–299

  23. 23

    Li H, Wang Y, Zhang D, Zhang M and Chang E Y 2008 Pfp: parallel fp-growth for query recommendation. In: Proceedings of the 2008 ACM Conference on Recommender Systems, pp. 107–114

  24. 24

    Xun Y, Zhang J and Qin X 2016 Fidoop: parallel mining of frequent itemsets using MapReduce. IEEE Transactions on Systems, Man, and Cybernetics: Systems 46(3): 313–325

    Article  Google Scholar 

  25. 25

    Riondato M, DeBrabant J A, Fonseca R and Upfal E 2012 PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, October, pp. 85–94

  26. 26

    Morales G D F and Bifet A 2015 SAMOA: scalable advanced massive online analysis. Journal of Machine Learning Research 16(1): 149–153

    Google Scholar 

  27. 27

    Holt J D and Chung S M 2007 Parallel mining of association rules from text databases. The Journal of Supercomputing 39(3): 273–299

    Article  Google Scholar 

  28. 28

    Shvachko K, Kuang H, Radia S and Chansler R 2010 The Hadoop distributed file system. In: Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10

  29. 29

    Javed A and Khokhar A 2004 Frequent pattern mining on message passing multiprocessor systems. Distributed and Parallel Databases 16(3): 321–334

    Article  Google Scholar 

  30. 30

    Wu X, Zhang S 2003 Synthesizing high-frequency rules from different data sources. IEEE Transactions on Knowledge and Data Engineering 15(2): 353–367

    Article  Google Scholar 

  31. 31

    Friedman J, Hastie T, Tibshirani R 2001 The elements of statistical learning. In: Springer Series in Statistics, vol. 1. Berlin: Springer

    Google Scholar 

  32. 32

    Fournier-Viger P 2008 SPMF: a Java open-source data mining library. Retrieved on October 30, 2016, from http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php

  33. 33

    Fournier-Viger P, Gomariz Gueniche T A, Soltani A, Wu C and Tseng V S 2014 SPMF: a Java open-source pattern mining library. Journal of Machine Learning Research 15: 3389–3393

    MATH  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Amrit Pal.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Pal, A., Kumar, M. Distributed synthesized association mining for big transactional data. Sādhanā 45, 169 (2020). https://doi.org/10.1007/s12046-020-01380-8

Download citation

Keywords

  • Big Data
  • HDFS
  • MapReduce
  • Apriori
  • frequent itemset
  • association rule