Advertisement

Parallel Prediction Algorithms for Heterogeneous Data: A Case Study with Real-Time Big Datasets

  • Y. V. LokeswariEmail author
  • Shomona Gracia Jacob
  • Rajavel Ramadoss
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 750)

Abstract

Parallel data mining algorithms are extensively used to mine and discover hidden knowledge from varied, unrelated data. Parallel data mining algorithms provide advantages such as reduced training time, less execution time, and less memory requirement. There are several issues in executing parallel data mining algorithms in a distributed environment. It is crucial to partition the data among processors such that there is minimal data dependency, proper synchronization, communication overhead, work load balancing among nodes in distributed processors and disk IO cost. Few of these issues can be resolved when parallel data mining algorithms are executed on Apache framework called Hadoop Map Reduce. Hadoop Map Reduce provides improved performance, reduced communication cost, reduced execution time, reduced training time, and reduced IO access. This paper proposes a novel framework that aims at enhancing the aforementioned advantages in terms of scalability by increasing the number of nodes in the Hadoop cluster and analyzing the performance of classification algorithms like K-Nearest Neighbor, Naïve Bayes and Decision Tree. This parallel framework could be extended to other fields of biotechnology where prediction on large datasets is essential.

Keywords

Data mining algorithms Hadoop map reduce Scalability Big data 

References

  1. 1.
    Grossman, L., Gou, Y.: Parallel Methods for Scaling Data Mining Algorithms to Large Data Sets, Handbook on Data Mining and Knowledge Discovery. Oxford University Press, Oxford (2001)Google Scholar
  2. 2.
    Talia, D.: Parallelism in knowledge discovery techniques. In: Applied Parallel Computing. Springer Berlin Heidelberg, pp. 127–136 (2002)Google Scholar
  3. 3.
    Wang, J., Chen, X., Zhou, K.: Research on a scalable parallel data mining algorithm, In: Fifth International Joint Conference on INC, IMS and IDC, 2009. NCM’09. IEEE, pp. 888–893 (2009)Google Scholar
  4. 4.
    Masih, S., Tanwani, S.: Data mining techniques in parallel and distributed environment-a comprehensive survey. Int. J. Emerging Technol. Adv. Eng. 4(3), 453–461 (2014)Google Scholar
  5. 5.
    Zhou, L., Wang, H., Wang, W.: Parallel implementation of classification algorithms based on cloud computing environment, TELKOMNIKA Indonesian J. Electr. Eng. 10(5), 1087–1092 (2012)Google Scholar
  6. 6.
    Xiao, H.: Towards parallel and distributed computing in large-scale data mining: a survey. Technical University of Munich, Technical Report (2010)Google Scholar
  7. 7.
    Hall, L.O., Chawla, N., Bowyer, K.W.: Combining decision trees learned in parallel. In: Working Notes of the KDD-97 Workshop on Distributed Data Mining, pp.10–15 (1998)Google Scholar
  8. 8.
    Joshi, M.V., Karypis, G., Kumar, V.: ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets. In: Parallel Processing Symposium, 1998, IPPS/SPDP 1998. Proceedings of the First Merged International and Symposium on Parallel and Distributed Processing 1998, IEEE pp. 573–579Google Scholar
  9. 9.
    Pakize, S.R., Gandomi, A.: Comparative study of classification algorithms based on MapReduce model. Int. J. Innovative Res. Adv. Eng. 2349–2163 (2014)Google Scholar
  10. 10.
  11. 11.
    Maha Lakshmi, N.V., Kanya Kumari, L., Rama Satish, A.: A study of classification algorithms using MapReduce framework. Int. J. Adv. Res. Comput. Sci. Software Eng. 5(5), 885–891 (2015)Google Scholar
  12. 12.
    Anchalia, P.P., Roy, K.: The K-nearest neighbour algorithm using mapreduce paradigm. In: Fifth International Conference on Intelligent Systems. Modelling and Simulation (2014)Google Scholar
  13. 13.
    Wu, G., Haiguang, L.I., Hu, X., Bi, Y., Zhang, J., Wu, X.: MReC4.5: C4. 5 ensemble classification with MapReduce, In: Fourth IEEE ChinaGrid Annual Conference, pp. 249–255 (2009)Google Scholar
  14. 14.
  15. 15.
    Katkar, V.D., Kulkarni, S.V.: A novel parallel implementation of naive bayesian classifier for big data. In: 2013 International Conference on Green Computing, Communication and Conservation of Energy (ICGCE), IEEE, pp. 847–852 (2013)Google Scholar
  16. 16.
    Zheng, S, Bayes, N.: Classifier: a mapreduce approach, a paper submitted to the graduate faculty of the North Dakota State University of agriculture and applied science (2014)Google Scholar
  17. 17.
    Srivastava, A., Han, E., Kumar, V., Singh, V.: Parallel Formulations of Decision-Tree Classification Algorithms, High Performance Data Mining. ISBN 978-0-7923-7745-0, Kluwer Academic Publishers. Manufactured in The Netherlands, pp. 237–261 (2002)Google Scholar
  18. 18.
    Paul, S.: Parallel and distributed data mining. In: Technical Report. ISBN: 978–953-307-547-1. Karunya University. Coimbatore, India (2011)Google Scholar
  19. 19.
    Ben-Haim, Y., Tom-Tov, E.: A streaming parallel decision tree algorithm. J. Machine Learning Res. 11, 849–872 (2010)MathSciNetzbMATHGoogle Scholar
  20. 20.
    Kubota, K., Nakase, A., Sakai, H., Oyanagi, S.: Parallelization of decision tree algorithm and its performance evaluation. In: The Fourth International Conference/Exhibition on IEEE High Performance Computing in the Asia-Pacific Region, 2000 Proceedings, vol. 2, pp. 574-579 (2000)Google Scholar
  21. 21.
    Chauhan, H., Chauhan, A.: Implementation of decision tree algorithm C4.5. Int. J. Sci. Res. Publications, 3(10), 1–2 (2013)Google Scholar
  22. 22.
    Dai, W., Ji, W.: A map reduce implementation of C4.5 decision tree algorithm. Int. J. Database Theory Appl 4, 49–60 (2014)Google Scholar
  23. 23.
    Shafer, J., Agrawal, R., Mehta, M.: SPRINT: a scalable parallel classifier for data mining. In: Proceeding of 1996 International Conference on Very Large Data Bases, pp. 544–555 (1996)Google Scholar
  24. 24.
    UCI Irvine Machine Learning Repository http://archive.ics.uci.edu/ml/datasets.html (2017)

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  • Y. V. Lokeswari
    • 1
    Email author
  • Shomona Gracia Jacob
    • 1
  • Rajavel Ramadoss
    • 2
  1. 1.Department of CSESri Sivasubramaniya Nadar College of EngineeringChennaiIndia
  2. 2.Department of ECESri Sivasubramaniya Nadar College of EngineeringChennaiIndia

Personalised recommendations