Skip to main content

A Parallel Implementation of Information Gain Using Hive in Conjunction with MapReduce for Continuous Features

  • Conference paper
  • First Online:
Trends and Applications in Knowledge Discovery and Data Mining (PAKDD 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11154))

Included in the following conference series:

  • 1196 Accesses

Abstract

Finding efficient ways to perform the Information Gain algorithm is becoming even more important as we enter the Big Data era where data and dimensionality are increasing at alarming rates. When machine learning algorithms get over-burdened with large dimensional data with redundant features, information gain becomes very crucial for feature selection. Information gain is also often used as a pre-cursory step in creating decision trees, text classifiers, support vector machines, etc. Due to the very large volume of today’s data, there is a need to efficiently parallelize classic algorithms like Information Gain. In this paper, we present a parallel implementation of Information Gain in the MapReduce environment, using MapReduce in conjunction with Hive, for continuous features. In our approach, Hive was used to calculate the counts and parent entropy and a Map only job was used to complete the Information Gain calculations. Our approach demonstrated gains in run times as we carefully designed MapReduce jobs efficiently leveraging the Hadoop cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bagui, S., Spratlin, S.: A review of data mining algorithms on Hadoop’s MapReduce. Int. J. Data Sci. (2017, to appear)

    Google Scholar 

  2. Bhardwaj, R., Vatta, S.: Implementation of the ID3 algorithm. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 3, 845–851 (2013)

    Google Scholar 

  3. Dhai, W., Ji, W.: A MapReduce implementation of C4.5 decision tree algorithm. Int. J. Database Theory Appl. 7, 49–60 (2014)

    Article  Google Scholar 

  4. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 5, 107–113 (2008)

    Article  Google Scholar 

  5. Duda, R.O.T.: Pattern classification, 2nd edn. Wiley, New York (2001)

    MATH  Google Scholar 

  6. Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)

    MATH  Google Scholar 

  7. Guillén, A., Sorjamaa, A., Miche, Y., Lendasse, A., Rojas, I.: Efficient parallel feature selection for steganography problems. In: Cabestany, J., Sandoval, F., Prieto, A., Corchado, Juan M. (eds.) IWANN 2009. LNCS, vol. 5517, pp. 1224–1231. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02478-8_153. ISBN 978-3-642-02477-1

    Chapter  Google Scholar 

  8. Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Publishers, Waltham (2011)

    MATH  Google Scholar 

  9. Huang, T.C., Chu, K.-C., Lee, W.-T., Ho, Y.-S.: Adaptive combiner for MapReduce. Cluster Comput. 17, 1231–1253 (2014)

    Article  Google Scholar 

  10. Kumar, V., Minz, S.: Poem classification using machine learning approach. In: Babu, B.V., et al. (eds.) Proceedings of the Second International Conference on Soft Computing for Problem Solving (SocProS 2012), December 28–30, 2012. AISC, vol. 236, pp. 675–682. Springer, New Delhi (2014). https://doi.org/10.1007/978-81-322-1602-5_72

    Chapter  Google Scholar 

  11. Kumar, V., Minz, S.: Mood classification of lyrics using SentiWordNet. In: International Conference on Computer Communications and Informatics (2013)

    Google Scholar 

  12. Kumar, V., Minz, S.: Feature selection: a literature review. Smart Comput. Rev. 4, 211–229 (2014)

    Article  Google Scholar 

  13. Lam, C.: Hadoop in Action. Manning Publications Co., New York (2010)

    Google Scholar 

  14. Liu, H., Yu, L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng. 17, 491–592 (2005)

    Article  Google Scholar 

  15. Mitchell, T.M.: Machine Learning, 1st edn. McGraw-Hill Science/Engineering/Math, New York (1997)

    MATH  Google Scholar 

  16. Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and Naive Bayes. In: Proceedings of the Sixteenth International Conference on Machine Learning (ICML). Morgan Kaufmann Publishers Inc., San Francisco, pp. 258–267 (1999)

    Google Scholar 

  17. NandaKumar, A.N., Yambem, N.: Survey on data mining algorithms on Apache Hadoop platform. Int. J. Emerg. Technol. Adv. Eng. 4, 563–565 (2014)

    Google Scholar 

  18. Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action. Manning Publications Co., Greenwich (2011)

    Google Scholar 

  19. Purdila, V., Pentiuc, S.-G.: MR-tree – a scalable MapReduce algorithm for building decision trees. J. Appl. Comput. Sci. Math. 16, 16–19 (2014)

    Google Scholar 

  20. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993). ISBN 1-55860-238-0

    Google Scholar 

  21. Sakr, S., Liu, A., Fayoumi, A.G.: The family of MapReduce and large-scale data processing systems. ACM Comput. Surv. (CSUR), 46 (2013)

    Google Scholar 

  22. https://2xbbhjxc6wk3v21p62t8n4d4-wpengine.netdna-ssl.com/wp-content/uploads/2012/06/Using_Tableau_with_Hortonworks_Data_Platform.v1.0.pdf

  23. Singh, S., Kubica, J., Larsen, S., Sorokina, D.: Parallel large-scale feature selection for logistic regression. In: Proceedings of SIAM International Conference on Data Mining, pp. 1172–1183 (2009)

    Google Scholar 

  24. Sun, Z., Li, Z.: Data intensive parallel feature selection method study. In: International Joint Conference on Neural Networks (IJCNN), pp. 2256–2262 (2014). https://doi.org/10.1109/ijcnn.2014.6889409

  25. Zdravevski, E., Lameski, P., Kulakov, A., Filiposka, S., Trajanov, D., Jakimovskik, B.: Parallel computation of information gain using Hadoop and MapReduce. In: Proceedings of the Federated Conference on Computer Science and Information Systems. ACSIS, vol. 5, pp. 181–192 (2015). https://doi.org/10.15439/2015f89

  26. http://archive.ics.uci.edu/ml/datasets/HEPMASS?ref=datanews.io

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sikha Bagui .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bagui, S., John, S.K., Baggs, J.P., Bagui, S. (2018). A Parallel Implementation of Information Gain Using Hive in Conjunction with MapReduce for Continuous Features. In: Ganji, M., Rashidi, L., Fung, B., Wang, C. (eds) Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 11154. Springer, Cham. https://doi.org/10.1007/978-3-030-04503-6_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-04503-6_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-04502-9

  • Online ISBN: 978-3-030-04503-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics