A Parallel Implementation of Information Gain Using Hive in Conjunction with MapReduce for Continuous Features

  • Sikha BaguiEmail author
  • Sharon K. John
  • John P. Baggs
  • Subhash Bagui
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11154)


Finding efficient ways to perform the Information Gain algorithm is becoming even more important as we enter the Big Data era where data and dimensionality are increasing at alarming rates. When machine learning algorithms get over-burdened with large dimensional data with redundant features, information gain becomes very crucial for feature selection. Information gain is also often used as a pre-cursory step in creating decision trees, text classifiers, support vector machines, etc. Due to the very large volume of today’s data, there is a need to efficiently parallelize classic algorithms like Information Gain. In this paper, we present a parallel implementation of Information Gain in the MapReduce environment, using MapReduce in conjunction with Hive, for continuous features. In our approach, Hive was used to calculate the counts and parent entropy and a Map only job was used to complete the Information Gain calculations. Our approach demonstrated gains in run times as we carefully designed MapReduce jobs efficiently leveraging the Hadoop cluster.


Hadoop Hive Information Gain Parallel implementation Feature selection 


  1. 1.
    Bagui, S., Spratlin, S.: A review of data mining algorithms on Hadoop’s MapReduce. Int. J. Data Sci. (2017, to appear)Google Scholar
  2. 2.
    Bhardwaj, R., Vatta, S.: Implementation of the ID3 algorithm. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 3, 845–851 (2013)Google Scholar
  3. 3.
    Dhai, W., Ji, W.: A MapReduce implementation of C4.5 decision tree algorithm. Int. J. Database Theory Appl. 7, 49–60 (2014)CrossRefGoogle Scholar
  4. 4.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 5, 107–113 (2008)CrossRefGoogle Scholar
  5. 5.
    Duda, R.O.T.: Pattern classification, 2nd edn. Wiley, New York (2001)zbMATHGoogle Scholar
  6. 6.
    Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)zbMATHGoogle Scholar
  7. 7.
    Guillén, A., Sorjamaa, A., Miche, Y., Lendasse, A., Rojas, I.: Efficient parallel feature selection for steganography problems. In: Cabestany, J., Sandoval, F., Prieto, A., Corchado, Juan M. (eds.) IWANN 2009. LNCS, vol. 5517, pp. 1224–1231. Springer, Heidelberg (2009). ISBN 978-3-642-02477-1CrossRefGoogle Scholar
  8. 8.
    Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Publishers, Waltham (2011)zbMATHGoogle Scholar
  9. 9.
    Huang, T.C., Chu, K.-C., Lee, W.-T., Ho, Y.-S.: Adaptive combiner for MapReduce. Cluster Comput. 17, 1231–1253 (2014)CrossRefGoogle Scholar
  10. 10.
    Kumar, V., Minz, S.: Poem classification using machine learning approach. In: Babu, B.V., et al. (eds.) Proceedings of the Second International Conference on Soft Computing for Problem Solving (SocProS 2012), December 28–30, 2012. AISC, vol. 236, pp. 675–682. Springer, New Delhi (2014). Scholar
  11. 11.
    Kumar, V., Minz, S.: Mood classification of lyrics using SentiWordNet. In: International Conference on Computer Communications and Informatics (2013)Google Scholar
  12. 12.
    Kumar, V., Minz, S.: Feature selection: a literature review. Smart Comput. Rev. 4, 211–229 (2014)CrossRefGoogle Scholar
  13. 13.
    Lam, C.: Hadoop in Action. Manning Publications Co., New York (2010)Google Scholar
  14. 14.
    Liu, H., Yu, L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng. 17, 491–592 (2005)CrossRefGoogle Scholar
  15. 15.
    Mitchell, T.M.: Machine Learning, 1st edn. McGraw-Hill Science/Engineering/Math, New York (1997)zbMATHGoogle Scholar
  16. 16.
    Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and Naive Bayes. In: Proceedings of the Sixteenth International Conference on Machine Learning (ICML). Morgan Kaufmann Publishers Inc., San Francisco, pp. 258–267 (1999)Google Scholar
  17. 17.
    NandaKumar, A.N., Yambem, N.: Survey on data mining algorithms on Apache Hadoop platform. Int. J. Emerg. Technol. Adv. Eng. 4, 563–565 (2014)Google Scholar
  18. 18.
    Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action. Manning Publications Co., Greenwich (2011)Google Scholar
  19. 19.
    Purdila, V., Pentiuc, S.-G.: MR-tree – a scalable MapReduce algorithm for building decision trees. J. Appl. Comput. Sci. Math. 16, 16–19 (2014)Google Scholar
  20. 20.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993). ISBN 1-55860-238-0Google Scholar
  21. 21.
    Sakr, S., Liu, A., Fayoumi, A.G.: The family of MapReduce and large-scale data processing systems. ACM Comput. Surv. (CSUR), 46 (2013)Google Scholar
  22. 22.
  23. 23.
    Singh, S., Kubica, J., Larsen, S., Sorokina, D.: Parallel large-scale feature selection for logistic regression. In: Proceedings of SIAM International Conference on Data Mining, pp. 1172–1183 (2009)Google Scholar
  24. 24.
    Sun, Z., Li, Z.: Data intensive parallel feature selection method study. In: International Joint Conference on Neural Networks (IJCNN), pp. 2256–2262 (2014).
  25. 25.
    Zdravevski, E., Lameski, P., Kulakov, A., Filiposka, S., Trajanov, D., Jakimovskik, B.: Parallel computation of information gain using Hadoop and MapReduce. In: Proceedings of the Federated Conference on Computer Science and Information Systems. ACSIS, vol. 5, pp. 181–192 (2015).
  26. 26.

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Sikha Bagui
    • 1
    Email author
  • Sharon K. John
    • 1
  • John P. Baggs
    • 1
  • Subhash Bagui
    • 1
  1. 1.University of West FloridaPensacolaUSA

Personalised recommendations