A Parallel Implementation of Information Gain Using Hive in Conjunction with MapReduce for Continuous Features

Bagui, Sikha; John, Sharon K.; Baggs, John P.; Bagui, Subhash

doi:10.1007/978-3-030-04503-6_28

Sikha Bagui¹⁶,
Sharon K. John¹⁶,
John P. Baggs¹⁶ &
…
Subhash Bagui¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11154))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1196 Accesses

Abstract

Finding efficient ways to perform the Information Gain algorithm is becoming even more important as we enter the Big Data era where data and dimensionality are increasing at alarming rates. When machine learning algorithms get over-burdened with large dimensional data with redundant features, information gain becomes very crucial for feature selection. Information gain is also often used as a pre-cursory step in creating decision trees, text classifiers, support vector machines, etc. Due to the very large volume of today’s data, there is a need to efficiently parallelize classic algorithms like Information Gain. In this paper, we present a parallel implementation of Information Gain in the MapReduce environment, using MapReduce in conjunction with Hive, for continuous features. In our approach, Hive was used to calculate the counts and parent entropy and a Map only job was used to complete the Information Gain calculations. Our approach demonstrated gains in run times as we carefully designed MapReduce jobs efficiently leveraging the Hadoop cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bagui, S., Spratlin, S.: A review of data mining algorithms on Hadoop’s MapReduce. Int. J. Data Sci. (2017, to appear)
Google Scholar
Bhardwaj, R., Vatta, S.: Implementation of the ID3 algorithm. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 3, 845–851 (2013)
Google Scholar
Dhai, W., Ji, W.: A MapReduce implementation of C4.5 decision tree algorithm. Int. J. Database Theory Appl. 7, 49–60 (2014)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 5, 107–113 (2008)
Article Google Scholar
Duda, R.O.T.: Pattern classification, 2nd edn. Wiley, New York (2001)
MATH Google Scholar
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
MATH Google Scholar
Guillén, A., Sorjamaa, A., Miche, Y., Lendasse, A., Rojas, I.: Efficient parallel feature selection for steganography problems. In: Cabestany, J., Sandoval, F., Prieto, A., Corchado, Juan M. (eds.) IWANN 2009. LNCS, vol. 5517, pp. 1224–1231. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02478-8_153. ISBN 978-3-642-02477-1
Chapter Google Scholar
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Publishers, Waltham (2011)
MATH Google Scholar
Huang, T.C., Chu, K.-C., Lee, W.-T., Ho, Y.-S.: Adaptive combiner for MapReduce. Cluster Comput. 17, 1231–1253 (2014)
Article Google Scholar
Kumar, V., Minz, S.: Poem classification using machine learning approach. In: Babu, B.V., et al. (eds.) Proceedings of the Second International Conference on Soft Computing for Problem Solving (SocProS 2012), December 28–30, 2012. AISC, vol. 236, pp. 675–682. Springer, New Delhi (2014). https://doi.org/10.1007/978-81-322-1602-5_72
Chapter Google Scholar
Kumar, V., Minz, S.: Mood classification of lyrics using SentiWordNet. In: International Conference on Computer Communications and Informatics (2013)
Google Scholar
Kumar, V., Minz, S.: Feature selection: a literature review. Smart Comput. Rev. 4, 211–229 (2014)
Article Google Scholar
Lam, C.: Hadoop in Action. Manning Publications Co., New York (2010)
Google Scholar
Liu, H., Yu, L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng. 17, 491–592 (2005)
Article Google Scholar
Mitchell, T.M.: Machine Learning, 1st edn. McGraw-Hill Science/Engineering/Math, New York (1997)
MATH Google Scholar
Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and Naive Bayes. In: Proceedings of the Sixteenth International Conference on Machine Learning (ICML). Morgan Kaufmann Publishers Inc., San Francisco, pp. 258–267 (1999)
Google Scholar
NandaKumar, A.N., Yambem, N.: Survey on data mining algorithms on Apache Hadoop platform. Int. J. Emerg. Technol. Adv. Eng. 4, 563–565 (2014)
Google Scholar
Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action. Manning Publications Co., Greenwich (2011)
Google Scholar
Purdila, V., Pentiuc, S.-G.: MR-tree – a scalable MapReduce algorithm for building decision trees. J. Appl. Comput. Sci. Math. 16, 16–19 (2014)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993). ISBN 1-55860-238-0
Google Scholar
Sakr, S., Liu, A., Fayoumi, A.G.: The family of MapReduce and large-scale data processing systems. ACM Comput. Surv. (CSUR), 46 (2013)
Google Scholar
https://2xbbhjxc6wk3v21p62t8n4d4-wpengine.netdna-ssl.com/wp-content/uploads/2012/06/Using_Tableau_with_Hortonworks_Data_Platform.v1.0.pdf
Singh, S., Kubica, J., Larsen, S., Sorokina, D.: Parallel large-scale feature selection for logistic regression. In: Proceedings of SIAM International Conference on Data Mining, pp. 1172–1183 (2009)
Google Scholar
Sun, Z., Li, Z.: Data intensive parallel feature selection method study. In: International Joint Conference on Neural Networks (IJCNN), pp. 2256–2262 (2014). https://doi.org/10.1109/ijcnn.2014.6889409
Zdravevski, E., Lameski, P., Kulakov, A., Filiposka, S., Trajanov, D., Jakimovskik, B.: Parallel computation of information gain using Hadoop and MapReduce. In: Proceedings of the Federated Conference on Computer Science and Information Systems. ACSIS, vol. 5, pp. 181–192 (2015). https://doi.org/10.15439/2015f89
http://archive.ics.uci.edu/ml/datasets/HEPMASS?ref=datanews.io

Download references

Author information

Authors and Affiliations

University of West Florida, Pensacola, FL, 32514, USA
Sikha Bagui, Sharon K. John, John P. Baggs & Subhash Bagui

Authors

Sikha Bagui
View author publications
You can also search for this author in PubMed Google Scholar
Sharon K. John
View author publications
You can also search for this author in PubMed Google Scholar
John P. Baggs
View author publications
You can also search for this author in PubMed Google Scholar
Subhash Bagui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sikha Bagui .

Editor information

Editors and Affiliations

University of Melbourne, Melbourne, VIC, Australia
Mohadeseh Ganji
University of Melbourne, Melbourne, VIC, Australia
Lida Rashidi
McGill University, Montreal, QC, Canada
Benjamin C. M. Fung
Griffith University, Gold Coast, QLD, Australia
Can Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bagui, S., John, S.K., Baggs, J.P., Bagui, S. (2018). A Parallel Implementation of Information Gain Using Hive in Conjunction with MapReduce for Continuous Features. In: Ganji, M., Rashidi, L., Fung, B., Wang, C. (eds) Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 11154. Springer, Cham. https://doi.org/10.1007/978-3-030-04503-6_28

Download citation

DOI: https://doi.org/10.1007/978-3-030-04503-6_28
Published: 21 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04502-9
Online ISBN: 978-3-030-04503-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics