Abstract
Finding efficient ways to perform the Information Gain algorithm is becoming even more important as we enter the Big Data era where data and dimensionality are increasing at alarming rates. When machine learning algorithms get over-burdened with large dimensional data with redundant features, information gain becomes very crucial for feature selection. Information gain is also often used as a pre-cursory step in creating decision trees, text classifiers, support vector machines, etc. Due to the very large volume of today’s data, there is a need to efficiently parallelize classic algorithms like Information Gain. In this paper, we present a parallel implementation of Information Gain in the MapReduce environment, using MapReduce in conjunction with Hive, for continuous features. In our approach, Hive was used to calculate the counts and parent entropy and a Map only job was used to complete the Information Gain calculations. Our approach demonstrated gains in run times as we carefully designed MapReduce jobs efficiently leveraging the Hadoop cluster.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bagui, S., Spratlin, S.: A review of data mining algorithms on Hadoop’s MapReduce. Int. J. Data Sci. (2017, to appear)
Bhardwaj, R., Vatta, S.: Implementation of the ID3 algorithm. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 3, 845–851 (2013)
Dhai, W., Ji, W.: A MapReduce implementation of C4.5 decision tree algorithm. Int. J. Database Theory Appl. 7, 49–60 (2014)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 5, 107–113 (2008)
Duda, R.O.T.: Pattern classification, 2nd edn. Wiley, New York (2001)
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
Guillén, A., Sorjamaa, A., Miche, Y., Lendasse, A., Rojas, I.: Efficient parallel feature selection for steganography problems. In: Cabestany, J., Sandoval, F., Prieto, A., Corchado, Juan M. (eds.) IWANN 2009. LNCS, vol. 5517, pp. 1224–1231. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02478-8_153. ISBN 978-3-642-02477-1
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Publishers, Waltham (2011)
Huang, T.C., Chu, K.-C., Lee, W.-T., Ho, Y.-S.: Adaptive combiner for MapReduce. Cluster Comput. 17, 1231–1253 (2014)
Kumar, V., Minz, S.: Poem classification using machine learning approach. In: Babu, B.V., et al. (eds.) Proceedings of the Second International Conference on Soft Computing for Problem Solving (SocProS 2012), December 28–30, 2012. AISC, vol. 236, pp. 675–682. Springer, New Delhi (2014). https://doi.org/10.1007/978-81-322-1602-5_72
Kumar, V., Minz, S.: Mood classification of lyrics using SentiWordNet. In: International Conference on Computer Communications and Informatics (2013)
Kumar, V., Minz, S.: Feature selection: a literature review. Smart Comput. Rev. 4, 211–229 (2014)
Lam, C.: Hadoop in Action. Manning Publications Co., New York (2010)
Liu, H., Yu, L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng. 17, 491–592 (2005)
Mitchell, T.M.: Machine Learning, 1st edn. McGraw-Hill Science/Engineering/Math, New York (1997)
Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and Naive Bayes. In: Proceedings of the Sixteenth International Conference on Machine Learning (ICML). Morgan Kaufmann Publishers Inc., San Francisco, pp. 258–267 (1999)
NandaKumar, A.N., Yambem, N.: Survey on data mining algorithms on Apache Hadoop platform. Int. J. Emerg. Technol. Adv. Eng. 4, 563–565 (2014)
Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action. Manning Publications Co., Greenwich (2011)
Purdila, V., Pentiuc, S.-G.: MR-tree – a scalable MapReduce algorithm for building decision trees. J. Appl. Comput. Sci. Math. 16, 16–19 (2014)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993). ISBN 1-55860-238-0
Sakr, S., Liu, A., Fayoumi, A.G.: The family of MapReduce and large-scale data processing systems. ACM Comput. Surv. (CSUR), 46 (2013)
Singh, S., Kubica, J., Larsen, S., Sorokina, D.: Parallel large-scale feature selection for logistic regression. In: Proceedings of SIAM International Conference on Data Mining, pp. 1172–1183 (2009)
Sun, Z., Li, Z.: Data intensive parallel feature selection method study. In: International Joint Conference on Neural Networks (IJCNN), pp. 2256–2262 (2014). https://doi.org/10.1109/ijcnn.2014.6889409
Zdravevski, E., Lameski, P., Kulakov, A., Filiposka, S., Trajanov, D., Jakimovskik, B.: Parallel computation of information gain using Hadoop and MapReduce. In: Proceedings of the Federated Conference on Computer Science and Information Systems. ACSIS, vol. 5, pp. 181–192 (2015). https://doi.org/10.15439/2015f89
http://archive.ics.uci.edu/ml/datasets/HEPMASS?ref=datanews.io
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Bagui, S., John, S.K., Baggs, J.P., Bagui, S. (2018). A Parallel Implementation of Information Gain Using Hive in Conjunction with MapReduce for Continuous Features. In: Ganji, M., Rashidi, L., Fung, B., Wang, C. (eds) Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 11154. Springer, Cham. https://doi.org/10.1007/978-3-030-04503-6_28
Download citation
DOI: https://doi.org/10.1007/978-3-030-04503-6_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04502-9
Online ISBN: 978-3-030-04503-6
eBook Packages: Computer ScienceComputer Science (R0)