Abstract
Feature selection is an important research topic in machine learning and pattern recognition. It is effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. However, in recent years, data has become increasingly larger in both number of instances and number of features in many applications. Classical feature selection method is out of work in processing large-scale dataset because of expensive computational cost. For improving computational speed, parallel feature selection is taken as the efficient method. MapReduce is an efficient distributional computing model to process large-scale data mining problems. In this paper, a parallel feature selection method based on MapReduce model is proposed. Large-scale dataset is partitioned into sub-datasets. Feature selection is operated on each computational node. Selected feature variables are combined into one feature vector in Reduce job. The parallel feature selection method is scalable. The efficiency of the method is illustrated through example analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution (pp. 856–863). Twentieth International Conference on Machine Learning. American Association for Artificial Intelligence.
Dash, M., & Liu, H. (2009). Dimensionality reduction. In Encyclopedia of database systems (pp. 843–846).
Liu, H., & Motoda, H. (1998). Feature selection for knowledge discovery and data mining (pp. 23–45). Boston: Kluwer.
Kwak, N., & Choi, C. H. (2002). Input feature selection for classification problems. IEEE Transactions on Neural Networks, 13(1), 143–159.
Kari, T. (2003). Feature extraction by non-parametric mutual information maximization. Journal of Machine Learning Research, 3, 1415–1438.
Swedlow, J. R., Zanetti, G., & Best, C. (2011). Channeling the data deluge. Nature Methods, 8, 463–465.
Fox, G. C. Qiu, X. H. Beason, S. Choi, J. Y. Ekanayake, J. Gunarathne., T. et al. (2009). Biomedical case studies in data intensive computing. Lecture Notes in Computer Science, 5931, 2–18.
Blake, J. A. & Bult C. J. (2006). Beyond the data deluge: Data integration and bio-ontologies. Journal of Biomedical Informatics, 39(3), 314–320.
Qiu J. (2010). Scalable programming and algorithms for data intensive life science. Journal of Integrative Biology, 15(4), 1–3.
Guha, R. Gilbert, K. Fox, G. C. Pierce, M. Wild, D. & Yuan H. (2010). Advances in cheminformatics methodologies and infrastructure to support the data mining of large, heterogeneous chemical datasets. Current Computer-Aided Drug Design, 6(1), 50–67.
Chang, C. C. He, B. & Zhang Z. (2004). Mining semantics for large scale integration on the web: evidences, insights, and challenges. SIGKDD Explorations, 6(2), 67–76.
Fox, G. C. Bae, S. H. Ekanayake, J. Qiu, X. H. & Yuan H. P. (2008). Parallel data mining from multicore to cloudy grids (pp. 311–340). High Performance Computing and Grids Workshop. IOS Press.
Zhang, B. J. Ruan, Y. Wu, T. L. Qiu, J. Hughes, A. & Fox G. (2010). Applying twister to scientific applications (pp. 25–32). Proceedings of CloudCom. IEEE CS Press.
Ekanayake, J. Li, H. Zhang, B. J. Gunarathne, Bae, S. H. Qiu, J. et al. (2010). Twister: A runtime for iterative MapReduce (pp. 810–818). The First International Workshop on MapReduce and Its Applications of ACM HPDC. ACM Press.
Sun, Z. Q. & Fox G. C. (2012). Study on parallel SVM based on MapReduce (pp. 495–501). International Conference on Parallel and Distributed Processing Techniques and Applications. CSREA Press.
Acknowledgments
This work is partially supported by National Youth Science Foundation (No. 61004115), National Science Foundation (No. 61272433), and Provincial Fund for Nature project (No. ZR2010FQ018).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Sun, Z. (2014). Parallel Feature Selection Based on MapReduce. In: Wong, W.E., Zhu, T. (eds) Computer Engineering and Networking. Lecture Notes in Electrical Engineering, vol 277. Springer, Cham. https://doi.org/10.1007/978-3-319-01766-2_35
Download citation
DOI: https://doi.org/10.1007/978-3-319-01766-2_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-01765-5
Online ISBN: 978-3-319-01766-2
eBook Packages: EngineeringEngineering (R0)