Parallel Feature Selection Based on MapReduce

Sun, Zhanquan

doi:10.1007/978-3-319-01766-2_35

Zhanquan Sun³

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 277))

1385 Accesses
9 Citations

Abstract

Feature selection is an important research topic in machine learning and pattern recognition. It is effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. However, in recent years, data has become increasingly larger in both number of instances and number of features in many applications. Classical feature selection method is out of work in processing large-scale dataset because of expensive computational cost. For improving computational speed, parallel feature selection is taken as the efficient method. MapReduce is an efficient distributional computing model to process large-scale data mining problems. In this paper, a parallel feature selection method based on MapReduce model is proposed. Large-scale dataset is partitioned into sub-datasets. Feature selection is operated on each computational node. Selected feature variables are combined into one feature vector in Reduce job. The parallel feature selection method is scalable. The efficiency of the method is illustrated through example analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Hardcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution (pp. 856–863). Twentieth International Conference on Machine Learning. American Association for Artificial Intelligence.
Google Scholar
Dash, M., & Liu, H. (2009). Dimensionality reduction. In Encyclopedia of database systems (pp. 843–846).
Google Scholar
Liu, H., & Motoda, H. (1998). Feature selection for knowledge discovery and data mining (pp. 23–45). Boston: Kluwer.
Book MATH Google Scholar
Kwak, N., & Choi, C. H. (2002). Input feature selection for classification problems. IEEE Transactions on Neural Networks, 13(1), 143–159.
Article Google Scholar
Kari, T. (2003). Feature extraction by non-parametric mutual information maximization. Journal of Machine Learning Research, 3, 1415–1438.
MATH Google Scholar
Swedlow, J. R., Zanetti, G., & Best, C. (2011). Channeling the data deluge. Nature Methods, 8, 463–465.
Article Google Scholar
Fox, G. C. Qiu, X. H. Beason, S. Choi, J. Y. Ekanayake, J. Gunarathne., T. et al. (2009). Biomedical case studies in data intensive computing. Lecture Notes in Computer Science, 5931, 2–18.
Google Scholar
Blake, J. A. & Bult C. J. (2006). Beyond the data deluge: Data integration and bio-ontologies. Journal of Biomedical Informatics, 39(3), 314–320.
Article Google Scholar
Qiu J. (2010). Scalable programming and algorithms for data intensive life science. Journal of Integrative Biology, 15(4), 1–3.
Google Scholar
Guha, R. Gilbert, K. Fox, G. C. Pierce, M. Wild, D. & Yuan H. (2010). Advances in cheminformatics methodologies and infrastructure to support the data mining of large, heterogeneous chemical datasets. Current Computer-Aided Drug Design, 6(1), 50–67.
Google Scholar
Chang, C. C. He, B. & Zhang Z. (2004). Mining semantics for large scale integration on the web: evidences, insights, and challenges. SIGKDD Explorations, 6(2), 67–76.
Article Google Scholar
Fox, G. C. Bae, S. H. Ekanayake, J. Qiu, X. H. & Yuan H. P. (2008). Parallel data mining from multicore to cloudy grids (pp. 311–340). High Performance Computing and Grids Workshop. IOS Press.
Google Scholar
Zhang, B. J. Ruan, Y. Wu, T. L. Qiu, J. Hughes, A. & Fox G. (2010). Applying twister to scientific applications (pp. 25–32). Proceedings of CloudCom. IEEE CS Press.
Google Scholar
Ekanayake, J. Li, H. Zhang, B. J. Gunarathne, Bae, S. H. Qiu, J. et al. (2010). Twister: A runtime for iterative MapReduce (pp. 810–818). The First International Workshop on MapReduce and Its Applications of ACM HPDC. ACM Press.
Google Scholar
Sun, Z. Q. & Fox G. C. (2012). Study on parallel SVM based on MapReduce (pp. 495–501). International Conference on Parallel and Distributed Processing Techniques and Applications. CSREA Press.
Google Scholar

Download references

Acknowledgments

This work is partially supported by National Youth Science Foundation (No. 61004115), National Science Foundation (No. 61272433), and Provincial Fund for Nature project (No. ZR2010FQ018).

Author information

Authors and Affiliations

Shandong Provincial Key Laboratory of Computer Network, Shandong Computer Science Center, Jinan, Shandong, 250014, China
Zhanquan Sun

Authors

Zhanquan Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhanquan Sun .

Editor information

Editors and Affiliations

University of Texas at Dallas, Richardson, Texas, USA
W. Eric Wong
Chinese Academy of Sciences, Beijing, China, People’s Republic
Tingshao Zhu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, Z. (2014). Parallel Feature Selection Based on MapReduce. In: Wong, W.E., Zhu, T. (eds) Computer Engineering and Networking. Lecture Notes in Electrical Engineering, vol 277. Springer, Cham. https://doi.org/10.1007/978-3-319-01766-2_35

Download citation

DOI: https://doi.org/10.1007/978-3-319-01766-2_35
Published: 05 December 2013
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-01765-5
Online ISBN: 978-3-319-01766-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics