Skip to main content

Parallel Feature Selection Based on MapReduce

  • Conference paper
  • First Online:
Computer Engineering and Networking

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 277))

Abstract

Feature selection is an important research topic in machine learning and pattern recognition. It is effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. However, in recent years, data has become increasingly larger in both number of instances and number of features in many applications. Classical feature selection method is out of work in processing large-scale dataset because of expensive computational cost. For improving computational speed, parallel feature selection is taken as the efficient method. MapReduce is an efficient distributional computing model to process large-scale data mining problems. In this paper, a parallel feature selection method based on MapReduce model is proposed. Large-scale dataset is partitioned into sub-datasets. Feature selection is operated on each computational node. Selected feature variables are combined into one feature vector in Reduce job. The parallel feature selection method is scalable. The efficiency of the method is illustrated through example analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 329.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution (pp. 856–863). Twentieth International Conference on Machine Learning. American Association for Artificial Intelligence.

    Google Scholar 

  2. Dash, M., & Liu, H. (2009). Dimensionality reduction. In Encyclopedia of database systems (pp. 843–846).

    Google Scholar 

  3. Liu, H., & Motoda, H. (1998). Feature selection for knowledge discovery and data mining (pp. 23–45). Boston: Kluwer.

    Book  MATH  Google Scholar 

  4. Kwak, N., & Choi, C. H. (2002). Input feature selection for classification problems. IEEE Transactions on Neural Networks, 13(1), 143–159.

    Article  Google Scholar 

  5. Kari, T. (2003). Feature extraction by non-parametric mutual information maximization. Journal of Machine Learning Research, 3, 1415–1438.

    MATH  Google Scholar 

  6. Swedlow, J. R., Zanetti, G., & Best, C. (2011). Channeling the data deluge. Nature Methods, 8, 463–465.

    Article  Google Scholar 

  7. Fox, G. C. Qiu, X. H. Beason, S. Choi, J. Y. Ekanayake, J. Gunarathne., T. et al. (2009). Biomedical case studies in data intensive computing. Lecture Notes in Computer Science, 5931, 2–18.

    Google Scholar 

  8. Blake, J. A. & Bult C. J. (2006). Beyond the data deluge: Data integration and bio-ontologies. Journal of Biomedical Informatics, 39(3), 314–320.

    Article  Google Scholar 

  9. Qiu J. (2010). Scalable programming and algorithms for data intensive life science. Journal of Integrative Biology, 15(4), 1–3.

    Google Scholar 

  10. Guha, R. Gilbert, K. Fox, G. C. Pierce, M. Wild, D. & Yuan H. (2010). Advances in cheminformatics methodologies and infrastructure to support the data mining of large, heterogeneous chemical datasets. Current Computer-Aided Drug Design, 6(1), 50–67.

    Google Scholar 

  11. Chang, C. C. He, B. & Zhang Z. (2004). Mining semantics for large scale integration on the web: evidences, insights, and challenges. SIGKDD Explorations, 6(2), 67–76.

    Article  Google Scholar 

  12. Fox, G. C. Bae, S. H. Ekanayake, J. Qiu, X. H. & Yuan H. P. (2008). Parallel data mining from multicore to cloudy grids (pp. 311–340). High Performance Computing and Grids Workshop. IOS Press.

    Google Scholar 

  13. Zhang, B. J. Ruan, Y. Wu, T. L. Qiu, J. Hughes, A. & Fox G. (2010). Applying twister to scientific applications (pp. 25–32). Proceedings of CloudCom. IEEE CS Press.

    Google Scholar 

  14. Ekanayake, J. Li, H. Zhang, B. J. Gunarathne, Bae, S. H. Qiu, J. et al. (2010). Twister: A runtime for iterative MapReduce (pp. 810–818). The First International Workshop on MapReduce and Its Applications of ACM HPDC. ACM Press.

    Google Scholar 

  15. Sun, Z. Q. & Fox G. C. (2012). Study on parallel SVM based on MapReduce (pp. 495–501). International Conference on Parallel and Distributed Processing Techniques and Applications. CSREA Press.

    Google Scholar 

Download references

Acknowledgments

This work is partially supported by National Youth Science Foundation (No. 61004115), National Science Foundation (No. 61272433), and Provincial Fund for Nature project (No. ZR2010FQ018).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhanquan Sun .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Sun, Z. (2014). Parallel Feature Selection Based on MapReduce. In: Wong, W.E., Zhu, T. (eds) Computer Engineering and Networking. Lecture Notes in Electrical Engineering, vol 277. Springer, Cham. https://doi.org/10.1007/978-3-319-01766-2_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-01766-2_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-01765-5

  • Online ISBN: 978-3-319-01766-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics