Skip to main content

Evaluating Feature Selection Robustness on High-Dimensional Data

  • Conference paper
  • First Online:
Hybrid Artificial Intelligent Systems (HAIS 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10870))

Included in the following conference series:

Abstract

With the explosive growth of high-dimensional data, feature selection has become a crucial step of machine learning tasks. Though most of the available works focus on devising selection strategies that are effective in identifying small subsets of predictive features, recent research has also highlighted the importance of investigating the robustness of the selection process with respect to sample variation. In presence of a high number of features, indeed, the selection outcome can be very sensitive to any perturbations in the set of training records, which limits the interpretability of the results and their subsequent exploitation in real-world applications. This study aims to provide more insight about this critical issue by analysing the robustness of some state-of-the-art selection methods, for different levels of data perturbation and different cardinalities of the selected feature subsets. Furthermore, we explore the extent to which the adoption of an ensemble selection strategy can make these algorithms more robust, without compromising their predictive performance. The results on five high-dimensional datasets, which are representatives of different domains, are presented and discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)

    MATH  Google Scholar 

  2. Saeys, Y., Inza, I., Larranaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)

    Article  Google Scholar 

  3. Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)

    MATH  Google Scholar 

  4. Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Feature selection and classification in multiple class datasets: an application to kdd cup 99 dataset. Expert Syst. Appl. 38(5), 5947–5957 (2011)

    Article  Google Scholar 

  5. Staroszczyk, T., Osowski, S., Markiewicz, T.: Comparative analysis of feature selection methods for blood cell recognition in leukemia. In: Proceedings of the 8th International Conference on Machine Learning and Data Mining in Pattern Recognition, pp. 467–481 (2012)

    Chapter  Google Scholar 

  6. Tang, J., Alelyani, S., Liu, H.: Feature selection for classification: a review. In: Aggarwal, C.C. (ed.) Data Classification: Algorithms and Applications, pp. 37–64. CRC Press, Boca Raton (2014)

    MATH  Google Scholar 

  7. Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: A review of feature selection methods on synthetic data. Knowl. Inf. Syst. 34(3), 483–519 (2013)

    Article  Google Scholar 

  8. Bolón-Canedo, V., Rego-Fernández, D., Peteiro-Barral, D., Alonso-Betanzos, A., Guijarro-Berdiñas, B., Sánchez-Maroño, N.: On the scalability of feature selection methods on high-dimensional data. Knowl. Inf. Syst. 1–48 (2018). https://link.springer.com/article/10.1007/s10115-017-1140-3

  9. Maldonado, S., Pérez, J., Bravo, C.: Cost-based feature selection for support vector machines: an application in credit scoring. Eur. J. Oper. Res. 261(2), 656–665 (2017)

    Article  MathSciNet  Google Scholar 

  10. Kalousis, A., Prados, J., Hilario, M.: Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl. Inf. Syst. 12(1), 95–116 (2007)

    Article  Google Scholar 

  11. Saeys, Y., Abeel, T., Van de Peer, Y.: Robust feature selection using ensemble feature selection techniques. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part II. LNCS (LNAI), vol. 5212, pp. 313–325. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87481-2_21

    Chapter  Google Scholar 

  12. Pes, B.: Feature selection for high-dimensional data: the issue of stability. In: 26th IEEE International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises, WETICE 2017, pp. 170–175 (2017)

    Google Scholar 

  13. Alelyani, S., Zhao, Z., Liu, H.: A dilemma in assessing stability of feature selection algorithms. In: IEEE 13th International Conference on High Performance Computing and Communications, pp. 701–707 (2011)

    Google Scholar 

  14. Abeel, T., Helleputte, T., Van de Peer, Y., Dupont, P., Saeys, Y.: Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26(3), 392–398 (2010)

    Article  Google Scholar 

  15. Dietterich, T.: Ensemble methods in machine learning. In: Proceedings of the 1st International Workshop on Multiple Classifier Systems, pp. 1–15 (2000)

    Google Scholar 

  16. Kuncheva, L.I., Smith, C.J., Syed, Y., Phillips, C.O., Lewis, K.E.: Evaluation of feature ranking ensembles for high-dimensional biomedical data: a case study. In: IEEE 12th International Conference on Data Mining Workshops, pp. 49–56. IEEE (2012)

    Google Scholar 

  17. Haury, A.C., Gestraud, P., Vert, J.P.: The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS ONE 6(12), e28210 (2011)

    Article  Google Scholar 

  18. Zengyou, H., Weichuan, Y.: Stable feature selection for biomarker discovery. Comput. Biol. Chem. 34, 215–225 (2010)

    Article  Google Scholar 

  19. Awada, W., Khoshgoftaar, T.M., Dittman, D., Wald, R., Napolitano, A.: A review of the stability of feature selection techniques for bioinformatics data. In: IEEE 13th International Conference on Information Reuse and Integration, pp. 356–363. IEEE (2012)

    Google Scholar 

  20. Wang, H., Khoshgoftaar, T.M., Wald, R., Napolitano, A.: A novel dataset-similarity-aware approach for evaluating stability of software metric selection techniques. In: Proceedings of the IEEE International Conference on Information Reuse and Integration, pp. 1–8 (2012)

    Google Scholar 

  21. Kuncheva, L.I.: A stability index for feature selection. In: 25th IASTED International Multi-Conference: Artificial Intelligence and Applications, pp. 390–395. ACTA Press Anaheim (2007)

    Google Scholar 

  22. Somol, P., Novovicova, J.: Evaluating stability and comparing output of feature selectors that optimize feature subset cardinality. IEEE Trans. Pattern Anal. Mach. Intell. 32(11), 1921–1939 (2010)

    Article  Google Scholar 

  23. Dessì, N., Pascariello, E., Pes, B.: A comparative analysis of biomarker selection techniques. BioMed. Res. Int. 2013, Article ID 387673 (2013)

    Google Scholar 

  24. Drotár, P., Gazda, J., Smékal, Z.: An experimental comparison of feature selection methods on two-class biomedical datasets. Comput. Biol. Med. 66, 1–10 (2015)

    Article  Google Scholar 

  25. Wang, H., Khoshgoftaar, T.M., Seliya, N.: On the stability of feature selection methods in software quality prediction: an empirical investigation. Int. J. Soft. Eng. Knowl. Eng. 25, 1467–1490 (2015)

    Article  Google Scholar 

  26. Wald, R., Khoshgoftaar, T.M., Dittman, D.: Mean aggregation versus robust rank aggregation for ensemble gene selection. In: 11th International Conference on Machine Learning and Applications, pp. 63–69 (2012)

    Google Scholar 

  27. Cannas, L.M., Dessì, N., Pes, B.: Assessing similarity of feature selection techniques in high-dimensional domains. Pattern Recogn. Lett. 34(12), 1446–1453 (2013)

    Article  Google Scholar 

  28. Dessì, N., Pes, B.: Similarity of feature selection methods: an empirical study across data intensive classification tasks. Expert Syst. Appl. 42(10), 4632–4642 (2015)

    Article  Google Scholar 

  29. Mesejo, P., Pizarro, D., Abergel, A., Rouquette, O., et al.: Computer-aided classification of gastrointestinal lesions in regular colonoscopy. IEEE Trans. Med. Imaging 35(9), 2051–2063 (2016)

    Article  Google Scholar 

  30. Tsanas, A., Little, M.A., Fox, C., Ramig, L.O.: Objective automatic assessment of rehabilitative speech treatment in Parkinson’s disease. IEEE Trans. Neural Syst. Rehabil. Eng. 22, 181–190 (2014)

    Article  Google Scholar 

  31. Shipp, M.A., Ross, K.N., Tamayo, P., Weng, A.P., et al.: Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat. Med. 8(1), 68–74 (2002)

    Article  Google Scholar 

  32. Petricoin, E.F., Ardekani, A.M., Hitt, B.A., Levine, P.J., et al.: Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359, 572–577 (2002)

    Article  Google Scholar 

  33. Guyon, I., Gunn, S.R., Ben-Hur, A., Dror, G.: Result analysis of the NIPS 2003 feature selection challenge. In: Advances in Neural Information Processing Systems, vol. 17, pp. 545–552. MIT Press (2004)

    Google Scholar 

  34. Rokach, L.: Decision forest: twenty years of research. Inf. Fusion 27, 111–125 (2016)

    Article  Google Scholar 

Download references

Acknowledgments

This research was supported by Sardinia Regional Government, within the projects “DomuSafe” (L.R. 7/2007, annualità 2015, CRP 69) and “EmILIE” (L.R. 7/2007, annualità 2016, CUP F72F16003030002).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Barbara Pes .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pes, B. (2018). Evaluating Feature Selection Robustness on High-Dimensional Data. In: de Cos Juez, F., et al. Hybrid Artificial Intelligent Systems. HAIS 2018. Lecture Notes in Computer Science(), vol 10870. Springer, Cham. https://doi.org/10.1007/978-3-319-92639-1_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-92639-1_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-92638-4

  • Online ISBN: 978-3-319-92639-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics