Skip to main content

Stability of Feature Selection Methods: A Study of Metrics Across Different Gene Expression Datasets

  • Conference paper
  • First Online:
Bioinformatics and Biomedical Engineering (IWBBIO 2020)

Abstract

Analysis of gene-expression data often requires that a gene (feature) subset is selected and many feature selection (FS) methods have been devised. However, FS methods often generate different lists of features for the same dataset and users then have to choose which list to use. One approach to support this choice is to apply stability metrics on the generated lists and selecting lists on that base. The aim of this study is to investigate the behavior of stability metrics applied to feature subsets generated by FS methods. The experiments in this work explore a plethora of gene expression datasets, FS methods, and expected number of features to compare several stability metrics. The stability metrics have been used to compare five feature selection methods (SVM, SAM, ReliefF, RFE + RF and LIMMA) on gene expression datasets from the EBI repository. Results show that the studied stability metrics display a high amount of variability. The reason behind this is not clear yet and is being further investigated. The final objective of the research, that is to define how to select a FS method, is an ongoing work whose partial findings are reported herein.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene Selection for Cancer Classification using Support Vector Machines. Mach. Learn. 46, 389–422 (2002). https://doi.org/10.1023/A:1012487302797

    Article  Google Scholar 

  2. Mungloo-Dilmohamud, Z., Jaufeerally-Fakim, Y., Peña-Reyes, C.: A meta-review of feature selection techniques in the context of microarray data. In: Rojas, I., Ortuño, F. (eds.) IWBBIO 2017. LNCS, vol. 10208, pp. 33–49. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56148-6_3

    Chapter  Google Scholar 

  3. Abeel, T., Helleputte, T., Van deaaa Peer, Y., Dupont, P., Saeys, Y.: Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26, 392–398 (2010). https://doi.org/10.1093/bioinformatics/btp630

    Article  CAS  PubMed  Google Scholar 

  4. He, Z., Yu, W.: Stable feature selection for biomarker discovery. Comput. Biol. Chem. 34, 215–225 (2010). https://doi.org/10.1016/j.compbiolchem.2010.07.002

    Article  CAS  PubMed  Google Scholar 

  5. Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: A review of microarray datasets and applied feature selection methods. Inf. Sci. (N.Y.) 282, 111–135 (2014). https://doi.org/10.1016/j.ins.2014.05.042

    Article  Google Scholar 

  6. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience, Hoboken (1991)

    Book  Google Scholar 

  7. Kuhn, M.: Building predictive models in R using the caret Package. J. Stat. Softw. 28(5), 1–26 (2008)

    Article  Google Scholar 

  8. Nogueira, S., Brown, G.: Measuring the stability of feature selection. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds.) ECML PKDD 2016. LNCS (LNAI), vol. 9852, pp. 442–457. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46227-1_28

    Chapter  Google Scholar 

  9. Mohana, C.: A Survey on feature selection stability measures. International Journal of Computer and Information Technology 05(1), 98–103 (2016)

    Google Scholar 

  10. Saeys, Y., Abeel, T., Van de Peer, Y.: Robust feature selection using ensemble feature selection techniques. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008. LNCS (LNAI), vol. 5212, pp. 313–325. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87481-2_21

    Chapter  Google Scholar 

  11. Kalousis, A., Prados, J., Hilario, M.: Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl. Inf. Syst. 12, 95–116 (2007)

    Article  Google Scholar 

  12. Guzmán-Martínez, R., Alaiz-Rodríguez, R.: Feature selection stability assessment based on the jensen-shannon divergence. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011. LNCS (LNAI), vol. 6911, pp. 597–612. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23780-5_48

    Chapter  Google Scholar 

  13. Lustgarten, J.L., Gopalakrishnan, V., Visweswaran, S.: Measuring stability of feature selection in biomedical datasets. AMIA Annu. Symp. Proc. 2009, 406–410 (2009)

    PubMed  PubMed Central  Google Scholar 

  14. Dunne, K., Cunningham, P., Azuaje, F.: Solutions to instability problems with sequential wrapper-based approaches to feature selection. J. Mach. Learn. Res., 1–22 (2002)

    Google Scholar 

  15. Kuncheva, L.I.: A stability index for feature selection. In: Proceedings of the 25th IASTED International Multi-Conference: artificial intelligence and applications, pp. 390–395. ACTA Press (2007)

    Google Scholar 

  16. Shi, L., Reid, L.H., Jones, W.D., Shippy, R., et al.: The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol. 24, 1151–1161 (2006). MAQC Consortium

    Article  CAS  Google Scholar 

  17. Yu, L., Ding, C., Loscalzo, S.: Stable feature selection via dense feature groups. In: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD 08, p. 803. ACM Press, New York (2008)

    Google Scholar 

  18. Zucknick, M., Richardson, S., Stronach, E.A.: Comparing the characteristics of gene expression profiles derived by univariate and multivariate classification methods. Stat. Appl. Genet. Mol. Biol. 7 (2008). Article7

    Google Scholar 

  19. Somol, P., Novovicová, J.: Evaluating stability and comparing output of feature selectors that optimize feature subset cardinality. IEEE Trans. Pattern Anal. Mach. Intell. 32, 1921–1939 (2010)

    Article  Google Scholar 

  20. Novovicová, J., Somol, P., Pudil, P.: A new measure of feature selection algorithms’ stability. In: 2009 IEEE International Conference on Data Mining Workshops, pp. 382–387. IEEE (2009)

    Google Scholar 

  21. Křížek, P., Kittler, J., Hlaváč, V.: Improving stability of feature selection methods. In: Kropatsch, Walter G., Kampel, M., Hanbury, A. (eds.) CAIP 2007. LNCS, vol. 4673, pp. 929–936. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74272-2_115

    Chapter  Google Scholar 

  22. Goh, W.W.B., Wong, L.: Evaluating feature-selection stability in next-generation proteomics. J. Bioinform. Comput. Biol. 14, 1650029 (2016)

    Article  CAS  Google Scholar 

  23. CA, D.: Reliable gene signatures for microarray classification: assessment of stability and performance. Bioinformatics 22, 2356–2363 (2006)

    Article  Google Scholar 

  24. Lausser, L., Müssel, C., Maucher, M., Kestler, H.A.: Measuring and visualizing the stability of biomarker selection techniques. Comput Stat. 28, 51–65 (2013)

    Article  Google Scholar 

  25. Cancer Program Legacy Publication Resources. http://portals.broadinstitute.org/cgi-bin/cancer/datasets.cgi

  26. ArrayExpress < EMBL-EBI. https://www.ebi.ac.uk/arrayexpress/

  27. Home - GEO – NCBI. https://www.ncbi.nlm.nih.gov/geo/

  28. Hira, Z.M., Gillies, D.F.: A review of feature selection and feature extraction methods applied on microarray data. Adv. Bioinform. 2015, 198363 (2015)

    Google Scholar 

  29. Tusher, V.G., Tibshirani, R., Chu, G.: Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. U.S.A. 98, 5116–5121 (2001)

    Article  CAS  Google Scholar 

  30. Smyth, G.K.: Limma: linear models for microarray data. In: Gentleman, R., Carey, V.J., Huber, W., Irizarry, R.A., Dudoit, S. (eds.) Bioinformatics and Computational Biology Solutions Using R and Bioconductor, pp. 397–420. Springer, New York (2005). https://doi.org/10.1007/0-387-29362-0_23

    Chapter  Google Scholar 

  31. Kononenko, I.: Estimating attributes: Analysis and extensions of RELIEF. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 171–182. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-57868-4_57

    Chapter  Google Scholar 

  32. Mungloo-Dilmohamud, Z., Marigliano, G., Jaufeerally-Fakim, Y., Pena-Reyes, C.: A comparative study of feature selection methods for biomarker discovery. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2789–2791. IEEE (2018). https://doi.org/10.1109/bibm.2018.8621267

  33. Mungloo-Dilmohamud, Z., Jaufeerally-Fakim, T., Peña-Reyes, C.: Exploring the Stability of Feature Selection Methods across a Palette of Gene Expression Datasets. Proceedings of the 2019 6th International Conference on Biomedical and Bioinformatics Engineering, ICBBE 2019. ACM (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zahra Mungloo-Dilmohamud .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mungloo-Dilmohamud, Z., Jaufeerally-Fakim, Y., Peña-Reyes, C. (2020). Stability of Feature Selection Methods: A Study of Metrics Across Different Gene Expression Datasets. In: Rojas, I., Valenzuela, O., Rojas, F., Herrera, L., Ortuño, F. (eds) Bioinformatics and Biomedical Engineering. IWBBIO 2020. Lecture Notes in Computer Science(), vol 12108. Springer, Cham. https://doi.org/10.1007/978-3-030-45385-5_59

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-45385-5_59

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-45384-8

  • Online ISBN: 978-3-030-45385-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics