Advertisement

Stability of Feature Selection Methods: A Study of Metrics Across Different Gene Expression Datasets

  • Zahra Mungloo-DilmohamudEmail author
  • Yasmina Jaufeerally-Fakim
  • Carlos Peña-Reyes
Conference paper
  • 139 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12108)

Abstract

Analysis of gene-expression data often requires that a gene (feature) subset is selected and many feature selection (FS) methods have been devised. However, FS methods often generate different lists of features for the same dataset and users then have to choose which list to use. One approach to support this choice is to apply stability metrics on the generated lists and selecting lists on that base. The aim of this study is to investigate the behavior of stability metrics applied to feature subsets generated by FS methods. The experiments in this work explore a plethora of gene expression datasets, FS methods, and expected number of features to compare several stability metrics. The stability metrics have been used to compare five feature selection methods (SVM, SAM, ReliefF, RFE + RF and LIMMA) on gene expression datasets from the EBI repository. Results show that the studied stability metrics display a high amount of variability. The reason behind this is not clear yet and is being further investigated. The final objective of the research, that is to define how to select a FS method, is an ongoing work whose partial findings are reported herein.

Keywords

Stability Stability metrics FS methods Gene expression data 

References

  1. 1.
    Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene Selection for Cancer Classification using Support Vector Machines. Mach. Learn. 46, 389–422 (2002).  https://doi.org/10.1023/A:1012487302797CrossRefGoogle Scholar
  2. 2.
    Mungloo-Dilmohamud, Z., Jaufeerally-Fakim, Y., Peña-Reyes, C.: A meta-review of feature selection techniques in the context of microarray data. In: Rojas, I., Ortuño, F. (eds.) IWBBIO 2017. LNCS, vol. 10208, pp. 33–49. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-56148-6_3CrossRefGoogle Scholar
  3. 3.
    Abeel, T., Helleputte, T., Van deaaa Peer, Y., Dupont, P., Saeys, Y.: Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26, 392–398 (2010).  https://doi.org/10.1093/bioinformatics/btp630CrossRefPubMedGoogle Scholar
  4. 4.
    He, Z., Yu, W.: Stable feature selection for biomarker discovery. Comput. Biol. Chem. 34, 215–225 (2010).  https://doi.org/10.1016/j.compbiolchem.2010.07.002CrossRefPubMedGoogle Scholar
  5. 5.
    Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: A review of microarray datasets and applied feature selection methods. Inf. Sci. (N.Y.) 282, 111–135 (2014).  https://doi.org/10.1016/j.ins.2014.05.042CrossRefGoogle Scholar
  6. 6.
    Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience, Hoboken (1991)CrossRefGoogle Scholar
  7. 7.
    Kuhn, M.: Building predictive models in R using the caret Package. J. Stat. Softw. 28(5), 1–26 (2008)CrossRefGoogle Scholar
  8. 8.
    Nogueira, S., Brown, G.: Measuring the stability of feature selection. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds.) ECML PKDD 2016. LNCS (LNAI), vol. 9852, pp. 442–457. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46227-1_28CrossRefGoogle Scholar
  9. 9.
    Mohana, C.: A Survey on feature selection stability measures. International Journal of Computer and Information Technology 05(1), 98–103 (2016)Google Scholar
  10. 10.
    Saeys, Y., Abeel, T., Van de Peer, Y.: Robust feature selection using ensemble feature selection techniques. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008. LNCS (LNAI), vol. 5212, pp. 313–325. Springer, Heidelberg (2008).  https://doi.org/10.1007/978-3-540-87481-2_21CrossRefGoogle Scholar
  11. 11.
    Kalousis, A., Prados, J., Hilario, M.: Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl. Inf. Syst. 12, 95–116 (2007)CrossRefGoogle Scholar
  12. 12.
    Guzmán-Martínez, R., Alaiz-Rodríguez, R.: Feature selection stability assessment based on the jensen-shannon divergence. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011. LNCS (LNAI), vol. 6911, pp. 597–612. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-23780-5_48CrossRefGoogle Scholar
  13. 13.
    Lustgarten, J.L., Gopalakrishnan, V., Visweswaran, S.: Measuring stability of feature selection in biomedical datasets. AMIA Annu. Symp. Proc. 2009, 406–410 (2009)PubMedPubMedCentralGoogle Scholar
  14. 14.
    Dunne, K., Cunningham, P., Azuaje, F.: Solutions to instability problems with sequential wrapper-based approaches to feature selection. J. Mach. Learn. Res., 1–22 (2002)Google Scholar
  15. 15.
    Kuncheva, L.I.: A stability index for feature selection. In: Proceedings of the 25th IASTED International Multi-Conference: artificial intelligence and applications, pp. 390–395. ACTA Press (2007)Google Scholar
  16. 16.
    Shi, L., Reid, L.H., Jones, W.D., Shippy, R., et al.: The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol. 24, 1151–1161 (2006). MAQC ConsortiumCrossRefGoogle Scholar
  17. 17.
    Yu, L., Ding, C., Loscalzo, S.: Stable feature selection via dense feature groups. In: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD 08, p. 803. ACM Press, New York (2008)Google Scholar
  18. 18.
    Zucknick, M., Richardson, S., Stronach, E.A.: Comparing the characteristics of gene expression profiles derived by univariate and multivariate classification methods. Stat. Appl. Genet. Mol. Biol. 7 (2008). Article7Google Scholar
  19. 19.
    Somol, P., Novovicová, J.: Evaluating stability and comparing output of feature selectors that optimize feature subset cardinality. IEEE Trans. Pattern Anal. Mach. Intell. 32, 1921–1939 (2010)CrossRefGoogle Scholar
  20. 20.
    Novovicová, J., Somol, P., Pudil, P.: A new measure of feature selection algorithms’ stability. In: 2009 IEEE International Conference on Data Mining Workshops, pp. 382–387. IEEE (2009)Google Scholar
  21. 21.
    Křížek, P., Kittler, J., Hlaváč, V.: Improving stability of feature selection methods. In: Kropatsch, Walter G., Kampel, M., Hanbury, A. (eds.) CAIP 2007. LNCS, vol. 4673, pp. 929–936. Springer, Heidelberg (2007).  https://doi.org/10.1007/978-3-540-74272-2_115CrossRefGoogle Scholar
  22. 22.
    Goh, W.W.B., Wong, L.: Evaluating feature-selection stability in next-generation proteomics. J. Bioinform. Comput. Biol. 14, 1650029 (2016)CrossRefGoogle Scholar
  23. 23.
    CA, D.: Reliable gene signatures for microarray classification: assessment of stability and performance. Bioinformatics 22, 2356–2363 (2006)CrossRefGoogle Scholar
  24. 24.
    Lausser, L., Müssel, C., Maucher, M., Kestler, H.A.: Measuring and visualizing the stability of biomarker selection techniques. Comput Stat. 28, 51–65 (2013)CrossRefGoogle Scholar
  25. 25.
    Cancer Program Legacy Publication Resources. http://portals.broadinstitute.org/cgi-bin/cancer/datasets.cgi
  26. 26.
    ArrayExpress < EMBL-EBI. https://www.ebi.ac.uk/arrayexpress/
  27. 27.
  28. 28.
    Hira, Z.M., Gillies, D.F.: A review of feature selection and feature extraction methods applied on microarray data. Adv. Bioinform. 2015, 198363 (2015)Google Scholar
  29. 29.
    Tusher, V.G., Tibshirani, R., Chu, G.: Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. U.S.A. 98, 5116–5121 (2001)CrossRefGoogle Scholar
  30. 30.
    Smyth, G.K.: Limma: linear models for microarray data. In: Gentleman, R., Carey, V.J., Huber, W., Irizarry, R.A., Dudoit, S. (eds.) Bioinformatics and Computational Biology Solutions Using R and Bioconductor, pp. 397–420. Springer, New York (2005).  https://doi.org/10.1007/0-387-29362-0_23CrossRefGoogle Scholar
  31. 31.
    Kononenko, I.: Estimating attributes: Analysis and extensions of RELIEF. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 171–182. Springer, Heidelberg (1994).  https://doi.org/10.1007/3-540-57868-4_57CrossRefGoogle Scholar
  32. 32.
    Mungloo-Dilmohamud, Z., Marigliano, G., Jaufeerally-Fakim, Y., Pena-Reyes, C.: A comparative study of feature selection methods for biomarker discovery. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2789–2791. IEEE (2018).  https://doi.org/10.1109/bibm.2018.8621267
  33. 33.
    Mungloo-Dilmohamud, Z., Jaufeerally-Fakim, T., Peña-Reyes, C.: Exploring the Stability of Feature Selection Methods across a Palette of Gene Expression Datasets. Proceedings of the 2019 6th International Conference on Biomedical and Bioinformatics Engineering, ICBBE 2019. ACM (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.University of MauritiusReduitMauritius
  2. 2.School of Business and Engineering Vaud (HEIG-VD), Swiss Institute of Bioinformatics (SIB), CI4CB, Computational Intelligence for Computational Biology GroupUniversity of Applied Sciences Western Switzerland (HES-SO)Yverdon-les-Bains Switzerland

Personalised recommendations