Skip to main content

Feature Selection Applied to Microarray Data

  • Protocol
  • First Online:

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1986))

Abstract

A typical characteristic of microarray data is that it has a very high number of features (in the order of thousands) while the number of examples is usually less than 100. In the context of microarray classification, this poses a challenge for machine learning methods, which can suffer overfitting and thus degradation in their performance. A common solution is to apply a dimensionality reduction technique before classification, to reduce the number of features. This chapter will be focused on one of the most famous dimensionality reduction techniques: feature selection. We will see how feature selection can help improve the classification accuracy in several microarray data scenarios.

This is a preview of subscription content, log in via an institution.

Buying options

Protocol
USD   49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Springer Nature is developing a new tool to find and evaluate Protocols. Learn more

References

  1. Piatetsky-Shapiro G, Tamayo P (2003) Microarray data mining: facing the challenges. ACM SIGKDD Explor Newsl 5(2):1–5

    Article  Google Scholar 

  2. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537

    Article  CAS  PubMed  Google Scholar 

  3. Ding C, Peng H (2005) Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol 3(02):185–205

    Article  CAS  PubMed  Google Scholar 

  4. Wang Y, Tetko IV, Hall MA, Frank E, Facius A, Mayer KFX, Mewes HW (2005) Gene selection from microarray data for cancer classification–a machine learning approach. Comput Biol Chem 29(1):37–46

    Article  PubMed  CAS  Google Scholar 

  5. Xing EP, Jordan MI, Karp RM et al (2001) Feature selection for high-dimensional genomic microarray data. In: Proceedings of ICML, vol 1, pp 601–608. Citeseer

    Google Scholar 

  6. Jain A, Zongker D (1997) Feature selection: evaluation, application, and small sample performance. IEEE Trans Pattern Anal Mach Intell 19(2):153–158

    Article  Google Scholar 

  7. Guyon I (2006) Feature extraction: foundations and applications, vol 207. Springer Science & Business Media, Berlin

    Book  Google Scholar 

  8. Hall MA (1999) Correlation-based feature selection for machine learning. PhD thesis, Citeseer

    Google Scholar 

  9. Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 856–863

    Google Scholar 

  10. Zhao Z, Liu H (2007) Searching for interacting features. In: Proceedings of the 20th international joint conference on artifical intelligence. Morgan Kaufmann Publishers Inc., San Francisco, pp 1156–1161

    Google Scholar 

  11. Hall MA, Smith LA (1998) Practical feature subset selection for machine learning. Comput Sci 98:181–191

    Google Scholar 

  12. Kononenko I (1994) Estimating attributes: analysis and extensions of relief. In: Machine learning: ECML-94. Springer, Berlin, pp 171–182

    Chapter  Google Scholar 

  13. Kira K, Rendell LA (1992) The feature selection problem: traditional methods and a new algorithm. In: Proceedings of the National conference on artificial intelligence. Wiley, New York, pp 129–129

    Google Scholar 

  14. Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  PubMed  Google Scholar 

  15. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422

    Article  Google Scholar 

  16. Feature Selection Datasets at Arizona State University (2018). http://featureselection.asu.edu/datasets.php. [Online; accessed Jan 2018]

  17. Statnikov A, Aliferis CF, Tsamardinos I (2018) Gems: gene expression model selector. http://www.gems-system.org. [Online; accessed Jan 2018]

  18. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18

    Article  Google Scholar 

  19. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A, Benítez JM, Herrera F (2014) A review of microarray datasets and applied feature selection methods. Inf Sci 282:111–135

    Article  Google Scholar 

  20. González-Navarro FF (2011) Feature selection in cancer research: microarray gene expression and in vivo 1H-MRS domains. PhD thesis, Technical University of Catalonia

    Google Scholar 

  21. Dopazo J (2002) Microarray data processing and analysis. In: Methods of microarray data analysis II. Springer, Boston, pp 43–63

    Chapter  Google Scholar 

  22. McConnell P, Johnson K, Lockhart DJ (2002) An introduction to DNA microarrays. In: Methods of microarray data analysis II. Springer, Boston, pp 9–21

    Chapter  Google Scholar 

  23. International Human Genome Sequencing Consortium et al (2001) Initial sequencing and analysis of the human genome. Nature 409(6822):860

    Article  Google Scholar 

  24. Lin SM, Johnson KF (2002) Methods of microarray data analysis: papers from CAMDA’00. Springer, New York

    Book  Google Scholar 

  25. Brazma A, Vilo J (2000) Gene expression data analysis. FEBS lett 480(1):17–24

    Article  CAS  PubMed  Google Scholar 

  26. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3(Mar):1157–1182

    Google Scholar 

  27. Medjahed SA, Saadi TA, Benyettou A, Ouali M (2017) Kernel-based learning and feature selection analysis for cancer diagnosis. Appl Soft Comput 51:39–48

    Article  Google Scholar 

  28. Mirjalili S (2016) Dragonfly algorithm: a new meta-heuristic optimization technique for solving single-objective, discrete, and multi-objective problems. Neural Comput Appl 27(4):1053–1073

    Article  Google Scholar 

  29. Jain I, Jain VK, Jain R (2018) Correlation feature selection based improved-binary particle swarm optimization for gene selection and cancer classification. Appl Soft Comput 62:203–215

    Article  Google Scholar 

  30. Alomari OA, Khader AT, Al-Betar MA, Abualigah LM (2017) Gene selection for cancer classification by combining minimum redundancy maximum relevancy and bat-inspired algorithm. Int J Data Min Bioinform 19(1):32–51

    Article  Google Scholar 

  31. Ebrahimpour MK, Eftekhari M (2017) Ensemble of feature selection methods: a hesitant fuzzy sets approach. Appl Soft Comput 50:300–312

    Article  Google Scholar 

  32. Alkuhlani A, Nassef M, Farag I (2017) Multistage feature selection approach for high-dimensional cancer data. Soft Comput 21(22):6895–6906

    Article  Google Scholar 

  33. Seijo-Pardo B, Bolón-Canedo V, Alonso-Betanzos A (2017) Testing different ensemble configurations for feature selection. Neural Process Lett 46:1–24

    Article  Google Scholar 

  34. Ferreira A, Figueiredo MAT (2011) Feature discretization and selection in microarray data. In: Proc international conf. on knowledge discovery and information retrieval - KDIR, pp 465–469

    Google Scholar 

  35. Liu H, Setiono R (1996) A probabilistic approach to feature selection - a filter solution. In: Proceedings of the 13th international conference on machine learning, pp 319–327

    Google Scholar 

  36. García S, Luengo J, Sáez JA, López V, Herrera F (2013) A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 25(4):734–750

    Article  Google Scholar 

  37. Cios KJ, Pedrycz W, Swiniarski RW, Kurgan L (2007) Data mining: a knowledge discovery approach. Springer, New York

    Google Scholar 

  38. Karlebach G, Shamir R (2008) Modelling and analysis of gene regulatory networks. Nat Rev Mol Cell Biol 9:770–780

    Article  CAS  PubMed  Google Scholar 

  39. Ramírez-Gallego S, García S, Mouriño-Talín H, Martínez-Rego D, Bolón-Canedo V, Alonso-Betanzos A (2016) Data discretization: taxonomy and big data challenge. WIREs Data Min Knowl Discovery 6(1):5–21

    Article  Google Scholar 

  40. Gallo CA, Carballido JA, Ponzoni I (2011) Discovering time-lagged rules from microarray data using gene profile classifiers. BMC Bioinformatics 12:123

    Article  PubMed  PubMed Central  Google Scholar 

  41. Ding C, Peng H (2005) Minimun redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol 3:185–193

    Article  CAS  PubMed  Google Scholar 

  42. Gallo CA, Cecchini RL, Carballido JA, Micheletto S, Ponzoni I (2016) Discretization of gene expression data revised. Brief Bioinform 17(5):758–770

    Article  CAS  PubMed  Google Scholar 

  43. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2010) On the effectiveness of discretization on gene selection of microarray data. In: Proc. 2010 international joint conference on neural networks, pp 3167–3174

    Google Scholar 

  44. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2009) A combination of discretization and filter methods for improving classification performance in KDD Cup 99 dataset. In: Proc. 2009 international joint conference on neural networks, pp 359–366

    Google Scholar 

  45. Fayyad U, Irani K (1993) Multi-interval discretization of continuous-valued attributes for classification learning

    Google Scholar 

  46. Yang Y, Webb GI (2001) Proportional k-interval discretization for Naive-Bayes classifiers. In: Proceedings of the 12th international conference on machine learning, pp 564–575

    Chapter  Google Scholar 

  47. Tran B, Xue B, Zhang M (2017) A new representation in pso for discretization-based feature selection. IEEE Trans Cybern 48:1733–1746

    Article  PubMed  Google Scholar 

  48. Lorena AC, Costa IG, Spolaôr N, De Souto MCP (2012) Analysis of complexity indices for classification problems: cancer gene expression data. Neurocomputing 75(1):33–42

    Article  Google Scholar 

  49. Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97(457):77–87

    Article  CAS  Google Scholar 

  50. Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2017) Can classification performance be predicted by complexity measures? A study using microarray data. Knowl Inf Syst 51(3):1067–1090

    Article  Google Scholar 

  51. Ho TK, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24(3):289–300

    Article  Google Scholar 

  52. Das K, Bhaduri K, Kargupta H (2010) A local asynchronous distributed privacy preserving feature selection algorithm for large peer-to-peer networks. Knowl Inf Syst 24(3):341–367

    Article  Google Scholar 

  53. Banerjee M, Chakravarty S (2011) Privacy preserving feature selection for distributed data using virtual dimension. In: Proceedings of the 20th ACM international conference on Information and knowledge management. ACM, New York, pp 2281–2284

    Google Scholar 

  54. Tan M, Tsang IW, Wang L (2014) Towards ultrahigh dimensional feature selection for big data. J Mach Learn Res 15:1371–1429

    Google Scholar 

  55. Peralta D, del Río S, Ramírez-Gallego S, Triguero I, Benitez JM, Herrera F (2015) Evolutionary feature selection for big data classification: a mapreduce approach. Math Probl Eng 2015:11pp.

    Article  Google Scholar 

  56. Zhao Z, Zhang R, Cox J, Duling D, Sarle W (2013) Massively parallel feature selection: an approach based on variance preservation. Mach Learn 92(1):195–220

    Article  Google Scholar 

  57. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2015) Distributed feature selection: an application to microarray data classification. Appl Soft Comput 30:136–150

    Article  Google Scholar 

  58. Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2015) A time efficient approach for distributed feature selection partitioning by features. In: Conference of the Spanish Association for artificial intelligence. Springer, Cham, pp 245–254

    Google Scholar 

  59. Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2017) Centralized vs. distributed feature selection methods based on data complexity measures. Knowl-Based Syst 117:27–45

    Article  Google Scholar 

  60. Apache Hadoop (2018). http://hadoop.apache.org/. [Online; accessed Jan 2018]

  61. Apache Spark (2018). https://spark.apache.org/. [Online; accessed Jan 2018]

  62. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2015) Recent advances and emerging challenges of feature selection in the context of big data. Knowl-Based Syst 86:33–45

    Article  Google Scholar 

  63. Eiras-Franco C, Bolón-Canedo V, Ramos S, González-Domínguez J, Alonso-Betanzos A, Touriño J (2016) Multithreaded and spark parallelization of feature selection filters. J Comput Sci 17:609–619

    Article  Google Scholar 

  64. Palma-Mendoza R-J, Rodríguez D, de Marcos L (2018) Distributed ReliefF-based feature selection in Spark. Knowl Inf Syst 57:1–20

    Article  Google Scholar 

  65. Ramírez-Gallego S, Lastra I, Martínez-Rego D, Bolón-Canedo V, Benítez JM, Herrera F, Alonso-Betanzos A (2017) Fast-mrmr: fast minimum redundancy maximum relevance algorithm for high-dimensional big data. Int J Intell Syst 32(2):134–152

    Article  Google Scholar 

  66. Ramírez-Gallego S, Mouriño-Talín H, Martínez-Rego D, Bolón-Canedo V, Benítez JM, Alonso-Betanzos A, Herrera F (2017) An information theory-based feature selection framework for big data under apache spark. IEEE Trans Syst Man Cybern Syst 48:1441–1453

    Article  Google Scholar 

  67. Kuncheva LI (2004) Combining pattern classifiers: methods and algorithms. Wiley, New York

    Book  Google Scholar 

  68. Kuncheva LI, Whitaker CJ (2003) Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach Learn 51(2):181–207

    Article  Google Scholar 

  69. Osanaiye O, Cai H, Choo K-KR, Dehghantanha A, Xu Z, Dlodlo M (2016) Ensemble-based multi-filter feature selection method for DDoS detection in cloud computing. EURASIP J Wirel Commun Netw 2016(1):130

    Article  Google Scholar 

  70. Wang H, Khoshgoftaar TM, Gao K (2010) Ensemble feature selection technique for software quality classification. In: Proceedings of the SEKE, pp 215–220

    Google Scholar 

  71. Wang H, Khoshgoftaar TM, Napolitano A (2010) A comparative study of ensemble feature selection techniques for software defect prediction. In: 2010 ninth international conference on machine learning and applications (ICMLA). IEEE, Piscataway, pp 135–140

    Chapter  Google Scholar 

  72. Ji W, Huang Y, Qiang B, Li Y (2017) Min-max ensemble feature selection. J Intell Fuzzy Syst 33(6):3441–3450

    Article  Google Scholar 

  73. Yang F, Mao KZ (2011) Robust feature selection for microarray data based on multicriterion fusion. IEEE/ACM Trans Comput Biol Bioinform 8(4):1080–1092

    Article  PubMed  Google Scholar 

  74. Khoshgoftaar TM, Golawala M, Van Hulse J (2007) An empirical study of learning from imbalanced data using random forest. In: 19th IEEE international conference on tools with artificial intelligence, 2007, ICTAI 2007, vol 2. IEEE, Piscataway, pp 310–317

    Google Scholar 

  75. Joachims T (2002) Optimizing search engines using clickthrough data. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 133–142

    Chapter  Google Scholar 

  76. Willett P (2013) Combination of similarity rankings using data fusion. J Chem Inf Model 53(1):1–10

    Article  CAS  PubMed  Google Scholar 

  77. Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F, Tranchevent L-C, De Moor B, Marynen P, Hassan B et al (2006) Gene prioritization through genomic data fusion. Nat Biotechnol 24(5):537–544

    Article  CAS  PubMed  Google Scholar 

  78. Kolde R, Laur S, Adler P, Vilo J (2012) Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics 28(4):573–580

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  79. Quinlan JR (2014) C4. 5: programs for machine learning. Elsevier, Amsterdam

    Google Scholar 

  80. Rish I (2001) An empirical study of the Naive Bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence, vol 3. IBM, New York, pp 41–46

    Google Scholar 

  81. Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66

    Google Scholar 

  82. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Verónica Bolón-Canedo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Alonso-Betanzos, A., Bolón-Canedo, V., Morán-Fernández, L., Seijo-Pardo, B. (2019). Feature Selection Applied to Microarray Data. In: Bolón-Canedo, V., Alonso-Betanzos, A. (eds) Microarray Bioinformatics. Methods in Molecular Biology, vol 1986. Humana, New York, NY. https://doi.org/10.1007/978-1-4939-9442-7_6

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-9442-7_6

  • Published:

  • Publisher Name: Humana, New York, NY

  • Print ISBN: 978-1-4939-9441-0

  • Online ISBN: 978-1-4939-9442-7

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics