Skip to main content

A Review of Feature Reduction Methods for QSAR-Based Toxicity Prediction

  • Chapter
  • First Online:

Part of the book series: Challenges and Advances in Computational Chemistry and Physics ((COCH,volume 30))

Abstract

Thousands of molecular descriptors (1D to 4D) can be generated and used as features to model quantitative structure–activity or toxicity relationship (QSAR or QSTR) for chemical toxicity prediction. This often results in models that suffer from the “curse of dimensionality”, a problem that can occur in machine learning practice when too many features are employed to train a model. Here we discuss different methods of eliminating redundant and irrelevant features to enhance prediction performance, increase interpretability, and reduce computational complexity. Several feature selection and extraction methods are summarized along with their strengths and shortcomings. We also highlight some commonly overlooked challenges such as algorithm instability and selection bias while offering possible solutions.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Abbreviations

1D:

One-dimensional

2D:

Two-dimensional

3D:

Three-dimensional

4D:

Four-dimensional

ACO:

Ant colony optimization

ECFP:

Extended connectivity fingerprints

GA:

Genetic algorithm

KPCA:

Kernel principal component analysis

LASSO:

Least absolute shrinkage and selection operator

LDA:

Linear discriminant analysis

LOOCV:

Leave-one-out cross-validation

MACCS:

Molecular access system

MDS:

Multi-dimensional scaling

PCA:

Principal component analysis

PSO:

Particle swarm optimization

QSAR:

Quantitative structure–activity relationship

QSTR:

Quantitative structure–toxicity relationship

RFE:

Recursive feature elimination

SA:

Simulated annealing

SAR:

Structure–activity relationship

SFFS:

Sequential floating forward selection

SFS:

Sequential forward selection

STR:

Structure–toxicity relationship

SVM:

Support vector machine

Tox21:

Toxicology in the twenty-first century

t-SNE:

t-Distributed stochastic neighbor embedding

References

  1. Lavecchia A (2015) Machine-learning approaches in drug discovery: methods and applications. Drug Discov Today 20(3):318–331

    Article  Google Scholar 

  2. Raies AB, Bajic VB (2016) In silico toxicology: computational methods for the prediction of chemical toxicity. Wiley Interdiscip Rev Comput Mol Sci 6(2):147–172

    Article  CAS  Google Scholar 

  3. Greene N, Pennie W (2015) Computational toxicology, friend or foe? Toxicol Res 4(5):1159–1172

    Article  CAS  Google Scholar 

  4. Kruhlak NL, Benz RD, Zhou H, Colatsky TJ (2012) (Q)SAR modeling and safety assessment in regulatory review. Clin Pharmacol Ther 91(3):529–534

    Article  CAS  Google Scholar 

  5. Tropsha A (2010) Best practices for QSAR model development, validation, and exploitation. Mol Inform 29(6–7):476–488

    Article  CAS  Google Scholar 

  6. Yang H, Sun L, Li W, Liu G, Tang Y (2018) In silico prediction of chemical toxicity for drug design using machine learning methods and structural alerts. Front Chem 6:30. https://doi.org/10.3389/fchem.2018.00030

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Danishuddin Khan AU (2016) Descriptors and their selection methods in QSAR analysis: paradigm for drug design. Drug Discov Today 21(8):1291–1302

    Article  CAS  Google Scholar 

  8. Leach AR, Gillet VJ (2007) Molecular descriptors. An introduction to chemoinformatics. Springer, Dordrecht, pp 53–74

    Chapter  Google Scholar 

  9. Todeschini R, Consonni V (2000) Handbook of molecular descriptors. Wiley-VCH, Weinheim

    Book  Google Scholar 

  10. Duan J, Dixon SL, Lowrie JF, Sherman W (2010) Analysis and comparison of 2D fingerprints: insights into database screening performance using eight fingerprint methods. J Mol Graph Model 29(2):157–170

    Article  CAS  Google Scholar 

  11. National Institutes of Health (2009) PubChem substructure fingerprint. ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt. Accessed 10 Oct 2018

    Google Scholar 

  12. Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50(5):742–754

    Article  CAS  Google Scholar 

  13. Huang R, Xia M, Nguyen D-T et al (2016) Tox21Challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to environmental chemicals and drugs. Front Environ Sci 3:85. https://doi.org/10.3389/fenvs.2015.00085

    Article  Google Scholar 

  14. Mayr A, Klambauer G, Unterthiner T, Hochreiter S (2016) DeepTox: toxicity prediction using deep learning. Front Environ Sci 3:80. https://doi.org/10.3389/fenvs.2015.00080

    Article  Google Scholar 

  15. Subramanian J, Simon R (2013) Overfitting in prediction models—Is it a problem only in high dimensions? Contemp Clin Trials 36(2):636–641

    Article  Google Scholar 

  16. Clarke R, Ressom HW, Wang A et al (2008) The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat Rev Cancer 8(1):37–49

    Article  CAS  Google Scholar 

  17. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    Article  CAS  Google Scholar 

  18. Ang JC, Mirzal A, Haron H, Hamed HNA (2016) Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE/ACM Trans Comput Biol Bioinform 13(5):971–989

    Article  Google Scholar 

  19. Merkwirth C, Mauser H, Schulz-Gasch T, Roche O, Martin Stahl A, Lengauer T (2004) Ensemble methods for classification in cheminformatics. J Chem Inf Comput Sci 44(6):1971–1978

    Article  CAS  Google Scholar 

  20. Venkatraman V, Dalby AR, Yang ZR (2004) Evaluation of mutual information and genetic programming for feature selection in QSAR. J Chem Inf Comput Sci 44(5):1686–1692

    Article  CAS  Google Scholar 

  21. Bajorath J (2001) Selected concepts and investigations in compound classification, molecular descriptor analysis, and virtual screening. J Chem Inf Comput Sci 41(2):233–245

    Article  CAS  Google Scholar 

  22. Goodarzi M, Dejaegher B, Heyden YV (2012) Feature selection methods in QSAR studies. J AOAC Int 95(3):636–651

    Article  CAS  Google Scholar 

  23. Shahlaei M (2013) Descriptor selection methods in quantitative structure—activity relationship studies: a review study. Chem Rev 113(10):8093–8103

    Article  CAS  Google Scholar 

  24. Bellman R (2016) Adaptive control processes: a guided tour. Princeton University Press, New Jersey

    Google Scholar 

  25. Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28

    Article  Google Scholar 

  26. Van Der Maaten L, Postma E, Van Den Herik J (2009) Dimensionality reduction: a comparative review. J Mach Learn Res 10:66–71

    Google Scholar 

  27. Cai J, Luo J, Wang S, Yang S (2018) Feature selection in machine learning: a new perspective. Neurocomputing 300:70–79

    Article  Google Scholar 

  28. Tang J, Alelyani S, Liu H (2014) Feature selection for classification: a review. In: Aggarwal CC (ed) Data classification: algorithms and applications, 1st edn. CRC Press, Boca Raton, pp 37–64

    Google Scholar 

  29. Johnstone IM, Titterington DM (2009) Statistical challenges of high-dimensional data. Philos Trans A Math Phys Eng Sci 367(1906):4237–4253

    Article  Google Scholar 

  30. Zhu X, Wu X (2004) Class noise versus attribute noise: a quantitative study. Artif Intell Rev 22(3): 177 –210

    Google Scholar 

  31. Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biol Cybern 43(1):59–69

    Article  Google Scholar 

  32. Sheikhpour R, Sarram MA, Gharaghani S, Chahooki MAZ (2017) A survey on semi-supervised feature selection methods. Pattern Recognit 64:141–158

    Article  Google Scholar 

  33. Dy JG, Brodley CE (2004) Feature selection for unsupervised learning. J Mach Learn Res 5:845–889

    Google Scholar 

  34. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    Google Scholar 

  35. Solorio-Fernandez S, Martinez-Trinidad JF, Carrasco-Ochoa JA, and Zhang Y-Q (2012) Hybrid feature selection method for biomedical datasets. In: 2012 IEEE symposium on computational intelligence in bioinformatics and computational biology (CIBCB), San Diego, 9–12 May 2012

    Google Scholar 

  36. Hsu H-H, Hsieh C-W, Lu M-D (2011) Hybrid feature selection by combining filters and wrappers. Expert Syst Appl 38(7):8144–8150

    Article  Google Scholar 

  37. Guan D, Yuan W, Lee YK, Najeebullah K, Rasel MK (2014) A review of ensemble learning based feature selection. IETE Tech Rev 31(3):190–198

    Article  Google Scholar 

  38. Brahim AB, Limam M (2017) Ensemble feature selection for high dimensional data: a new method and a comparative study. Adv Data Anal Classif 12(4):937–952

    Article  Google Scholar 

  39. Seijo-Pardo B, Porto-Díaz I, Bolón-Canedo V, Alonso-Betanzos A (2017) Ensemble feature selection: homogeneous and heterogeneous approaches. Knowl Based Syst 118:124–139

    Article  Google Scholar 

  40. Janecek A, Gansterer W, Demel M, Ecker G (2008) On the relationship between feature selection and classification accuracy. Proc Mach Learn Res 4:90–105

    Google Scholar 

  41. Hira ZM, Gillies DF (2015) A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinformatics. https://doi.org/10.1155/2015/198363

    Article  Google Scholar 

  42. Rajarshi G, Jurs PC (2004) Development of linear, ensemble, and nonlinear models for the prediction and interpretation of the biological activity of a set of PDGFR inhibitors. J Chem Inf Comput Sci 44(6):2179–2189

    Article  Google Scholar 

  43. Guo G, Neagu D, Cronin MTD (2005) A study on feature selection for toxicity prediction. In: Wang L, Jin Y (eds) Fuzzy systems and knowledge discovery. Springer, Heidelberg, pp 31–34

    Chapter  Google Scholar 

  44. Newby D, Freitas AA, Ghafourian T (2012) Pre-processing feature selection for improved C&RT models for oral absorption. J Chem Inf Model 53(10):2730–2742

    Article  Google Scholar 

  45. Pudil P, Novovičová J, Kittler J (1994) Floating search methods in feature selection. Pattern Recognit Lett 15(11):1119–1125

    Article  Google Scholar 

  46. Brendel M, Zaccarelli R, Devillers L (2010) A quick sequential forward floating feature selection algorithm for emotion detection from speech. In: INTERSPEECH-2010, Chiba, 26–30 September 2010

    Google Scholar 

  47. Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings of ICNN’95—international conference on neural networks, Perth, 27 November–1 December 1995

    Google Scholar 

  48. Goldberg DE (1989) Genetic algorithms in search, optimization, and machine learning. Addison-Wesley Longman Publishing Co., Inc, Boston

    Google Scholar 

  49. Revathy N, Balasubramanian R (2012) GA-SVM Wrapper approach for gene banking and classificaiton using expressions of very few genes. J Theor Appl Inf Technol 40(2):113–119

    Google Scholar 

  50. Shen Q, Jiang J-H, Tao J et al (2005) Modified ant colony optimization algorithm for variable selection in QSAR modeling: QSAR studies of cyclooxygenase inhibitors. J Chem Inf Model 45(4):1024–1029

    Article  CAS  Google Scholar 

  51. Jain D, Singh V (2018) Feature selection and classification systems for chronic disease prediction: a review. Egypt Informatics J 19(3):179–189

    Article  Google Scholar 

  52. Osman H, Ghafari M, Nierstrasz O (2017) Automatic feature selection by regularization to improve bug prediction accuracy. In: 2017 IEEE workshop on machine learning techniques for software quality evaluation (MaLTeSQuE), Klagenfurt, 21 February 2017

    Google Scholar 

  53. Reddy AS, Kumar S, Garg R (2010) Hybrid-genetic algorithm based descriptor optimization and QSAR models for predicting the biological activity of tipranavir analogs for HIV protease inhibition. J Mol Graph Model 28(8):852–862

    Article  CAS  Google Scholar 

  54. Dutta D, Guha R, Wild D, Chen T (2007) Ensemble feature selection: consistent descriptor subsets for multiple QSAR models. J Chem Inf Model 47(3):989–997

    Article  CAS  Google Scholar 

  55. Zhu X-W, Xin Y-J, Ge H-L (2015) Recursive random forests enable better predictive performance and model interpretation than variable selection by LASSO. J Chem Inf Model 55(4):736–746

    Article  CAS  Google Scholar 

  56. Lauria A, Ippolito M, Almerico AM. (2009) Combined use of PCA and QSAR/QSPR to predict the drugs mechanism of action. An application to the NCI ACAM database. QSAR Comb Sci 28(4):387–395

    Article  CAS  Google Scholar 

  57. Yoo C, Shahlaei M (2018) The applications of PCA in QSAR studies: a case study on CCR5 antagonists. Chem Biol Drug Des 91(1):137–152

    Article  CAS  Google Scholar 

  58. Klepsch F, Vasanthanathan P, Ecker GF (2014) Ligand and structure-based classification models for prediction of P-glycoprotein inhibitors. J Chem Inf Model 54(1):218–229

    Article  CAS  Google Scholar 

  59. Hemmateenejad B, Miri R, Jafarpour M, Tabarzad M, Foroumadi A (2006) Multiple linear regression and principal component analysis-based prediction of the anti-tuberculosis activity of some 2-aryl-1,3,4-Thiadiazole derivatives. QSAR Comb Sci 25(1):56–66

    Article  CAS  Google Scholar 

  60. Manikandan G, Abirami S (2018) A survey on feature selection and extraction techniques for high-dimensional microarray datasets. In: Anouncia SM, Wiil UK (eds) Knowledge computing and its applications. Springer, Singapore, pp 311–333

    Chapter  Google Scholar 

  61. Reverter F, Vegas E, Oller JM (2014) Kernel-PCA data integration with enhanced interpretability. BMC Syst Biol 8(2):S6

    Article  Google Scholar 

  62. Wang Q (2012) Kernel principal component analysis and its applications in face recognition and active shape models. https://arxiv.org/abs/1207.3538. Accessed 10 October 2018

  63. Baldi P (2012) Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of ICML workshop on unsupervised and transfer learning, Bellevue, 2 July 2012

    Google Scholar 

  64. Goh GB, Hodas NO, Vishnu A (2017) Deep learning for computational chemistry. J Comput Chem 38(16):1291–1307

    Article  CAS  Google Scholar 

  65. Chandra B, Sharma RK (2015) Exploring autoencoders for unsupervised feature selection. In: 2015 international joint conference on neural networks (IJCNN), Killarney, 12–17 July 2015

    Google Scholar 

  66. Blaschke T, Olivecrona M, Engkvist O, Bajorath J, Chen H (2018) Application of generative autoencoder in de novo molecular design. Mol Inform 37(1–2):1700123

    Article  Google Scholar 

  67. Burgoon LD (2017) Autoencoder predicting estrogenic chemical substances (APECS): an improved approach for screening potentially estrogenic chemicals using in vitro assays and deep learning. Comput Toxicol 2:45–49

    Article  Google Scholar 

  68. Ye J, Ji S (2009) Discriminant analysis for dimensionality reduction: an overview of recent developments. In: Boulgouris NV, Plataniotis KN, Micheli-Tzanakou E (eds) Biometrics: theory, methods, and applications. IEEE Press, Piscataway, pp 1–20

    Google Scholar 

  69. Yan H, Dai Y (2011) The comparison of five discriminant methods. In: 2011 International conference on management and service science, Wuhan, 12–14 August

    Google Scholar 

  70. Ren YY, Zhou LC, Yang L, Liu PY, Zhao BW, Liu HX (2016) Predicting the aquatic toxicity mode of action using logistic regression and linear discriminant analysis. SAR QSAR Environ Res 27(9):721–746

    Article  CAS  Google Scholar 

  71. van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605

    Google Scholar 

  72. Borg I, Groenen PJF (2005) Modern Multidimensional Scaling, 2nd edn. Springer Science + Business Media Inc, New York

    Google Scholar 

  73. Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 15(6):1373–1396

    Article  Google Scholar 

  74. Tenenbaum JB, de Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323

    Article  CAS  Google Scholar 

  75. Izenman AJ (2012) Introduction to manifold learning. Wiley Interdiscip Rev Comput Stat 4(5):439–446

    Article  Google Scholar 

  76. Kalousis A, Prados J, Hilario M (2007) Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl Inf Syst 12(1):95–116

    Article  Google Scholar 

  77. Alelyani S, Liu H, Wang L (2011) The effect of the characteristics of the dataset on the selection stability. In: 2011 IEEE 23rd international conference on tools with artificial intelligence, Boca Raton, 7–9 November 2011

    Google Scholar 

  78. Yang P, Zhou BB, Yang JY-H, Zomaya AY (2013) Stability of feature selection algorithms and ensemble feature selection methods in bioinformatics. In: Elloumi M, Zomaya AY (eds) Biological knowledge discovery handbook: preprocessing, mining, and postprocessing of biological data. John Wiley & Sons Inc, Hoboken, pp 333–352

    Chapter  Google Scholar 

  79. Yang P, Ho JW, Yang Y, Zhou BB (2011) Gene-gene interaction filtering with ensemble of filters. BMC Bioinformatics 12:S10. https://doi.org/10.1186/1471-2105-12-S1-S10

    Article  PubMed  PubMed Central  Google Scholar 

  80. Yang F, Mao KZ (2011) Robust feature selection for microarray data based on multicriterion fusion. IEEE/ACM Trans Comput Biol Bioinforma 8(4):1080–1092

    Article  Google Scholar 

  81. Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y (2010) Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26(3):392–398

    Article  CAS  Google Scholar 

  82. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning, 2nd edn. Springer-Verlag, New York

    Book  Google Scholar 

  83. Ambroise C, McLachlan GJ (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci U S A 99(10):6562–6566

    Article  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chaoyang Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Idakwo, G., Luttrell IV, J., Chen, M., Hong, H., Gong, P., Zhang, C. (2019). A Review of Feature Reduction Methods for QSAR-Based Toxicity Prediction. In: Hong, H. (eds) Advances in Computational Toxicology. Challenges and Advances in Computational Chemistry and Physics, vol 30. Springer, Cham. https://doi.org/10.1007/978-3-030-16443-0_7

Download citation

Publish with us

Policies and ethics