Skip to main content

FS4RVDD: A Feature Selection Algorithm for Random Variables with Discrete Distribution

Part of the Communications in Computer and Information Science book series (CCIS,volume 855)

Abstract

Feature Selection is a crucial step for inferring regression and classification models in QSPR (Quantitative Structure–Property Relationship) applied to Cheminformatics. A particularly complex case of QSPR modelling occurs in Polymer Informatics because the features under analysis require the management of uncertainty. In this paper, a novel feature selection method for addressing this special QSPR scenario is presented. The proposed methodology assumes that each feature is characterized by a probabilistic distribution of values associated with the polydispersity of the polymers included in the training dataset. This new algorithm has two sequential steps: ranking of the features, generated by correlation analysis, and iterative subset reduction, obtained by feature redundancy analysis. A prototype of the algorithm has been implemented in order to conduct a proof of concept. The method performance has been evaluated by using synthetic datasets of different sizes and varying the cardinality of the feature selected sub-sets. These preliminary results allow concluding that the chosen mathematical representation and the proposed method is suitable for managing the uncertainty inherent to the polymerization. Nevertheless, this research constitutes a piece of work in progress and additional experiments should be conducted in the future in order to assess the actual benefits and limitations of this methodology.

Keywords

  • Feature selection
  • QSPR
  • Polymer informatics

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Li, Y., Li, T., Liu, H.: Recent advances in feature selection and its applications. Knowl. Inf. Syst. 53, 551–577 (2017)

    CrossRef  Google Scholar 

  2. Eklund, M., Norinder, U., Boyer, S., Carlsson, L.: Choosing feature selection and learning algorithms in QSAR. J. Chem. Inf. Model. 54, 837–843 (2014)

    CrossRef  Google Scholar 

  3. Li, J., Fong, S., Siu, S., Mohammed, S., Fiaidhi, J., Wong, K.K.L.: WITHDRAWN: improving classification of protein binders for virtual drug screening by novel swarm-based feature selection techniques. Comput. Med. Imaging Graph. (2016, in press)

    Google Scholar 

  4. Ponzoni, I., Sebastián-Pérez, V., Requena-Triguero, C., Roca, C., Martínez, M.J., Cravero, F., Díaz, M.F., Páez, J.A., Gómez Arrayás, R., Adrio, J., Campillo, N.E.: Hybridizing feature selection and feature learning approaches in QSAR modeling for drug discovery. Sci. Rep. 7, Article number 2403 (2017)

    Google Scholar 

  5. Adams, N.: Polymer informatics. In: Meier, M., Webster, D. (eds.) Polymer Libraries. Advances in Polymer Science, vol. 225, pp. 107–149 (2010)

    Google Scholar 

  6. Audus, D.J., De Pablo, J.J.: Polymer informatics: opportunities and challenges. ACS Macro Lett. 6, 1078–1082 (2017)

    CrossRef  Google Scholar 

  7. Liu, Y., Zhao, T., Ju, W., Shi, S.: Materials discovery and design using machine learning. J. Materiomics 3, 159–177 (2017)

    CrossRef  Google Scholar 

  8. Huan, T.D., Mannodi-Kanakkithodi, A., Kim, C., Sharma, V., Pilania, G., Ramprasad, R.: A polymer dataset for accelerated property prediction and design. Sci. Data 3, Article number 160012 (2016)

    CrossRef  Google Scholar 

  9. Singh, R.K., Sivabalakrishnan, M.: Feature selection of gene expression data for cancer classification: a review. Procedia Comput. Sci. 50, 52–57 (2015)

    CrossRef  Google Scholar 

  10. Tommasel, A., Godoy, D.: A Social-aware online short-text feature selection technique for social media. Inf. Fusion 40, 1–17 (2018)

    CrossRef  Google Scholar 

  11. Soto, A.J., Cecchini, R.L., Vazquez, G.E., Ponzoni, I.: A wrapper-based feature selection method for ADMET prediction using evolutionary computing. In: Marchiori, E., Moore, J.H. (eds.) EvoBIO 2008. LNCS, vol. 4973, pp. 188–199. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78757-0_17

    CrossRef  Google Scholar 

  12. Soto, A.J., Cecchini, R.L., Vazquez, G.E., Ponzoni, I.: Multi-objective feature selection in QSAR using a machine learning approach. Mol. Inf. 28, 1509–1523 (2009)

    Google Scholar 

  13. Martínez, M.J., Ponzoni, I., Díaz, M.F., Vazquez, G.E., Soto, A.J.: Visual analytics in cheminformatics: user-supervised descriptor selection for QSAR methods. J. Cheminform. 7, 39 (2015)

    CrossRef  Google Scholar 

  14. Cravero, F., Martínez, M.J., Vazquez, G.E., Díaz, M.F., Ponzoni, I.: Feature learning applied to the estimation of tensile strength at break in polymeric material design. J. Integr. Bioinf. 13, 286 (2016)

    Google Scholar 

  15. McCrum, N.G., Buckley, C.P., Bucknall, C.B.: Principles of Polymer Engineering. Oxford University Press, Oxford; New York (1997)

    Google Scholar 

  16. Sheu, W.-S.: Molecular weight averages and polydispersity of polymers. J. Chem. Educ. 78, 554–555 (2001)

    CrossRef  Google Scholar 

  17. Bhattacharyya, A.: On a measure of divergence between two statistical populations defined by probability distributions. Bull. Calcutta Math. Soc. 35, 99–109 (1943)

    MathSciNet  MATH  Google Scholar 

  18. Cravero, F., Schustik, S., Martínez, M.J., Ponzoni, I., Díaz, M.F.: Macro approach to molecular modelling of linear polymers applied to estimation of tensile modulus for new materials development. In: VIII International Symposium on Materials (Materias2017), Aveiro, Portugal (2017)

    Google Scholar 

  19. Cravero, F., Martínez, M.J., Vazquez, G.E., Ponzoni, I., Díaz, M.F.: Representación de la Estructura Molecular de Polímeros Sintéticos de Alto Peso. In: XXXI Congreso Argentino de Química, Buenos Aires, Argentina (2016)

    Google Scholar 

Download references

Acknowledgments

This work is kindly supported by CONICET, grant PIP 112-2012-0100471 and UNS, grants PGI 24/N042 and PGI 24/ZM17.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ignacio Ponzoni .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cravero, F., Schustik, S., Martínez, M.J., Díaz, M.F., Ponzoni, I. (2018). FS4RVDD: A Feature Selection Algorithm for Random Variables with Discrete Distribution. In: Medina, J., Ojeda-Aciego, M., Verdegay, J., Perfilieva, I., Bouchon-Meunier, B., Yager, R. (eds) Information Processing and Management of Uncertainty in Knowledge-Based Systems. Applications. IPMU 2018. Communications in Computer and Information Science, vol 855. Springer, Cham. https://doi.org/10.1007/978-3-319-91479-4_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-91479-4_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-91478-7

  • Online ISBN: 978-3-319-91479-4

  • eBook Packages: Computer ScienceComputer Science (R0)