Abstract
Feature Selection is a crucial step for inferring regression and classification models in QSPR (Quantitative Structure–Property Relationship) applied to Cheminformatics. A particularly complex case of QSPR modelling occurs in Polymer Informatics because the features under analysis require the management of uncertainty. In this paper, a novel feature selection method for addressing this special QSPR scenario is presented. The proposed methodology assumes that each feature is characterized by a probabilistic distribution of values associated with the polydispersity of the polymers included in the training dataset. This new algorithm has two sequential steps: ranking of the features, generated by correlation analysis, and iterative subset reduction, obtained by feature redundancy analysis. A prototype of the algorithm has been implemented in order to conduct a proof of concept. The method performance has been evaluated by using synthetic datasets of different sizes and varying the cardinality of the feature selected sub-sets. These preliminary results allow concluding that the chosen mathematical representation and the proposed method is suitable for managing the uncertainty inherent to the polymerization. Nevertheless, this research constitutes a piece of work in progress and additional experiments should be conducted in the future in order to assess the actual benefits and limitations of this methodology.
Keywords
- Feature selection
- QSPR
- Polymer informatics
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Li, Y., Li, T., Liu, H.: Recent advances in feature selection and its applications. Knowl. Inf. Syst. 53, 551–577 (2017)
Eklund, M., Norinder, U., Boyer, S., Carlsson, L.: Choosing feature selection and learning algorithms in QSAR. J. Chem. Inf. Model. 54, 837–843 (2014)
Li, J., Fong, S., Siu, S., Mohammed, S., Fiaidhi, J., Wong, K.K.L.: WITHDRAWN: improving classification of protein binders for virtual drug screening by novel swarm-based feature selection techniques. Comput. Med. Imaging Graph. (2016, in press)
Ponzoni, I., Sebastián-Pérez, V., Requena-Triguero, C., Roca, C., Martínez, M.J., Cravero, F., Díaz, M.F., Páez, J.A., Gómez Arrayás, R., Adrio, J., Campillo, N.E.: Hybridizing feature selection and feature learning approaches in QSAR modeling for drug discovery. Sci. Rep. 7, Article number 2403 (2017)
Adams, N.: Polymer informatics. In: Meier, M., Webster, D. (eds.) Polymer Libraries. Advances in Polymer Science, vol. 225, pp. 107–149 (2010)
Audus, D.J., De Pablo, J.J.: Polymer informatics: opportunities and challenges. ACS Macro Lett. 6, 1078–1082 (2017)
Liu, Y., Zhao, T., Ju, W., Shi, S.: Materials discovery and design using machine learning. J. Materiomics 3, 159–177 (2017)
Huan, T.D., Mannodi-Kanakkithodi, A., Kim, C., Sharma, V., Pilania, G., Ramprasad, R.: A polymer dataset for accelerated property prediction and design. Sci. Data 3, Article number 160012 (2016)
Singh, R.K., Sivabalakrishnan, M.: Feature selection of gene expression data for cancer classification: a review. Procedia Comput. Sci. 50, 52–57 (2015)
Tommasel, A., Godoy, D.: A Social-aware online short-text feature selection technique for social media. Inf. Fusion 40, 1–17 (2018)
Soto, A.J., Cecchini, R.L., Vazquez, G.E., Ponzoni, I.: A wrapper-based feature selection method for ADMET prediction using evolutionary computing. In: Marchiori, E., Moore, J.H. (eds.) EvoBIO 2008. LNCS, vol. 4973, pp. 188–199. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78757-0_17
Soto, A.J., Cecchini, R.L., Vazquez, G.E., Ponzoni, I.: Multi-objective feature selection in QSAR using a machine learning approach. Mol. Inf. 28, 1509–1523 (2009)
Martínez, M.J., Ponzoni, I., Díaz, M.F., Vazquez, G.E., Soto, A.J.: Visual analytics in cheminformatics: user-supervised descriptor selection for QSAR methods. J. Cheminform. 7, 39 (2015)
Cravero, F., Martínez, M.J., Vazquez, G.E., Díaz, M.F., Ponzoni, I.: Feature learning applied to the estimation of tensile strength at break in polymeric material design. J. Integr. Bioinf. 13, 286 (2016)
McCrum, N.G., Buckley, C.P., Bucknall, C.B.: Principles of Polymer Engineering. Oxford University Press, Oxford; New York (1997)
Sheu, W.-S.: Molecular weight averages and polydispersity of polymers. J. Chem. Educ. 78, 554–555 (2001)
Bhattacharyya, A.: On a measure of divergence between two statistical populations defined by probability distributions. Bull. Calcutta Math. Soc. 35, 99–109 (1943)
Cravero, F., Schustik, S., Martínez, M.J., Ponzoni, I., Díaz, M.F.: Macro approach to molecular modelling of linear polymers applied to estimation of tensile modulus for new materials development. In: VIII International Symposium on Materials (Materias2017), Aveiro, Portugal (2017)
Cravero, F., Martínez, M.J., Vazquez, G.E., Ponzoni, I., Díaz, M.F.: Representación de la Estructura Molecular de Polímeros Sintéticos de Alto Peso. In: XXXI Congreso Argentino de Química, Buenos Aires, Argentina (2016)
Acknowledgments
This work is kindly supported by CONICET, grant PIP 112-2012-0100471 and UNS, grants PGI 24/N042 and PGI 24/ZM17.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Cravero, F., Schustik, S., Martínez, M.J., Díaz, M.F., Ponzoni, I. (2018). FS4RVDD: A Feature Selection Algorithm for Random Variables with Discrete Distribution. In: Medina, J., Ojeda-Aciego, M., Verdegay, J., Perfilieva, I., Bouchon-Meunier, B., Yager, R. (eds) Information Processing and Management of Uncertainty in Knowledge-Based Systems. Applications. IPMU 2018. Communications in Computer and Information Science, vol 855. Springer, Cham. https://doi.org/10.1007/978-3-319-91479-4_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-91479-4_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91478-7
Online ISBN: 978-3-319-91479-4
eBook Packages: Computer ScienceComputer Science (R0)