Abstract
Bioinformatics is an interdisciplinary field that combines computer science, statistics, mathematics, and engineering to analyze and interpret biological data. In systems biology, many analytical methods such as mass spectrometry and DNA sequencing generate a large amount of data and advanced statistical and bioinformatics tools are urgently needed to analyze such data. In this study, machine-learning methods using the H2O package in R software were proposed to classify 341 volatile organic compounds (VOCs) based on their molecular structure. Using nine types of molecular fingerprints, including one newly proposed fingerprint (COMBINE) to represent the molecules, 72 classification models were generated to predict biological activities of VOCs by four machine-learning methods, which are deep neural network (DNN), gradient boosting machine (GBM), random forest (RF) and generalized linear model (GLM). The models were evaluated by an external validation set containing 120 VOCs from other sources. Based on computational results, the best classification model was developed by COMBINE fingerprint trained with GBM method with predictive accuracy at 94.4% and the obtained mean-squared error (MSE) value was 0.3952804. We found that the combination of molecular fingerprints and machine-learning methods can be used for predicting biological activities of VOCs. It is recommended to use COMBINE fingerprint trained with GBM method in the context of classifying VOCs. GBM method has advantage in term of computational speed and requires less parameter for optimization com-pared to other machine-learning methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Mitchell, J.B.: Machine learning methods in chemoinformatics. Wiley Interdiscip. Rev. Comput. Mol. Sci. 4(5), 468–481 (2014)
Karthikeyan, M., Vyas, R.: Machine learning methods in chemoinformatics for drug discovery. In: Practical Chemoinformatics, pp. 133–194. Springer, India (2014)
Libbrecht, M.W., Noble, W.S.: Machine learning in genetics and genomics. Nat. Rev. Genet. 16(6), 321–332 (2015)
Intelligence, M.: High Performance Machine Learning in R with H2O (2015)
Abdullah, A.A., et al.: Development and mining of a volatile organic compound database. Biomed. Res. Int. (2015)
Dong, J., et al.: ChemDes: an integrated web-based platform for molecular descriptor and fingerprint computation. J. Cheminform. 7(1), 1–10 (2015)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Cook, R.J.: Generalized linear model. Encycl. Biostat. 6(2), e16104 (1998)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Abdullah, A.A., Kanaya, S. (2019). Machine Learning Using H2O R Package: An Application in Bioinformatics. In: Kor, LK., Ahmad, AR., Idrus, Z., Mansor, K. (eds) Proceedings of the Third International Conference on Computing, Mathematics and Statistics (iCMS2017). Springer, Singapore. https://doi.org/10.1007/978-981-13-7279-7_46
Download citation
DOI: https://doi.org/10.1007/978-981-13-7279-7_46
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-7278-0
Online ISBN: 978-981-13-7279-7
eBook Packages: Computer ScienceComputer Science (R0)