Skip to main content

Abstract

Bioinformatics is an interdisciplinary field that combines computer science, statistics, mathematics, and engineering to analyze and interpret biological data. In systems biology, many analytical methods such as mass spectrometry and DNA sequencing generate a large amount of data and advanced statistical and bioinformatics tools are urgently needed to analyze such data. In this study, machine-learning methods using the H2O package in R software were proposed to classify 341 volatile organic compounds (VOCs) based on their molecular structure. Using nine types of molecular fingerprints, including one newly proposed fingerprint (COMBINE) to represent the molecules, 72 classification models were generated to predict biological activities of VOCs by four machine-learning methods, which are deep neural network (DNN), gradient boosting machine (GBM), random forest (RF) and generalized linear model (GLM). The models were evaluated by an external validation set containing 120 VOCs from other sources. Based on computational results, the best classification model was developed by COMBINE fingerprint trained with GBM method with predictive accuracy at 94.4% and the obtained mean-squared error (MSE) value was 0.3952804. We found that the combination of molecular fingerprints and machine-learning methods can be used for predicting biological activities of VOCs. It is recommended to use COMBINE fingerprint trained with GBM method in the context of classifying VOCs. GBM method has advantage in term of computational speed and requires less parameter for optimization com-pared to other machine-learning methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Mitchell, J.B.: Machine learning methods in chemoinformatics. Wiley Interdiscip. Rev. Comput. Mol. Sci. 4(5), 468–481 (2014)

    Article  Google Scholar 

  2. Karthikeyan, M., Vyas, R.: Machine learning methods in chemoinformatics for drug discovery. In: Practical Chemoinformatics, pp. 133–194. Springer, India (2014)

    Google Scholar 

  3. Libbrecht, M.W., Noble, W.S.: Machine learning in genetics and genomics. Nat. Rev. Genet. 16(6), 321–332 (2015)

    Article  Google Scholar 

  4. Intelligence, M.: High Performance Machine Learning in R with H2O (2015)

    Google Scholar 

  5. Abdullah, A.A., et al.: Development and mining of a volatile organic compound database. Biomed. Res. Int. (2015)

    Google Scholar 

  6. Dong, J., et al.: ChemDes: an integrated web-based platform for molecular descriptor and fingerprint computation. J. Cheminform. 7(1), 1–10 (2015)

    Article  MathSciNet  Google Scholar 

  7. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  8. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001)

    Article  MathSciNet  Google Scholar 

  9. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  10. Cook, R.J.: Generalized linear model. Encycl. Biostat. 6(2), e16104 (1998)

    MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Azian Azamimi Abdullah .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Abdullah, A.A., Kanaya, S. (2019). Machine Learning Using H2O R Package: An Application in Bioinformatics. In: Kor, LK., Ahmad, AR., Idrus, Z., Mansor, K. (eds) Proceedings of the Third International Conference on Computing, Mathematics and Statistics (iCMS2017). Springer, Singapore. https://doi.org/10.1007/978-981-13-7279-7_46

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-7279-7_46

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-7278-0

  • Online ISBN: 978-981-13-7279-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics