Machine Learning Using H2O R Package: An Application in Bioinformatics

  • Azian Azamimi AbdullahEmail author
  • Shigehiko Kanaya
Conference paper


Bioinformatics is an interdisciplinary field that combines computer science, statistics, mathematics, and engineering to analyze and interpret biological data. In systems biology, many analytical methods such as mass spectrometry and DNA sequencing generate a large amount of data and advanced statistical and bioinformatics tools are urgently needed to analyze such data. In this study, machine-learning methods using the H2O package in R software were proposed to classify 341 volatile organic compounds (VOCs) based on their molecular structure. Using nine types of molecular fingerprints, including one newly proposed fingerprint (COMBINE) to represent the molecules, 72 classification models were generated to predict biological activities of VOCs by four machine-learning methods, which are deep neural network (DNN), gradient boosting machine (GBM), random forest (RF) and generalized linear model (GLM). The models were evaluated by an external validation set containing 120 VOCs from other sources. Based on computational results, the best classification model was developed by COMBINE fingerprint trained with GBM method with predictive accuracy at 94.4% and the obtained mean-squared error (MSE) value was 0.3952804. We found that the combination of molecular fingerprints and machine-learning methods can be used for predicting biological activities of VOCs. It is recommended to use COMBINE fingerprint trained with GBM method in the context of classifying VOCs. GBM method has advantage in term of computational speed and requires less parameter for optimization com-pared to other machine-learning methods.


  1. 1.
    Mitchell, J.B.: Machine learning methods in chemoinformatics. Wiley Interdiscip. Rev. Comput. Mol. Sci. 4(5), 468–481 (2014)CrossRefGoogle Scholar
  2. 2.
    Karthikeyan, M., Vyas, R.: Machine learning methods in chemoinformatics for drug discovery. In: Practical Chemoinformatics, pp. 133–194. Springer, India (2014)Google Scholar
  3. 3.
    Libbrecht, M.W., Noble, W.S.: Machine learning in genetics and genomics. Nat. Rev. Genet. 16(6), 321–332 (2015)CrossRefGoogle Scholar
  4. 4.
    Intelligence, M.: High Performance Machine Learning in R with H2O (2015)Google Scholar
  5. 5.
    Abdullah, A.A., et al.: Development and mining of a volatile organic compound database. Biomed. Res. Int. (2015)Google Scholar
  6. 6.
    Dong, J., et al.: ChemDes: an integrated web-based platform for molecular descriptor and fingerprint computation. J. Cheminform. 7(1), 1–10 (2015)MathSciNetCrossRefGoogle Scholar
  7. 7.
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)CrossRefGoogle Scholar
  8. 8.
    Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefGoogle Scholar
  10. 10.
    Cook, R.J.: Generalized linear model. Encycl. Biostat. 6(2), e16104 (1998)MathSciNetGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.Biomedical Electronic Engineering Programme, School of Mechatronic EngineeringUniversiti Malaysia Perlis, Pauh Putra CampusArauMalaysia
  2. 2.Computational Systems Biology Laboratory, Graduate School of Information ScienceNara Institute of Science and TechnologyIkomaJapan

Personalised recommendations