Machine Learning Using H2O R Package: An Application in Bioinformatics
Bioinformatics is an interdisciplinary field that combines computer science, statistics, mathematics, and engineering to analyze and interpret biological data. In systems biology, many analytical methods such as mass spectrometry and DNA sequencing generate a large amount of data and advanced statistical and bioinformatics tools are urgently needed to analyze such data. In this study, machine-learning methods using the H2O package in R software were proposed to classify 341 volatile organic compounds (VOCs) based on their molecular structure. Using nine types of molecular fingerprints, including one newly proposed fingerprint (COMBINE) to represent the molecules, 72 classification models were generated to predict biological activities of VOCs by four machine-learning methods, which are deep neural network (DNN), gradient boosting machine (GBM), random forest (RF) and generalized linear model (GLM). The models were evaluated by an external validation set containing 120 VOCs from other sources. Based on computational results, the best classification model was developed by COMBINE fingerprint trained with GBM method with predictive accuracy at 94.4% and the obtained mean-squared error (MSE) value was 0.3952804. We found that the combination of molecular fingerprints and machine-learning methods can be used for predicting biological activities of VOCs. It is recommended to use COMBINE fingerprint trained with GBM method in the context of classifying VOCs. GBM method has advantage in term of computational speed and requires less parameter for optimization com-pared to other machine-learning methods.
- 2.Karthikeyan, M., Vyas, R.: Machine learning methods in chemoinformatics for drug discovery. In: Practical Chemoinformatics, pp. 133–194. Springer, India (2014)Google Scholar
- 4.Intelligence, M.: High Performance Machine Learning in R with H2O (2015)Google Scholar
- 5.Abdullah, A.A., et al.: Development and mining of a volatile organic compound database. Biomed. Res. Int. (2015)Google Scholar