Random Forest for Bioinformatics

Chapter

Abstract

Modern biology has experienced an increased use of machine learning techniques for large scale and complex biological data analysis. In the area of Bioinformatics, the Random Forest (RF) [6] technique, which includes an ensemble of decision trees and incorporates feature selection and interactions naturally in the learning process, is a popular choice. It is nonparametric, interpretable, efficient, and has high prediction accuracy for many types of data. Recent work in computational biology has seen an increased use of RF, owing to its unique advantages in dealing with small sample size, high-dimensional feature space, and complex data structures.

Keywords

Arthritis Manifold Transportation Editing Rifampin 

References

  1. 1.
    Altmann, A., Toloşi, L., Sander, O., Lengauer, T.: Permutation importance: a corrected feature importance measure. Bioinformatics 26(10), 1340 (2010)CrossRefGoogle Scholar
  2. 2.
    Amaratunga, D., Cabrera, J., Lee, Y.: Enriched random forests. Bioinformatics 24(18), 2010 (2008)Google Scholar
  3. 3.
    Bao, L., Zhou, M., Cui, Y.: nssnpanalyzer: identifying disease-associated nonsynonymous single nucleotide polymorphisms. Nucleic Acids Research 33(suppl 2), W480 (2005)CrossRefGoogle Scholar
  4. 4.
    Barenboim, M., Masso, M., Vaisman, I., Jamison, D.: Statistical geometry based prediction of nonsynonymous snp functional effects using random forest and neuro-fuzzy classifiers. Proteins: Structure, Function, and Bioinformatics 71(4), 1930–1939 (2008)CrossRefGoogle Scholar
  5. 5.
    Barrett, J., Cairns, D.: Application of the random forest classification method to peaks detected from mass spectrometric proteomic profiles of cancer patients and controls. Statistical Applications in Genetics and Molecular Biology 7(2), 4 (2008)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001). DOI 10.1023/A: 1010933404324MATHCrossRefGoogle Scholar
  7. 7.
    Bureau, A., Dupuis, J., Falls, K., Lunetta, K.L., Hayward, B., Keith, T.P., Van Eerdewegh, P.: Identifying snps predictive of phenotype using random forests. Genet Epidemiol 28(2), 171–82 (2005). DOI 10.1002/gepi.20041CrossRefGoogle Scholar
  8. 8.
    Chen, X., Jeong, J.: Sequence-based prediction of protein interaction sites with an integrative method. Bioinformatics 25(5), 585 (2009)CrossRefGoogle Scholar
  9. 9.
    Chen, X., Liu, C.T., Zhang, M., Zhang, H.: A forest-based approach to identifying gene and gene–gene interactions. Proc Natl Acad Sci USA 104(49), 19,199–203 (2007). DOI 10.1073/pnas.0709868104Google Scholar
  10. 10.
    Chen, X., Liu, M.: Prediction of protein–protein interactions using random decision forest framework. Bioinformatics 21(24), 4394 (2005)CrossRefGoogle Scholar
  11. 11.
    Chen, X., Wang, M., Zhang, H.: The use of classification trees for bioinformatics. ​​Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1(1), 55–63 (2011)Google Scholar
  12. 12.
    Cummings, M., Myers, D.: Simple statistical models predict c-to-u edited sites in plant mitochondrial rna. BMC Bioinformatics 5(1), 132 (2004)CrossRefGoogle Scholar
  13. 13.
    Cummings, M., Segal, M.: Few amino acid positions in rpob are associated with most of the rifampin resistance in mycobacterium tuberculosis. BMC Bioinformatics 5(1), 137 (2004)CrossRefGoogle Scholar
  14. 14.
    Cutler, D., Edwards Jr, T., Beard, K., Cutler, A., Hess, K., Gibson, J., Lawler, J.: Random forests for classification in ecology. Ecology 88(11), 2783–2792 (2007)CrossRefGoogle Scholar
  15. 15.
    Diaz-Uriarte, R., de Andrés, S.: Variable selection from random forests: application to gene expression data. Arxiv preprint q-bio/0503025 (2005)Google Scholar
  16. 16.
    Dybowski, J.N., Heider, D., Hoffmann, D.: Prediction of co-receptor usage of hiv-1 from genotype. PLoS Comput Biol 6(4), e1000,743 (2010). DOI 10.1371/journal.pcbi. 1000743Google Scholar
  17. 17.
    Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63, 3–42 (2006)MATHGoogle Scholar
  18. 18.
    Geurts, P., Fillet, M., De Seny, D., Meuwis, M., Malaise, M., Merville, M., Wehenkel, L.: Proteomic mass spectra classification using decision tree based ensemble methods. Bioinformatics 21(14), 3138 (2005)CrossRefGoogle Scholar
  19. 19.
    Hamby, S., Hirst, J.: Prediction of glycosylation sites using random forests. BMC Bioinformatics 9(1), 500 (2008)CrossRefGoogle Scholar
  20. 20.
    Hanselmann, M., Ko the, U., Kirchner, M., Renard, B., Amstalden, E., Glunde, K., Heeren, R., Hamprecht, F.: Toward digital staining using imaging mass spectrometry and random forests. Journal of Proteome Research 8(7), 3558–3567 (2009)Google Scholar
  21. 21.
    Hothorn, T., Hornik, K., Zeileis, A., Wien, W., Wien, W.: Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics 15(3), 651–674 (2006)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Izmirlian, G.: Application of the random forest classification algorithm to a seldi-tof proteomics study in the setting of a cancer prevention trial. Annals of the New York Academy of Sciences 1020(1), 154–174 (2004)CrossRefGoogle Scholar
  23. 23.
    Karpievitch, Y., Hill, E., Leclerc, A., Dabney, A., Almeida, J.: An introspective comparison of random forest-based classifiers for the analysis of cluster-correlated data by way of rf++. PloS one 4(9), e7087 (2009)CrossRefGoogle Scholar
  24. 24.
    Kirchner, M., Timm, W., Fong, P., Wangemann, P., Steen, H.: Non-linear classification for on-the-fly fractional mass filtering and targeted precursor fragmentation in mass spectrometry experiments. Bioinformatics 26(6), 791 (2010)CrossRefGoogle Scholar
  25. 25.
    Kruglyak, L., Nickerson, D.A.: Variation is the spice of life. Nat Genet 27(3), 234–6 (2001). DOI 10.1038/85776CrossRefGoogle Scholar
  26. 26.
    Lee, J., Lee, J., Park, M., Song, S.: An extensive comparison of recent classification tools applied to microarray data. Computational Statistics & Data Analysis 48(4), 869–885 (2005)MathSciNetMATHCrossRefGoogle Scholar
  27. 27.
    Lin, N., Wu, B., Jansen, R., Gerstein, M., Zhao, H.: Information assessment on predicting protein–protein interactions. BMC Bioinformatics 5(1), 154 (2004)CrossRefGoogle Scholar
  28. 28.
    Lunetta, K., Hayward, L., Segal, J., Van Eerdewegh, P.: Screening large-scale association study data: exploiting interactions using random forests. BMC Genetics 5(1), 32 (2004)CrossRefGoogle Scholar
  29. 29.
    Ma, Y., Ding, Z., Qian, Y., Shi, X., Castranova, V., Harner, E., Guo, L.: Predicting cancer drug response by proteomic profiling. Clinical Cancer Research 12(15), 4583 (2006)CrossRefGoogle Scholar
  30. 30.
    Meng, Y., Yu, Y., Cupples, L., Farrer, L., Lunetta, K.: Performance of random forest when snps are in linkage disequilibrium. BMC Bioinformatics 10(1), 78 (2009)CrossRefGoogle Scholar
  31. 31.
    Menze, B., Kelm, B., Masuch, R., Himmelreich, U., Bachert, P., Petrich, W., Hamprecht, F.: A comparison of random forest and its gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics 10(1), 213 (2009)CrossRefGoogle Scholar
  32. 32.
    Moore, J., Asselbergs, F., Williams, S.: Bioinformatics challenges for genome-wide association studies. Bioinformatics 26(4), 445 (2010)CrossRefGoogle Scholar
  33. 33.
    Qi, Y., Bar-Joseph, Z., Klein-Seetharaman, J.: Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins: Structure, Function, and Bioinformatics 63(3), 490–500 (2006)CrossRefGoogle Scholar
  34. 34.
    Qi, Y., Dhiman, H., Bhola, N., Budyak, I., Kar, S., Man, D., Dutta, A., Tirupula, K., Carr, B., Grandis, J., et al.: Systematic prediction of human membrane receptor interactions. Proteomics 9(23), 5243–5255 (2009)CrossRefGoogle Scholar
  35. 35.
    Qi, Y., Klein-Seetharaman, J., Bar-Joseph, Z.: Random forest similarity for protein–protein interaction prediction from multiple sources. In: Proceedings of the Pacific Symposium on Biocomputing (2005)Google Scholar
  36. 36.
    Riddick, G., Song, H., Ahn, S., Walling, J., Borges-Rivera, D., Zhang, W., Fine, H.: Predicting in vitro drug sensitivity using random forests. Bioinformatics 27(2), 220 (2011)CrossRefGoogle Scholar
  37. 37.
    Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507 (2007)CrossRefGoogle Scholar
  38. 38.
    Segal, M.R.: Machine learning benchmarks and random forest regression. Technical Report, Center for Bioinformatics & Molecular Biostatistics, University of California, San Francisco (2004)Google Scholar
  39. 39.
    Statnikov, A., Wang, L., Aliferis, C.: A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics 9(1), 319 (2008)CrossRefGoogle Scholar
  40. 40.
    Strobl, C., Boulesteix, A., Kneib, T., Augustin, T., Zeileis, A.: Conditional variable importance for random forests. BMC Bioinformatics 9(1), 307 (2008)CrossRefGoogle Scholar
  41. 41.
    Strobl, C., Boulesteix, A., Zeileis, A., Hothorn, T.: Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8(1), 25 (2007)CrossRefGoogle Scholar
  42. 42.
    Svetnik, V., Liaw, A., Tong, C., Culberson, J.C., Sheridan, R.P., Feuston, B.P.: Random forest: a classification and regression tool for compound classification and qsar modeling. J Chem Inf Comput Sci 43(6), 1947–58 (2003). DOI 10.1021/ci034160gCrossRefGoogle Scholar
  43. 43.
    Tastan, O., Qi, Y., Carbonell, J., Klein-Seetharaman, J.: Prediction of interactions between HIV-1 and human proteins by information integration. In: Pac Symp Biocomput, vol. 516 (2009)Google Scholar
  44. 44.
    Wang, M., Chen, X., Zhang, H.: Maximal conditional chi-square importance in random forests. Bioinformatics 26(6), 831 (2010)CrossRefGoogle Scholar
  45. 45.
    Wang, W.Y.S., Barratt, B.J., Clayton, D.G., Todd, J.A.: Genome-wide association studies: theoretical and practical concerns. Nat Rev Genet 6(2), 109–18 (2005). DOI 10.1038/nrg1522CrossRefGoogle Scholar
  46. 46.
    Wu, X., Wu, Z., Li, K.: Identification of differential gene expression for microarray data using recursive random forest. Chin Med J 121(24), 2492–2496 (2008)Google Scholar
  47. 47.
    Yang, P., Hwa Yang, Y., Zhou, B., Zomaya, Y., et al.: A review of ensemble methods in bioinformatics. Current Bioinformatics 5(4), 296–308 (2010)CrossRefGoogle Scholar
  48. 48.
    Zhang, H., Yu, C., Singer, B.: Cell and tumor classification using gene expression data: construction of forests. Proceedings of the National Academy of Sciences 100(7), 4168 (2003)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.Machine Learning DepartmentNEC Labs AmericaPrincetonUSA
  2. 2.PrincetonUSA

Personalised recommendations