Core Statistical Methods for Chemogenomic Data

  • Christin RakersEmail author
Part of the Methods in Molecular Biology book series (MIMB, volume 1825)


Chemogenomic modeling involves the construction of algorithmic or statistical models for prediction on new input data and is often based on noisy, multidescriptor data. A deeper understanding of such data through statistical analyses can underpin informed study design and increase information gain from prediction results and model performances. This chapter mediates basic statistical concepts and provides step-by-step instructions to explore and visualize chemogenomic data based on the statistics-centered, open-source software R. Directions on executing essential techniques such as the calculation of correlations, hypothesis testing, and clustering are provided.

Key words

Chemogenomic data Normality Correlation Clustering Feature importance Hypothesis testing 



Support from the Japan Society for the Promotion of Science (JSPS) in conjunction with the Alexander von Humboldt Foundation (AvH) is gratefully acknowledged.


  1. 1.
    Team R (2015) RStudio: integrated development for R. RStudio, Inc, Boston, MA http://www rstudio comGoogle Scholar
  2. 2.
    Team RC (2016) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, AustriaGoogle Scholar
  3. 3.
    Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E (2003) The chemistry development kit (CDK): an open-source java library for chemo- and bioinformatics. J Chem Inf Comput Sci 43(2):493–500. Scholar
  4. 4.
    Guha R (2007) Chemical informatics functionality in R. J Stat Softw 18(5):1–16CrossRefGoogle Scholar
  5. 5.
    Lipinski CA, Lombardo F, Dominy BW, Feeney PJ (1997) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Del Rev 23(1–3):3–25CrossRefGoogle Scholar
  6. 6.
    van Belle G, Fisher LD, Heagerty PJ, Lumley T (2004) Biostatistics: a methodology for the health sciences. Wiley, ChichesterGoogle Scholar
  7. 7.
    Boslaugh S (2012) Statistics in a nutshell: a desktop quick reference. O'Reilly Media, BeijingGoogle Scholar
  8. 8.
    Lawson RG, Jurs PC (1990) New index for clustering tendency and its application to chemical problems. J Chem Inf Comput Sci 30(1):36–41CrossRefGoogle Scholar
  9. 9.
    Sullivan GM, Feinn R (2012) Using effect size—or why the P value is not enough. J Grad Med Educ 4(3):279–282CrossRefGoogle Scholar
  10. 10.
    Halsey LG, Curran-Everett D, Vowler SL, Drummond GB (2015) The fickle P value generates irreproducible results. Nat Methods 12(3):179–185. Scholar
  11. 11.
    Canty A, Ripley B (2012) boot: Bootstrap R (S-Plus) functions. R package version 1 (7).
  12. 12.
    Kursa MB, Rudnicki WR (2010) Feature selection with the Boruta package. J Stat Softw 36(11):1–13CrossRefGoogle Scholar
  13. 13.
    Fox J (2002) An R and S-plus companion to applied regression. Sage, Thousand OaksGoogle Scholar
  14. 14.
    Kuhn M (2008) Caret package. J Stat Softw 28(5):1–26CrossRefGoogle Scholar
  15. 15.
    Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2012) Cluster: cluster analysis basics and extensions. R package version 1(2):56Google Scholar
  16. 16.
    YiLan L, Zeng R (2015) clustertend: check the clustering tendency.
  17. 17.
    Brock G, Pihur V, Datta S, Datta S (2011) clValid, an R package for cluster validation. Journal of Statistical Software (Brock et al, March 2008)Google Scholar
  18. 18.
    Wei T, Simko V (2013) corrplot: Visualization of a correlation matrix. R package version 073 230 (231):11Google Scholar
  19. 19.
    Galili T (2015) Dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics 31(22):3718–3720CrossRefGoogle Scholar
  20. 20.
    Wickham H, Francois R (2017) dplyr: A grammar of data manipulation. R package version 074 1:20Google Scholar
  21. 21.
    Dimitriadou E, Hornik K, Leisch F, Meyer D, Weingessel A (2005) Misc functions of the department of statistics (e1071), TU Wien. R package version:1.5–7Google Scholar
  22. 22.
    Kassambara A (2015) Factoextra: visualization of the outputs of a multivariate analysis. R package version 1 (1)Google Scholar
  23. 23.
    Ogle D (2015) FSA: fisheries stock analysis. R package version 06:13Google Scholar
  24. 24.
    Kassambara A (2017) ggpubr:“ggplot2” Based Publication Ready Plots. R Package Version 01 2Google Scholar
  25. 25.
    Wickham H (2016) ggplot2: elegant graphics for data analysis. Springer, New YorkCrossRefGoogle Scholar
  26. 26.
    Harrell Jr FE, Dupont C (2008) Hmisc: harrell miscellaneous. R package version 3 (2)Google Scholar
  27. 27.
    Zeileis A, Hothorn T (2002) Diagnostic checking in regression relationships.
  28. 28.
    Leisch F, Dimitriadou E 2005 mlbench: machine learning benchmark problems, URL http://CRAN R-project org/ R package version:1.0–1Google Scholar
  29. 29.
    Carl P, Peterson BG, Boudt K, Zivot E (2009) PerformanceAnalytics: econometric tools for performance and risk analysis. R package version 1 (0)Google Scholar
  30. 30.
    Champely S (2012) pwr: Basic functions for power analysis. R package version 1 (1)Google Scholar
  31. 31.
    Mangiafico S (2017) rcompanion: functions to support extension education program evaluation. R package version 15 0 The Comprehensive R Archive NetworkGoogle Scholar
  32. 32.
    Wickham H (2007) Reshaping data with the reshape package. J Stat Softw 21(12):1–20Google Scholar
  33. 33.
    Wickham H, Francois R, Müller K (2016) Tibble: simple data frames. R package version
  34. 34.
    Wickham H (2014) tidyr: easily tidy data with spread () and gather () functions. R package version 02 0Google Scholar
  35. 35.
    Irwin JJ, Shoichet BK (2005) ZINC− a free database of commercially available compounds for virtual screening. J Chem Inf Model 45(1):177–182CrossRefGoogle Scholar
  36. 36.
    Davison AC, Hinkley DV (1997) Bootstrap methods and their application. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  37. 37.
    Breusch TS, Pagan AR (1979) A simple test for heteroscedasticity and random coefficient variation. Econometrica 47:1287–1294CrossRefGoogle Scholar
  38. 38.
    Ward JH Jr (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244CrossRefGoogle Scholar
  39. 39.
    Vogt M, Bajorath J (2017) Hierarchical clustering in R. In: Tutorials in chemoinformatics. John Wiley & Sons, Ltd, Hoboken, NJ, pp 103–118. Scholar
  40. 40.
    Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J Roy Stat Soc Ser B (Stat Method) 63(2):411–423CrossRefGoogle Scholar
  41. 41.
    Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65CrossRefGoogle Scholar
  42. 42.
    Cook RD (1977) Detection of influential observation in linear regression. Technometrics 19(1):15–18Google Scholar
  43. 43.
    Koenker R (1981) A note on studentizing a test for heteroscedasticity. J Econom 17(1):107–112CrossRefGoogle Scholar
  44. 44.
    Reinhart A (2015) Statistics done wrong: the woefully complete guide. No Starch Press, San FranciscoGoogle Scholar
  45. 45.
    Krzywinski M, Altman N (2013) Points of significance: power and sample size. Nat Methods 10(12):1139–1140CrossRefGoogle Scholar
  46. 46.
    Noble WS (2009) How does multiple testing correction work? Nat Biotechnol 27(12):1135–1137. Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Graduate School of Pharmaceutical Sciences, Yoshida-shimoadachichoKyoto UniversitySakyo-ku, KyotoJapan
  2. 2.Graduate School of Science Nagoya UniversityNagoyaJapan

Personalised recommendations