Abstract
Chemogenomic modeling involves the construction of algorithmic or statistical models for prediction on new input data and is often based on noisy, multidescriptor data. A deeper understanding of such data through statistical analyses can underpin informed study design and increase information gain from prediction results and model performances. This chapter mediates basic statistical concepts and provides step-by-step instructions to explore and visualize chemogenomic data based on the statistics-centered, open-source software R. Directions on executing essential techniques such as the calculation of correlations, hypothesis testing, and clustering are provided.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Team R (2015) RStudio: integrated development for R. RStudio, Inc, Boston, MA http://www rstudio com
Team RC (2016) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria
Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E (2003) The chemistry development kit (CDK): an open-source java library for chemo- and bioinformatics. J Chem Inf Comput Sci 43(2):493–500. https://doi.org/10.1021/ci025584y
Guha R (2007) Chemical informatics functionality in R. J Stat Softw 18(5):1–16
Lipinski CA, Lombardo F, Dominy BW, Feeney PJ (1997) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Del Rev 23(1–3):3–25
van Belle G, Fisher LD, Heagerty PJ, Lumley T (2004) Biostatistics: a methodology for the health sciences. Wiley, Chichester
Boslaugh S (2012) Statistics in a nutshell: a desktop quick reference. O'Reilly Media, Beijing
Lawson RG, Jurs PC (1990) New index for clustering tendency and its application to chemical problems. J Chem Inf Comput Sci 30(1):36–41
Sullivan GM, Feinn R (2012) Using effect size—or why the P value is not enough. J Grad Med Educ 4(3):279–282
Halsey LG, Curran-Everett D, Vowler SL, Drummond GB (2015) The fickle P value generates irreproducible results. Nat Methods 12(3):179–185. https://doi.org/10.1038/nmeth.3288
Canty A, Ripley B (2012) boot: Bootstrap R (S-Plus) functions. R package version 1 (7). https://cran.r-project.org/web/packages/boot/citation.html
Kursa MB, Rudnicki WR (2010) Feature selection with the Boruta package. J Stat Softw 36(11):1–13
Fox J (2002) An R and S-plus companion to applied regression. Sage, Thousand Oaks
Kuhn M (2008) Caret package. J Stat Softw 28(5):1–26
Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2012) Cluster: cluster analysis basics and extensions. R package version 1(2):56
YiLan L, Zeng R (2015) clustertend: check the clustering tendency. https://cran.r-project.org/web/packages/clustertend/index.html
Brock G, Pihur V, Datta S, Datta S (2011) clValid, an R package for cluster validation. Journal of Statistical Software (Brock et al, March 2008)
Wei T, Simko V (2013) corrplot: Visualization of a correlation matrix. R package version 073 230 (231):11
Galili T (2015) Dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics 31(22):3718–3720
Wickham H, Francois R (2017) dplyr: A grammar of data manipulation. R package version 074 1:20
Dimitriadou E, Hornik K, Leisch F, Meyer D, Weingessel A (2005) Misc functions of the department of statistics (e1071), TU Wien. R package version:1.5–7
Kassambara A (2015) Factoextra: visualization of the outputs of a multivariate analysis. R package version 1 (1)
Ogle D (2015) FSA: fisheries stock analysis. R package version 06:13
Kassambara A (2017) ggpubr:“ggplot2” Based Publication Ready Plots. R Package Version 01 2
Wickham H (2016) ggplot2: elegant graphics for data analysis. Springer, New York
Harrell Jr FE, Dupont C (2008) Hmisc: harrell miscellaneous. R package version 3 (2)
Zeileis A, Hothorn T (2002) Diagnostic checking in regression relationships. https://cran.r-project.org/web/packages/lmtest/citation.html
Leisch F, Dimitriadou E 2005 mlbench: machine learning benchmark problems, URL http://CRAN R-project org/ R package version:1.0–1
Carl P, Peterson BG, Boudt K, Zivot E (2009) PerformanceAnalytics: econometric tools for performance and risk analysis. R package version 1 (0)
Champely S (2012) pwr: Basic functions for power analysis. R package version 1 (1)
Mangiafico S (2017) rcompanion: functions to support extension education program evaluation. R package version 15 0 The Comprehensive R Archive Network
Wickham H (2007) Reshaping data with the reshape package. J Stat Softw 21(12):1–20
Wickham H, Francois R, Müller K (2016) Tibble: simple data frames. R package version https://cran.r-project.org/web/packages/tibble/index.html
Wickham H (2014) tidyr: easily tidy data with spread () and gather () functions. R package version 02 0
Irwin JJ, Shoichet BK (2005) ZINC− a free database of commercially available compounds for virtual screening. J Chem Inf Model 45(1):177–182
Davison AC, Hinkley DV (1997) Bootstrap methods and their application. Cambridge University Press, Cambridge
Breusch TS, Pagan AR (1979) A simple test for heteroscedasticity and random coefficient variation. Econometrica 47:1287–1294
Ward JH Jr (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244
Vogt M, Bajorath J (2017) Hierarchical clustering in R. In: Tutorials in chemoinformatics. John Wiley & Sons, Ltd, Hoboken, NJ, pp 103–118. https://doi.org/10.1002/9781119161110.ch6
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J Roy Stat Soc Ser B (Stat Method) 63(2):411–423
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Cook RD (1977) Detection of influential observation in linear regression. Technometrics 19(1):15–18
Koenker R (1981) A note on studentizing a test for heteroscedasticity. J Econom 17(1):107–112
Reinhart A (2015) Statistics done wrong: the woefully complete guide. No Starch Press, San Francisco
Krzywinski M, Altman N (2013) Points of significance: power and sample size. Nat Methods 10(12):1139–1140
Noble WS (2009) How does multiple testing correction work? Nat Biotechnol 27(12):1135–1137. https://doi.org/10.1038/nbt1209-1135
Acknowledgments
Support from the Japan Society for the Promotion of Science (JSPS) in conjunction with the Alexander von Humboldt Foundation (AvH) is gratefully acknowledged.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Rakers, C. (2018). Core Statistical Methods for Chemogenomic Data. In: Brown, J. (eds) Computational Chemogenomics. Methods in Molecular Biology, vol 1825. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-8639-2_7
Download citation
DOI: https://doi.org/10.1007/978-1-4939-8639-2_7
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-8638-5
Online ISBN: 978-1-4939-8639-2
eBook Packages: Springer Protocols