Core Statistical Methods for Chemogenomic Data

Rakers, Christin

doi:10.1007/978-1-4939-8639-2_7

Christin Rakers^3,4

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1825))

1239 Accesses

Abstract

Chemogenomic modeling involves the construction of algorithmic or statistical models for prediction on new input data and is often based on noisy, multidescriptor data. A deeper understanding of such data through statistical analyses can underpin informed study design and increase information gain from prediction results and model performances. This chapter mediates basic statistical concepts and provides step-by-step instructions to explore and visualize chemogenomic data based on the statistics-centered, open-source software R. Directions on executing essential techniques such as the calculation of correlations, hypothesis testing, and clustering are provided.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Team R (2015) RStudio: integrated development for R. RStudio, Inc, Boston, MA http://www rstudio com
Google Scholar
Team RC (2016) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria
Google Scholar
Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E (2003) The chemistry development kit (CDK): an open-source java library for chemo- and bioinformatics. J Chem Inf Comput Sci 43(2):493–500. https://doi.org/10.1021/ci025584y
Article CAS PubMed PubMed Central Google Scholar
Guha R (2007) Chemical informatics functionality in R. J Stat Softw 18(5):1–16
Article Google Scholar
Lipinski CA, Lombardo F, Dominy BW, Feeney PJ (1997) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Del Rev 23(1–3):3–25
Article CAS Google Scholar
van Belle G, Fisher LD, Heagerty PJ, Lumley T (2004) Biostatistics: a methodology for the health sciences. Wiley, Chichester
Google Scholar
Boslaugh S (2012) Statistics in a nutshell: a desktop quick reference. O'Reilly Media, Beijing
Google Scholar
Lawson RG, Jurs PC (1990) New index for clustering tendency and its application to chemical problems. J Chem Inf Comput Sci 30(1):36–41
Article CAS Google Scholar
Sullivan GM, Feinn R (2012) Using effect size—or why the P value is not enough. J Grad Med Educ 4(3):279–282
Article PubMed PubMed Central Google Scholar
Halsey LG, Curran-Everett D, Vowler SL, Drummond GB (2015) The fickle P value generates irreproducible results. Nat Methods 12(3):179–185. https://doi.org/10.1038/nmeth.3288
Article CAS PubMed Google Scholar
Canty A, Ripley B (2012) boot: Bootstrap R (S-Plus) functions. R package version 1 (7). https://cran.r-project.org/web/packages/boot/citation.html
Kursa MB, Rudnicki WR (2010) Feature selection with the Boruta package. J Stat Softw 36(11):1–13
Article Google Scholar
Fox J (2002) An R and S-plus companion to applied regression. Sage, Thousand Oaks
Google Scholar
Kuhn M (2008) Caret package. J Stat Softw 28(5):1–26
Article Google Scholar
Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2012) Cluster: cluster analysis basics and extensions. R package version 1(2):56
Google Scholar
YiLan L, Zeng R (2015) clustertend: check the clustering tendency. https://cran.r-project.org/web/packages/clustertend/index.html
Brock G, Pihur V, Datta S, Datta S (2011) clValid, an R package for cluster validation. Journal of Statistical Software (Brock et al, March 2008)
Google Scholar
Wei T, Simko V (2013) corrplot: Visualization of a correlation matrix. R package version 073 230 (231):11
Google Scholar
Galili T (2015) Dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics 31(22):3718–3720
Article CAS PubMed PubMed Central Google Scholar
Wickham H, Francois R (2017) dplyr: A grammar of data manipulation. R package version 074 1:20
Google Scholar
Dimitriadou E, Hornik K, Leisch F, Meyer D, Weingessel A (2005) Misc functions of the department of statistics (e1071), TU Wien. R package version:1.5–7
Google Scholar
Kassambara A (2015) Factoextra: visualization of the outputs of a multivariate analysis. R package version 1 (1)
Google Scholar
Ogle D (2015) FSA: fisheries stock analysis. R package version 06:13
Google Scholar
Kassambara A (2017) ggpubr:“ggplot2” Based Publication Ready Plots. R Package Version 01 2
Google Scholar
Wickham H (2016) ggplot2: elegant graphics for data analysis. Springer, New York
Book Google Scholar
Harrell Jr FE, Dupont C (2008) Hmisc: harrell miscellaneous. R package version 3 (2)
Google Scholar
Zeileis A, Hothorn T (2002) Diagnostic checking in regression relationships. https://cran.r-project.org/web/packages/lmtest/citation.html
Leisch F, Dimitriadou E 2005 mlbench: machine learning benchmark problems, URL http://CRAN R-project org/ R package version:1.0–1
Google Scholar
Carl P, Peterson BG, Boudt K, Zivot E (2009) PerformanceAnalytics: econometric tools for performance and risk analysis. R package version 1 (0)
Google Scholar
Champely S (2012) pwr: Basic functions for power analysis. R package version 1 (1)
Google Scholar
Mangiafico S (2017) rcompanion: functions to support extension education program evaluation. R package version 15 0 The Comprehensive R Archive Network
Google Scholar
Wickham H (2007) Reshaping data with the reshape package. J Stat Softw 21(12):1–20
Google Scholar
Wickham H, Francois R, Müller K (2016) Tibble: simple data frames. R package version https://cran.r-project.org/web/packages/tibble/index.html
Wickham H (2014) tidyr: easily tidy data with spread () and gather () functions. R package version 02 0
Google Scholar
Irwin JJ, Shoichet BK (2005) ZINC− a free database of commercially available compounds for virtual screening. J Chem Inf Model 45(1):177–182
Article CAS PubMed PubMed Central Google Scholar
Davison AC, Hinkley DV (1997) Bootstrap methods and their application. Cambridge University Press, Cambridge
Book Google Scholar
Breusch TS, Pagan AR (1979) A simple test for heteroscedasticity and random coefficient variation. Econometrica 47:1287–1294
Article Google Scholar
Ward JH Jr (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244
Article Google Scholar
Vogt M, Bajorath J (2017) Hierarchical clustering in R. In: Tutorials in chemoinformatics. John Wiley & Sons, Ltd, Hoboken, NJ, pp 103–118. https://doi.org/10.1002/9781119161110.ch6
Chapter Google Scholar
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J Roy Stat Soc Ser B (Stat Method) 63(2):411–423
Article Google Scholar
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Article Google Scholar
Cook RD (1977) Detection of influential observation in linear regression. Technometrics 19(1):15–18
Google Scholar
Koenker R (1981) A note on studentizing a test for heteroscedasticity. J Econom 17(1):107–112
Article Google Scholar
Reinhart A (2015) Statistics done wrong: the woefully complete guide. No Starch Press, San Francisco
Google Scholar
Krzywinski M, Altman N (2013) Points of significance: power and sample size. Nat Methods 10(12):1139–1140
Article CAS Google Scholar
Noble WS (2009) How does multiple testing correction work? Nat Biotechnol 27(12):1135–1137. https://doi.org/10.1038/nbt1209-1135
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgments

Support from the Japan Society for the Promotion of Science (JSPS) in conjunction with the Alexander von Humboldt Foundation (AvH) is gratefully acknowledged.

Author information

Authors and Affiliations

Graduate School of Pharmaceutical Sciences, Yoshida-shimoadachicho, Kyoto University, Sakyo-ku, Kyoto, Japan
Christin Rakers
Graduate School of Science Nagoya University, Nagoya, Japan
Christin Rakers

Authors

Christin Rakers
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christin Rakers .

Editor information

Editors and Affiliations

Life Science Informatics Research Unit, Laboratory of Molecular Biosciences, Kyoto University Graduate School of Medicine, Kyoto, Japan
J.B. Brown

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Rakers, C. (2018). Core Statistical Methods for Chemogenomic Data. In: Brown, J. (eds) Computational Chemogenomics. Methods in Molecular Biology, vol 1825. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-8639-2_7

Download citation

DOI: https://doi.org/10.1007/978-1-4939-8639-2_7
Published: 18 October 2018
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-8638-5
Online ISBN: 978-1-4939-8639-2
eBook Packages: Springer Protocols

Publish with us

Policies and ethics