Skip to main content

Core Statistical Methods for Chemogenomic Data

  • Protocol
  • First Online:
Computational Chemogenomics

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1825))

  • 1239 Accesses

Abstract

Chemogenomic modeling involves the construction of algorithmic or statistical models for prediction on new input data and is often based on noisy, multidescriptor data. A deeper understanding of such data through statistical analyses can underpin informed study design and increase information gain from prediction results and model performances. This chapter mediates basic statistical concepts and provides step-by-step instructions to explore and visualize chemogenomic data based on the statistics-centered, open-source software R. Directions on executing essential techniques such as the calculation of correlations, hypothesis testing, and clustering are provided.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Team R (2015) RStudio: integrated development for R. RStudio, Inc, Boston, MA http://www rstudio com

    Google Scholar 

  2. Team RC (2016) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria

    Google Scholar 

  3. Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E (2003) The chemistry development kit (CDK): an open-source java library for chemo- and bioinformatics. J Chem Inf Comput Sci 43(2):493–500. https://doi.org/10.1021/ci025584y

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Guha R (2007) Chemical informatics functionality in R. J Stat Softw 18(5):1–16

    Article  Google Scholar 

  5. Lipinski CA, Lombardo F, Dominy BW, Feeney PJ (1997) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Del Rev 23(1–3):3–25

    Article  CAS  Google Scholar 

  6. van Belle G, Fisher LD, Heagerty PJ, Lumley T (2004) Biostatistics: a methodology for the health sciences. Wiley, Chichester

    Google Scholar 

  7. Boslaugh S (2012) Statistics in a nutshell: a desktop quick reference. O'Reilly Media, Beijing

    Google Scholar 

  8. Lawson RG, Jurs PC (1990) New index for clustering tendency and its application to chemical problems. J Chem Inf Comput Sci 30(1):36–41

    Article  CAS  Google Scholar 

  9. Sullivan GM, Feinn R (2012) Using effect size—or why the P value is not enough. J Grad Med Educ 4(3):279–282

    Article  PubMed  PubMed Central  Google Scholar 

  10. Halsey LG, Curran-Everett D, Vowler SL, Drummond GB (2015) The fickle P value generates irreproducible results. Nat Methods 12(3):179–185. https://doi.org/10.1038/nmeth.3288

    Article  CAS  PubMed  Google Scholar 

  11. Canty A, Ripley B (2012) boot: Bootstrap R (S-Plus) functions. R package version 1 (7). https://cran.r-project.org/web/packages/boot/citation.html

  12. Kursa MB, Rudnicki WR (2010) Feature selection with the Boruta package. J Stat Softw 36(11):1–13

    Article  Google Scholar 

  13. Fox J (2002) An R and S-plus companion to applied regression. Sage, Thousand Oaks

    Google Scholar 

  14. Kuhn M (2008) Caret package. J Stat Softw 28(5):1–26

    Article  Google Scholar 

  15. Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2012) Cluster: cluster analysis basics and extensions. R package version 1(2):56

    Google Scholar 

  16. YiLan L, Zeng R (2015) clustertend: check the clustering tendency. https://cran.r-project.org/web/packages/clustertend/index.html

  17. Brock G, Pihur V, Datta S, Datta S (2011) clValid, an R package for cluster validation. Journal of Statistical Software (Brock et al, March 2008)

    Google Scholar 

  18. Wei T, Simko V (2013) corrplot: Visualization of a correlation matrix. R package version 073 230 (231):11

    Google Scholar 

  19. Galili T (2015) Dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics 31(22):3718–3720

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Wickham H, Francois R (2017) dplyr: A grammar of data manipulation. R package version 074 1:20

    Google Scholar 

  21. Dimitriadou E, Hornik K, Leisch F, Meyer D, Weingessel A (2005) Misc functions of the department of statistics (e1071), TU Wien. R package version:1.5–7

    Google Scholar 

  22. Kassambara A (2015) Factoextra: visualization of the outputs of a multivariate analysis. R package version 1 (1)

    Google Scholar 

  23. Ogle D (2015) FSA: fisheries stock analysis. R package version 06:13

    Google Scholar 

  24. Kassambara A (2017) ggpubr:“ggplot2” Based Publication Ready Plots. R Package Version 01 2

    Google Scholar 

  25. Wickham H (2016) ggplot2: elegant graphics for data analysis. Springer, New York

    Book  Google Scholar 

  26. Harrell Jr FE, Dupont C (2008) Hmisc: harrell miscellaneous. R package version 3 (2)

    Google Scholar 

  27. Zeileis A, Hothorn T (2002) Diagnostic checking in regression relationships. https://cran.r-project.org/web/packages/lmtest/citation.html

  28. Leisch F, Dimitriadou E 2005 mlbench: machine learning benchmark problems, URL http://CRAN R-project org/ R package version:1.0–1

    Google Scholar 

  29. Carl P, Peterson BG, Boudt K, Zivot E (2009) PerformanceAnalytics: econometric tools for performance and risk analysis. R package version 1 (0)

    Google Scholar 

  30. Champely S (2012) pwr: Basic functions for power analysis. R package version 1 (1)

    Google Scholar 

  31. Mangiafico S (2017) rcompanion: functions to support extension education program evaluation. R package version 15 0 The Comprehensive R Archive Network

    Google Scholar 

  32. Wickham H (2007) Reshaping data with the reshape package. J Stat Softw 21(12):1–20

    Google Scholar 

  33. Wickham H, Francois R, Müller K (2016) Tibble: simple data frames. R package version https://cran.r-project.org/web/packages/tibble/index.html

  34. Wickham H (2014) tidyr: easily tidy data with spread () and gather () functions. R package version 02 0

    Google Scholar 

  35. Irwin JJ, Shoichet BK (2005) ZINC− a free database of commercially available compounds for virtual screening. J Chem Inf Model 45(1):177–182

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Davison AC, Hinkley DV (1997) Bootstrap methods and their application. Cambridge University Press, Cambridge

    Book  Google Scholar 

  37. Breusch TS, Pagan AR (1979) A simple test for heteroscedasticity and random coefficient variation. Econometrica 47:1287–1294

    Article  Google Scholar 

  38. Ward JH Jr (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244

    Article  Google Scholar 

  39. Vogt M, Bajorath J (2017) Hierarchical clustering in R. In: Tutorials in chemoinformatics. John Wiley & Sons, Ltd, Hoboken, NJ, pp 103–118. https://doi.org/10.1002/9781119161110.ch6

    Chapter  Google Scholar 

  40. Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J Roy Stat Soc Ser B (Stat Method) 63(2):411–423

    Article  Google Scholar 

  41. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  Google Scholar 

  42. Cook RD (1977) Detection of influential observation in linear regression. Technometrics 19(1):15–18

    Google Scholar 

  43. Koenker R (1981) A note on studentizing a test for heteroscedasticity. J Econom 17(1):107–112

    Article  Google Scholar 

  44. Reinhart A (2015) Statistics done wrong: the woefully complete guide. No Starch Press, San Francisco

    Google Scholar 

  45. Krzywinski M, Altman N (2013) Points of significance: power and sample size. Nat Methods 10(12):1139–1140

    Article  CAS  Google Scholar 

  46. Noble WS (2009) How does multiple testing correction work? Nat Biotechnol 27(12):1135–1137. https://doi.org/10.1038/nbt1209-1135

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgments

Support from the Japan Society for the Promotion of Science (JSPS) in conjunction with the Alexander von Humboldt Foundation (AvH) is gratefully acknowledged.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christin Rakers .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Rakers, C. (2018). Core Statistical Methods for Chemogenomic Data. In: Brown, J. (eds) Computational Chemogenomics. Methods in Molecular Biology, vol 1825. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-8639-2_7

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-8639-2_7

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-8638-5

  • Online ISBN: 978-1-4939-8639-2

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics