Advertisement

R-Programming for Genome-Wide Data Analysis

  • Arunima Shilpi
  • Shraddha Dubey
Chapter

Abstract

R is a programming language that provides an interface to perform statistical analysis and produce graphics. It is freely available and incorporates precompiled binary functions for various operating systems such as Linux, Mac, and Windows. R is an interactive programming language and contains several pre-defined functions for hypothesis testing in the given dataset. It provides set of operators for handling vector, list, array, or matrices. R helps in coherent, integrated data analysis and also provides facility to incorporate additional libraries containing functions for biological dataset. This chapter provides an overview of various R functions used in genome-wide association studies (GWAS) and elaborated on basic functions with examples as a guideline to GWAS analysis. Additionally, the genome-wide differential expression analysis in cancer dataset along with some of the important packages used in the data processing for quality control, annotation, visualization, and workflow is discussed.

Keywords

Genome-wide studies R-programming Data types Functions Bioconductor packages 

Notes

Acknowledgements

Authors Arunima Shilpi and Shraddha Dubey are thankful to the Department of Bioinformatics at NIT Rourkela and Maulana Azad National Institute of Technology, Bhopal, respectively, for providing financial support and dry lab facilities.

References

  1. Amos CI (2007) Successful design and conduct of genome-wide association studies. Hum Mol Genet 16(2):R220–R225CrossRefGoogle Scholar
  2. Bray NL, Pimentel H, Melsted P, Pachter L (2016) Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 34(5):525–527CrossRefGoogle Scholar
  3. Brooks SP, Catchpole EA, Morgan BJ, Barry SC (2000) On the Bayesian analysis of ring-recovery data. Biometrics 56(3):951–956CrossRefGoogle Scholar
  4. Burkett K, McNeney B, Graham J (2004) A note on inference of trait associations with SNP haplotypes and other attributes in generalized linear models. Hum Hered 57(4):200–206CrossRefGoogle Scholar
  5. Clayton D, Leung HT (2007) An R package for analysis of whole-genome association studies. Hum Hered 64(1):45–51CrossRefGoogle Scholar
  6. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5(10):R80CrossRefGoogle Scholar
  7. Guo SW, Lange K (2000) Genetic mapping of complex traits: promises, problems, and prospects. Theor Popul Biol 57(1):1–11CrossRefGoogle Scholar
  8. Hardcastle TJ, Kelly KA (2010) BaySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 11:422CrossRefGoogle Scholar
  9. Lake SL, Lyon H, Tantisira K, Silverman EK, Weiss ST, Laird NM, Schaid DJ (2003) Estimation and tests of haplotype-environment interaction when linkage phase is ambiguous. Hum Hered 55(1):56–65CrossRefGoogle Scholar
  10. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blocker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowski J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ (2001) Initial sequencing and analysis of the human genome. Nature 409(6822):860–921CrossRefGoogle Scholar
  11. Law CW, Chen Y, Shi W, Smyth GK (2014) Voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 15(2):R29CrossRefGoogle Scholar
  12. Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BM, Haag JD, Gould MN, Stewart RM, Kendziorski C (2013) EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics 29(8):1035–1043CrossRefGoogle Scholar
  13. Li B, Dewey CN (2011) RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12(1):323CrossRefGoogle Scholar
  14. Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15(12):550CrossRefGoogle Scholar
  15. O'Neill RJ, O'Neill MJ, Graves JA (1998) Undermethylation associated with retroelement activation and chromosome remodelling in an interspecific mammalian hybrid. Nature 393(6680):68–72CrossRefGoogle Scholar
  16. Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 14(4):417–419CrossRefGoogle Scholar
  17. Patro R, Mount SM, Kingsford C (2014) Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat Biotechnol 32(5):462–464CrossRefGoogle Scholar
  18. Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1):139–140CrossRefGoogle Scholar
  19. Sanders NW, Mann NH 3rd, Spengler DM (1997) Web client and ODBC access to legacy database information: a low cost approach. Proc AMIA Annu Fall Symp:799–803Google Scholar
  20. Satagopan JM, Yandell BS, Newton MA, Osborn TC (1996) A bayesian approach to detect quantitative trait loci using Markov chain Monte Carlo. Genetics 144(2):805–816PubMedPubMedCentralGoogle Scholar
  21. Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA (2002) Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet 70(2):425–434CrossRefGoogle Scholar
  22. Sen S, Satagopan JM, Churchill GA (2005) Quantitative trait locus study design from an information perspective. Genetics 170(1):447–464CrossRefGoogle Scholar
  23. Weir BS, Wilson SR (1986) Log-linear models for linked loci. Biometrics 42(3):665–670CrossRefGoogle Scholar
  24. Zhao JH (2004) 2LD, GENECOUNTING and HAP: computer programs for linkage disequilibrium analysis. Bioinformatics 20(8):1325–1326CrossRefGoogle Scholar
  25. Zhao JH, Tan Q (2006) Integrated analysis of genetic data with R. Hum Genomics 2(4):258–265CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  • Arunima Shilpi
    • 1
  • Shraddha Dubey
    • 2
  1. 1.Department of Life ScienceNational Institute of TechnologyRourkelaIndia
  2. 2.Department of BioinformaticsMaulana Azad National Institute of TechnologyBhopalIndia

Personalised recommendations