Advertisement

Analysis of Genotyping-by-Sequencing (GBS) Data

  • Sateesh Kagale
  • Chushin Koh
  • Wayne E. Clarke
  • Venkatesh Bollina
  • Isobel A. P. Parkin
  • Andrew G. SharpeEmail author
Part of the Methods in Molecular Biology book series (MIMB, volume 1374)

Abstract

The development of genotyping-by-sequencing (GBS) to rapidly detect nucleotide variation at the whole genome level, in many individuals simultaneously, has provided a transformative genetic profiling technique. GBS can be carried out in species with or without reference genome sequences yields huge amounts of potentially informative data. One limitation with the approach is the paucity of tools to transform the raw data into a format that can be easily interrogated at the genetic level. In this chapter we describe bioinformatics tools developed to address this shortfall together with experimental design considerations to fully leverage the power of GBS for genetic analysis.

Key words

Genotyping Genotyping-by-sequencing GBS RAD-seq Next generation sequencing Geneticvariation Single nucleotide polymorphism InDels Reduced representation sequencing Trimmomatic Bowtie SAMtools GATK Demultiplexing, Read mapping UnifiedGenotyper HaplotypeCaller Minor allele frequency Imputation Haplotype 

References

  1. 1.
    The Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408(6814):796–815CrossRefGoogle Scholar
  2. 2.
    Michael TP, Jackson S (2013) The first 50 plant genomes. Plant Genome 6:1–7CrossRefGoogle Scholar
  3. 3.
    Ganal M, Altmann T, Roder M (2009) SNP identification in crop plants. Curr Opin Plant Biol 12:211–217CrossRefPubMedGoogle Scholar
  4. 4.
    Altshuler D, Pollara VJ, Cowles CR, Van Etten WJ, Baldwin J, Linton L, Lander ES (2000) An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407:513–516CrossRefPubMedGoogle Scholar
  5. 5.
    Barchi L, Lanteri S, Portis E, Valè G, Volante A, Pulcini L, Ciriaci T, Acciarri N, Barbierato V, Toppino L, Rotino GL (2012) A RAD tag derived marker based eggplant linkage map and the location of qtls determining anthocyanin pigmentation. PLoS One 7, e43740PubMedCentralCrossRefPubMedGoogle Scholar
  6. 6.
    Poland JA, Rife TW (2012) Genotyping-by-Sequencing for plant breeding and genetics. Plant Genome 5:92–102CrossRefGoogle Scholar
  7. 7.
    Wang N, Thomson M, Bodles WJA, Crawford RMM, Hunt HV, Featherstone AW, Pellicer J, Buggs RJA (2013) Genome sequence of dwarf birch (Betula nana) and cross-species RAD markers. Mol Ecol 22:3098–3111CrossRefPubMedGoogle Scholar
  8. 8.
    Liu H, Bayer M, Druka A, Russell J, Hackett C, Poland J, Ramsay L, Hedley P, Waugh R (2014) An evaluation of genotyping by sequencing (GBS) to map the Breviaristatum-e (ari-e) locus in cultivated barley. BMC Genomics 15:104PubMedCentralCrossRefPubMedGoogle Scholar
  9. 9.
    Varshney RK, Song C, Saxena RK, Azam S, Yu S, Sharpe AG, Cannon S, Baek J, Rosen BD, Tar’an B, Millan T, Zhang X, Ramsay LD, Iwata A, Wang Y, Nelson W, Farmer AD, Gaur PM, Soderlund C, Penmetsa RV, Xu C, Bharti AK, He W, Winter P, Zhao S, Hane JK, Carrasquilla-Garcia N, Condie JA, Upadhyaya HD, Luo M-C, Thudi M, Gowda CLL, Singh NP, Lichtenzveig J, Gali KK, Rubio J, Nadarajan N, Dolezel J, Bansal KC, Xu X, Edwards D, Zhang G, Kahl G, Gil J, Singh KB, Datta SK, Jackson SA, Wang J, Cook DR (2013) Draft genome sequence of chickpea (Cicer arietinum) provides a resource for trait improvement. Nat Biotechnol 31:240–246CrossRefPubMedGoogle Scholar
  10. 10.
    Kagale S, Chushin K, Nixon J, Bollina V, Clarke WE, Tuteja R, Spillane C, Robinson SJ, Links MG, Clarke C, Higgins EE, Huebert T, Sharpe AG, Parkin IAP (2014) The emerging biofuel crop Camelina sativa retains a highly undifferentiated hexaploid genome structure. Nat Commun 5:3706PubMedCentralCrossRefPubMedGoogle Scholar
  11. 11.
    Parkin IAP, Koh C, Tang H, Robinson SJ, Kagale S, Clarke WE, Town CD, Nixon J, Krishnakumar V, Bidwell SL, Denoeud F, Belcram H, Links MG, Just J, Clarke C, Bender T, Huebert T, Mason AS, Pires JC, Barker G, Moore J, Walley PG, Manoli S, Batley J, Edwards D, Nelson MN, Wang X, Paterson AH, King G, Bancroft I, Chalhoub B, Sharpe AG (2014) Transcriptome and methylome profiling reveals relics of genome dominance in the mesopolyploid Brassica oleracea. Genome Biol 15:R77PubMedCentralCrossRefPubMedGoogle Scholar
  12. 12.
    Baird NA, Etter PD, Atwood TS, Currey MC, Shiver AL, Lewis ZA, Selker EU, Cresko WA, Johnson EA (2008) Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS One 3, e3376PubMedCentralCrossRefPubMedGoogle Scholar
  13. 13.
    Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, Buckler ES, Mitchell SE (2011) A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One 6, e19379PubMedCentralCrossRefPubMedGoogle Scholar
  14. 14.
    Wang S, Meyer E, McKay JK, Matz MV (2012) 2b-RAD: a simple and flexible method for genome-wide genotyping. Nat Methods 9:808–810CrossRefPubMedGoogle Scholar
  15. 15.
    Peterson BK, Weber JN, Kay EH, Fisher HS, Hoekstra HE (2012) Double digest RADseq: an inexpensive method for de novo SNP discovery and genotyping in model and non-model species. PLoS One 7, e37135PubMedCentralCrossRefPubMedGoogle Scholar
  16. 16.
    Davey J, Hohenlohe P, Etter P, Boone J, Catchen J, Blaxter M (2011) Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nat Rev Genet 12:499–510CrossRefPubMedGoogle Scholar
  17. 17.
    Deschamps S, Llaca V, May GD (2012) Genotyping-by-Sequencing in plants. Biology 1:460–483PubMedCentralCrossRefPubMedGoogle Scholar
  18. 18.
    Poland JA, Brown PJ, Sorrells ME, Jannink JL (2012) Development of high-density genetic maps for barley and wheat using a novel two-enzyme genotyping-by-sequencing approach. PLoS One 7, e32253PubMedCentralCrossRefPubMedGoogle Scholar
  19. 19.
    Edwards D, Batley J, Snowdon R (2013) Accessing complex crop genomes with next-generation sequencing. Theor Appl Genet 126:1–11CrossRefPubMedGoogle Scholar
  20. 20.
    Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. doi: 10.1093/bioinformatics/btu170 PubMedCentralPubMedGoogle Scholar
  21. 21.
    Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25PubMedCentralCrossRefPubMedGoogle Scholar
  22. 22.
    Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760PubMedCentralCrossRefPubMedGoogle Scholar
  23. 23.
    Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome Project Data Processing Subgroup (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078–2079PubMedCentralCrossRefPubMedGoogle Scholar
  24. 24.
    McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20:1297–1303PubMedCentralCrossRefPubMedGoogle Scholar
  25. 25.
    DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43:491–498PubMedCentralCrossRefPubMedGoogle Scholar
  26. 26.
    Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella KV, Altshuler D, Gabriel S, DePristo MA (2013) From FastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline. Curr Protoc Bioinform 43:11.10.1-11.10.33Google Scholar
  27. 27.
    Catchen JM, Amores A, Hohenlohe P, Cresko W, Postlethwait JH (2011) Stacks: building and genotyping loci de novo from short-read sequences. G3 1:171–182PubMedCentralCrossRefPubMedGoogle Scholar
  28. 28.
    Glaubitz JC, Casstevens TM, Lu F, Harriman J, Elshire RJ, Sun Q, Buckler ES (2014) TASSEL-GBS: a high capacity genotyping by sequencing analysis pipeline. PLoS One 9, e90346PubMedCentralCrossRefPubMedGoogle Scholar
  29. 29.
    Dai M, Thompson RC, Maher C, Contreras-Galindo R, Kaplan MH, Markovitz DM, Omenn G, Meng F (2010) NGSQC: cross-platform quality analysis pipeline for deep sequencing data. BMC Genomics 11 Suppl 4: S7Google Scholar
  30. 30.
    Lu F, Lipka AE, Glaubitz J, Elshire R, Cherney JH, Casler MD, Buckler ES, Costich DE (2013) Switchgrass genomic diversity, ploidy, and evolution: novel insights from a network-based SNP discovery protocol. PLoS Genet 9, e1003215PubMedCentralCrossRefPubMedGoogle Scholar
  31. 31.
    Willing EM, Hoffmann M, Klein JD, Weigel D, Dreyer C (2011) Paired-end RAD-seq for de novo assembly and marker design without available reference. Bioinformatics 27:2187–2193CrossRefPubMedGoogle Scholar
  32. 32.
    Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18:1851–1858PubMedCentralCrossRefPubMedGoogle Scholar
  33. 33.
    Hach F, Hormozdiari F, Alkan C, Hormozdiari F, Birol I, Eichler EE, Sahinalp SC (2010) mrsFAST: a cache-oblivious algorithm for short-read mapping. Nat Methods 7:576–577PubMedCentralCrossRefPubMedGoogle Scholar
  34. 34.
    Lunter G, Goodson M (2011) Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res 21:936–939PubMedCentralCrossRefPubMedGoogle Scholar
  35. 35.
    Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods 9:357–359PubMedCentralCrossRefPubMedGoogle Scholar
  36. 36.
    Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25:1966–1967CrossRefPubMedGoogle Scholar
  37. 37.
    Burrows M, Wheeler DJ (1994) A block-sorting lossless data compression algorithm. Systems Research Center Research Report 124, Digital Systems Research Center, Palo Alto, CA.Google Scholar
  38. 38.
    Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 11:473–483PubMedCentralCrossRefPubMedGoogle Scholar
  39. 39.
    Huang L, Popic V, Batzoglou S (2013) Short read alignment with populations of genomes. Bioinformatics 29:i361–i370PubMedCentralCrossRefPubMedGoogle Scholar
  40. 40.
    Fonseca NA, Rung J, Brazma A, Marioni JC (2012) Tools for mapping high-throughput sequencing data. Bioinformatics 28:3169–3177CrossRefPubMedGoogle Scholar
  41. 41.
    Li R, Li Y, Fang X, Yang H, Wang J, Kristiansen K, Wang J (2009) SNP detection for massively parallel whole-genome resequencing. Genome Res 19:1124–1132PubMedCentralCrossRefPubMedGoogle Scholar
  42. 42.
    Wei Z, Wang W, Hu P, Lyon GJ, Hakonarson H (2011) SNVer: a statistical tool for variant calling in analysis of pooled or individual next-generation sequencing data. Nucleic Acids Res 39, e132PubMedCentralCrossRefPubMedGoogle Scholar
  43. 43.
    Clement NL, Snell Q, Clement MJ, Hollenhorst PC, Purwar J, Graves BJ, Cairns BR, Johnson WE (2010) The GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides from next-generation sequencing. Bioinformatics 26:38–45CrossRefPubMedGoogle Scholar
  44. 44.
    O’Rawe J, Jiang T, Sun G, Wu Y, Wang W, Hu J, Bodily P, Tian L, Hakonarson H, Johnson WE, Wei Z, Wang K, Lyon GJ (2013) Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Med 5:28PubMedCentralCrossRefPubMedGoogle Scholar
  45. 45.
    Li H (2011) Improving SNP discovery by base alignment quality. Bioinformatics 27:1157–1158PubMedCentralCrossRefPubMedGoogle Scholar
  46. 46.
    Andolfatto P, Davison D, Erezyilmaz D, Hu TT, Mast J, Sunayama-Morita T, Stern DL (2011) Multiplexed shotgun genotyping for rapid and efficient genetic mapping. Genome Res 21:610–617PubMedCentralCrossRefPubMedGoogle Scholar
  47. 47.
    Marchini J, Howie B, Myers S, McVean G, Donnelly P (2007) A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet 39:906–913CrossRefPubMedGoogle Scholar
  48. 48.
    Browning BL, Browning SR (2009) A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet 84:210–223PubMedCentralCrossRefPubMedGoogle Scholar
  49. 49.
    Huang BE, Raghavan C, Mauleon R, Broman KW, Leung H (2014) Efficient imputation of missing markers in low-coverage genotyping-by-sequencing data from multi-parental crosses. Genetics 197:401–404PubMedCentralCrossRefPubMedGoogle Scholar
  50. 50.
    Robinson MR, Wray NR, Visscher PM (2014) Explaining additional genetic variation in complex traits. Trends Genet 30:124–132CrossRefPubMedGoogle Scholar
  51. 51.
    Milne I, Bayer M, Cardle L, Shaw P, Stephen G, Wright F, Marshall D (2010) Tablet—next generation sequence assembly visualization. Bioinformatics 26:401–402PubMedCentralCrossRefPubMedGoogle Scholar
  52. 52.
    Thorvaldsdottir H, Robinson JT, Mesirov JP (2013) Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform 14:178–192PubMedCentralCrossRefPubMedGoogle Scholar
  53. 53.
    Marchini J, Howie B (2010) Genotype imputation for genome-wide association studies. Nat Rev Genet 11:499–511CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Sateesh Kagale
    • 1
  • Chushin Koh
    • 1
  • Wayne E. Clarke
    • 2
  • Venkatesh Bollina
    • 2
  • Isobel A. P. Parkin
    • 2
  • Andrew G. Sharpe
    • 1
    Email author
  1. 1.National Research Council CanadaSaskatoonCanada
  2. 2.Agriculture and Agri-Food CanadaSaskatoonCanada

Personalised recommendations