Skip to main content

Bioinformatics Analysis of Whole Exome Sequencing Data

  • Protocol
  • First Online:

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1881))

Abstract

This chapter contains a step-by-step protocol for identifying somatic SNPs and small Indels from next-generation sequencing data of tumor samples and matching normal samples. The workflow presented here is largely based on the Broad Institute’s “Best Practices” guidelines and makes use of their Genome Analysis Toolkit (GATK) platform. Variants are annotated with population allele frequencies and curated resources such as GnomAD and ClinVar and curated effect predictions from dbNSFP using VCFtools, SnpEff, and SnpSift.

This is a preview of subscription content, log in via an institution.

Buying options

Protocol
USD   49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Springer Nature is developing a new tool to find and evaluate Protocols. Learn more

References

  1. Karapetis CS, Khambata-Ford S, Jonker DJ et al (2008) K-ras mutations and benefit from cetuximab in advanced colorectal cancer. N Engl J Med 359:1757–1765

    Article  CAS  Google Scholar 

  2. DePristo MA, Banks E, Poplin R et al (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43:491–498

    Article  CAS  Google Scholar 

  3. McKenna A, Hanna M, Banks E et al (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20:1297–1303

    Article  CAS  Google Scholar 

  4. Hwang S, Kim E, Lee I et al (2015) Systematic comparison of variant calling pipelines using gold standard personal exome variants. Sci Rep 5:17875

    Article  Google Scholar 

  5. Cornish A, Guda C (2015) A comparison of variant calling pipelines using genome in a bottle as a reference. Biomed Res Int 2015:456479

    Article  Google Scholar 

  6. Roberts ND, Kortschak RD, Parker WT et al (2013) A comparative analysis of algorithms for somatic SNV detection in cancer. Bioinformatics 29:2223–2230

    Article  CAS  Google Scholar 

  7. Wang Q, Jia P, Li F et al (2013) Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers. Genome Med 5:91

    Article  Google Scholar 

  8. Xu H, DiCarlo J, Satya RV et al (2014) Comparison of somatic mutation calling methods in amplicon and whole exome sequence data. BMC Genomics 15:244

    Article  Google Scholar 

  9. Gerlinger M, Rowan AJ, Horswell S et al (2012) Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N Engl J Med 366:883–892

    Article  CAS  Google Scholar 

  10. Jacoby MA, Duncavage EJ, Walter MJ (2015) Implications of tumor clonal heterogeneity in the era of next-generation sequencing. Trends Cancer 1:231–241

    Article  Google Scholar 

  11. Pleasance ED, Cheetham RK, Stephens PJ et al (2010) A comprehensive catalogue of somatic mutations from a human cancer genome. Nature 463:191–196

    Article  CAS  Google Scholar 

  12. Alexandrov LB, Nik-Zainal S, Wedge DC et al (2013) Signatures of mutational processes in human cancer. Nature 500:415–421

    Article  CAS  Google Scholar 

  13. Roth A, Ding J, Morin R et al (2012) JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data. Bioinformatics 28:907–913

    Article  CAS  Google Scholar 

  14. Saunders CT, Wong WS, Swamy S et al (2012) Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics 28:1811–1817

    Article  CAS  Google Scholar 

  15. Cibulskis K, Lawrence MS, Carter SL et al (2013) Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol 31:213–219

    Article  CAS  Google Scholar 

  16. The Broad Institute (2018.) https://software.broadinstitute.org/gatk/. Accessed 08 Jan 2018

  17. Cingolani P (2017) SnpEff: genomic variant annotations and functional effect prediction toolbox. http://snpeff.sourceforge.net/. Accessed 08 Jan 2018

  18. Koboldt DC, Zhang Q, Larson DE et al (2012) VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res 22:568–576

    Article  CAS  Google Scholar 

  19. Cock PJ, Fields CJ, Goto N et al (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 38:1767–1771

    Article  CAS  Google Scholar 

  20. Poplin R, Ruano-Rubio V, DePristo MA, et al (2017) Scaling accurate genetic variant discovery to tens of thousands of samples. https://doi.org/10.1101/201178. Accessed 08 Jan 2018

  21. Garrison E and Marth G (2012) Haplotype-based variant detection from short-read sequencing. arXiv:1207.3907v2.: https://arxiv.org/abs/1207.3907. Accessed 08 Jan 2018

  22. Babraham Bioinformatics (2017) .FastQC: a quality control tool for high throughput sequence data. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed 8 Jan 2018

  23. Ewels P, Magnusson M, Lundin S et al (2016) MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32:3047–3048

    Article  CAS  Google Scholar 

  24. Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120

    Article  CAS  Google Scholar 

  25. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760

    Article  CAS  Google Scholar 

  26. Benjamin D (2017) Pair HMM probabilistic realignment in HaplotypeCaller and Mutect. https://github.com/broadinstitute/gatk/blob/master/docs/pair_hmm.pdf. Accessed 08 Jan 2018

  27. Benjamin D, Sato T (2018) Mathematical notes on mutect. https://github.com/broadinstitute/gatk/blob/master/docs/mutect/mutect.pdf. Accessed 08 Jan 2018

  28. Benjamin D (2017) Local assembly in HaplotypeCaller and Mutect. https://github.com/broadinstitute/gatk/blob/master/docs/local_assembly.pdf. Accessed 08 Jan 2018

  29. Sherry ST, Ward MH, Kholodov M et al (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29:308–311

    Article  CAS  Google Scholar 

  30. Consortium GP, Auton A, Brooks LD, et al (2015) A global reference for human genetic variation. Nature 526:68-74

    Google Scholar 

  31. Lek M, Karczewski KJ, Minikel EV et al (2016) Analysis of protein-coding genetic variation in 60,706 humans. Nature 536:285–291

    Article  CAS  Google Scholar 

  32. GnomAD. Browser beta, genome aggregation database (2017.) http://gnomad.broadinstitute.org/. Accessed 10 Jan 2018

  33. Danecek P, Auton A, Abecasis G et al (2011) The variant call format and VCFtools. Bioinformatics 27:2156–2158

    Article  CAS  Google Scholar 

  34. Cingolani P, Platts A, Wang le L, et al (2012) A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6:80-92

    Article  CAS  Google Scholar 

  35. Cingolani P, Patel VM, Coon M et al (2012) Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift. Front Genet 3:35

    Article  Google Scholar 

  36. McLaren W, Gil L, Hunt SE et al (2016) The Ensembl variant effect predictor. Genome Biol 17:122

    Article  Google Scholar 

  37. Wang K, Li M, Hakonarson H (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38:e164

    Article  Google Scholar 

  38. Golden Helix SNP & Variation Suite™ (2017) Golden Helix, Inc., Bozeman, MT. http://www.goldenhelix.com/. Accessed 15 Jan 2018

  39. Eilbeck K, Lewis SE, Mungall CJ et al (2005) The Sequence Ontology: a tool for the unification of genome annotations. Genome Biol 6:R44

    Article  Google Scholar 

  40. Liu X, Jian X, Boerwinkle E (2011) dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions. Hum Mutat 32:894–899

    Article  CAS  Google Scholar 

  41. Liu X, Wu C, Li C et al (2016) dbNSFP v3.0: a one-stop database of functional predictions and annotations for human nonsynonymous and Splice-Site SNVs. Hum Mutat 37:235–241

    Article  Google Scholar 

  42. Landrum MJ, Lee JM, Benson M et al (2016) ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res 44:D862–D868

    Article  CAS  Google Scholar 

  43. Gates C and Bene J (2016) .Jacquard: a suite of command-line tools to expedite analysis of exome variant data from multiple patients and multiple variant callers. https://github.com/umich-brcf-bioinf/Jacquard. Accessed 08 Jan 2018

  44. Kim SY, Jacob L, Speed TP (2014) Combining calls from multiple somatic mutation-callers. BMC Bioinformatics 15:154

    Article  Google Scholar 

  45. Fang LT, Afshar PT, Chhibber A et al (2015) An ensemble approach to accurately detect somatic mutations using SomaticSeq. Genome Biol 16:197

    Article  Google Scholar 

  46. Callari M, Sammut SJ, De Mattos-Arruda L et al (2017) Intersect-then-combine approach: improving the performance of somatic variant calling in whole exome sequencing data using multiple aligners and callers. Genome Med 9:35

    Article  Google Scholar 

Download references

Acknowledgments

The authors would like to thank the institutions, developers, and documenters of the informatics tools used in this chapter’s workflows. Genomics and disease research in general benefits hourly from the availability of tools such as Bioconda, BWA, GATK, HaplotypeCaller, Mutect2, Samtools, SNPEff , VarScan, and Vcftools, as well as public resources such as ClinVar and GnomAD.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peter J. Ulintz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Ulintz, P.J., Wu, W., Gates, C.M. (2019). Bioinformatics Analysis of Whole Exome Sequencing Data. In: Malek, S. (eds) Chronic Lymphocytic Leukemia. Methods in Molecular Biology, vol 1881. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-8876-1_21

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-8876-1_21

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-8875-4

  • Online ISBN: 978-1-4939-8876-1

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics