Skip to main content

Mining SNPs from DNA Sequence Data; Computational Approaches to SNP Discovery and Analysis

  • Protocol
  • First Online:
Single Nucleotide Polymorphisms

Part of the book series: Methods in Molecular Biology™ ((MIMB,volume 578))

Abstract

Single nucleotide polymorphisms (SNPs) are the most abundant form of genetic variation and are the basis for most molecular markers. Before these SNPs can be used for direct sequence-based SNP detection or in a derived SNP assay, they need to be identified. For those regions or species where no validated SNPs are available in the public databases, a good alternative is to mine them from DNA sequences. The alignment of multiple sequence fragments originating from different genotypes representing the same region on the genome will allow for the discovery of sequence variants. The corresponding nucleotide mismatches are likely to be SNPs or insertions/deletions. A large amount of sequence data to be mined is present in the public databases (both expressed sequence tags and genomic sequences) and is free to use without having to do large-scale sequencing oneself. However, with the appearance of the next-generation sequencing machines (Roche GS/454, Illumina GA/Solexa, SOLiD), high-throughput sequencing is becoming widely available. This will allow for the sequencing of polymorphic genotypes on specific target areas and consequent SNP identification. In this paper we discuss the bioinformatics tools required to analyze DNA sequence data for SNP mining. A general approach for the consecutive steps in the mining process is described and commonly used SNP discovery pipelines are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. http://www.ncbi.nlm.nih.gov/

  2. http://www.ebi.ac.uk/embl/

  3. http://www.ddbj.nig.ac.jp/

  4. Buetow, K. H., Edmonson, M. N. and Cassidy, A. B. (1999) Reliable identification of large numbers of candidate SNPs from public EST data. Nat. Genet. 21, 323–325.

    Article  PubMed  CAS  Google Scholar 

  5. Picoult-Newberg, L., Ideker, T. E., Pohl, M. G., Taylor, S. L., Donaldson, M. A., Nickerson, D. A. and Boyce-Jacino, M. (1999) Mining SNPs from EST databases. Genome Res. 9, 167–174.

    PubMed  CAS  Google Scholar 

  6. Taillon-Miller, P., Gu, Z., Li, Q., Hillier, L. and Kwok, P. Y. (1998) Overlapping genomic sequences: a treasure trove of single-nucleotide polymorphisms. Genome Res. 8, 748–754.

    PubMed  CAS  Google Scholar 

  7. Sachidanandam, R., Weissman, D., Schmidt, S. C., Kakol, J. M., Stein, L. D., Marth, G. et al. (2001) A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928–933.

    Article  PubMed  CAS  Google Scholar 

  8. Margulies, M., Egholm, M., Altman, W. E., Attiya, S., Bader, J. S., Bemben, L. A., Berka, J. et al. (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380.

    PubMed  CAS  Google Scholar 

  9. Bentley, D.R. (2006) Whole-genome re-sequencing. Curr. Opin. Genet. Dev. 16, 545–552.

    Article  PubMed  CAS  Google Scholar 

  10. Fu, Y., Peckham, H. E., McLaughlin, S. F., Ni, J. N., Rhodes, M. D., Malek, J. A., McKernan, K. J. and Blanchard, A. P. (2008) SOLiD™ system sequencing and 2 base encoding. Cold Spring Harbor, Biology of Genomes 2008.

    Google Scholar 

  11. Wang, D. G., Fan, J. B., Siao, C. J., Berno, A., Young, P., Sapolsky, R. et al. (1998) Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome. Science 280, 1077–1082.

    Article  PubMed  CAS  Google Scholar 

  12. Altshuler, D., Pollara, V. J., Cowles, C. R., Van Etten, W. J., Baldwin, J., Linton, L. and Lander, E. S. (2000) An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407, 513–516.

    Article  PubMed  CAS  Google Scholar 

  13. http://www.ncbi.nlm.nih.gov/sites/entrez?db=unigene

  14. http://compbio.dfci.harvard.edu/tgi

  15. Vos, P., Hogers, R., Bleeker, M., Reijans, M., van de Lee, T., Hornes, M., Frijters, A., Pot, J., Peleman, J., Kuiper, M. et al. (1995) AFLP: a new technique for DNA fingerprinting. Nucleic Acids Res. 23, 4407–4414.

    Article  PubMed  CAS  Google Scholar 

  16. Van Orsouw, N. J., Hogers, R. C. J., Janssen, A., Yalcin, F., Snoeijers, S., Verstege, E., Schneiders, H., Van der Poel, H., Van Oeveren, J., Verstegen, H. and Van Eijk, M. J. T. (2007) Complexity Reduction of Polymorphic Sequences (CRoPS™): a novel approach for large-scale polymorphism discovery in complex genomes. PLoS ONE 11, e1172.

    Article  Google Scholar 

  17. Peterson, D. G., Wessler, S. R. and Paterson, A. H. (2002) Efficient capture of unique sequences from eukaryotic genomes. Trends Genet. 18, 547–550.

    Article  PubMed  CAS  Google Scholar 

  18. Ewing, B., Hillier, L., Wendl, M. C. and Green, P. (1998) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8, 175–185.

    PubMed  CAS  Google Scholar 

  19. Ewing, B. and Green, P. (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186–194.

    PubMed  CAS  Google Scholar 

  20. Zhang, J., Wheeler, D. A., Yakub, I., Wei, S., Sood, R., Rowe, W., Liu, P. P., Gibbs, R. A. and Buetow, K. H.. (2005) SNPdetector: a software tool for sensitive and accurate SNP detection. PLoS Comput. Biol. 1, 395–404.

    CAS  Google Scholar 

  21. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990) Basic local alignment search tool. J Mol. Biol. 215, 403–410.

    PubMed  CAS  Google Scholar 

  22. Ning, Z., Cox, A. J. and Mullikin, J. C. (2001) SSAHA: A fast search method for large DNA databases. Genome Res. 11, 1725–1729.

    Article  PubMed  CAS  Google Scholar 

  23. Li, R., Li, Y., Kristiansen, K. and Wang, J. (2008) SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713–714.

    Article  PubMed  CAS  Google Scholar 

  24. Li, H. (2008) Maq. http://maq.sourceforge.net/maq-man.shtml#intro

  25. Wheelan, S. J., Church, D. M. and Ostell, J. M. (2001) Spidey: A Tool for mRNA-to-Genomic Alignments. Genome Res. 11, 1952–1957.

    PubMed  CAS  Google Scholar 

  26. Kent, J. K. (2002) BLAT—The BLAST-Like Alignment Tool. Genome Res. 12, 656–664.

    PubMed  CAS  Google Scholar 

  27. Smit, A. F. A., Hubley, R. and Green, P. RepeatMasker Open-3.0. 1996–2004, http://www.repeatmasker.org

  28. Green, P. http://www.phrap.org

  29. Huang, X. and Madan, A. (1999) CAP3: A DNA sequence assembly program. Genome Res. 9, 868–877.

    Article  PubMed  CAS  Google Scholar 

  30. Hillier, L. W., Marth, G. T., Quinlan, A. R., Dooling, D., Fewell, G. et al. (2008) Whole-genome sequencing and variant discovery in C. elegans. Nat. Methods 1179, 1–6.

    Google Scholar 

  31. Burke, J., Davison, D. and Hide, W. (1999) d2_cluster: a validated method for clustering EST and full-length cDNA sequences. Genome Res. 9, 1135–1142.

    Article  PubMed  CAS  Google Scholar 

  32. Pertea, G., Huang, X., Liang, F., Antonescu, V., Sultana, R., Karamycheva, S., Lee, Y., White, J., Cheung, F., Parvizi, B., Tsai, J. and Quackenbush, J. (2003) TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets. Bioinformatics 19, 651–652.

    Article  PubMed  CAS  Google Scholar 

  33. Marth, G. T., Korf, I., Yandell, M. D., Yeh, R. T., Gu, Z., Zakeri, H. et al. (1999) A general approach to single-nucleotide polymorphism discovery. Nat. Genet. 23, 452–456.

    Article  PubMed  CAS  Google Scholar 

  34. Nickerson, D. A., Tobe, V. O. and Taylor, S. L. (1997) PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. Nucleic Acids Res. 25, 2745–2751.

    Article  PubMed  CAS  Google Scholar 

  35. Batley, J., Barker, G., O'Sullivan, H., Edwards, K.J. and Edwards, D. (2003) Mining for single nucleotide polymorphisms and insertions/deletions in maize expressed sequence tag data. Plant Physiol. 132, 84–91.

    Article  PubMed  CAS  Google Scholar 

  36. Tang, J., Vosman, B., Voorrips, R. E., van der Linden, C. G. and Leunissen, J. A. (2006) QualitySNP: a pipeline for detecting single nucleotide polymorphisms and insertions/deletions in EST data from diploid and polyploid species. BMC Bioinformatics 7, 438.

    Article  PubMed  Google Scholar 

  37. Panitz, F., Stengaard, H., Hornshøj, H., Gorodkin, J., Hedegaard, J., Cirera, S. et al. (2007) SNP mining porcine ESTs with MAVIANT, a novel tool for SNP evaluation and annotation. Bioinformatics 23, 387–391.

    Article  Google Scholar 

  38. Pavy, N., Parsons, L. S., Paule, C., MacKay, J. and Bousquet, J. (2006) Automated SNP detection from a large collection of white spruce expressed sequences: contributing factors and approaches for the categorization of SNPs. BMC Genomics 7, 174.

    Article  PubMed  Google Scholar 

  39. Barbazuk, W. B., Emrich, S. J., Chen, H. D., Li, L. and Schnable, P. S. (2007) SNP discovery via 454 transcriptome sequencing. Plant J. 51, 910–918.

    Article  PubMed  Google Scholar 

  40. Quinlan, A. R., Stewart, D. A., Strømberg, M. P. and Marth, G. T. (2008) Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nat. Methods 5, 179–181.

    Article  PubMed  CAS  Google Scholar 

  41. Marth, G. T. et al. (2008) http://bioinformatics.bc.edu/marthlab/PbShort

  42. Wang, J. and Huang, X. (2005) A method for finding SNPs with allele frequencies in sequences of deep coverage. BMC Bioinformatics 6, 220.

    Article  PubMed  Google Scholar 

  43. Stephens, M., Sloan, J. S., Robertson, P. D., Scheet, P. and Nickerson, D. A. (2006) Automating sequence-based detection and genotyping of SNPs from diploid samples. Nat. Genet. 38, 375–381.

    Article  PubMed  CAS  Google Scholar 

  44. Weckx, S., Del-Favero, J., Rademakers, R., Claes, L., Cruts, M., De Jonghe, P., Van Broeckhoven, C. and De Rijk, P. (2005) novoSNP, a novel computational tool for sequence variation discovery. Genome Res. 15, 436–442.

    Article  PubMed  CAS  Google Scholar 

  45. De Rijk, P. and Del-Favero, J. (2007) novoSNP3: variant detection and sequence annotation in resequencing projects. Methods Mol. Biol. 396, 331–344.

    Article  PubMed  Google Scholar 

  46. Huang, X. Q., Hardison, R. C. and Miller, W. (1990) A space-efficient algorithm for local similarities. Comput. Appl. Biosci. 6, 373–381.

    PubMed  CAS  Google Scholar 

  47. Matukumalli, L. K., Grefenstette, J. J., Hyten, D. L., Choi, I. Y., Cregan, P. B. and Van Tassell, C. P. (2006) SNP-PHAGE – high throughput SNP discovery pipeline. BMC Bioinformatics 23, 468.

    Article  Google Scholar 

  48. Manaster, C., Zheng, W., Teuber, M., Wächter, S., Döring, F., Schreiber, S. and Hampe, J. (2005) InSNP: a tool for automated detection and visualization of SNPs and InDels. Hum. Mutat. 26, 11–19.

    Article  PubMed  CAS  Google Scholar 

  49. Ning, Z., Caccamo, M. and Mullikin, J. C. (2005) ssahaSNP – a polymorphism detection tool on a whole genome scale. 2005 IEEE Computational Systems Bioinformatics Conference – Workshops (CSBW'05) 251–254.

    Google Scholar 

  50. The International SNP Map Working Group (2001) A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928–933.

    Article  Google Scholar 

  51. Ning, Z., Gu, Y., Blackburne, B., Ponstingl, H. and Cox, A. (2008) Alignment and de novo assembly of transcriptome reads from Solexa sequencing. ISMB2008 poster P08.

    Google Scholar 

  52. Savage, D., Batley, J., Erwin, T., Logan, E., Love, C. G., Lim, G. A., Mongin, E., Barker, G., Spangenberg, G. C. and Edwards, D. (2005) SNPServer: a real-time SNP discovery tool. Nucleic Acids Res. 33, 493–495.

    Article  Google Scholar 

  53. Tang, J., Leunissen, J.A., Voorrips, R.E., van der Linden, C.G. and Vosman B. (2008) HaploSNPer: a web-based allele and SNP detection tool. BMC Genet. 9, 23.

    Article  PubMed  Google Scholar 

  54. Useche, F. J., Gao, G., Hanafey, M. and Rafalski, A. (2001) High-throughput identification, database storage and analysis of SNPs in EST Sequences. Genome Inform. 12,194–203.

    PubMed  CAS  Google Scholar 

Download references

Acknowledgements

We thank Jifeng Tang, John Smith, Mike Cariaso, and Michiel J.T. van Eijk for their comments on the manuscript. The AFLP® and CRoPS® technologies are covered by patents and patent applications owned by Keygene NV. AFLP and CRoPS are registered trademarks of Keygene NV. Other (registered) trademarks are the property of the respective owners.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Humana Press, a part of Springer Science+Business Media, LLC 2003

About this protocol

Cite this protocol

van Oeveren, J., Janssen, A. (2009). Mining SNPs from DNA Sequence Data; Computational Approaches to SNP Discovery and Analysis. In: Komar, A. (eds) Single Nucleotide Polymorphisms. Methods in Molecular Biology™, vol 578. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-60327-411-1_4

Download citation

  • DOI: https://doi.org/10.1007/978-1-60327-411-1_4

  • Published:

  • Publisher Name: Humana Press, Totowa, NJ

  • Print ISBN: 978-1-60327-410-4

  • Online ISBN: 978-1-60327-411-1

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics