Mining SNPs from DNA Sequence Data; Computational Approaches to SNP Discovery and Analysis

van Oeveren, Jan; Janssen, Antoine

doi:10.1007/978-1-60327-411-1_4

Jan van Oeveren² &
Antoine Janssen³

Part of the book series: Methods in Molecular Biology™ ((MIMB,volume 578))

8570 Accesses
13 Citations

Abstract

Single nucleotide polymorphisms (SNPs) are the most abundant form of genetic variation and are the basis for most molecular markers. Before these SNPs can be used for direct sequence-based SNP detection or in a derived SNP assay, they need to be identified. For those regions or species where no validated SNPs are available in the public databases, a good alternative is to mine them from DNA sequences. The alignment of multiple sequence fragments originating from different genotypes representing the same region on the genome will allow for the discovery of sequence variants. The corresponding nucleotide mismatches are likely to be SNPs or insertions/deletions. A large amount of sequence data to be mined is present in the public databases (both expressed sequence tags and genomic sequences) and is free to use without having to do large-scale sequencing oneself. However, with the appearance of the next-generation sequencing machines (Roche GS/454, Illumina GA/Solexa, SOLiD), high-throughput sequencing is becoming widely available. This will allow for the sequencing of polymorphic genotypes on specific target areas and consequent SNP identification. In this paper we discuss the bioinformatics tools required to analyze DNA sequence data for SNP mining. A general approach for the consecutive steps in the mining process is described and commonly used SNP discovery pipelines are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

http://www.ncbi.nlm.nih.gov/
http://www.ebi.ac.uk/embl/
http://www.ddbj.nig.ac.jp/
Buetow, K. H., Edmonson, M. N. and Cassidy, A. B. (1999) Reliable identification of large numbers of candidate SNPs from public EST data. Nat. Genet. 21, 323–325.
Article PubMed CAS Google Scholar
Picoult-Newberg, L., Ideker, T. E., Pohl, M. G., Taylor, S. L., Donaldson, M. A., Nickerson, D. A. and Boyce-Jacino, M. (1999) Mining SNPs from EST databases. Genome Res. 9, 167–174.
PubMed CAS Google Scholar
Taillon-Miller, P., Gu, Z., Li, Q., Hillier, L. and Kwok, P. Y. (1998) Overlapping genomic sequences: a treasure trove of single-nucleotide polymorphisms. Genome Res. 8, 748–754.
PubMed CAS Google Scholar
Sachidanandam, R., Weissman, D., Schmidt, S. C., Kakol, J. M., Stein, L. D., Marth, G. et al. (2001) A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928–933.
Article PubMed CAS Google Scholar
Margulies, M., Egholm, M., Altman, W. E., Attiya, S., Bader, J. S., Bemben, L. A., Berka, J. et al. (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380.
PubMed CAS Google Scholar
Bentley, D.R. (2006) Whole-genome re-sequencing. Curr. Opin. Genet. Dev. 16, 545–552.
Article PubMed CAS Google Scholar
Fu, Y., Peckham, H. E., McLaughlin, S. F., Ni, J. N., Rhodes, M. D., Malek, J. A., McKernan, K. J. and Blanchard, A. P. (2008) SOLiD™ system sequencing and 2 base encoding. Cold Spring Harbor, Biology of Genomes 2008.
Google Scholar
Wang, D. G., Fan, J. B., Siao, C. J., Berno, A., Young, P., Sapolsky, R. et al. (1998) Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome. Science 280, 1077–1082.
Article PubMed CAS Google Scholar
Altshuler, D., Pollara, V. J., Cowles, C. R., Van Etten, W. J., Baldwin, J., Linton, L. and Lander, E. S. (2000) An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407, 513–516.
Article PubMed CAS Google Scholar
http://www.ncbi.nlm.nih.gov/sites/entrez?db=unigene
http://compbio.dfci.harvard.edu/tgi
Vos, P., Hogers, R., Bleeker, M., Reijans, M., van de Lee, T., Hornes, M., Frijters, A., Pot, J., Peleman, J., Kuiper, M. et al. (1995) AFLP: a new technique for DNA fingerprinting. Nucleic Acids Res. 23, 4407–4414.
Article PubMed CAS Google Scholar
Van Orsouw, N. J., Hogers, R. C. J., Janssen, A., Yalcin, F., Snoeijers, S., Verstege, E., Schneiders, H., Van der Poel, H., Van Oeveren, J., Verstegen, H. and Van Eijk, M. J. T. (2007) Complexity Reduction of Polymorphic Sequences (CRoPS™): a novel approach for large-scale polymorphism discovery in complex genomes. PLoS ONE 11, e1172.
Article Google Scholar
Peterson, D. G., Wessler, S. R. and Paterson, A. H. (2002) Efficient capture of unique sequences from eukaryotic genomes. Trends Genet. 18, 547–550.
Article PubMed CAS Google Scholar
Ewing, B., Hillier, L., Wendl, M. C. and Green, P. (1998) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8, 175–185.
PubMed CAS Google Scholar
Ewing, B. and Green, P. (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186–194.
PubMed CAS Google Scholar
Zhang, J., Wheeler, D. A., Yakub, I., Wei, S., Sood, R., Rowe, W., Liu, P. P., Gibbs, R. A. and Buetow, K. H.. (2005) SNPdetector: a software tool for sensitive and accurate SNP detection. PLoS Comput. Biol. 1, 395–404.
CAS Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990) Basic local alignment search tool. J Mol. Biol. 215, 403–410.
PubMed CAS Google Scholar
Ning, Z., Cox, A. J. and Mullikin, J. C. (2001) SSAHA: A fast search method for large DNA databases. Genome Res. 11, 1725–1729.
Article PubMed CAS Google Scholar
Li, R., Li, Y., Kristiansen, K. and Wang, J. (2008) SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713–714.
Article PubMed CAS Google Scholar
Li, H. (2008) Maq. http://maq.sourceforge.net/maq-man.shtml#intro
Wheelan, S. J., Church, D. M. and Ostell, J. M. (2001) Spidey: A Tool for mRNA-to-Genomic Alignments. Genome Res. 11, 1952–1957.
PubMed CAS Google Scholar
Kent, J. K. (2002) BLAT—The BLAST-Like Alignment Tool. Genome Res. 12, 656–664.
PubMed CAS Google Scholar
Smit, A. F. A., Hubley, R. and Green, P. RepeatMasker Open-3.0. 1996–2004, http://www.repeatmasker.org
Green, P. http://www.phrap.org
Huang, X. and Madan, A. (1999) CAP3: A DNA sequence assembly program. Genome Res. 9, 868–877.
Article PubMed CAS Google Scholar
Hillier, L. W., Marth, G. T., Quinlan, A. R., Dooling, D., Fewell, G. et al. (2008) Whole-genome sequencing and variant discovery in C. elegans. Nat. Methods 1179, 1–6.
Google Scholar
Burke, J., Davison, D. and Hide, W. (1999) d2_cluster: a validated method for clustering EST and full-length cDNA sequences. Genome Res. 9, 1135–1142.
Article PubMed CAS Google Scholar
Pertea, G., Huang, X., Liang, F., Antonescu, V., Sultana, R., Karamycheva, S., Lee, Y., White, J., Cheung, F., Parvizi, B., Tsai, J. and Quackenbush, J. (2003) TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets. Bioinformatics 19, 651–652.
Article PubMed CAS Google Scholar
Marth, G. T., Korf, I., Yandell, M. D., Yeh, R. T., Gu, Z., Zakeri, H. et al. (1999) A general approach to single-nucleotide polymorphism discovery. Nat. Genet. 23, 452–456.
Article PubMed CAS Google Scholar
Nickerson, D. A., Tobe, V. O. and Taylor, S. L. (1997) PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. Nucleic Acids Res. 25, 2745–2751.
Article PubMed CAS Google Scholar
Batley, J., Barker, G., O'Sullivan, H., Edwards, K.J. and Edwards, D. (2003) Mining for single nucleotide polymorphisms and insertions/deletions in maize expressed sequence tag data. Plant Physiol. 132, 84–91.
Article PubMed CAS Google Scholar
Tang, J., Vosman, B., Voorrips, R. E., van der Linden, C. G. and Leunissen, J. A. (2006) QualitySNP: a pipeline for detecting single nucleotide polymorphisms and insertions/deletions in EST data from diploid and polyploid species. BMC Bioinformatics 7, 438.
Article PubMed Google Scholar
Panitz, F., Stengaard, H., Hornshøj, H., Gorodkin, J., Hedegaard, J., Cirera, S. et al. (2007) SNP mining porcine ESTs with MAVIANT, a novel tool for SNP evaluation and annotation. Bioinformatics 23, 387–391.
Article Google Scholar
Pavy, N., Parsons, L. S., Paule, C., MacKay, J. and Bousquet, J. (2006) Automated SNP detection from a large collection of white spruce expressed sequences: contributing factors and approaches for the categorization of SNPs. BMC Genomics 7, 174.
Article PubMed Google Scholar
Barbazuk, W. B., Emrich, S. J., Chen, H. D., Li, L. and Schnable, P. S. (2007) SNP discovery via 454 transcriptome sequencing. Plant J. 51, 910–918.
Article PubMed Google Scholar
Quinlan, A. R., Stewart, D. A., Strømberg, M. P. and Marth, G. T. (2008) Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nat. Methods 5, 179–181.
Article PubMed CAS Google Scholar
Marth, G. T. et al. (2008) http://bioinformatics.bc.edu/marthlab/PbShort
Wang, J. and Huang, X. (2005) A method for finding SNPs with allele frequencies in sequences of deep coverage. BMC Bioinformatics 6, 220.
Article PubMed Google Scholar
Stephens, M., Sloan, J. S., Robertson, P. D., Scheet, P. and Nickerson, D. A. (2006) Automating sequence-based detection and genotyping of SNPs from diploid samples. Nat. Genet. 38, 375–381.
Article PubMed CAS Google Scholar
Weckx, S., Del-Favero, J., Rademakers, R., Claes, L., Cruts, M., De Jonghe, P., Van Broeckhoven, C. and De Rijk, P. (2005) novoSNP, a novel computational tool for sequence variation discovery. Genome Res. 15, 436–442.
Article PubMed CAS Google Scholar
De Rijk, P. and Del-Favero, J. (2007) novoSNP3: variant detection and sequence annotation in resequencing projects. Methods Mol. Biol. 396, 331–344.
Article PubMed Google Scholar
Huang, X. Q., Hardison, R. C. and Miller, W. (1990) A space-efficient algorithm for local similarities. Comput. Appl. Biosci. 6, 373–381.
PubMed CAS Google Scholar
Matukumalli, L. K., Grefenstette, J. J., Hyten, D. L., Choi, I. Y., Cregan, P. B. and Van Tassell, C. P. (2006) SNP-PHAGE – high throughput SNP discovery pipeline. BMC Bioinformatics 23, 468.
Article Google Scholar
Manaster, C., Zheng, W., Teuber, M., Wächter, S., Döring, F., Schreiber, S. and Hampe, J. (2005) InSNP: a tool for automated detection and visualization of SNPs and InDels. Hum. Mutat. 26, 11–19.
Article PubMed CAS Google Scholar
Ning, Z., Caccamo, M. and Mullikin, J. C. (2005) ssahaSNP – a polymorphism detection tool on a whole genome scale. 2005 IEEE Computational Systems Bioinformatics Conference – Workshops (CSBW'05) 251–254.
Google Scholar
The International SNP Map Working Group (2001) A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928–933.
Article Google Scholar
Ning, Z., Gu, Y., Blackburne, B., Ponstingl, H. and Cox, A. (2008) Alignment and de novo assembly of transcriptome reads from Solexa sequencing. ISMB2008 poster P08.
Google Scholar
Savage, D., Batley, J., Erwin, T., Logan, E., Love, C. G., Lim, G. A., Mongin, E., Barker, G., Spangenberg, G. C. and Edwards, D. (2005) SNPServer: a real-time SNP discovery tool. Nucleic Acids Res. 33, 493–495.
Article Google Scholar
Tang, J., Leunissen, J.A., Voorrips, R.E., van der Linden, C.G. and Vosman B. (2008) HaploSNPer: a web-based allele and SNP detection tool. BMC Genet. 9, 23.
Article PubMed Google Scholar
Useche, F. J., Gao, G., Hanafey, M. and Rafalski, A. (2001) High-throughput identification, database storage and analysis of SNPs in EST Sequences. Genome Inform. 12,194–203.
PubMed CAS Google Scholar

Download references

Acknowledgements

We thank Jifeng Tang, John Smith, Mike Cariaso, and Michiel J.T. van Eijk for their comments on the manuscript. The AFLP® and CRoPS® technologies are covered by patents and patent applications owned by Keygene NV. AFLP and CRoPS are registered trademarks of Keygene NV. Other (registered) trademarks are the property of the respective owners.

Author information

Authors and Affiliations

Division of Bioinformatics, Keygene, Wageningen, NV, The Netherlands
Jan van Oeveren
Division of Bioinformatics, Keygene NV, Wageningen, AE, The Netherlands
Antoine Janssen

Authors

Jan van Oeveren
View author publications
You can also search for this author in PubMed Google Scholar
Antoine Janssen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Gene Regulation in, Cleveland State University, Euclid Ave. 2121, Cleveland, 44115, U.S.A.
Anton A. Komar

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

van Oeveren, J., Janssen, A. (2009). Mining SNPs from DNA Sequence Data; Computational Approaches to SNP Discovery and Analysis. In: Komar, A. (eds) Single Nucleotide Polymorphisms. Methods in Molecular Biology™, vol 578. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-60327-411-1_4

Download citation

DOI: https://doi.org/10.1007/978-1-60327-411-1_4
Published: 05 August 2009
Publisher Name: Humana Press, Totowa, NJ
Print ISBN: 978-1-60327-410-4
Online ISBN: 978-1-60327-411-1
eBook Packages: Springer Protocols

Publish with us

Policies and ethics