Abstract
Single nucleotide polymorphisms (SNPs) are the most abundant form of genetic variation and are the basis for most molecular markers. Before these SNPs can be used for direct sequence-based SNP detection or in a derived SNP assay, they need to be identified. For those regions or species where no validated SNPs are available in the public databases, a good alternative is to mine them from DNA sequences. The alignment of multiple sequence fragments originating from different genotypes representing the same region on the genome will allow for the discovery of sequence variants. The corresponding nucleotide mismatches are likely to be SNPs or insertions/deletions. A large amount of sequence data to be mined is present in the public databases (both expressed sequence tags and genomic sequences) and is free to use without having to do large-scale sequencing oneself. However, with the appearance of the next-generation sequencing machines (Roche GS/454, Illumina GA/Solexa, SOLiD), high-throughput sequencing is becoming widely available. This will allow for the sequencing of polymorphic genotypes on specific target areas and consequent SNP identification. In this paper we discuss the bioinformatics tools required to analyze DNA sequence data for SNP mining. A general approach for the consecutive steps in the mining process is described and commonly used SNP discovery pipelines are presented.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Buetow, K. H., Edmonson, M. N. and Cassidy, A. B. (1999) Reliable identification of large numbers of candidate SNPs from public EST data. Nat. Genet. 21, 323–325.
Picoult-Newberg, L., Ideker, T. E., Pohl, M. G., Taylor, S. L., Donaldson, M. A., Nickerson, D. A. and Boyce-Jacino, M. (1999) Mining SNPs from EST databases. Genome Res. 9, 167–174.
Taillon-Miller, P., Gu, Z., Li, Q., Hillier, L. and Kwok, P. Y. (1998) Overlapping genomic sequences: a treasure trove of single-nucleotide polymorphisms. Genome Res. 8, 748–754.
Sachidanandam, R., Weissman, D., Schmidt, S. C., Kakol, J. M., Stein, L. D., Marth, G. et al. (2001) A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928–933.
Margulies, M., Egholm, M., Altman, W. E., Attiya, S., Bader, J. S., Bemben, L. A., Berka, J. et al. (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380.
Bentley, D.R. (2006) Whole-genome re-sequencing. Curr. Opin. Genet. Dev. 16, 545–552.
Fu, Y., Peckham, H. E., McLaughlin, S. F., Ni, J. N., Rhodes, M. D., Malek, J. A., McKernan, K. J. and Blanchard, A. P. (2008) SOLiD™ system sequencing and 2 base encoding. Cold Spring Harbor, Biology of Genomes 2008.
Wang, D. G., Fan, J. B., Siao, C. J., Berno, A., Young, P., Sapolsky, R. et al. (1998) Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome. Science 280, 1077–1082.
Altshuler, D., Pollara, V. J., Cowles, C. R., Van Etten, W. J., Baldwin, J., Linton, L. and Lander, E. S. (2000) An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407, 513–516.
Vos, P., Hogers, R., Bleeker, M., Reijans, M., van de Lee, T., Hornes, M., Frijters, A., Pot, J., Peleman, J., Kuiper, M. et al. (1995) AFLP: a new technique for DNA fingerprinting. Nucleic Acids Res. 23, 4407–4414.
Van Orsouw, N. J., Hogers, R. C. J., Janssen, A., Yalcin, F., Snoeijers, S., Verstege, E., Schneiders, H., Van der Poel, H., Van Oeveren, J., Verstegen, H. and Van Eijk, M. J. T. (2007) Complexity Reduction of Polymorphic Sequences (CRoPS™): a novel approach for large-scale polymorphism discovery in complex genomes. PLoS ONE 11, e1172.
Peterson, D. G., Wessler, S. R. and Paterson, A. H. (2002) Efficient capture of unique sequences from eukaryotic genomes. Trends Genet. 18, 547–550.
Ewing, B., Hillier, L., Wendl, M. C. and Green, P. (1998) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8, 175–185.
Ewing, B. and Green, P. (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186–194.
Zhang, J., Wheeler, D. A., Yakub, I., Wei, S., Sood, R., Rowe, W., Liu, P. P., Gibbs, R. A. and Buetow, K. H.. (2005) SNPdetector: a software tool for sensitive and accurate SNP detection. PLoS Comput. Biol. 1, 395–404.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990) Basic local alignment search tool. J Mol. Biol. 215, 403–410.
Ning, Z., Cox, A. J. and Mullikin, J. C. (2001) SSAHA: A fast search method for large DNA databases. Genome Res. 11, 1725–1729.
Li, R., Li, Y., Kristiansen, K. and Wang, J. (2008) SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713–714.
Li, H. (2008) Maq. http://maq.sourceforge.net/maq-man.shtml#intro
Wheelan, S. J., Church, D. M. and Ostell, J. M. (2001) Spidey: A Tool for mRNA-to-Genomic Alignments. Genome Res. 11, 1952–1957.
Kent, J. K. (2002) BLAT—The BLAST-Like Alignment Tool. Genome Res. 12, 656–664.
Smit, A. F. A., Hubley, R. and Green, P. RepeatMasker Open-3.0. 1996–2004, http://www.repeatmasker.org
Green, P. http://www.phrap.org
Huang, X. and Madan, A. (1999) CAP3: A DNA sequence assembly program. Genome Res. 9, 868–877.
Hillier, L. W., Marth, G. T., Quinlan, A. R., Dooling, D., Fewell, G. et al. (2008) Whole-genome sequencing and variant discovery in C. elegans. Nat. Methods 1179, 1–6.
Burke, J., Davison, D. and Hide, W. (1999) d2_cluster: a validated method for clustering EST and full-length cDNA sequences. Genome Res. 9, 1135–1142.
Pertea, G., Huang, X., Liang, F., Antonescu, V., Sultana, R., Karamycheva, S., Lee, Y., White, J., Cheung, F., Parvizi, B., Tsai, J. and Quackenbush, J. (2003) TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets. Bioinformatics 19, 651–652.
Marth, G. T., Korf, I., Yandell, M. D., Yeh, R. T., Gu, Z., Zakeri, H. et al. (1999) A general approach to single-nucleotide polymorphism discovery. Nat. Genet. 23, 452–456.
Nickerson, D. A., Tobe, V. O. and Taylor, S. L. (1997) PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. Nucleic Acids Res. 25, 2745–2751.
Batley, J., Barker, G., O'Sullivan, H., Edwards, K.J. and Edwards, D. (2003) Mining for single nucleotide polymorphisms and insertions/deletions in maize expressed sequence tag data. Plant Physiol. 132, 84–91.
Tang, J., Vosman, B., Voorrips, R. E., van der Linden, C. G. and Leunissen, J. A. (2006) QualitySNP: a pipeline for detecting single nucleotide polymorphisms and insertions/deletions in EST data from diploid and polyploid species. BMC Bioinformatics 7, 438.
Panitz, F., Stengaard, H., Hornshøj, H., Gorodkin, J., Hedegaard, J., Cirera, S. et al. (2007) SNP mining porcine ESTs with MAVIANT, a novel tool for SNP evaluation and annotation. Bioinformatics 23, 387–391.
Pavy, N., Parsons, L. S., Paule, C., MacKay, J. and Bousquet, J. (2006) Automated SNP detection from a large collection of white spruce expressed sequences: contributing factors and approaches for the categorization of SNPs. BMC Genomics 7, 174.
Barbazuk, W. B., Emrich, S. J., Chen, H. D., Li, L. and Schnable, P. S. (2007) SNP discovery via 454 transcriptome sequencing. Plant J. 51, 910–918.
Quinlan, A. R., Stewart, D. A., Strømberg, M. P. and Marth, G. T. (2008) Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nat. Methods 5, 179–181.
Marth, G. T. et al. (2008) http://bioinformatics.bc.edu/marthlab/PbShort
Wang, J. and Huang, X. (2005) A method for finding SNPs with allele frequencies in sequences of deep coverage. BMC Bioinformatics 6, 220.
Stephens, M., Sloan, J. S., Robertson, P. D., Scheet, P. and Nickerson, D. A. (2006) Automating sequence-based detection and genotyping of SNPs from diploid samples. Nat. Genet. 38, 375–381.
Weckx, S., Del-Favero, J., Rademakers, R., Claes, L., Cruts, M., De Jonghe, P., Van Broeckhoven, C. and De Rijk, P. (2005) novoSNP, a novel computational tool for sequence variation discovery. Genome Res. 15, 436–442.
De Rijk, P. and Del-Favero, J. (2007) novoSNP3: variant detection and sequence annotation in resequencing projects. Methods Mol. Biol. 396, 331–344.
Huang, X. Q., Hardison, R. C. and Miller, W. (1990) A space-efficient algorithm for local similarities. Comput. Appl. Biosci. 6, 373–381.
Matukumalli, L. K., Grefenstette, J. J., Hyten, D. L., Choi, I. Y., Cregan, P. B. and Van Tassell, C. P. (2006) SNP-PHAGE – high throughput SNP discovery pipeline. BMC Bioinformatics 23, 468.
Manaster, C., Zheng, W., Teuber, M., Wächter, S., Döring, F., Schreiber, S. and Hampe, J. (2005) InSNP: a tool for automated detection and visualization of SNPs and InDels. Hum. Mutat. 26, 11–19.
Ning, Z., Caccamo, M. and Mullikin, J. C. (2005) ssahaSNP – a polymorphism detection tool on a whole genome scale. 2005 IEEE Computational Systems Bioinformatics Conference – Workshops (CSBW'05) 251–254.
The International SNP Map Working Group (2001) A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928–933.
Ning, Z., Gu, Y., Blackburne, B., Ponstingl, H. and Cox, A. (2008) Alignment and de novo assembly of transcriptome reads from Solexa sequencing. ISMB2008 poster P08.
Savage, D., Batley, J., Erwin, T., Logan, E., Love, C. G., Lim, G. A., Mongin, E., Barker, G., Spangenberg, G. C. and Edwards, D. (2005) SNPServer: a real-time SNP discovery tool. Nucleic Acids Res. 33, 493–495.
Tang, J., Leunissen, J.A., Voorrips, R.E., van der Linden, C.G. and Vosman B. (2008) HaploSNPer: a web-based allele and SNP detection tool. BMC Genet. 9, 23.
Useche, F. J., Gao, G., Hanafey, M. and Rafalski, A. (2001) High-throughput identification, database storage and analysis of SNPs in EST Sequences. Genome Inform. 12,194–203.
Acknowledgements
We thank Jifeng Tang, John Smith, Mike Cariaso, and Michiel J.T. van Eijk for their comments on the manuscript. The AFLP® and CRoPS® technologies are covered by patents and patent applications owned by Keygene NV. AFLP and CRoPS are registered trademarks of Keygene NV. Other (registered) trademarks are the property of the respective owners.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Humana Press, a part of Springer Science+Business Media, LLC 2003
About this protocol
Cite this protocol
van Oeveren, J., Janssen, A. (2009). Mining SNPs from DNA Sequence Data; Computational Approaches to SNP Discovery and Analysis. In: Komar, A. (eds) Single Nucleotide Polymorphisms. Methods in Molecular Biology™, vol 578. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-60327-411-1_4
Download citation
DOI: https://doi.org/10.1007/978-1-60327-411-1_4
Published:
Publisher Name: Humana Press, Totowa, NJ
Print ISBN: 978-1-60327-410-4
Online ISBN: 978-1-60327-411-1
eBook Packages: Springer Protocols