Abstract
Mapping reads to a reference genome is a routine yet computationally intensive task in research based on high-throughput sequencing. In recent years, the sequencing reads of the Illumina platform get longer and their quality scores get higher. According to our calculation, this allows perfect k-mer seed match for almost all reads when a close reference genome is available subject to reasonable specificity. Our another observation is that the majority reads contain at most one short INDEL polymorphism. Based on these observations, we propose a fast mapping approach, referred to as “SEME”, which has two core steps: first it scans a read sequentially in a specific order for a k-mer exact match seed; next it extends the alignment on both sides allowing at most one short-INDEL each, using a novel method “auto-match function”. We decompose the evaluation of the sensitivity and specificity into two parts corresponding to the seed and extension step, and the composite result provides an approximate overall reliability estimate of each mapping. We compare SEME with some existing mapping methods on several data sets, and SEME shows better performance in terms of both running time and mapping rates.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Li, H.: A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinformatics 11, 473–483 (2010)
Ben, L., et al.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10, R25 (2009)
Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008)
Hui, J., Wing-Hung, W.: SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics 24, 2395–2396 (2008)
MiSeq Personal Sequencer - Illumina, http://www.illumina.com/systems/miseq.ilmn
Xun, G., Wen-Hsiung, L.: The Size Distribution of Insertions and Deletions in Human and Rodent Pseudogenes Suggests the Logarithmic Gap Penalty for Sequence Alignment. J. Mol. Evol. 40, 464–473 (1994)
Ryan, E.M., Christopher, T., et al.: Luttig, An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res. 16, 1182–1190 (2006)
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)
Karlin, S., Altschul, S.F.: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. U S A 87, 2264–2268 (1990)
Waterman, M.S.: General methods of sequence comparison. Bull. Math. Biol. 46, 473–500 (1984)
Waterman, M.S.: Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall, London (1995)
Ross, A.L., Haiyan, H., Waterman, M.S.: Distributional regimes for the number of k-word matches between two random sequences. PNAS 99, 13980–13989 (2002)
Warren, J.E., Gregory, R.G.: Statistical Methods in Bioinformatics: An introduction. Springer, New York (2001)
Brent, E., Phil, G.: Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities. Genome Res. 8, 186–194 (1998)
Ming, L., Magnus, N., Lei, M.L.: Adjust quality scores from alignment and improve sequencing accuracy. Nucleic Acids Research 32, 5183–5191 (2004)
Ruiqiang, L., Yingrui, L., Karsten, K., Jun, W.: SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713–714 (2008)
Ruiqiang, L., Chang, Y., Yingrui, L., et al.: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009)
Zaharia, M., Bolosky, W.J., Curtis, K., Fox, A., Patterson, D., Shenker, S., Stoica, I., Karp, R.M., Sittler, T.: Faster and More Accurate Sequence Alignment with SNAP. arXiv:1111.5572 [cs.DS] (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chen, S., Wang, A., Li, L.M. (2013). SEME: A Fast Mapper of Illumina Sequencing Reads with Statistical Evaluation. In: Deng, M., Jiang, R., Sun, F., Zhang, X. (eds) Research in Computational Molecular Biology. RECOMB 2013. Lecture Notes in Computer Science(), vol 7821. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37195-0_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-37195-0_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37194-3
Online ISBN: 978-3-642-37195-0
eBook Packages: Computer ScienceComputer Science (R0)