SEME: A Fast Mapper of Illumina Sequencing Reads with Statistical Evaluation

Chen, Shijian; Wang, Anqi; Li, Lei M.

doi:10.1007/978-3-642-37195-0_2

Shijian Chen²³,
Anqi Wang²³ &
Lei M. Li^23,24

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 7821))

Included in the following conference series:

Annual International Conference on Research in Computational Molecular Biology

3175 Accesses

Abstract

Mapping reads to a reference genome is a routine yet computationally intensive task in research based on high-throughput sequencing. In recent years, the sequencing reads of the Illumina platform get longer and their quality scores get higher. According to our calculation, this allows perfect k-mer seed match for almost all reads when a close reference genome is available subject to reasonable specificity. Our another observation is that the majority reads contain at most one short INDEL polymorphism. Based on these observations, we propose a fast mapping approach, referred to as “SEME”, which has two core steps: first it scans a read sequentially in a specific order for a k-mer exact match seed; next it extends the alignment on both sides allowing at most one short-INDEL each, using a novel method “auto-match function”. We decompose the evaluation of the sensitivity and specificity into two parts corresponding to the seed and extension step, and the composite result provides an approximate overall reliability estimate of each mapping. We compare SEME with some existing mapping methods on several data sets, and SEME shows better performance in terms of both running time and mapping rates.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Li, H.: A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinformatics 11, 473–483 (2010)
Article Google Scholar
Ben, L., et al.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10, R25 (2009)
Google Scholar
Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008)
Article Google Scholar
Hui, J., Wing-Hung, W.: SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics 24, 2395–2396 (2008)
Article Google Scholar
MiSeq Personal Sequencer - Illumina, http://www.illumina.com/systems/miseq.ilmn
Xun, G., Wen-Hsiung, L.: The Size Distribution of Insertions and Deletions in Human and Rodent Pseudogenes Suggests the Logarithmic Gap Penalty for Sequence Alignment. J. Mol. Evol. 40, 464–473 (1994)
Google Scholar
Ryan, E.M., Christopher, T., et al.: Luttig, An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res. 16, 1182–1190 (2006)
Article Google Scholar
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)
Article Google Scholar
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)
Google Scholar
Karlin, S., Altschul, S.F.: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. U S A 87, 2264–2268 (1990)
Article MATH Google Scholar
Waterman, M.S.: General methods of sequence comparison. Bull. Math. Biol. 46, 473–500 (1984)
MathSciNet MATH Google Scholar
Waterman, M.S.: Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall, London (1995)
MATH Google Scholar
Ross, A.L., Haiyan, H., Waterman, M.S.: Distributional regimes for the number of k-word matches between two random sequences. PNAS 99, 13980–13989 (2002)
Article MATH Google Scholar
Warren, J.E., Gregory, R.G.: Statistical Methods in Bioinformatics: An introduction. Springer, New York (2001)
MATH Google Scholar
Brent, E., Phil, G.: Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities. Genome Res. 8, 186–194 (1998)
Google Scholar
Ming, L., Magnus, N., Lei, M.L.: Adjust quality scores from alignment and improve sequencing accuracy. Nucleic Acids Research 32, 5183–5191 (2004)
Article Google Scholar
Ruiqiang, L., Yingrui, L., Karsten, K., Jun, W.: SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713–714 (2008)
Article Google Scholar
Ruiqiang, L., Chang, Y., Yingrui, L., et al.: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009)
Article Google Scholar
Zaharia, M., Bolosky, W.J., Curtis, K., Fox, A., Patterson, D., Shenker, S., Stoica, I., Karp, R.M., Sittler, T.: Faster and More Accurate Sequence Alignment with SNAP. arXiv:1111.5572 [cs.DS] (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

NCMIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China
Shijian Chen, Anqi Wang & Lei M. Li
Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, 90089, USA
Lei M. Li

Authors

Shijian Chen
View author publications
You can also search for this author in PubMed Google Scholar
Anqi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lei M. Li
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Mathematics, Peking University, Beijing, P.R. China
Minghua Deng
Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, 100084, Beijing, P.R. China
Rui Jiang
Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
Fengzhu Sun
Department of Automation, Tsinghua University, P.O. Box, 100084, Beijing, China
Xuegong Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, S., Wang, A., Li, L.M. (2013). SEME: A Fast Mapper of Illumina Sequencing Reads with Statistical Evaluation. In: Deng, M., Jiang, R., Sun, F., Zhang, X. (eds) Research in Computational Molecular Biology. RECOMB 2013. Lecture Notes in Computer Science(), vol 7821. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37195-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-37195-0_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37194-3
Online ISBN: 978-3-642-37195-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics