An Efficient Filtration Method Based on Variable-Length Seeds for Sequence Alignment

Guo, Ruidong; Cheng, Haoyu; Xu, Yun

doi:10.1007/978-981-10-6442-5_19

Ruidong Guo^12,13,
Haoyu Cheng^12,13 &
Yun Xu^12,13

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 729))

Included in the following conference series:

International Symposium on Parallel Architecture, Algorithm and Programming

1382 Accesses

Abstract

With the rapid development of next-generation sequencing (NGS) platforms, more than billions of reads are produced quickly. Finding all mapping locations of these reads in the reference genome is not only a bioinformatics issue, but also a large-scale computation issue. Existing all mapping tools are usually divided into the two steps, filtration and verification. Filtration step discards some wrong locations and generates candidates. As for verification step, each candidate is mapped to the reference sequence to determine whether it is a mapping location. Statistics indicated that the verification step is the main part of the whole mapping time. That is to say, less candidates lead to less mapping time. Our strategies improve filtration step to decrease the number of candidates.

We propose a dynamic programming and two heuristic strategies and integrated them into the filtration step. These strategies are applied in the state-of-the-art all-mapper, Bitmapper. Compared with the advanced all-mappers, experiment results show that our method make a significant progress.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinform. 25(14), 1754–1760 (2009)
Article Google Scholar
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultra-fast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10(3), R25 (2009)
Article Google Scholar
Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with bowtie 2. Nat. Methods 9(4), 357–359 (2012)
Article Google Scholar
Hach, F., Hormozdiari, F., Alkan, C., et al.: mrsfast: a cache-oblivious algorithm for short-read mapping. Nat. Methods 7(8), 576–577 (2010)
Article Google Scholar
Siragusa, E.: Approximate string matching for high-throughput sequencing. Ph.D. Dissertation, Freie University Berlin (2015)
Google Scholar
Cheng, H., Jiang, H., Yang, J., et al.: Bitmapper: an efficient all-mapper based on bit-vector computing. BMC Bioinform. 16(1), 192 (2015)
Article Google Scholar
Xin, H., Lee, D., Hormozdiari, F., et al.: Accelerating read mapping with fasthash. BMC Bioinform. 14(1), S13 (2013)
Article Google Scholar
Kim, J., Li, C., Xie, X.: Improving read mapping using additional prefix grams. BMC Bioinform. 15(1), 42 (2014)
Article Google Scholar
Marco-Sola, S., Sammeth, M., et al.: The gem mapper: fast, accurate and versatile alignment by filtration. Nat. Methods 9(12), 1185–1188 (2012)
Article Google Scholar
Xin, H., Nahar, S., et al.: Optimal seed solver: optimizing seed selection in read mapping. Bioinform. 32(11), 1632–1642 (2016)
Article Google Scholar
Kim, J., Li, C., Xie, X.: Hobbes3: dynamic generation of variable-length signatures for efficient approximate subsequence mappings. In: IEEE 32nd International Conference on Data Engineering (ICDE). IEEE 2016
Google Scholar
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)
Article Google Scholar
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)
Article Google Scholar
Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM (JACM) 46(3), 395–415 (1999)
Article MathSciNet MATH Google Scholar
1000 Genomes Project Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature, 491(7422): 56–65 (2012)
Google Scholar

Download references

Acknowledgment

This work was supported by the National Nature Science Foundation of China under the grant No. 61672480 and the Program for Excellent Graduate Students in Collaborative Innovation Center of High Performance Computing.

Author information

Authors and Affiliations

Computing School of Computer Science, Key Laboratory on High Performance, University of Science and Technology of China, Anhui, 230027, China
Ruidong Guo, Haoyu Cheng & Yun Xu
Collaborative Innovation Center of High Performance Computing, National University of Defense Technology, Changsha, 410073, China
Ruidong Guo, Haoyu Cheng & Yun Xu

Authors

Ruidong Guo
View author publications
You can also search for this author in PubMed Google Scholar
Haoyu Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Yun Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yun Xu .

Editor information

Editors and Affiliations

Nanjing University of Posts and Telecommunications, Nanjing, Jiangsu, China
Guoliang Chen
Sun Yat-sen University, Guangzhou, Guangdong, China
Hong Shen
Hainan University, Haikou, Hainan, China
Mingrui Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guo, R., Cheng, H., Xu, Y. (2017). An Efficient Filtration Method Based on Variable-Length Seeds for Sequence Alignment. In: Chen, G., Shen, H., Chen, M. (eds) Parallel Architecture, Algorithm and Programming. PAAP 2017. Communications in Computer and Information Science, vol 729. Springer, Singapore. https://doi.org/10.1007/978-981-10-6442-5_19

Download citation

DOI: https://doi.org/10.1007/978-981-10-6442-5_19
Published: 06 October 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6441-8
Online ISBN: 978-981-10-6442-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics