Abstract
With the rapid development of next-generation sequencing (NGS) platforms, more than billions of reads are produced quickly. Finding all mapping locations of these reads in the reference genome is not only a bioinformatics issue, but also a large-scale computation issue. Existing all mapping tools are usually divided into the two steps, filtration and verification. Filtration step discards some wrong locations and generates candidates. As for verification step, each candidate is mapped to the reference sequence to determine whether it is a mapping location. Statistics indicated that the verification step is the main part of the whole mapping time. That is to say, less candidates lead to less mapping time. Our strategies improve filtration step to decrease the number of candidates.
We propose a dynamic programming and two heuristic strategies and integrated them into the filtration step. These strategies are applied in the state-of-the-art all-mapper, Bitmapper. Compared with the advanced all-mappers, experiment results show that our method make a significant progress.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinform. 25(14), 1754–1760 (2009)
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultra-fast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10(3), R25 (2009)
Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with bowtie 2. Nat. Methods 9(4), 357–359 (2012)
Hach, F., Hormozdiari, F., Alkan, C., et al.: mrsfast: a cache-oblivious algorithm for short-read mapping. Nat. Methods 7(8), 576–577 (2010)
Siragusa, E.: Approximate string matching for high-throughput sequencing. Ph.D. Dissertation, Freie University Berlin (2015)
Cheng, H., Jiang, H., Yang, J., et al.: Bitmapper: an efficient all-mapper based on bit-vector computing. BMC Bioinform. 16(1), 192 (2015)
Xin, H., Lee, D., Hormozdiari, F., et al.: Accelerating read mapping with fasthash. BMC Bioinform. 14(1), S13 (2013)
Kim, J., Li, C., Xie, X.: Improving read mapping using additional prefix grams. BMC Bioinform. 15(1), 42 (2014)
Marco-Sola, S., Sammeth, M., et al.: The gem mapper: fast, accurate and versatile alignment by filtration. Nat. Methods 9(12), 1185–1188 (2012)
Xin, H., Nahar, S., et al.: Optimal seed solver: optimizing seed selection in read mapping. Bioinform. 32(11), 1632–1642 (2016)
Kim, J., Li, C., Xie, X.: Hobbes3: dynamic generation of variable-length signatures for efficient approximate subsequence mappings. In: IEEE 32nd International Conference on Data Engineering (ICDE). IEEE 2016
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)
Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM (JACM) 46(3), 395–415 (1999)
1000 Genomes Project Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature, 491(7422): 56–65 (2012)
Acknowledgment
This work was supported by the National Nature Science Foundation of China under the grant No. 61672480 and the Program for Excellent Graduate Students in Collaborative Innovation Center of High Performance Computing.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Nature Singapore Pte Ltd
About this paper
Cite this paper
Guo, R., Cheng, H., Xu, Y. (2017). An Efficient Filtration Method Based on Variable-Length Seeds for Sequence Alignment. In: Chen, G., Shen, H., Chen, M. (eds) Parallel Architecture, Algorithm and Programming. PAAP 2017. Communications in Computer and Information Science, vol 729. Springer, Singapore. https://doi.org/10.1007/978-981-10-6442-5_19
Download citation
DOI: https://doi.org/10.1007/978-981-10-6442-5_19
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6441-8
Online ISBN: 978-981-10-6442-5
eBook Packages: Computer ScienceComputer Science (R0)