A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases
Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this paper, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290x faster than BWA-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each \(\ge 5\) kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and \(> 60,000\) genomes.
KeywordsLong read mapping Jaccard MinHash Winnowing Minimizers Sketching Nanopore PacBio
This research was supported in part by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health, and the U.S. National Science Foundation under IIS-1416259.
- 4.Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of Compression and Complexity of Sequences 1997, pp. 21–29. IEEE (1997)Google Scholar
- 13.Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arxiv preprint arXiv:1303.3997 (2013)
- 14.Li, H.: Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, btw152 (2016)Google Scholar
- 17.Loman, N.J.: Nanopore R9 rapid run data release (2016). https://goo.gl/UlHVtL. Accessed 8 Sept 2016
- 20.Pacific Biosciences: Human microbiome mock community shotgun sequencing data (2014). https://goo.gl/kjRcLb. Accessed 8 Sept 2016
- 21.Popic, V., Batzoglou, S.: Privacy-preserving read mapping using locality sensitive hashing and secure kmer voting. bioRxiv, 046920 (2016)Google Scholar
- 25.Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 76–85. ACM (2003)Google Scholar
- 26.Smith, K.C.: Sliding window minimum implementations (2016). https://goo.gl/8RC54b. Accessed 8 Sept 2016