Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

Kuhnle, Alan; Mun, Taher; Boucher, Christina; Gagie, Travis; Langmead, Ben; Manzini, Giovanni

doi:10.1007/978-3-030-17083-7_10

Alan Kuhnle¹⁵,
Taher Mun¹⁶,
Christina Boucher¹⁵,
Travis Gagie¹⁷,
Ben Langmead¹⁶ &
…
Giovanni Manzini¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 11467))

Included in the following conference series:

International Conference on Research in Computational Molecular Biology

2082 Accesses
5 Citations

Abstract

While short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform (${\mathsf{BWT}}$) of the string that will allow us to find the interval in the string’s suffix array (${\mathsf{SA}}$) containing pointers to starting positions of occurrences of a given pattern; second, a sample of the ${\mathsf{SA}}$ that—when used with the rank data structure—allows us access to the ${\mathsf{SA}}$. The rank data structure can be kept small even for large genomic databases, by run-length compressing the ${\mathsf{BWT}}$, but until recently there was no means known to keep the ${\mathsf{SA}}$ sample small without greatly slowing down access to the ${\mathsf{SA}}$. Now that Gagie et al. (SODA 2018) have defined an ${\mathsf{SA}}$ sample that takes about the same space as the run-length compressed ${\mathsf{BWT}}$—we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018 we showed how to build the ${\mathsf{BWT}}$ of large genomic databases efficiently (WABI 2018) but the problem of building Gagie et al.’s ${\mathsf{SA}}$ sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the ${\mathsf{SA}}$ sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes and show that it improves over Bowtie with respect to both memory and time.

Availability: The implementations of our methods can be found at https://gitlab.com/manzai/Big-BWT (BWT and SA sample construction) and at https://github.com/alshai/r-index (indexing).

A. Kuhnle and T. Mun—Equal contribution, ordered alphabetically.

C. Boucher, T. Gagie, B. Langmead and G. Manzini—Equal contribution, ordered alphabetically.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
With the ${\mathsf{SA}}$ sample of Gagie et al. [11], this index is termed the r-index.
2.
Given a sequence (string) S[1, n] over an alphabet $\varSigma = \{1,\ldots ,\sigma \}$, a character $c \in \varSigma $, and an integer i, $\textsf {rank}_c(S,i)$ is the number of times that c appears in S[1, i].
3.
Sampled means that only some fraction of entries of the suffix array are stored.
4.
For technical reasons, the string S must terminate with w copies of lexicographically least $\$$ symbol.

References

Bannai, H., Gagie, T., I, T.: Online LZ77 parsing and matching statistics with RLBWTs. In: Proceedings of the 29th Annual Symposium on Combinatorial Pattern Matching, (CPM), vol. 105, pp. 7:1–7:12 (2018)
Google Scholar
Boucher, C., Gagie, T., Kuhnle, A., Manzini, G.: Prefix-free parsing for building big BWTs. In: Proceedings of 18th International Workshop on Algorithms in Bioinformatics, WABI, vol. 113, pp. 2:1–2:16 (2018)
Google Scholar
Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical report 124. Digital Equipment Corporation (1994)
Google Scholar
The 1000 Genomes Project Consortium: A global reference for human genetic variation. Nature 526(7571), 68–74 (2015)
Google Scholar
Danek, A., Deorowicz, S., Grabowski, S.: Indexes of large genome collections on a PC. PLoS ONE 9(10), e109384 (2014)
Google Scholar
Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10), 1569–1576 (2015)
Article Google Scholar
Garrison, E., et al.: Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36(9), 875–879 (2018)
Article Google Scholar
Ferrada, H., Gagie, T., Hirvola, T., Puglisi, S.J.: Hybrid indexes for repetitive datasets. Philos. Trans. Roy. Soc. A: Math. Phys. Eng. Sci. 372(2016), 1–9 (2014)
Article MathSciNet Google Scholar
Ferrada, H., Kempa, D., Puglisi, S.J.: Hybrid indexing revisited. In: Proceedings of the 21st Algorithm Engineering and Experiments, ALENEX, pp. 1–8 (2018)
Google Scholar
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, FOCS, pp. 390–398 (2000)
Google Scholar
Gagie, T., Navarro, G., Prezza, N.: Optimal-time text indexing in BWT-runs bounded space. In: Proceedings of the 29th Annual Symposium on Discrete Algorithms, SODA, pp. 1459–1477 (2018)
Google Scholar
Gagie, T., Puglisi, S.J.: Searching and indexing genomic databases via kernelization. Front. Bioeng. Biotechnol. 3, 10–13 (2015)
Article Google Scholar
Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics 25, 1754–1760 (2009)
Article Google Scholar
Huang, L., Popic, V., Batzoglou, S.: Short read alignment with populations of genomes. Bioinformatics 29(13), i361–i370 (2013)
Article Google Scholar
Jain, M., et al.: Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36(4), 338–345 (2018)
Article MathSciNet Google Scholar
Jeong-Sun, S., et al.: De novo assembly and phasing of a Korean human genome. Nature 538(7624), 243–247 (2016)
Article Google Scholar
Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Parallel external memory suffix sorting. In: Cicalese, F., Porat, E., Vaccaro, U. (eds.) CPM 2015. LNCS, vol. 9133, pp. 329–342. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19929-0_28
Chapter Google Scholar
Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9(4), 357 (2012)
Article Google Scholar
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2008)
Google Scholar
Levy, S., et al.: The diploid genome sequence of an individual human. PLoS Biol. 5(10), e254 (2007)
Article Google Scholar
Li, R., et al.: SOAP2: an improved tool for short read alignment. Bioinformatics 25(15), 1966–1967 (2009)
Article Google Scholar
Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997 [q-bio], March 2013
Maciuca, S., del Ojo Elias, C., McVean, G., Iqbal, Z.: A natural encoding of genetic variation in a Burrows-Wheeler Transform to enable mapping and genome inference. In: Frith, M., Storm Pedersen, C.N. (eds.) WABI 2016. LNCS, vol. 9838, pp. 222–233. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43681-4_18
Chapter Google Scholar
Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281–308 (2010)
Article MathSciNet Google Scholar
Peng, Y., Leung, H.C.M., Yiu, S.M., Chin, F.Y.L.: IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28(11), 1420–1428 (2012)
Article Google Scholar
Policriti, A., Prezza, N.: LZ77 computation based on the run-length encoded BWT. Algorithmica 80(7), 1986–2011 (2018)
Article MathSciNet Google Scholar
Schneeberger, K., et al.: Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10(9), R98 (2009)
Google Scholar
Shi, L., et al.: Long-read sequencing and de novo assembly of a Chinese genome. Nat. Commun. 7, 12065 (2016)
Article Google Scholar
Sirén, J., Välimäki, N., Mäkinen, V.: Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinform. 11(2), 375–388 (2014)
Article Google Scholar
Steinberg, K.M., et al.: Single haplotype assembly of the human genome from a hydatidiform mole. Genome Res. 24, 2066–2076 (2014). p. gr.180893.114
Article Google Scholar
Stevens, E.L., et al.: The public health impact of a publically available, environmental database of microbial genomes. Front. Microbiol. 8, 808 (2017)
Article Google Scholar
Valenzuela, D., Norri, T., Välimäki, N., Pitkänen, E., Mäkinen, V.: Towards pan-genome read alignment to improve variation calling. BMC Genomics 19(2), 87 (2018)
Article Google Scholar
Valenzuela, D., Mäkinen, V.: CHIC: a short read aligner for pan-genomic references. Technical report, biorxiv.org (2017)
Wandelt, S., Starlinger, J., Bux, M., Leser, U.: RCSI: scalable similarity search in thousand(s) of genomes. Proc. VLDB Endow. 6(13), 1534–1545 (2013)
Article Google Scholar

Download references

Acknowledgements

AK and CB were supported by the National Institute of Allergy and Infectious Diseases of the National Institutes of Health (1R01AI141810-01) and NSF-IIS (1618814). TM and BL were supported by the National Institutes of Health (R01GM118568) and NSF-IIS (1349906). TG was supported by FONDECYT grant 1171058 Compression-aware algorithmics. GM was partially supported by PRIN grant 201534HNXC and by INdAM-GNCS Project 2019 Innovative methods for the solution of medical and biological big data.

Author information

Authors and Affiliations

Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
Alan Kuhnle & Christina Boucher
Department of Computer Science, John Hopkins University, Baltimore, MD, USA
Taher Mun & Ben Langmead
School of Computer Science and Telecommunications, Universidad Diego Portales and CeBiB, Santiago, Chile
Travis Gagie
Department of Science and Technological Innovation, University of Eastern Piedmont, Alessandria, Italy
Giovanni Manzini

Authors

Alan Kuhnle
View author publications
You can also search for this author in PubMed Google Scholar
Taher Mun
View author publications
You can also search for this author in PubMed Google Scholar
Christina Boucher
View author publications
You can also search for this author in PubMed Google Scholar
Travis Gagie
View author publications
You can also search for this author in PubMed Google Scholar
Ben Langmead
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Manzini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alan Kuhnle .

Editor information

Editors and Affiliations

Tufts University, Cambridge, MA, USA
Lenore J. Cowen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kuhnle, A., Mun, T., Boucher, C., Gagie, T., Langmead, B., Manzini, G. (2019). Efficient Construction of a Complete Index for Pan-Genomics Read Alignment. In: Cowen, L. (eds) Research in Computational Molecular Biology. RECOMB 2019. Lecture Notes in Computer Science(), vol 11467. Springer, Cham. https://doi.org/10.1007/978-3-030-17083-7_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-17083-7_10
Published: 02 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-17082-0
Online ISBN: 978-3-030-17083-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics