Abstract
In this article we present practical and theoretical improvements to the computation of the Burrows-Wheeler similarity distribution for all pairs of strings in a collection. Our algorithms take advantage of the Burrows-Wheeler transform (BWT) computed for the concatenation of all strings, instead of the pairwise construction of BWTs performed by the straightforward approach, and use compressed data structures that allow reductions of running time while still keeping a small memory footprint, as shown by a set of experiments with real datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
References
Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval: The Concepts and Technology Behind Search, 2nd edn. Pearson Education Ltd., Harlow (2011)
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report, Digital SRC Research Report (1994)
Fischer, J., Heun, V.: Theoretical and practical improvements on the RMQ-problem, with applications to LCA and LCE. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 36–48. Springer, Heidelberg (2006). https://doi.org/10.1007/11780441_5
Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07959-2_28
Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New indices for text: pat trees and pat arrays. In: Information Retrieval, pp. 66–82. Prentice-Hall Inc., Upper Saddle River (1992)
Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 841–850. ACM/SIAM (2003)
Louza, F.A., Gog, S., Telles, G.P.: Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci. 678, 22–39 (2017)
Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: An extension of the burrows wheeler transform and applications to sequence comparison and data compression. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 178–189. Springer, Heidelberg (2005). https://doi.org/10.1007/11496656_16
Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: A new combinatorial approach to sequence comparison. Theory Comput. Syst. 42(3), 411–429 (2008)
Mantaci, S., Restivo, A., Sciortino, M.: Distance measures for biological sequences: some recent approaches. Int. J. Approx. Reason. 47(1), 109–124 (2008)
Munro, J.I.: Tables. In: Chandru, V., Vinay, V. (eds.) FSTTCS 1996. LNCS, vol. 1180, pp. 37–42. Springer, Heidelberg (1996). https://doi.org/10.1007/3-540-62034-6_35
Munro, J.I., Nekrich, Y., Vitter, J.S.: Fast construction of wavelet trees. Theor. Comput. Sci. 638, 91–97 (2016)
Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proceedings ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 657–666. ACM/SIAM (2002)
Nong, G.: Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans. Inform. Syst. 31(3), 1–15 (2013)
Ohlebusch, E.: Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag (2013)
Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Proceedings of Workshop on Algorithm Engineering and Experimentation (ALENEX). SIAM (2007)
Okanohara, D., Sadakane, K.: A linear-time burrows-wheeler transform using induced sorting. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 90–101. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03784-9_9
Paiva, J.G., Florian, L., Pedrini, H., Telles, G., Minghim, R.: Improved similarity trees and their application to visual data classification. IEEE Trans. Vis. Comput. Graph. 17(12), 2459–2468 (2011)
Puglisi, S.J., Smyth, W.F., Turpin, A.H.: A taxonomy of suffix array construction algorithms. ACM Comp. Surv. 39(2), 1–31 (2007)
Yang, L., Zhang, X., Wang, T.: The Burrows-Wheeler similarity distribution between biological sequences based on Burrows-Wheeler transform. J. Theor. Biol. 262(4), 742–749 (2010)
Acknowledgments
The authors thank Prof. Nalvo Almeida for granting access to the machine used for the experiments.
Funding. FAL was supported by the grant \(\#\)2017/09105-0 from the São Paulo Research Foundation (FAPESP). GPT acknowledges the support of CNPq grants 425340/2016-3 and 310685/2015-0.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Louza, F.A., Telles, G.P., Gog, S., Zhao, L. (2018). Computing Burrows-Wheeler Similarity Distributions for String Collections. In: Gagie, T., Moffat, A., Navarro, G., Cuadros-Vargas, E. (eds) String Processing and Information Retrieval. SPIRE 2018. Lecture Notes in Computer Science(), vol 11147. Springer, Cham. https://doi.org/10.1007/978-3-030-00479-8_23
Download citation
DOI: https://doi.org/10.1007/978-3-030-00479-8_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00478-1
Online ISBN: 978-3-030-00479-8
eBook Packages: Computer ScienceComputer Science (R0)