Skip to main content

Computing Burrows-Wheeler Similarity Distributions for String Collections

  • Conference paper
  • First Online:
String Processing and Information Retrieval (SPIRE 2018)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11147))

Included in the following conference series:

Abstract

In this article we present practical and theoretical improvements to the computation of the Burrows-Wheeler similarity distribution for all pairs of strings in a collection. Our algorithms take advantage of the Burrows-Wheeler transform (BWT) computed for the concatenation of all strings, instead of the pairwise construction of BWTs performed by the straightforward approach, and use compressed data structures that allow reductions of running time while still keeping a small memory footprint, as shown by a set of experiments with real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/simongog/sdsl-lite.

  2. 2.

    https://github.com/felipelouza/gsa-is/.

  3. 3.

    http://gage.cbcb.umd.edu/data/index.html.

  4. 4.

    http://www.ebi.ac.uk/uniprot/download-center/.

  5. 5.

    http://www.uni-ulm.de/in/theo/research/seqana.html.

  6. 6.

    http://algo2.iti.kit.edu/gog/projects/ALENEX15/collections/ENWIKIBIG/.

  7. 7.

    http://panthema.net/2013/malloc_count.

References

  1. Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval: The Concepts and Technology Behind Search, 2nd edn. Pearson Education Ltd., Harlow (2011)

    Google Scholar 

  2. Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report, Digital SRC Research Report (1994)

    Google Scholar 

  3. Fischer, J., Heun, V.: Theoretical and practical improvements on the RMQ-problem, with applications to LCA and LCE. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 36–48. Springer, Heidelberg (2006). https://doi.org/10.1007/11780441_5

    Chapter  Google Scholar 

  4. Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07959-2_28

    Chapter  Google Scholar 

  5. Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New indices for text: pat trees and pat arrays. In: Information Retrieval, pp. 66–82. Prentice-Hall Inc., Upper Saddle River (1992)

    Google Scholar 

  6. Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 841–850. ACM/SIAM (2003)

    Google Scholar 

  7. Louza, F.A., Gog, S., Telles, G.P.: Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci. 678, 22–39 (2017)

    Article  MathSciNet  Google Scholar 

  8. Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MathSciNet  Google Scholar 

  9. Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: An extension of the burrows wheeler transform and applications to sequence comparison and data compression. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 178–189. Springer, Heidelberg (2005). https://doi.org/10.1007/11496656_16

    Chapter  MATH  Google Scholar 

  10. Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: A new combinatorial approach to sequence comparison. Theory Comput. Syst. 42(3), 411–429 (2008)

    Article  MathSciNet  Google Scholar 

  11. Mantaci, S., Restivo, A., Sciortino, M.: Distance measures for biological sequences: some recent approaches. Int. J. Approx. Reason. 47(1), 109–124 (2008)

    Article  MathSciNet  Google Scholar 

  12. Munro, J.I.: Tables. In: Chandru, V., Vinay, V. (eds.) FSTTCS 1996. LNCS, vol. 1180, pp. 37–42. Springer, Heidelberg (1996). https://doi.org/10.1007/3-540-62034-6_35

    Chapter  Google Scholar 

  13. Munro, J.I., Nekrich, Y., Vitter, J.S.: Fast construction of wavelet trees. Theor. Comput. Sci. 638, 91–97 (2016)

    Article  MathSciNet  Google Scholar 

  14. Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proceedings ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 657–666. ACM/SIAM (2002)

    Google Scholar 

  15. Nong, G.: Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans. Inform. Syst. 31(3), 1–15 (2013)

    Article  MathSciNet  Google Scholar 

  16. Ohlebusch, E.: Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag (2013)

    Google Scholar 

  17. Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Proceedings of Workshop on Algorithm Engineering and Experimentation (ALENEX). SIAM (2007)

    Google Scholar 

  18. Okanohara, D., Sadakane, K.: A linear-time burrows-wheeler transform using induced sorting. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 90–101. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03784-9_9

    Chapter  Google Scholar 

  19. Paiva, J.G., Florian, L., Pedrini, H., Telles, G., Minghim, R.: Improved similarity trees and their application to visual data classification. IEEE Trans. Vis. Comput. Graph. 17(12), 2459–2468 (2011)

    Article  Google Scholar 

  20. Puglisi, S.J., Smyth, W.F., Turpin, A.H.: A taxonomy of suffix array construction algorithms. ACM Comp. Surv. 39(2), 1–31 (2007)

    Article  Google Scholar 

  21. Yang, L., Zhang, X., Wang, T.: The Burrows-Wheeler similarity distribution between biological sequences based on Burrows-Wheeler transform. J. Theor. Biol. 262(4), 742–749 (2010)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

The authors thank Prof. Nalvo Almeida for granting access to the machine used for the experiments.

Funding. FAL was supported by the grant \(\#\)2017/09105-0 from the São Paulo Research Foundation (FAPESP). GPT acknowledges the support of CNPq grants 425340/2016-3 and 310685/2015-0.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Felipe A. Louza .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Louza, F.A., Telles, G.P., Gog, S., Zhao, L. (2018). Computing Burrows-Wheeler Similarity Distributions for String Collections. In: Gagie, T., Moffat, A., Navarro, G., Cuadros-Vargas, E. (eds) String Processing and Information Retrieval. SPIRE 2018. Lecture Notes in Computer Science(), vol 11147. Springer, Cham. https://doi.org/10.1007/978-3-030-00479-8_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00479-8_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00478-1

  • Online ISBN: 978-3-030-00479-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics