Space-Efficient Detection of Unusual Words

Belazzougui, Djamal; Cunial, Fabio

doi:10.1007/978-3-319-23826-5_22

Space-Efficient Detection of Unusual Words

Djamal Belazzougui^16,17 &
Fabio Cunial¹⁸

Conference paper
First Online: 01 January 2015

1111 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9309))

Abstract

Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-inefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorithm based on the suffix tree of a string to use just a small data structure built on the Burrows-Wheeler transform, and a stack of \(O(\sigma ^2\log ^2 n)\) bits, where n is the length of the string and \(\sigma \) is the size of the alphabet. The size of the stack is o(n) except for very large values of \(\sigma \). We further improve the algorithm by removing its time dependency on \(\sigma \), by reporting only a subset of the maximal repeats and of the minimal rare words of the string, and by detecting and scoring candidate under-represented strings that do not occur in the string. Our algorithms are practical and work directly on the BWT, thus they can be immediately applied to a number of existing datasets that are available in this form, returning this string mining problem to a manageable scale.

This work was partially supported by Academy of Finland under grant 284598 (Center of Excellence in Cancer Genetics Research).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Apostolico, A., Bock, M.E., Lonardi, S.: Monotony of surprise and large-scale quest for unusual words. Journal of Computational Biology 10(3–4), 283–311 (2003)
Google Scholar
Apostolico, A., Bock, M.E., Lonardi, S., Xu, X.: Efficient detection of unusual words. Journal of Computational Biology 7(1–2), 71–94 (2000)
Article Google Scholar
Apostolico, A., Bock, M.E., Xu, X.: Annotated statistical indices for sequence analysis. In: Proceedgins of Compression and Complexity of Sequences 1997, pp. 215–229. IEEE (1998)
Google Scholar
Apostolico, A., Gong, F.-C., Lonardi, S.: Verbumculus and the discovery of unusual words. Journal of Computer Science and Technology 19(1), 22–41 (2004)
Article MathSciNet Google Scholar
Belazzougui, D.: Linear time construction of compressed text indices in compact space. In: Proceedings of the 46th Annual ACM Symposium on Theory of Computing, STOC 2014, pp. 148–193. ACM, New York (2014)
Google Scholar
Belazzougui, D., Cunial, F.: A framework for space-efficient string kernels. In: Cicalese, F., Porat, E., Vaccaro, U. (eds.) CPM 2015. LNCS, vol. 9133, pp. 13–25. Springer, Heidelberg (2015)
Chapter Google Scholar
Belazzougui, D., Navarro, G., Valenzuela, D.: Improved compressed indexes for full-text document retrieval. Journal of Discrete Algorithms 18, 3–13 (2013)
Article MathSciNet MATH Google Scholar
Chairungsee, S., Crochemore, M.: Using minimal absent words to build phylogeny. Theoretical Computer Science 450, 109–116 (2012)
Article MathSciNet MATH Google Scholar
Crochemore, M., Mignosi, F., Restivo, A.: Automata and forbidden words. Information Processing Letters 67(3), 111–117 (1998)
Article MathSciNet Google Scholar
Crochemore, M., Rytter, W.: Jewels of stringology. World Scientific (2002)
Google Scholar
Gog, S.: Compressed suffix trees: design, construction, and applications. PhD thesis, University of Ulm, Germany (2011)
Google Scholar
Herold, J., Kurtz, S., Giegerich, R.: Efficient computation of absent words in genomic sequences. BMC Bioinformatics 9(1), 167 (2008)
Article Google Scholar
Hoare, C.A.R.: Quicksort. The Computer Journal 5(1), 10–16 (1962)
Article MathSciNet MATH Google Scholar
Ileri, A.M., Külekci, M.O., Xu, B.: A simple yet time-optimal and linear-space algorithm for shortest unique substring queries. Theoretical Computer Science 562, 621–633 (2015)
Google Scholar
Keogh, E., Lonardi, S., Chiu, B.Y.-C.: Finding surprising patterns in a time series database in linear time and space. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2002, pp. 550–556. ACM, New York (2002)
Google Scholar
Lin, J., Keogh, E., Wei, L., Lonardi, S.: Experiencing SAX: a novel symbolic representation of time series. Data Mining and Knowledge Discovery 15(2), 107–144 (2007)
Article MathSciNet Google Scholar
Morris, J.H., Pratt, V.R.: A linear pattern-matching algorithm. Technical Report 40, University of California, Berkeley (1970)
Google Scholar
Simon, I.: String matching algorithms and automata. In: First South American Workshop on String Processing, Belo Horizonte, Brazil, pp. 151–157 (1993)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Helsinki, Helsinki, Finland
Djamal Belazzougui
Helsinki Institute for Information Technology, Helsinki, Finland
Djamal Belazzougui
Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
Fabio Cunial

Authors

Djamal Belazzougui
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Cunial
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fabio Cunial .

Editor information

Editors and Affiliations

King's College London, London, United Kingdom
Costas Iliopoulos
University of Helsinki, Helsinki, Finland
Simon Puglisi
University College London, London, United Kingdom
Emine Yilmaz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Belazzougui, D., Cunial, F. (2015). Space-Efficient Detection of Unusual Words. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds) String Processing and Information Retrieval. SPIRE 2015. Lecture Notes in Computer Science(), vol 9309. Springer, Cham. https://doi.org/10.1007/978-3-319-23826-5_22

Download citation

DOI: https://doi.org/10.1007/978-3-319-23826-5_22
Published: 05 September 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23825-8
Online ISBN: 978-3-319-23826-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics