Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts

Rahmann, Sven; Rivals, Eric

doi:10.1007/3-540-45123-4_31

Sven Rahmann⁶ &
Eric Rivals⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1848))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

596 Accesses
4 Citations

Abstract

The number of missing words (NMW) of length q in a text, and the number of common words (NCW) of two texts are useful text statistics. Knowing the distribution of the NMW in a random text is essential for the construction of so-called monkey tests for pseudorandom number generators. Knowledge of the distribution of the NCW of two independent random texts is useful for the average case analysis of a family of fast pattern matching algorithms, namely those which use a technique called q-gram filtration. Despite these important applications, we are not aware of any exact studies of these text statistics. We propose an efficient method to compute their expected values exactly. The difficulty of the computation lies in the strong dependence of successive words, as they overlap by (q - 1) characters. Our method is based on the enumeration of all string autocorrelations of length q, i.e., of the ways a word of length q can overlap itself. For this, we present the first efficient algorithm. Furthermore, by assuming the words are independent, we obtain very simple approximation formulas, which are shown to be surprisingly good when compared to the exact values.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic Local Alignment Search Tool (BLAST). Journal of Molecular Biology, 215:403–410, 1990.
Google Scholar
S. Burkhardt, A. Crauser, P. Ferragina, H.-P. Lenhof, E. Rivals, and M. Vingron. q-gram based Database Searching Using a Suffix Array (QUASAR). In S. Istrail, P. Pevzner, and M. Waterman, editors, Proceedings of The Third International Conference on Computational Molecular Biology, pages 77–83. ACM-Press, 1999.
Google Scholar
L. J. Guibas and A. M. Odlyzko. Maximal Prefix-Synchronized Codes. SIAM Journal of Applied Mathematics, 35(2):401–418, 1981.
Article MathSciNet Google Scholar
L. J. Guibas and A. M. Odlyzko. Periods in Strings. Journal of Combinatorial Theory, Series A, 30:19–42, 1981.
Article MATH MathSciNet Google Scholar
L. J. Guibas and A. M. Odlyzko. String Overlaps, Pattern Matching, and Nontransitive Games. Journal of Combinatorial Theory, Series A, 30:183–208, 1981.
Article MATH MathSciNet Google Scholar
W. Hide, J. Burke, and D. Davison. Biological evaluation of d2, an algorithm for high-performance sequence comparison. J. Biol., 1:199–215, 1994.
Google Scholar
N. L. Johnson and S. Kotz. Urn Models and Their Applications. Wiley, New York, 1977.
Google Scholar
P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In A. Tarlecki, editor, Proceedings of the 16th symposium on Mathematical Foundations of Computer Science, number 520 in Lecture Notes in Computer Science, pages 240–248, Berlin, 1991. Springer-Verlag.
Google Scholar
D. E. Knuth. The Art of Computer Programming, volume 2 / Seminumerical Algorithms. Addison-Wesley, Reading, MA, third edition, 1998.
Google Scholar
G. Marsaglia and A. Zaman. Monkey Tests for Random Number Generators. Computers and Mathematics with Applications, 26(9):1–10, 1993.
Article MATH MathSciNet Google Scholar
A. A. Mironov and N. N. Alexandrov. Statistical method for rapid homology search. Nucleic Acids Res, 16(11):5169–73, Jun 1988.
Google Scholar
O. E. Percus and P. A. Whitlock. Theory and Application of Marsaglia’s Monkey Test for Pseudorandom Number Generators. ACM Transactions on Modeling and Computer Simulation, 5(2):87–100, April 1995.
Google Scholar
P. A. Pevzner. Statistical distance between texts and filtration methods in sequence comparison. Appl. BioSci., 8(2):121–127, 1992.
Google Scholar
S. Rahmann and E. Rivals. The Expected Number of Missing Words in a Random Text. Technical Report 99-229, LIRMM, Montpellier, France, 1999.
Google Scholar
E. Rivals and S. Rahmann. Enumerating String Autocorrelations and Computing their Population Sizes. Technical Report 99-297, LIRMM, Montpellier, France, 1999.
Google Scholar
R. Sedgewick and P. Flajolet. Analysis of Algorithms. Addison-Wesley, Reading, MA, 1996.
MATH Google Scholar
D. C. Torney, C. Burks, D. Davison, and K. M. Sirotkin. Computation of d²: A measure of sequence dissimilarity. In G. Bell and R. Marr, editors, Computers and DNA, pages 109–125, New York, 1990. Sante Fe Institute studies in the sciences of complexity, vol. VII, Addison-Wesley.
Google Scholar
E. Ukkonen. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92(1):191–211, Jan. 1992.
Google Scholar
S. Wu and U. Manber. Fast text searching allowing errors. Communications of the Association for Computing Machinery, 35(10):83–91, Oct. 1992.
Google Scholar

Download references

Author information

Authors and Affiliations

Theoretische Bioinformatik (TBI), Deutsches Krebsforschungszentrum (DKFZ), Im Neuenheimer Feld 280, D-69120, Heidelberg, Germany
Sven Rahmann
L.I.R.M.M, 161 rue Ada, F-34392, Montpellier Cedex 5, France
Eric Rivals

Authors

Sven Rahmann
View author publications
You can also search for this author in PubMed Google Scholar
Eric Rivals
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dipartimento di Matematica ed Applicazioni, Universitá die Palermo, Via Archirafi 34, 90123, Palermo, Italy
Raffaele Giancarlo
Centre de recherches mathématiques, Université de Montréal, CP 6128, succursale Centre-Ville, Montréal, Québec, Canada, H3C 3J7
David Sankoff

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rahmann, S., Rivals, E. (2000). Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts. In: Giancarlo, R., Sankoff, D. (eds) Combinatorial Pattern Matching. CPM 2000. Lecture Notes in Computer Science, vol 1848. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45123-4_31

Download citation

DOI: https://doi.org/10.1007/3-540-45123-4_31
Published: 07 November 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67633-1
Online ISBN: 978-3-540-45123-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics