Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts

  • Sven Rahmann
  • Eric Rivals
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1848)


The number of missing words (NMW) of length q in a text, and the number of common words (NCW) of two texts are useful text statistics. Knowing the distribution of the NMW in a random text is essential for the construction of so-called monkey tests for pseudorandom number generators. Knowledge of the distribution of the NCW of two independent random texts is useful for the average case analysis of a family of fast pattern matching algorithms, namely those which use a technique called q-gram filtration. Despite these important applications, we are not aware of any exact studies of these text statistics. We propose an efficient method to compute their expected values exactly. The difficulty of the computation lies in the strong dependence of successive words, as they overlap by (q - 1) characters. Our method is based on the enumeration of all string autocorrelations of length q, i.e., of the ways a word of length q can overlap itself. For this, we present the first efficient algorithm. Furthermore, by assuming the words are independent, we obtain very simple approximation formulas, which are shown to be surprisingly good when compared to the exact values.


Common Word Exponential Approximation Alphabet Size Random Text Monkey Test 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic Local Alignment Search Tool (BLAST). Journal of Molecular Biology, 215:403–410, 1990.Google Scholar
  2. [2]
    S. Burkhardt, A. Crauser, P. Ferragina, H.-P. Lenhof, E. Rivals, and M. Vingron. q-gram based Database Searching Using a Suffix Array (QUASAR). In S. Istrail, P. Pevzner, and M. Waterman, editors, Proceedings of The Third International Conference on Computational Molecular Biology, pages 77–83. ACM-Press, 1999.Google Scholar
  3. [3]
    L. J. Guibas and A. M. Odlyzko. Maximal Prefix-Synchronized Codes. SIAM Journal of Applied Mathematics, 35(2):401–418, 1981.CrossRefMathSciNetGoogle Scholar
  4. [4]
    L. J. Guibas and A. M. Odlyzko. Periods in Strings. Journal of Combinatorial Theory, Series A, 30:19–42, 1981.zbMATHCrossRefMathSciNetGoogle Scholar
  5. [5]
    L. J. Guibas and A. M. Odlyzko. String Overlaps, Pattern Matching, and Nontransitive Games. Journal of Combinatorial Theory, Series A, 30:183–208, 1981.zbMATHCrossRefMathSciNetGoogle Scholar
  6. [6]
    W. Hide, J. Burke, and D. Davison. Biological evaluation of d2, an algorithm for high-performance sequence comparison. J. Biol., 1:199–215, 1994.Google Scholar
  7. [7]
    N. L. Johnson and S. Kotz. Urn Models and Their Applications. Wiley, New York, 1977.Google Scholar
  8. [8]
    P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In A. Tarlecki, editor, Proceedings of the 16th symposium on Mathematical Foundations of Computer Science, number 520 in Lecture Notes in Computer Science, pages 240–248, Berlin, 1991. Springer-Verlag.Google Scholar
  9. [9]
    D. E. Knuth. The Art of Computer Programming, volume 2 / Seminumerical Algorithms. Addison-Wesley, Reading, MA, third edition, 1998.Google Scholar
  10. [10]
    G. Marsaglia and A. Zaman. Monkey Tests for Random Number Generators. Computers and Mathematics with Applications, 26(9):1–10, 1993.zbMATHCrossRefMathSciNetGoogle Scholar
  11. [11]
    A. A. Mironov and N. N. Alexandrov. Statistical method for rapid homology search. Nucleic Acids Res, 16(11):5169–73, Jun 1988.Google Scholar
  12. [12]
    O. E. Percus and P. A. Whitlock. Theory and Application of Marsaglia’s Monkey Test for Pseudorandom Number Generators. ACM Transactions on Modeling and Computer Simulation, 5(2):87–100, April 1995.Google Scholar
  13. [13]
    P. A. Pevzner. Statistical distance between texts and filtration methods in sequence comparison. Appl. BioSci., 8(2):121–127, 1992.Google Scholar
  14. [14]
    S. Rahmann and E. Rivals. The Expected Number of Missing Words in a Random Text. Technical Report 99-229, LIRMM, Montpellier, France, 1999.Google Scholar
  15. [15]
    E. Rivals and S. Rahmann. Enumerating String Autocorrelations and Computing their Population Sizes. Technical Report 99-297, LIRMM, Montpellier, France, 1999.Google Scholar
  16. [16]
    R. Sedgewick and P. Flajolet. Analysis of Algorithms. Addison-Wesley, Reading, MA, 1996.zbMATHGoogle Scholar
  17. [17]
    D. C. Torney, C. Burks, D. Davison, and K. M. Sirotkin. Computation of d2: A measure of sequence dissimilarity. In G. Bell and R. Marr, editors, Computers and DNA, pages 109–125, New York, 1990. Sante Fe Institute studies in the sciences of complexity, vol. VII, Addison-Wesley.Google Scholar
  18. [18]
    E. Ukkonen. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92(1):191–211, Jan. 1992.Google Scholar
  19. [19]
    S. Wu and U. Manber. Fast text searching allowing errors. Communications of the Association for Computing Machinery, 35(10):83–91, Oct. 1992.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2000

Authors and Affiliations

  • Sven Rahmann
    • 1
  • Eric Rivals
    • 2
  1. 1.Theoretische Bioinformatik (TBI), Deutsches Krebsforschungszentrum (DKFZ)HeidelbergGermany
  2. 2.L.I.R.M.MMontpellier Cedex 5France

Personalised recommendations