Skip to main content

Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts

  • Conference paper
  • First Online:
Combinatorial Pattern Matching (CPM 2000)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1848))

Included in the following conference series:

Abstract

The number of missing words (NMW) of length q in a text, and the number of common words (NCW) of two texts are useful text statistics. Knowing the distribution of the NMW in a random text is essential for the construction of so-called monkey tests for pseudorandom number generators. Knowledge of the distribution of the NCW of two independent random texts is useful for the average case analysis of a family of fast pattern matching algorithms, namely those which use a technique called q-gram filtration. Despite these important applications, we are not aware of any exact studies of these text statistics. We propose an efficient method to compute their expected values exactly. The difficulty of the computation lies in the strong dependence of successive words, as they overlap by (q - 1) characters. Our method is based on the enumeration of all string autocorrelations of length q, i.e., of the ways a word of length q can overlap itself. For this, we present the first efficient algorithm. Furthermore, by assuming the words are independent, we obtain very simple approximation formulas, which are shown to be surprisingly good when compared to the exact values.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic Local Alignment Search Tool (BLAST). Journal of Molecular Biology, 215:403–410, 1990.

    Google Scholar 

  2. S. Burkhardt, A. Crauser, P. Ferragina, H.-P. Lenhof, E. Rivals, and M. Vingron. q-gram based Database Searching Using a Suffix Array (QUASAR). In S. Istrail, P. Pevzner, and M. Waterman, editors, Proceedings of The Third International Conference on Computational Molecular Biology, pages 77–83. ACM-Press, 1999.

    Google Scholar 

  3. L. J. Guibas and A. M. Odlyzko. Maximal Prefix-Synchronized Codes. SIAM Journal of Applied Mathematics, 35(2):401–418, 1981.

    Article  MathSciNet  Google Scholar 

  4. L. J. Guibas and A. M. Odlyzko. Periods in Strings. Journal of Combinatorial Theory, Series A, 30:19–42, 1981.

    Article  MATH  MathSciNet  Google Scholar 

  5. L. J. Guibas and A. M. Odlyzko. String Overlaps, Pattern Matching, and Nontransitive Games. Journal of Combinatorial Theory, Series A, 30:183–208, 1981.

    Article  MATH  MathSciNet  Google Scholar 

  6. W. Hide, J. Burke, and D. Davison. Biological evaluation of d2, an algorithm for high-performance sequence comparison. J. Biol., 1:199–215, 1994.

    Google Scholar 

  7. N. L. Johnson and S. Kotz. Urn Models and Their Applications. Wiley, New York, 1977.

    Google Scholar 

  8. P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In A. Tarlecki, editor, Proceedings of the 16th symposium on Mathematical Foundations of Computer Science, number 520 in Lecture Notes in Computer Science, pages 240–248, Berlin, 1991. Springer-Verlag.

    Google Scholar 

  9. D. E. Knuth. The Art of Computer Programming, volume 2 / Seminumerical Algorithms. Addison-Wesley, Reading, MA, third edition, 1998.

    Google Scholar 

  10. G. Marsaglia and A. Zaman. Monkey Tests for Random Number Generators. Computers and Mathematics with Applications, 26(9):1–10, 1993.

    Article  MATH  MathSciNet  Google Scholar 

  11. A. A. Mironov and N. N. Alexandrov. Statistical method for rapid homology search. Nucleic Acids Res, 16(11):5169–73, Jun 1988.

    Google Scholar 

  12. O. E. Percus and P. A. Whitlock. Theory and Application of Marsaglia’s Monkey Test for Pseudorandom Number Generators. ACM Transactions on Modeling and Computer Simulation, 5(2):87–100, April 1995.

    Google Scholar 

  13. P. A. Pevzner. Statistical distance between texts and filtration methods in sequence comparison. Appl. BioSci., 8(2):121–127, 1992.

    Google Scholar 

  14. S. Rahmann and E. Rivals. The Expected Number of Missing Words in a Random Text. Technical Report 99-229, LIRMM, Montpellier, France, 1999.

    Google Scholar 

  15. E. Rivals and S. Rahmann. Enumerating String Autocorrelations and Computing their Population Sizes. Technical Report 99-297, LIRMM, Montpellier, France, 1999.

    Google Scholar 

  16. R. Sedgewick and P. Flajolet. Analysis of Algorithms. Addison-Wesley, Reading, MA, 1996.

    MATH  Google Scholar 

  17. D. C. Torney, C. Burks, D. Davison, and K. M. Sirotkin. Computation of d2: A measure of sequence dissimilarity. In G. Bell and R. Marr, editors, Computers and DNA, pages 109–125, New York, 1990. Sante Fe Institute studies in the sciences of complexity, vol. VII, Addison-Wesley.

    Google Scholar 

  18. E. Ukkonen. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92(1):191–211, Jan. 1992.

    Google Scholar 

  19. S. Wu and U. Manber. Fast text searching allowing errors. Communications of the Association for Computing Machinery, 35(10):83–91, Oct. 1992.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Rahmann, S., Rivals, E. (2000). Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts. In: Giancarlo, R., Sankoff, D. (eds) Combinatorial Pattern Matching. CPM 2000. Lecture Notes in Computer Science, vol 1848. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45123-4_31

Download citation

  • DOI: https://doi.org/10.1007/3-540-45123-4_31

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-67633-1

  • Online ISBN: 978-3-540-45123-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics