Generalized Pattern Matching Statistics
In pattern matching algorithms, a characteristic parameter is the number of occurrences of a given pattern in a random text of length n generated by a source. We consider here a generalization of the pattern matching problem in two ways. First, we deal with a generalized notion of pattern that encompasses classical patterns as well as “hidden patterns”. Second, we consider a quite general probabilistic model of sources that may possess a high degree of correlations. Such sources are built with dynamical systems and are called dynamical sources. We determine the mean and the variance of the number of occurrences in this generalized pattern matching problem, and establish a property of concentration of distribution. These results are obtained via combinatorics, formal language techniques, and methods of analytic combinatorics based on generating operators and generating functions. The generating operators come from the dynamical system framework and generate themselves generating functions. The motivation to study this problem comes from an attempt at finding a reliable threshold for intrusion detections, from textual data processing applications, and from molecular biology.
KeywordsGenerate Operator Intrusion Detection Pattern Match Regular Language Dynamical Source
Unable to display preview. Download preview PDF.
- A. Apostolico and M. Atallah, Compact Recognizers of Episode Sequences, Submitted to Information and Computation. Google Scholar
- J. Bourdon, B. Vallée, Dynamical Sources in Information Theory: Motif Statistics, not published.Google Scholar
- J. Bourdon, B. Daireaux, B. Vallée, Dynamical Analysis of a-Euclidean Algorithms, submitted.Google Scholar
- Ph Flajolet, Y. Guivarch, W. Szpankowski and B. Vallée, Hidden Pattern Statistics,Proc. of ICALP’2001 LNCS 2076, 152–165, 2001.Google Scholar
- D. E. Knuth The Art of Computer Programming Fundamental Algorithms Vol. 1, Third Edition, Addison-Wesley, Reading, MA, 1997.Google Scholar
- S. Kumar and E.H. Spafford, A Pattern-Matching Model for Intrusion Detection Proceedings of the National Computer Security Conference 11–21, 1994.Google Scholar
- M. Régnier and W. Szpankowski, On the Approximate Pattern Occurrences in a Text, Proc. Compression and Complexity of SEQUENCE’97, IEEE Computer Society, 253–264, Positano, 1997.Google Scholar
- R. Sedgewick and P. Flajolet An Introduction to the Analysis of Algorithms Addison-Wesley, Reading, MA, 1995.Google Scholar
- B. Vallée, Dynamical Analysis of a Class of Euclidean Algorithms, to appear in Theoretical Computer Science (2002), also available in Les cahiers du GREYC 2000.Google Scholar