Abstract
We consider the sequence comparison problem, also known as “hidden pattern” problem, where one searches for a given subsequence in a text (rather than a string understood as a sequence of consecutive symbols). A characteristic parameter is the number of occurrences of a given pattern w of length m as a subsequence in a random text of length n generated by a memoryless source. Spacings between letters of the pattern may either be constrained or not in order to define valid occurrences. We determine the mean and the variance of the number of occurrences, and establish a Gaussian limit law. These results are obtained via combinatorics on words, formal language techniques, and methods of analytic combinatorics based on generating functions and convergence of moments. The motivation to study this problem comes from an attempt at finding a reliable threshold for intrusion detections, from textual data processing applications, and from molecular biology.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
A. Aczel, The Mystery of the Aleph. Mathematics, the Kabbalah, and the Search for Infinity, Four Walls Eight Windows, New York, 2000.
A. Apostolico and M. Atallah, Compact Recognizers of Episode Sequences, Submitted to Information and Computation.
E. Bender and F. Kochman, The Distribution of Subword Counts is Usually Normal, European Journal of Combinatorics, 14, 265–275, 1993.
P. Billingsley, Probability and Measure, Second Edition, John Wiley & Sons, New York, 1986.
L. Boasson, P. Cegielski, I. Guessarian, and Yuri Matiyasevich, Window-Accumulated Subsequence Matching Problem is Linear, In Proceedings of the Eighteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems: PODS 1999, ACM Press, 327–336, 1999.
J. Clément, P. Flajolet, and B. Vallée, Dynamical Sources in Information Theory: A General Analysis of Trie Structures, Algorithmica, 29, 307–369, 2001.
M. Crochemore and W. Rytter, Text Algorithms, Oxford University Press, New York, 1994.
G. Das, R. Fleischer, L. Gasieniec, D. Gunopulos, and J. Kärkkäinen, Episode Matching, In Combinatorial Pattern Matching, 8th Annual Symposium, Lecture Notes in Computer Science vol. 1264, 12–27, 1997.
L. Guibas and A. M. Odlyzko, Periods in Strings, J. Combinatorial Theory Ser. A, 30, 19–43, 1981.
L. Guibas and A. M. Odlyzko, String Overlaps, Pattern Matching, and Nontransitive Games, J. Combinatorial Theory Ser. A, 30, 183–208, 1981.
Y. Guivarc’h, Marches aléatoires sur les groupes, Fascicule de probabilités, Publ. Inst. Rech. Math. Rennes, 2000.
D. E. Knuth, The Art of Computer Programming, Fundamental Algorithms, Vol. 1, Third Edition, Addison-Wesley, Reading, MA, 1997.
G. Kucherov and M. Rusinowitch, Matching a Set of Strings with Variable Length Don’t Cares, Theoretical Computer Science 178, 129–154, 1997.
S. Kumar and E.H. Spafford, A Pattern-Matching Model for Intrusion Detection, Proceedings of the National Computer Security Conference, 11–21, 1994.
P. Nicodème, B. Salvy, and P. Flajolet, Motif Statistics, European Symposium on Algorithms, Lecture Notes in Computer Science, No. 1643, 194–211, 1999.
M. Régnier and W. Szpankowski, On the Approximate Pattern Occurrences in a Text, Proc. Compression and Complexity of SEQUENCE’97, IEEE Computer Society, 253–264, Positano, 1997.
M. Règnier and W. Szpankowski, On Pattern Frequency Occurrences in a Markovian Sequence, Algorithmica, 22, 631–649, 1998.
I. Rigoutsos, A. Floratos, L. Parida, Y. Gao and D. Platt, The Emergence of Pattern Discovery Techniques in Computational Biology, Metabolic Engineering, 2, 159–177, 2000.
R. Sedgewick and P. Flajolet, An Introduction to the Analysis of Algorithms, Addison-Wesley, Reading, MA, 1995.
J. M. Steele, Probability Theory and Combinatorial Optimization, SIAM, Philadelphia, 1997.
W. Szpankowski, Average Case Analysis of Algorithms on Sequences, John Wiley & Sons, New York, 2001.
B. Vallépe, Dynamical Sources in Information Theory: Fundamental Intervals and Word Prefixes, Algorithmica, 29, 262–306, 2001.
A. Vanet, L. Marsan, and M.-F. Sagot, Promoter sequences and algorithmical methods for identifying them, Res. Microbiol., 150, 779–799, 1999.
M. Waterman, Introduction to Computational Biology, Chapman and Hall, London, 1995.
A. Wespi, H. Debar, M. Dacier, and M. Nassehi, Fixed vs. Variable-Length Patterns For Detecting Suspicious Process Behavior, J. Computer Security, 8, 159–181, 2000.
S. Wu and U. Manber, Fast Text Searching Allowing Errors, Comm. ACM, 35:10, 83–991, 1995.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Flajolet, P., Guivarc’h, Y., Szpankowski, W., Vallée, B. (2001). Hidden Pattern Statistics. In: Orejas, F., Spirakis, P.G., van Leeuwen, J. (eds) Automata, Languages and Programming. ICALP 2001. Lecture Notes in Computer Science, vol 2076. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48224-5_13
Download citation
DOI: https://doi.org/10.1007/3-540-48224-5_13
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42287-7
Online ISBN: 978-3-540-48224-6
eBook Packages: Springer Book Archive