Abstract
An efficient data structure for representing the full index of a set of strings is the factor automaton, the minimal deterministic automaton representing the set of all factors or substrings of these strings. This paper presents a novel analysis of the size of the factor automaton of an automaton, that is the minimal deterministic automaton accepting the set of factors of a finite set of strings, itself represented by a finite automaton. It shows that the factor automaton of a set of strings U has at most 2|Q| − 2 states, where Q is the number of nodes of a prefix-tree representing the strings in U, a bound that significantly improves over 2 ||U||− 1, the bound given by Blumer et al. (1987), where ||U|| is the sum of the lengths of all strings in U. It also gives novel and general bounds for the size of the factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge, UK (1997)
Crochemore, M., Rytter, W.: Jewels of Stringology. World Scientific, Singapore (2002)
Allauzen, C., Mohri, M., Saraclar, M.: General Indexation of Weighted Automata – Application to Spoken Utterance Retrieval. In: Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval (HLT/NAACL 2004), Boston, Massachusetts, pp. 33–40 (2004)
Weinstein, E., Moreno, P.: Music Identification with Weighted Finite-State Transducers. In: Proceedings of ICASSP 2007, Honolulu, Hawaii (2007)
Blumer, A., Blumer, J., Haussler, D., Ehrenfeucht, A., Chen, M., Seiferas, J.: The smallest automaton recognizing the subwords of a text. Theoretical Computer Science 40, 31–55 (1985)
Crochemore, M.: Transducers and repetitions. Theoretical Computer Science 45, 63–86 (1986)
Blumer, A., Blumer, J., Ehrenfeucht, A., Haussler, D., McConnell, R.: Complete inverted files for efficient text retrieval and analysis. Journal of the ACM 34, 578–589 (1987)
Mohri, M.: Finite-state transducers in language and speech processing. Computational Linguistics 23, 269–311 (1997)
Mohri, M.: Statistical Natural Language Processing. In: Lothaire, M. (ed.) Applied Combinatorics on Words, Cambridge University Press, Cambridge (2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mohri, M., Moreno, P., Weinstein, E. (2007). Factor Automata of Automata and Applications. In: Holub, J., Žďárek, J. (eds) Implementation and Application of Automata. CIAA 2007. Lecture Notes in Computer Science, vol 4783. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76336-9_17
Download citation
DOI: https://doi.org/10.1007/978-3-540-76336-9_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-76335-2
Online ISBN: 978-3-540-76336-9
eBook Packages: Computer ScienceComputer Science (R0)