Abstract
In this paper we show that clustering alphabet symbols before PDFA inference is performed reduces perplexity on new data. This result is especially important in real tasks, such as spoken language interfaces, in which data sparseness is a significant issue. We describe the application of the ALERGIA algorithm combined with an independent clustering technique to the Air Travel Information System (ATIS) task. A 25 % reduction in perplexity was obtained. This result outperforms a trigram model under the same simple smoothing scheme.
Preview
Unable to display preview. Download preview PDF.
References
N. Abe and M. Warmuth. On the computational complexity of approximating distributions by probabilistic automata. Machine Learning, 9:205–260, 1992.
P. Brown, V. Della Pietra, P. de Souza, J. Lai, and R. Mercer. Class-based N-gram models of natural language. Computational Linguistics, 18(4):467–479, 1992.
R. Carrasco and J. Oncina. Learning stochastic regular grammars by means of a state merging method. In Grammatical Inference and Applications, ICGI'94,number 862 in Lecture Notes in Artificial Intelligence, pages 139–150. SpringerVerlag, 1994.
L. Hirschman. Multi-site data collection for a spoken language corpus. In Proc. of DARPA Speech and Natural Language Workshop, pages 7–14, 1992.
W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
S.M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustic, Speech and Signal Processing, 35(3):400–401, 1987.
M.J. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, R.E. Schapire, and L. Sellie. On the learnability of discrete distributions. In Proc. of the 25th Annual ACM Symposium on Theory of Computing, pages 273–282, 1994.
R. Kneser and H. Ney. Improved backing-off for m-gram language modeling. In International Conference on Acoustic, Speech and Signal Processing, pages 181–184, 1995.
K. Lang. Merge order counts. Technical report, NEC Research Institute, September 1997.
K.J. Lang. Random DFA's can be approximately learned from sparse uniform examples. In 5th ACM workshop on Computational Learning Theory, pages 45–52, 1992.
H. Ney, U. Essen, and R. Kneser. On structuring probabilistic dependences in stochastic language modelling. Computer Speech and Language, 8:1–38, 1994.
H. Ney and R. Knesser. Improved clustering techniques for class-based statistical language modelling. In European Conference on Speech Communication and Technology, pages 973–976, Berlin, 1993.
J. Oncina and P. García. Inferring regular languages in polynomial update time. In N. Pérez de la Bianca, A. Sanfeliu, and E.Vidal, editors, Pattern Recognition and Image Analysis, volume 1 of Series in Machine Perception and Artificial Intelligence, pages 49–61. World Scientific, 1992.
D. Ron, Y. Singer, and N. Tishby. On the learnability and usage of acyclic probabilistic automata. to appear in Journal of Computer and System Sciences.
H. Rulot and E. Vidal. An efficient algorithm for the inference of circuit-free automata. In G. Ferratè, T. Pavlidis, A. Sanfeliu, and H. Bunke, editors, Advances in Structural and Syntactic Pattern Recognition, pages 173–184. NATO ASI, Springer-Verlag, 1988.
B. Trakhtenbrot and Ya. Barzdin. Finite Automata: Behavior and Synthesis. North Holland Pub. Comp., Amsterdam, 1973.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dupont, P., Chase, L. (1998). Using symbol clustering to improve probabilistic automaton inference. In: Honavar, V., Slutzki, G. (eds) Grammatical Inference. ICGI 1998. Lecture Notes in Computer Science, vol 1433. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0054079
Download citation
DOI: https://doi.org/10.1007/BFb0054079
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64776-8
Online ISBN: 978-3-540-68707-8
eBook Packages: Springer Book Archive