Abstract
This paper presents a semi-automatic method for statistical language modeling. The method addresses the structure learning problem of the linguistic classes prediction model (LCPM) in class-dependent N-grams supporting multiple linguistic classes per word. The structure of the LCPM is designed, within the Factorial Language Model framework, combining a knowledge-based approach with a data-driven technique. First, simple linguistic knowledge is used to define a set with linguistic features appropriate to the application, and to sketch the LCPM main structure. Next an automatic algorithm selects, based on Information Theory solid concepts, the relevant factors associated to the selected features and establishes the LCPM definitive structure. This approach is based on the so called Buried Markov Models [1]. Although only preliminary results were obtained, they afford great confidence on the method’s ability to learn from the data, LCPM structures that represent accurately the application’s real dependencies and also favor the training robustness.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
The lighter notation \(f^{1:K}\) is used to express the features set \(\{f^1, f^2, \dots , f^K\}\). The same convention is used, from now on, with \(f_{i:j}^{m:n}\) representing a factors set (where \(f_{\tau }^{\nu }, \tau =i,\dots ,j\quad \nu =m,\dots ,n\;\) represents the factor corresponding to the linguistic feature \(f^{\nu }\) at time \(\tau \)).
References
Bilmes, J.: Natural statistical models for automatic speech recognition. Ph.D. thesis, U.C. Berkeley, Department of EECS, CS Division (1999)
Bilmes, J.: Natural statistical models for automatic speech recognition. Technical report, International Computer Science Institute, October 1999
Bilmes, J., Kirchhoff, K.: Factored language models and generalized parallel backoff. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4–6, May 2003. https://www.aclweb.org/anthology/N03-2002
Federico, M.: Language models. Presented at the Fourth Machine Translation Marathon - Open Source Tools for Machine Translation, January 2010
Federico, M., Cettolo, M.: Efficient handling of n-gram language models for statistical machine translation. In: Proceedings of the Second Workshop on Statistical Machine Translation, pp. 88–95. Association for Computational Linguistics, Prague, June 2007. https://www.aclweb.org/anthology/W07-0712
Kirchhoff, K., Bilmes, J., Duh, K.: Factored language model tutorial. Technical report, University of Washington, Department of EE, February 2008
Kirchhoff, K., Yang, M.: Improved language modeling for statistical machine translation. In: Proceedings of the ACL Workshop on Building and Using Parallel Texts, ParaText 2005, pp. 125–128. Association for Computational Linguistics, Stroudsburg (2005). http://dl.acm.org/citation.cfm?id=1654449.1654476
Mateus, M., Brito, A., Duarte, I., Faria, I.: Gramática da Língua Portuguesa. Editorial Caminho, S.A., Rua Cidade de Córdova n\(^\circ \)2, 2610–038 Alfragide, Portugal (2003)
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1226–1238 (2005)
Santos, D., Rocha, P.: Evaluating cetempúblico, a free resource for Portuguese. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pp. 450–457. Association for Computational Linguistics, Stroudsburg, July 2001. http://www.linguateca.pt/superb/busca_publ.pl?idi=1141122240
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Pera, V. (2019). A Semi-automatic Structure Learning Method for Language Modeling. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-27947-9_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27946-2
Online ISBN: 978-3-030-27947-9
eBook Packages: Computer ScienceComputer Science (R0)