Abstract
This paper explores morpho-syntactic ambiguities for French to develop a strategy for part-of-speech disambiguation that a) reflects the complexity of French as an inflected language, b) optimizes the estimation of probabilities, c) allows the user flexibility in choosing a tagset. The problem in extracting lexical probabilities from a limited training corpus is that the statistical model may not necessarily represent the use of a particular word in a particular context. In a highly morphologically inflected language, this argument is particularly serious since a word can he tagged with a large number of parts of speech. Due to the lack of sufficient training data, we argue against estimating lexical probabilities to disambiguate parts of speech in unrestricted texts. Instead, we use the strength of contextual probabilities along with a feature we call “genotype”, a set of tags associated with a word. Using this knowledge, we have built a part-of-speech tagger that combines linguistic and statistical approaches: contextual information is disambiguated by linguistic rules and n-gram probabilities on parts of speech only are estimated in order to disambiguate the remaining ambiguous tags.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bahl, L. R. and Mercer, R. L. 1976. Part-of-speech assignement by a statistical decision algorithm. IEEE International Symposium on Information Theory, pp. 88–89.
Box, G. E. P. and Tiao, G. C. 1973. Bayesian Inference in Statistical Analysis. Addison-Wesley, Reading, Mass.
Brill, E. 1992. A simple rule-based part of speech tagger. In Third Conference on Applied Computational Linguistics, Trento, Italy, pp. 152–155.
Brill, E. and Marcus, M. 1992. Tagging an unfamiliar text with minimal human supervision. In Proceedings of the AAAI Symposium on Probabilistic Approaches to Natural Language, American Association for Artificial Intelligence, pp. 10–16.
Chanod, J-P. and Tapanainen, P. 1995. Tagging French–comparing a statistical and a constraint-based method. In EACL SIGDAT Workshop, Dublin, Ireland. Association for Computational Linguistics–European Chapter, pp. 58–64.
Church, K. W. 1989. A stochastic parts program noun phrase parser for unrestricted text. In IEEE Proceedings of the ICASSP,pp. 695–698, Glasgow.
Church, K. W. 1992. Current practice in part of speech tagging and suggestions for the future. In Simmons (ed). Abornik praci: In Honor of Henry Kucera. Michigan Slavic Studies.
Cutting, D., Kupiec, J., Peterson, J. and Sibun, P. 1992. A practical part-of-speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, Trento, Italy, pp. 133–140.
DeRose, S. 1988. Grammatical category disambiguation by statistical optimization. Computational Linguistics, 14 (1): 31–39.
Duval, A., et al. 1992. Robert Encyclopedic Dictionary (CD-ROM). Hachette, Paris.
Francis, W. N. and Kucera, H. 1982. Frequency Analysis of English Usage: Lexicon and Grammar. Houghton Mifflin Company, Boston, Massachusetts.
Johansson, S. 1980. The LOB Corpus of British English Tests: presentation and comments. Association for Literary and Linguistic Computing, 1: 25–36.
Karlsson, F., Voutilainen, A., Heikkilä, J. and Antilla, A. 1995. Constraint Grammar: A Language-Independent System for Parsing Unrestricted Text. Mouton de Gruyter, Berlin, New York.
Klein, S. and Simmons, R. F. 1963. A grammatical approach to grammatical tagging coding of English words. JA CM, 10: 334–347.
Leech, G., Garside, R. and Atwell, E. 1983. Automatic grammatical tagging of the LOB corpus. ICAME News, 7: 13–33.
Merialdo, B. 1994. Tagging English text with a probabilistic model. Computational Linguistics, 20 (2): 155–172.
Moore, D. S. and McCabe, G. P. 1989. Introduction to the Practice of Statistics. W. H. Freeman, New York.
Pereira, F., Riley, M. and Sproat, R. 1994. Weighted rational transductions and their application to human language processing. In ARPA Workshop on Human Language Technology, Advanced Research Projects Agency, March 8–11, pp. 249–254.
Tzoukermann, E. and Liberman, M. Y. 1990. A finite-state morphological processor for Spanish. In Proceedings of the Thirteenth International Conference on Computational Linguistics, Helsinki, Finland, pp. 277–282.
Tzoukermann, E., Radev, D. R. and Gale, W. A. 1995. Combining linguistic knowledge and statistical learning in French part-of-speech tagging. In EACL SIGDAT Workshop, Dublin, Ireland. Association for Computational Linguistics–European Chapter, pp. 51–57.
Voutilainen, A. 1993. NPtool, a detector of English noun phrases. In Proceedings of the Workshop on Very Large Corpora, Columbus, Ohio, pp. 48–57.
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Tzoukermann, E., Radev, D., Gale, W. (1999). Tagging French without Lexical Probabilities — Combining Linguistic Knowledge and Statistical Learning. In: Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D. (eds) Natural Language Processing Using Very Large Corpora. Text, Speech and Language Technology, vol 11. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2390-9_4
Download citation
DOI: https://doi.org/10.1007/978-94-017-2390-9_4
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-5349-7
Online ISBN: 978-94-017-2390-9
eBook Packages: Springer Book Archive