Tagging French without Lexical Probabilities — Combining Linguistic Knowledge and Statistical Learning

Tzoukermann, E.; Radev, D.; Gale, W.

doi:10.1007/978-94-017-2390-9_4

E. Tzoukermann,
D. Radev &
W. Gale

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 11))

368 Accesses

Abstract

This paper explores morpho-syntactic ambiguities for French to develop a strategy for part-of-speech disambiguation that a) reflects the complexity of French as an inflected language, b) optimizes the estimation of probabilities, c) allows the user flexibility in choosing a tagset. The problem in extracting lexical probabilities from a limited training corpus is that the statistical model may not necessarily represent the use of a particular word in a particular context. In a highly morphologically inflected language, this argument is particularly serious since a word can he tagged with a large number of parts of speech. Due to the lack of sufficient training data, we argue against estimating lexical probabilities to disambiguate parts of speech in unrestricted texts. Instead, we use the strength of contextual probabilities along with a feature we call “genotype”, a set of tags associated with a word. Using this knowledge, we have built a part-of-speech tagger that combines linguistic and statistical approaches: contextual information is disambiguated by linguistic rules and n-gram probabilities on parts of speech only are estimated in order to disambiguate the remaining ambiguous tags.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bahl, L. R. and Mercer, R. L. 1976. Part-of-speech assignement by a statistical decision algorithm. IEEE International Symposium on Information Theory, pp. 88–89.
Google Scholar
Box, G. E. P. and Tiao, G. C. 1973. Bayesian Inference in Statistical Analysis. Addison-Wesley, Reading, Mass.
Google Scholar
Brill, E. 1992. A simple rule-based part of speech tagger. In Third Conference on Applied Computational Linguistics, Trento, Italy, pp. 152–155.
Google Scholar
Brill, E. and Marcus, M. 1992. Tagging an unfamiliar text with minimal human supervision. In Proceedings of the AAAI Symposium on Probabilistic Approaches to Natural Language, American Association for Artificial Intelligence, pp. 10–16.
Google Scholar
Chanod, J-P. and Tapanainen, P. 1995. Tagging French–comparing a statistical and a constraint-based method. In EACL SIGDAT Workshop, Dublin, Ireland. Association for Computational Linguistics–European Chapter, pp. 58–64.
Google Scholar
Church, K. W. 1989. A stochastic parts program noun phrase parser for unrestricted text. In IEEE Proceedings of the ICASSP,pp. 695–698, Glasgow.
Google Scholar
Church, K. W. 1992. Current practice in part of speech tagging and suggestions for the future. In Simmons (ed). Abornik praci: In Honor of Henry Kucera. Michigan Slavic Studies.
Google Scholar
Cutting, D., Kupiec, J., Peterson, J. and Sibun, P. 1992. A practical part-of-speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, Trento, Italy, pp. 133–140.
Google Scholar
DeRose, S. 1988. Grammatical category disambiguation by statistical optimization. Computational Linguistics, 14 (1): 31–39.
Google Scholar
Duval, A., et al. 1992. Robert Encyclopedic Dictionary (CD-ROM). Hachette, Paris.
Google Scholar
Francis, W. N. and Kucera, H. 1982. Frequency Analysis of English Usage: Lexicon and Grammar. Houghton Mifflin Company, Boston, Massachusetts.
Google Scholar
Johansson, S. 1980. The LOB Corpus of British English Tests: presentation and comments. Association for Literary and Linguistic Computing, 1: 25–36.
Google Scholar
Karlsson, F., Voutilainen, A., Heikkilä, J. and Antilla, A. 1995. Constraint Grammar: A Language-Independent System for Parsing Unrestricted Text. Mouton de Gruyter, Berlin, New York.
Google Scholar
Klein, S. and Simmons, R. F. 1963. A grammatical approach to grammatical tagging coding of English words. JA CM, 10: 334–347.
Article Google Scholar
Leech, G., Garside, R. and Atwell, E. 1983. Automatic grammatical tagging of the LOB corpus. ICAME News, 7: 13–33.
Google Scholar
Merialdo, B. 1994. Tagging English text with a probabilistic model. Computational Linguistics, 20 (2): 155–172.
Google Scholar
Moore, D. S. and McCabe, G. P. 1989. Introduction to the Practice of Statistics. W. H. Freeman, New York.
Google Scholar
Pereira, F., Riley, M. and Sproat, R. 1994. Weighted rational transductions and their application to human language processing. In ARPA Workshop on Human Language Technology, Advanced Research Projects Agency, March 8–11, pp. 249–254.
Google Scholar
Tzoukermann, E. and Liberman, M. Y. 1990. A finite-state morphological processor for Spanish. In Proceedings of the Thirteenth International Conference on Computational Linguistics, Helsinki, Finland, pp. 277–282.
Google Scholar
Tzoukermann, E., Radev, D. R. and Gale, W. A. 1995. Combining linguistic knowledge and statistical learning in French part-of-speech tagging. In EACL SIGDAT Workshop, Dublin, Ireland. Association for Computational Linguistics–European Chapter, pp. 51–57.
Google Scholar
Voutilainen, A. 1993. NPtool, a detector of English noun phrases. In Proceedings of the Workshop on Very Large Corpora, Columbus, Ohio, pp. 48–57.
Google Scholar

Download references

Authors

E. Tzoukermann
View author publications
You can also search for this author in PubMed Google Scholar
D. Radev
View author publications
You can also search for this author in PubMed Google Scholar
W. Gale
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

ISSCO, University of Geneva, Switzerland
Susan Armstrong & Sandra Manzi &
AT & T Labs-Research, USA
Kenneth Church
Xerox Research Centre Europe, France
Pierre Isabelle
Bell Laboratories, Lucent, USA
Evelyne Tzoukermann
Johns Hopkins University, Baltimore, Maryland, USA
David Yarowsky

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Tzoukermann, E., Radev, D., Gale, W. (1999). Tagging French without Lexical Probabilities — Combining Linguistic Knowledge and Statistical Learning. In: Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D. (eds) Natural Language Processing Using Very Large Corpora. Text, Speech and Language Technology, vol 11. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2390-9_4

Download citation

DOI: https://doi.org/10.1007/978-94-017-2390-9_4
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-5349-7
Online ISBN: 978-94-017-2390-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics