Skip to main content

Tagging French without Lexical Probabilities — Combining Linguistic Knowledge and Statistical Learning

  • Chapter
Natural Language Processing Using Very Large Corpora

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 11))

  • 363 Accesses

Abstract

This paper explores morpho-syntactic ambiguities for French to develop a strategy for part-of-speech disambiguation that a) reflects the complexity of French as an inflected language, b) optimizes the estimation of probabilities, c) allows the user flexibility in choosing a tagset. The problem in extracting lexical probabilities from a limited training corpus is that the statistical model may not necessarily represent the use of a particular word in a particular context. In a highly morphologically inflected language, this argument is particularly serious since a word can he tagged with a large number of parts of speech. Due to the lack of sufficient training data, we argue against estimating lexical probabilities to disambiguate parts of speech in unrestricted texts. Instead, we use the strength of contextual probabilities along with a feature we call “genotype”, a set of tags associated with a word. Using this knowledge, we have built a part-of-speech tagger that combines linguistic and statistical approaches: contextual information is disambiguated by linguistic rules and n-gram probabilities on parts of speech only are estimated in order to disambiguate the remaining ambiguous tags.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Bahl, L. R. and Mercer, R. L. 1976. Part-of-speech assignement by a statistical decision algorithm. IEEE International Symposium on Information Theory, pp. 88–89.

    Google Scholar 

  • Box, G. E. P. and Tiao, G. C. 1973. Bayesian Inference in Statistical Analysis. Addison-Wesley, Reading, Mass.

    Google Scholar 

  • Brill, E. 1992. A simple rule-based part of speech tagger. In Third Conference on Applied Computational Linguistics, Trento, Italy, pp. 152–155.

    Google Scholar 

  • Brill, E. and Marcus, M. 1992. Tagging an unfamiliar text with minimal human supervision. In Proceedings of the AAAI Symposium on Probabilistic Approaches to Natural Language, American Association for Artificial Intelligence, pp. 10–16.

    Google Scholar 

  • Chanod, J-P. and Tapanainen, P. 1995. Tagging French–comparing a statistical and a constraint-based method. In EACL SIGDAT Workshop, Dublin, Ireland. Association for Computational Linguistics–European Chapter, pp. 58–64.

    Google Scholar 

  • Church, K. W. 1989. A stochastic parts program noun phrase parser for unrestricted text. In IEEE Proceedings of the ICASSP,pp. 695–698, Glasgow.

    Google Scholar 

  • Church, K. W. 1992. Current practice in part of speech tagging and suggestions for the future. In Simmons (ed). Abornik praci: In Honor of Henry Kucera. Michigan Slavic Studies.

    Google Scholar 

  • Cutting, D., Kupiec, J., Peterson, J. and Sibun, P. 1992. A practical part-of-speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, Trento, Italy, pp. 133–140.

    Google Scholar 

  • DeRose, S. 1988. Grammatical category disambiguation by statistical optimization. Computational Linguistics, 14 (1): 31–39.

    Google Scholar 

  • Duval, A., et al. 1992. Robert Encyclopedic Dictionary (CD-ROM). Hachette, Paris.

    Google Scholar 

  • Francis, W. N. and Kucera, H. 1982. Frequency Analysis of English Usage: Lexicon and Grammar. Houghton Mifflin Company, Boston, Massachusetts.

    Google Scholar 

  • Johansson, S. 1980. The LOB Corpus of British English Tests: presentation and comments. Association for Literary and Linguistic Computing, 1: 25–36.

    Google Scholar 

  • Karlsson, F., Voutilainen, A., Heikkilä, J. and Antilla, A. 1995. Constraint Grammar: A Language-Independent System for Parsing Unrestricted Text. Mouton de Gruyter, Berlin, New York.

    Google Scholar 

  • Klein, S. and Simmons, R. F. 1963. A grammatical approach to grammatical tagging coding of English words. JA CM, 10: 334–347.

    Article  Google Scholar 

  • Leech, G., Garside, R. and Atwell, E. 1983. Automatic grammatical tagging of the LOB corpus. ICAME News, 7: 13–33.

    Google Scholar 

  • Merialdo, B. 1994. Tagging English text with a probabilistic model. Computational Linguistics, 20 (2): 155–172.

    Google Scholar 

  • Moore, D. S. and McCabe, G. P. 1989. Introduction to the Practice of Statistics. W. H. Freeman, New York.

    Google Scholar 

  • Pereira, F., Riley, M. and Sproat, R. 1994. Weighted rational transductions and their application to human language processing. In ARPA Workshop on Human Language Technology, Advanced Research Projects Agency, March 8–11, pp. 249–254.

    Google Scholar 

  • Tzoukermann, E. and Liberman, M. Y. 1990. A finite-state morphological processor for Spanish. In Proceedings of the Thirteenth International Conference on Computational Linguistics, Helsinki, Finland, pp. 277–282.

    Google Scholar 

  • Tzoukermann, E., Radev, D. R. and Gale, W. A. 1995. Combining linguistic knowledge and statistical learning in French part-of-speech tagging. In EACL SIGDAT Workshop, Dublin, Ireland. Association for Computational Linguistics–European Chapter, pp. 51–57.

    Google Scholar 

  • Voutilainen, A. 1993. NPtool, a detector of English noun phrases. In Proceedings of the Workshop on Very Large Corpora, Columbus, Ohio, pp. 48–57.

    Google Scholar 

Download references

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1999 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Tzoukermann, E., Radev, D., Gale, W. (1999). Tagging French without Lexical Probabilities — Combining Linguistic Knowledge and Statistical Learning. In: Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D. (eds) Natural Language Processing Using Very Large Corpora. Text, Speech and Language Technology, vol 11. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2390-9_4

Download citation

  • DOI: https://doi.org/10.1007/978-94-017-2390-9_4

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-90-481-5349-7

  • Online ISBN: 978-94-017-2390-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics