A Comparison of Corpus-Based Techniques for Restoring Accents in Spanish and French Text

  • D. Yarowsky
Part of the Text, Speech and Language Technology book series (TLTB, volume 11)


This chapter will explore and compare three corpus-based techniques for lexical ambiguity resolution, focusing on the problem of restoring missing accents to Spanish and French text. Many of the ambiguities created by missing accents are differences in part of speech: hence one of the methods considered is an N-gram tagger using Viterbi decoding, such as is found in stochastic part-of-speech taggers. A second technique, Bayesian classification, has been successfully applied to word-sense disambiguation and is well suited for some of the semantic ambiguities which arise from missing accents. The third approach, based on decision lists, combines the strengths of the two other methods, incorporating both local syntactic patterns and more distant collocational evidence, and outperforms them both. The problem of accent restoration is particularly well suited for demonstrating and testing the capabilities of the given algorithms because it requires the resolution of both semantic and syntactic ambiguity, and offers an objective ground truth for automatic evaluation. The problem is also a practical one with immediate application.


Ambiguity Resolution Function Word Computational Linguistics Word Class Semantic Ambiguity 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Church, K. W. 1988. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text. In Proceedings of the Second Conference on Applied Natural Language Processing, ACL, pp. 136–143.Google Scholar
  2. Gale, W., Church, K. and Yarowsky, D. 1992. A Method for Disambiguating Word Senses in a Large Corpus. In Computers and the Humanities, 26, pp. 415–439.Google Scholar
  3. Gale, W., Church, K. and Yarowsky, D. 1994. Discrimination decisions for 100,000-dimensional spaces. In Zampoli, Calzolari and Palmer, (eds). Current Issues in Computational Linguistics: In Honour of Don Walker, Kluwer Academic Publishers, pp. 429–450.Google Scholar
  4. Kupiec, J. 1989. Probabilistic Models of Short and Long Distance Word Dependencies in Running Text. In Proceedings, DARPA Speech and Natural Language Workshop, Philadelphia, February, pp. 290–295.Google Scholar
  5. Leacock, C., Towel!, G. and Voorhees, E. 1993. Corpus-Based Statistical Sense Resolution. In Proceedings, ARPA Human Language Technology Workshop Merialdo, B. 1990. Tagging Text with a Probabilistic Model. In Proceedings of the Google Scholar
  6. IBM Natural Language ITL,Paris, France, pp. 161–172.Google Scholar
  7. Mosteller, F. and Wallace, D. 1964. Inference and Disputed Authorship: The Federalist, Addison-Wesley, Reading, Massachusetts.Google Scholar
  8. Paul, D. B. 1990. Speech Recognition Using Hidden Markov Models. In The Lincoln Laboratory Journal, 3.Google Scholar
  9. Rabiner, L. R. 1989. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. In Proceedings of the IEEE, 77, pp. 257–285.Google Scholar
  10. Rivest, R. L. 1987. Learning Decision Lists. In Machine Learning, 2, pp. 229–246.Google Scholar
  11. Sproat, R., Hirschberg, J. and Yarowsky, D. 1992. A Corpus-based Synthesizer. In Proceedings, International Conference on Spoken Language Processing, Banff, Alberta, October 1992.Google Scholar
  12. Tzoukermann, E. and Liberman, M. 1990. A Finite-state Morphological Processor for Spanish. In Proceedings, COLING-90, Helsinki, Finland, pp. 277–282.Google Scholar
  13. Yarowsky, D. 1992. Word-sense disambiguation using statistical models of Roget’s categories trained on large corpora. In Proceedings of the 14th International Conference on Computational Linguistics (COLING), pp. 454–460.Google Scholar
  14. Yarowsky, D. 1993. One Sense Per Collocation. In Proceedings, ARPA Human Language Technology Workshop,Princeton.Google Scholar
  15. Yarowsky, D. 1994. Decision Lists for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM, pp. 8895.Google Scholar
  16. Yarowsky, D. 1995. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, Cambridge, MA, pp. 189–196.Google Scholar
  17. Yarowsky, D. 1996. Homograph Disambiguation in Speech Synthesis, In van Santen, Sproat, Olive and Hirschberg, (eds), Progress in Speech Synthesis, Springer-Verlag, pp. 157–172.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 1999

Authors and Affiliations

  • D. Yarowsky

There are no affiliations available

Personalised recommendations