Skip to main content

Spanish Diacritic Error Detection and Restoration—A Survey

  • Conference paper
  • First Online:
  • 685 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9561))

Abstract

In this paper we address the problem of diacritic error detection and restoration—the task of identifying and correcting missing accents in text. In particular, we evaluate the performance of a simple part-of-speech tagger-based technique comparing it to other established methods for error detection/restoration: unigram frequency, decision lists, discriminative classifiers, a machine-translation based method, and grapheme-based approaches. In languages such as Spanish (the focus here), diacritics play a key role in disambiguation and results show that a straightforward modification to an n-gram tagger can be used to achieve good performance in diacritic error identification without resorting to any specialized machinery. Our method should be applicable to any language where diacritics distribute comparably and perform similar roles of disambiguation.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The suffix-based approach described by Yarowsky [4, 5] is not considered here as morphological form merely serves as a proxy for part-of-speech category, the primary variable of the current POS-tag approach.

  2. 2.

    The widely-adopted EAGLES tagset for Spanish is used for part-of-speech annotation (see Freeling: http://nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-es.html [last accessed September 20, 2015]).

  3. 3.

    We assume here that the relevant n-gram tagger that is trained has some suffix-based mechanism for guessing parts-of-speech for unknown words as is generally the case with better-performing taggers, such as TnT [14], or HunPos [15].

  4. 4.

    The default options train a trigram tagger that also trains a separate suffix tree to help tag previously unseen words. This obviously has great bearing on the ability to generalize to errors in unseen words as many competing diacritizations are found on the suffixes of words, as in the -ara/-ará subjunctive vs. future example given above. The HunPos tagger, unlike standard trigram taggers which condition pos-word pairs only on the previous two tags, also conditions its tagging choice on the previous word.

References

  1. Tufiş, D., Ceauşu, A.: DIAC+: a professional diacritics recovering system. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC) (2008)

    Google Scholar 

  2. Ungurean, C., Burileanu, D., Popescu, V., Negrescu, C., Dervis, A.: Automatic diacritic restoration for a TTS-based e-mail reader application. Bull. Ser. C 70, 3–12 (2008)

    Google Scholar 

  3. Paredes, F.: La ortografía en las encuestas de disponibilidad léxica. Reale 11, 75–97 (1999)

    Google Scholar 

  4. Yarowsky, D.: Decision lists for lexical ambiguity resolution: application to accent restoration in Spanish and French. In: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, pp. 88–95 (1994)

    Google Scholar 

  5. Yarowsky, D.: A comparison of corpus-based techniques for restoring accents in Spanish and French text. In: Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D. (eds.) Natural Language Processing Using Very Large Corpora. Text, Speech and Language Technology, vol. 11, pp. 99–120. Springer, Netherlands (1999)

    Chapter  Google Scholar 

  6. Scannell, K.P.: The Crúbadán project: corpus building for under-resourced languages. In: Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop, Incorporating Cleaneval, p. 5 (2007)

    Google Scholar 

  7. Mihalcea, R.F.: Diacritics restoration: learning from letters versus learning from words. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 339–348. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  8. De Pauw, G., Wagacha, P.W., de Schryver, G.-M.: Automatic Diacritic Restoration for Resource-Scarce Languages. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 170–179. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  9. Novák, A., Siklósi, B.: Automatic Diacritics Restoration for Hungarian. In: EMNLP 2015, pp. 2286–2291 (2015)

    Google Scholar 

  10. Hulden, M., Silfverberg, M., Francom, J.: Finite state applications with Javascript. In: Proceedings of the 19th Nordic Conference of Computational Linguistics, pp. 441–446 (2013)

    Google Scholar 

  11. Roth, R., Rambow, O., Habash, N., Diab, M., Rudin, C.: Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, pp. 117–120. Association for Computational Linguistics (2008)

    Google Scholar 

  12. Trung, N.M., Nhan, N.Q., Phuong, N.H.: Vietnamese diacritics restoration as sequential tagging. In: 2012 IEEE RIVF International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), pp. 1–6. IEEE (2012)

    Google Scholar 

  13. Simard, M., Deslauriers, A.: Real-time automatic insertion of accents in French text. Nat. Lang. Eng. 7(02), 143–165 (2001)

    Article  Google Scholar 

  14. Brants, T.: TnT: a statistical part-of-speech tagger. In: Proceedings of the Sixth Conference on Applied Natural Language Processing, pp. 224–231. Association for Computational Linguistics (2000)

    Google Scholar 

  15. Halácsy, P., Kornai, A., Oravecz, C.: HunPos: an open source trigram tagger. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 209–212. Association for Computational Linguistics (2007)

    Google Scholar 

  16. Scannell, K.P.: Statistical unicodification of African languages. Lang. Resour. Eval. 45(3), 375–386 (2011)

    Article  Google Scholar 

  17. Wagacha, P., De Pauw, G., Githinji, P.: A grapheme-based approach for accent restoration in Gikuyu. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, pp. 1937–1940 (2006)

    Google Scholar 

  18. Freund, Y., Schapire, R.E.: Large margin classification using the Perceptron algorithm. Mach. Learn. 37(3), 277–296 (1999)

    Article  MATH  Google Scholar 

  19. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180. Association for Computational Linguistics, Prague (2007)

    Google Scholar 

  20. Stolcke, A., Zheng, J., Wang, W., Abrash, V.: SRILM at sixteen: update and outlook. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, p. 5 (2011)

    Google Scholar 

  21. Mendonça, Â., Jaquette, D., Graff, D., DiPersio, D.: Spanish Gigaword Third Edition LDC2011T12 (2011). https://catalog.ldc.upenn.edu/LDC2011T12

  22. Francom, J., Hulden, M., Ussishkin, A.: ACTIV-ES: a comparable, cross-dialect corpus of “everyday” Spanish from Argentina, Mexico and Spain. In: The Ninth International Conference on Language Resources and Evaluation, pp. 1733–1737 (2014)

    Google Scholar 

  23. Taulé, M., Martí, M.A., Recasens, M.: AnCora: multilevel annotated corpora for Catalan and Spanish. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC-2008) (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jerid Francom .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Hulden, M., Francom, J. (2016). Spanish Diacritic Error Detection and Restoration—A Survey. In: Vetulani, Z., Uszkoreit, H., Kubis, M. (eds) Human Language Technology. Challenges for Computer Science and Linguistics. LTC 2013. Lecture Notes in Computer Science(), vol 9561. Springer, Cham. https://doi.org/10.1007/978-3-319-43808-5_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-43808-5_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-43807-8

  • Online ISBN: 978-3-319-43808-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics