Abstract
In this paper we address the problem of diacritic error detection and restoration—the task of identifying and correcting missing accents in text. In particular, we evaluate the performance of a simple part-of-speech tagger-based technique comparing it to other established methods for error detection/restoration: unigram frequency, decision lists, discriminative classifiers, a machine-translation based method, and grapheme-based approaches. In languages such as Spanish (the focus here), diacritics play a key role in disambiguation and results show that a straightforward modification to an n-gram tagger can be used to achieve good performance in diacritic error identification without resorting to any specialized machinery. Our method should be applicable to any language where diacritics distribute comparably and perform similar roles of disambiguation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
The widely-adopted EAGLES tagset for Spanish is used for part-of-speech annotation (see Freeling: http://nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-es.html [last accessed September 20, 2015]).
- 3.
- 4.
The default options train a trigram tagger that also trains a separate suffix tree to help tag previously unseen words. This obviously has great bearing on the ability to generalize to errors in unseen words as many competing diacritizations are found on the suffixes of words, as in the -ara/-ará subjunctive vs. future example given above. The HunPos tagger, unlike standard trigram taggers which condition pos-word pairs only on the previous two tags, also conditions its tagging choice on the previous word.
References
Tufiş, D., Ceauşu, A.: DIAC+: a professional diacritics recovering system. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC) (2008)
Ungurean, C., Burileanu, D., Popescu, V., Negrescu, C., Dervis, A.: Automatic diacritic restoration for a TTS-based e-mail reader application. Bull. Ser. C 70, 3–12 (2008)
Paredes, F.: La ortografía en las encuestas de disponibilidad léxica. Reale 11, 75–97 (1999)
Yarowsky, D.: Decision lists for lexical ambiguity resolution: application to accent restoration in Spanish and French. In: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, pp. 88–95 (1994)
Yarowsky, D.: A comparison of corpus-based techniques for restoring accents in Spanish and French text. In: Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D. (eds.) Natural Language Processing Using Very Large Corpora. Text, Speech and Language Technology, vol. 11, pp. 99–120. Springer, Netherlands (1999)
Scannell, K.P.: The Crúbadán project: corpus building for under-resourced languages. In: Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop, Incorporating Cleaneval, p. 5 (2007)
Mihalcea, R.F.: Diacritics restoration: learning from letters versus learning from words. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 339–348. Springer, Heidelberg (2002)
De Pauw, G., Wagacha, P.W., de Schryver, G.-M.: Automatic Diacritic Restoration for Resource-Scarce Languages. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 170–179. Springer, Heidelberg (2007)
Novák, A., Siklósi, B.: Automatic Diacritics Restoration for Hungarian. In: EMNLP 2015, pp. 2286–2291 (2015)
Hulden, M., Silfverberg, M., Francom, J.: Finite state applications with Javascript. In: Proceedings of the 19th Nordic Conference of Computational Linguistics, pp. 441–446 (2013)
Roth, R., Rambow, O., Habash, N., Diab, M., Rudin, C.: Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, pp. 117–120. Association for Computational Linguistics (2008)
Trung, N.M., Nhan, N.Q., Phuong, N.H.: Vietnamese diacritics restoration as sequential tagging. In: 2012 IEEE RIVF International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), pp. 1–6. IEEE (2012)
Simard, M., Deslauriers, A.: Real-time automatic insertion of accents in French text. Nat. Lang. Eng. 7(02), 143–165 (2001)
Brants, T.: TnT: a statistical part-of-speech tagger. In: Proceedings of the Sixth Conference on Applied Natural Language Processing, pp. 224–231. Association for Computational Linguistics (2000)
Halácsy, P., Kornai, A., Oravecz, C.: HunPos: an open source trigram tagger. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 209–212. Association for Computational Linguistics (2007)
Scannell, K.P.: Statistical unicodification of African languages. Lang. Resour. Eval. 45(3), 375–386 (2011)
Wagacha, P., De Pauw, G., Githinji, P.: A grapheme-based approach for accent restoration in Gikuyu. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, pp. 1937–1940 (2006)
Freund, Y., Schapire, R.E.: Large margin classification using the Perceptron algorithm. Mach. Learn. 37(3), 277–296 (1999)
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180. Association for Computational Linguistics, Prague (2007)
Stolcke, A., Zheng, J., Wang, W., Abrash, V.: SRILM at sixteen: update and outlook. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, p. 5 (2011)
Mendonça, Â., Jaquette, D., Graff, D., DiPersio, D.: Spanish Gigaword Third Edition LDC2011T12 (2011). https://catalog.ldc.upenn.edu/LDC2011T12
Francom, J., Hulden, M., Ussishkin, A.: ACTIV-ES: a comparable, cross-dialect corpus of “everyday” Spanish from Argentina, Mexico and Spain. In: The Ninth International Conference on Language Resources and Evaluation, pp. 1733–1737 (2014)
Taulé, M., Martí, M.A., Recasens, M.: AnCora: multilevel annotated corpora for Catalan and Spanish. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC-2008) (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Hulden, M., Francom, J. (2016). Spanish Diacritic Error Detection and Restoration—A Survey. In: Vetulani, Z., Uszkoreit, H., Kubis, M. (eds) Human Language Technology. Challenges for Computer Science and Linguistics. LTC 2013. Lecture Notes in Computer Science(), vol 9561. Springer, Cham. https://doi.org/10.1007/978-3-319-43808-5_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-43808-5_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43807-8
Online ISBN: 978-3-319-43808-5
eBook Packages: Computer ScienceComputer Science (R0)