Spanish Diacritic Error Detection and Restoration—A Survey

Hulden, Mans; Francom, Jerid

doi:10.1007/978-3-319-43808-5_22

Spanish Diacritic Error Detection and Restoration—A Survey

Mans Hulden¹⁶ &
Jerid Francom¹⁷

Conference paper
First Online: 30 July 2016

685 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9561))

Abstract

In this paper we address the problem of diacritic error detection and restoration—the task of identifying and correcting missing accents in text. In particular, we evaluate the performance of a simple part-of-speech tagger-based technique comparing it to other established methods for error detection/restoration: unigram frequency, decision lists, discriminative classifiers, a machine-translation based method, and grapheme-based approaches. In languages such as Spanish (the focus here), diacritics play a key role in disambiguation and results show that a straightforward modification to an n-gram tagger can be used to achieve good performance in diacritic error identification without resorting to any specialized machinery. Our method should be applicable to any language where diacritics distribute comparably and perform similar roles of disambiguation.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
The suffix-based approach described by Yarowsky [4, 5] is not considered here as morphological form merely serves as a proxy for part-of-speech category, the primary variable of the current POS-tag approach.
2.
The widely-adopted EAGLES tagset for Spanish is used for part-of-speech annotation (see Freeling: http://nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-es.html [last accessed September 20, 2015]).
3.
We assume here that the relevant n-gram tagger that is trained has some suffix-based mechanism for guessing parts-of-speech for unknown words as is generally the case with better-performing taggers, such as TnT [14], or HunPos [15].
4.
The default options train a trigram tagger that also trains a separate suffix tree to help tag previously unseen words. This obviously has great bearing on the ability to generalize to errors in unseen words as many competing diacritizations are found on the suffixes of words, as in the -ara/-ará subjunctive vs. future example given above. The HunPos tagger, unlike standard trigram taggers which condition pos-word pairs only on the previous two tags, also conditions its tagging choice on the previous word.

References

Tufiş, D., Ceauşu, A.: DIAC+: a professional diacritics recovering system. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC) (2008)
Google Scholar
Ungurean, C., Burileanu, D., Popescu, V., Negrescu, C., Dervis, A.: Automatic diacritic restoration for a TTS-based e-mail reader application. Bull. Ser. C 70, 3–12 (2008)
Google Scholar
Paredes, F.: La ortografía en las encuestas de disponibilidad léxica. Reale 11, 75–97 (1999)
Google Scholar
Yarowsky, D.: Decision lists for lexical ambiguity resolution: application to accent restoration in Spanish and French. In: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, pp. 88–95 (1994)
Google Scholar
Yarowsky, D.: A comparison of corpus-based techniques for restoring accents in Spanish and French text. In: Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D. (eds.) Natural Language Processing Using Very Large Corpora. Text, Speech and Language Technology, vol. 11, pp. 99–120. Springer, Netherlands (1999)
Chapter Google Scholar
Scannell, K.P.: The Crúbadán project: corpus building for under-resourced languages. In: Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop, Incorporating Cleaneval, p. 5 (2007)
Google Scholar
Mihalcea, R.F.: Diacritics restoration: learning from letters versus learning from words. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 339–348. Springer, Heidelberg (2002)
Chapter Google Scholar
De Pauw, G., Wagacha, P.W., de Schryver, G.-M.: Automatic Diacritic Restoration for Resource-Scarce Languages. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 170–179. Springer, Heidelberg (2007)
Chapter Google Scholar
Novák, A., Siklósi, B.: Automatic Diacritics Restoration for Hungarian. In: EMNLP 2015, pp. 2286–2291 (2015)
Google Scholar
Hulden, M., Silfverberg, M., Francom, J.: Finite state applications with Javascript. In: Proceedings of the 19th Nordic Conference of Computational Linguistics, pp. 441–446 (2013)
Google Scholar
Roth, R., Rambow, O., Habash, N., Diab, M., Rudin, C.: Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, pp. 117–120. Association for Computational Linguistics (2008)
Google Scholar
Trung, N.M., Nhan, N.Q., Phuong, N.H.: Vietnamese diacritics restoration as sequential tagging. In: 2012 IEEE RIVF International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), pp. 1–6. IEEE (2012)
Google Scholar
Simard, M., Deslauriers, A.: Real-time automatic insertion of accents in French text. Nat. Lang. Eng. 7(02), 143–165 (2001)
Article Google Scholar
Brants, T.: TnT: a statistical part-of-speech tagger. In: Proceedings of the Sixth Conference on Applied Natural Language Processing, pp. 224–231. Association for Computational Linguistics (2000)
Google Scholar
Halácsy, P., Kornai, A., Oravecz, C.: HunPos: an open source trigram tagger. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 209–212. Association for Computational Linguistics (2007)
Google Scholar
Scannell, K.P.: Statistical unicodification of African languages. Lang. Resour. Eval. 45(3), 375–386 (2011)
Article Google Scholar
Wagacha, P., De Pauw, G., Githinji, P.: A grapheme-based approach for accent restoration in Gikuyu. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, pp. 1937–1940 (2006)
Google Scholar
Freund, Y., Schapire, R.E.: Large margin classification using the Perceptron algorithm. Mach. Learn. 37(3), 277–296 (1999)
Article MATH Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180. Association for Computational Linguistics, Prague (2007)
Google Scholar
Stolcke, A., Zheng, J., Wang, W., Abrash, V.: SRILM at sixteen: update and outlook. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, p. 5 (2011)
Google Scholar
Mendonça, Â., Jaquette, D., Graff, D., DiPersio, D.: Spanish Gigaword Third Edition LDC2011T12 (2011). https://catalog.ldc.upenn.edu/LDC2011T12
Francom, J., Hulden, M., Ussishkin, A.: ACTIV-ES: a comparable, cross-dialect corpus of “everyday” Spanish from Argentina, Mexico and Spain. In: The Ninth International Conference on Language Resources and Evaluation, pp. 1733–1737 (2014)
Google Scholar
Taulé, M., Martí, M.A., Recasens, M.: AnCora: multilevel annotated corpora for Catalan and Spanish. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC-2008) (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Colorado, Boulder, CO, 80303, USA
Mans Hulden
Wake Forest University, Winston-Salem, NC, 27109, USA
Jerid Francom

Authors

Mans Hulden
View author publications
You can also search for this author in PubMed Google Scholar
Jerid Francom
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jerid Francom .

Editor information

Editors and Affiliations

Adam Mickiewicz University , Poznań, Poland
Zygmunt Vetulani
Deutsches Forschungszentrum f. Künstl.Intelligenz (DFKI GmbH), Saarbrücken, Saarland, Germany
Hans Uszkoreit
Adam Mickiewicz University , Poznań, Poland
Marek Kubis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hulden, M., Francom, J. (2016). Spanish Diacritic Error Detection and Restoration—A Survey. In: Vetulani, Z., Uszkoreit, H., Kubis, M. (eds) Human Language Technology. Challenges for Computer Science and Linguistics. LTC 2013. Lecture Notes in Computer Science(), vol 9561. Springer, Cham. https://doi.org/10.1007/978-3-319-43808-5_22

Download citation

DOI: https://doi.org/10.1007/978-3-319-43808-5_22
Published: 30 July 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43807-8
Online ISBN: 978-3-319-43808-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics