Abstract
Brazilian Portuguese (BP) and European Portuguese (EP) have specific NLP resources and tools for many tasks. It is generally agreed upon that applying them to the variant other than their intended one results in a performance drop; however, very little research has measured it. We evaluated a POS tagger in a cross-variant setting under multiple combinations of word embeddings, train and test corpora, and found that (i) BP is easier than EP, (ii) word embeddings help increase tagger performance significantly, but not enough to close the accuracy gap in a cross-variant setting and (iii) embeddings generated from a corpus with both variants are useful in cross-variant scenarios. While we cannot generalize observations from POS tagging to any NLP task, this is an important first step for such evaluations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The texts explored here are from before the Spelling Agreement of the Portuguese language taking place.
- 2.
More information about the workshops on http://ttg.uni-saarland.de/lt4vardial2015/.
- 3.
Many of these OOV words are common in both variants, but since the Bosque corpus is very small, they only appear in one of the halves.
- 4.
Available at http://www.linguateca.pt/cetempublico/.
- 5.
The Bosque corpus is composed of sentences from CETENFolha and CETEMPúblico. We removed all those sentences to avoid any overlap with the labeled corpus.
- 6.
Remember that their BP and EP corpora are not the same as ours.
References
Afonso, S., Bick, E., Haber, R., Santos, D.: Floresta sintá(c)tica: a treebank for Portuguese. In: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002). pp. 1698–1703 (2002)
Aluísio, S.M., Pelizzoni, J.M., Marchi, A.R., de Oliveira, L., Manenti, R., Marquiafável, V.: An account of the challenge of tagging a reference corpus for brazilian portuguese. In: Mamede, N.J., Baptista, J., Trancoso, I., Nunes, M.G.V. (eds.) PROPOR 2003. LNCS, vol. 2721, pp. 110–117. Springer, Heidelberg (2003)
Bick, E.: The Parsing System PALAVRAS: Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Ph.D. thesis, Aarhus University (2000)
Branco, A., Carvalheiro, C., Costa, F., Castro, S., Silva, J., Martins, C., Ramos, J.: DeepBankPT and companion portuguese treebanks in a multilingual collection of treebanks aligned with the penn treebank. In: Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A.S., Volpe Nunes, M.G. (eds.) PROPOR 2014. LNCS, vol. 8775, pp. 207–213. Springer, Heidelberg (2014)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
Fonseca, E.R., Rosa, J.L.G., Aluísio, S.M.: Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese. J. Braz. Comput. Soc. 21(2), 1–14 (2015)
Garcia, M., Gamallo, P., Gayo, I., Cruz, M.A.P.: PoS-tagging the web in Portuguese. national varieties, text typologies and spelling systems. Procesamiento Lenguaje Nat. 53, 95–101 (2014)
Hamdi, A., Nasr, A., Habash, N., Gala, N.: POS-tagging of tunisian dialect using standard arabic resources and tools. In: Proceedings of the Second Workshop on Arabic Natural Language Processing. pp. 59–68 (2015)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of the ICLR Workshop (2013)
Rocha, P., Santos, D.: CETEMPúblico: Um corpus de grandes dimensões de linguagem jornalística portuguesa. In: Actas do V Encontro para o processamento computacional da língua portuguesa escrita e falada. pp. 131–140 (2000)
Scarton, C., Sanches Duran, M., Aluísio, S.M.: Using cross-linguistic knowledge to build verbnet-style lexicons: results for a (brazilian) portuguese verbnet. In: Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A.S., Volpe Nunes, M.G. (eds.) PROPOR 2014. LNCS, vol. 8775, pp. 149–160. Springer, Heidelberg (2014)
Tseng, H., Jurafsky, D., Manning, C.: Morphological features help POS tagging of unknown words across language varieties. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing. pp. 32–39 (2005)
Vergez-Couret, M., Urieli, A.: Pos-tagging different varieties of Occitan with single-dialect resources. In: Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects. pp. 21–29 (2014)
Acknowledgments
This work was funded by grant#2013/22973-0 of the São Paulo Research Funding Agency (FAPESP).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Fonseca, E.R., Aluísio, S.M. (2016). Improving POS Tagging Across Portuguese Variants with Word Embeddings. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds) Computational Processing of the Portuguese Language. PROPOR 2016. Lecture Notes in Computer Science(), vol 9727. Springer, Cham. https://doi.org/10.1007/978-3-319-41552-9_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-41552-9_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41551-2
Online ISBN: 978-3-319-41552-9
eBook Packages: Computer ScienceComputer Science (R0)