Improving POS Tagging Across Portuguese Variants with Word Embeddings

Fonseca, Erick Rocha; Aluísio, Sandra Maria

doi:10.1007/978-3-319-41552-9_22

Erick Rocha Fonseca¹⁸ &
Sandra Maria Aluísio¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9727))

Included in the following conference series:

International Conference on Computational Processing of the Portuguese Language

603 Accesses

Abstract

Brazilian Portuguese (BP) and European Portuguese (EP) have specific NLP resources and tools for many tasks. It is generally agreed upon that applying them to the variant other than their intended one results in a performance drop; however, very little research has measured it. We evaluated a POS tagger in a cross-variant setting under multiple combinations of word embeddings, train and test corpora, and found that (i) BP is easier than EP, (ii) word embeddings help increase tagger performance significantly, but not enough to close the accuracy gap in a cross-variant setting and (iii) embeddings generated from a corpus with both variants are useful in cross-variant scenarios. While we cannot generalize observations from POS tagging to any NLP task, this is an important first step for such evaluations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The texts explored here are from before the Spelling Agreement of the Portuguese language taking place.
2.
More information about the workshops on http://ttg.uni-saarland.de/lt4vardial2015/.
3.
Many of these OOV words are common in both variants, but since the Bosque corpus is very small, they only appear in one of the halves.
4.
Available at http://www.linguateca.pt/cetempublico/.
5.
The Bosque corpus is composed of sentences from CETENFolha and CETEMPúblico. We removed all those sentences to avoid any overlap with the labeled corpus.
6.
Remember that their BP and EP corpora are not the same as ours.

References

Afonso, S., Bick, E., Haber, R., Santos, D.: Floresta sintá(c)tica: a treebank for Portuguese. In: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002). pp. 1698–1703 (2002)
Google Scholar
Aluísio, S.M., Pelizzoni, J.M., Marchi, A.R., de Oliveira, L., Manenti, R., Marquiafável, V.: An account of the challenge of tagging a reference corpus for brazilian portuguese. In: Mamede, N.J., Baptista, J., Trancoso, I., Nunes, M.G.V. (eds.) PROPOR 2003. LNCS, vol. 2721, pp. 110–117. Springer, Heidelberg (2003)
Chapter Google Scholar
Bick, E.: The Parsing System PALAVRAS: Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Ph.D. thesis, Aarhus University (2000)
Google Scholar
Branco, A., Carvalheiro, C., Costa, F., Castro, S., Silva, J., Martins, C., Ramos, J.: DeepBankPT and companion portuguese treebanks in a multilingual collection of treebanks aligned with the penn treebank. In: Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A.S., Volpe Nunes, M.G. (eds.) PROPOR 2014. LNCS, vol. 8775, pp. 207–213. Springer, Heidelberg (2014)
Google Scholar
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
MathSciNet MATH Google Scholar
Fonseca, E.R., Rosa, J.L.G., Aluísio, S.M.: Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese. J. Braz. Comput. Soc. 21(2), 1–14 (2015)
Google Scholar
Garcia, M., Gamallo, P., Gayo, I., Cruz, M.A.P.: PoS-tagging the web in Portuguese. national varieties, text typologies and spelling systems. Procesamiento Lenguaje Nat. 53, 95–101 (2014)
Google Scholar
Hamdi, A., Nasr, A., Habash, N., Gala, N.: POS-tagging of tunisian dialect using standard arabic resources and tools. In: Proceedings of the Second Workshop on Arabic Natural Language Processing. pp. 59–68 (2015)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of the ICLR Workshop (2013)
Google Scholar
Rocha, P., Santos, D.: CETEMPúblico: Um corpus de grandes dimensões de linguagem jornalística portuguesa. In: Actas do V Encontro para o processamento computacional da língua portuguesa escrita e falada. pp. 131–140 (2000)
Google Scholar
Scarton, C., Sanches Duran, M., Aluísio, S.M.: Using cross-linguistic knowledge to build verbnet-style lexicons: results for a (brazilian) portuguese verbnet. In: Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A.S., Volpe Nunes, M.G. (eds.) PROPOR 2014. LNCS, vol. 8775, pp. 149–160. Springer, Heidelberg (2014)
Google Scholar
Tseng, H., Jurafsky, D., Manning, C.: Morphological features help POS tagging of unknown words across language varieties. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing. pp. 32–39 (2005)
Google Scholar
Vergez-Couret, M., Urieli, A.: Pos-tagging different varieties of Occitan with single-dialect resources. In: Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects. pp. 21–29 (2014)
Google Scholar

Download references

Acknowledgments

This work was funded by grant#2013/22973-0 of the São Paulo Research Funding Agency (FAPESP).

Author information

Authors and Affiliations

ICMC – University of São Paulo, São Carlos, Brazil
Erick Rocha Fonseca & Sandra Maria Aluísio

Authors

Erick Rocha Fonseca
View author publications
You can also search for this author in PubMed Google Scholar
Sandra Maria Aluísio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Erick Rocha Fonseca .

Editor information

Editors and Affiliations

Universidade de Lisbon, Portugal
João Silva
ISCTE-IUL, Lisbon, Portugal
Ricardo Ribeiro
Universidade de Évora, Évora, Portugal
Paulo Quaresma
Universidade de Caxias do Sul, Caxias do Suö, Brazil
André Adami
Universidade de Lisbon, Lisboa, Portugal
António Branco

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fonseca, E.R., Aluísio, S.M. (2016). Improving POS Tagging Across Portuguese Variants with Word Embeddings. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds) Computational Processing of the Portuguese Language. PROPOR 2016. Lecture Notes in Computer Science(), vol 9727. Springer, Cham. https://doi.org/10.1007/978-3-319-41552-9_22

Download citation

DOI: https://doi.org/10.1007/978-3-319-41552-9_22
Published: 21 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41551-2
Online ISBN: 978-3-319-41552-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics