SIMPLEX-PB: A Lexical Simplification Database and Benchmark for Portuguese

Hartmann, Nathan S.; Paetzold, Gustavo H.; Aluísio, Sandra M.

doi:10.1007/978-3-319-99722-3_28

Nathan S. Hartmann²¹,
Gustavo H. Paetzold²² &
Sandra M. Aluísio²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11122))

Included in the following conference series:

International Conference on Computational Processing of the Portuguese Language

843 Accesses
3 Citations

Abstract

Lexical Simplification has the function of changing words or expressions for synonyms that can be understood by a larger number of people. It is very common to have in mind a target audience which will benefit from the task, such as children, low-literacy audiences, and others. In recent years there has been great activity in this field of research, especially for English, but also for other languages such as Japanese and multilingual and cross-lingual scenarios. Few works have children as target audience. Currently, in Brazil, the Programa Nacional do Livro Didático (PNLD) is an initiative with a broad impact on education, as it aims to choose, acquire, and distribute free textbooks to students in public elementary schools. In this scenario, adapting the level of complexity of a text to the reading ability of a student is a determinant of his/her improvement and whether he/she reaches the level of reading comprehension expected for that school year. On the other hand, there have not been publicly available resources on lexical simplification for Portuguese as yet. Therefore, the development of this material is urgent and welcome. This work compiled the SIMPLEX-PB, the first available corpus of lexical simplification for Brazilian Portuguese. We also make available a benchmark for evaluating the most well-known methods of LS in our dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Our corpus and benchmark are available at github.com/nathanshartmann/simplex.

References

Brysbaert, M., New, B.: Moving beyond kucera and francis: a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for american English. Behav. Res. Methods 41, 977–990 (2009)
Article Google Scholar
Carroll, J., Minnen, G., Canning, Y., Devlin, S., Tait, J.: Practical simplification of English newspaper text to assist aphasic readers. In: Proceedings of the AAAI 1998 Workshop on Integrating Artificial Intelligence and Assistive Technology, pp. 7–10 (1998)
Google Scholar
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 37–46 (1960)
Article Google Scholar
De Belder, J., Moens, M.-F.: A dataset for the evaluation of lexical simplification. In: Gelbukh, A. (ed.) CICLing 2012. LNCS, vol. 7182, pp. 426–437. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28601-8_36
Chapter Google Scholar
Glavaš, G., Štajner, S.: Simplifying lexical simplification: do we need simplified corpora? In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2015), vol. 2, pp. 63–68 (2015)
Google Scholar
da Graça Krieger, M.: Dicionários escolares e ensino de língua materna. Estudos Linguísticos (São Paulo 1978) 41(1), 169–180 (2016)
Google Scholar
Hartmann, N., Cucatto, L., Brants, D., Aluísio, S.: Automatic classification of the complexity of nonfiction texts in portuguese for early school years. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds.) PROPOR 2016. LNCS (LNAI), vol. 9727, pp. 12–24. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41552-9_2
Chapter Google Scholar
Hartmann, N., Fonseca, E., Shulby, C., Treviso, M., Rodrigues, J., Aluisio, S.: Portuguese word embeddings: evaluating on word analogies and natural language tasks. arXiv preprint arXiv:1708.06025 (2017)
Hartmann, N.S.: ASSIN shared task - solo queue group: mix of a traditional and an emerging approaches. In: Avaliação de Similaridade Semântica e Inferência Textual (ASSIN), Propor Workshop (2016)
Google Scholar
Horn, C., Manduca, C., Kauchak, D.: Learning a lexical simplifier using wikipedia. In: Proceedings of the 52nd Annual Meeting of the ACL (ACL 2014), pp. 458–463 (2014)
Google Scholar
Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: Fasttext.zip: compressing text classification models. arXiv preprint arXiv:1612.03651 (2016)
Kajiwara, T., Yamamoto, K.: Evaluation dataset and system for Japanese lexical simplification. In: ACL (Student Research Workshop), pp. 35–40. The Association for Computer Linguistics (2015)
Google Scholar
Kodaira, T., Kajiwara, T., Komachi, M.: Controlled and balanced dataset for Japanese lexical simplification. In: Proceedings of the ACL 2016 Student Research Workshop, pp. 1–7. Association for Computational Linguistics (2016)
Google Scholar
Ling, W., Dyer, C., Black, A., Trancoso, I.: Two/too simple adaptations of Word2Vec for syntax problems. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics (2015)
Google Scholar
Lison, P., Tiedemann, J.: OpenSubtitles 2016: extracting large parallel corpora from movie and TV subtitles. In: Proceedings of the 10th LREC (2016)
Google Scholar
Maziero, E.G., Pardo, T.A., Di Felippo, A., Dias-da Silva, B.C.: A base de dados lexical e a interface web do tep 2.0: thesaurus eletrônico para o português do brasil. In: Companion Proceedings of the XIV Brazilian Symposium on Multimedia and the Web, pp. 390–392. ACM (2008)
Google Scholar
McCarthy, D., Navigli, R.: Semeval-2007 task 10: English lexical substitution task. In: Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval 2007), pp. 48–53. Association for Computational Linguistics (2007)
Google Scholar
Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2013), pp. 746–751 (2013)
Google Scholar
Muniz, M.C.M.: A construção de recursos lingüístico-computacionais para o português do Brasil: o projeto de Unitex-PB. Master’s thesis, Universidade de São Paulo, Brasil (2004)
Google Scholar
Paetzold, G., Specia, L.: SemEval 2016 task 11: complex word identification. In: Proceedings of the 10th International Workshop on Semantic Evaluation, pp. 560–569. Association for Computational Linguistics, San Diego, California, June 2016
Google Scholar
Paetzold, G., Specia, L.: Lexical simplification with neural ranking. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 2, pp. 34–40. ACL (2017)
Google Scholar
Paetzold, G.H., Specia, L.: PLUMBErr: an automatic error identification framework for lexical simplification. In: Proceedings of 1st Quality Assessment for Text Simplification (LREC-QATS 2016), pp. 1–9 (2016)
Google Scholar
Paetzold, G.H., Specia, L.: Unsupervised lexical simplification for non-native speakers. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI 2016), pp. 3761–3767 (2016)
Google Scholar
Paetzold, G.H., Specia, L.: A survey on lexical simplication. J. Artif. Intell. Res. 60, 549–593 (2017)
Article Google Scholar
Paetzold, G.H., Specia, L.: Benchmarking lexical simplification systems. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pp. 3074–3080 (2016)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), vol. 12, pp. 1532–1543 (2014)
Google Scholar
Shardlow, M.: The CW corpus: a new resource for evaluating the identification of complex words. In: Proceedings of the 2nd Workshop on Predicting and Improving Text Readability for Target Reader Populations, pp. 69–77. Association for Computational Linguistics (2013)
Google Scholar
Shardlow, M.: Out in the open: finding and categorising errors in the lexical simplification pipeline. In: Proceedings of The International Conference on Language Resources and Evaluation (LREC 2014), pp. 1583–1590 (2014)
Google Scholar
Specia, L., Jauhar, S.K., Mihalcea, R.: SemEval-2012 task 1: English lexical simplification. In: Proceedings of the 1st SEM, pp. 347–355. ACL (2012)
Google Scholar
Yimam, S.M., et al.: A report on the complex word identification shared task 2018. In: Proceedings of the 13th BEA. Association for Computational Linguistics (2018)
Google Scholar
Yimam, S.M., Štajner, S., Riedl, M., Biemann, C.: CWIG3G2 - complex word identification task across three text genres and two user groups. In: Proceedings of the 8th IJCNLP, pp. 401–407. Asian Federation of Natural Language Processing (2017)
Google Scholar
Yimam, S.M., Štajner, S., Riedl, M., Biemann, C.: Multilingual and cross-lingual complex word identification. In: Proceedings of RANLP, pp. 813–822 (2017)
Google Scholar

Download references

Acknowledgments

We thank Larissa Pícoli, Livia Cucatto and Magali Duran for annotating our corpus. This work was supported by FAPESP proc. 2016/00500-1.

Author information

Authors and Affiliations

Institute of Mathematics and Computer Science, University of São Paulo, São Paulo, Brazil
Nathan S. Hartmann & Sandra M. Aluísio
Department of Computer Science, University of Sheffield, Sheffield, UK
Gustavo H. Paetzold

Authors

Nathan S. Hartmann
View author publications
You can also search for this author in PubMed Google Scholar
Gustavo H. Paetzold
View author publications
You can also search for this author in PubMed Google Scholar
Sandra M. Aluísio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nathan S. Hartmann .

Editor information

Editors and Affiliations

Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, Brazil
Aline Villavicencio
Instituto de Informática - UFRGS, Porto Alegre, Brazil
Viviane Moreira
INESC-ID, Lisbon, Portugal
Alberto Abad
UFSCAR, Sao Carlos, Brazil
Helena Caseli
Centro Singular de Investigación en Tecnoloxías, Universidade de Santiago de Compostela, Santiago de Compostela, La Coruña, Spain
Pablo Gamallo
Université de Toulon, Parc Scientifique Technologique Luminy, Marseille, France
Carlos Ramisch
Centro de Informática e Sistemas, Universidade de Coimbra, Coimbra, Portugal
Hugo Gonçalo Oliveira
Federal University of Technology, Dois Vizinhos, Paraná, Brazil
Gustavo Henrique Paetzold

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hartmann, N.S., Paetzold, G.H., Aluísio, S.M. (2018). SIMPLEX-PB: A Lexical Simplification Database and Benchmark for Portuguese. In: Villavicencio, A., et al. Computational Processing of the Portuguese Language. PROPOR 2018. Lecture Notes in Computer Science(), vol 11122. Springer, Cham. https://doi.org/10.1007/978-3-319-99722-3_28

Download citation

DOI: https://doi.org/10.1007/978-3-319-99722-3_28
Published: 26 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99721-6
Online ISBN: 978-3-319-99722-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics