OctaNLP: A Benchmark for Evaluating Multitask Generalization of Transformer-Based Pre-trained Language Models

Kaddari, Zakaria; Mellah, Youssef; Berrich, Jamal; Belkasmi, Mohammed G.; Bouchentouf, Toumi

doi:10.1007/978-981-33-6893-4_21

Zakaria Kaddari³⁹,
Youssef Mellah³⁹,
Jamal Berrich³⁹,
Mohammed G. Belkasmi⁴⁰ &
…
Toumi Bouchentouf³⁹

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 745))

1281 Accesses

Abstract

In the last decade, deep learning based Natural Language Processing (NLP) models achieved remarkable performance on the majority of NLP tasks, especially, in machine translation, question answering and dialogue. NLP language models shifted from uncontextualized vector space models like word2vec and Glove in 2013, and 2014, to contextualized LSTM-based model like ELMO and ULMFit in 2018, to contextualized transformer-based models like BERT. Transformer-based language models are already trained to perform very well on individual NLP tasks. However, when applied to many tasks simultaneously, their performance drops considerably. In this paper, we overview NLP evaluation metrics, multitask benchmarks, and the recent transformer-based language models. We discuss the limitations of the current multitask benchmarks, and we propose our octaNLP benchmark for comparing the generalization capabilities of the transformer-based pre-trained language models on multiple downstream NLP tasks simultaneously.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Cer D, Diab M, Agirre E, Lopez-Gazpio I, Specia L (2017) SemEval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In: Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017). Association for Computational Linguistics, Vancouver, Canada, pp 1–14
Google Scholar
Clark C, Lee K, Chang MW, Kwiatkowski T, Collins M, Toutanova K (2019) BoolQ: exploring the surprising difficulty of natural yes/no questions. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1. Association for Computational Linguistics, Minneapolis, Minnesota, pp 2924–2936
Google Scholar
Clark K, Luong MT, Le QV, Manning CD (2020) ELECTRA: pre-training text encoders as discriminators rather than generators. arXiv e-prints arXiv:2003.10555
Conneau A, Kiela D (2018) SentEval: an evaluation toolkit for universal sentence representations. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC (2018) European Language Resources Association (ELRA). Miyazaki, Japan
Google Scholar
Dagan I, Glickman OMB (2006) Machine learning challenges. evaluating predictive uncertainty, visual object classification, and recognising tectual entailment. Lecture notes in computer science, Vol 3944. Springer
Google Scholar
Devlin J, Chang M, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805
Google Scholar
Dolan WB, Brockett C (2005) Automatically constructing a corpus of sentential paraphrases. In: Proceedings of the third international workshop on paraphrasing (IWP2005)
Google Scholar
Gordon A, Kozareva Z, Roemmele M (2012) SemEval-2012 task 7: choice of plausible alternatives: An evaluation of commonsense causal reasoning. In: *SEM 2012: the first joint conference on lexical and computational semantics, volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the sixth international workshop on semantic evaluation (SemEval 2012). Association for Computational Linguistics, Montréal, Canada, pp 394–398
Google Scholar
Hermann KM, Kociský T, Grefenstette E, Espeholt L, Kay W, Suleyman M, Blunsom P (2015) Teaching machines to read and comprehend. CoRR abs/1506.03340
Google Scholar
Kaddari Z, Mellah Y, Berrich J, Belkasmi MG, Bouchentouf T (2021) Natural language processing: Challenges and future directions. In: Masrour T, El Hassani I, Cherrafi A (eds) Artificial intelligence and industrial applications. Springer International Publishing, Cham, pp 236–246
Chapter Google Scholar
Khashabi D, Chaturvedi S, Roth M, Upadhyay S, Roth D (2018) Looking beyond the surface: a challenge set for reading comprehension over multiple sentences. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies (Long Papers), vol 1. Association for Computational Linguistics, New Orleans, Louisiana, pp 252–262
Google Scholar
Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2019) ALBERT: a lite bert for self-supervised learning of language representations. arXiv e-prints arXiv:1909.11942
Levesque HJ, Davis E, Morgenstern L (2011) The winograd schema challenge. In: AAAI spring symposium: logical formalizations of commonsense reasoning, volume 46 (2011)
Google Scholar
Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L (2019) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv e-prints arXiv:1910.13461
Lin CY (2004) ROUGE: A package for automatic evaluation of summaries. Text summarization branches out. Association for Computational Linguistics, Barcelona, Spain, pp 74–81
Google Scholar
Marneffe M-C, Simmons M, Tonhauser J (2019) The commitmentbank: investigating projection in naturally occurring discourse. https://ojs.ub.uni-konstanz.de/sub/index.php/sub/article/view/601
Matthews B Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Struct 405(2) 442 – 451
Google Scholar
McCann B, Keskar NS, Xiong C, Socher R (2018) The natural language decathlon: multitask learning as question answering. CoRR abs/1806.08730
Google Scholar
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, pp 311–318
Google Scholar
Pilehvar MT, Camacho-Collados J (2019) WiC: the word-in-context dataset for evaluating context-sensitive meaning representations. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1. Association for computational linguistics, Minneapolis, Minnesota, pp 1267–1273
Google Scholar
Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X (2020) Pre-trained models for natural language processing: a survey. arXiv e-prints arXiv:2003.08271
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints arXiv:1910.10683
Rajpurkar P, Zhang J, Lopyrev K, Liang P (2016) SQuAD: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 conference on empirical methods in natural language processing. Association for computational linguistics, Austin, Texas, pp 2383–2392
Google Scholar
See, A., Liu, P.J., Manning, C.D.: Get to the point: Summarization with pointer-generator networks. CoRR abs/1704.04368 (2017)
Google Scholar
Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng A, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing. Association for Computational Linguistics, Seattle, Washington, pp 1631–1642
Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser LU, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30. Curran Associates, Inc., pp 5998–6008
Google Scholar
Wang A, Pruksachatkun Y, Nangia N, Singh A, Michael J, Hill F, Levy O, Bowman SR (2019) Superglue: a stickier benchmark for general-purpose language understanding systems
Google Scholar
Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR (2019) Glue: a multi-task benchmark and analysis platform for natural language understanding
Google Scholar
Wang W, Bi B, Yan M, Wu C, Bao Z, Xia J, Peng L, Si L (2019) StructBERT: incorporating language structures into pre-training for deep language understanding. arXiv e-prints arXiv:1908.04577
Warstadt A, Singh A, Bowman SR (2018) Neural network acceptability judgments. CoRR abs/1805.12471
Google Scholar
Williams A, Nangia N, Bowman S (2018) A broad-coverage challenge corpus for sentence understanding through inference. In: Proceedings of the 2018 conference of the North American Chapter of the association for computational linguistics: human language technologies, vol 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, pp 1112–1122
Google Scholar
Xiong C, Zhong V, Socher R (2016) Dynamic coattention networks for question answering. CoRR abs/1611.01604
Google Scholar
Zhang S, Liu X, Liu J, Gao J, Duh K, Durme BV (2018) Record: bridging the gap between human and machine commonsense reading comprehension. CoRR abs/1810.12885
Google Scholar
Zhang Z, Wu Y, Zhao H, Li Z, Zhang S, Zhou X, Zhou X (2019) Semantics-aware BERT for language understanding. arXiv e-prints arXiv:1909.02209

Download references

Author information

Authors and Affiliations

LaRSA Laboratory, AIRES Team, National School of Applied Sciences, Université Mohammed Premier, Oujda, Morocco
Zakaria Kaddari, Youssef Mellah, Jamal Berrich & Toumi Bouchentouf
SmartICT Laboratory, National School of Applied Sciences, Université Mohammed Premier, Oujda, Morocco
Mohammed G. Belkasmi

Authors

Zakaria Kaddari
View author publications
You can also search for this author in PubMed Google Scholar
Youssef Mellah
View author publications
You can also search for this author in PubMed Google Scholar
Jamal Berrich
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed G. Belkasmi
View author publications
You can also search for this author in PubMed Google Scholar
Toumi Bouchentouf
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zakaria Kaddari .

Editor information

Editors and Affiliations

Sidi Mohamed Ben Abdellah University, Fez, Morocco
Saad Bennani
Sidi Mohamed Ben Abdellah University, Fez, Morocco
Younes Lakhrissi
Sidi Mohamed Ben Abdellah University, Fez, Morocco
Ghizlane Khaissidi
Sidi Mohamed Ben Abdellah University, Fez, Morocco
Anass Mansouri
Sidi Mohamed Ben Abdellah University, Fez, Morocco
Youness Khamlichi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kaddari, Z., Mellah, Y., Berrich, J., Belkasmi, M.G., Bouchentouf, T. (2022). OctaNLP: A Benchmark for Evaluating Multitask Generalization of Transformer-Based Pre-trained Language Models. In: Bennani, S., Lakhrissi, Y., Khaissidi, G., Mansouri, A., Khamlichi, Y. (eds) WITS 2020. Lecture Notes in Electrical Engineering, vol 745. Springer, Singapore. https://doi.org/10.1007/978-981-33-6893-4_21

Download citation

DOI: https://doi.org/10.1007/978-981-33-6893-4_21
Published: 22 July 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-6892-7
Online ISBN: 978-981-33-6893-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

OctaNLP: A Benchmark for Evaluating Multitask Generalization of Transformer-Based Pre-trained Language Models