Skip to main content

OctaNLP: A Benchmark for Evaluating Multitask Generalization of Transformer-Based Pre-trained Language Models

  • Conference paper
  • First Online:
WITS 2020

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 745))

  • 1281 Accesses

Abstract

In the last decade, deep learning based Natural Language Processing (NLP) models achieved remarkable performance on the majority of NLP tasks, especially, in machine translation, question answering and dialogue. NLP language models shifted from uncontextualized vector space models like word2vec and Glove in 2013, and 2014, to contextualized LSTM-based model like ELMO and ULMFit in 2018, to contextualized transformer-based models like BERT. Transformer-based language models are already trained to perform very well on individual NLP tasks. However, when applied to many tasks simultaneously, their performance drops considerably. In this paper, we overview NLP evaluation metrics, multitask benchmarks, and the recent transformer-based language models. We discuss the limitations of the current multitask benchmarks, and we propose our octaNLP benchmark for comparing the generalization capabilities of the transformer-based pre-trained language models on multiple downstream NLP tasks simultaneously.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://gluebenchmark.com.

  2. 2.

    https://super.gluebenchmark.com.

References

  1. Cer D, Diab M, Agirre E, Lopez-Gazpio I, Specia L (2017) SemEval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In: Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017). Association for Computational Linguistics, Vancouver, Canada, pp 1–14

    Google Scholar 

  2. Clark C, Lee K, Chang MW, Kwiatkowski T, Collins M, Toutanova K (2019) BoolQ: exploring the surprising difficulty of natural yes/no questions. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1. Association for Computational Linguistics, Minneapolis, Minnesota, pp 2924–2936

    Google Scholar 

  3. Clark K, Luong MT, Le QV, Manning CD (2020) ELECTRA: pre-training text encoders as discriminators rather than generators. arXiv e-prints arXiv:2003.10555

  4. Conneau A, Kiela D (2018) SentEval: an evaluation toolkit for universal sentence representations. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC (2018) European Language Resources Association (ELRA). Miyazaki, Japan

    Google Scholar 

  5. Dagan I, Glickman OMB (2006) Machine learning challenges. evaluating predictive uncertainty, visual object classification, and recognising tectual entailment. Lecture notes in computer science, Vol 3944. Springer

    Google Scholar 

  6. Devlin J, Chang M, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805

    Google Scholar 

  7. Dolan WB, Brockett C (2005) Automatically constructing a corpus of sentential paraphrases. In: Proceedings of the third international workshop on paraphrasing (IWP2005)

    Google Scholar 

  8. Gordon A, Kozareva Z, Roemmele M (2012) SemEval-2012 task 7: choice of plausible alternatives: An evaluation of commonsense causal reasoning. In: *SEM 2012: the first joint conference on lexical and computational semantics, volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the sixth international workshop on semantic evaluation (SemEval 2012). Association for Computational Linguistics, Montréal, Canada, pp 394–398

    Google Scholar 

  9. Hermann KM, Kociský T, Grefenstette E, Espeholt L, Kay W, Suleyman M, Blunsom P (2015) Teaching machines to read and comprehend. CoRR abs/1506.03340

    Google Scholar 

  10. Kaddari Z, Mellah Y, Berrich J, Belkasmi MG, Bouchentouf T (2021) Natural language processing: Challenges and future directions. In: Masrour T, El Hassani I, Cherrafi A (eds) Artificial intelligence and industrial applications. Springer International Publishing, Cham, pp 236–246

    Chapter  Google Scholar 

  11. Khashabi D, Chaturvedi S, Roth M, Upadhyay S, Roth D (2018) Looking beyond the surface: a challenge set for reading comprehension over multiple sentences. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies (Long Papers), vol 1. Association for Computational Linguistics, New Orleans, Louisiana, pp 252–262

    Google Scholar 

  12. Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2019) ALBERT: a lite bert for self-supervised learning of language representations. arXiv e-prints arXiv:1909.11942

  13. Levesque HJ, Davis E, Morgenstern L (2011) The winograd schema challenge. In: AAAI spring symposium: logical formalizations of commonsense reasoning, volume 46 (2011)

    Google Scholar 

  14. Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L (2019) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv e-prints arXiv:1910.13461

  15. Lin CY (2004) ROUGE: A package for automatic evaluation of summaries. Text summarization branches out. Association for Computational Linguistics, Barcelona, Spain, pp 74–81

    Google Scholar 

  16. Marneffe M-C, Simmons M, Tonhauser J (2019) The commitmentbank: investigating projection in naturally occurring discourse. https://ojs.ub.uni-konstanz.de/sub/index.php/sub/article/view/601

  17. Matthews B Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Struct 405(2) 442 – 451

    Google Scholar 

  18. McCann B, Keskar NS, Xiong C, Socher R (2018) The natural language decathlon: multitask learning as question answering. CoRR abs/1806.08730

    Google Scholar 

  19. Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, pp 311–318

    Google Scholar 

  20. Pilehvar MT, Camacho-Collados J (2019) WiC: the word-in-context dataset for evaluating context-sensitive meaning representations. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1. Association for computational linguistics, Minneapolis, Minnesota, pp 1267–1273

    Google Scholar 

  21. Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X (2020) Pre-trained models for natural language processing: a survey. arXiv e-prints arXiv:2003.08271

  22. Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints arXiv:1910.10683

  23. Rajpurkar P, Zhang J, Lopyrev K, Liang P (2016) SQuAD: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 conference on empirical methods in natural language processing. Association for computational linguistics, Austin, Texas, pp 2383–2392

    Google Scholar 

  24. See, A., Liu, P.J., Manning, C.D.: Get to the point: Summarization with pointer-generator networks. CoRR abs/1704.04368 (2017)

    Google Scholar 

  25. Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng A, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing. Association for Computational Linguistics, Seattle, Washington, pp 1631–1642

    Google Scholar 

  26. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser LU, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30. Curran Associates, Inc., pp 5998–6008

    Google Scholar 

  27. Wang A, Pruksachatkun Y, Nangia N, Singh A, Michael J, Hill F, Levy O, Bowman SR (2019) Superglue: a stickier benchmark for general-purpose language understanding systems

    Google Scholar 

  28. Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR (2019) Glue: a multi-task benchmark and analysis platform for natural language understanding

    Google Scholar 

  29. Wang W, Bi B, Yan M, Wu C, Bao Z, Xia J, Peng L, Si L (2019) StructBERT: incorporating language structures into pre-training for deep language understanding. arXiv e-prints arXiv:1908.04577

  30. Warstadt A, Singh A, Bowman SR (2018) Neural network acceptability judgments. CoRR abs/1805.12471

    Google Scholar 

  31. Williams A, Nangia N, Bowman S (2018) A broad-coverage challenge corpus for sentence understanding through inference. In: Proceedings of the 2018 conference of the North American Chapter of the association for computational linguistics: human language technologies, vol 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, pp 1112–1122

    Google Scholar 

  32. Xiong C, Zhong V, Socher R (2016) Dynamic coattention networks for question answering. CoRR abs/1611.01604

    Google Scholar 

  33. Zhang S, Liu X, Liu J, Gao J, Duh K, Durme BV (2018) Record: bridging the gap between human and machine commonsense reading comprehension. CoRR abs/1810.12885

    Google Scholar 

  34. Zhang Z, Wu Y, Zhao H, Li Z, Zhang S, Zhou X, Zhou X (2019) Semantics-aware BERT for language understanding. arXiv e-prints arXiv:1909.02209

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zakaria Kaddari .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kaddari, Z., Mellah, Y., Berrich, J., Belkasmi, M.G., Bouchentouf, T. (2022). OctaNLP: A Benchmark for Evaluating Multitask Generalization of Transformer-Based Pre-trained Language Models. In: Bennani, S., Lakhrissi, Y., Khaissidi, G., Mansouri, A., Khamlichi, Y. (eds) WITS 2020. Lecture Notes in Electrical Engineering, vol 745. Springer, Singapore. https://doi.org/10.1007/978-981-33-6893-4_21

Download citation

  • DOI: https://doi.org/10.1007/978-981-33-6893-4_21

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-33-6892-7

  • Online ISBN: 978-981-33-6893-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics