Skip to main content

Introducing Shared Tasks to NLG: The TUNA Shared Task Evaluation Challenges

  • Chapter
Empirical Methods in Natural Language Generation (EACL 2009, ENLG 2009)

Abstract

Shared Task Evaluation Challenges (stecs) have only recently begun in the field of nlg. The tuna stecs, which focused on Referring Expression Generation (reg), have been part of this development since its inception. This chapter looks back on the experience of organising the three tuna Challenges, which came to an end in 2009. While we discuss the role of the stecs in yielding a substantial body of research on the reg problem, which has opened new avenues for future research, our main focus is on the role of different evaluation methods in assessing the output quality of reg algorithms, and on the relationship between such methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Appelt, D.: Planning English referring expressions. Artificial Intelligence 26(1), 1–33 (1985)

    Article  Google Scholar 

  2. Appelt, D., Kronfeld, A.: A computational model of referring. In: Proceedings of the 10th International Joint Conference on Artificial Intelligence (IJCAI 1987), pp. 640–647 (1987)

    Google Scholar 

  3. Bard, E.G., Robertson, D., Sorace, A.: Magnitude estimation of linguistic acceptability. Language 72(1), 32–68 (1996)

    Article  Google Scholar 

  4. Belke, E., Meyer, A.: Tracking the time course of multidimensional stimulus discrimination: Analysis of viewing patterns and processing times during same-different decisions. European Journal of Cognitive Psychology 14(2), 237–266 (2002)

    Article  Google Scholar 

  5. Belz, A.: Statistical generation: Three methods compared and evaluated. In: Proceedings of the 10th European Workshop on Natural Language Generation (ENLG 2005), pp. 15–23 (2005)

    Google Scholar 

  6. Belz, A., Gatt, A.: The attribute selection for gre challenge: Overview and evaluation results. In: Proceedings of UCNLG+MT: Language Generation and Machine Translation, pp. 75–83 (2007)

    Google Scholar 

  7. Belz, A., Gatt, A.: Intrinsic vs. extrinsic evaluation measures for referring expression generation. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL 2008), pp. 197–200 (2008)

    Google Scholar 

  8. Belz, A., Kow, E.: System-building cost vs. output quality in data-to-text generation. In: Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009), pp. 16–24 (2009)

    Google Scholar 

  9. Belz, A., Kow, E., Viethen, J., Gatt, A.: Generating referring expressions in context: The grec task evaluation challenges. In: Krahmer, E., Theune, M. (eds.) Empirical Methods in NLG. LNCS (LNAI), vol. 5790, pp. 294–328. Springer, Heidelberg (2010)

    Google Scholar 

  10. Belz, A., Reiter, E.: Comparing automatic and human evaluation of nlg systems. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006), pp. 313–320 (2006)

    Google Scholar 

  11. Belz, A.: That’s nice.. what can you do with it? Computational Linguistics 35(1), 111–118 (2009)

    Article  Google Scholar 

  12. Belz, A., Kilgarriff, A.: Shared task evaluations in HLT: Lessons for NLG. In: Proceedings of INLG 2006, pp. 133–135 (2006)

    Google Scholar 

  13. Black, A., Taylor, P., Caley, R.: The Festival speech synthesis system: System documentation. Tech. Rep. 1.4 edition., University of Edinburgh (1999)

    Google Scholar 

  14. Bohnet, B.: is-fbn, is-fbs, is-iac: The adaptation of two classic algorithms for the generation of referring expressions in order to produce expressions like humans do. In: Proceedings of UCNLG+MT: Language Generation and Machine Translation, pp. 84–86 (2007)

    Google Scholar 

  15. Bohnet, B.: The fingerprint of human referring expressions and their surface realization with graph transducers. In: Proceedings of the 5th International Conference on Natural Language Generation (INLG 2008), pp. 207–210 (2008)

    Google Scholar 

  16. Bohnet, B., Dale, R.: Viewing referring expression generation as search. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI 2005), pp. 1004–1009 (2005)

    Google Scholar 

  17. Cahill, A., van Genabith, J.: Robust pcfg-based generation using automatically acquired lfg approximations. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL 2006), pp. 1033–1040 (2006)

    Google Scholar 

  18. Callaway, C.B.: Evaluating coverage for large symbolic nlg grammars. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI 2003), pp. 811–817 (2003)

    Google Scholar 

  19. Callaway, C.B., Lester, J.C.: Narrative prose generation. Artificial Intelligence 139(2), 213–252 (2002)

    Article  Google Scholar 

  20. Calliston-Burch, C., Osborne, M., Koehn, P.: Re-evaluating the role of bleu in machine translation research. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006), pp. 249–256 (2006)

    Google Scholar 

  21. Dale, R., Reiter, E.: Computational interpretation of the Gricean maxims in the generation of referring expressions. Cognitive Science 19(8), 233–263 (1995)

    Article  Google Scholar 

  22. Dale, R., White, M. (eds.): Shared Tasks and Comparative Evaluation in Natural Language Generation: Workshop Report (2007), http://www.ling.ohio-state.edu/nlgeval07/NLGEval07-Report.pdf

  23. Dale, R.: Cooking up referring expressions. In: Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, ACL 1989, pp. 68–75 (1989)

    Google Scholar 

  24. van Deemter, K.: Generating referring expressions: Boolean extensions of the incremental algorithm. Computational Linguistics 28(1), 37–52 (2002)

    Article  MATH  Google Scholar 

  25. van Deemter, K., Gatt, A.: Content determination in GRE: Evaluating the evaluator. In: Proceedings of the 2nd UCNLG Workshop: Language Generation and Machine Translation (2007)

    Google Scholar 

  26. Doddington, G.: Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the 2nd International Conference on Human Language Technology Research (HLT 2002), pp. 138–145 (2002)

    Google Scholar 

  27. Dorr, B.J., Monz, C., President, S., Schwartz, R., Zajic, D.: A methodology for extrinsic evaluation of text summarization: Does rouge correlate? In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarisation, pp. 1–8 (2005)

    Google Scholar 

  28. Engelhardt, P.E., Bailey, K., Ferreira, F.: Do speakers and listeners observe the Gricean Maxim of Quantity? Journal of Memory and Language 54, 554–573 (2006)

    Article  Google Scholar 

  29. Fabbrizio, G.D., Stent, A.J., Bangalore, S.: Referring expression generation using speaker-based attribute selection and trainable realization (att-reg). In: Proceedings of the 5th International Conference on Natural Language Generation (INLG 2008), pp. 211–214 (2008)

    Google Scholar 

  30. Fabbrizio, G.D., Stent, A.J., Bangalore, S.: Trainable speaker-based referring expression generation. In: Proceedings of the 12th Conference on Computational Natural Language Learning (CONLL 2008), pp. 151–158 (2008)

    Google Scholar 

  31. Foster, M.: Automated metrics that agree with human judgements on generated output for an embodied conversational agent. In: Proceedings of the 5th International Conference on Natural Language Generation (INLG 2008), pp. 95–103 (2008)

    Google Scholar 

  32. Gardent, C.: Generating minimal definite descriptions. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), pp. 96–103 (2002)

    Google Scholar 

  33. Gatt, A., Belz, A.: Attribute selection for referring expression generation: New algorithms and evaluation methods. In: Proceedings of the 5th International Conference on Natural Language Generation (INLG 2008), pp. 50–58 (2008)

    Google Scholar 

  34. Gatt, A., Belz, A., Kow, E.: The tuna Challenge 2008: Overview and evaluation results. In: Proceedings of the 5th International Conference on Natural Language Generation (INLG 2008), pp. 198–206 (2008)

    Google Scholar 

  35. Gatt, A., Belz, A., Kow, E.: The tuna-reg Challenge 2009: Overview and evaluation results. In: Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009), pp. 198–206 (2009)

    Google Scholar 

  36. Gatt, A., van Deemter, K.: Lexical choice and conceptual perspective in the generation of plural referring expressions. Journal of Logic, Language and Information 16(4), 423–443 (2007)

    Article  MATH  Google Scholar 

  37. Gatt, A., van der Sluis, I., van Deemter, K.: Evaluating algorithms for the generation of referring expressions using a balanced corpus. In: Proceedings of the 11th European Workshop on Natural Language Generation (ENLG 2007), pp. 45–56 (2007)

    Google Scholar 

  38. Grice, H.: Logic and conversation. In: Cole, P., Morgan, J. (eds.) Syntax and Semantics: Speech Acts, vol. III. Academic Press, London (1975)

    Google Scholar 

  39. Gupta, S., Stent, A.J.: Automatic evaluation of referring expression generation using corpora. In: Proceedings of the 1st Workshop on Using Corpora in NLG (UCNLG 2005), pp. 1–6 (2005)

    Google Scholar 

  40. Hervás, R., Gervás, P.: Evolutionary and case-based approaches to reg. In: Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009), pp. 187–188 (2009)

    Google Scholar 

  41. Jordan, P., Walker, M.: Learning attribute selections for non-pronominal expressions. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (2000)

    Google Scholar 

  42. Jordan, P.W.: Can nominal expressions achieve multiple goals? In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL 2000), pp. 142–149 (2000)

    Google Scholar 

  43. Jordan, P.W., Walker, M.: Learning content selection rules for generating object descriptions in dialogue. Journal of Artificial Intelligence Research 24, 157–194 (2005)

    MATH  Google Scholar 

  44. Karasimos, A., Isard, A.: Multi-lingual evaluation of a natural language generation system. In: Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2004 (2004)

    Google Scholar 

  45. Kelleher, J., Namee, B.M.: Referring expression generation challenge 2008 DIT system descriptions. In: Proceedings of the 5th International Conference on Natural Langauge Generation (INLG 2008), pp. 221–224 (2008)

    Google Scholar 

  46. King, J.: OSU-GP: Attribute selection using genetic programming. In: Proceedings of the 5th International Conference on Natural Language Generation (INLG 2008), pp. 225–226 (2008)

    Google Scholar 

  47. Koller, A., Striegnitz, K., Byron, D., Cassell, J., Dale, R., Moore, J., Oberlander, J.: The first challenge on generating instructions in virtual environments. In: Krahmer, E., Theune, M. (eds.) Empirical Methods in NLG. LNCS (LNAI), vol. 5790, pp. 329–353. Springer, Heidelberg (2010)

    Google Scholar 

  48. Koolen, R., Gatt, A., Goudbeek, M., Krahmer, E.: Need I say more? On factors causing referential overspecification. In: Proceedings of the Workshop on Production of Referring Expressions: Bridging Computational and Psycholinguistic Approaches (pre-cogsci 2009) (2009)

    Google Scholar 

  49. Krahmer, E., van Erk, S., Verleg, A.: Graph-based generation of referring expressions. Computational Linguistics 29(1), 53–72 (2003)

    Article  MATH  Google Scholar 

  50. Kronfeld, A.: Conversationally relevant descriptions. In: Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics (ACL 1989), pp. 60–67 (1989)

    Google Scholar 

  51. Langkilde-Geary, I.: An empirical verification of coverage and correctness for a general-purpose sentence generator. In: Proceedings of the 2nd International Conference on Natural Language Generation, INLG 2002 (2002)

    Google Scholar 

  52. Law, A.S., Freer, Y., Hunter, J., Logie, R., McIntosh, N., Quinn, J.: A comparison of graphical and textual presentations of time series data to support medical decision making in the neonatal intensive care unit. Journal of Clinical Monitoring and Computing 19, 183–194 (2005)

    Article  Google Scholar 

  53. Lester, J., Porter, B.: Developing and empirically evaluating robust explanation generators: The knight experiments. Computational Linguistics 23(1), 65–101 (1997)

    Google Scholar 

  54. Lin, C.Y., Hovy, E.: Automatic evaluation of summaries using n-gram co-occurrence statistics. In: Proceedings of HLT-NAACL 2003, pp. 71–78 (2003)

    Google Scholar 

  55. de Lucena, D., Paraboni, I.: usp-each frequency-based greedy attribute selection for referring expressions generation. In: Proceedings of the 5th International Conference on Natural Language Generation (INLG 2008), pp. 219–220 (2008)

    Google Scholar 

  56. Maes, A., Arts, A., Noordman, L.: Reference management in instructive discourse. Discourse Processes 37(2), 117–144 (2004)

    Article  Google Scholar 

  57. Miyao, Y., Saetre, R., Sagae, K., Matsuzaki, T., Tsujii, J.: Task-oriented evaluations of syntactic parsers and their representations. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL 2008), pp. 46–54 (2008)

    Google Scholar 

  58. Papineni, S., Roukos, T., Ward, W., Zhu., W.: bleu: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), pp. 311–318 (2002)

    Google Scholar 

  59. Passonneau, R.: Measuring agreement on set-valued items (masi) for semantic and pragmatic annotation. In: Proceedings of the 5th International Conference on Language Resources and Evaluation, LREC 2006 (2006)

    Google Scholar 

  60. Pechmann, T.: Incremental speech production and referential overspecification. Linguistics 27, 89–110 (1989)

    Article  Google Scholar 

  61. Pereira, D.B., Paraboni, I.: From TUNA attribute sets to Portuguese text: A first report. In: Proceedings of the 5th International Conference on Natural Language Generation (INLG 2008), pp. 232–234 (2008)

    Google Scholar 

  62. Portet, F., Reiter, E., Gatt, A., Hunter, J., Sripada, S., Freer, Y., Sykes, C.: Automatic generation of textual summaries from neonatal intensive care data. Artificial Intelligence 173(7–8), 789–816 (2009)

    Article  Google Scholar 

  63. Reips, U.D.: The Web Experimental Psychology Lab: Five years of data collection on the Internet. Behavioral Research Methods and Computers 33(2), 201–211 (2001)

    Article  Google Scholar 

  64. Reiter, E., Belz, A.: An investigation into the validity of some metrics for automatically evaluating Natural Language Generation systems. Computational Linguistics 35(4), 529–558 (2009)

    Article  Google Scholar 

  65. Reiter, E., Robertson, R., Osman, L.: Lessons from a failure: Generating tailored smoking cessation letters. Artificial Intelligence 144, 41–58 (2003)

    Article  Google Scholar 

  66. Reiter, E., Sripada, S.: Should corpora texts be gold standards for nlg? In: Proceedings of the 2nd International Conference on Natural Language Generation, INLG 2002 (2002)

    Google Scholar 

  67. Reiter, E., Sripada, S., Hunter, J., Yu, J., Davy, I.: Choosing words in computer-generated weather forecasts. Artificial Intelligence 167, 137–169 (2005)

    Article  Google Scholar 

  68. van der Sluis, I., Gatt, A., van Deemter, K.: Evaluating algorithms for the generation of referring expressions: Going beyond toy domains. In: Proceedings of the Conference on Recent Advances in Natural Language Processing, RANLP 2007 (2007)

    Google Scholar 

  69. van der Sluis, I., Krahmer, E.: Generating multimodal referring expressions. Discourse Processes 44(3), 145–174 (2007)

    Article  Google Scholar 

  70. Spanger, P., Kurosawa, T., Tokunaga, T.: TITCH: Attribute selection based on discrimination power and frequency. In: Proceedings of UCNLG+MT: Language Generation and Machine Translation, pp. 98–100 (2008)

    Google Scholar 

  71. Spärck Jones, K., Galliers, J.R.: Evaluating natural language processing systems: An analysis and review. Springer, Berlin (1996)

    Google Scholar 

  72. Stock, O., Zancanaro, M., Busetta, P., Callaway, C., Krueger, A., Kruppa, M., Kuflik, T., Not, E., Rocchi, C.: Adaptive, intelligent presentation of information for the museum visitor in peach. User Modeling and User-Adapted Interaction 17(3), 257–304 (2007)

    Article  Google Scholar 

  73. von Stutterheim, C., Mangold-Allwinn, R., Barattelli, S., Kohlmann, U., Kölbing, H.G.: Reference to objects in text production. Belgian Journal of Linguistics 8, 99–125 (1993)

    Article  Google Scholar 

  74. Tanenhaus, M.K., Spivey-Knowlton, M.J., Eberhard, K.M., Sedivy, J.G.: Integration of visual and linguistic information in spoken language comprehension. Science 268, 1632–1634 (1995)

    Article  Google Scholar 

  75. Theune, M., Touset, P., Viethen, J., Krahmer, E.: Cost-based attribute selection for generating referring expressions (graph-fp and graph-sc). In: Proceedings of UCNLG+MT: Language Generation and Machine Translation, pp. 95–97 (2007)

    Google Scholar 

  76. Viethen, J., Dale, R.: Algorithms for generating referring expressions: Do they do what people do? In: Proceedings of the 4th International Conference on Natural Language Generation (INLG 2006), pp. 63–70 (2006)

    Google Scholar 

  77. Viethen, J., Dale, R.: Evaluation in natural language generation: Lessons from referring expression generation. Traitement Automatique des Langues 48(1), 141–160 (2007)

    Google Scholar 

  78. White, M., Rajkumar, R., Martin, S.: Towards broad coverage surface realization with ccg. In: Proceedings of the Workshop on Using Corpora for NLG: Language Generation and Machine Translation, UCNLG+MT (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Gatt, A., Belz, A. (2010). Introducing Shared Tasks to NLG: The TUNA Shared Task Evaluation Challenges. In: Krahmer, E., Theune, M. (eds) Empirical Methods in Natural Language Generation. EACL ENLG 2009 2009. Lecture Notes in Computer Science(), vol 5790. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15573-4_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15573-4_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15572-7

  • Online ISBN: 978-3-642-15573-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics