Advertisement

Language Resources and Evaluation

, Volume 52, Issue 1, pp 101–148 | Cite as

The challenging task of summary evaluation: an overview

Original Paper

Abstract

Evaluation is crucial in the research and development of automatic summarization applications, in order to determine the appropriateness of a summary based on different criteria, such as the content it contains, and the way it is presented. To perform an adequate evaluation is of great relevance to ensure that automatic summaries can be useful for the context and/or application they are generated for. To this end, researchers must be aware of the evaluation metrics, approaches, and datasets that are available, in order to decide which of them would be the most suitable to use, or to be able to propose new ones, overcoming the possible limitations that existing methods may present. In this article, a critical and historical analysis of evaluation metrics, methods, and datasets for automatic summarization systems is presented, where the strengths and weaknesses of evaluation efforts are discussed and the major challenges to solve are identified. Therefore, a clear up-to-date overview of the evolution and progress of summarization evaluation is provided, giving the reader useful insights into the past, present and latest trends in the automatic evaluation of summaries.

Keywords

Text summarization Evaluation Content evaluation Readability Task-based evaluation 

Notes

Acknowledgements

This research is partially funded by the European Commission under the Seventh (FP7-2007-2013) Framework Programme for Research and Technological Development through the SAM (FP7-611312) project; by the Spanish Government through the projects VoxPopuli (TIN2013-47090-C3-1-P), Vemodalen (TIN2015-71785-R), RESCATA (TIN2015-65100-R) and REDES (TIN2015-65136-C2-2-R), the Generalitat Valenciana through project DIIM2.0 (PROMETEOII/2014/001), and the Universidad Nacional de Educación a Distancia through the project “Modelado y síntesis automática de opiniones de usuario en redes sociales” (2014-001-UNED-PROY).

References

  1. Aker, A., El-Haj, M., Albakour, M.-D., & Kruschwitz, U. (2012a). Assessing crowdsourcing quality through objective tasks. In Proceedings of the eighth international conference on language resources and evaluation (LREC-2012). European Language Resources Association (ELRA), Istanbul, Turkey (pp. 1456–61).Google Scholar
  2. Aker, A., Fan, X., Sanderson, M., & Gaizauskas, R. (2012b). Investigating summarization techniques for geo-tagged image indexing. In Advances in information retrieval: 34th European conference on information retrieval (ECIR), Barcelona, Spain (pp. 472–75).Google Scholar
  3. Aker, A., & Gaizauskas, R. (2010). Model summaries for location-related images. In Proceedings of the 7th language resources and evaluation conference.Google Scholar
  4. Alhindi, A., Kruschwitz, U., & Fox, C. (2013). A pilot study on using profile-based summarisation for interactive search assistance. In P. Serdyukov, P. Braslavski, S. Kuznetsov, J. Kamps, S. Rger, E. Agichtein, I. Segalovich & E. Yilmaz, E. (Eds.), Advances in information retrieval. Vol. 7814 of Lecture Notes in Computer Science, Springer, Berlin (pp. 672–75). doi: 10.1007/978-3-642-36973-5_57.
  5. Amigo, E., Gonzalo, J., Peinado, V., Peñas, A., & Verdejo, F. (2004). An empirical study of information synthesis task. In Proceedings of the 42nd meeting of the association for computational linguistics (ACL’04), Main Volume, Barcelona, Spain (pp. 207–14).Google Scholar
  6. Balikas, G., Krithara, A., Partalas, I., & Paliouras, G. (2015). Bioasq: A challenge on large-scale biomedical semantic indexing and question-answering. In Multimodal retrieval in the medical domain, Workshop at ECIR.Google Scholar
  7. Balikas, G., Partalas, I., Kosmopoulos, A., Petridis, S., Malakasiotis, P., & Pavlopoulos, I., et al. (2013). Bioasq evaluation framework specifications. Project deliverable D4.1. http://bioasq.org/sites/default/files/PublicDocuments/BioASQ_D4.1-EvaluationFrameworkSpecification_final.pdf.
  8. Bamman, D., O’Connor, B., & Smith, N. A. (2013). Learning latent personas of film characters. In ACL (1). The Association for Computer Linguistics (pp. 352–61).Google Scholar
  9. Banko, M., & Vanderwende, L. (2004). Using n-grams to understand the nature of summaries. In Proceedings of HLT-NAACL 2004: Short Papers. HLT-NAACL-Short ’04. Association for Computational Linguistics, Stroudsburg, PA, USA (pp. 1–4). http://dl.acm.org/citation.cfm?id=1613984.1613985.
  10. Barzilay, R., & Lapata, M. (2005). Modeling local coherence: An entity-based approach. In Proceedings of the 43rd annual meeting of the association for computational linguistics (ACL’05) (pp. 141–48).Google Scholar
  11. Barzilay, R., & Lapata, M. (2008). Modeling local coherence: An entity-based approach. Computational Linguistics, 34(1), 1–34.CrossRefGoogle Scholar
  12. Berlanga Llavori, R., Ramírez Cruz, Y., & Gil García, R. (2012). A framework for obtaining structurally complex condensed representations of document sets in the biomedical domain. Procesamiento del Lenguaje Natural, 49, 21–8.Google Scholar
  13. Branny, E. (2007). Automatic summary evaluation based on text grammars. Journal of Digital Information, 8(3), 1–6.Google Scholar
  14. Cabrera-Diego, L. A., Torres-Moreno, J., & Durette, B. (2016). Evaluating multiple summaries without human models: A first experiment with a trivergent model. In Natural language processing and information systems—21st international conference on applications of natural language to information systems, NLDB 2016, Salford, UK, June 22–24, 2016, Proceedings (pp. 91–101).Google Scholar
  15. Callison-Burch, C. (2009). Fast, cheap, and creative: evaluating translation quality using Amazon’s mechanical turk. In Proceedings of the 2009 conference on empirical methods in natural language processing (pp. 286–95).Google Scholar
  16. Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kraaij, W., Kronenthal, M., Lathoud, G., Lincoln, M., Lisowska, A., McCowan, I., Post, W., Reidsma, D., & Wellner, P. (2005). The AMI meeting corpus. In L. P. J. J. Noldus, F. Grieco, L. W. S. Loijens & P. H. Zimmerman (Eds.), Proceedings of the measuring behavior 2005 symposium on “annotating and measuring meeting behavior”.Google Scholar
  17. Chen, P., & Verma, R. (2006). A query-based medical information summarization system using ontology knowledge. In Proceedings of the IEEE symposium on computer-based medical systems (pp. 37–42).Google Scholar
  18. Christensen, J., Mausam, S. S., Soderland, S., & Etzioni, O. (2013). Towards coherent multi-document summarization. In Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: Human language technologies. Association for Computational Linguistics, Atlanta, Georgia (pp. 1163–1173). http://www.aclweb.org/anthology/N13-1136.
  19. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2493–2537.Google Scholar
  20. Conroy, J. M., & Dang, H. T. (2008a). Mind the gap: Dangers of divorcing evaluations of summary content from linguistic quality. In Proceedings of the 22nd international conference on computational linguistics (Coling 2008). Coling 2008 Organizing Committee, Manchester, UK (pp. 145–52).Google Scholar
  21. Conroy, J. M., & Dang, H. T. (2008b). Mind the gap: Dangers of divorcing evaluations of summary content from linguistic quality. In Proceedings of the 22nd international conference on computational linguistics—Volume 1. COLING ’08. Association for Computational Linguistics, Stroudsburg, PA, USA (pp. 145–152). http://dl.acm.org/citation.cfm?id=1599081.1599100.
  22. Conroy, J. M., Schlesinger, J. D., Kubina, J., Rankel, P. A., & O’Leary, D. P. (2011). CLASSY 2011 at TAC: Guided and multi-lingual summaries and evaluation metrics. In Proceedings of the 2011 text analysis conference (TAC 2011).Google Scholar
  23. Conroy, J. M., Schlesinger, J. D., Rankel, P. A., & O’Leary, D. P. (2010). Guiding CLASSY toward more responsive summaries. In Proceedings of the 2010 text analysis conference (TAC 2010).Google Scholar
  24. Dalianis, H., & Hassel, M. (2001). Development of a Swedish corpus for evaluating summarizers and other IR-tools. Technical report TRITA-NAP0112, IPLab-188, NADA, KTH.Google Scholar
  25. Dang, H. T. (2005). Overview of DUC 2005. In Proceedings of the document understanding conference (DUC).Google Scholar
  26. Dang, H. T. (2006). Overview of DUC 2006. In Proceedings of the document understanding conference (DUC).Google Scholar
  27. Donaway, R. L., Drummey, K. W., & Mather, L. A. (2000). A comparison of rankings produced by summarization evaluation measures. In Proceedings of NAACL-ANLP 2000 workshop on automatic summarization (pp. 69–78).Google Scholar
  28. Dong, Z., & Dong, Q. (2003). HowNet—A hybrid language and knowledge resource. In Proceedings of natural language processing and knowledge engineering conference (pp. 820–24).Google Scholar
  29. Edmundson, H. P. (1969). New methods in automatic extracting. Journal of the ACM (JACM), 16(2), 264–85.CrossRefGoogle Scholar
  30. El-Haj, M., Kruschwitz, U., & Fox, C. (2010). Using mechanical turk to create a corpus of arabic summaries. In Proceedings of the seventh conference on international language resources and evaluation, Valletta, Malta.Google Scholar
  31. Elhadad, N., Kan, M. Y., Klavans, J. L., & McKeown, K. R. (2005). Customization in a unified framework for summarizing medical literature. Artificial Intelligence in Medicine, 33(2), 179–198. doi: 10.1016/j.artmed.2004.07.018.CrossRefGoogle Scholar
  32. Ellouze, S., Jaoua, M., & Belguith, L. H. (2016). Automatic evaluation of a summary’s linguistic quality. In Natural language processing and information systems—21st international conference on applications of natural language to information systems, NLDB 2016, Salford, UK, June 22–24, 2016, Proceedings (pp. 392–400).Google Scholar
  33. Ellouze, S., Jaoua, M., & Hadrich Belguith, L. (2017). Machine learning approach to evaluate multilingual summaries. In Proceedings of the MultiLing 2017 workshop on summarization and summary evaluation across source types and genres. Association for Computational Linguistics (pp. 47–54).Google Scholar
  34. Feng, D., Besana, S., & Zajac, R. (2009). Acquiring high quality non-expert knowledge from on-demand workforce. In Proceedings of the 2009 workshop on the people’s web meets NLP: Collaboratively constructed semantic resources. People’s Web ’09. Association for Computational Linguistics, Morristown, NJ, USA (pp. 51–6). http://portal.acm.org/citation.cfm?id=1699765.1699773.
  35. Field, D., Pulman, S., Van Labeke, N., Whitelock, D., & Richardson, J. (2013). Did I really mean that? Applying automatic summarisation techniques to formative feedback. In Proceedings of the international conference recent advances in natural language processing RANLP 2013. INCOMA Ltd. Shoumen, BULGARIA, Hissar, Bulgaria (pp. 277–84). http://www.aclweb.org/anthology/R13-1036.
  36. Fiori, A. (2014). Innovative document summarization techniques: Revolutionizing knowledge understanding: Revolutionizing knowledge understanding. In Advances in data mining and database management: IGI Global.Google Scholar
  37. Fiszman, M., Demner-Fushman, D., Kilicoglu, H., & Rindflesch, T. C. (2009). Automatic summarization of medline citations for evidence-based medical treatment: A topic-oriented evaluation. Journal of Biomedical Informatics, 42(5), 801–813. doi: 10.1016/j.jbi.2008.10.002.
  38. Gambhir, M., & Gupta, V. (2017). Recent automatic text summarization techniques: A survey. Artificial Intelligence Review, 47(1), 1–66. doi: 10.1007/s10462-016-9475-9.CrossRefGoogle Scholar
  39. Giannakopoulos, G., Conroy, J., Kubina, J., Rankel, P. A., Lloret, E., Steinberger, J., Litvak, M., & Favre, B. (2017). Multiling 2017 overview. In Proceedings of the MultiLing 2017 workshop on summarization and summary evaluation across source types and genres. Association for Computational Linguistics, Valencia, Spain (pp. 1–6). http://www.aclweb.org/anthology/W17-1001.
  40. Giannakopoulos, G., & Karkaletsis, V. (2011a). AutoSummENG and MeMoG in evaluating guided summaries. In Proceedings of the 2011 text analysis conference (TAC 2011).Google Scholar
  41. Giannakopoulos, G., & Karkaletsis, V. (2011b). Autosummeng and memog in evaluating guided summaries. In Proceedings of the text analysis conference (TAC 2011), Gaithersburg, Maryland, USA.Google Scholar
  42. Giannakopoulos, G., & Karkaletsis, V. (2013). Together we stand npower-ed. In Proceedings of CICLing 2013, Karlovasi, Samos, Greece.Google Scholar
  43. Giannakopoulos, G., Karkaletsis, V., Vouros, G., & Stamatopoulos, P. (2008). Summarization system evaluation revisited: N-gram graphs. ACM Transactions on Speech and Language Processing, 5(3), 1–39.CrossRefGoogle Scholar
  44. Gillick, D., & Liu, Y. (2010). Non-expert evaluation of summarization systems is Risky. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk.Google Scholar
  45. Grosz, B. J., Weinstein, S., & Joshi, A. K. (1995). Centering: A framework for modeling the local coherence of discourse. Computational Linguistics, 21(2), 203–25.Google Scholar
  46. Hand, T. (1997). A proposal for task-based evaluation of text summarization systems. In Proceedings of the association for computational linguistics conference, Madrid, Spain (pp. 31–38).Google Scholar
  47. Harnly, A., Nenkova, A., Passonneau, R. J., & Rambow, O. (2015). Automatation of summary evaluation by the pyramid method. In Proceedings of the international conference recent advances in natural language processing (RANLP), Borovets, Bulgaria (pp. 226–232).Google Scholar
  48. Hasler, L. (2008). Centering theory for evaluation of coherence in computer-aided summaries. In Proceedings of the sixth international conference on language resources and evaluation.Google Scholar
  49. Hasler, L., Orăsan, C., & Mitkov, R. (2003). Building better corpora for summarization. In Proceedings of corpus linguistics 2003, Lancaster, UK (pp. 309–19).Google Scholar
  50. Hassel, M. (2004). Evaluation of automatic text summarization: A practical implementation.Google Scholar
  51. He, T., Chen, J., Ma, L., Gui, Z., Li, F., Shao, W., & Wang, Q. (2008). ROUGE-C: A fully automated evaluation method for multi-document summarization, Granular Computing, 2008. GrC 2008. In IEEE international conference on (pp. 269–74).Google Scholar
  52. Hong, K., Conroy, J., Favre, B., Kulesza, A., Lin, H., & Nenkova, A. (2014). A repository of state of the art and competitive baseline summaries for generic news summarization. In N. C. C. Chair, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the ninth international conference on language resources and evaluation (LREC’14). European Language Resources Association (ELRA), Reykjavik, Iceland.Google Scholar
  53. Hovy, E. (2005). The Oxford handbook of computational linguistics. Oxford University Press, Ch. Text Summarization (pp. 583–98).Google Scholar
  54. Hovy, E., Lin, C.-Y., Zhou, L., & Fukumoto, J. (2006). Automated summarization evaluation with basic elements. In Proceedings of the 5th international conference on language resources and evaluation.Google Scholar
  55. Jimeno-Yepes, A. J., Plaza, L., Mork, J. G., Aronson, A. R., & Díaz, A. (2013). MeSH indexing based on automatically generated summaries. BMC Bioinformatics, 14, 208.CrossRefGoogle Scholar
  56. Jing, H., Barzilay, R., McKeown, K. & Elhadad, M. (1998). Summarization evaluation methods: Experiments and analysis. In AAAI symposium on intelligent summarization (pp. 51–9).Google Scholar
  57. Kabadjov, M., Steinberger, J., Barker, E., Kruschwitz, U., & Poesio, M. (2015). Onforums: The shared task on online forum summarisation at multiling’15. In Proceedings of the 7th forum for information retrieval evaluation, FIRE ’15. ACM, New York, NY, USA (pp. 21–26). doi: 10.1145/2838706.2838709.
  58. Katragadda, R. (2010). GEMS: Generative modeling for evaluation of summaries. In Proceedings of the 11th international conference on computational linguistics and intelligent text processing (pp. 724–35).Google Scholar
  59. Khan, A., Salim, N., & Kumar, Y. J. (2015). A framework for multi-document abstractive summarization based on semantic role labelling. Applied Soft Computing, 30, 737–747.CrossRefGoogle Scholar
  60. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., & et al. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the acl on interactive poster and demonstration sessions. Association for Computational Linguistics (pp. 177–80).Google Scholar
  61. Kupiec, J., Pedersen, J., & Chen, F., (1995). A trainable document summarizer. In Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval. ACM (pp. 68–73).Google Scholar
  62. Labeke, N. V., Whitelock, D., Field, D., Pulman, S., & Richardson, J. (2013a). What is my essay really saying? Using extractive summarization to motivate reflection and redrafting. In Proceedings of the workshops at the 16th international conference on artificial intelligence in education AIED 2013, Memphis, USA, July 9–13. Vol. 1009 of CEUR workshop proceedings. CEUR-WS.org.Google Scholar
  63. Labeke, N. V., Whitelock, D., Field, D., Pulman, S., & Richardson, J. T. E. (2013b). OpenEssayist: extractive summarisation and formative assessment of free-text essays. In 1st international workshop on discourse-centric learning analytics. A pre conference workshop at LAK13. http://oro.open.ac.uk/37548/.
  64. Lapata, M., & Barzilay, R., (2005a). Automatic evaluation of text coherence: Models and representations. In Proceedings of the 19th international joint conference on artificial intelligence, Edinburgh (pp. 1085–1090).Google Scholar
  65. Lapata, M., & Barzilay, R. (2005b). Automatic evaluation of text coherence: Models and representations. In Proceedings of the 19th international joint conference on artificial intelligence. IJCAI’05. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (pp. 1085–1090). http://dl.acm.org/citation.cfm?id=1642293.1642467.
  66. Lin, C.-Y. (2001). Summary evaluation environment. http://www.isi.edu/~cyl/SEE.
  67. Lin, C.-Y. (2004a). Rouge: A package for automatic evaluation of summaries. In Text summarization branches out: Proceedings of the ACL-04 workshop. Association for Computational Linguistics, Barcelona, Spain (pp. 74–81).Google Scholar
  68. Lin, C.-Y. (2004b). ROUGE: A package for automatic evaluation of summaries. In Proceedings of association of computational linguistics text summarization workshop (pp. 74–81).Google Scholar
  69. Lin, C.-Y., & Hovy, E. (2002). Manual and automatic evaluation of summaries. In Proceedings of the workshop on automatic summarization post conference workshop of ACL-02 (DUC 2002).Google Scholar
  70. Lin, C.-Y., & Hovy, E. (2003). Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology—Volume 1. NAACL ’03. Association for Computational Linguistics, Stroudsburg, PA, USA (pp. 71–78). doi: 10.3115/1073445.1073465
  71. Lin, Z., Liu, C., Ng, H. T., & Kan, M.-Y. (2012). Combining coherence models and machine translation evaluation metrics for summarization evaluation. In Proceedings of the 50th annual meeting of the association for computational linguistics: Long papers—Volume 1. Association for Computational Linguistics (pp. 1006–1014).Google Scholar
  72. Liseth, A. (2004). En evaluering av NorSum en automatisk tekstsammenfatter for norsk. Hovedfagsoppgave. Technical report: Universitetet i Bergen, Seksjon for lingvistiske fag.Google Scholar
  73. Liu, F., & Liu, Y. (2008a). Correlation between rouge and human evaluation of extractive meeting summaries. In Proceedings of the 46th annual meeting of the association for computational linguistics on human language technologies: Short papers. HLT-Short ’08. Association for Computational Linguistics, Stroudsburg, PA, USA (pp. 201–204). http://dl.acm.org/citation.cfm?id=1557690.1557747
  74. Liu, F., & Liu, Y. (2008b). Correlation between ROUGE and human evaluation of extractive meeting summaries. In Proceedings of the 46th annual meeting of the association of computational linguistics: Human language technologies, short papers (pp. 201–4).Google Scholar
  75. Lloret, E., Llorens, H., Moreda, P., Saquete, E., & Palomar, M. (2011). Text summarization contribution to semantic question answering: New approaches for finding answers on the web. International Journal of Intelligent Systems, 26(12), 1125–52.CrossRefGoogle Scholar
  76. Lloret, E., & Palomar, M. (2012). Text summarisation in progress: A literature review. Artificial Intelligence Review, 37(1), 1–41. doi: 10.1007/s10462-011-9216-z.CrossRefGoogle Scholar
  77. Lloret, E., Plaza, L., & Aker, A. (2013). Analyzing the capabilities of crowdsourcing services for text summarization. Language Resources and Evaluation, 47(2), 337–69. doi: 10.1007/s10579-012-9198-8.CrossRefGoogle Scholar
  78. Louis, A., & Nenkova, A. (2008). Automatic summary evaluation without human models. In Proceedings of the text analysing conference, (TAC 2008).Google Scholar
  79. Louis, A., & Nenkova, A. (2009a). Automatically evaluating content selection in summarization without human models. In Proceedings of the 2009 conference on empirical methods in natural language processing: Volume 1. Association for Computational Linguistics (pp. 306–314).Google Scholar
  80. Louis, A., & Nenkova, A. (2009b). Predicting summary quality using limited human input. In Proceedings of the 2009 text analysis conference (TAC 2009).Google Scholar
  81. Mani, I. (2001). Automatic summarization (Vol. 3). Amsterdam: John Benjamins Publishing Company.CrossRefGoogle Scholar
  82. Mani, I., House, D., Klein, G., Hirschman, L., Firmin, T., & Sundheim, B. (1999). The TIPSTER SUMMAC text summarization evaluation. In Proceedings of the ninth conference on European chapter of the association for computational linguistics. Association for Computational Linguistics (pp. 77–85).Google Scholar
  83. Marcu, D. (1997). From discourse structures to text summaries. In Proceedings of the ACL. Vol. 97 (pp. 82–88).Google Scholar
  84. Martschat, S., & Markert, K. (2017). Improving rouge for timeline summarization. In Proceedings of the 15th conference of the european chapter of the association for computational linguistics: Volume 2, short papers. Association for Computational Linguistics (pp. 285–290).Google Scholar
  85. Mason, W., & Watts, D. J. (2010). Financial incentives and the “performance of crowds”. ACM SigKDD Explorations Newsletter, 11, 100–8.CrossRefGoogle Scholar
  86. McKeown, K., Barzilay, R., Evans, D., Hatzivassiloglou, V., Kan, M. Y., Schiffman, B., & Teufel, S. (2001). Columbia multi-document summarisation: Approach and evaluation. In Proceedings of the DUC 2001.Google Scholar
  87. McKeown, K., Passonneau, R., Elson, D., Nenkova, A., & Hirschberg, J. (2005). Do summaries help? A task-based evaluation of multi-document summarization. In 28th annual ACM SIGIR conference on research and development in information retrieval, ACM, Salvador, Brazil (pp. 210–17).Google Scholar
  88. Nenkova, A. (2006). Summarization evaluation for text and speech: Issues and approaches. In INTERSPEECH-2006, paper 2079-Wed1WeS.1.Google Scholar
  89. Nenkova, A., & McKeown, K. (2011). Automatic summarization. Foundations and Trends in Information Retrieval, 5(2–3), 103–233. doi: 10.1561/1500000015.CrossRefGoogle Scholar
  90. Nenkova, A., & Passonneau, R. (2004). Evaluating content selection in summarization: The pyramid method. In HLT-NAACL 2004: Main Proceedings (pp. 145–52). Association for Computational Linguistics, Boston, Massachusetts, USA.Google Scholar
  91. Nenkova, A., Passonneau, R., & McKeown, K. (2007). The pyramid method: Incorporating human content selection variation in summarization evaluation. ACM Transactions on Speech and Language Processing (TSLP), 4(2), 2–23.Google Scholar
  92. Ng, J.-P., & Abrecht, V. (2015a). Better summarization evaluation with word embeddings for rouge. In Proceedings of the 2015 conference on empirical methods in natural language processing. Association for Computational Linguistics, Lisbon, Portugal (pp. 1925–1930).Google Scholar
  93. Ng, J.-P., & Abrecht, V. (2015b). Better summarization evaluation with word embeddings for rouge. In Proceedings of the 2015 conference on empirical methods in natural language processing. Association for Computational Linguistics, Lisbon, Portugal (pp. 1925–1930). http://aclweb.org/anthology/D15-1222.
  94. Ono, K., Sumita, K., & Miike, S. (1994). Abstract generation based on rhetorical structure extraction. In Proceedings of the 15th conference on Computational linguistics—Volume 1. Association for Computational Linguistics (pp. 344–48).Google Scholar
  95. Over, P., & Liggett, W. (2002). Introduction to DUC: An intrinsic evaluation of generic news text summarization systems. In Proceedings of DUC 2002.Google Scholar
  96. Owczarzak, K. (2009). DEPEVAL(summ): Dependency-based evaluation for automatic summaries. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP (pp. 190–98).Google Scholar
  97. Owczarzak, K., Conroy, J. M., Dang, H. T., & Nenkova, A. (2012a). An assessment of the accuracy of automatic evaluation in summarization. In Proceedings of workshop on evaluation metrics and system comparison for automatic summarization. Association for Computational Linguistics, Stroudsburg, PA, USA (pp. 1–9).Google Scholar
  98. Owczarzak, K., Conroy, J. M., Dang, H. T., & Nenkova, A. (2012b). An assessment of the accuracy of automatic evaluation in summarization. In Proceedings of workshop on evaluation metrics and system comparison for automatic summarization. Association for Computational Linguistics, Stroudsburg, PA, USA (pp. 1–9). http://dl.acm.org/citation.cfm?id=2391258.2391259.
  99. Owczarzak, K., & Dang, H. T. (2011). Overview of the TAC 2011 summarization track: Guided task and AESOP task. In Proceedings of the text analysis conference (TAC).Google Scholar
  100. Paice, C. D. (1990). Constructing literature abstracts by computer: Techniques and prospects. Information Processing & Management, 26(1), 171–86.CrossRefGoogle Scholar
  101. Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of 40th annual meeting of the association for computational linguistics (pp. 311–318).Google Scholar
  102. Passonneau, R. J., Chen, E., Guo, W., & Perin, D. (2013). Automated pyramid scoring of summaries using distributional semantics. In Proceedings of the 51st annual meeting of the association for computational linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Sofia, Bulgaria (pp. 143–147).Google Scholar
  103. Passonneau, R. J., Nenkova, A., McKeown, K., & Sigelman, S. (2005). Applying the pyramid method in DUC 2005. In Proceedings of the document understanding conference (DUC 05), Vancouver, BC, Canada.Google Scholar
  104. Perea-Ortega, J. M., Lloret, E., Ureña López, A., & Palomar, M. (2013). Application of text summarization techniques to the geographical information retrieval task. Expert Systems with Applications, 40(8), 2966–74. doi: 10.1016/j.eswa.2012.12.012.CrossRefGoogle Scholar
  105. Pitler, E., Louis, A., & Nenkova, A. (2010). Automatic evaluation of linguistic quality in multi-document summarization. In Proceedings of the 48th annual meeting of the association for computational linguistics. ACL ’10. Association for Computational Linguistics, Stroudsburg, PA, USA (pp. 544–554). http://dl.acm.org/citation.cfm?id=1858681.1858737.
  106. Pitler, E., & Nenkova, A. (2008). Revisiting readability: A unified framework for predicting text quality. In Proceedings of the 2008 conference on empirical methods in natural language processing (pp. 186–95).Google Scholar
  107. Plaza, L. (2014). Comparing different knowledge sources for the automatic summarization of biomedical literature. Journal of Biomedical Informatics, 52, 319–328, special Section: Methods in clinical research informatics. http://www.sciencedirect.com/science/article/pii/S1532046414001610.
  108. Plaza, L., Stevenson, M., & Díaz, A. (2010). Improving summarization of biomedical documents using word sense disambiguation. In Proceedings of the 2010 workshop on biomedical natural language processing. BioNLP ’10. Association for Computational Linguistics, Stroudsburg, PA, USA (pp. 55–63). http://dl.acm.org/citation.cfm?id=1869961.1869968.
  109. Radev, D. R. (2001). Experiments in single and multidocument summarization using mead. In First document understanding conference (DUC 2001).Google Scholar
  110. Radev, D. R., & Tam, D. (2003). Summarization evaluation using relative utility. In CIKM ’03: Proceedings of the 12th international conference on information and knowledge management (pp. 508–11).Google Scholar
  111. Rankel, P., Conroy, J. M., Slud, E. V., & O’Leary, D. P. (2011). Ranking human and machine summarization systems. In Proceedings of the conference on empirical methods in natural language processing. EMNLP ’11. Association for Computational Linguistics, Stroudsburg, PA, USA (pp. 467–473).Google Scholar
  112. Rankel, P. A., Conroy, J. M., Dang, H. T., & Nenkova, A. (2013). A decade of automatic content evaluation of news summaries: Reassessing the state of the art. In Proceedings of the 51st annual meeting of the association for computational linguistics, ACL 2013, 4–9 August 2013, Sofia, Bulgaria, Volume 2: Short Papers (pp. 131–136).Google Scholar
  113. Rankel, P. A., Conroy, J. M., & Schlesinger, J. D. (2012). Better metrics to automatically predict the quality of a text summary. Algorithms, 5(4), 398. http://www.mdpi.com/1999-4893/5/4/398.
  114. Reeve, L. H., Han, H., & Brooks, A. D. (2007). The use of domain-specific concepts in biomedical text summarization. Information Processing & Management, 43(6), 1765–1776, text summarization. http://www.sciencedirect.com/science/article/pii/S030645730700074X.
  115. Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of international joint conferences on artificial intelligence (IJCAI), Montreal, Canada (pp. 448–53).Google Scholar
  116. Saggion, H., & Lapalme, G. (2000). Selective analysis for automatic abstracting: Evaluating indicativeness and acceptability. In Proceedings of content-based multimedia information access (pp. 747–64).Google Scholar
  117. Saggion, H., & Szasz, S. (2012). The CONCISUS corpus of event summaries. In N. Calzolari, K. Choukri, T. Declerck, M. U. Dogan, B. Maegaard, J. Mariani, J. Odijk & S. Piperidis (Eds.), LREC. European Language Resources Association (ELRA) (pp. 2031–37).Google Scholar
  118. Saggion, H., Teufel, S., Radev, D., & Lam, W. (2002). Meta-evaluation of summaries in a cross-lingual environment using content-based metrics. In Proceedings of the 19th international conference on Computational linguistics (pp. 1–7).Google Scholar
  119. Saggion, H., Torres-Moreno, J., da Cunha, I., SanJuan, E., & Velázquez-Morales, P. (2010). Multilingual summarization evaluation without human models. In COLING 2010, 23rd international conference on computational linguistics, posters volume, 23–27 August 2010, Beijing, China (pp. 1059–1067).Google Scholar
  120. Salton, G., Singhal, A., Mitra, M., & Buckley, C. (1997). Automatic text structuring and summarization. Information Processing & Management., 33, 193–207.CrossRefGoogle Scholar
  121. Schlesinger, J. D., O’Leary, D. P., & Conroy, J. M. (2008). Arabic/English multi-document summarization with CLASSY—The past and the future, Springer, Berlin (pp. 568–581).  10.1007/978-3-540-78135-6_49.
  122. Schluter, N. (2017). The limits of automatic summarisation according to rouge. In Proceedings of the 15th conference of the european chapter of the association for computational linguistics: Volume 2, short papers. Association for Computational Linguistics (pp. 41–45).Google Scholar
  123. Sjöbergh, J. (2007). Older versions of the ROUGEeval summarization evaluation system were easier to fool. Information Processing & Management, 43(6), 1500–5.CrossRefGoogle Scholar
  124. Smith, C., Danielsson, H., & Jönsson, A. (2012). A more cohesive summarizer. In COLING 2012, 24th international conference on computational linguistics, proceedings of the conference: Posters, 8–15 December 2012, Mumbai, India (pp. 1161–1170).Google Scholar
  125. Spärck Jones, K. (2007). Automatic summarising: The state of the art. Information Processing & Management, 43(6), 1449–1481. doi: 10.1016/j.ipm.2007.03.009.CrossRefGoogle Scholar
  126. Sparck Jones, K., & Galliers, J. (1996). Evaluating natural language processing systems (an analysis and review). In Lecture Notes in Computer Science, Springer.Google Scholar
  127. Steinberger, J., Kabadjov, M., Pouliquen, B., Steinberger, R., & Poesio, M. (2009). WB-JRC-UT’s participation in TAC 2009: Update summarization and AESOP tasks. In Proceedings of the 2009 text analysis conference (TAC 2009).Google Scholar
  128. Tang, J., & Sanderson, M. (2010). Evaluation and user preference study on spatial diversity. In Proceedings of the 32nd European conference on information retrieval (ECIR).Google Scholar
  129. Teufel, S. (2001). Task-based evaluation of summary quality: Describing relationships between scientific papers. In Workshop automatic summarization, NAACL (pp. 12–21).Google Scholar
  130. Teufel, S., & van Halteren, H. (2004). Evaluating information content by factoid analysis: Human annotation and stability. In Proceedings of the conference on empirical methods in natural language processing (pp. 419–26).Google Scholar
  131. Tombros, A., & Sanderson, M. (1998). Advantages of query biased summaries in information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, ACM, New York, NY, USA (pp. 2–10).Google Scholar
  132. Torres-Moreno, J. (2011). Résumé automatique de documents. Recherche d’information et web. Hermes Science Publications. https://books.google.es/books?id=9HeLsuRFRJMC.
  133. Torres-Moreno, J. (2014). Automatic Text Summarization. Cognitive science and knowledge management series. Wiley. https://books.google.es/books?id=aPHsBQAAQBAJ.
  134. Torres-Moreno, J., Saggion, H., da Cunha, I., SanJuan, E., & Velázquez-Morales, P. (2010a). Summary evaluation with and without references. Polibits: Research Journal on Computer Science and Computer Engineering with Applications, 42, 13–19.CrossRefGoogle Scholar
  135. Torres-Moreno, J., Saggion, H., da Cunha, I., Velázquez-Morales, P., & SanJuan, E. (2010b). Evaluation automatique de résumés avec et sans références. In TALN’10, Montréal, Canada.Google Scholar
  136. Tratz, S., & Hovy, E. (2008). Summarization evaluation using transformed basic elements. In Proceedings of the 1st text analysis conference.Google Scholar
  137. Turchi, M., Steinberger, J., Kabadjov, M., & Steinberger, R. (2010). Using parallel corpora for multilingual (multi-document) summarisation evaluation. In Multilingual and multimodal information access evaluation. Vol. 6360 of Lecture Notes in Computer Science (pp. 52–63).Google Scholar
  138. Ulrich, J., Murray, G., & Carenini, G. (2008). A publicly available annotated corpus for supervised email summarization. In AAAI08 EMAIL Workshop, AAAI, Chicago, USA.Google Scholar
  139. Vadlapudi, R., & Katragadda, R. (2010a). Quantitative evaluation of grammaticality of summaries. In Proceedings of the 11th international conference on computational linguistics and intelligent text processing, CICLing 2010, Iasi, Romania (pp. 736–47).Google Scholar
  140. Vadlapudi, R., & Katragadda, R. (2010b). On automated evaluation of readability of summaries: Capturing grammaticality, focus, structure and coherence. In Proceedings of the NAACL HLT 2010 student research workshop. HLT-SRWS ’10. Association for Computational Linguistics, Stroudsburg, PA, USA (pp. 7–12). http://dl.acm.org/citation.cfm?id=1858146.1858148.
  141. Vadlapudi, R., & Katragadda, R. (2010c). On automated evaluation of readability of summaries: Capturing grammaticality, focus, structure and coherence. In Proceedings of the NAACL HLT 2010 student research workshop (pp. 7–12).Google Scholar
  142. Van Dijk, T. (1972). Some aspects of text grammars. A study in theoretical linguistics and poetics. Paris, Mouton: The Hague.Google Scholar
  143. Voorhees, E. (2003). Overview of the TREC 2003 question answering track. In Proceedings of the twelfth text retrieval conference (TREC).Google Scholar
  144. Wang, C., Long, L., & Li, L. (2008). HowNet based evaluation for chinese text summarization. In Proceedings of the international conference on natural language processing and software engineering (pp. 82–7).Google Scholar
  145. Wang, X., Evanini, K., & Zechner, K. (2013). Coherence modeling for the automated assessment of spontaneous spoken responses. In Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: Human language technologies. Association for Computational Linguistics, Atlanta, Georgia (pp. 814–819). http://www.aclweb.org/anthology/N13-1101.
  146. Wu, M., Wilkinson, R., & Paris, C. (2004). An evaluation on query-biased summarisation for the question answering task. In Proceedings of the Australasian language technology workshop 2004, Sydney, Australia (pp. 32–8). http://www.aclweb.org/anthology/U/U04/U04-1005.
  147. Yin, W., & Schütze, H. (2015). Discriminative phrase embedding for paraphrase identification. In Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: Human language technologies. Association for Computational Linguistics, Denver, Colorado (pp. 1368–1373).Google Scholar
  148. Zhou, L., Lin, C.-Y., Munteanu, D. S., & Hovy, E. (2006). ParaEval: Using paraphrases to evaluate summaries automatically. In Proceedings of the human language technology/North American association of computational linguistics conference (pp. 447–54).Google Scholar
  149. Zhu, X., & Cimino, J. J. (2013). Clinicians’ evaluation of computer-assisted medication summarization of electronic medical records. Computers in Biology and Medicine, 59, 221–231.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2017

Authors and Affiliations

  1. 1.Universidad de AlicanteAlicanteSpain
  2. 2.IR & NLP UNEDMadridSpain
  3. 3.University of Duisburg-EssenDuisburgGermany

Personalised recommendations