Advertisement

Language Resources and Evaluation

, Volume 47, Issue 2, pp 337–369 | Cite as

Analyzing the capabilities of crowdsourcing services for text summarization

  • Elena Lloret
  • Laura Plaza
  • Ahmet Aker
Original Paper

Abstract

This paper presents a detailed analysis of the use of crowdsourcing services for the Text Summarization task in the context of the tourist domain. In particular, our aim is to retrieve relevant information about a place or an object pictured in an image in order to provide a short summary which will be of great help for a tourist. For tackling this task, we proposed a broad set of experiments using crowdsourcing services that could be useful as a reference for others who want to rely also on crowdsourcing. From the analysis carried out through our experimental setup and the results obtained, we can conclude that although crowdsourcing services were not good to simply gather gold-standard summaries (i.e., from the results obtained for experiments 1, 2 and 4), the encouraging results obtained in the third and sixth experiments motivate us to strongly believe that they can be successfully employed for finding some patterns of behaviour humans have when generating summaries, and for validating and checking other tasks. Furthermore, this analysis serves as a guideline for the types of experiments that might or might not work when using crowdsourcing in the context of text summarization.

Keywords

Information retrieval Text summarization Crowdsourcing services Crowdflower Mechanical Turk 

Notes

Acknowledgments

This work was supported by the EU-funded TRIPOD project (IST-FP6-045335) and by the Spanish Government through the FPU program and the projects TIN2009-14659-C03-01, TSI 020312-2009-44, and TIN2009-13391-C04-01; and by Conselleria d’Educació–Generalitat Valenciana (grant no. PROMETEO/2009/119 and grant no. ACOMP/2010/286).

References

  1. Aker, A., & Gaizauskas, R. (2010). Model summaries for location-related images. In Proceedings of the seventh conference on international language resources and evaluation (LREC’10), European Language Resources Association (ELRA), Valletta, Malta.Google Scholar
  2. Aker, A., El-Haj, M., Albakour, M. D., & Kruschwitz, U. (2012). Assessing crowdsourcing quality through objective tasks. In Proceedings of LREC.Google Scholar
  3. Alonso, O., Rose, D. E., & Stewart, B. (2008). Crowdsourcing for relevance evaluation. SIGIR Forum 42(2), 9–15.CrossRefGoogle Scholar
  4. Buzek, O., Resnik, P., & Bederson, B. B. (2010). Error driven paraphrase annotation using mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazons mechanical turk.Google Scholar
  5. Callison-Burch, C. (2009). Fast, cheap, and creative: Evaluating translation quality using amazon’s mechanical turk. In Proceedings of the 2009 conference on empirical methods in natural language processing (pp. 286–295).Google Scholar
  6. Dakka, W., & Ipeirotis, P.G. (2008). Automatic extraction of useful facet hierarchies from text databases. In ICDE ’08: Proceedings of the 2008 IEEE 24th international conference on data engineering (pp. 466–475).Google Scholar
  7. El-Haj, M., Kruschwitz, U., & Fox, C. (2010). Using mechanicel turk to create a corpus of arabic summaries. In Proceedings of the seventh conference on international language resources and evaluation (LREC’10), Valletta, Malta.Google Scholar
  8. Feng, D., Besana, S., & Zajac, R. (2009). Acquiring High Quality Non-expert Knowledge from On-demand Workforce. In Proceedings of the 2009 workshop on the people’s web meets NLP: Collaboratively constructed semantic resources (pp. 51–56). Association for computational linguistics, Morristown, NJ, People’s Web ’09. http://portal.acm.org/citation.cfm?id=1699765.1699773.
  9. Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., & Dredze, M. (2010). Annotating named entities in twitter data with crowdsourcing. In CSLDAMT ’10: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk, association for computational linguistics (pp. 80–88). Morristown, NJGoogle Scholar
  10. Giannakopoulos, G., Karkaletsis, V., Vouros, G., & Stamatopoulos, P. (2008). Summarization system evaluation revisited: N-gram graphs. ACM transactions on speech and language processing 5(3), 1–39.CrossRefGoogle Scholar
  11. Gillick, D., & Liu, Y. (2010). Non-expert evaluation of summarization systems is risky. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazons mechanical turk.Google Scholar
  12. Heilman, M., & Smith, N. (2010). Rating computer-generated questions with mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk (pp. 00–00).Google Scholar
  13. Hsueh, P. Y., Melville, P., & Sindhwani, V. (2009). Data quality from crowdsourcing: a study of annotation selection criteria. In HLT ’09: Proceedings of the NAACL HLT 2009 workshop on active learning for natural language processing (pp. 27–35).Google Scholar
  14. Kaisser, M., Hearst, M. A., & Lowe, J. B. (2008). Improving search results quality by customizing summary lengths. In Proceedings of ACL-08: HLT, Columbus, Ohio (pp. 701–709).Google Scholar
  15. Kittur, A., Chi, E. H., & Suh, B. (2008). Crowdsourcing user studies with mechanical turk. In Proceeding of the twenty-sixth annual SIGCHI conference on human factors in computing systems (pp. 453–456).Google Scholar
  16. Le, J., Edmonds, A., Hester, V., & Biewald, L. (2010). Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution. In Proceedings of the ACM SIGIR 2010 workshop on crowdsourcing for search evaluation (CSE 2010) (pp. 17–20), Switzerland: Geneva.Google Scholar
  17. Lin, C. Y. (2004). ROUGE: a package for automatic evaluation of summaries. In Proceedings of ACL text summarization workshop (pp. 74–81).Google Scholar
  18. Mason, W., & Watts, D.J. (2010) Financial incentives and the “Performance of Crowds”. SIGKDD Explor Newsl 11, 100–108.CrossRefGoogle Scholar
  19. Nakov, P. (2008). Noun compound interpretation using paraphrasing verbs: Feasibility study. In AIMSA ’08: Proceedings of the 13th international conference on artificial intelligence (pp. 103–117).Google Scholar
  20. Negri, M., & Mehdad, Y. (2010). Creating a bi-lingual entailment corpus through translations with mechanical turk: \(\hbox{\$}100\) for a 10-day rush. In CSLDAMT ’10: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk (pp. 212–216). Morristown, NJ: Association for Computational Linguistics.Google Scholar
  21. Sheng, V. S., Provost, F., & Ipeirotis, P. G. (2008). Get another label? improving data quality and data mining using multiple, noisy labelers. In KDD ’08: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 614–622).Google Scholar
  22. Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. (2008). Cheap and fast – but is it good? evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 conference on empirical methods in natural language processing (pp. 254–263), Honolulu, Hawaii.Google Scholar
  23. Sorokin, A., & Forsyth, D. (2008). Utility data annotation with amazon mechanical turk. pp. 1–8.Google Scholar
  24. Tang, J., & Sanderson, M. (2010). Evaluation and user preference study on spatial diversity. In Proceedings of the 32nd European conference on information retrieval (ECIR).Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2012

Authors and Affiliations

  1. 1.University of AlicanteAlicanteSpain
  2. 2.Universidad Complutense de MadridMadridSpain
  3. 3.University of SheffieldSheffieldUK

Personalised recommendations