Skip to main content

Crowdsourcing Named Entity Recognition and Entity Linking Corpora

  • Chapter
  • First Online:
Handbook of Linguistic Annotation

Abstract

This chapter describes our experience with crowdsourcing a corpus containing named entity annotations and their linking to DBpedia. The corpus consists of around 10,000 tweets and is still growing, as new social media content is added. We first define the methodological framework for crowdsourcing entity annotated corpora, which combines expert-based and paid-for crowdsourcing. In addition, the infrastructural support and reusable components of the GATE Crowdsourcing plugin are presented. Next, the process of crowdsourcing named entity annotations and their DBpedia grounding is discussed in detail, including annotation schemas, annotation interfaces, and inter-annotator agreement. Where different judgements needed adjudication, we mostly used experts for this task, in order to ensure a high quality gold standard.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 349.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 449.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 449.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.clips.uantwerpen.be/conll2003/ner/.

  2. 2.

    A corpus of 12 245 tweets with entity annotations was created by [24], but this is not shared due to Microsoft policy and the system is not available either.

  3. 3.

    http://www.ucomp.eu.

  4. 4.

    http://www.crowdflower.com.

  5. 5.

    The resulting median pay for trusted contributors on entity recognition was USD$11.37/hr, an ethical rate of pay considering that the majority of crowdsource workers rely on it for income.

  6. 6.

    Prefaced at http://www.bbc.co.uk/editorialguidelines/page/guidance-language-full.

  7. 7.

    There is a 1-to-1 mapping between each DBpedia URI and the corresponding Wikipedia page, which makes it possible to treat Wikipedia as a large corpus, human annotated with DBpedia URIs.

References

  1. ACE.: Annotation Guidelines for Event Detection and Characterization (EDC) (Feb 2004), available at http://www.ldc.upenn.edu/Projects/ACE/

  2. Amigó, E., Corujo, A., Gonzalo, J., Meij, E., Rijke, M.d.: Overview of RepLab 2012: evaluating online reputation management systems. In: CLEF 2012 Labs and Workshop Notebook Papers (2012)

    Google Scholar 

  3. Artstein, R., Poesio, M.: Kappa3 = Alpha (or Beta). Technical report CS Technical Report CSM-437, Department of Computer Science, University of Essex, Colchester, UK (2005)

    Google Scholar 

  4. Biewald, L.: Massive multiplayer human computation for fun, money, and survival. In: Current Trends in Web Engineering, pp. 171–176. Springer, Berlin (2012)

    Google Scholar 

  5. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.: DBpedia - a crystallization point for the web of data. J. Web Semant. Sci. Serv. Agents Worldw. Web 7, 154–165 (2009)

    Article  Google Scholar 

  6. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1247–1250. ACM (2008)

    Google Scholar 

  7. Bontcheva, K., Cunningham, H., Roberts, I., Roberts, A., Tablan, V., Aswani, N., Gorrell, G.: GATE teamware: a web-based, collaborative text annotation framework. Lang. Resour. Eval. 47, 1007–1029 (2013)

    Article  Google Scholar 

  8. Bontcheva, K., Derczynski, L., Funk, A., Greenwood, M.A., Maynard, D., Aswani, N.: TwitIE: an open-source information extraction pipeline for microblog text. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing. Association for Computational Linguistics (2013)

    Google Scholar 

  9. Bontcheva, K., Roberts, I., Derczynski, L., Rout, D.: The GATE crowdsourcing plugin: crowdsourcing annotated corpora made easy. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Association for Computational Linguistics (2014)

    Google Scholar 

  10. Chinchor, N.A.: Overview of MUC-7/MET-2. In: Proceedings of the \(7^{th}\) Message Understanding Conference (MUC7) (Apr 1998), available at http://www.muc.saic.com/proceedings/muc_7_toc.html

  11. Cunningham, H.: JAPE: a Java Annotation Patterns Engine. Research Memorandum CS–99–06, Department of Computer Science, University of Sheffield (May 1999)

    Google Scholar 

  12. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: an architecture for development of robust HLT applications. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 7–12 July 2002. pp. 168–175. ACL ’02, Association for Computational Linguistics, Stroudsburg, PA, USA (2002), http://gate.ac.uk/sale/acl02/acl-main.pdf

  13. Derczynski, L., Maynard, D., Aswani, N., Bontcheva, K.: Microblog-Genre noise and impact on semantic annotation accuracy. In: Proceedings of the 24th ACM Conference on Hypertext and Social Media. ACM (2013)

    Google Scholar 

  14. Derczynski, L., Maynard, D., Rizzo, G., van Erp, M., Gorrell, G., Troncy, R., Bontcheva, K.: Analysis of named entity recognition and linking for tweets. Inf. Process. Manag. 51, 32–49 (2015)

    Article  Google Scholar 

  15. Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., Dredze, M.: Annotating named entities in twitter data with crowdsourcing. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. pp. 80–88 (2010)

    Google Scholar 

  16. Hoffmann, L.: Crowd control. Commun. ACM 52(3), 16–17 (2009)

    Article  Google Scholar 

  17. Ide, N., Bonhomme, P., Romary, L.: XCES: An XML-based standard for linguistic corpora. In: Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000), 30 May – 2 Jun 2000. pp. 825–830. Athens, Greece (2000), http://www.lrec-conf.org/proceedings/lrec2000/pdf/172.pdf

  18. Ji, H., Grishman, R., Dang, H.T., Griffitt, K., Ellis, J.: Overview of the tac 2010 knowledge base population track. In: Proceedings of the Third Text Analysis Conference (2010)

    Google Scholar 

  19. Khanna, S., Ratan, A., Davis, J., Thies, W.: Evaluating and improving the usability of Mechanical Turk for low-income workers in India. In: Proceedings of the First ACM Symposium on Computing for Development. ACM (2010)

    Google Scholar 

  20. Krug, S.: Don’t Make Me Think: A Common Sense Approach to Web Usability. Pearson Education, New York (2009)

    Google Scholar 

  21. Lampos, V., Preotiuc-Pietro, D., Cohn, T.: A user-centric model of voting intention from social media. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. pp. 993–1003. Association for Computational Linguistics (2013)

    Google Scholar 

  22. Laws, F., Scheible, C., Schütze, H.: Active learning with amazon mechanical turk. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. pp. 1546–1556 (2011)

    Google Scholar 

  23. Lawson, N., Eustice, K., Perkowitz, M., Yetisgen-Yildiz, M.: Annotating large email datasets for named entity recognition with mechanical turk. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. pp. 71–79 (2010)

    Google Scholar 

  24. Liu, X., Zhou, M., Wei, F., Fu, Z., Zhou, X.: Joint inference of named entity recognition and normalization for tweets. In: Proceedings of the Association for Computational Linguistics. pp. 526–535 (2012)

    Google Scholar 

  25. Meij, E., Weerkamp, W., de Rijke, M.: Adding semantics to microblog posts. In: Proceedings of the Fifth International Conference on Web Search and Data Mining (WSDM) (2012)

    Google Scholar 

  26. Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: DBpedia spotlight: shedding light on the web of documents. In: Proceedings of the 7th International Conference on Semantic Systems (I-Semantics) (2011)

    Google Scholar 

  27. Milne, D., Witten, I.H.: Learning to link with Wikipedia. In: Proceedings of the 17th Conference on Information and Knowledge Management (CIKM). pp. 509–518 (2008)

    Google Scholar 

  28. Poesio, M., Kruschwitz, U., Chamberlain, J., Robaldo, L., Ducceschi, L.: Phrase detectives: utilizing collective intelligence for internet-scale language resource creation. Trans. Interact. Intell. Syst. 3(1) (2013)

    Google Scholar 

  29. Preotiuc-Pietro, D., Samangooei, S., Cohn, T., Gibbins, N., Niranjan, M.: Trendminer: an architecture for real time analysis of social media text. In: Proceedings of the workshop on Real-Time Analysis and Mining of Social Streams (2012)

    Google Scholar 

  30. Ramanath, R., Choudhury, M., Bali, K., Roy, R.S.: Crowd prefers the middle path: a new iaa metric for crowdsourcing reveals turker biases in query segmentation. In: Proceedings of the annual conference of the Association for Computational Linguistics, vol. 1, pp. 1713–1722 (2013)

    Google Scholar 

  31. Rao, D., McNamee, P., Dredze, M.: Entity linking: finding extracted entities in a knowledge base. In: Multi-source, Multi-lingual Information Extraction and Summarization. Springer, Berlin (2013)

    Google Scholar 

  32. Ritter, A., Clark, S., Mausam, Etzioni, O.: Named entity recognition in tweets: an experimental study. In: Proceedings of Empirical Methods for Natural Language Processing (EMNLP). Edinburgh, UK (2011)

    Google Scholar 

  33. Rout, D., Preotiuc-Pietro, D., Bontcheva, K., Cohn, T.: Wheres @wally? a classification approach to geolocating users based on their social ties. In: Proceedings of the 24th ACM Conference on Hypertext and Social Media (2013)

    Google Scholar 

  34. Sabou, M., Bontcheva, K., Derczynski, L., Scharl, A.: Corpus annotation through crowdsourcing: towards best practice guidelines. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC14). pp. 859–866 (2014)

    Google Scholar 

  35. Shen, W., Wang, J., Luo, P., Wang, M.: LINDEN: linking named entities with knowledge base via semantic knowledge. In: Proceedings of the 21st Conference on World Wide Web. pp. 449–458 (2012)

    Google Scholar 

  36. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Proceedings of the 16th international conference on World Wide Web. pp. 697–706. ACM (2007)

    Google Scholar 

  37. Tjong Kim Sang, E.F., Meulder, F.D.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of CoNLL-2003. pp. 142–147. Edmonton, Canada (2003)

    Google Scholar 

  38. Voyer, R., Nygaard, V., Fitzgerald, W., Copperman, H.: A hybrid model for annotating named entity training corpora. In: Proceedings of the fourth linguistic annotation workshop. pp. 243–246. Association for Computational Linguistics (2010)

    Google Scholar 

Download references

Acknowledgements

Special thanks to Niraj Aswani for implementing the initial entity linking prototype in CrowdFlower, as well as to Marta Sabou, Arno Scharl, and other uComp project members for the feedback on the task and user interface designs. Also, many thanks to Johann Petrak and Genevieve Gorrell for their help with the automatic candidate generation for entity linking. We are particularly grateful to all researchers at the Sheffield NLP group and members of the TrendMiner and uComp projects, who helped create the gold data units. This research has received funding support of EPSRC EP/K017896/1, FWF 1097-N23, and ANR-12-CHRI-0003-03, in the framework of the CHIST-ERA ERA-NET (uComp project), as well as the UK Engineering and Physical Sciences Research Council (grant EP/I004327/1).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kalina Bontcheva .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Bontcheva, K., Derczynski, L., Roberts, I. (2017). Crowdsourcing Named Entity Recognition and Entity Linking Corpora. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_32

Download citation

  • DOI: https://doi.org/10.1007/978-94-024-0881-2_32

  • Published:

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-94-024-0879-9

  • Online ISBN: 978-94-024-0881-2

  • eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics