Skip to main content

Reducing Human Effort in Named Entity Corpus Construction Based on Ensemble Learning and Annotation Categorization

  • Conference paper
  • First Online:
Natural Language Understanding and Intelligent Applications (ICCPOL 2016, NLPCC 2016)

Abstract

Annotated named entity corpora play a significant role in many natural language processing applications. However, annotation by humans is time-consuming and costly. In this paper, we propose a high recall pre-annotator which combines multiple existing named entity taggers based on ensemble learning, to reduce the number of annotations that humans have to add. In addition, annotations are categorized into normal annotations and candidate annotations based on their estimated confidence, to reduce the number of human corrective actions as well as the total annotation time. The experiment results show that our approach outperforms the baseline methods in reduction of annotation time without loss in annotation performance (in terms of F-measure).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://58.192.114.226/assistedNER.

  2. 2.

    http://nlp.stanford.edu/software/CRF-NER.shtml (version 3.6.0).

  3. 3.

    http://cogcomp.cs.illinois.edu/page/software_view/NETagger (version 2.8.8).

  4. 4.

    http://balie.sourceforge.net (version 1.8.1).

  5. 5.

    http://opennlp.apache.org/index.html (version 1.6.0).

  6. 6.

    http://gate.ac.uk/ (version 8.1).

  7. 7.

    http://alias-i.com/lingpipe/ (version 4.1.0).

References

  1. Wu, D., Ngai, G., Carpuat, M.: A stacked, voted, stacked model for named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, CONLL 2003, vol. 4, pp. 200–203. Association for Computational Linguistics, Stroudsburg (2003)

    Google Scholar 

  2. Speck, R., Ngonga Ngomo, A.-C.: Ensemble learning for named entity recognition. In: Mika, P., Tudorache, T., Bernstein, A., Welty, C., Knoblock, C., Vrandečić, D., Groth, P., Noy, N., Janowicz, K., Goble, C. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 519–534. Springer, Heidelberg (2014). doi:10.1007/978-3-319-11964-9_33

    Google Scholar 

  3. Ganchev, K., Pereira, F., Mandel, M., Carroll, S., White, P.: Semi-automated named entity annotation. In: Proceedings of the Linguistic Annotation Workshop, pp. 53–56. Association for Computational Linguistics (2007)

    Google Scholar 

  4. Stenetorp, P., Pyysalo, S., Ananiadou, S., Jun’ichi, T.: Generalising semantic category disambiguation with large lexical resources for fun and profit. J. Biomed. Semant. 5, 26 (2014)

    Article  Google Scholar 

  5. Lingren, T., Deleger, L., Molnar, K., Zhai, H., Meinzen-Derr, J., Kaiser, M., Stoutenborough, L., Li, Q., Solti, I.: Evaluating the impact of pre-annotation on annotation speed and potential bias: natural language processing gold standard development for clinical named entity recognition in clinical trial announcements. J. Am. Med. Inform. Assoc. 21(3), 406–413 (2014)

    Article  Google Scholar 

  6. Ogren, P.V., Savova, G.K., Chute, C.G.: Constructing evaluation corpora for automated clinical named entity recognition. In: Proceedings of the Language Resources and Evaluation Conference (LREC), pp. 28–30 (2008)

    Google Scholar 

  7. Loftsson, H., Yngvason, J.H., Helgadóttir, S., Rögnvaldsson, E.: Developing a PoS-tagged corpus using existing tools. In: Proceedings of “Creation and use of basic lexical resources for less-resourced languages”, workshop at the 7th International Conference on Language Resources and Evaluation (2010)

    Google Scholar 

  8. Tjong Kim Sang, E. F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, vol. 4, pp. 142–147. Association for Computational Linguistics (2003)

    Google Scholar 

  9. Röder, M., Usbeck, R., Hellmann, S., Gerber, D., Both, A.: N3-a collection of datasets for named entity recognition and disambiguation in the NLP interchange format. In: Proceeding of the Ninth International Conference on Language Resources and Evaluation (2014)

    Google Scholar 

  10. Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: ACL, pp. 363–370 (2005)

    Google Scholar 

  11. Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL 2009, pp. 147–155. Association for Computational Linguistics, Stroudsburg (2009)

    Google Scholar 

  12. Nadeau, D.: Balie-baseline information extraction: multilingual information extraction from text with machine learning and natural language techniques. Technical report, University of Ottawa (2005)

    Google Scholar 

  13. Baldridge, J.: The OpenNLP Project (2005)

    Google Scholar 

  14. Cunningham, H.: GATE: a general architecture for text engineering. Comput. Humanit. 36(2), 223–254 (2001)

    Article  Google Scholar 

  15. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp. 144–152. ACM (1992)

    Google Scholar 

  16. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)

    Google Scholar 

Download references

Acknowledgement

This work is partially funded by the National Science Foundation of China under Grant 61170165, 61602260, 61502095. We would like to thank all the anonymous reviewers for their helpful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhiqiang Gao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Lu, T., Zhu, M., Gao, Z. (2016). Reducing Human Effort in Named Entity Corpus Construction Based on Ensemble Learning and Annotation Categorization. In: Lin, CY., Xue, N., Zhao, D., Huang, X., Feng, Y. (eds) Natural Language Understanding and Intelligent Applications. ICCPOL NLPCC 2016 2016. Lecture Notes in Computer Science(), vol 10102. Springer, Cham. https://doi.org/10.1007/978-3-319-50496-4_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-50496-4_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-50495-7

  • Online ISBN: 978-3-319-50496-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics