The Effect of Corpora Size on Performance of Named Entity Recognition

Liaghat, Zeinab

doi:10.1007/978-3-319-60255-4_8

The Effect of Corpora Size on Performance of Named Entity Recognition

Zeinab Liaghat⁵

Chapter
First Online: 23 August 2017

1521 Accesses

Part of the book series: Studies in Big Data ((SBD,volume 27))

Abstract

The amount of on-line text available is continuously growing and has reached hundreds of billions of words. A lot of research has been done using this data, trying to improve results on different problems. Algorithms are continuously optimized, tested and compared after training on corpora with only one million words or less. Most research focuses on the accuracy of the results generated by these algorithms often overlooking the running time or the cost associated with running those algorithms. The main goal of this paper is to show the effect that large data has on the running time and performance of those algorithms in Natural Language Processing. To achieve this goal, three Named Entity Recognition tools were selected. We evaluated the trade-off between quality, running time, and the effect of increasing the data size on performance on the best variety of tools in NER domain. The result shows that the existing tools are unable to work with increasing data size. Also by increasing data size quality is increasing but performance is decreasing; therefore, rendering the existing tools inefficient. By optimizing these tools, large data sizes can be processed; unfortunately, latency is still high.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Hardcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Baeza-Yates, R. Big data or right data? In Mendelzon A, editor. Workshop, vol. 2013. 2013.
Google Scholar
Gudivada V, Baeza-Yates R, Raghavan V. Big data: promises and problems. IEEE Comput Soc. 2015;48(03):20–3.
Article Google Scholar
Ekbal A, Sourjikova E, Frank A, Ponzetto S. Assessing the challenge of fine-grained named entity recognition and classification. In: NEWS’10 proceedings of the 2010 named entities workshop, 2010. p. 93–101.
Google Scholar
Zhang L, Pan Y, Zhang T. Focused named entity recognition using machine learning. In: The 27th annual international ACM SIGIR conference on Research and development in information retrieval, 2004. p. 281–288.
Google Scholar
Nadeau D, Sekine S. A survey of named entity recognition and classification. Int J Linguist Lang Resour. 2007;30(1):3–26.
Google Scholar
Florian R, Ittycheriah A, Jing H, Zhang T. Named entity recognition through classifier combination. In: Proceeding CONLL ‘03 proceedings of the seventh conference on natural language learning at HLT-NAACL, vol. 4, 2003. p. 168–171.
Google Scholar
Mansouri A, Suriani Affendey L, Mamat A. Named entity recognition approaches. Int J Comput Sci Net Secur. 2008;8:339–44.
Google Scholar
Zhou GD, Su J. Named entity recognition using an HMM-based chunk tagger. In: 40th annual meeting on ACL, 2001. p. 473–80.
Google Scholar
Alias-i. LingPipe 4.1.0 (2008, 22 Feb 2013). Available: http://alias-i.com/lingpipe
Finkel JR, Grenager T, Manning C. Incorporating non-local information into information extraction systems by Gibbs sampling. In: 43rd annual meeting on ACL, 2005. p. 363–370.
Google Scholar
Labatut V. Improved named entity recognition through SVM-based combination. 2013. <hal-01322867>
Google Scholar
Ratinov L, Roth D. Design challenges and misconceptions in named entity recognition. In: 13th Conference on computational natural language learning, 2009. p. 147–155.
Google Scholar
Mansouri A, Affendey LS, Mamat A. Named entity recognition approaches. International Journal of Computer Science and Network Security. 2008;8(2)
Google Scholar
Erik, TKS, Fien, DM. 2003. Available http://www.cnts.ua.ac.be/conll2003/ner/000README
Lewis DD, Yang Y, Rose TG, Li F. Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res. 2004;5:361–97.
Google Scholar
Lewis, D. D., Yang, Y., Rose, T. G., Li, F. 2015. http://trec.nist.gov/data/reuters/reuters.html
Han J, Kamber M, Pei J. Data mining: concepts and techniques. 3rd ed. San Francisco: Morgan Kaufmann Publishers Inc.; 2011.
Google Scholar

Download references

Author information

Authors and Affiliations

Web Research Group, DTIC, Universitat Pompeu Fabra, Barcelona, Spain
Zeinab Liaghat

Authors

Zeinab Liaghat
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zeinab Liaghat .

Editor information

Editors and Affiliations

Department of Electrical & Computer Engineering, University of Calgary, Calgary, Alberta, Canada
Mohammad Moshirpour
Department of Electrical & Computer Engineering, University of Calgary, Calgary, Alberta, Canada
Behrouz Far
Department of Electrical & Computer Engineering, University of Calgary, Calgary, Alberta, Canada
Reda Alhajj

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Liaghat, Z. (2018). The Effect of Corpora Size on Performance of Named Entity Recognition. In: Moshirpour, M., Far, B., Alhajj, R. (eds) Highlighting the Importance of Big Data Management and Analysis for Various Applications. Studies in Big Data, vol 27. Springer, Cham. https://doi.org/10.1007/978-3-319-60255-4_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-60255-4_8
Published: 23 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-60254-7
Online ISBN: 978-3-319-60255-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics