Abstract
The amount of on-line text available is continuously growing and has reached hundreds of billions of words. A lot of research has been done using this data, trying to improve results on different problems. Algorithms are continuously optimized, tested and compared after training on corpora with only one million words or less. Most research focuses on the accuracy of the results generated by these algorithms often overlooking the running time or the cost associated with running those algorithms. The main goal of this paper is to show the effect that large data has on the running time and performance of those algorithms in Natural Language Processing. To achieve this goal, three Named Entity Recognition tools were selected. We evaluated the trade-off between quality, running time, and the effect of increasing the data size on performance on the best variety of tools in NER domain. The result shows that the existing tools are unable to work with increasing data size. Also by increasing data size quality is increasing but performance is decreasing; therefore, rendering the existing tools inefficient. By optimizing these tools, large data sizes can be processed; unfortunately, latency is still high.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Baeza-Yates, R. Big data or right data? In Mendelzon A, editor. Workshop, vol. 2013. 2013.
Gudivada V, Baeza-Yates R, Raghavan V. Big data: promises and problems. IEEE Comput Soc. 2015;48(03):20–3.
Ekbal A, Sourjikova E, Frank A, Ponzetto S. Assessing the challenge of fine-grained named entity recognition and classification. In: NEWS’10 proceedings of the 2010 named entities workshop, 2010. p. 93–101.
Zhang L, Pan Y, Zhang T. Focused named entity recognition using machine learning. In: The 27th annual international ACM SIGIR conference on Research and development in information retrieval, 2004. p. 281–288.
Nadeau D, Sekine S. A survey of named entity recognition and classification. Int J Linguist Lang Resour. 2007;30(1):3–26.
Florian R, Ittycheriah A, Jing H, Zhang T. Named entity recognition through classifier combination. In: Proceeding CONLL ‘03 proceedings of the seventh conference on natural language learning at HLT-NAACL, vol. 4, 2003. p. 168–171.
Mansouri A, Suriani Affendey L, Mamat A. Named entity recognition approaches. Int J Comput Sci Net Secur. 2008;8:339–44.
Zhou GD, Su J. Named entity recognition using an HMM-based chunk tagger. In: 40th annual meeting on ACL, 2001. p. 473–80.
Alias-i. LingPipe 4.1.0 (2008, 22 Feb 2013). Available: http://alias-i.com/lingpipe
Finkel JR, Grenager T, Manning C. Incorporating non-local information into information extraction systems by Gibbs sampling. In: 43rd annual meeting on ACL, 2005. p. 363–370.
Labatut V. Improved named entity recognition through SVM-based combination. 2013. <hal-01322867>
Ratinov L, Roth D. Design challenges and misconceptions in named entity recognition. In: 13th Conference on computational natural language learning, 2009. p. 147–155.
Mansouri A, Affendey LS, Mamat A. Named entity recognition approaches. International Journal of Computer Science and Network Security. 2008;8(2)
Erik, TKS, Fien, DM. 2003. Available http://www.cnts.ua.ac.be/conll2003/ner/000README
Lewis DD, Yang Y, Rose TG, Li F. Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res. 2004;5:361–97.
Lewis, D. D., Yang, Y., Rose, T. G., Li, F. 2015. http://trec.nist.gov/data/reuters/reuters.html
Han J, Kamber M, Pei J. Data mining: concepts and techniques. 3rd ed. San Francisco: Morgan Kaufmann Publishers Inc.; 2011.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this chapter
Cite this chapter
Liaghat, Z. (2018). The Effect of Corpora Size on Performance of Named Entity Recognition. In: Moshirpour, M., Far, B., Alhajj, R. (eds) Highlighting the Importance of Big Data Management and Analysis for Various Applications. Studies in Big Data, vol 27. Springer, Cham. https://doi.org/10.1007/978-3-319-60255-4_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-60255-4_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-60254-7
Online ISBN: 978-3-319-60255-4
eBook Packages: EngineeringEngineering (R0)