The Effect of Corpora Size on Performance of Named Entity Recognition

  • Zeinab LiaghatEmail author
Part of the Studies in Big Data book series (SBD, volume 27)


The amount of on-line text available is continuously growing and has reached hundreds of billions of words. A lot of research has been done using this data, trying to improve results on different problems. Algorithms are continuously optimized, tested and compared after training on corpora with only one million words or less. Most research focuses on the accuracy of the results generated by these algorithms often overlooking the running time or the cost associated with running those algorithms. The main goal of this paper is to show the effect that large data has on the running time and performance of those algorithms in Natural Language Processing. To achieve this goal, three Named Entity Recognition tools were selected. We evaluated the trade-off between quality, running time, and the effect of increasing the data size on performance on the best variety of tools in NER domain. The result shows that the existing tools are unable to work with increasing data size. Also by increasing data size quality is increasing but performance is decreasing; therefore, rendering the existing tools inefficient. By optimizing these tools, large data sizes can be processed; unfortunately, latency is still high.


Big data Machine learning Named entity recognition 


  1. 1.
    Baeza-Yates, R. Big data or right data? In Mendelzon A, editor. Workshop, vol. 2013. 2013.Google Scholar
  2. 2.
    Gudivada V, Baeza-Yates R, Raghavan V. Big data: promises and problems. IEEE Comput Soc. 2015;48(03):20–3.CrossRefGoogle Scholar
  3. 3.
    Ekbal A, Sourjikova E, Frank A, Ponzetto S. Assessing the challenge of fine-grained named entity recognition and classification. In: NEWS’10 proceedings of the 2010 named entities workshop, 2010. p. 93–101.Google Scholar
  4. 4.
    Zhang L, Pan Y, Zhang T. Focused named entity recognition using machine learning. In: The 27th annual international ACM SIGIR conference on Research and development in information retrieval, 2004. p. 281–288.Google Scholar
  5. 5.
    Nadeau D, Sekine S. A survey of named entity recognition and classification. Int J Linguist Lang Resour. 2007;30(1):3–26.Google Scholar
  6. 6.
    Florian R, Ittycheriah A, Jing H, Zhang T. Named entity recognition through classifier combination. In: Proceeding CONLL ‘03 proceedings of the seventh conference on natural language learning at HLT-NAACL, vol. 4, 2003. p. 168–171.Google Scholar
  7. 7.
    Mansouri A, Suriani Affendey L, Mamat A. Named entity recognition approaches. Int J Comput Sci Net Secur. 2008;8:339–44.Google Scholar
  8. 8.
    Zhou GD, Su J. Named entity recognition using an HMM-based chunk tagger. In: 40th annual meeting on ACL, 2001. p. 473–80.Google Scholar
  9. 9.
    Alias-i. LingPipe 4.1.0 (2008, 22 Feb 2013). Available:
  10. 10.
    Finkel JR, Grenager T, Manning C. Incorporating non-local information into information extraction systems by Gibbs sampling. In: 43rd annual meeting on ACL, 2005. p. 363–370.Google Scholar
  11. 11.
    Labatut V. Improved named entity recognition through SVM-based combination. 2013. <hal-01322867>Google Scholar
  12. 12.
    Ratinov L, Roth D. Design challenges and misconceptions in named entity recognition. In: 13th Conference on computational natural language learning, 2009. p. 147–155.Google Scholar
  13. 13.
    Mansouri A, Affendey LS, Mamat A. Named entity recognition approaches. International Journal of Computer Science and Network Security. 2008;8(2)Google Scholar
  14. 14.
    Erik, TKS, Fien, DM. 2003. Available
  15. 15.
    Lewis DD, Yang Y, Rose TG, Li F. Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res. 2004;5:361–97.Google Scholar
  16. 16.
    Lewis, D. D., Yang, Y., Rose, T. G., Li, F. 2015.
  17. 17.
    Han J, Kamber M, Pei J. Data mining: concepts and techniques. 3rd ed. San Francisco: Morgan Kaufmann Publishers Inc.; 2011.Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.Web Research Group, DTICUniversitat Pompeu FabraBarcelonaSpain

Personalised recommendations