Evaluating the Impact of OCR Errors on Topic Modeling

  • Stephen Mutuvi
  • Antoine DoucetEmail author
  • Moses Odeo
  • Adam Jatowt
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11279)


Historical documents pose a challenge for character recognition due to various reasons such as font disparities across different materials, lack of orthographic standards where same words are spelled differently, material quality and unavailability of lexicons of known historical spelling variants. As a result, optical character recognition (OCR) of those documents often yield unsatisfactory OCR accuracy and render digital material only partially discoverable and the data they hold difficult to process. In this paper, we explore the impact of OCR errors on the identification of topics from a corpus comprising text from historical OCRed documents. Based on experiments performed on OCR text corpora, we observe that OCR noise negatively impacts the stability and coherence of topics generated by topic modeling algorithms and we quantify the strength of this impact.


Topic modeling Topic coherence Text mining Topic stability 


  1. 1.
    Silfverberg, M., Rueter, J.: Can morphological analyzers improve the quality of optical character recognition? In: Septentrio Conference Series, vol. 2, pp. 45–56 (2015)Google Scholar
  2. 2.
    Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pp. 487–494. AUAI Press (2004)Google Scholar
  3. 3.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  4. 4.
    Newman, D.J., Block, S.: Probabilistic topic decomposition of an eighteenth-century American newspaper. J. Assoc. Inf. Sci. Technol. 57(6), 753–767 (2006)CrossRefGoogle Scholar
  5. 5.
    Nelson, R.K.: Mining the dispatch (2010)Google Scholar
  6. 6.
    Yang, T.I., Torget, A.J., Mihalcea, R.: Topic modeling on historical newspapers. In: Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences and Humanities, pp. 96–104. Association for Computational Linguistics (2011)Google Scholar
  7. 7.
    Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J.L., Blei, D.M.: Reading tea leaves: how humans interpret topic models. In: Advances in Neural Information Processing Systems, pp. 288–296 (2009)Google Scholar
  8. 8.
    McCallum, A.K.: Mallet: a machine learning for language toolkit (2002)Google Scholar
  9. 9.
    Walker, D.D., Lund, W.B., Ringger, E.K.: Evaluating models of latent document semantics in the presence of OCR errors. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 240–250. Association for Computational Linguistics (2010)Google Scholar
  10. 10.
    Blevins, C.: Topic modeling Martha Ballard’s diary. Accessed 23 Feb 2018
  11. 11.
    Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–91 (1999)CrossRefGoogle Scholar
  12. 12.
    Arora, S., Ge, R., Moitra, A.: Learning topic models - going beyond SVD. In: Proceedings of 53rd Symposium on Foundations of Computer Science, pp. 1–10. IEEE (2012)Google Scholar
  13. 13.
    Kuang, D., Choo, J., Park, H.: Nonnegative matrix factorization for interactive topic modeling and document clustering. In: Celebi, M.E. (ed.) Partitional Clustering Algorithms, pp. 215–243. Springer, Cham (2015). Scholar
  14. 14.
    Belford, M., Mac Namee, B., Greene, D.: Stability of topic modeling via matrix factorization. Expert Syst. Appl. 91, 159–169 (2018)CrossRefGoogle Scholar
  15. 15.
    Greene, D., O’Callaghan, D., Cunningham, P.: How many topics? Stability analysis for topic models. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) ECML PKDD 2014. LNCS (LNAI), vol. 8724, pp. 498–513. Springer, Heidelberg (2014). Scholar
  16. 16.
    Lange, T., Roth, V., Braun, M.L., Buhmann, J.M.: Stability-based validation of clustering solutions. Neural Comput. 16(6), 1299–1323 (2004)CrossRefGoogle Scholar
  17. 17.
    Fang, A., Macdonald, C., Ounis, I., Habel, P.: Using word embedding to evaluate the coherence of topics from Twitter data. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR 2016, pp. 1057–1060 (2016)Google Scholar
  18. 18.
    O’Callaghan, D., Greene, D., Carthy, J., Cunningham, P.: An analysis of the coherence of descriptors in topic modeling. Expert Syst. Appl. 42(13), 5645–5657 (2015)CrossRefGoogle Scholar
  19. 19.
    Greene, D., Cross, J.P.: Exploring the political agenda of the European parliament using a dynamic topic modeling approach. Polit. Anal. 25, 77–94 (2017)CrossRefGoogle Scholar
  20. 20.
    Chiron, G., Doucet, A., Coustaty, M., Visani, M., Moreux, J.P.: Impact of OCR errors on the use of digital libraries: towards a better access to information. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (2017)Google Scholar
  21. 21.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013)Google Scholar
  22. 22.
    Afli, H., Barrault, L., Schwenk, H.: OCR error correction using statistical machine translation. In: 16th International Conference Intelligent Text Processing Computational Linguistics (CICLing 2015), vol. 7, pp. 175–191 (2015)Google Scholar
  23. 23.
    Knoblock, C., Lopresti, D., Roy, S., Subramaniam, V.: Special issue on noisy text analytics. Int. J. Doc. Anal. Recogn. 10(3–4), 127–128 (2007)CrossRefGoogle Scholar
  24. 24.
    Eder, M.: Mind your corpus: systematic errors in authorship attribution. Literary Linguist. Comput. 10, 1093 (2013)Google Scholar
  25. 25.
    Lopresti, D.: Optical character recognition errors and their effects on natural language processing. Presented at The Second Workshop on Analytics for Noisy Unstructured Text Data, Sponsored by ACM (2008)Google Scholar
  26. 26.
    Taghva, K., Borsack, J., Condit, A.: Results of applying probabilistic IR to OCR text. In: Croft, B.W., van Rijsbergen, C.J. (eds.) SIGIR 1994, pp. 202–211. Springer, New York (1994)Google Scholar
  27. 27.
    Beitzel, S., Jensen, E.C., Grossman, D.A.: A survey of retrieval strategies for OCR text collections. In: Proceedings of 2003 Symposium on Document Image Understanding Technology (2003)Google Scholar
  28. 28.
    Taghva, K., Nartker, T., Borsack, J., Lumos, S., Condit, A., Young, R.: Evaluating text categorization in the presence of OCR errors. In: Document Recognition and Retrieval VIII. International Society for Optics and Photonics, vol. 4307, pp. 68–75 (2000)Google Scholar
  29. 29.
    Agarwal, S., Godbole, S., Punjani, D., Roy, S.: How much noise is too much: a study in automatic text classification. In: Proceedings of the Seventh IEEE International Conference on Data Mining, ICDM 2007, pp. 3–12 (2007)Google Scholar
  30. 30.
    Steyvers, M., Griffiths, T.: Probabilistic topic models. In: Handbook of Latent Semantic Analysis, vol. 427, no. 7, pp. 424–440 (2007)Google Scholar
  31. 31.
    Walker, D., Ringger, E., Seppi, K.: Evaluating supervised topic models in the presence of OCR errors. In: Document Recognition and Retrieval XX, vol. 8658, p. 865812. International Society for Optics and Photonics (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Stephen Mutuvi
    • 1
  • Antoine Doucet
    • 2
    Email author
  • Moses Odeo
    • 1
  • Adam Jatowt
    • 3
  1. 1.Multimedia University KenyaNairobiKenya
  2. 2.La Rochelle UniversityLa RochelleFrance
  3. 3.Kyoto UniversityKyotoJapan

Personalised recommendations