Rejection Threshold Estimation for an Unknown Language Model in an OCR Task

  • Joaquim Arlandis
  • Juan-Carlos Perez-Cortes
  • J. Ramon Navarro-Cerdan
  • Rafael Llobet
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6218)


In an OCR post-processing task, a language model is used to find the best transformation of the OCR hypothesis into a string compatible with the language. The cost of this transformation is used as a confidence value to reject the strings that are less likely to be correct, and the error rate of the accepted strings should be strictly controlled by the user. In this work, the expected error rate distribution of an unknown language model is estimated from a training set composed of known language models. This means that after building a new language model, the user should be able to automatically “fix” the expected error rate at an acceptable level instead of having to deal with an arbitrary threshold.


Error rate rejection threshold language model error- correcting parsing OCR post-processing regression model 


  1. 1.
    Amengual, J., Vidal, E.: Efficient error-correcting viterbi parsing. IEEE Trans. on Pattern Analysis and Machine Intelligence 20(10), 1109 (1998)CrossRefGoogle Scholar
  2. 2.
    Lindberg, J., Koolwaaij, J., Hutter, H., Genoud, D., Pierrot, J., Blomberg, M., Bimbot, F.: Techniques for a priori decision threshold estimation in speaker verification. In: Proceedings RLA2C, pp. 89–92 (1998)Google Scholar
  3. 3.
    Bertolami, R., Zimmermann, M., Bunke, H.: Rejection strategies for offline handwritten text line recognition. Pattern Recognition Letters 27(16), 2005–2012 (2006)CrossRefGoogle Scholar
  4. 4.
    Broadwater, J., Chellappa, R.: Adaptive threshold estimation via extreme value theory. IEEE Transactions on Signal Processing 58, 490–500 (2010)CrossRefGoogle Scholar
  5. 5.
    Gandrabur, S., Foster, G.F., Lapalme, G.: Confidence estimation for nlp applications. TSLP 3(3), 1–29 (2006)CrossRefGoogle Scholar
  6. 6.
    Hall, P., Dowling, G.: Approximate string matching. ACM Surveys 12(4), 381–402 (1980)CrossRefMathSciNetGoogle Scholar
  7. 7.
    Hansen, B.E.: Sample splitting and threshold estimation. Econometrica 68(3), 575–604 (2000)zbMATHCrossRefMathSciNetGoogle Scholar
  8. 8.
    He, C.L., Lam, L., Suen, C.Y.: A novel rejection measurement in handwritten numeral recognition based on linear discriminant analysis. In: 10th Intl. Conf. on Document Analysis and Recognition, pp. 451–455. IEEE Computer Society, Los Alamitos (2009)CrossRefGoogle Scholar
  9. 9.
    Hull, J., Srihari, S.: Experiments in text recognition with binary n-gram and viterbi algorithms. IEEE Trans. on Pattern Analysis and Machine Intelligence 4(5), 520–530 (1982)CrossRefGoogle Scholar
  10. 10.
    Jelinek, F.: Up from trigrams, the strugle for improved language models. In: European Conf. on Speech Communication and Technology, Berlin, pp. 1037–1040 (1993)Google Scholar
  11. 11.
    Kae, A., Huang, G.B., Learned-Miller, E.G.: Bounding the probability of error for high precision recognition. CoRR, abs/0907.0418 (2009)Google Scholar
  12. 12.
    Kolak, O., Resnik, P.: Ocr post-processing for low density languages. In: Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT), pp. 867–874. Association for Computational Linguistics (2005)Google Scholar
  13. 13.
    Landgrebe, T., Paclík, P., Duin, R.P.W.: Precision-recall operating characteristic (p-roc) curves in imprecise environments. In: International Conference on Pattern Recognition ICPR (4), pp. 123–127 (2006)Google Scholar
  14. 14.
    Ozturk, A., Chakravarthi, P.R., Weiner, D.D.: On determining the radar threshold for non-gaussian processes from experimental data. IEEE Transactions on Information Theory 42(4), 1310–1316 (1996)zbMATHCrossRefGoogle Scholar
  15. 15.
    Perez-Cortes, J., Amengual, J., Arlandis, J., Llobet, R.: Stochastic error correcting parsing for ocr post-processing. In: International Conference on Pattern Recognition ICPR-2000, Barcelona, Spain, vol. 4, pp. 405–408 (2000)Google Scholar
  16. 16.
    Pitrelli, J.F., Subrahmonia, J., Perrone, M.P.: Confidence modeling for handwriting recognition: algorithms and applications. International Journal of Document Analysis 8(1), 35–46 (2006)CrossRefGoogle Scholar
  17. 17.
    Serrano, N., Sanchis, A., Juan, A.: Balancing error and supervision effort in interactive-predictive handwriting recognition. In: International Conference on Intelligent User Interfaces (ICIUI), Hong-Kong, China (2010)Google Scholar
  18. 18.
    Tong, X., Evans, D.A.: A statistical approach to automatic ocr error correction in context. In: Fourth Workshop on Very Large Corpora, pp. 88–100 (1996)Google Scholar
  19. 19.
    Garcia, P., Vidal, E.: Inference of K-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence 12(9), 920–925 (1990)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Joaquim Arlandis
    • 1
  • Juan-Carlos Perez-Cortes
    • 1
  • J. Ramon Navarro-Cerdan
    • 1
  • Rafael Llobet
    • 1
  1. 1.Instituto Tecnológico de InformáticaUniversitat Politècnica de ValènciaValènciaSpain

Personalised recommendations