Multimodal Output Combination for Transcribing Historical Handwritten Documents

  • Emilio GranellEmail author
  • Carlos-D. Martínez-Hinarejos
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9256)


Transcription of digitalised historical documents is an interesting task in the document analysis area. This transcription can be achieved by using Handwritten Text Recognition (HTR) on digitalised pages or by using Automatic Speech Recognition (ASR) on the dictation of contents. Moreover, another option is using both systems in a multimodal combination to obtain a draft transcription, given that combining the outputs of different recognition systems will generally improve the recognition accuracy. In this work, we present a new combination method based on Confusion Network. We check its effectiveness for transcribing a Spanish historical book. Results on both unimodal combination with different optical (for HTR) and acoustic (for ASR) models, and multimodal combination, show a relative reduction of Word and Character Error Rate of \(14.3\%\) and \(16.6\%\), respectively, over the HTR baseline.


Document analysis and transcription Handwritten text recognition Automatic speech recognition Confusion Networks combination Recognition outputs combination 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alabau, V., Martínez-Hinarejos, C.D., Romero, V., Lagarda, A.L.: An iterative multimodal framework for the transcription of handwritten historical documents. Pattern Recognition Letters 35, 195–203 (2014)CrossRefGoogle Scholar
  2. 2.
    Bertolami, R., Halter, B., Bunke, H.: Combination of multiple handwritten text line recognition systems with a recursive approach. In: Proc. Int. Conf. Frontiers Handwriting Recognition, pp. 61–65 (2006)Google Scholar
  3. 3.
    Bisani, M., Ney, H.: Bootstrap estimates for confidence intervals in ASR performance evaluation. In: Proc. of Int. Conf. on Acoustics, Speech and Signal Processing, vol. 1, pp. 409–412 (2004)Google Scholar
  4. 4.
    Collobert, R., Bengio, S., Mariéthoz, J.: Torch: a modular machine learning software library. Tech. rep., IDIAP-RR 02–46, IDIAP (2002)Google Scholar
  5. 5.
    Dreuw, P., Jonas, S., Ney, H.: White-space models for offline Arabic handwriting recognition. In: Proc. of Int. Conf. on Pattern Recognition, pp. 1–4 (2008)Google Scholar
  6. 6.
    Hermansky, H., Ellis, D.P., Sharma, S.: Tandem connectionist feature extraction for conventional HMM systems. In: Proc. of Int. Conf. Acoustics, Speech and Signal Processing, vol. 3, pp. 1635–1638 (2000)Google Scholar
  7. 7.
    Ishimaru, S., Nishizaki, H., Sekiguchi, Y.: Effect of confusion network combination on speech recognition system for editing. In: Proc. of APSIPA Annual Summit and Conf., vol. 4, pp. 1–4 (2011)Google Scholar
  8. 8.
    Johnson, D.: ICSI Quicknet soft package (2004).
  9. 9.
    Kneser, R., Ney, H.: Improved backing-off for m-gram language modeling. In: Proc. of Int. Conf. Acoustics, Speech and Signal Processing, vol. 1, pp. 181–184 (1995)Google Scholar
  10. 10.
    Krishnamurthy, H.K.: Study of algorithms to combine multiple automatic speech recognition (ASR) system outputs. Master’s thesis, Department of Electrical and Computer Engineering (2009).
  11. 11.
    Luján-Mares, M., Tamarit, V., Alabau, V., Martínez-Hinarejos, C.D., Pastor i Gadea, M., Sanchis, A., Toselli, A.H.: iATROS: a speech and handwritting recognition system. In: V Jornadas en Tecnologías del Habla (VJTH2008), pp. 75–78 (2008)Google Scholar
  12. 12.
    Moreno, A., Poch, D., Bonafonte, A., Lleida, E., Llisterri, J., Mariño, J.B., Nadeu, C.: Albayzin speech database: design of the phonetic corpus. In: Proc. of EuroSpeech 1993, pp. 175–178 (1993)Google Scholar
  13. 13.
  14. 14.
    Plamondon, R., Srihari, S.N.: On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1), 63–84 (2000)CrossRefGoogle Scholar
  15. 15.
    Romero, V., Leiva, L.A., Toselli, A.H., Vidal, E.: Interactive multimodal transcription of text images using a web-based demo system. In: Proc. of Conf. on Intelligent User Interfaces, pp. 477–478 (2009)Google Scholar
  16. 16.
    Serrano, N., Castro, F., Juan, A.: The RODRIGO Database. In: Proc. of Language Resources and Evaluation Conference, pp. 2709–2712 (2010)Google Scholar
  17. 17.
    Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Proc. Interspeech, pp. 901–904 (2002)Google Scholar
  18. 18.
    Woodruff, P., Dupont, S.: Bimodal combination of speech and handwriting for improved word recognition. In: Proc. of EUSIPCO 2005, pp. 1918–1921 (2005)Google Scholar
  19. 19.
    Xue, J., Zhao, Y.: Improved confusion network algorithm and shortest path search from word lattice. In: Proc. of Int. Conf. in Acoustics, Speech and Signal Processing, vol. 1, pp. 853–856 (2005)Google Scholar
  20. 20.
    Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., et al.: The HTK book (for HTK version 3.4). Cambridge university Eng. Dept. (2006)Google Scholar
  21. 21.
    Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. Transactions on Information Systems 22(2), 179–214 (2004)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Pattern Recognition and Human Language Technology Research CenterUniversitat Politècnica de ValènciaValenciaSpain

Personalised recommendations