Image-Sensitive Language Modeling for Automatic Speech Recognition

  • Kata NaszádiEmail author
  • Youssef Oualil
  • Dietrich Klakow
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11132)


Typically language models in a speech recognizer just use the previous words as a context. Thus they are insensitive to context from the real world. This paper explores the benefits of introducing the visual modality as context information to automatic speech recognition. We use neural multimodal language models to rescore the recognition results of utterances that describe visual scenes. We provide a comprehensive survey of how much the language model improves when adding the image to the conditioning set. The image was introduced to a purely text-based RNN-LM using three different composition methods. Our experiments show that using the visual modality helps the recognition process by a \(7.8\%\) relative improvement, but can also hurt the results because of overfitting to the visual input.


Multimodal speech recognition Multimodal language model 


  1. 1.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)Google Scholar
  2. 2.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)Google Scholar
  3. 3.
    Pagh, R.: Compressed matrix multiplication. In: Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pp. 442–451. ACM (2012)Google Scholar
  4. 4.
    Chen, X., et al.: Microsoft coco captions: Data collection and evaluation server (2015). arXiv preprint: arXiv:1504.00325
  5. 5.
    Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641–2649 (2015)Google Scholar
  6. 6.
    Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P.: One billion word benchmark for measuring progress in statistical language modeling. CoRR abs/1312.3005 (2013)Google Scholar
  7. 7.
    Heafield, K., Pouzyrevsky, I., Clark, J.H., Koehn, P.: Scalable modified Kneser-Ney language model estimation. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, pp. 690–696, August 2013Google Scholar
  8. 8.
    Graff, D., Cieri, C.: English Gigaword, LDC catalog no. LDC2003T05. Linguistic Data Consortium, University of Pennsylvania (2003)Google Scholar
  9. 9.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)Google Scholar
  10. 10.
    Harwath, D., Glass, J.: Deep multimodal semantic embeddings for speech and images (2015). arXiv preprint: arXiv:1511.03690
  11. 11.
    Weng, F., Stolcke, A., Sankar, A.: Hub4 language modeling using domain interpolation and data clustering. In: DARPA Speech Recognition Workshop, p. 147. Citeseer (1997)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Kata Naszádi
    • 1
    Email author
  • Youssef Oualil
    • 2
  • Dietrich Klakow
    • 2
  1. 1.AmazonAachenGermany
  2. 2.Spoken Language Systems (LSV), Saarland Informatics CampusSaarland UniversitySaarbrückenGermany

Personalised recommendations