Spotting words in silent speech videos: a retrieval-based approach

  • Abhishek JhaEmail author
  • Vinay P. Namboodiri
  • C. V. Jawahar
Special Issue Paper


Our goal is to spot words in silent speech videos without explicitly recognizing the spoken words, where the lip motion of the speaker is clearly visible and audio is absent. Existing work in this domain has mainly focused on recognizing a fixed set of words in word-segmented lip videos, which limits the applicability of the learned model due to limited vocabulary and high dependency on the model’s recognition performance. Our contribution is twofold: (1) we develop a pipeline for recognition-free retrieval and show its performance against recognition-based retrieval on a large-scale dataset and another set of out-of-vocabulary words. (2) We introduce a query expansion technique using pseudo-relevant feedback and propose a novel re-ranking method based on maximizing the correlation between spatiotemporal landmarks of the query and the top retrieval candidates. Our word spotting method achieves 35% higher mean average precision over recognition-based method on large-scale LRW dataset. We also demonstrate the application of the method by word spotting in a popular speech video (“The great dictator” by Charlie Chaplin) where we show that the word retrieval can be used to understand what was spoken perhaps in the silent movies. Finally, we compare our model against ASR in a noisy environment and analyze the effect of the performance of underlying lip-reader and input video quality on the proposed word spotting pipeline.


Keyword spotting Lip-reading Visual speech recognition Recognition-free retrieval 



This work is partly supported by Alexa Graduate Fellowship from Amazon.


  1. 1.
    Arandjelović, R., Zisserman, A.: Three things everyone should know to improve object retrieval. In: CVPR (2012)Google Scholar
  2. 2.
    Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: Lipnet: Sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016)
  3. 3.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2014). arXiv preprint arXiv:1409.0473
  4. 4.
    Basu, S., Oliver, N., Pentland, A.: 3d modeling and tracking of human lip motions. In: ICCV (1998)Google Scholar
  5. 5.
    Bourlard, H.A., Morgan, N.: Connectionist Speech Recognition: A Hybrid Approach, vol. 247. Springer, Berlin (2012)Google Scholar
  6. 6.
    Bradski, G.: The opencv library. Dr. Dobb’s J.: Softw. Tools Prof. Progr. 25(11), 120, 122–125 (2000)Google Scholar
  7. 7.
    Brooke N.M, S.S.: Pca image coding schemes and visual speech intelligibility. In: Proceedings of the Institute of Acoustics, vol. 16 (1994)Google Scholar
  8. 8.
    Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: ICASSP, pp. 4960–4964 (2016)Google Scholar
  9. 9.
    Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. In: BMVC (2014)Google Scholar
  10. 10.
    Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation (2014). arXiv preprint arXiv:1406.1078
  11. 11.
    Chollet, F., et al.: Keras. (2015)
  12. 12.
    Chorowski, J., Jaitly, N.: Towards better decoding and language model integration in sequence to sequence models (2016). arXiv preprint arXiv:1612.02695
  13. 13.
    Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: CVPR (2016)Google Scholar
  14. 14.
    Chung, J.S., Zisserman, A.: Lip reading in the wild. In: ACCV (2016)Google Scholar
  15. 15.
    Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: ACCV (2016)Google Scholar
  16. 16.
    Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)CrossRefGoogle Scholar
  17. 17.
    Doetsch, P., Kozielski, M., Ney, H.: Fast and robust training of recurrent neural networks for offline handwriting recognition. In: ICFHR (2014)Google Scholar
  18. 18.
    Fergus, R., Perona, P., Zisserman, A.: A visual category filter for google images. In: ECCV (2004)Google Scholar
  19. 19.
    Fernández, S., Graves, A., Schmidhuber, J.: An application of recurrent neural networks to discriminative keyword spotting. In: ICANN (2007)Google Scholar
  20. 20.
    Fischer, A., Keller, A., Frinken, V., Bunke, H.: HMM-based word spotting in handwritten documents using subword models. In: ICMR (2010)Google Scholar
  21. 21.
    Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Frinken, V., Fischer, A., Manmatha, R., Bunke, H.: A novel word spotting method based on recurrent neural networks. IEEE TPAMI 34(2), 211–224 (2012)CrossRefGoogle Scholar
  23. 23.
    Giotis, A.P., Sfikas, G., Gatos, B., Nikou, C.: A survey of document image word spotting techniques. Pattern Recognit. 68, 310–332 (2017)CrossRefGoogle Scholar
  24. 24.
    Gish, H., Ng, K.: A segmental speech model with applications to word spotting. In: ICASSP, vol. 2 (1993)Google Scholar
  25. 25.
    Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: ICML (2006)Google Scholar
  26. 26.
    Graves, A., Fernández, S., Schmidhuber, J.: Bidirectional LSTM networks for improved phoneme classification and recognition. In: ICANN (2005)Google Scholar
  27. 27.
    Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: ICML, pp. 1764–1772 (2014)Google Scholar
  28. 28.
    Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., et al.: Deep speech: scaling up end-to-end speech recognition (2014). arXiv preprint arXiv:1412.5567
  29. 29.
    Hassanat, A.B.: Visual words for automatic lip-reading (2014). arXiv preprint arXiv:1409.6689
  30. 30.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  31. 31.
    Hennecke, M.E.: Audio-visual speech recognition: preprocessing, learning and sensory integration. PhD thesis, Stanford Univ. (1997)Google Scholar
  32. 32.
    Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)Google Scholar
  33. 33.
    Ho, T.K., Hull, J.J., Srihari, S.N.: A computational model for recognition of multifont word images. Mach. Vis. Appl. 5(3), 157–168 (1992)CrossRefGoogle Scholar
  34. 34.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  35. 35.
    Jha, A., Namboodiri, V., Jawahar, C.V.: Word spotting in silent lip videos. In: WACV (2018)Google Scholar
  36. 36.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE TPAMI 35(1), 221–231 (2013)CrossRefGoogle Scholar
  37. 37.
    Keshet, J., Grangier, D., Bengio, S.: Discriminative keyword spotting. Speech Commun. 51(4), 317–329 (2009)CrossRefGoogle Scholar
  38. 38.
    King, D.E.: Dlib-ml: A machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)Google Scholar
  39. 39.
    Krishnan, P., Jawahar, C.V.: Bringing semantics in word image retrieval. In: ICDAR (2013)Google Scholar
  40. 40.
    Lee, J.S., Park, C.H.: Robust audio-visual speech recognition based on late integration. IEEE TMM 10(5), 767–779 (2008)Google Scholar
  41. 41.
    Liu, H., Fan, T., Wu, P.: Audio-visual keyword spotting based on adaptive decision fusion under noisy conditions for human-robot interaction. In: ICRA, pp. 6644–6651 (2014)Google Scholar
  42. 42.
    Manmatha, R., Han, C., Riseman, E.M.: Word spotting: A new approach to indexing handwriting. In: CVPR (1996)Google Scholar
  43. 43.
    Mohamed, A.R., Dahl, G.E., Hinton, G., et al.: Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20(1), 14–22 (2012)CrossRefGoogle Scholar
  44. 44.
    Robinson, T., Hochberg, M., Renals, S.: The use of recurrent neural networks in continuous speech recognition. In: Automatic Speech and Speaker Recognition, pp. 233–258. Springer, Berlin (1996)Google Scholar
  45. 45.
    Rohlicek, J.R., Russell, W., Roukos, S., Gish, H.: Continuous hidden Markov modeling for speaker-independent word spotting. In: ICASSP (1989)Google Scholar
  46. 46.
    Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading (2017). arXiv preprint arXiv:1703.04105
  47. 47.
    Stafylakis, T., Tzimiropoulos, G.: Zero-shot keyword spotting for visual speech recognition in-the-wild (2018). arXiv preprint arXiv:1807.08469
  48. 48.
    Stillittano, S., Girondel, V., Caplier, A.: Lip contour segmentation and tracking compliant with lip-reading application constraints. Mach. Vis. Appl. 24(1), 1–18 (2013)CrossRefGoogle Scholar
  49. 49.
    Sudholt, S., Fink, G.A.: Phocnet: A deep convolutional neural network for word spotting in handwritten documents. In: ICFHR (2016)Google Scholar
  50. 50.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS (2014)Google Scholar
  51. 51.
    Tsai, S.S., Chen, D., Takacs, G., Chandrasekhar, V., Vedantham, R., Grzeszczuk, R., Girod, B.: Fast geometric re-ranking for image-based retrieval. In: ICIP (2010)Google Scholar
  52. 52.
    Wand, M., Koutník, J., Schmidhuber, J.: Lipreading with long short-term memory. In: ICASSP (2016)Google Scholar
  53. 53.
    Wang, K., Belongie, S.: Word spotting in the wild. In: ECCV (2010)Google Scholar
  54. 54.
    Wu, P., Liu, H., Li, X., Fan, T., Zhang, X.: A novel lip descriptor for audio-visual keyword spotting based on adaptive decision fusion. IEEE TMM 18(3), 326–338 (2016)Google Scholar
  55. 55.
    Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M., Stolcke, A., Yu, D., Zweig, G.: Achieving human parity in conversational speech recognition (2016). arXiv preprint arXiv:1610.05256
  56. 56.
    Zhang, X.Y., Yin, F., Zhang, Y.M., Liu, C.L., Bengio, Y.: Drawing and recognizing chinese characters with recurrent neural network. IEEE TPAMI 849—862 (2017)Google Scholar
  57. 57.
    Zhou, Z., Zhao, G., Hong, X., Pietikäinen, M.: A review of recent advances in visual speech decoding. Image Vis. Comput. 32(9), 590–605 (2014)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.CVITIIIT HyderabadHyderabadIndia
  2. 2.Department of Computer scienceIIT KanpurKanpurIndia

Personalised recommendations