Text Caption Generation Based on Lip Movement of Speaker in Video Using Neural Network

  • Dipti Pawade
  • Avani Sakhapara
  • Chaitya ShahEmail author
  • Jigar Wala
  • Ankitmani Tripathi
  • Bhavikk Shah
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 1046)


In this era of e-learning, it will be a great help to deaf people if there can be a system which will generate text caption for various videos. Most of the automatic caption generation system is based on audio to text conversion and thus its accuracy is inversely proportional to the noise in the video. So we have proposed a system which will generate the caption for video based upon the lip movement of the person speaking in the video. Using Facial landmark detector we have extracted facial features of the lip region from frames of the video. These features are fed to the three-dimensional convolutional neural network (3D CNN) to get the text output for the particular frame. The system is trained and tested on GRID dataset.


Visual speech processing Neural network Facial landmark detector Lip reading 


  1. 1.
    Rathee, N.: A novel approach for lip reading based on neural network. In: International Conference on Computational Techniques in Information and Communication Technologies (ICCTICT) (2016). ISSN 978-1-5090-0082-1Google Scholar
  2. 2.
    Almajai, I., Cox, S., Harvey, R., Lan, Y.: Improved speaker independent lip reading using speaker adaptive training and deep neural networks. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2016)Google Scholar
  3. 3.
    Wand, M., Koutnk, J., Schmidhuber, J.: Lipreading with long short-term memory. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016)Google Scholar
  4. 4.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)CrossRefGoogle Scholar
  5. 5.
    Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)CrossRefGoogle Scholar
  6. 6.
    Petridis, S., Li, Z., Pantic, M.: End-to-end visual speech recognition with LSTM. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2592–2596 (2017)Google Scholar
  7. 7.
    Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTM for lipreading. arXiv preprint arXiv:1703.04105 (2017)
  8. 8.
    Rathee, N.: Investigating back propagation neural network for lip reading. In: International Conference on Computing, Communication and Automation (ICCCA) (2016)Google Scholar
  9. 9.
    Fatemeh, V., Farshad, A., Ahmad, N.: LipReading via deep neural networks using hybrid visual features. Image Anal. Stereol. 37(2), 159–171 (2018)CrossRefGoogle Scholar
  10. 10.
    Castrillón, M., Déniz, O., Hernández, D., et al.: Mach. Vis. Appl. 22, 481 (2011). Scholar
  11. 11.
    Zafeiriou, S., Tzimiropoulos, G., Pantic, M.: 300 W: special issue on facial landmark localisation “in-the-wild”. Image Vis. Comput. 47, 1–2 (2016). Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  • Dipti Pawade
    • 1
  • Avani Sakhapara
    • 1
  • Chaitya Shah
    • 1
    Email author
  • Jigar Wala
    • 1
  • Ankitmani Tripathi
    • 1
  • Bhavikk Shah
    • 1
  1. 1.Department of ITK.J. Somaiya College of EngineeringMumbaiIndia

Personalised recommendations