Routine Statistical Framework to Speculate Kannada Lip Reading

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 1192)


This paper envisage the system provides a statistical based effort to predict lip movements of speaker. The words spoken by a person is identified by analyzing the shape of lip movement in every instance of time. The approach learns the process of prediction of shapes of lips based on recognition of movement. The lip shapes are predicted by annotating and tracking the movement of lips and synchronization of shape recognition with respect to time is achieved by extracting the shape of lips with different statistical information extracted from every frames of a video. Hence, grooved statistical data lends the system with more appropriate shape based in terms of mean, variance, standard deviation and various other statistical features. The proposed system based on statistical features extraction leads to lip movement recognition and mapping of various Kannada words into different classes based on recognition of shape leads the system perform good initiation towards achieving the Lip Reading. The effort has provided overall accuracy of 40.21% with distinct statistical pattern of features extraction and classification.


Statistical feature extraction Shape classification Lip tracking Lip shape recognition Mapping Kannada lip reading 


  1. 1.
    Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
  2. 2.
    Assael, Y.M., Shillingford, B., Whitestone, S., de Freitas, N.: LipNet: sentence-level lipreading. arXiv:1611.01599 (2016)
  3. 3.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of ICLR (2015)Google Scholar
  4. 4.
    Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: Advances in Neural Information Processing Systems, pp. 1171–1179 (2015)Google Scholar
  5. 5.
    Chan, W., Jaitly, N., Le, Q.V., Vinyals, O.: Listen, attend and spell. arXiv preprint arXiv:1508.01211 (2015)
  6. 6.
    Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. In: Proceedings of BMVC (2014)Google Scholar
  7. 7.
    Chorowski, J., Bahdanau, D., Cho, K., Bengio, Y.: End-to-end continuous speech recognition using attention-based recurrent NN: first results. ArXiv Preprint arXiv: 1412.1602 (2014)
  8. 8.
    Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. In: Advances in Neural Information Processing Systems, pp. 577–585 (2015)Google Scholar
  9. 9.
    Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). Scholar
  10. 10.
    Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Chen, C.-S., Lu, J., Ma, K.-K. (eds.) ACCV 2016. LNCS, vol. 10117, pp. 251–263. Springer, Cham (2017). Scholar
  11. 11.
    Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006) CrossRefGoogle Scholar
  12. 12.
    Czyzewski, A., Kostek, B., Bratoszewski, P., Kotus, J., Szykulski, M.: An Audio-visual corpus for multimodal auto-matic speech recognition. J. Intell. Inform. Syst. 49, 167–192 (2017)Google Scholar
  13. 13.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of CVPR (2016)Google Scholar
  14. 14.
    Galatas, G., Potamianos, G., Makedon, F.: Audio-visual speech recognition incorporating facial depth information captured by the Kinect. In: 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), pp. 2714–2717. IEEE (2012)Google Scholar
  15. 15.
    Graves, A., Fern\(\acute{a}\)ndez, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376. ACM (2006)Google Scholar
  16. 16.
    Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1764–1772 (2014)Google Scholar
  17. 17.
    Graves, A., Jaitly, N., Mohamed, A.-R.: Hybrid speech recognition with deep bidirectional LSTM. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 273–278. IEEE (2013)Google Scholar
  18. 18.
    Hermansky, H.: Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990) CrossRefGoogle Scholar
  19. 19.
    Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1867–1874 (2014)Google Scholar
  20. 20.
    King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)Google Scholar
  21. 21.
    Koller, O., Ney, H., Bowden, R.: Deep learning of mouth shapes for sign language. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 85–91 (2015)Google Scholar
  22. 22.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  1. 1.Department of IS and EngineeringNIE Institute of TechnologyMysoreIndia
  2. 2.Department of CS and EngineeringGovernment Engineering CollegeChamarajanagaraIndia
  3. 3.Department of EngineeringJayachamaraja College of Engineering, JSS Science and Technology University, MysoreMysoreIndia

Personalised recommendations