Appearance and shape-based hybrid visual feature extraction: toward audio–visual automatic speech recognition


Nowadays, audio–visual automatic speech recognition (AV-ASR) is an emerging field of research, but there is still lack of proper visual features for visual speech recognition. Visual features are mainly categorized into shape based and appearance based. Based on the different information embedded in shape and appearance features, this paper proposes a new set of hybrid visual features which lead to a better visual speech recognition system. Pseudo-Zernike Moment (PZM) is calculated for shape-based visual feature while Local Bnary Pattern-three orthogonal planes (LBP-TOP) and Discrete Cosine Transform (DCT) are calculated for the appearance-based feature. Moreover, our proposed method also gathers global and local visual information. Thus, the objective of the proposed system is to embed all this visual information into a compact features set. Here, for audio speech recognition, the proposed system uses Mel-frequency cepstral coefficients (MFCC). We also propose a hybrid classification method to carry out all the experiments of AV-ASR. Artificial Neural Network (ANN), multiclass Support Vector Machine (SVM) and Naive Bayes (NB) classifiers are used for classifier hybridization. It is shown that the proposed AV-ASR system with a hybrid classifier significantly improves the recognition rate.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7


  1. 1.

    Borde, P., Varpe, A., Manza, R., Yannawar, P.: Recognition of isolated words using Zernike and MFCC features for audio visual speech recognition. Int. J. Speech Technol. 18(2), 167–175 (2015)

    Article  Google Scholar 

  2. 2.

    Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Audio-visual speech recognition using deep learning. Appl. Intell. 42(4), 722–737 (2015)

    Article  Google Scholar 

  3. 3.

    Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91(9), 1306–1326 (2003)

    Article  Google Scholar 

  4. 4.

    Dave, N.: A lip localization based visual feature extraction method. Electr. Comput. Eng. Int. J. ECIJ (2015).

    Article  Google Scholar 

  5. 5.

    Chitu, A.G., Rothkrantz, L.J.M., Wojdel, J.C., Wiggers, P.: Comparison between different feature extraction techniques for audio-visual speech recognition. J. Multimodal User Interfaces 1(1), 7–20 (2007)

    Article  Google Scholar 

  6. 6.

    Dupont, S., Luettin, J.: Audio-Visual Speech Modeling for Continuous Speech Recognition. IEEE Trans. Multimedia 2(3), 141–151 (2000)

    Article  Google Scholar 

  7. 7.

    Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. Trans. IEEE Pattern Anal. Mach. Intell 24(7), 971–987 (2002)

    Article  Google Scholar 

  8. 8.

    Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. Trans. IEEE Multimed. 11(7), 1254–1265 (2009)

    Article  Google Scholar 

  9. 9.

    Dabbaghchiana, S., Ghaemmaghamib, M.P., Aghagolzadeh, A.: Feature extraction using discrete cosine transform and discrimination power analysis with a face recognition technology. Pattern Recognit. 43(4), 1431–1440 (2010)

    Article  Google Scholar 

  10. 10.

    Bhatia, A., Wolf, E.: On the circle polynomials of zernike and related orthogonal sets. Proc. Camb. Philos. Soc. 50(1), 40–48 (2002)

    MathSciNet  Article  Google Scholar 

  11. 11.

    Singh, C., Upneja, R.: Accurate calculation of high order pseudo Zernike moments and their numerical stability. Digit. Signal Proc. 27(1), 95–106 (2013)

    Google Scholar 

  12. 12.

    Sato, H., Iwai, T.: A complex singular value decomposition algorithm based on the Riemannian Newton method. In: 52nd IEEE Conference on Decision and Control, Florence, Italy. IEEE (2013)

  13. 13.

    Wen, J., Fang, X., Cui, J., Fei, L., Yan, K., Chen, Y., Xu, Y.: Robust sparse linear discriminant analysis. IEEE Trans. Circuits Syst. Video Technol. 29(2), 390–403 (2019)

    Article  Google Scholar 

  14. 14.

    Davis, S.B., Mermelstein, P.: Comparison of parametric representation for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–365 (1980)

    Article  Google Scholar 

  15. 15.

    Gevaert, V.M.W., Tsenov, G.: Neural networks used for speech recognition. J. Autom. Control 20(1), 1–7 (2010)

    Article  Google Scholar 

  16. 16.

    Ganapathiraju, A., Jonathan, E., Picone, H.J.: Applications of support vector machines to speech recognition. IEEE Trans. Signal Process. 52(8), 2348–2355 (2004)

    Article  Google Scholar 

  17. 17.

    Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach. Prentice Hall, Upper Saddle River (2003)

    Google Scholar 

  18. 18.

    Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley, New York (2004)

    Google Scholar 

  19. 19.

    Borde, P., Manza, R., Gawali, B., Yannawar, P.: ‘VISWa’: a multilingual multi-pose audio visual database for robust human computer interaction. Int. J. Comput. Appl. 137(4), 25–31 (2004)

    Google Scholar 

  20. 20.

    Liu, G.H., Yang, J.Y., Li, Z.: Content-based image retrieval using computational visual attention model. Pattern Recognit. 48(8), 2554–2566 (2015)

    Article  Google Scholar 

  21. 21.

    Liu, G.H., Yang, J.Y.: Exploiting color volume and color difference for salient region detection. IEEE Trans. Image Process. 28(1), 6–16 (2019)

    MathSciNet  Article  Google Scholar 

  22. 22.

    Sui, C., Togneri, R., Bennamoun, M.: A cascade gray-stereo visual feature extraction method for visual and audio-visual speech recognition. Speech Commun. 90, 26–38 (2017)

    Article  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Saswati Debnath.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Debnath, S., Roy, P. Appearance and shape-based hybrid visual feature extraction: toward audio–visual automatic speech recognition. SIViP 15, 25–32 (2021).

Download citation


  • AV-ASR
  • Appearance and shape-based hybrid visual speech features
  • DCT
  • PZM
  • Hybrid classifier (classifier combination)