Skip to main content

Unified System for Visual Speech Recognition and Speaker Identification

  • Conference paper
  • First Online:
Advanced Concepts for Intelligent Vision Systems (ACIVS 2015)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9386))

Abstract

This paper proposes a unified system for both visual speech recognition and speaker identification. The proposed system can handle image and depth data if they are available. The proposed system consists of four consecutive steps, namely, 3D face pose tracking, mouth region extraction, features computing, and classification using the Support Vector Machine method. The system is experimentally evaluated on three public datasets, namely, MIRACL-VC1, OuluVS, and CUAVE. In one hand, the visual speech recognition module achieves up to 96 % and 79.2 % for speaker dependent and speaker independent settings, respectively. On the other hand, speaker identification performs up to 98.9 % of recognition rate. Additionally, the obtained results demonstrate the importance of the depth data to resolve the subject dependency issue.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ahlberg, J.: Candide-3 - an updated parameterised face. Technical report, Department of Electrical Engineering, Linköping University, Sweden (2001)

    Google Scholar 

  2. Bakry, A., Elgammal, A.: Mkpls: manifold kernel partial least squares for lipreading and speaker identification. In: International Conference on Computer Vision and Pattern Recognition, pp. 684–691 (2013)

    Google Scholar 

  3. Ben-Hamadou, A., Soussen, C., Daul, C., Blondel, W., Wolf, D.: Flexible calibration of structured-light systems projecting point patterns. Computer Vision and Image Understanding 117(10), 1468–1481 (2013)

    Article  Google Scholar 

  4. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. International Conference on Computer Vision and Pattern Recognition 1, 886–893 (2005)

    Google Scholar 

  5. Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 428–441. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  6. de la Cuesta, A.G., Zhang, J., Miller, P.: Biometric identification using motion history images of a speaker’s lip movements. In: International Machine Vision and Image Processing Conference, IMVIP 2008, pp. 83–88. IEEE (2008)

    Google Scholar 

  7. Liu, Y.-F., Lin, C.-Y., Guo, J.-M.: Impact of the lips for biometrics. IEEE Transactions on Image Processing 21(6), 3092–3101 (2012)

    Article  MathSciNet  Google Scholar 

  8. Lucey, P., Sridharan, S.: Patch-based representation of visual speech. In: Proceedings of the HCSNet Workshop on Use of Vision in Human-Computer Interaction, pp. 79–85 (2006)

    Google Scholar 

  9. Lucey, P., Sridharan, S., Dean, D.: Continuous pose-invariant lipreading. In: INTERSPEECH 2008, 9th Annual Conference of the International Speech Communication Association, Brisbane, Australia, pp. 2679–2682, September 22–26, 2008

    Google Scholar 

  10. Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. Audio, Speech, and Language Processing 17(3), 423–435 (2009)

    Article  Google Scholar 

  11. Patterson, E.K., Gurbuz, S., Tufekci, Z., Gowdy, J.: Cuave: a new audio-visual database for multimodal human-computer interface research. In: Acoustics, Speech, and Signal Processing, vol. 2, pp. 2017–2020 (2002)

    Google Scholar 

  12. Pei, Y., Kim, T.-k., Zha, H.: Unsupervised random forest manifold alignment for lipreading. In: International Conference on Computer Vision, pp. 129–136 (2013)

    Google Scholar 

  13. Rekik, A., Ben-Hamadou, A., Mahdi, W.: Face pose tracking under arbitrary illumination changes. In: International Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (2014)

    Google Scholar 

  14. Rekik, A., Ben-Hamadou, A., Mahdi, W.: A new visual speech recognition approach for RGB-D cameras. In: Campilho, A., Kamel, M. (eds.) ICIAR 2014, Part II. LNCS, vol. 8815, pp. 21–28. Springer, Heidelberg (2014)

    Google Scholar 

  15. Rekik, A., Ben-Hamadou, A., Mahdi, W.: An adaptive approach for lip-reading using image and depth data. Multimedia Tools and Applications, 1–28 (2015)

    Google Scholar 

  16. Rekik, A., Ben-Hamadou, A., Mahdi, W.: Human machine interaction via visual speech spotting. In: Proc. of Advanced Concepts for Intelligent Vision Systems (ACIVS) (2015)

    Google Scholar 

  17. Saeed, U.: Comparative analysis of lip features for person identification. In: Proceedings of the 8th International Conference on Frontiers of Information Technology, pp. 20. ACM (2010)

    Google Scholar 

  18. Saeed, U.: Person identification using behavioral features from lip motion. In: 2011 IEEE International Conference on Automatic Face & Gesture Recognition and Workshops (FG 2011), pp. 131–136. IEEE (2011)

    Google Scholar 

  19. Zhang, Z.: A flexible new technique for camera calibration. Pattern Analysis and Machine Intelligence 22(11), 1330–1334 (2000)

    Article  Google Scholar 

  20. Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. Multimedia, IEEE Transactions 11(7), 1254–1265 (2009)

    Article  Google Scholar 

  21. Zhou, Z., Hong, X., Zhao, G., Pietikainen, M.: A compact representation of visual speech data using latent variables. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(1), 181–187 (2014)

    Google Scholar 

  22. Zhou, Z., Zhao, G., Hong, X., Pietikäinen, M.: A review of recent advances in visual speech decoding. Image and Vision Computing (2014)

    Google Scholar 

  23. Zhou, Z., Zhao, G. and Pietikainen, M.: Towards a practical lipreading system. In: International Conference on Computer Vision and Pattern Recognition, pp. 137–144 (2011)

    Google Scholar 

  24. Zhou, Z., Zhao, G., Pietikainen, M.: Lipreading: a graph embedding approach. In: International Conference on Pattern Recognition, pp. 523–526 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ahmed Rekik .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Rekik, A., Ben-Hamadou, A., Mahdi, W. (2015). Unified System for Visual Speech Recognition and Speaker Identification. In: Battiato, S., Blanc-Talon, J., Gallo, G., Philips, W., Popescu, D., Scheunders, P. (eds) Advanced Concepts for Intelligent Vision Systems. ACIVS 2015. Lecture Notes in Computer Science(), vol 9386. Springer, Cham. https://doi.org/10.1007/978-3-319-25903-1_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25903-1_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25902-4

  • Online ISBN: 978-3-319-25903-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics