Abstract
In noisy or other adverse conditions, consistently high speaker identification accuracy is difficult to attain via speech signal, hence visual component which can complement audio information is of particular interest. In this paper, we capture the asynchronous correlation instead of tight synchrony between audio and visual modalities. Furthermore, the apparent asynchrony between the two modalities is effectively modeled based on Dynamic Bayesian Network (DBN) with asynchronous articulatory feature in three ways: (1) there are three hidden state variables, each representing one articulatory feature, (2) the degree of asynchrony among articulatory features is controlled by probability distribution, (3) the audio and video observations depend on all three hidden state variables. Then a multi-level hybrid fusion is explored to combine model-level and decision-level fusion. The experiment results for audio-visual bimodal corpus show that the effectiveness of the method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Neti, C., Potamianos, G.: Audio visual speech recognition. In: Final report: JHU 2000 Summer Workshop (2000)
Chu, S.M., Huang, T.S.: Multi-model sensory fusion with application to audio-visual speech recognition. In: Proceedings of European Conference on Speech Communication and Technology (Eurospeech), Aalborg, Denmark (2001)
Browman, C.P., Goldstein, L.: Articulatory phonology: An overview. Phonetica 49, 155–180 (1992)
Livescu, K., Cetin, O.: Articulatory Feature-based methods for acoustic and audio-visual speech recognition. In: Final report: JHU 2006 Summer Workshop (2006)
Zhang, Y., Diao, Q.: DBN based multi-stream models for speech. In: Proceedings of the International Conference on Acoustic, Speech and Signal Processing (ICASSP), Hong Kong, China, pp. 836–839 (2003)
Chen, T.: Audiovisual speech processing. IEEE Transactions on Signal Processing 18(1), 9–21 (2001)
Bilmes, J., Zweig, G.: The graphical models toolkit: An open source software system for speech and time-series processing. In: Proceedings of the International Conference on Acoustic, Speech and Signal Processing (ICASSP), Florida, USA, pp. 3916–3919 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chen, Y. (2011). Research on Audio-Visual Asynchronous Correlation for Speaker Identification Based on DBN. In: Zeng, D. (eds) Future Intelligent Information Systems. Lecture Notes in Electrical Engineering, vol 86. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19706-2_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-19706-2_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19705-5
Online ISBN: 978-3-642-19706-2
eBook Packages: EngineeringEngineering (R0)