Abstract
We present a new approach to modeling and processing multimedia data. This approach is based on graphical models that combine audio and video variables. We demonstrate it by developing a new algorithm for tracking a moving object in a cluttered, noisy scene using two microphones and a camera. Our model uses unobserved variables to describe the data in terms of the process that generates them. It is therefore able to capture and exploit the statistical structure of the audio and video data separately, as well as their mutual dependencies. Model parameters are learned from data via an EM algorithm, and automatic calibration is performed as part of this procedure. Tracking is done by Bayesian inference of the object location from data. We demonstrate successful performance on multimedia clips captured in real world scenarios using off-the-shelf equipment.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
H. Attias and C.E. Schreiner (1998), Blind source separation and deconvolution: the dynamic component analysis algorithm. Neural Computation 10, 1373–1424.
H. Attias et al (2001), A new method for speech denoising and robust speech recognition using probabilistic models for clean speech and for noise. Proc. Eurospeech 2001.
M. S. Brandstein (1999). Time-delay estimation of reverberant speech exploiting harmonic structure. Journal of the Accoustic Society of America 105(5), 2914–2919.
C. Bregler and Y. Konig (1994). Eigenlips for robust speech recognition. Proc. ICASSP.
B. Frey and N. Jojic (1999). Estimating mixture models of images and inferring spatial transformations using the EM algorithm. Proc. of IEEE Conf. on Computer Vision and Pattern Recognition.
B. Frey and N. Jojic (2001). Fast, large-scale transformation-invariant clustering. Proc. of Neural Information Processing Systems, December 2001, Vancouver, BC, Canada.
N. Jojic and B. Frey (2001). Learning flexible sprites in video layers. Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, Maui, HI.
Jordan, M.I. (Ed.) (1998). Learning in Graphical Models. MIT Press, Cambridge, MA.
J. Vermaak, M. Gagnet, A. Blake and P. Pérez (2001). Sequential Monte-Carlo fusion of sound and vision for speaker tracking. Proc. IEEE Intl. Conf. on Computer Vision.
H. Wang and P. Chu (1997). Voice source localization for automatic camera pointing system in videoconferencing. Proc. ICASSP, 187–190.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Beal, M.J., Attias, H., Jojic, N. (2002). Audio-Video Sensor Fusion with Probabilistic Graphical Models. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds) Computer Vision — ECCV 2002. ECCV 2002. Lecture Notes in Computer Science, vol 2350. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47969-4_49
Download citation
DOI: https://doi.org/10.1007/3-540-47969-4_49
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43745-1
Online ISBN: 978-3-540-47969-7
eBook Packages: Springer Book Archive