Audio-Video Sensor Fusion with Probabilistic Graphical Models

  • Matthew J. Beal
  • Hagai Attias
  • Nebojsa Jojic
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2350)


We present a new approach to modeling and processing multimedia data. This approach is based on graphical models that combine audio and video variables. We demonstrate it by developing a new algorithm for tracking a moving object in a cluttered, noisy scene using two microphones and a camera. Our model uses unobserved variables to describe the data in terms of the process that generates them. It is therefore able to capture and exploit the statistical structure of the audio and video data separately, as well as their mutual dependencies. Model parameters are learned from data via an EM algorithm, and automatic calibration is performed as part of this procedure. Tracking is done by Bayesian inference of the object location from data. We demonstrate successful performance on multimedia clips captured in real world scenarios using off-the-shelf equipment.


Video Data Audio Signal Video Model Audio Data Probabilistic Graphical Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. [1]
    H. Attias and C.E. Schreiner (1998), Blind source separation and deconvolution: the dynamic component analysis algorithm. Neural Computation 10, 1373–1424.CrossRefGoogle Scholar
  2. [2]
    H. Attias et al (2001), A new method for speech denoising and robust speech recognition using probabilistic models for clean speech and for noise. Proc. Eurospeech 2001.Google Scholar
  3. [3]
    M. S. Brandstein (1999). Time-delay estimation of reverberant speech exploiting harmonic structure. Journal of the Accoustic Society of America 105(5), 2914–2919.CrossRefGoogle Scholar
  4. [4]
    C. Bregler and Y. Konig (1994). Eigenlips for robust speech recognition. Proc. ICASSP.Google Scholar
  5. [5]
    B. Frey and N. Jojic (1999). Estimating mixture models of images and inferring spatial transformations using the EM algorithm. Proc. of IEEE Conf. on Computer Vision and Pattern Recognition.Google Scholar
  6. [6]
    B. Frey and N. Jojic (2001). Fast, large-scale transformation-invariant clustering. Proc. of Neural Information Processing Systems, December 2001, Vancouver, BC, Canada.Google Scholar
  7. [7]
    N. Jojic and B. Frey (2001). Learning flexible sprites in video layers. Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, Maui, HI.Google Scholar
  8. [8]
    Jordan, M.I. (Ed.) (1998). Learning in Graphical Models. MIT Press, Cambridge, MA.zbMATHGoogle Scholar
  9. [9]
    J. Vermaak, M. Gagnet, A. Blake and P. Pérez (2001). Sequential Monte-Carlo fusion of sound and vision for speaker tracking. Proc. IEEE Intl. Conf. on Computer Vision.Google Scholar
  10. [10]
    H. Wang and P. Chu (1997). Voice source localization for automatic camera pointing system in videoconferencing. Proc. ICASSP, 187–190.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Matthew J. Beal
    • 1
    • 2
  • Hagai Attias
    • 1
  • Nebojsa Jojic
    • 1
  1. 1.Microsoft ResearchRedmondUSA
  2. 2.Gatsby Computational Neuroscience UnitUniversity College LondonLondonUK

Personalised recommendations