An Audio-Visual Particle Filter for Speaker Tracking on the CLEAR’06 Evaluation Dataset

  • Kai Nickel
  • Tobias Gehrig
  • Hazim K. Ekenel
  • John McDonough
  • Rainer Stiefelhagen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4122)


We present an approach for tracking a lecturer during the course of his speech. We use features from multiple cameras and microphones, and process them in a joint particle filter framework. The filter performs sampled projections of 3D location hypotheses and scores them using features from both audio and video. On the video side, the features are based on foreground segmentation, multi-view face detection and upper body detection. On the audio side, the time delays of arrival between pairs of microphones are estimated with a generalized cross correlation function. In the CLEAR’06 evaluation, the system yielded a tracking accuracy (MOTA) of 71% for video-only, 55% for audio-only and 90% for combined audio-visual tracking.


Face Detection Integral Image Microphone Array Audio Feature Video Feature 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    CLEAR 2006 Evaluation and Workshop Campaign, April 6-7, Southampton, UK (2006),
  2. 2.
    Brandstein, M.S.: A framework for speech source localization using sensor arrays. PhD thesis, Brown University, Providence, RI (May 1995)Google Scholar
  3. 3.
    Brandstein, M.S., Adcock, J.E., Silverman, H.F.: A closed-form location estimator for use with room environment microphone arrays. IEEE Trans. Speech Audio Proc. 5(1), 45–50 (1997)CrossRefGoogle Scholar
  4. 4.
    Checka, N., Wilson, K., Rangarajan, V., Darrell, T.: A probabilistic framework for multi-modal multi-person tracking. In: IEEE Workshop on Multi-Object Tracking (in conjunction with CVPR) (2003)Google Scholar
  5. 5.
    Chen, J., Benesty, J., Huang, Y.A.: Robust time delay estimation exploiting redundancy among multiple microphones. IEEE Trans. Speech Audio Proc. 11(6), 549–557 (2003)CrossRefGoogle Scholar
  6. 6.
    Gatica-Perez, D., Lathoud, G., McCowan, I., Odobez, J.-M.: A mixed-state i-particle filter for multi-camera speaker tracking. In: Proc. IEEE ICCV Workshop on Multimedia Technologies in E-Learning and Collaboration (ICCV-WOMTEC) (2003)Google Scholar
  7. 7.
    Huang, Y., Benesty, J., Elko, G.W., Mersereau, R.M.: Real-time passive source localization: A practical linear-correction least-squares approach. IEEE Trans. Speech Audio Proc. 9(8), 943–956 (2001)CrossRefGoogle Scholar
  8. 8.
    Isard, M., Blake, A.: Condensation–conditional density propagation for visual tracking. International Journal of Computer Vision 29(1), 5–28 (1998)CrossRefGoogle Scholar
  9. 9.
    Gehrig, T., Nickel, K., Ekenel, H.K., Klee, U., McDonough, J.: Kalman filters for audio-video source localization. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (Oct. 2005)Google Scholar
  10. 10.
    Klee, U., Gehrig, T., McDonough, J.: Kalman filters for time delay of arrival-based source localization. EURASIP Special Issue on Multichannel Speech Processing, submitted for publicationGoogle Scholar
  11. 11.
    Kruppa, H., Castrillon-Santana, M., Schiele, B.: Fast and robust face finding via local context. In: IEEE Intl. Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (Oct. 2003)Google Scholar
  12. 12.
    Lienhart, R., Maydt, J.: An extended set of haar-like features for rapid object detection. In: ICIP, vol. 1, pp. 900–903 (Sept. 2002)Google Scholar
  13. 13.
    Mikic, I., Santini, S., Jain, R.: Tracking objects in 3d using multiple camera views. In: ACCV (2000)Google Scholar
  14. 14.
    Omologo, M., Svaizer, P.: Acoustic event localization using a crosspower-spectrum phase based technique. Proc. ICASSP 2, 273–276 (1994)Google Scholar
  15. 15.
    Vermaak, J., Gangnet, M., Blake, A., Pérez, P.: Sequential monte carlo fusion of sound and vision for speaker tracking. Proc. IEEE Intl. Conf. on Computer Vision 1, 741–746 (2001)Google Scholar
  16. 16.
    Viola, P., Jones, M.: Robust real-time object detection. In: ICCV Workshop on Statistical and Computation Theories of Vision (July 2001)Google Scholar
  17. 17.
    Ward, D.B., Lehmann, E.A., Williamson, R.C.: Particle filtering algorithms for tracking an acoustic source in a reverberant environment. IEEE Trans. Speech Audio Proc. 11(6), 826–836 (2003)CrossRefGoogle Scholar
  18. 18.
    Wölfel, M., Nickel, K., McDonough, J.: Microphone Array Driven Speech Recognition: Influence of Localization on the Word Error Rate. In: 2nd Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms, Edinburgh, 11-13 July (2005)Google Scholar
  19. 19.
    Zotkin, D., Duraiswami, R., Davis, L.: Joint audio-visual tracking using particle filters. EURASIP journal on Applied Signal Processing 11 (2002)Google Scholar

Copyright information

© Springer Berlin Heidelberg 2007

Authors and Affiliations

  • Kai Nickel
    • 1
  • Tobias Gehrig
    • 1
  • Hazim K. Ekenel
    • 1
  • John McDonough
    • 1
  • Rainer Stiefelhagen
    • 1
  1. 1.Interactive Systems Labs - University of Karlsruhe, Am Fasanengarten 5, 76131 KarlsruheGermany

Personalised recommendations