Skip to main content

Audio-Visual Clustering for 3D Speaker Localization

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5237))

Abstract

We address the issue of localizing individuals in a scene that contains several people engaged in a multiple-speaker conversation. We use a human-like configuration of sensors (binaural and binocular) to gather both auditory and visual observations. We show that the localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data to a representation of the common 3D scene-space, via a pair of Gaussian mixture models. Inference is performed by a version of the Expectation Maximization algorithm, which provides cooperative estimates of both the activity (speaking or not) and the 3D position of each speaker.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Heckmann, M., Berthommier, F., Kroschel, K.: Noise adaptive stream weighting in audio-visual speech recognition. EURASIP J. Applied Signal Proc. 11, 1260–1273 (2002)

    Article  Google Scholar 

  2. Beal, M., Jojic, N., Attias, H.: A graphical model for audiovisual object tracking. IEEE Trans. PAMI 25(7), 828–836 (2003)

    Google Scholar 

  3. Kushal, A., Rahurkar, M., Fei-Fei, L., Ponce, J., Huang, T.: Audio-visual speaker localization using graphical models. In: Proc.18th Int. Conf. Pat. Rec., pp. 291–294 (2006)

    Google Scholar 

  4. Zotkin, D.N., Duraiswami, R., Davis, L.S.: Joint audio-visual tracking using particle filters. EURASIP Journal on Applied Signal Processing 11, 1154–1164 (2002)

    Article  Google Scholar 

  5. Vermaak, J., Ganget, M., Blake, A., Pérez, P.: Sequential monte carlo fusion of sound and vision for speaker tracking. In: Proc. 8th Int. Conf. Comput. Vision, pp. 741–746 (2001)

    Google Scholar 

  6. Perez, P., Vermaak, J., Blake, A.: Data fusion for visual tracking with particles. Proc. of IEEE (spec. issue on Sequential State Estimation) 92, 495–513 (2004)

    Google Scholar 

  7. Chen, Y., Rui, Y.: Real-time speaker tracking using particle filter sensor fusion. Proc. of IEEE (spec. issue on Sequential State Estimation) 92, 485–494 (2004)

    Google Scholar 

  8. Nickel, K., Gehrig, T., Stiefelhagen, R., McDonough, J.: A joint particle filter for audio-visual speaker tracking. In: ICMI 2005, pp. 61–68 (2005)

    Google Scholar 

  9. Checka, N., Wilson, K., Siracusa, M., Darrell, T.: Multiple person and speaker activity tracking with a particle filter. In: IEEE Conf. Acou. Spee. Sign. Proc., pp. 881–884 (2004)

    Google Scholar 

  10. Gatica-Perez, D., Lathoud, G., Odobez, J.-M., McCowan, I.: Audiovisual probabilistic tracking of multiple speakers in meetings. IEEE trans. Audi. Spee. Lang. Proc. 15(2), 601–616 (2007)

    Article  Google Scholar 

  11. Fisher, J., Darrell, T.: Speaker association with signal-level audiovisual fusion. IEEE Trans. on Multimedia 6(3), 406–413 (2004)

    Article  Google Scholar 

  12. Barzelay, Z., Schechner, Y.Y.: Harmony in motion. In: IEEE Conf. Comput. Vision Pat. Rec (CVPR), pp. 1–8 (2007)

    Google Scholar 

  13. Christensen, H., Ma, N., Wrigley, S.N., Barker, J.: Integrating pitch and localisation cues at a speech fragment level. In: Proc. of Interspeech 2007, pp. 2769–2772 (2007)

    Google Scholar 

  14. Movellan, J.R., Chadderdon, G.: Channel separability in the audio-visual integration of speech: A bayesian approach. In: Stork, D.G., Hennecke, M.E. (eds.) Speechreading by Humans and Machines: Models, Systems and Applications. NATO ASI Series, pp. 473–487. Springer, Berlin (1996)

    Google Scholar 

  15. Massaro, D.W., Stork, D.G.: Speech recognition and sensory integration. American Scientist 86(3), 236–244 (1998)

    Google Scholar 

  16. Celeux, G., Forbes, F., Peyrard, N.: EM procedures using mean-field approximations for Markov model-based image segmentation. Pattern Recognition 36, 131–144 (2003)

    Article  MATH  Google Scholar 

  17. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B 39(1), 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  18. Schwarz, G.: Estimating the dimension of a model. The Annals of Statistics 6(2), 461–464 (1978)

    Article  MATH  MathSciNet  Google Scholar 

  19. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proc. 4th Alvey Vision Conference, pp. 147–151 (1988)

    Google Scholar 

  20. Hansard, M., Horaud, R.P.: Patterns of binocular disparity for a fixating observer. In: Adv. Brain Vision Artif. Intel., 2nd Int. Symp., pp. 308–317 (2007)

    Google Scholar 

  21. Intel OpenCV Computer Vision library, http://www.intel.com/technology/computing/opencv

  22. Viola, P., Jones, M.: Robust real-time face detection. IJCV 57(2), 137–154 (2004)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Andrei Popescu-Belis Rainer Stiefelhagen

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Khalidov, V., Forbes, F., Hansard, M., Arnaud, E., Horaud, R. (2008). Audio-Visual Clustering for 3D Speaker Localization. In: Popescu-Belis, A., Stiefelhagen, R. (eds) Machine Learning for Multimodal Interaction. MLMI 2008. Lecture Notes in Computer Science, vol 5237. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85853-9_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-85853-9_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-85852-2

  • Online ISBN: 978-3-540-85853-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics