Abstract
Assessing the quality of a speaker localization or tracking algorithm on a few short examples is difficult, especially when the ground-truth is absent or not well defined. One step towards systematic performance evaluation of such algorithms is to provide time-continuous speaker location annotation over a series of real recordings, covering various test cases. Areas of interest include audio, video and audio-visual speaker localization and tracking. The desired location annotation can be either 2-dimensional (image plane) or 3-dimensional (physical space). This paper motivates and describes a corpus of audio-visual data called “AV16.3”, along with a method for 3-D location annotation based on calibrated cameras. “16.3” stands for 16 microphones and 3 cameras, recorded in a fully synchronized manner, in a meeting room. Part of this corpus has already been successfully used to report research results.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Algazi, V., Duda, R., Thompson, D.: The CIPIC HRTF Database. In: Proceedings of WASPAA (2001)
Bouguet, J.Y.: Camera Calibration Toolbox for Matlab (January 2004), http://www.vision.caltech.edu/bouguetj/calib_doc/
DiBiase, J., Silverman, H., Brandstein, M.: Robust Localization in Reverberant Rooms. In: Brandstein, M., Ward, D. (eds.) Microphone Arrays, pp. 157–180. Springer, Heidelberg (2001)
Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., Wooters, C.: The ICSI Meeting Corpus. In: Proceedings of ICASSP (2003)
Lathoud, G., McCowan, I.A.: A Sector-Based Approach for Localization of Multiple Speakers with Microphone Arrays. In: Proceedings of SAPA (2004) (to appear)
Moore, D.: The IDIAP Smart Meeting Room. IDIAP Communication COM-02-07 (2002)
Patterson, E., Gurbuz, S., Tufekci, Z., Gowdy, J.: Moving Talker, Speaker-Independent Feature Study and Baseline Results Using the CUAVE Multimodal Speech Corpus. Eurasip Journal on Applied Signal Processing 11, 1189–1201 (2002)
Perez, P., Hue, C., Vermaak, J., Gangnet, M.: Color-based Probabilistic Tracking. Proceedings of ECCV (2002)
Shriberg, E., Stolcke, A., Baron, D.: Observations on Overlap: Findings and Implications for Automatic Processing of Multi-Party Conversation. In: Proceedings of Eurospeech, vol. 2, pp. 1359–1362 (2001)
Svoboda, T.: Multi-Camera Self-Calibration (August 2003), http://cmp.felk.cvut.cz/svoboda/SelfCal/index.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lathoud, G., Odobez, JM., Gatica-Perez, D. (2005). AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking. In: Bengio, S., Bourlard, H. (eds) Machine Learning for Multimodal Interaction. MLMI 2004. Lecture Notes in Computer Science, vol 3361. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30568-2_16
Download citation
DOI: https://doi.org/10.1007/978-3-540-30568-2_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24509-4
Online ISBN: 978-3-540-30568-2
eBook Packages: Computer ScienceComputer Science (R0)