Audio-visual events for multi-camera synchronization

Llagostera Casanovas, Anna; Cavallaro, Andrea

doi:10.1007/s11042-014-1872-y

Audio-visual events for multi-camera synchronization

Published: 23 March 2014

Volume 74, pages 1317–1340, (2015)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Anna Llagostera Casanovas¹ &
Andrea Cavallaro²

476 Accesses
5 Citations
4 Altmetric
Explore all metrics

Abstract

We present a multimodal method for the automatic synchronization of audio-visual recordings captured with a set of independent cameras. The proposed method jointly processes data from audio and video channels to estimate inter-camera delays that are used to temporally align the recordings. Our approach is composed of three main steps. First we extract from each recording temporally sharp audio-visual events. These audio-visual events are short and characterized by an audio onset happening jointly to a well-localized spatio-temporal change in the video data. Then, we estimate the inter-camera delays by assessing the co-occurrence of the events in the various recordings. Finally, we use a cross-validation procedure that combines the results for all camera pairs and aligns the recordings in a global timeline. An important feature of the proposed method is the estimation of the confidence level on the results that allows us to automatically reject recordings that are not reliable for the alignment. Results show that our method outperforms state-of-the-art approaches based on audio-only or video-only analysis with both fixed and hand-held moving cameras.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combining audio and video metrics to assess audio-visual quality

Article 07 February 2018

A Framework for Recording Audio-Visual Speech Corpora with a Microphone and a High-Speed Camera

Survey on Fusion of Audiovisual Information for Multimedia Event Recognition

Notes

t _V denotes the discrete temporal coordinate of the video signal (in frames) and t _A corresponds to the discrete temporal coordinate of the audio signal (in samples).
Note that with M = 3 cameras the proposed method can detect that there is an unrelated recording but not which recording is actually unrelated
http://www.eecs.qmul.ac.uk/%7Eandrea/synchro.html
The constraints of the method in [13] make it not applicable to our dataset since it requires a minimum of 3 cameras, of which two need to be static.

References

Adobe Premiere Pro http://www.adobe.com/products/premiere.html. Accessed 26 Aug 2013
Caspi Y, Irani M (2002) Spatio-temporal alignment of sequences. IEEE Trans Pattern Anal Mach Intell 24:1409–1424
Article Google Scholar
Cremer M, Cook R (2009) Machine-assisted editing of user-generated content. In: Proceedings of the SPIE-IS&T electronic imaging, vol 7254
Daniyal F, Taj M, Cavallaro A (2010) Content and task-based view selection from multiple video streams. Multimed Tools Appl 46:235–258
Article Google Scholar
EU, FP7 project APIDIS (ICT-216023) http://www.apidis.org/Dataset/. Accessed 26 Aug 2013
Final Cut Pro http://www.apple.com/finalcutpro/. Accessed 26 Aug 2013
Fritsch J, Kleinehagenbrock M, Lang S, Fink GA, Sagerer G (2004) Audiovisual person tracking with a mobile robot. In: International conference intelligent autonomous systems
Guggenberger M, Lux M, Boszormenyi L (2012) Audioalign - synchronization of A/V-streams based on audio data. Int Symp Multimed 0:382–383
Google Scholar
Jiang W, Cotton C, Chang SF, Ellis D, Loui AC (2010) Audio-visual atoms for generic video concept classification. ACM Trans Multimed Comput Commun Appl 6(3):1–19
Article Google Scholar
Kennedy LS, Naaman M (2009) Less talk, more rock: automated organization of community-contributed collections of concert videos. In: Proceedings of the ACM WWW
Kidron E, Schechner YY, Elad M (2007) Cross-modal localization via sparsity. IEEE Trans Sig Process 55(4):1390–1404
Article MathSciNet Google Scholar
Laptev I, Belongie SJ, Prez P, Wills J (2005) Periodic motion detection and segmentation via approximate sequence alignment. In: ICCV
Lei C, Yang YH (2006) Tri-focal tensor-based multiple video synchronization with subframe optimization. IEEE Trans Image Process 15(9):2473–2480
Article Google Scholar
Llagostera Casanovas A, Monaci G, Vandergheynst P, Gribonval R (2010) Blind audio-visual source separation based on sparse redundant representations. IEEE Trans Multimed 12(5):358–371
Article Google Scholar
Padua FL, Carceroni RL, Santos GA, Kutulakos KN (2010) Linear sequence-to-sequence alignment. IEEE Trans Pattern Anal Mach Intell 32:304–320
Article Google Scholar
Potamianos G, Neti C, Gravier G, Garg A, Senior AW (2003) Recent advances in the automatic recognition of audiovisual speech. Proc IEEE 91:1306–1326
Article Google Scholar
RECOMMENDATION ITU-R BT.1359-1 (1998) Relative timing of sound and vision for broadcasting
Shrestha P, Barbieri M, Weda H, Sekulovski D (2010) Synchronization of multiple camera videos using audio-visual features. IEEE Trans Multimed 12:79–92
Article Google Scholar
Sodoyer D, Girin L, Jutten C, Schwartz JL (2004) Developing an audio-visual speech source separation algorithm. Speech Comm 44(1–4):113–125
Article Google Scholar
Stein G (1999) Tracking from multiple view points: self-calibration of space and time. In: CVPR.
Sumby WH, Pollack I (1954) Visual contribution to speech intelligibility in noise. J Acoust Soc Am 26(2):212–215
Article Google Scholar
Summerfield Q (1987) Some preliminaries to a comprehensive account of audio-visual speech perception. In: Hearing by eye: the psychology of lipreading. Lawrence Erlbaum Associates, pp 3–51
Ukrainitz Y, Irani M (2006) Aligning sequences and actions by maximizing space-time correlations. In: ECCV
Vroomen J, Keetels M (2010) Perception of intersensory synchrony: a tutorial review. Atten Percept Psychophys 72(4):871–884
Article Google Scholar
Wedge D, Huynh D, Kovesi P (2007) Using space-time interest points for video sequence synchronization. In: Proceedings of the IAPR conference machine vision applications
Whitehead A, Laganiere R, Bose P (2005) Temporal synchronization of video sequences in theory and in practice. In: Proceedings of the IEEE workshop motion and video computing
Yan J, Pollefeys M (2004) Video synchronization via space-time interest point distribution. In: Proceedings of the advanced concepts for intelligent vision systems

Download references

Author information

Authors and Affiliations

SwissQual AG, Zuchwil, Switzerland
Anna Llagostera Casanovas
Centre for Intelligent Sensing, Queen Mary University of London, London, UK
Andrea Cavallaro

Authors

Anna Llagostera Casanovas
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Cavallaro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anna Llagostera Casanovas.

Additional information

A. Llagostera Casanovas contributed to this work while at Queen Mary University of London, UK. She was supported by the Swiss National Science Foundation under the prospective researcher fellowship PBELP2-137724. A. Cavallaro acknowledges the support of the UK Engineering and Physical Sciences Research Council (EPSRC), under grant EP/K007491/1.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Llagostera Casanovas, A., Cavallaro, A. Audio-visual events for multi-camera synchronization. Multimed Tools Appl 74, 1317–1340 (2015). https://doi.org/10.1007/s11042-014-1872-y

Download citation

Published: 23 March 2014
Issue Date: February 2015
DOI: https://doi.org/10.1007/s11042-014-1872-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Audio-visual events for multi-camera synchronization

Abstract

Access this article

Similar content being viewed by others

Combining audio and video metrics to assess audio-visual quality

A Framework for Recording Audio-Visual Speech Corpora with a Microphone and a High-Speed Camera

Survey on Fusion of Audiovisual Information for Multimedia Event Recognition

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Audio-visual events for multi-camera synchronization

Abstract

Access this article

Similar content being viewed by others

Combining audio and video metrics to assess audio-visual quality

A Framework for Recording Audio-Visual Speech Corpora with a Microphone and a High-Speed Camera

Survey on Fusion of Audiovisual Information for Multimedia Event Recognition

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation