Abstract
We present a multimodal method for the automatic synchronization of audio-visual recordings captured with a set of independent cameras. The proposed method jointly processes data from audio and video channels to estimate inter-camera delays that are used to temporally align the recordings. Our approach is composed of three main steps. First we extract from each recording temporally sharp audio-visual events. These audio-visual events are short and characterized by an audio onset happening jointly to a well-localized spatio-temporal change in the video data. Then, we estimate the inter-camera delays by assessing the co-occurrence of the events in the various recordings. Finally, we use a cross-validation procedure that combines the results for all camera pairs and aligns the recordings in a global timeline. An important feature of the proposed method is the estimation of the confidence level on the results that allows us to automatically reject recordings that are not reliable for the alignment. Results show that our method outperforms state-of-the-art approaches based on audio-only or video-only analysis with both fixed and hand-held moving cameras.
Similar content being viewed by others
Notes
t V denotes the discrete temporal coordinate of the video signal (in frames) and t A corresponds to the discrete temporal coordinate of the audio signal (in samples).
Note that with M = 3 cameras the proposed method can detect that there is an unrelated recording but not which recording is actually unrelated
The constraints of the method in [13] make it not applicable to our dataset since it requires a minimum of 3 cameras, of which two need to be static.
References
Adobe Premiere Pro http://www.adobe.com/products/premiere.html. Accessed 26 Aug 2013
Caspi Y, Irani M (2002) Spatio-temporal alignment of sequences. IEEE Trans Pattern Anal Mach Intell 24:1409–1424
Cremer M, Cook R (2009) Machine-assisted editing of user-generated content. In: Proceedings of the SPIE-IS&T electronic imaging, vol 7254
Daniyal F, Taj M, Cavallaro A (2010) Content and task-based view selection from multiple video streams. Multimed Tools Appl 46:235–258
EU, FP7 project APIDIS (ICT-216023) http://www.apidis.org/Dataset/. Accessed 26 Aug 2013
Final Cut Pro http://www.apple.com/finalcutpro/. Accessed 26 Aug 2013
Fritsch J, Kleinehagenbrock M, Lang S, Fink GA, Sagerer G (2004) Audiovisual person tracking with a mobile robot. In: International conference intelligent autonomous systems
Guggenberger M, Lux M, Boszormenyi L (2012) Audioalign - synchronization of A/V-streams based on audio data. Int Symp Multimed 0:382–383
Jiang W, Cotton C, Chang SF, Ellis D, Loui AC (2010) Audio-visual atoms for generic video concept classification. ACM Trans Multimed Comput Commun Appl 6(3):1–19
Kennedy LS, Naaman M (2009) Less talk, more rock: automated organization of community-contributed collections of concert videos. In: Proceedings of the ACM WWW
Kidron E, Schechner YY, Elad M (2007) Cross-modal localization via sparsity. IEEE Trans Sig Process 55(4):1390–1404
Laptev I, Belongie SJ, Prez P, Wills J (2005) Periodic motion detection and segmentation via approximate sequence alignment. In: ICCV
Lei C, Yang YH (2006) Tri-focal tensor-based multiple video synchronization with subframe optimization. IEEE Trans Image Process 15(9):2473–2480
Llagostera Casanovas A, Monaci G, Vandergheynst P, Gribonval R (2010) Blind audio-visual source separation based on sparse redundant representations. IEEE Trans Multimed 12(5):358–371
Padua FL, Carceroni RL, Santos GA, Kutulakos KN (2010) Linear sequence-to-sequence alignment. IEEE Trans Pattern Anal Mach Intell 32:304–320
Potamianos G, Neti C, Gravier G, Garg A, Senior AW (2003) Recent advances in the automatic recognition of audiovisual speech. Proc IEEE 91:1306–1326
RECOMMENDATION ITU-R BT.1359-1 (1998) Relative timing of sound and vision for broadcasting
Shrestha P, Barbieri M, Weda H, Sekulovski D (2010) Synchronization of multiple camera videos using audio-visual features. IEEE Trans Multimed 12:79–92
Sodoyer D, Girin L, Jutten C, Schwartz JL (2004) Developing an audio-visual speech source separation algorithm. Speech Comm 44(1–4):113–125
Stein G (1999) Tracking from multiple view points: self-calibration of space and time. In: CVPR.
Sumby WH, Pollack I (1954) Visual contribution to speech intelligibility in noise. J Acoust Soc Am 26(2):212–215
Summerfield Q (1987) Some preliminaries to a comprehensive account of audio-visual speech perception. In: Hearing by eye: the psychology of lipreading. Lawrence Erlbaum Associates, pp 3–51
Ukrainitz Y, Irani M (2006) Aligning sequences and actions by maximizing space-time correlations. In: ECCV
Vroomen J, Keetels M (2010) Perception of intersensory synchrony: a tutorial review. Atten Percept Psychophys 72(4):871–884
Wedge D, Huynh D, Kovesi P (2007) Using space-time interest points for video sequence synchronization. In: Proceedings of the IAPR conference machine vision applications
Whitehead A, Laganiere R, Bose P (2005) Temporal synchronization of video sequences in theory and in practice. In: Proceedings of the IEEE workshop motion and video computing
Yan J, Pollefeys M (2004) Video synchronization via space-time interest point distribution. In: Proceedings of the advanced concepts for intelligent vision systems
Author information
Authors and Affiliations
Corresponding author
Additional information
A. Llagostera Casanovas contributed to this work while at Queen Mary University of London, UK. She was supported by the Swiss National Science Foundation under the prospective researcher fellowship PBELP2-137724. A. Cavallaro acknowledges the support of the UK Engineering and Physical Sciences Research Council (EPSRC), under grant EP/K007491/1.
Rights and permissions
About this article
Cite this article
Llagostera Casanovas, A., Cavallaro, A. Audio-visual events for multi-camera synchronization. Multimed Tools Appl 74, 1317–1340 (2015). https://doi.org/10.1007/s11042-014-1872-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-014-1872-y