Skip to main content
Log in

Audio-visual events for multi-camera synchronization

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

We present a multimodal method for the automatic synchronization of audio-visual recordings captured with a set of independent cameras. The proposed method jointly processes data from audio and video channels to estimate inter-camera delays that are used to temporally align the recordings. Our approach is composed of three main steps. First we extract from each recording temporally sharp audio-visual events. These audio-visual events are short and characterized by an audio onset happening jointly to a well-localized spatio-temporal change in the video data. Then, we estimate the inter-camera delays by assessing the co-occurrence of the events in the various recordings. Finally, we use a cross-validation procedure that combines the results for all camera pairs and aligns the recordings in a global timeline. An important feature of the proposed method is the estimation of the confidence level on the results that allows us to automatically reject recordings that are not reliable for the alignment. Results show that our method outperforms state-of-the-art approaches based on audio-only or video-only analysis with both fixed and hand-held moving cameras.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. t V denotes the discrete temporal coordinate of the video signal (in frames) and t A corresponds to the discrete temporal coordinate of the audio signal (in samples).

  2. Note that with M = 3 cameras the proposed method can detect that there is an unrelated recording but not which recording is actually unrelated

  3. http://www.eecs.qmul.ac.uk/%7Eandrea/synchro.html

  4. The constraints of the method in [13] make it not applicable to our dataset since it requires a minimum of 3 cameras, of which two need to be static.

References

  1. Adobe Premiere Pro http://www.adobe.com/products/premiere.html. Accessed 26 Aug 2013

  2. Caspi Y, Irani M (2002) Spatio-temporal alignment of sequences. IEEE Trans Pattern Anal Mach Intell 24:1409–1424

    Article  Google Scholar 

  3. Cremer M, Cook R (2009) Machine-assisted editing of user-generated content. In: Proceedings of the SPIE-IS&T electronic imaging, vol 7254

  4. Daniyal F, Taj M, Cavallaro A (2010) Content and task-based view selection from multiple video streams. Multimed Tools Appl 46:235–258

    Article  Google Scholar 

  5. EU, FP7 project APIDIS (ICT-216023) http://www.apidis.org/Dataset/. Accessed 26 Aug 2013

  6. Final Cut Pro http://www.apple.com/finalcutpro/. Accessed 26 Aug 2013

  7. Fritsch J, Kleinehagenbrock M, Lang S, Fink GA, Sagerer G (2004) Audiovisual person tracking with a mobile robot. In: International conference intelligent autonomous systems

  8. Guggenberger M, Lux M, Boszormenyi L (2012) Audioalign - synchronization of A/V-streams based on audio data. Int Symp Multimed 0:382–383

    Google Scholar 

  9. Jiang W, Cotton C, Chang SF, Ellis D, Loui AC (2010) Audio-visual atoms for generic video concept classification. ACM Trans Multimed Comput Commun Appl 6(3):1–19

    Article  Google Scholar 

  10. Kennedy LS, Naaman M (2009) Less talk, more rock: automated organization of community-contributed collections of concert videos. In: Proceedings of the ACM WWW

  11. Kidron E, Schechner YY, Elad M (2007) Cross-modal localization via sparsity. IEEE Trans Sig Process 55(4):1390–1404

    Article  MathSciNet  Google Scholar 

  12. Laptev I, Belongie SJ, Prez P, Wills J (2005) Periodic motion detection and segmentation via approximate sequence alignment. In: ICCV

  13. Lei C, Yang YH (2006) Tri-focal tensor-based multiple video synchronization with subframe optimization. IEEE Trans Image Process 15(9):2473–2480

    Article  Google Scholar 

  14. Llagostera Casanovas A, Monaci G, Vandergheynst P, Gribonval R (2010) Blind audio-visual source separation based on sparse redundant representations. IEEE Trans Multimed 12(5):358–371

    Article  Google Scholar 

  15. Padua FL, Carceroni RL, Santos GA, Kutulakos KN (2010) Linear sequence-to-sequence alignment. IEEE Trans Pattern Anal Mach Intell 32:304–320

    Article  Google Scholar 

  16. Potamianos G, Neti C, Gravier G, Garg A, Senior AW (2003) Recent advances in the automatic recognition of audiovisual speech. Proc IEEE 91:1306–1326

    Article  Google Scholar 

  17. RECOMMENDATION ITU-R BT.1359-1 (1998) Relative timing of sound and vision for broadcasting

  18. Shrestha P, Barbieri M, Weda H, Sekulovski D (2010) Synchronization of multiple camera videos using audio-visual features. IEEE Trans Multimed 12:79–92

    Article  Google Scholar 

  19. Sodoyer D, Girin L, Jutten C, Schwartz JL (2004) Developing an audio-visual speech source separation algorithm. Speech Comm 44(1–4):113–125

    Article  Google Scholar 

  20. Stein G (1999) Tracking from multiple view points: self-calibration of space and time. In: CVPR.

  21. Sumby WH, Pollack I (1954) Visual contribution to speech intelligibility in noise. J Acoust Soc Am 26(2):212–215

    Article  Google Scholar 

  22. Summerfield Q (1987) Some preliminaries to a comprehensive account of audio-visual speech perception. In: Hearing by eye: the psychology of lipreading. Lawrence Erlbaum Associates, pp 3–51

  23. Ukrainitz Y, Irani M (2006) Aligning sequences and actions by maximizing space-time correlations. In: ECCV

  24. Vroomen J, Keetels M (2010) Perception of intersensory synchrony: a tutorial review. Atten Percept Psychophys 72(4):871–884

    Article  Google Scholar 

  25. Wedge D, Huynh D, Kovesi P (2007) Using space-time interest points for video sequence synchronization. In: Proceedings of the IAPR conference machine vision applications

  26. Whitehead A, Laganiere R, Bose P (2005) Temporal synchronization of video sequences in theory and in practice. In: Proceedings of the IEEE workshop motion and video computing

  27. Yan J, Pollefeys M (2004) Video synchronization via space-time interest point distribution. In: Proceedings of the advanced concepts for intelligent vision systems

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anna Llagostera Casanovas.

Additional information

A. Llagostera Casanovas contributed to this work while at Queen Mary University of London, UK. She was supported by the Swiss National Science Foundation under the prospective researcher fellowship PBELP2-137724. A. Cavallaro acknowledges the support of the UK Engineering and Physical Sciences Research Council (EPSRC), under grant EP/K007491/1.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Llagostera Casanovas, A., Cavallaro, A. Audio-visual events for multi-camera synchronization. Multimed Tools Appl 74, 1317–1340 (2015). https://doi.org/10.1007/s11042-014-1872-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-014-1872-y

Keywords

Navigation