ActionSnapping: Motion-Based Video Synchronization

  • Jean-Charles BazinEmail author
  • Alexander Sorkine-Hornung
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9909)


Video synchronization is a fundamental step for many applications in computer vision, ranging from video morphing to motion analysis. We present a novel method for synchronizing action videos where a similar action is performed by different people at different times and different locations with different local speed changes, e.g., as in sports like weightlifting, baseball pitch, or dance. Our approach extends the popular “snapping” tool of video editing software and allows users to automatically snap action videos together in a timeline based on their content. Since the action can take place at different locations, existing appearance-based methods are not appropriate. Our approach leverages motion information, and computes a nonlinear synchronization of the input videos to establish frame-to-frame temporal correspondences. We demonstrate our approach can be applied for video synchronization, video annotation, and action snapshots. Our approach has been successfully evaluated with ground truth data and a user study.


Action Recognition Cost Matrix Input Video Point Trajectory Video Annotation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



We are very grateful to World Dance New York for giving us the permission to use their YouTube videos.


  1. 1.
    Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: a review. ACM Comput. Surv. 43, 16 (2011)CrossRefGoogle Scholar
  2. 2.
    Arandjelovic, R., Zisserman, A.: Three things everyone should know to improve object retrieval. In: CVPR (2012)Google Scholar
  3. 3.
    Averbuch-Elor, H., Cohen-Or, D.: RingIt: ring-ordering casual photos of a temporal event. TOG 34, 33 (2015)CrossRefzbMATHGoogle Scholar
  4. 4.
    Ballan, L., Brostow, G.J., Puwein, J., Pollefeys, M.: Unstructured video-based rendering: interactive exploration of casually captured videos. TOG (SIGGRAPH) 29, 87 (2010)Google Scholar
  5. 5.
    Basha, T.D., Moses, Y., Avidan, S.: Photo sequencing. IJCV 110(3), 275–289 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Bazin, J.C., Malleson, C., Wang, O., Bradley, D., Beeler, T., Hilton, A., Sorkine-Hornung, A.: FaceDirector: continuous control of facial performance in video. In: ICCV (2015)Google Scholar
  7. 7.
    Beauchemin, S.S., Barron, J.L.: The computation of optical flow. ACM Comput. Surv. 27, 433–466 (1995)CrossRefGoogle Scholar
  8. 8.
    Bregler, C., Covell, M., Slaney, M.: Video rewrite: driving visual speech with audio. In: SIGGRAPH (1997)Google Scholar
  9. 9.
    Caspi, Y., Irani, M.: Spatio-temporal alignment of sequences. TPAMI 24, 1409–1424 (2002)CrossRefGoogle Scholar
  10. 10.
    Dale, K., Sunkavalli, K., Johnson, M.K., Vlasic, D., Matusik, W., Pfister, H.: Video face replacement. TOG (SIGGRAPH Asia) 30(6) (2011)Google Scholar
  11. 11.
    Diego, F., Serrat, J., López, A.M.: Joint spatio-temporal alignment of sequences. Trans. Multimedia 15, 1377–1387 (2013)CrossRefGoogle Scholar
  12. 12.
    Dijkstra, E.W.: A note on two problems in connexion with graphs. Numer. Math. 1, 269–271 (1959)MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Evangelidis, G.D., Bauckhage, C.: Efficient subframe video alignment using short descriptors. TPAMI 35, 2371–2386 (2013)CrossRefGoogle Scholar
  14. 14.
    Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: Scandinavian Conference on Image Analysis (2003)Google Scholar
  15. 15.
    Fossati, A., Dimitrijevic, M., Lepetit, V., Fua, P.: From canonical poses to 3D motion capture using a single camera. TPAMI 32, 1165–1181 (2010)CrossRefGoogle Scholar
  16. 16.
    Freeman, W.T., Adelson, E.H., Heeger, D.J.: Motion without movement. In: SIGGRAPH (1991)Google Scholar
  17. 17.
    Garrido, P., Valgaerts, L., Rehmsen, O., Thormaehlen, T., Perez, P., Theobalt, C.: Automatic face reenactment. In: CVPR (2014)Google Scholar
  18. 18.
    Girshick, R.B., Shotton, J., Kohli, P., Criminisi, A., Fitzgibbon, A.W.: Efficient regression of general-activity human poses from depth images. In: ICCV (2011)Google Scholar
  19. 19.
    Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2004)CrossRefzbMATHGoogle Scholar
  20. 20.
    Hasler, N., Rosenhahn, B., Thormählen, T., Wand, M., Gall, J., Seidel, H.: Markerless motion capture with unsynchronized moving cameras. In: CVPR (2009)Google Scholar
  21. 21.
    Hsu, E., Pulli, K., Popovic, J.: Style translation for human motion. TOG (SIGGRAPH) 24, 1082–1089 (2005)CrossRefGoogle Scholar
  22. 22.
    Jain, M., Jegou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: CVPR (2013)Google Scholar
  23. 23.
    Kemelmacher-Shlizerman, I., Sankar, A., Shechtman, E., Seitz, S.M.: Being John Malkovich. In: ECCV (2010)Google Scholar
  24. 24.
    Klose, F., Wang, O., Bazin, J.C., Magnor, M.A., Sorkine-Hornung, A.: Sampling based scene-space video processing. TOG (SIGGRAPH) 34, 67 (2015)Google Scholar
  25. 25.
    Laptev, I.: On space-time interest points. IJCV 64, 107–123 (2005)CrossRefGoogle Scholar
  26. 26.
    Li, F., Perona, P.: A bayesian hierarchical model for learning natural scene categories. In: CVPR (2005)Google Scholar
  27. 27.
    Liao, J., Lima, R.S., Nehab, D., Hoppe, H., Sander, P.V.: Semi-automated video morphing. In: CGF (Eurographics Symposium on Rendering) (2014)Google Scholar
  28. 28.
    Liao, J., Lima, R.S., Nehab, D., Hoppe, H., Sander, P.V., Yu, J.: Automating image morphing using structural similarity on a halfway domain. TOG 33, 168 (2014)CrossRefGoogle Scholar
  29. 29.
    Liao, Z., Joshi, N., Hoppe, H.: Automated video looping with progressive dynamism. TOG (SIGGRAPH) 32, 4 (2013)zbMATHGoogle Scholar
  30. 30.
    Liu, C., Yuen, J., Torralba, A.: SIFT flow: dense correspondence across scenes and its applications. TPAMI (2011)Google Scholar
  31. 31.
    Poppe, R.: A survey on vision-based human action recognition. Image Vis. Comput. 28, 976–990 (2010)CrossRefGoogle Scholar
  32. 32.
    Rother, C., Kolmogorov, V., Blake, A.: “GrabCut”: interactive foreground extraction using iterated graph cuts. TOG (SIGGRAPH) 23, 309–314 (2004)CrossRefGoogle Scholar
  33. 33.
    Sand, P., Teller, S.J.: Video matching. TOG (SIGGRAPH) 23(3), 592–599 (2004)CrossRefGoogle Scholar
  34. 34.
    Sand, P., Teller, S.J.: Particle video: Long-range motion estimation using point trajectories. IJCV 80, 72–91 (2008)CrossRefGoogle Scholar
  35. 35.
    Shechtman, E., Irani, M.: Matching local self-similarities across images and videos. In: CVPR (2007)Google Scholar
  36. 36.
    Shi, J., Tomasi, C.: Good features to track. In: CVPR (1994)Google Scholar
  37. 37.
    Shotton, J., Fitzgibbon, A.W., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth images. In: CVPR (2011)Google Scholar
  38. 38.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. Technical Report CRCV-TR-12-01 (2012)Google Scholar
  39. 39.
    Sunkavalli, K., Joshi, N., Kang, S.B., Cohen, M.F., Pfister, H.: Video snapshots: creating high-quality images from video clips. TVCG 18, 1868–1879 (2012)Google Scholar
  40. 40.
    Urtasun, R., Fleet, D.J., Fua, P.: Temporal motion models for monocular and multiview 3D human body tracking. CVIU 104, 157–177 (2006)Google Scholar
  41. 41.
    Vedaldi, A., Zisserman, A.: Efficient additive kernels via explicit feature maps. TPAMI 34, 480–492 (2012)CrossRefGoogle Scholar
  42. 42.
    Wang, C., Wang, Y., Lin, Z., Yuille, A.L., Gao, W.: Robust estimation of 3D human poses from a single image. In: CVPR (2014)Google Scholar
  43. 43.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV (2013)Google Scholar
  44. 44.
    Wang, O., Schroers, C., Zimmer, H., Gross, M., Sorkine-Hornung, A.: VideoSnapping: interactive synchronization of multiple videos. TOG (SIGGRAPH) 33, 77 (2014)Google Scholar
  45. 45.
    Wu, C.: Towards linear-time incremental structure from motion. In: International Conference on 3D Vision (3DV) (2013)Google Scholar
  46. 46.
    Xu, X., Wan, L., Liu, X., Wong, T., Wang, L., Leung, C.: Animating animal motion from still. TOG (SIGGRAPH Asia) 27, 117 (2008)Google Scholar
  47. 47.
    Yang, F., Bourdev, L.D., Shechtman, E., Wang, J., Metaxas, D.N.: Facial expression editing in video using a temporally-smooth factorization. In: CVPR (2012)Google Scholar
  48. 48.
    Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures of parts. TPAMI 35, 2878–2890 (2013)CrossRefGoogle Scholar
  49. 49.
    Zhou, F., De la Torre, F.: Canonical time warping for alignment of human behavior. In: NIPS (2009)Google Scholar
  50. 50.
    Zhou, F., De la Torre, F.: Generalized time warping for multi-modal alignment of human motion. In: CVPR (2012)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.Disney ResearchZurichSwitzerland

Personalised recommendations