Repetition Estimation
 116 Downloads
Abstract
Visual repetition is ubiquitous in our world. It appears in human activity (sports, cooking), animal behavior (a bee’s waggle dance), natural phenomena (leaves in the wind) and in urban environments (flashing lights). Estimating visual repetition from realistic video is challenging as periodic motion is rarely perfectly static and stationary. To better deal with realistic video, we elevate the static and stationary assumptions often made by existing work. Our spatiotemporal filtering approach, established on the theory of periodic motion, effectively handles a wide variety of appearances and requires no learning. Starting from motion in 3D we derive three periodic motion types by decomposition of the motion field into its fundamental components. In addition, three temporal motion continuities emerge from the field’s temporal dynamics. For the 2D perception of 3D motion we consider the viewpoint relative to the motion; what follows are 18 cases of recurrent motion perception. To estimate repetition under all circumstances, our theory implies constructing a mixture of differential motion maps: \(\mathbf {F}\), \({\varvec{\nabla }}\mathbf {F}\), \({\varvec{\nabla }}{\varvec{\cdot }} \mathbf {F}\) and \({\varvec{\nabla }}{\varvec{\times }} \mathbf {F}\). We temporally convolve the motion maps with wavelet filters to estimate repetitive dynamics. Our method is able to spatially segment repetitive motion directly from the temporal filter responses densely computed over the motion maps. For experimental verification of our claims, we use our novel dataset for repetition estimation, betterreflecting reality with nonstatic and nonstationary repetitive motion. On the task of repetition counting, we obtain favorable results compared to a deep learning alternative.
Keywords
Video analysis Motion Periodicity Repetition counting Wavelet transform Motion segmentation1 Introduction
Visual repetitive motion is common in our everyday experience as it appears in sports, musicmaking, cooking and other daily activities. In natural scenes, it appears as leaves in the wind, waves in the sea or the drumming of a woodpecker, whereas our encounters of visual repetition in urban environments include blinking lights, the spinning of wind turbines or a waving pedestrian. In this work we reconsider the theory of periodic motion and propose a method for estimating repetition in realworld video.
To understand the origin and appearance of visual repetition we rethink the theory of periodic motion inspired by existing work (Pogalin et al. 2008; Davis et al. 2000). We follow a differential geometric approach, starting from the divergence, gradient and curl components of the 3D flow field. From the decomposition of the motion field and its temporal dynamics, we derive three motion types and three motion continuities to arrive at \(3\times 3\) fundamental cases of intrinsic periodicity in 3D. For the 2D perception of 3D intrinsic periodicity, the observer’s viewpoint can be somewhere in the continuous range between two viewpoint extremes. Finally, we arrive at 18 fundamental cases for the 2D perception of 3D intrinsic periodic motion.
Estimating repetition in practice remains challenging. First and foremost, repetition appears in many forms due to its diversity motion types and motion continuity (Fig. 1). Sources of variation in motion appearance include the action class, origin of motion and the observer’s viewpoint. Moreover, the motion appearance is often nonstatic due to a moving camera or as the observed phenomena develops over time. In practice, repetitions are rarely perfectly periodic but rather are nonstationarity. Existing literature (Levy and Wolf 2015; Pogalin et al. 2008) generally assumes static and stationary repetitive motion. As reality is more complex, we here address the challenges involved with nonstatic and nonstationary by proposing a novel method for estimating repetition in realworld video.
To deal with the diverse and possibly nonstatic motion appearance in realistic video, our theory implies representing the video with a mixture of firstorder differential motion maps. For nonstationary temporal dynamics the fixedperiod Fourier transform (Cutler and Davis 2000; Pogalin et al. 2008) is not suitable. Instead, we handle complex temporal dynamics by decomposing the motion into a timefrequency distribution using the continuous wavelet transform. To increase robustness and to be able to handle camera motion, we combine the wavelet power of all motion representations. Finally, we alleviate the need for explicit tracking (Pogalin et al. 2008) or motion segmentation (Runia et al. 2018) by segmenting repetitive motion directly from the wavelet power. On the task of repetition counting, our method performs well on an existing video dataset and our novel QUVA Repetition dataset which emphasizes on more realistic video.

We rethink the theory of periodic motion to arrive at a classification of periodic motion. Starting from the 3D motion field induced by an object periodically moving through space, we decompose the motion into three elementary components: divergence, curl and shear. From the motion field decomposition and the field’s temporal dynamics, we identify 9 fundamental cases of periodic motion in 3D. For the 2D perception of 3D periodic motion we consider the observer’s viewpoint relative to the motion. Two viewpoint extremes are identified, from which 18 cases of 2D repetitive appearance emerge.

Our spatiotemporal filtering method addresses the wide variety of repetitive appearances and effectively handles nonstationary motion. Specifically, diversity in motion appearance handled by representing video as six differential motion maps that emerge from the theory. To identify the repetitive dynamics in the possibly nonstationary video, we use the continuous wavelet transform to produce a timefrequency distribution densely over the video. Directly from the wavelet responses we localize the repetitive motion and determine the repetitive contents.

Extending beyond the video dataset of Levy and Wolf (2015), we propose a new dataset for repetition estimation, that is more realistic and challenging in terms of nonstatic and nonstationary videos. To encourage further research on video repetition, we will make the dataset and source code available as download.
2 Related Work
2.1 Repetition Estimation
The seminal work of Cutler and Davis (2000) uses normalized autocorrelation to obtain similarity matrices and proceeds by repetition estimation using Fourier analysis. Pogalin et al. (2008) estimate the frequency of motion in video by tracking an object, performing principal component analysis over the tracked regions and also employing the Fourierbased periodogram. From the spectral decomposition, the dominant frequencies can be identified by peak detection and nontrivial separation of fundamental and harmonic frequencies. While Fourierbased methods provide a good estimate of strongly periodic motion, they are unsuitable nor intended to deal with more realistic nonstationary repetition, see the accelerating rower in Fig. 2.
As strongly periodic motion has received serious attention, less effort has been devoted to nonstationary repetition in video. Briassouli and Ahuja (2007) use the ShortTime Fourier Transform for estimating the timevarying spectral components in video to distinguish multiple periodically moving objects. The filteringbased approach of Burghouts and Geusebroek (2006) uses a timecausal filter bank from Koenderink (1988) to detect quasiperiodic motion in video. Their method works online and shows good results when filter response frequencies are tuned correctly. In this work, we employ the continuous wavelet transform over multiple temporal scales to estimate repetition in complex video.
The deep learning method of Levy and Wolf (2015) is different from all other work but resembles our work in countingbased evaluation over a large video dataset. The general idea is to train a convolutional neural network for predicting the motion period in short video clips. As training data is not available, the network is optimized on synthetic video sequences in which moving squares exhibit periodic motion of four motion types from Pogalin et al. (2008). At test time, the method takes a stack of video frames, performs explicit motion localization to obtain a region of interest and then classifies the motion period by forwarding the frame crops through the network. The system is evaluated on the task of repetition counting and shows nearperfect performance on their YTSegments dataset. The 100 videos are a good initial benchmark but as the majority of videos have a static viewpoint and exhibit stationary periodic motion, we propose a new dataset. Our dataset better reflects reality by including more nonstatic and nonstationary examples.
Increased video complexity in terms of motion appearance, scene complexity and camera motion demands intricate spatiotemporal localization of salient motion. While many methods for periodic motion analysis incorporate some form of tracking or motion segmentation (Polana and Nelson 1997; Pogalin et al. 2008; Levy and Wolf 2015), few approaches specifically address the challenge of repetitive motion segmentation. Goldenberg et al. (2005) estimate the repetitive foreground motion to leverages its centerofmass trajectory for classifying human behavior. More closely related is the work of Lindeberg (2017) in which scale selection over space and time leads to an effective temporal scale map. Inspired by this, we perform spatial segmentation of repetitive motion directly from the spectral power maps obtained through the continuous wavelet transform. This is appealing, as it connects localization to the temporal dynamics rather than relying on decoupled localization by stateoftheart motion segmentation, e.g. Tokmakov et al. (2017).
2.2 Categorization of Motion Types
In realworld video, periodic motion emerges in a wide variety of appearances (see Fig. 3). We reconsider the theory of periodic motion by proposing a classification of fundamental periodic motion types starting from the 3D motion field tied to a moving object. Using firstorder differential analysis, we decompose the motion field into its primitive components. The work of Koenderink and van Doorn (1975) delivered inspiration for our theoretical derivation of repetitive motion types from the flow field. Similar to the Helmholtz–Hodge decomposition (Abraham et al. 1988) into the eigenvalues of the flow field’s Jacobian matrix, it finds use in flow field topology for fluid dynamics and electrodynamics. Although our work is similar in differential decomposition of the motion field, we use it to reach a novel classification of periodic motion patterns. We use the insights for establishing our repetition estimation method.
Although not directly related to our work, firstorder differential geometric motion representations have been used extensively as spatiotemporal video descriptors. Klaser et al. (2008) proposes a spatial multiscale motion descriptor based on firstorder differential motion and uses integral videos for efficient computation. Along similar lines, MoSIFT (Chen and Hauptmann 2009) uses spatial interest points and enforces sufficient temporal dynamics to eliminate candidate points. In terms of motion descriptors, our work bears resemblance to the Divergence–Curl–Shear descriptor proposed by Jain et al. (2013). Their favorable action classification results associated with the differentialbased descriptor support our findings for periodic motion estimation.
3 Repetitive Motion
Visual repetition is defined as a reoccurring pattern over space or time in the 3D world. In this work, we focus on temporally repetitive motion rather than spatially repetitive patterns such as a texture. Consequently, the 3D motion field induced by a moving object is the right starting point for our theoretical analysis.
3.1 Motion Field Decomposition
3.2 Intrinsic Periodic Motion in 3D
3.2.1 Motion Types
3.2.2 Motion Continuities
3.2.3 Categorization of Periodic Motion
The intrinsic periodicity in 3D does not cover all perceived recurrence in an image sequence. For the trivial cases of constant translation and constant expansion in 3D, the perceived recurrence will appear when a repetitive chain of objects (conveyor) or a repetitive appearance (texture on a car tire) on the object is aligned with the motion. In such cases, the recurrence will also be observed in the field of view. For constant rotation, the restriction is that the appearance cannot be constant over the surface, as no motion, let alone recurrent motion would be observed. In the rotational case, any rotational symmetry in appearance will induce a higher order recurrence as a multiplication of the symmetry and the rotational speed.
For the purpose of periodic motion, nine cases organize in a \(3\times 3\) Cartesian table of basic motion type times motion continuity, see Fig. 5a. The corresponding examples of these nine cases are given in Fig. 5b. This is the list of fundamental cases, where a mixture of types is permitted. In practice, some cases are ubiquitous, while for others it is hard to find examples at all.
3.3 Visual Recurrence in 2D
3.4 Nonstatic Repetition
Relative motion between the moving object and the observer adds another dimension of complexity. In particular with recurrent motion (1) the camera may move because the camera is mounted on the moving object itself, or (2) the camera is following the target of interest, or (3) the camera is in motion independent of the motion of the object. For the first two cases, the camera motion reflects the periodic dynamics of the object’s motion. The flow field may be outside the object, but otherwise it displays a complementary pattern in the flow field.
In the first case, the periodically moving camera will produce a global repetitive flow field as opposed to local repetitive flow when the object itself is moving. The third case particularly demands the removal of the camera motion prior to the repetitive motion analysis. In practice, this situation occurs frequently. Therefore, particular attention needs to be paid to camera motion independent of the target’s motion. When the viewpoint changes from frontal to side view due to camera motion, the analysis will be inevitably hard. Figure 6 illustrates the dramatic changes in the flow field when the camera changes from one extreme viewpoint (side) to the other (frontal), or vice versa. Our method handles such appearance changes by simultaneously using multiple motion representations and summing temporal filter responses.
3.5 Nonstationary Repetition
4 Method
In this section we present our method for estimating repetition in video. The method takes as input a sequence of RGB frames and outputs a frequency distribution densely computed over space and time. Subsequently, the spectral power distribution, which we obtain from the continuous wavelet transform, is used for repetition counting, motion segmentation or other frequencybased measurements. We target the general case in which moving objects may exhibit nonstationary periodicity or have a nonstatic appearance due to camera motion or repetition superposed on translation. Our method, summarized in Fig. 8, comprises motion estimation and two consecutive filtering steps: first we spatially filter the motion fields to arrive at firstorder differential geometric motion maps, and then we determine the video’s repetitive contents by applying the continuous wavelet transform densely over the motion maps. Taskdependent postprocessing steps may give the desired output; here we focus on repetition counting as it enables straightforward evaluation of our method in the presence of nonstationary repetitions.
4.1 Differential Geometric Motion Maps
Figure 9 displays an example frame with four of six motion maps (the two are omitted here). The six motion maps represent the video for each moment in time and address the diversity in repetitive motion. In our experiments, we will evaluate the individual and joint representative power associated with the motion maps. A priori it is unknown which motion we are dealing with, to which we return later by combining the temporal responses of all motion maps.
4.2 Dense Temporal Filtering
So far we have only considered spatial filtering to obtain the motion maps for a moment in time. Here we include time and proceed by temporal filtering of the motion maps to estimate the video’s repetitive motion. This is where the current method diverges from our previous work. In (Runia et al. 2018), we relied on the same motion maps but performed maxpooling over the foreground motion segmentation obtained separately from Papazoglou and Ferrari (2013). The maxpooled values over time construct a onedimensional signal acting as a surrogate for the dynamics in a particular motion map. Spectral decomposition for each of the signals led to six (possibly contrasting) timefrequency estimates. To select the most discriminative representation, we employed a selfquality assessment based on the spectral power in the signals.
We found two problems with this approach: (1) the decoupled motion segmentation may not be optimal for estimating repetitive motion dynamics, and (2) maxpooling over the foreground motion mask discards most information and is unable to deal with multiple moving parts. We here address these problems by dense temporal filtering over all locations in the motion map instead of operating on the maxpooled signals. Spatially dense estimation of the local spectral power enables us to localize regions likely containing repetitive motion. The temporal filtering can be implemented in several ways, for example, as Fourier transform through temporal convolution. To handle nonstationary video dynamics, we perform the continuous wavelet transform by convolution to obtain a timevarying spectral decomposition.
4.3 Continuous Wavelet Transform
4.4 Combining Spectral Power Maps
We compute the timelocalized frequency estimates by temporal convolution densely over the six individual motion representations. For each representation this produces a timevarying maximum power map and scale map. The power map contains the spatial distribution of maximum wavelet power over all temporal scales; the scale map holds the temporal scales corresponding to the wavelets with maximum power. What remains is combining the wavelet responses from all motion representations.
Rather than selecting the single most discriminative representation (Runia et al. 2018), we combine the spectral power maps by summation on a perframe basis. To illustrate this, we visualize four (out of six) individual power maps and their combined response in Fig. 11. Summation of the spectral power maps has a number of attractive properties. Most importantly, the motion maps with the strongest repetitive appearance will contribute most to the final power map whereas weaklyperiodic motion maps will have a negligible contribution. This effectively serves as a dynamic selection of the most discriminative motion representation. Moreover, as the spectral power is timelocalized, the relative contribution per motion representation will be evolving over time. This is appealing because motion appearance can be nonstatic in realistic video due to camera motion or gradual change in motion type.
4.5 Spatial Segmentation
The combined wavelet power map gives a timevarying spatial distribution of spectral power over all motion representations, whereas the corresponding effective scale map relates to the temporal scale with maximum spectral power. We propose to use the spatial distribution of spectral power for segmentation of the regions with strongest repetitive appearance. Subsequently, we use the scale map to infer the dominant temporal scale (related to the motion frequency) over the localized region.
The spatial segmentation of repetitive motion is performed in a straightforward manner. For a moment in time, we simply meanthreshold the combined wavelet power map to obtain a binary segmentation mask associated with regions containing significant spectral power. More precisely, the waveletbased motion segmentation will attend to regions in which the maximum spectral power over all temporal scales is significant. Figure 9 (bottom row) illustrates this by displaying the combined power map and corresponding scale map. In general, performing motion segmentation directly from the spatial distribution of spectral power is appealing as it couples the localization and subsequent frequency measurements. Our experiments will verify this claim and compare them with specialized motion segmentation methods. We would like to mention that our segmentation method leaves the door open for multiple repetitively moving objects whereas most stateoftheart segmentation methods assume a single dominant foreground motion (Tokmakov et al. 2017).
4.6 Repetition Counting
To obtain an instantaneous frequency estimate of the salient motion, we medianpool the temporal wavelet scales over the segmentation mask. Medianpooling is preferred over meanpooling as it relatively robust to outliers and will produce a better estimate of the dominant frequency. The corresponding temporal wavelet scale is then converted to an instantaneous frequency using Eq. 18. For a moment in time, this will deliver a frequency estimate for the salient repetitive motion. Counting the number of repetitions follows temporal integration of the consecutive frequency measurements with the temporal sampling spacing inferred from the video’s frame rate.
We emphasize our method’s ability to count the number of cycles in nonstationary video. For a stationary periodic signal, the medianpooled temporal scales will be constant over time, while nonstationarity motion produces timevarying frequency estimates. Although the videos considered in our experiments are temporally segmented, the timelocalized wavelet responses could also be used for temporal localization of repetitive actions. Moreover, although the current approach performs medianpooling over the motion segmentation mask, the spatial distribution of wavelet power also enables the identification of multiple periodically moving parts.
5 Experiments
We perform experiments to show the effectiveness of our method on the task of counting repetitions in video. Prior to evaluating our full method, we demonstrate the strength of the continuous wavelet transform for estimating repetition in nonstationary signals, show the need for diversified motion maps to deal with the wide variety in motion appearance, and investigate our method’s ability to handle dynamic viewpoints. Before discussing the actual experiments, we introduce the video datasets for testing, give implementation details and specify our counting evaluation metrics.
5.1 Datasets and Evaluation
The main experiments consider two video datasets: the existing YTSegments and our new QUVA Repetition dataset; both collected for the purpose of evaluating repetition estimation in video. The two realworld datasets contain a single dominant repetitive motion only for the ease of evaluation. Additionally, we perform a controlled experiment on viewpoint estimation with synthetic video that we generated through 3D modeling in Blender.
YTSegments Dataset For the purpose of evaluating repetition counting in video, Levy and Wolf (2015) introduced a new video benchmark. The 100 videos downloaded from YouTube are purely for evaluation purpose as training the network is performed with synthesized videos. A wide range of actions appears in the videos: several sports, cooking and animal movement. Each video is temporally segmented such that only the repetitive action is covered. The clips are annotated with a total repetition count. While the dataset serves as a good initial benchmark for repetition estimation, it is limited in terms of cycle length variation (nonstationarity), motion appearances and camera motion. As our goal is to evaluate our method on more realistic video, we introduce a new video dataset that is more challenging in terms of nonstationarity, motion appearance, camera motion and background clutter.
QUVA Repetition Dataset In Runia et al. (2018) we introduced a more realistic video benchmark for repetition estimation. The QUVA Repetition consists of 100 videos displaying a wide variety of repetitive video dynamics, including various kinds of sport, musicmaking, cooking, grooming, construction and animal behavior. The videos are collected from YouTube with emphasis on creating a diverse collection of videos suitable for evaluating our method’s ability to deal with nonstationary motion, camera motion and significant evolution of motion appearance over the course of a video.
Dataset statistics of YTSegments and QUVA Repetition
YTSegments  QUVA repetition  

Number of videos  100  100 
Duration min/max (s)  2.1/68.9  2.5/64.2 
Duration avg. (s)  \(14.9 \pm 9.8\)  \(17.6 \pm 13.3\) 
Count avg. ± SD  \(10.8 \pm 6.5\)  \(12.5 \pm 10.4\) 
Count min/max  4/51  4/63 
Cycle length variation  0.22  0.36 
Camera motion  21  53 
Superposed translation  7  27 
The characteristics for both datasets are reported in Table 1. It is apparent that our videos have more variability in cycle length, motion appearance, camera motion and background clutter. The increased difficulty in both appearance and temporal dynamics give a more realistic benchmark for repetition estimation in the wild. Figure 12 displays a number of examples from both datasets. The project page^{1} contains the dataset download link and several video previews.
5.2 Implementation Details
Motion Segmentation Complex videos with background clutter or camera motion demand segmentation of the foreground motion prior to further analysis. Although our method directly performs localization from the densely computed wavelet power, we also evaluate with stateoftheart motion segmentation methods. The fast video segmentation method of Papazoglou and Ferrari (2013) is chosen as classical approach and was also used in Runia et al. (2018). This approach separates foreground objects from the background in a video by combining motion boundaries followed by segmentation refinement. We also evaluate the more recent deep learning based method of Tokmakov et al. (2017). The method trains a twostream convolutional neural network with a longshort term memory (LSTM) module to capture the evolution over time. The network parameters are optimized using the large FlyingThings 3D dataset (Mayer et al. 2016). To refine the motion masks from the trained networks, a conditional random field is applied for refinement. For both methods we use the official implementations made available by the authors. While both methods generally attain excellent segmentations, we observed that segmentation fails completely for some more difficult frames (either all or none pixels selected as foreground). To remedy incorrect segmentation masks we reuse the segmentation of the previous frame if the fraction of foreground pixels is less than 1% of the entire frame.
Differential Geometric Motion Maps To compute the motion maps we perform spatial filtering by firstorder Gaussian kernels. The filtering is implemented in PyTorch and runs in large batches on the GPU to accelerate computation. Spatial convolution is performed with \(\sigma = 4\) for all experiments. We also evaluated \(\sigma = \{2,8,16\}\) but found only minor variation in performance. In practice, a combination of multiple spatial scales may produce best results. Once the spatial firstorder derivatives \(\nabla _x F_x, \nabla _y F_x, \nabla _x F_y\) and \(\nabla _y F_y\) have been obtained through convolution, the differential motion maps are computed as specified in Sect. 4.1.
Continuous Wavelet Transform We use the continuous wavelet filtering implementation as outlined in Torrence and Compo (1998). In comparison to the previous version of our work, we now also perform temporal filtering on the GPU^{2} resulting in a considerable speedup. This enables us to apply the wavelet transform in large batches over all spatial locations in the video. As previously mentioned, we use a Morlet wavelet (\(\omega _0 = 6\)) with logarithmic scales (\(\delta j = 0.125\), \(s_0 = 2\delta t\)). We limit the range of J corresponding to a minimum of four repetitions by setting \(s_{\min }\) and \(s_{\max }\) accordingly in (16) and (17). Depending on the video length, there are typically between 50 and 60 temporal scales levels. When compute budget is tight, computational efficiency can be improved by pruning the filter bank with scale selection, for example using the maximum response of a Laplacian filter (Lindeberg 2017). Alternatively, learning could be employed to infer the relationship between motionspeed and relevant wavelet scalelevels to prune the filter bank.
Repetition Counting The instantaneous frequency estimates are obtained from the dense wavelet power by pooling over the motion foreground mask. As detailed in Sect. 4.6, the frequencies are integrated over time to arrive at a final repetition count. To remove frequency estimate outliers inconsistent with adjacent frames, we apply a median filter of 9 timesteps (frames) to enforce local smoothness. This gives a slight improvement on both video datasets. The final Count predictions are not rounded, hence evaluation metrics may be slightly off due to incomplete cycles.
Reimplementation of Baselines We compare our method against two existing works for repetition estimation. The method of Pogalin et al. (2008) is chosen to represent the class of Fourierbased methods. Our reimplementation uses a more recent object tracker (Henriques et al. 2012) but is identical otherwise. The tracker is initialized by manually drawing a box on the first frame. Converting the frequency to a count is trivial using the video length and frame rate. Additionally, we compare with the deep learning method of Levy and Wolf (2015) using their publicly available code and pretrained model without any modifications.
5.3 Temporal Filtering: Fourier Versus Wavelets
Results From the results in Fig. 15 it is clear that waveletbased counting outperforms the periodogram on idealized signals. As expected, we observe that the Fourierbased measurements generally fail on videos with significant cycle length variation as they give a global frequency prediction. Wavelets naturally handle nonstationary repetition and are less sensitive to cycle length variability. We also tried adding a substantial amount of Gaussian noise (\(\sigma = 0.5\)) to the signals; this resulted in a minor negative effect on both methods (data not shown). This controlled experiment shows the effectiveness of wavelets for repetition estimation assuming a clear signal can be distilled from the videos.
5.4 Viewpoint Invariance
Setup The theory of repetition considers two viewpoint extremes (Fig. 6). In this experiment we evaluate our method’s ability to handle a continuous transition from one viewpoint extreme to the other. The designated mechanism for this is the use of multiple motion representations and the summation of their spectral power obtained from the continuous wavelet transform. To test this, we setup a controlled experiment in which we synthesize a video clip from 3D modeled data in Blender. This enables full control over the object’s motion and the viewpoint. Specifically, we choose to build a simple 3D scene containing a ball periodically bouncing on the floor as displayed in the top row of Fig. 16. Initially, the camera captures the bouncing ball from the side view but after a number of full motion cycles, the camera smoothly transitions to frontal view (case 3 to case 6 in Fig. 6). We record the medianpooled vertical flow and divergence over the foreground region to obtain two timevarying signals. The spectral power for both signals is individually estimated using the continuous wavelet transform, after which we combine the power by summation. In addition to a synthetic experiment, we also include the result of a realworld video with significant dynamic viewpoint change (previously shown in Fig. 7).
Results Figures 16 and 17 plot the two medianpooled flow signals and their joint wavelet power obtained by summation. Initially, as the moving object is captured from the side view, vertical flow is best measurable. Upon the viewpoint transition, vertical flow vanishes while the divergent flow becomes dominant. As a result of the camera motion, the measurement of the spectral power for both individual signals will only give a strong response for either the first or second half of the video. However, the summation of the spectra gives a clear measurement over the complete video as is apparent from the combined wavelet power spectrum. This illustrates our method’s ability to handle viewpoint changes by the combination of the wavelet power contained in multiple motion representations. By summation of the spectra, the best measurable motion representation will naturally give the largest contribution to the combined power. Therefore, this mechanism acts as a replacement of the global representation selection used in (Runia et al. 2018) by dynamically leveraging information in all representations.
5.5 Diversity in Motion Maps
Setup As wavelets prove to be effective for repetition estimation and multiple representations show value on a synthetic video, we now assess the value of a diversified video representation on real videos of our QUVA Repetition dataset. We hypothesize that, due to the high variability in motion pattern and viewpoint, no single representation is powerful but their joint diversity is effective. To test this, we perform repetition counting over all individual motion maps listed in Eq. (13). Instead of summing the wavelet power for all representations, we test the performance of the six motion representations individually. For each representation we densely compute te wavelet power and count the number of repetitions as outlined in the method’s section. For a fair comparison, we exclude our motion segmentation mechanism based on wavelet power and instead use the motion segmentation proposed by Papazoglou and Ferrari (2013). Again, we evaluate repetition counting on our QUVA Repetition dataset. To obtain a lowerbound on the error, we also select the best representation per video in an oracle fashion.
Value of diversity in six motion maps for videos from QUVA Repetition
MAE  OBOA  # Selected  

\({\varvec{\nabla }}{\varvec{\cdot }} \mathbf {F}\)  \(77.8 \pm 90.8\)  0.21  10 
\({\varvec{\nabla }}{\varvec{\times }} \mathbf {F}\)  \(53.0 \pm 65.5\)  0.32  11 
\(\nabla _x F_x\)  \(58.1 \pm 63.5\)  0.29  15 
\(\nabla _y F_y\)  \(59.5 \pm 68.4\)  0.31  9 
\(F_x\)  \(49.6 \pm 48.0\)  0.35  25 
\(F_y\)  \(42.0 \pm 45.3\)  0.43  30 
Oracle best  \(24.1 \pm 33.5\)  0.63  100 
5.6 Video Acceleration Sensitivity
Setup In this experiment, we examine our method’s sensitivity to acceleration by artificially speedingup videos. Starting from the YTSegments dataset, in which most videos exhibit strong periodic motion, we induce significant nonstationarity by artificially accelerating the videos halfway. More precisely, we modify the videos such that after the midpoint frame, the speed is increased by dropping every second frame. What follows are 100 videos with a \(2\times \) acceleration starting halfway. We compare against the deep learning method of Levy and Wolf (2015) which handles nonstationarity by running the periodpredicting convolutional neural network in slidingwindow fashion over the video. Fourierbased analysis was left out as it will inevitably fail on this task.
5.7 Motion Segmentation
Setup In this experiment we investigate the effectiveness of the motion segmentations obtained directly from the wavelet power for repetition estimation. We visually compare the motion segmentations and test whether replacing our localization mechanism with a stateoftheart motion segmentation method improves repetition estimation performance. We keep the method identical except for the segmentation method to obtain a motion mask. In addition to our waveletbased motion segmentation to obtain the discriminative motion mask we compare our method’s performance without any localization (fullframe), the video segmentation method of Papazoglou and Ferrari (2013) and the deep learning approach of Tokmakov et al. (2017).
Results We visually compare the three different motion segmentation methods in Fig. 20. For most videos, our method is able to localize the repetitive motion. As the emphasis of our work is on repetition estimation, where the segmentation masks are a byproduct, the stateoftheart specifically devoted to foreground motion segmentation naturally produce the visually best results and lowest intersectionoverunion error with respect to the ground truth mask. Our intention is to obtain a motion mask best suitable for repetition estimation which not necessarily overlaps with the foreground motion. By thresholding the wavelet power maps, our method seems to emphasize on regions with most discriminative repetitive motion. This is best recognizable from the bottom two rows where the motion segmentation includes background regions that periodically change due to the motion. If maximum intersectionoverunion overlap with respect to the ground truth foreground motion mask is desired, we observe a number of failure cases. For the rower (bottom row), the periodicity contained in the movement of the paddles yields a significantly stronger wavelet response than the body itself hence the body is excluded from the segmentation mask due to meanthresholding of the wavelet power. In case of football keepups (third row), the dominant repetitive motion is the football moving upanddown but the actor also rotates around its axis which is not revealed in the static images. However, the oscillating ball dominates the scene and our segmentation masks should not include the actor’s torso for this reason. The threshold is currently fixed to the mean wavelet power – setting it higher or adaptively could improve the segmentation masks.
5.8 Comparison to the StateoftheArt
Setup In this experiment, we perform a full comparison on the task of repetition counting for both video datasets. We compare against the Fourierbased method of Pogalin et al. (2008) and the deep learning approach of Levy and Wolf (2015).
Repetition counting results of our method with different motion segmentation mechanism
YTSegments  QUVA repetition  

Motion segmentation method  MAE \(\downarrow \)  OBOA \(\uparrow \)  MAE \(\downarrow \)  OBOA \(\uparrow \) 
Fullframe  \(46.0 \pm 67.2\)  0.28  \(60.8 \pm 49.4\)  0.22 
Papazoglou and Ferrari (2013)  \(13.1 \pm 20.3\)  0.78  \(42.6 \pm 49.2\)  0.44 
Tokmakov et al. (2017)  \(21.6 \pm 57.2\)  0.76  \(38.9 \pm 39.2\)  0.42 
Differential geometry (this paper)  \(\varvec{9.4 \pm 17.4}\)  0.89  \(\varvec{26.1 \pm 39.6}\)  0.62 
Comparison with the stateoftheart on repetition counting for the YTSegments and our QUVA Repetition dataset
YTSegments  QUVA repetition  

MAE \(\downarrow \)  OBOA \(\uparrow \)  MAE \(\downarrow \)  OBOA \(\uparrow \)  
Pogalin et al. (2008)  \(21.9 \pm 30.1\)  0.68  \(38.5 \pm 37.6\)  0.49 
Levy and Wolf (2015)  \(\mathbf {6.5 \pm \phantom {0}9.2}\)  0.90  \(48.2 \pm 61.5\)  0.45 
This paper  \(9.4 \pm 17.4\)  0.89  \(\mathbf {26.1 \pm 39.6}\)  0.62 
The results change dramatically when considering our challenging QUVA Repetition dataset; notably the deep learning approach of Levy and Wolf (2015) now performs the worst, with an MAE of 48.2. This could possibly be explained by the fact that their network only considers four motion types during training or the convolutional network’s fixed temporal input dimension posing a constraint on the effective motion periods (ranging from 0.2 to 2.33 seconds). Dealing with motion periods outside of this range most likely requires retraining the network. The Fourierbased method of Pogalin et al. (2008) scores an MAE of 38.5, whereas we obtain an average error of 26.1. On the YTSegments dataset our simplified method slightly improves over the MAE of \(10.3 \pm 19.8\) reported in (Runia et al. 2018), while giving comparable results to previously reported MAE of \(23.2 \pm 34.4\) on the QUVA Repetition dataset. The Fourierbased and deep learningbased approaches are unable to effectively handle the increased nonstationarity and motion complexity found in our challenging video dataset. The method proposed here improves the ability to handle such difficult videos without relying on explicit motion segmentation methods.
Sensitivity of our method with respect to different optical flow methods
YTSegments  QUVA repetition  

MAE \(\downarrow \)  OBOA \(\uparrow \)  MAE \(\downarrow \)  OBOA \(\uparrow \)  
TVL\(^1\)  \(9.8 \pm 17.9\)  0.89  \(26.5 \pm 67.5\)  0.67 
EpicFlow  \(9.7 \pm 17.9\)  0.88  \(30.8 \pm 38.2\)  0.55 
FlowNet 2.0  \(\mathbf {9.4 \pm 17.4}\)  0.89  \(\mathbf {26.1 \pm 39.6}\)  0.62 
To gain a better understanding of our method’s characteristics we study success and failure cases. We observe that our waveletbased motion segmentation struggles with scenes containing dynamic textures such as sand or water (e.g. Fig. 12, bottom row). Based on our analysis, we believe the reason for this is twofold: (1) For such regions, motion estimation using optical flow is difficult (Adelson 2001); and (2) Dynamic textures produce visual repetitive dynamics resulting in a strong wavelet response over its entire surface. Consequently, motion segmentation by meanthresholding of the spectral power will fail inevitably; and subsequent measurements over the foreground motion mask will be incorrect as well. For such videos, we observe an enormous overcount as the frequency estimates correspond to the highfrequent rippling water. The error associated with these videos explains the limited improvement over our previous method (Runia et al. 2018) which relied on Papazoglou and Ferrari (2013) for motion segmentation, being less prone to such segmentation failures. To remedy the problem of coarse and inaccurate segmentation masks, a postprocessing step (e.g. conditional random field) is likely to improve the overall segmentation quality.
We also observe that all methods make a common mistake: overcounting videos with a factor of two. The similarity in these videos is that one full cycle contains the exact same motion first with one arm (or leg) followed by the other (e.g. walking lunges or swimming frontcrawl). As the perceived motion is almost identical for both limbs, the estimated temporal dynamics are twice as fast. Again, the significant overestimate of the motion frequency produces a large count error for all methods. Solving this problem is not easy, as current repetition estimates in those cases are essentially also a correct prediction; however, the human annotators define salient motion as a full cycle with both limbs.
6 Conclusion
We have categorized 3D intrinsic periodic motion as translation, rotation or expansion depending on the firstorder differential decomposition of the motion field. Additionally, we distinguish three periodic motion continuities: constant, intermittent and oscillatory motion. For the 2D perception of 3D periodicity, the camera will be somewhere in the continuous range between two viewpoint extremes. What follows are 18 fundamentally different cases of repetitive motion appearance in 2D. The practical challenges associated with repetition estimation are the wide variety in motion appearance, nonstationary temporal dynamics and camera motion. Our method addresses all these challenges by computing a diversified motion representation, employing the continuous wavelet transform and combining the power spectra of all representations to support viewpoint invariance. Whereas related work explicitly localizes the foreground motion, our method performs repetitive motion segmentation directly from the wavelet power maps resulting in a simplified approach. We verify our claims by improving the stateoftheart on the task of repetition counting on our challenging new video dataset. The method requires no training and requires only a minimum number of hyperparameters which are fixed throughout the paper. We envision applications beyond repetition estimation as the wavelet power and scale maps can support localization of low and highfrequency regions suitable for region pruning or action classification.
Footnotes
Notes
References
 Abraham, R., Marsden, J. E., & Ratiu, T. (1988). Manifolds, tensor analysis, and applications (Vol. 75). Berlin: Springer.CrossRefzbMATHGoogle Scholar
 Adelson, E. H. (2001). On seeing stuff: The perception of materials by humans and machines. In Human vision and electronic imaging (Vol. 4299). International Society for Optics and Photonics.Google Scholar
 Albu, A. B., Bergevin, R., & Quirion, S. (2008). Generic temporal segmentation of cyclic human motion. Pattern Recognition, 41(1), 6–21. CrossRefzbMATHGoogle Scholar
 Azy, O., & Ahuja, N. (2008). Segmentation of periodically moving objects. In Proceedings of the IEEE international conference on pattern recognition (pp. 1–4).Google Scholar
 Belongie, S., & Wills, J. (2006). Structure from periodic motion. In W. James MacLean (Ed.), Spatial coherence for visual motion analysis. Berlin Heidelberg: Springer. Google Scholar
 Briassouli, A., & Ahuja, N. (2007). Extraction and analysis of multiple periodic motions in video sequences. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(7), 1244–1261.CrossRefGoogle Scholar
 Burghouts, G. J., & Geusebroek, J. M. (2006). Quasiperiodic spatiotemporal filtering. IEEE Transactions on Image Processing, 15(6), 1572–1582.CrossRefGoogle Scholar
 Chen, M. Y., & Hauptmann, A. (2009). MoSIFT: Recognizing human actions in surveillance videos. Technical Report, CMUCS09161, Carnegie Mellon University.Google Scholar
 Chetverikov, D., & Fazekas, S. (2006). On motion periodicity of dynamic textures. In Proceedings of the British machine vision conference (pp. 167–176).Google Scholar
 Cutler, R., & Davis, L. S. (2000). Robust realtime periodic motion detection, analysis, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 781–796.CrossRefGoogle Scholar
 Davis, J., Bobick, A., & Richards, W. (2000). Categorical representation and recognition of oscillatory motion patterns. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1, 628–635.Google Scholar
 Goldenberg, R., Kimmel, R., Rivlin, E., & Rudzsky, M. (2005). Behavior classification by eigendecomposition of periodic motions. Pattern Recognition, 38(7), 1033–1043.CrossRefzbMATHGoogle Scholar
 Grossmann, A., & Morlet, J. (1984). Decomposition of Hardy functions into square integrable wavelets of constant shape. SIAM Journal on Mathematical Analysis, 15(4), 723–736.MathSciNetCrossRefzbMATHGoogle Scholar
 Henriques, J. F., Caseiro, R., Martins, P., & Batista, J. (2012). Exploiting the circulant structure of trackingbydetection with kernels. In Proceedings of the European conference on computer vision Google Scholar
 Huang, S., Ying, X., Rong, J., Shang, Z., & Zha, H. (2016). Camera calibration from periodic motion of a pedestrian. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
 Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., & Brox, T. (2017). FlowNet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
 Jain, M., Jegou, H., & Bouthemy, P. (2013). Better exploiting motion for better action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2555–2562).Google Scholar
 Johansson, G. (1973). Visual perception of biological motion and a model for its analysis. Perception & Psychophysics, 14, 201–211. CrossRefGoogle Scholar
 Klaser, A., Marszałek, M., & Schmid, C. (2008). A spatiotemporal descriptor based on 3dgradients. In Proceedings of the British machine vision conference (pp. 275–1).Google Scholar
 Koenderink, J., & van Doorn, A. (1975). Invariant properties of the motion parallax field due to the movement of rigid bodies relative to an observer. Optica Acta: International Journal of Optics, 9, 773–791.Google Scholar
 Koenderink, J. J. (1988). Scaletime. Biological Cybernetics, 58(3), 159–162.MathSciNetCrossRefzbMATHGoogle Scholar
 Laptev, I., Belongie, S. J., Perez, P., & Wills, J. (2005). Periodic motion detection and segmentation via approximate sequence alignment. Proceedings of the IEEE International Conference on Computer Vision, 1, 816–823.Google Scholar
 Levy, O., & Wolf, L. (2015). Live repetition counting. In Proceedings of the IEEE international conference on computer vision.Google Scholar
 Li, X., Li, H., Joo, H., Liu, Y., & Sheikh, Y. (2018). Structure from recurrent motion: From rigidity to recurrency. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3032–3040)Google Scholar
 Lindeberg, T. (2017). Dense scale selection over space, time and spacetime. Journal on Imaging Sciences, 11(1), 438–451.MathSciNetGoogle Scholar
 Liu, F., & Picard, R. W. (1998). Finding periodicity in space and time. In Proceedings of the IEEE international conference on computer vision (pp. 376–383).Google Scholar
 Lu, C., & Ferrier, N. J. (2004). Repetitive motion analysis: Segmentation and event classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2), 258–263.CrossRefGoogle Scholar
 Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., & Brox, T. (2016). A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4040–4048).Google Scholar
 Papazoglou, A., & Ferrari, V. (2013). Fast object segmentation in unconstrained video. In Proceedings of the IEEE international conference on computer vision (pp. 1777–1784).Google Scholar
 Pogalin, E., Smeulders, A. W. M., & Thean, A. H. C. (2008). Visual quasiperiodicity. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
 Polana, R., & Nelson, R. C. (1997). Detection and recognition of periodic, nonrigid motion. International Journal of Computer Vision, 23(3), 261–282.CrossRefGoogle Scholar
 Ran, Y., Weiss, I., Zheng, Q., & Davis, L. S. (2007). Pedestrian detection via periodic motion analysis. International Journal of Computer Vision, 71(2), 143–160.Google Scholar
 Revaud, J., Weinzaepfel, P., Harchaoui, Z., & Schmid, C. (2015). EpicFlow: Edgepreserving interpolation of correspondences for optical flow. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
 Runia, T. F. H., Snoek, C. G. M., & Smeulders, A. W. M. (2018). Realworld repetition estimation by div, grad and curl. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
 Sarel, B., & Irani, M. (2005). Separating transparent layers of repetitive dynamic behaviors. In Proceedings of the IEEE international conference on computer vision.Google Scholar
 Thangali, A., & Sclaroff, S. (2005). Periodic motion detection and estimation via spacetime sampling. In Proceedings of the IEEE workshops on application of computer vision.Google Scholar
 Tokmakov, P., Alahari, K., & Schmid, C. (2017). Learning motion patterns in videos. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
 Torrence, C., & Compo, G. P. (1998). A practical guide to wavelet analysis. Bulletin of the American Meteorological Society, 79(1), 61–78.CrossRefGoogle Scholar
 Tralie, C. J., & Perea, J. A. (2018). (quasi) periodicity quantification in video data, using topology. SIAM Journal on Imaging Sciences, 11(2), 1049–1077.MathSciNetCrossRefzbMATHGoogle Scholar
 Tsai, P. S., Shah, M., Keiter, K., & Kasparis, T. (1994). Cyclic motion detection for motion based recognition. Pattern Recognition, 27(12), 1591–1603.CrossRefGoogle Scholar
 Zach, C., Pock, T., & Bischof, H. (2007). A duality based approach for realtime TVL 1 optical flow. Pattern recognition, LNCS (Vol. 4713, pp. 214–223). Berlin: Springer.Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.