Keywords

1 Introduction

We focus on human hand-tremor frequency estimation from videos captured with common consumer RGB cameras. The problem has a considerable importance in medical applications for aiding the medical personnel in the task of motor-disorder patient monitoring and tremor diagnosing [2, 13, 25, 34]. Traditionally the clinical practice uses body-worn accelerometers which offer excellent measurements, yet is intrusive, slow to setup, and allows only measuring a single location per accelerometer. Replacing accelerometers with a common RGB camera brings forth a non-intrusive method of measuring full-body tremors, offering a strong advantage in the clinical practice.

In the context of tremor analysis, existing approaches require the use of specialized sensors [6, 8, 14, 15], which makes it difficult to apply these methods in practice. Moreover, the targeted application of these approaches are the more high-level tremor diagnosing problem [14, 26] or tremor/no-tremor classification [28]. We propose to estimate human hand-tremor frequency from RGB videos, and compare against ground truth accelerometer data.

The main challenge, when performing human tremor frequency estimation, is the current lack of openly available realistic datasets. Existing work on human tremor analysis either evaluates using in-house data that is not publicly available [6, 14, 37], or on simulated tremor data where no ground truth tremor statistics are provided [28]. This limits the assessment of human tremor analysis methods and, thus, its progress. An open evaluation dataset is needed.

In this work: (i) we evaluate the frequency of human hand-tremors from RGB videos and we analyze two possible approaches: (i.a) a Lagrangian approach that focuses on the motion of the hand in the image plane, and estimates tremors over the hand positions; (i.b) an Eulerian approach that aligns the hand position over a temporal window, by tracking it, and subsequently uses the image information over time as extracted from intensity values and phase-images, to perform a windowed Fourier analysis at every hand pixel; (ii) we bring forth the TIM-Tremor dataset, containing: 55 RGB patient videos, together with associated ground-truth accelerometer recordings on the most affected hand, as well as aligned depth-data; (iii) we analyze two variants of the Lagrangian approach and two variants of the Eulerian approach and evaluate them numerically on our proposed TIM-Tremor human tremor dataset.

2 Related Work

2.1 Motion Analysis

Periodic Motion. The work in [22] performs action recognition by using space-time repetitive motion templates. Similar to using templates, in [5] a self-similarity relying on time-frequency analysis is used for action recognition. The work in [12] performs a spectral decomposition of moving objects to encode periodic motions for object recognition, while [30] performs eigen decomposition and describes periodic motion by the circularity or toroidality of an associated geometric space. Following a similar trend, in [19] complex motion is decomposed into a sequence of simple linear dynamic models for motion categorization. The work in [24] focuses on pedestrian detection through periodic movement analysis. Similar to us, the work in [28] performs tremor analysis, however in [28] videos are classified into tremor/no-tremor using optical flow features and SVM. In the recent work in [29] a CNN is used for discriminating between Parkinson patients and non-Parkinson patients, using wrist-worn senors. In this work we also focus on periodic motion analysis, however our end goal is tremor frequency estimation rather than action recognition, object tracking or recognition.

Differently, in [18, 27, 33] deep network architectures are trained for counting action repetitions. These actions must be clearly visible and recognizable in the camera view for the deep network architectures to work, while we focus on tremors which are subtle motions.

The most similar work to our work is the work performed in [31, 32] where tremor frequency is measure from pixel intensities in the video. However these methods assumes the location of the body part at which the tremor is measured to be known in advance and moreover, the frequency is estimated over intensity values rather than detected hand location over time, or image phase-information over time, as we propose here. The authors do not publicly provide either code or data, which makes it impossible for us to compare with their approach.

Subtle Motion. Small motion, difficult to see with the bare eyes, can be magnified [17, 35] through a complex steerable pyramid. In the more realistic case, when the subtle motion is combined with a large motion, follow up work can magnify subtle motions such as tremors in the presence of large object motion such as walking [9, 39]. Video frequency analysis has been also employed for estimating the properties of physical materials [7]. We also employ signal analysis in the Fourier domain, however rather than magnifying the subtle motion or estimating material properties, we estimate the frequency of the subtle tremor motion. The works in [14, 15] use specialized sensors or a digital light-processing projector, and a high frame-rate camera to detect small vibrations. Unlike [14, 15], we do not employ specific cameras or expensive sensors, we estimate the tremor frequency from common RGB videos.

2.2 Human Body Pose Estimation

Works such as [3, 21] perform body pose estimation over multiple people, in deep networks. In [3, 4, 23, 36] cascaded prediction or iterative optimizations are used for body pose estimation. We use the method in [36] for estimating where to measure the tremors. We opt for [36] due to its ease of usage and robustness. In this work we use the MPII Human Body Pose dataset [1] for training the human body pose estimation models.

3 Hand-Tremor Frequency Estimation

We start by localizing the affected hand. Subsequently, we consider two methods for hand-tremor frequency estimation: (a) Lagrangian hand-tremor frequency estimation, and (b) Eulerian hand-tremor frequency estimation.

3.1 Hand Location Estimation

A first step in estimating human hand-tremors, is localizing the affected hand. For this, we use the robust human body pose estimation proposed in [36]. This method provides us a hand location per frame \((x_i, y_i)\). We perform the tremor analysis on shorter temporal windows of the video, w(t).

Fig. 1.
figure 1

Lagrangian hand-tremor estimation is based on frequency estimation of (xy) coordinates. We detect the hand position \((x_i, y_i)_{i \in w(t)}\) at every frame i over a temporal window w(t). The hand motion is characterized by a large motion, depicted by the black line, and a small motion, depicted by the red dotted line. We smooth this information over time, using a Kalman tracker to obtain the smooth coordinates of the hand. (Color figure online)

3.2 (a) Lagrangian Hand-Tremor Frequency Estimation

Figure 1 depicts the idea behind the Lagrangian hand-tremor frequency estimation. We start by detecting the hand locations \((x_i, y_i)_{i \in w(t)}\) over the temporal window w(t). The hand motion is typically characterized by a combination of two motions: a large hand trajectory motion, depicted through the continuous black line, and a small motion corresponding to the tremor, depicted in Fig. 1 by the dotted red line. We first apply a Kalman-filter tracker [38] to the initial hand locations, detected by the pose estimation algorithm [36]. This step is used for smoothing the hand trajectory, to obtain the large hand motion. We subsequently subtract this smooth trajectory from the original hand locations to retain only the x and y locations of the small hand motion, corresponding to the tremor. Thereafter, we apply the windowed Fourier transform over these corrected locations. This provides us a PSD (Power Spectrum Density) function. We use the maximum frequency as the estimated hand-tremor frequency.

Fig. 2.
figure 2

Eulerian hand-tremor estimation is based on frequency estimation in images. (1) The fist step is the same as in the Lagrangian illustrated in Fig. 1: detecting a Kalman-filtered smoothed hand position at every frame over a temporal window w(t). (2) We crop image windows around the smoothed hand locations. Each such cropped image is transformed into a phase-pyramid with 4 orientations and 3 scales using a steerable filter bank. (3) For every pixel, in every phase-image over the temporal window w(t) we estimate a PSD (Power Spectrum Density). We accumulate these over the pixels in one phase-image, to obtain one PSD per phase-image. (4) We select the most informative phase-image PSD and use it to estimate the tremor-frequency.

3.3 (b) Eulerian Hand-Tremor Frequency Estimation

Figure 2 illustrates the Eulerian frequency estimation. The first step is the same as in Fig. 1, where the hand locations are detected using the pose estimation method in [36], and subsequently, we smooth the trajectory given by these hand detections using a Kalman tracker. This gives us the smooth trajectory of the hand over time, in the video. We crop image windows around the temporally smoothed locations of the hand in the video—along the black line depicted in Fig. 2.(1). For each such image crop, we extract local motion information encoded as phase over different scales and orientations. Thereafter, we compute the frequency of the hand-tremor by using the most informative phase-image. Figure 2 depicts these individual steps.

Phase-Image Computation. Works such as [10, 20, 35, 39] claim that the phase responses over time contain descriptive information regarding the motion present in the image. In [35] the use of complex steerable filters [11] is proposed for extracting local motion information. Given an input image I(xy) and a set of complex steerable filters of the form: \(G_\sigma ^\theta + i H_\sigma ^\theta \), where \(i=\sqrt{-1}\), \(\sigma \) defines the scale of the filter, and \(\theta \) the orientation, we obtain a complex steerable pyramid by convolving the image with this set of filters

$$\begin{aligned} ( G_\sigma ^\theta + i H_\sigma ^\theta ) \circledast I(x,y) = A_\sigma ^\theta (x,y) e^{ i \phi _\sigma ^\theta (x,y)}, \end{aligned}$$
(1)

where \(\circledast \) denotes the convolution operations, and \(A_\sigma ^\theta (x,y)\) is the resulting amplitude for scale \(\sigma \) and orientation \(\theta \), and \(\phi _\sigma ^\theta (x,y)\) is the corresponding phase information. To obtain a phase-image, we set the amplitude to 1 and apply the inverse transformation [11] to reconstruct back the image. Examples of phase-images are depicted in Fig. 1.(2). We use 4 orientations: \(\theta \in \{0, \frac{\pi }{4}, \frac{\pi }{2}, \frac{3 \pi }{4}\}\) and 3 scales: \(\sigma \in \{1.0, 0.5, 0.25\}\), giving rise to 12 phase-images. In addition to the 12 phase-images, we add the grayscale version of the cropped hand-image. Therefore, we have in total 13 images, which we merge into a single image with 13 channels, over which we estimate the hand-tremor frequency.

Hand-Tremor Frequency Estimation. We filter each one of the 13 input channels over time with a 4\(^{th}\)-order Butterworth band-pass filter. This eliminates noisy frequencies that cannot correspond to a natural human tremor.

To reduce the effect of the considered temporal window, w(t), we use an adjustable Tukey window with the parameter \(\alpha \) set to \(\frac{f_s}{N - 1}\), where \(f_s\) is the sampling rate and N is the total number of frames in w(t). This ensures that the video signal over time is processed in a consistent manner while allowing for adjustable temporal window sizes, w(t).

Fig. 3.
figure 3

(a) We record motor-disorder patients in 21 tasks. Each task may elicit a tremor. (b) Short explanation of what each task involves.

Within each temporal window, w(t), we estimate a PSD function, over every input channel, at every pixel location. For an input channel, c, we estimate the final PSD, \(\mathcal {P}_{w(t)}^c (f)\), by averaging spatially the PSDs over the pixels in that channel. We repeat this process for all 13 channels, giving rise to 13 PSD functions.

In [5] the power spectrum is considered to be periodic at a certain frequency, f, if the PSD response at that frequency is at least a few standard deviations away from the mean PSD response. This is indicative of how noisy is the PSD function. We use this same criterion to pick the most informative image channel; this is the channel over which we estimate the final hand-tremor frequency. We define for each channel a score, \(\mathcal {S}^c(f)\):

$$\begin{aligned} \mathcal {S}^c(f) = \frac{1}{\mid w(t) \mid }\sum _{w(t)}\left( \mathcal {P}_{w(t)}^c (f) - \mu _{\mathcal {P}_{w(t)}^c} - k \sigma _{\mathcal {P}_{w(t)}^c} \right) , \end{aligned}$$
(2)

where \(\mid w(t) \mid \) is the number of temporal windows per video, \(\mu _{\mathcal {P}_{w(t)}^c}\) represents the mean of the PSD response, and \(\sigma _{\mathcal {P}_{w(t)}}\) denotes the standard deviation, while k is an adjustable parameter. We set \(k = 3\) in our experiments.

The final predicted frequency over the 13 channels becomes:

$$\begin{aligned} f^* = \mathop {\text {arg max}}\limits _f ( \max _c \mathcal {S}^c (f) ). \end{aligned}$$
(3)

4 Experiments

We test the considered frequency estimation approaches on our tremor patient dataset, TIM-Tremor, containing a multitude of tasks. The anonymized TIM-Tremor patient data can be found at https://doi.org/10.4121/uuid:522d14ed-3019-4206-b49e-a4e674b6440a.

Fig. 4.
figure 4

Examples from the recording setup together with the predicted body joint locations using [36]. We use this to obtain the location of the hand where we estimate the tremor frequency.

Fig. 5.
figure 5

(a) Total number of videos recorded per task. (b) Average frequency and standard deviation for all tasks across all 55 patient recordings.

4.1 Patient Data Evaluation

Data Description. We recorded the TIM-Tremor dataset, in which 55 patients are videotaped sitting in a chair and performing a multitude of tasks. The data is recorded with a Kinect\(^{TM}\) v2 device, and it consists of short RGB videos of resolution \(1920 \times 1080\) px, and associated depth video recordings of \(512 \times 424\) px using a 16-bit encoding, as well as depth videos aligned with the RGB videos following the method in [16]. To reduce the storage requirements, we rescale the video resolution to \(960 \times 540\) px. The ground truth tremor frequency is measured on the wrist of the most affected hand: left/right. On this hand, during the recording, we position an accelerometer. The accelerometer recordings are included in the dataset. The hand on which the accelerometer is positioned, is annotated in the dataset for each patient. Thus, for each patient and each performed task, we provide a set of recorded videos of approximately 1 min each, together with a corresponding aligned depth map video, and the ground truth accelerometer recording from the most affected hand.

Data collection occurred in parallel to the standard tremor clinical evaluation. The standard tremor evaluation consists of a set of 21 tasks, which are illustrated in Fig. 3.(a) and described in Fig. 3.(b). The tasks vary with respect to the adopted posture: e.g. arm supported by the arm rest, or held outstretched in front of the patient, the amount of motion involved: e.g. rest – no motion, or touching the top of the nose – intention-oriented motion, as well as the focus of attention: e.g. distraction by mental task. Changes in tremor frequency between these tasks are analyzed by the medical expert to classify the tremor. For example, certain types of tremor are present across most or all tasks (e.g. “Parkinsonian tremor”), while other types of tremor may only occur when performing a specific task (e.g. “postural tremor” occurs only when a patient maintains a specific posture such as Thumbs_up), while other tremors may show considerable variation in tremor frequency between tasks (e.g. “functional tremor”).

Figure 4 displays a few examples of the recording setup together with the estimated joint locations using [36]. In Fig. 5 we show the total number of videos recorded for each task, and the average hand tremor frequency, as estimated by the accelerometer, together with the standard deviation, computed across all 55 patients. The average tremor frequency is around 5 Hz, which is a common in tremor affections such as Parkinson and Dystonia.

Experimental Evaluation. We estimate the body pose in the videos using the method in [36], pretrained on the MPII dataset [1]. We apply the method a every frame. We use a temporal window, w(t), of 60 frames for frequency estimation. Unless stated otherwise, we evaluate our method in terms of MAE (Mean Absolute Error) with respect to the ground truth frequency detected by the accelerometer. We only evaluate on video segments in which a periodic tremor has been detected, using the accelerometer data.

4.2 Exp. 1: Design Choices

In this experiment we test individual choices in the considered Lagrangian and Eulerian approaches. For the Lagrangian approach we test in Exp 1.1 if removing the smooth trajectory, corresponding to the large motion of the hand, helps the frequency estimation. For the Eulerian approach, in Exp 1.2 we test the added value of computing hand-tremor frequency over the phase information, rather than using only the intensity values of the image.

Table 1. Exp 1.1: MAE when comparing the Lagrangian method with trajectory smoothing by using the Kalman tracker—Lag_with_smooth, versus not using trajectory smoothing, Lag_no_smooth. Lag_with_smooth performs slightly better than the default Lagrangian method, Lag_no_smooth. We highlight in bold the better performing method (lower is better).

Exp 1.1: The Need of Trajectory Smoothing. We experimentally compare two variants of the Lagrangian frequency estimation. The Lag_no_smooth variant uses raw hand trajectory points as computed by the pose estimation algorithm. The Lag_with_smooth variant removes the large motion of the hand obtained by subtracting the output of a Kalman tracker, which in effect retains only the small motions. The MAE numbers in Table 1 show that removing the large motion by using the Kalman tracker is beneficial to the overall performance. This is explained by the fact that subtracting the trajectory returned by the Kalman tracker from the original hand trajectory works as a data detrending step. This allows for the frequency to be estimated only over the small tremor motion.

Table 2. Exp 1.2: MAE showing the added value of the phase information. We compare the Euler_gray—Eulerian frequency estimation over grayscale hand-images, with Euler_phase – Eulerian frequency estimation over 12 phase-images and 1 grayscale image. Adding the 12 extra phase-images is beneficial for the frequency estimation. We highlight in bold the better performing method (lower is better).

Exp 1.2: The Added Value of Using Phase-Images. For both considered Eulerian approaches we stabilize the trajectory along which we measure the tremor by using the Kalman tracker, and subsequently perform the frequency estimation over the complete hand window. In Table 2 we test the added value of using phase information for frequency estimation. We compare two variants. The Euler_gray variant estimates the frequency over gray-scale pixels over gray-scale hand-images, obtained by cropping the hand location along the smoothed trajectory of the hand. The Euler_phase variant adds the 12 phase channels as detailed in Sect. 3.3. The phase channels allow the Euler_phase to more precisely capture the small motion corresponding to the tremor, because the phase is effective for describing motion. The MAE numbers in Table 2 validate that adding the phase information is beneficial for the hand-tremor frequency estimation.

Fig. 6.
figure 6

Exp 2: We report accuracy on all recorded tasks, over the 55 patient recordings (higher is better). We consider the hand-tremor frequency to be correctly estimated for a task if the MAE (Mean Absolute Error) for that task is lower than 1 Hz. We plot in dotted green line the total number of videos recorded for each task, on which we have detected a periodic tremor. For each of our considered methods we show the number of videos for which the frequency was correctly estimated. On average the Eulerian methods perform better than the Lagrangian methods. (Color figure online)

4.3 Exp 2: Eulerian Versus Lagrangian Tremor Frequency Estimation

In Fig. 6 we display the accuracy of our proposed frequency estimation methods over the complete set of 55 patient recordings, for all tasks. We show in dotted green line the number of videos per task where a periodic tremor was detected, according to the accelerometer data. In corresponding color, we show the number of videos in which we have correctly estimated the hand-tremor frequency, for each frequency estimation method: Euler_phase is the Eulerian method using 12 phase-channels and 1 grayscale channel; Euler_gray is the Eulerian method on image intensity information only; Lag_no_smooth is the Lagrangian method without Kalman trajectory smoothing; Lag_with_smooth is the Lagrangian method with Kalman trajectory smoothing. We consider an estimated tremor frequency to be correct if the MAE between the accelerometer frequency and the one estimated by the method is lower than 1 Hz.

Fig. 7.
figure 7

Exp 2: The MAE per patient, for the two tasks where the Eulerian methods performed better than the Lagrangian methods (lower is better). To avoid over-cluttering the image, we plot only the best performing Lagrangian method: Lag_with_smooth—Lagrangian method with Kalman trajectory smoothing, and the best performing Eulerian method: Euler_phase—Eulerian method using 12 phase channels and 1 grayscale channel. The Lagrangian method makes large frequency estimation mistakes on a few patient videos, while the Eulerian method is more precise on some of the patient videos. (Note: for certain patients the task has not been recorded or no stable frequency, according to the accelerometer, has been found.)

Figure 6 shows that on average the Eulerian frequency estimation methods are more precise than the Lagrangian methods. The gain of using the Eulerian approaches is especially clear for the Weight task and the Hands_in_pronation task. Figure 7 displays the MAE scores per patient for these two tasks. To avoid over-cluttering the image, we only show the best Lagrangian method: Lag_with_smooth, Lagrangian with Kalman trajectory smoothing, and the best Eulerian method: Euler_phase, Eulerian over 12 phase channels and 1 grayscale channel. The Eulerian method gives more precise frequency estimates for some of the patient recordings, while for others it performs similar to the Lagrangian method. The tasks are not characterized by large hand motion. The gain of the Eulerian method over the Lagrangian is explained by the Eulerian method better describing the subtle changes in image information over time at the hand location. Therefore, the Eulerian method more accurately captures the tremor in tasks that do not involve large hand motion, but exhibit small motion.

5 Conclusions

We consider the task of hand-tremor frequency estimation from RGB videos. We propose two different approaches for measuring human hand-tremor frequencies: (a) Lagrangian hand-tremor frequency estimation, using the trajectory of the hand motion in the image plane throughout the video, to assess the hand-tremor frequency; and (b) Eulerian hand-tremor frequency estimation, which measures the change in the image information over time, at the location of hand in the image plane. We experimentally evaluate two variants of each approach on our proposed TIM-Tremor dataset containing 55 patient recordings performing a multitude of tasks. From our experimental analysis we learned that the Eulerian approaches are more accurate on average than the Lagrangian methods, with the difference being substantial on tasks on which there is a limited amount of large hand motion, but where there is a small hand-tremor motion present.