Keywords

1 Introduction

Unmanned aerial vehicles (UAVs) can be deployed in a variety of applications such as search and rescue, situational awareness, surveillance and police pursuit by leveraging their mobility and operational simplicity. In some situations, a UAV’s ability to recognize the commanding actions of the human operator and to take responsive actions is desirable. Such scenarios might include a firefighter commanding a drone to scan a particular area, a lifeguard directing a drone to monitor a drifting kayaker, or more user-friendly video and photo shooting capabilities. Whether for offline gesture recognition from aerial videos or for equipping UAVs with gesture recognition capabilities, a substantial amount of training data is necessary. However, the majority of the video action recognition datasets consist of ground videos recorded from stationary or dynamic cameras [15].

Different video datasets recorded from moving and stationary aerial cameras have been published in recent years [6, 15]. They have been recorded under different camera and platform settings and have limitations when used with a wide range of human action recognition behaviors demanded today. However, aerial action recognition is still far from perfect. In general, the existing aerial video action datasets are lacking detailed human body shapes to be used with state-of-the-art action recognition algorithms. Many action recognition techniques depend on accurate analysis of human body joints or body frame. It is difficult to use the existing aerial datasets for aerial action or gesture recognition due to one or more of the following reasons: (i) severe perspective distortion – camera elevation angle closer to \(90^\circ \) results in a severely distorted body shape with large head and shoulder, and most of the other body parts being occluded; (ii) the low resolution makes it difficult to retrieve human body and texture details; (iii) motion blur caused by rapid variations of the elevation and pan angles or the movement of the platform; and (iv) camera vibration caused by the engine or the rotors of the UAV.

We introduce a dataset recorded from a low altitude and slow flying mobile platform for gesture recognition. The dataset was created with the intention of capturing full human body details from a relatively low altitude in a way that preserves the maximum detail of the body position. Our dataset is suitable for research involving search and rescue, situational awareness, surveillance, and general action recognition. We assume that in most practical missions, the UAV operator or an autonomous UAV follows these general rules: (i) it does not fly so low that it poses danger to the civilians, ground-based structures, or itself; (ii) it does not fly so high or so fast that it loses too much detail in the images it captures; (iii) it hovers to capture the details of an interesting scene; and (iv) it records human subjects from a viewpoint that causes minimum perspective distortion and maximum body details. Our dataset was created following these guidelines to represent 13 command gesture classes. The gestures were selected from general aircraft handling and helicopter handling signals [32]. All the videos were recorded at high-definition (HD) resolution, enabling the gesture videos to be used in general gesture recognition and gesture-based autonomous system control research. To our knowledge, this is the first dataset presenting gestures captured from a moving aerial camera in an outdoor setting.

2 Related Work

A complete list and description of recently published action recognition datasets is available in [6, 15], and gesture recognition datasets can be found in [21, 25]. Here, we discuss some selected studies related to our work.

Detecting human action from an aerial view is more challenging than from a fronto-parallel view. Created by Oh et al. [18], the large-scale VIRAT dataset contains about 550 videos, recorded from static and moving cameras covering 23 event types over 29 h. The VIRAT ground dataset has been recorded from stationary aerial cameras (e.g., overhead mounted surveillance cameras) at multiple locations with resolutions of 1080 \(\times \) 1920 and 720 \(\times \) 1280. Both aerial and ground-based datasets have been recorded in uncontrolled and cluttered backgrounds. However, in the VIRAT aerial dataset, the low resolution of 480 \(\times \) 720 precludes retrieval of rich activity information from relatively small human subjects.

A 4K-resolution video dataset called Okutama-Action was introduced in [1] for concurrent action detection by multiple subjects. The videos have been recorded in a relatively clutter-free baseball field using 2 UAVs. There are 12 actions under abrupt camera movements, altitudes from 10 to 45 m and different view angles. The camera elevation angle of 90\(^\circ \) causes a severe distortion in perspective and self-occlusions in videos.

Other notable aerial action datasets are UCF aerial action [30], UCF-ARG [31] and Mini-drone [2]. UCF aerial action and UCF ARG have been recorded using an R/C-controlled blimp and a helium balloon respectively. Both datasets contain similar action classes. However, UCF aerial action is a single-view dataset while UCF ARG is a multi-view dataset recorded from aerial, rooftop and ground cameras. The Mini-drone dataset has been developed as a surveillance dataset to evaluate different aspects and definitions of privacy. This dataset was recorded in a car park using a drone flying at a low altitude and the actions are categorized as normal, suspicious and illicit behaviors.

Gesture recognition has been studied extensively in recent years [21, 25]. However, the gesture-based UAV control studies available in the literature are mostly limited to indoor environments or static gestures [10, 16, 19], restricting their applicability to real-world scenarios. The datasets used for these works were mostly recorded indoors using RGB-D images [13, 24, 27] or RGB images [5, 17]. An aircraft handling signal dataset similar to ours in terms of gesture classes is available in [28]. It has been created using VICON cameras and a stereo camera with a static indoor background. However, these gesture datasets cannot be used in aerial gesture studies. We selected some gesture classes from [28] when creating our dataset.

3 Preparing the Dataset

This section discusses the collection process of the dataset, the types of gestures recorded in the dataset, and the usefulness of the dataset for vision-related research purposes.

3.1 Data Collection

The data was collected on an unsettled road located in the middle of a wheat field from a rotorcraft UAV (3DR Solo) in slow and low-altitude flight. For video recording, we used a GoPro Hero 4 Black camera with an anti-fish eye replacement lens (5.4 mm, 10MP, IR CUT) and a 3-axis Solo gimbal. We provide the videos with HD (\(1920\times 1080\)) formats at 25 fps. The gestures were recorded on two separate days. The participants were asked to perform the gestures in a selected section of the road. A total of 13 gestures have been recorded while the UAV was hovering in front of the subject. In these videos, the subject is roughly in the middle of the frame and performs each gesture five to ten times.

When recording the gestures, sometimes the UAV drifts from its initial hovering position due to wind gusts. This adds random camera motion to the videos making them closer to practical scenarios.

3.2 Gesture Selection

The gestures were selected from general aircraft handling signals and helicopter handling signals available in the Aircraft Signals NATOPS manual [32, Ch. 2–3]. The selected 13 gestures are shown in Fig. 1. When selecting the gestures, we avoided aircraft and helicopter specific gestures. The gestures were selected to meet the following criteria: (i) they should be easily identifiable from a moving platform, (ii) the gestures need to be crisp enough to be differentiated from each another, (iii) they need to be simple enough to be repeated by an untrained individual, (iv) the gestures should be applicable to basic UAV navigation control, and (v) the selected gestures should be a mixture of static and dynamic gestures to enable other possible applications such as taking “selfies”.

Fig. 1.
figure 1

The selected thirteen gestures are shown with one selected image from each gesture. The arrows indicate the hand movement directions. The amber color markers roughly designate the start and end positions of the palm for one repetition. The Hover and Land gestures are static gestures.

3.3 Variations in Data

The actors that participated in this dataset are not professionals in aircraft handling signals. They were shown how to do a particular gesture by another person who was standing in front of them, and then asked to do the same towards the UAV. Therefore, each actor performed the gestures slightly differently. There are rich variations in the recorded gestures in terms of the phase, orientation, camera movement and the body shape of the actors. In some videos, the skin color of the actor is close to the background color. These variations create a challenging dataset for gesture recognition, and also makes it more representative of real-world situations.

The dataset was recorded on two separate days and involved a total of eight participants. Two participants performed the same gestures on both days. For a particular gesture performed by a participant in the two settings, the two videos have significant differences in the background, clothing, camera to subject distance and natural variations in hand movements. Due to these visual variations in the dataset, we consider the total number of actors to be 10.

3.4 Dataset Annotations

We used an extended version of online video annotation tool VATIC [33] to annotate the videos. Thirteen body joints are annotated in 37151 frames, namely ankles, knees, hip-joint, wrists, elbows, shoulders and head. Two annotated images are shown in Fig. 2. Each annotation also comes with the gesture class, subject identity and bounding box. The bounding box is created by adding a margin to the minimum and maximum coordinates of joint annotations in both x and y directions.

Fig. 2.
figure 2

Examples of body joint annotations. Image on the left is from the Move to left class, whereas the image on the right is from the Wave off class.

3.5 Dataset Summary

The dataset contains a total of 37151 frames distributed over 119, 25 fps, 1920 \(\times \) 1080 video clips. All the frames are annotated with the gesture classes and body joints. There are 10 actors in the dataset, and they perform 5–10 repetitions of each gesture. Each gesture lasts about 12.5 s on average. A summary of the dataset is given in Table 1. The total clip length (blue bars) and mean clip length (amber bars) for each class are shown in Fig. 3.

In Table 2, we compare our dataset with eight recently published video datasets. These datasets have helped to progress research in action recognition, gesture recognition, event recognition and object tracking. The closest dataset in terms of the class types and the purpose is the NATOPS aircraft signals dataset that was created using 24 selected gestures.

Fig. 3.
figure 3

The total clip length (blue) and the mean clip length (amber) are shown in the same graph in seconds. Note the former is one order of magnitude higher than the latter. (Color figure online)

Table 1. A summary of the dataset.
Table 2. Comparison with recently published video datasets.

4 Experimental Results

We performed an experiment on the dataset using Pose-based Convolutional Neural Network (P-CNN) descriptors [9]. A P-CNN descriptor aggregates motion and appearance information along tracks of human body parts (right hand, left hand, upper body and full body). The P-CNN descriptor was originally introduced for action recognition. Since our dataset contains gestures with full body poses, P-CNN is also a suitable method for full-body gesture recognition. In P-CNN, the body-part patches of the input image are extracted using the human pose and corresponding body parts. For body joint estimation, we used the state-of-the-art OpenPose [4] pose estimator which is an extension of Convolutional Pose Machines [34]. Similar to the original P-CNN implementation, the optical flow for each consecutive pair of images was computed using Brox et al.’s method [3].

Fig. 4.
figure 4

The P-CNN feature descriptor [9]: the steps shown in the diagram correspond to an example P-CNN computation for body part left hand.

A diagram showing P-CNN feature extraction is given in Fig. 4. For each body part and full image, the appearance (RGB) and optical flow patches are extracted and their CNN features are computed using two pre-trained networks. For appearance patches, the publicly available “VGG-f” network [7] is used, whereas for optical flow patches, the motion network from Gkioxari and Malik’s Action Tube implementation [12] is used. Static and dynamic features are separately aggregated over time to obtain a static video descriptor \(v_{stat}\) and a dynamic video descriptor \(v_{stat}\) respectively. The static features are the (i) distances between body joints, (ii) orientations of the vectors connecting pairs of joints, and (iii) inner angles spanned by vectors connecting all triplets of joints. The dynamic features are computed from trajectories of body joints. We select the Min and Max aggregation schemes, because of their high accuracies over other schemes when used with P-CNN [9] on the JHMDB dataset [14] for action recognition. The Min and Max aggregation schemes compute the minimum and maximum values respectively for each descriptor dimension over all video frames. The static and dynamic video descriptors can be defined as

$$\begin{aligned} v_{stat}&= [{m_1, \ldots , m_k, M_1, \ldots , M_k}]^\top , \end{aligned}$$
(1)
$$\begin{aligned} v_{dyn}&= [{\varDelta m_1, \ldots , \varDelta m_k, \varDelta M_1, \ldots , \varDelta M_k}]^\top , \end{aligned}$$
(2)

where, m and M correspond to the minimum and maximum values for each video descriptor dimension \(1,\ldots ,k\). \(\varDelta \) represents temporal differences in the video descriptors. The aggregated features (\(v_{stat}\) and \(v_{dyn}\)) are normalized and concatenated over the number of body parts to obtain appearance features \(v_{app}\) and flow features \(v_{of}\). The final P-CNN descriptor is obtained by concatenating \(v_{app}\) and \(v_{of}\).

The evaluation metric selected for the experiment is accuracy. Accuracy is calculated using the scores returned by the action classifiers. There are three training and testing splits for UAV-GESTURE dataset. In Table 3, the mean accuracy is compared with the evaluation results reported in [9] for the JHMDB [14] and MPII Cooking [23] datasets. For the JHMDB and MPII Cooking datasets, the poses are estimated using the pose estimator described in [8]. However, we use OpenPose [4] for UAV-GESTURE, because OpenPose has been used as the body joint detector in notable pose-based action recognition studies [11, 20, 35], and has reportedly the best performance [4].

Table 3. The best reported P-CNN action recognition results for different datasets.

5 Conclusion

We presented a gesture dataset recorded by a hovering UAV. The dataset contains 119 HD videos lasting a total of 24.78 min. The dataset was prepared using 13 selected gestures from the set of general aircraft handling and helicopter handling signals. The gestures were recorded from 10 participants in an outdoor setting. The rich variation of body size, camera motion, and phase, makes our dataset challenging for gesture recognition. The dataset is annotated for human body joints and action classes to extend its applicability to a wider research community. We evaluated this new dataset using P-CNN descriptors and reported an overall baseline action recognition accuracy of 91.9%. This dataset is useful for research involving gesture-based unmanned aerial vehicle or unmanned ground vehicle control, situation awareness, general gesture recognition, and general action recognition. The UAV-GESTURE dataset is available at https://asankagp.github.io/uavgesture/.