Keywords

1 Introduction

Compared to single person human pose estimation, where human candidates are cropped and centered in the image patch, the task of multi-person human pose estimation is more realistic and challenging. Existing methods can be classified into top-down and bottom-up approaches. The top-down approach [8, 16] relies on a detection module to obtain human candidates and then apply a single-person human pose estimator to locate human keypoints. The bottom-up approach [2, 9, 14, 18], on the other hand, detects human keypoints from all potential human candidates and then assembles these keypoints into human limbs for each individual based on various data association techniques. The advantage of bottom-up approaches is their excellent trade-off between estimation accuracy and computational cost because their computational cost is invariant to the number of human candidates in the image. In contrast, the main advantage of top-down approaches is their capability in disassembling the task into multiple comparatively easier tasks, i.e., object detection and single-person pose estimation. The object detector is expert in detecting hard (usually small) candidates, so that the pose estimator will perform better with a focused regression space.

Pose tracking is the task of estimating human keypoints and assigning unique ids for each keypoint at instance-level across frames in videos. In videos with multiple people, accurate trajectory estimation of human key points is useful in human action recognition and human interaction understanding. PoseTrack [12] and ArtTrack [11] primarily introduce multi-person pose tracking challenge and propose a graph partitioning formulation, which transforms the pose tracking problem into a minimum cost multi-cut problem. However, hand-crafted graphical models are not scalable for long and unseen clips. Another line of research explores top-down approach [6, 19, 20] by operating multi-person human pose estimation on each frame and linking them based on appearance similarities and temporal adjacencies. A naive solution is to apply multi-target object tracking on human detection candidates across frames and then estimate human poses for each human tubelet. While this is a feasible method, it neglects unique attributes of keypoints. Compared to the tracked bounding boxes, keypoints can potentially be helpful cues for both the bounding boxes and the keypoints tracking. The tracker of 3D Mask R-CNN [6] simplifies the pose tracking problem as a maximum weight bipartite matching problem and solve it with Greedy or Hungarian algorithm. PoseFlow [20] further takes motion and pose information into account to address the issue of occasional truncated human candidates.

2 Our Approach

We follow the top-down approach for pose tracking, i.e., perform human candidate detection, single-person pose estimation, and pose tracking step by step. The details for each module are described below, respectively.

2.1 Detection Module

We adopt state-of-the-art object detectors trained with ImageNet and COCO datasets. Specifically, we use pre-trained models from deformable ConvNets [5]. In order to increase the recall rate of human candidates, we conduct experiments on validation sets of both PoseTrack 2017 [1] and PoseTrack 2018 to choose the best object detector. Firstly, we infer ground truth bounding boxes of human candidates from the annotated keypoints, because in PoseTrack 2017 dataset, the bounding box position is not provided in the annotations. Specifically, we locate a bounding box from the minimum and maximum coordinates of the 15 keypoints, and then enlarge this box by 20% both horizontally and vertically. Even though ground truth bounding boxes are given in PoseTrack 2018 dataset, we infer a more consistent version based on ground truth locations of keypoints. Those inferred ground truth bounding boxes are utilized to train the pose estimator.

For the object detectors, we compare the deformable convolution versions of the R-FCN network [4] and of the FPN network [13], both with ResNet101 backbone [10]. The FPN feature extractor is attached to the Fast R-CNN [7] head for detection. We compare the detection results with the ground truth based on the precision and recall rate on PoseTrack 2017 validation set. In order to eliminate redundant candidates, we drop candidate(s) with lower likelihood. As shown in Table 1, for various drop thresholds of bounding boxes, the precision and recall of the detectors are given. For PoseTrack 2018 validation set, the FPN network performs better as well. Therefore, we choose the FPN network as our human candidate detector.

Table 1. Precision-Recall on PoseTrack 2017 validation set. A bounding box is correct if its IoU with GT is above certain threshold, which is set to 0.4 for all experiments.
Table 2. Comparison of single-frame pose estimation results using various detectors on PoseTrack 2017 validation set.
Table 3. Comparison of multi-frame pose tracking results using various detectors on PoseTrack 2017 validation set.

The upper bound for detection is the ground truth bounding box location. In order to measure the gap between ideal detection results and our detection results, we feed the ground truth bounding boxes to the subsequent pose estimation module and tracking module, and compare its performance with that of our detector on the validation set. As shown in Table 2, the pose estimation will perform around 7% better with ground truth detections. As shown in Table 3, the pose tracking will perform around 6% better with ground truth detections.

With ResNet151 as backbone, and training detectors solely on the human class, e.g., training on the CrowdHuman [17] dataset, we believe the detection module may render better results. For the challenge, we just adopt the deformable FPN with ResNet101 and use their pre-trained model for simplicity.

2.2 Pose Estimation Module

For the single-person human pose estimator, we adopt Cascaded Pyramid Networks (CPN) [3] with slight modifications. We first train the CPN network with the merged dataset of PoseTrack 2018 and COCO for 260 epochs. Then we finetune the network solely on PoseTrack 2018 training set for 40 epochs in order to mitigate the regression on head. For COCO dataset, bottom-head and top-head positions are not given. We infer these keypoints through rough interpolation on the annotated keypoints. We find that by finetuning on the PoseTrack dataset, the prediction on head keypoints will be refined. During finetuning, we use the technique of online hard keypoint mining, only focusing on losses from the 7 hardest keypoints out of the total 15 keypoints.

In our implementation, we perform non-maximum suppression (NMS) in the detection phase on the bounding boxes and perform pose estimation on all candidates from the detection module. For each candidate, we post-process on the predicted heatmaps with cross-heatmap pose NMS [15] to render more accurate keypoint locations. We did not perform flip testing, although the performance might be slightly better. During testing, we use a manifold of two models from epoch 291 and 293. We notice a slight performance boost with model ensemble. For epoch 291, the prediction of shoulders and hips renders better results than epoch 293 on validation sets of both PoseTrack 2017 and PoseTrack 2018. However, epoch 293 performs better on end limbs such as ankles and wrists. We test with two manifold modes: (1) Average and (2) Expert. As shown in Table 4, the expert mode takes shoulder/hip predictions from the previous model and end-limb predictions from the latter, which performs better consistently on both PoseTrack 2017 and PoseTrack 2018 validation sets. Both modes perform better than plain testing on the pose estimation task.

Table 4. Comparison of single-frame pose estimation results with different ensemble modes on PoseTrack 2017 validation set.
Fig. 1.
figure 1

Our modular system for pose tracking. From top to bottom: we perform human candidate detection, pose estimation, and pose tracking sequentially.

2.3 Pose Tracking Module

We adopt a flow-based pose tracker [20], where pose flows are built by associating poses that indicate the same person across frames. We start the tracking process from the first frame where human candidates are detected. To prevent assignments of IDs for persons which have already left the visible image area, IDs are only kept for a limited amount of frames, afterwards they are discarded. For the pose tracking task, the performance is evaluated via MOTA, which is very strict. It penalizes mis-matches, false positives and misses. In order to get higher MOTA results, we need to drop keypoints with lower confidence scores, sacrificing the recall rate of correct keypoints. We find the MOTA evaluation criterion quite sensitive to the drop rate of keypoints, as shown in Table 5 (Fig. 1).

Table 5. Sensitivity analysis on how the drop thresholds of keypoints affect the performance in AP and MOTA. Performed on PoseTrack 2018 validation set.

Considering the distinct difficulties of keypoints, e.g., shoulders are easier than ankles to localize, the confidence distribution for each joint is supposedly not uniform. Dropping keypoints solely based on the keypoint confidence estimated by the pose estimator may not be an ideal strategy for pose tracking. We collect statistics on the drop rate of keypoints from different joints, as shown in Table 6. We can see that from left to right, the keypoints become more and more difficult to estimate, as reflected by their respective preservation rate. The least and most difficult joints are the shoulders and ankles, respectively. In other words, the pose estimator is most confident on the shoulders but least confident on ankles. An adaptive keypoint pruner may help increase the MOTA performance while maintaining high recall rates.

Table 6. Statistics analysis on the drop rates of keypoints with different drop thresholds. Performed on PoseTrack 2018 validation set. The numbers indicate the percentage of keypoints maintained after pruning.

3 Challenge Results

Our final performance on the partial test set of PoseTrack 2018 is given in Tables 7 and 8.

Table 7. Our single-frame pose estimation results on PoseTrack 2018 partial test set
Table 8. Our multi-frame pose tracking results on PoseTrack 2018 partial test set

4 Conclusion

In this paper, we aim to build a modular system to reach the state-of-the-art of human pose estimation and tracking. This system consists of three modules, which conduct human candidate detection, pose estimation and pose tracking respectively. We have analyzed each module in the system with ablation studies on various models and configurations while discussing their pros and cons. We present the performance of our system in the pose estimation challenge and pose tracking challenge of PoseTrack 2018.