Keywords

1 Introduction

The rapid recent progress in computer vision has allowed the community to move beyond classic tasks such as bounding box-level face and body detection towards more detailed visual understanding of people in unconstrained environments. In this work we tackle in a unified manner the tasks of multi-person detection, 2-D pose estimation, and instance segmentation. Given a potentially cluttered and crowded ‘in-the-wild’ image, our goal is to identify every person instance, localize its facial and body keypoints, and estimate its instance segmentation mask. A host of computer vision applications such as smart photo editing, person and activity recognition, virtual or augmented reality, and robotics can benefit from progress in these challenging tasks.

There are two main approaches for tackling multi-person detection, pose estimation and segmentation. The top-down approach starts by identifying and roughly localizing individual person instances by means of a bounding box object detector, followed by single-person pose estimation or binary foreground/ background segmentation in the region inside the bounding box. By contrast, the bottom-up approach starts by localizing identity-free semantic entities (individual keypoint proposals or semantic person segmentation labels, respectively), followed by grouping them into person instances. In this paper, we adopt the latter approach. We develop a box-free fully convolutional system whose computational cost is essentially independent of the number of people present in the scene and only depends on the cost of the CNN feature extraction backbone.

In particular, our approach first predicts all keypoints for every person in the image in a fully convolutional way. We also learn to predict the relative displacement between each pair of keypoints, also proposing a novel recurrent scheme which greatly improves the accuracy of long-range predictions. Once we have localized the keypoints, we use a greedy decoding process to group them into instances. Our approach starts from the most confident detection, as opposed to always starting from a distinguished landmark such as the nose, so it works well even in clutter.

In addition to predicting the sparse keypoints, our system also predicts dense instance segmentation masks for each person. For this purpose, we train our network to predict instance-agnostic semantic person segmentation maps. For every person pixel we also predict offset vectors to each of the K keypoints of the corresponding person instance. The corresponding vector fields can be thought as a geometric embedding representation and induce basins of attraction around each person instance, leading to an efficient association algorithm: For each pixel \(x_i\), we predict the locations of all K keypoints for the corresponding person that \(x_i\) belongs to; we then compare this to all candidate detected people j (in terms of average keypoint distance), weighted by the keypoint detection probability; if this distance is low enough, we assign pixel i to person j.

We train our model on the standard COCO keypoint dataset [1], which annotates multiple people with 12 body and 5 facial keypoints. We significantly outperform the best previous bottom-up approach to keypoint localization [2], improving the keypoint AP from 0.655 to 0.687. In addition, we are the first bottom-up method to report competitive results on the person class for the COCO instance segmentation task. We get a mask AP of 0.417, which outperforms the strong top-down FCIS method of [3], which gets 0.386. Furthermore our method is very simple and hence fast, since it does not require any second stage box-based refinement, or clustering algorithm. We believe it will therefore be quite useful for a variety of applications, especially since it lends itself to deployment in mobile phones.

2 Related Work

2.1 Pose Estimation

Proir to the recent trend towards deep convolutional networks [4, 5], early successful models for human pose estimation centered around inference mechanisms on part-based graphical models [6, 7], representing a person by a collection of configurable parts. Following this work, many methods have been proposed to develop tractable inference algorithms for solving the energy minimization that captures rich dependencies among body parts [8,9,10,11,12,13,14,15,16]. While the forward inference mechanism of this work differs to these early DPM-based models, we similarly propose a bottom-up approach for grouping part detections to person instances.

Recently, models based on modern large scale convolutional networks have achieved state-of-art performance on both single-person pose estimation [17,18,19,20,21,22,23,24,25,26] and multi-person pose estimation [27,28,29,30,31,32,33,34]. Broadly speaking, there are two main approaches to pose-estimation in the literature: top-down (person first) and bottom-up (parts first). Examples of the former include G-RMI [33], CFN [35], RMPE [36], Mask R-CNN [34], and CPN [37]. These methods all predict key point locations within person bounding boxes obtained by a person detector (e.g., Fast-RCNN [38], Faster-RCNN [39] or R-FCN [40]).

In the bottom-up approach, we first detect body parts and then group these parts to human instances. Pishchulin et al. [27], Insafutdinov et al. [28, 29], and Iqbal et al. [30] formulate the problem of multi-person pose estimation as part grouping and labeling via a Linear Program. Cao et al. [32] incorporate the unary joint detector modified from [31] with a part affinity field and greedily generate person instance proposals. Newell et al. [2] propose associative embedding to identify key point detections from the same person.

2.2 Instance Segmentation

The approaches for instance segmentation can also be categorized into the two top-down and bottom-up paradigms.

Top-down methods exploit state-of-art detection models to either classify mask proposals [41,42,43,44,45,46,47] or to obtain mask segmentation results by refining the bounding box proposals [3, 34, 48,49,50,51].

Ours is a bottom-up approach, in which we associate pixel-level predictions to each object instance. Many recent models propose similar forms of instance-level bottom-up clustering. For instance, Liang et al. use a proposal-free network [52] to cluster semantic segmentation results to obtain instance segmentation. Uhrig et al. [53] first predict each pixel’s direction towards its instance center and then employ template matching to decode and cluster the instance segmentation result. Zhang et al. [54, 55] predict instance ID by encoding the object depth ordering within a patch and use this depth ordering to cluster instances. Wu et al. [56] use a prediction network followed by a Hough transform-like approach to perform prediction instance clustering. In this work, we similarly perform a Hough voting of multiple predictions. In a slightly different formulation, Liu et al. [57] segment and aggregate segmentation results from dense multi-scale patches, and aggregate localized patches into complete object instances. Levinkov et al. [58] formulate the instance segmentation problem as a combinatorial optimization problem that consists of graph decomposition and node labeling and propose efficient local search algorithms to iteratively refine an initial solution. InstanceCut [59] and the work of [60] propose to predict object boundaries to separate instances. [2, 61, 62] group pixel predictions that have similar values in the learned embedding space to obtain instance segmentation results. Bai and Urtasun [63] propose a Watershed Transform Network which produces an energy map where object instances are represented as basin. Liu et al. [64] propose the Sequential Grouping Network which decomposes the instance segmentation problem into several sub-grouping problems.

3 Methods

Figure 1 gives an overview of our system, which we describe in detail next.

3.1 Person Detection and Pose Estimation

We develop a box-free bottom-up approach for person detection and pose estimation. It consists of two sequential steps, detection of K keypoints, followed by grouping them into person instances. We train our network in a supervised fashion, using the ground truth annotations of the \(K = 17\) face and body parts in the COCO dataset.

Fig. 1.
figure 1

Our PersonLab system consists of a CNN model that predicts: (1) keypoint heatmaps, (2) short-range offsets, (3) mid-range pairwise offsets, (4) person segmentation maps, and (5) long-range offsets. The first three predictions are used by the Pose Estimation Module in order to detect human poses while the latter two, along with the human pose detections, are used by the Instance Segmentation Module in order to predict person instance segmentation masks.

Keypoint Detection. The goal of this stage is to detect, in an instance-agnostic fashion, all visible keypoints belonging to any person in the image.

For this purpose, we follow the hybrid classification and regression approach of [33], adapting it to our multi-person setting. We produce heatmaps (one channel per keypoint) and offsets (two channels per keypoint for displacements in the horizontal and vertical directions). Let \(x_i\) be the 2-D position in the image, where \(i = 1, \dots N\) is indexing the position in the image and N is the number of pixels. Let \(\mathcal {D}_R(y) = \{x: \Vert x - y\Vert \le R\}\) be a disk of radius R centered around y. Also let \(y_{j,k}\) be the 2-D position of the k-th keypoint of the j-th person instance, with \(j = 1, \dots , M\), where M is the number of person instances in the image.

For every keypoint type \(k = 1, \dots , K\), we set up a binary classification task as follows. We predict a heatmap \(p_k(x)\) such that \(p_k(x) = 1\) if \(x \in \mathcal {D}_R(y_{j,k})\) for any person instance j, otherwise \(p_k(x) = 0\). We thus have K independent dense binary classification tasks, one for each keypoint type. Each amounts to predicting a disk of radius R around a specific keypoint type of any person in the image. The disk radius value is set to \(R=32\) pixels for all experiments reported in this paper and is independent of the person instance scale. We have deliberately opted for a disk radius which does not scale with the instance size in order to equally weigh all person instances in the classification loss. During training, we compute the heatmap loss as the average logistic loss along image positions and we back-propagate across the full image, only excluding areas that contain people that have not been fully annotated with keypoints (person crowd areas and small scale person segments in the COCO dataset).

In addition to the heatmaps, we also predict short-range offset vectors \(S_k(x)\) whose purpose is to improve the keypoint localization accuracy. At each position x within the keypoint disks and for each keypoint type k, the short-range 2-D offset vector \(S_k(x) = y_{j,k} - x\) points from the image position x to the k-th keypoint of the closest person instance j, as illustrated in Fig. 1. We generate K such vector fields, solving a 2-D regression problem at each image position and keypoint independently. During training, we penalize the short-range offset prediction errors with the \(L_1\) loss, averaging and back-propagating the errors only at the positions \(x \in \mathcal {D}_R(y_{j,k})\) in the keypoint disks. We divide the errors in the short-range offsets (and all other regression tasks described in the paper) by the radius \(R=32\) pixels in order to normalize them and make their dynamic range commensurate with the heatmap classification loss.

We aggregate the heatmap and short-range offsets via Hough voting into 2-D Hough score maps \(h_k(x), k=1, \dots , K\), using independent Hough accumulators for each keypoint type. Each image position casts a vote to each keypoint channel k with weight equal to its activation probability,

$$\begin{aligned} h_k(x) = \frac{1}{\pi R^2}\sum _{i=1:N} p_k(x_i) B(x_i + S_k(x_i) - x) \,, \end{aligned}$$
(1)

where \(B(\cdot )\) denotes the bilinear interpolation kernel. The resulting highly localized Hough score maps \(h_k(x)\) are illustrated in Fig. 1.

Fig. 2.
figure 2

Mid-range offsets. (a) Initial mid-range offsets that starting around the RightElbow keypoint, they point towards the RightShoulder keypoint. (b) Mid-range offset refinement using the short-range offsets. (c) Mid-range offsets after refinements.

Grouping Keypoints into Person Detection Instances.Mid-Range Pairwise Offsets. The local maxima in the score maps \(h_k(x)\) serve as candidate positions for person keypoints, yet they carry no information about instance association. When multiple person instances are present in the image, we need a mechanism to “connect the dots” and group together the keypoints belonging to each individual instance. For this purpose, we add to our network a separate pairwise mid-range 2-D offset field output \(M_{k,l}(x)\) designed to connect pairs of keypoints. We compute \(2(K-1)\) such offset fields, one for each directed edge connecting pairs (kl) of keypoints which are adjacent to each other in a tree-structured kinematic graph of the person, see Figs. 1 and 2. Specifically, the supervised training target for the pairwise offset field from the k-th to the l-th keypoint is given by \(M_{k,l}(x) = (y_{j,l} - x) I(x \in \mathcal {D}_R(y_{j,k}))\), since its purpose is to allow us to move from the k-th to the l-th keypoint of the same person instance j. During training, this target regression vector is only defined if both keypoints are present in the training example. We compute the average \(L_1\) loss of the regression prediction errors over the source keypoint disks \(x \in \mathcal {D}_R(y_{j,k})\) and back-propagate through the network.

Recurrent Offset Refinement. Particularly for large person instances, the edges of the kinematic graph connect pairs of keypoints such as RightElbow and RightShoulder which may be several hundred pixels away in the image, making it hard to generate accurate regressions. We have successfully addressed this important issue by recurrently refining the mid-range pairwise offsets using the more accurate short-range offsets, specifically:

$$\begin{aligned} M_{k,l}(x) \leftarrow x' + S_{l}(x') \,, \mathrm {where} \,\, x' = M_{k,l}(x) \,, \end{aligned}$$
(2)

as illustrated in Fig. 2. We repeat this refinement step twice in our experiments. We employ bilinear interpolation to sample the short-range offset field at the intermediate position \(x'\) and back-propagate the errors through it along both the mid-range and short-range input offset branches. We perform offset refinement at the resolution of CNN output activations (before upsamling to the original image resolution), making the process very fast. The offset refinement process drastically decreases the mid-range regression errors, as illustrated in Fig. 2. This is a key novelty in our method, which greatly facilitates grouping and significantly improves results compared to previous papers [28, 32] which also employ pairwise displacements to associate keypoints.

Fast Greedy Decoding. We have developed an extremely fast greedy decoding algorithm to group keypoints into detected person instances. We first create a priority queue, shared across all K keypoint types, in which we insert the position \(x_i\) and keypoint type k of all local maxima in the Hough score maps \(h_k(x)\) which have score above a threshold value (set to 0.01 in all reported experiments). These points serve as candidate seeds for starting a detection instance. We then pop elements out of the queue in descending score order. At each iteration, if the position \(x_i\) of the current candidate detection seed of type k is within a disk \(\mathcal {D}_r(y_{j',k})\) of the corresponding keypoint of previously detected person instances \(j'\), then we reject it; for this we use a non-maximum suppression radius of \(r=10\) pixels. Otherwise, we start a new detection instance j with the k-th keypoint at position \(y_{j,k} = x_i\) serving as seed. We then follow the mid-range displacement vectors along the edges of the kinematic person graph to greedily connect pairs (kl) of adjacent keypoints, setting \(y_{j,l} = y_{j,k} + M_{k,l}(y_{j,k})\).

It is worth noting that our decoding algorithm does not treat any keypoint type preferentially, in contrast to other techniques that always use the same keypoint type (e.gTorso or Nose) as seed for generating detections. Although we have empirically observed that the majority of detections in frontal facing person instances start from the more easily localizable facial keypoints, our approach can also handle robustly cases where a large portion of the person is occluded.

Keypoint- and Instance-Level Detection Scoring. We have experimented with different methods to assign a keypoint- and instance-level score to the detections generated by our greedy decoding algorithm. Our first keypoint-level scoring method follows [33] and assigns to each keypoint a confidence score \(s_{j,k} = h_k(y_{j,k})\). A drawback of this approach is that the well-localizable facial keypoints typically receive much higher scores than poorly localizable keypoints like the hip or knee. Our second approach attempts to calibrate the scores of the different keypoint types. It is motivated by the object keypoint similarity (OKS) evaluation metric used in the COCO keypoints task [1], which uses different accuracy thresholds \(\kappa _k\) to penalize localization errors for different keypoint types.

Fig. 3.
figure 3

Long-range offsets defined in the person segmentation mask. (a) Estimated person segmentation map. (b) Initial long range offsets for the Nose destination keypoint: each pixel in the foreground of the person segmentation mask points towards the Nose keypoint of the instance that it belongs to. (c) Long-range offsets after their refinements with the short-range offsets.

Specifically, consider a detected person instance j with keypoint coordinates \(y_{j,k}\). Let \(\lambda _j\) be the square root of the area of the bounding box tightly containing all detected keypoints of the j-th person instance. We define the Expected-OKS score for the k-th keypoint by

$$\begin{aligned} s_{j,k} = E\{OKS_{j,k}\} = p_k(y_{j,k}) \int _{x \in \mathcal {D}_R(y_{j,k})} \hat{h}_k(x) \exp \left( -\frac{(x - y_{j,k})^2}{2 \lambda _j^2 \kappa _k^2} \right) dx \,, \end{aligned}$$
(3)

where \(\hat{h}_k(x)\) is the Hough score normalized in \(\mathcal {D}_R(y_{j,k})\). The expected OKS keypoint-level score is the product of our confidence that the keypoint is present, times the OKS localization accuracy confidence, given the keypoint’s presence.

We use the average of the keypoint scores as instance-level score \(s_j^h = (1/K) \sum _k s_{j,k}\), followed by non-maximum suppression (NMS). We have experimented both with hard OKS-based NMS [33] as well as a soft-NMS scheme adapted for the keypoints tasks from [65], where we use as final instance-level score the sum of the scores of the keypoints that have not already been claimed by higher scoring instances, normalized by the total number of keypoints:

$$\begin{aligned} s_j = (1/K) \sum _{k=1:K} s_{j,k} [\Vert y_{j,k} - y_{j',k}\Vert > r, \text {for every} \,\, j' < j] \,, \end{aligned}$$
(4)

where \(r=10\) is the NMS-radius. In our experiments in the main paper we report results with the best performing Expected-OKS scoring and soft-NMS but we include ablation experiments in the supplementary material.

3.2 Instance-Level Person Segmentation

Given the set of keypoint-level person instance detections, the task of our method’s segmentation stage is to identify pixels that belong to people (recognition) and associate them with the detected person instances (grouping). We describe next the respective semantic segmentation and association modules, illustrated in Fig. 4.

Fig. 4.
figure 4

From semantic to instance segmentation: (a) Image; (b) person segmentation; (c) basins of attraction defined by the long-range offsets to the Nose keypoint; (d) instance segmentation masks.

Semantic Person Segmentation. We treat semantic person segmentation in the standard fully-convolutional fashion [66, 67]. We use a simple semantic segmentation head consisting of a single 1\(\,\times \,\)1 convolutional layer that performs dense logistic regression and compute at each image pixel \(x_i\) the probability \(p_S(x_i)\) that it belongs to at least one person. During training, we compute and backpropagate the average of the logistic loss over all image regions that have been annotated with person segmentation maps (in the case of COCO we exclude the crowd person areas).

Associating Segments with Instances Via Geometric Embeddings. The task of this module is to associate each person pixel identified by the semantic segmentation module with the keypoint-level detections produced by the person detection and pose estimation module.

Similar to [2, 61, 62], we follow the embedding-based approach for this task. In this framework, one computes an embedding vector G(x) at each pixel location, followed by clustering to obtain the final object instances. In previous works, the representation is typically learned by computing pairs of embedding vectors at different image positions and using a loss function designed to attract the two embedding vectors if they both come from the same object instance and repel them if they come from different person instances. This typically leads to embedding representations which are difficult to interpret and involves solving a hard learning problem which requires careful selection of the loss function and tuning several hyper-parameters such as the pair sampling protocol.

Here, we opt instead for a considerably simpler, geometric approach. At each image position x inside the segmentation mask of an annotated person instance j with 2-D keypoint positions \(y_{j,k}, k=1,\dots ,K\), we define the long-range offset vector \(L_k(x) = y_{j,k} - x\) which points from the image position x to the position of the k-th keypoint of the corresponding instance j. (This is very similar to the short-range prediction task, except the dynamic range is different, since we require the network to predict from any pixel inside the person, not just from inside a disk near the keypoint. Thus these are like two “specialist” networks. Performance is worse when we use the same network for both kinds of tasks. ) We compute K such 2-D vector fields, one for each keypoint type. During training, we penalize the long-range offset regression errors using the \(L_1\) loss, averaging and back-propagating the errors only at image positions x which belong to a single person object instance. We ignore background areas, crowd regions, and pixels which are covered by two or more person masks.

The long-range prediction task is challenging, especially for large object instances that may cover the whole image. As in Sect. 3.1, we recurrently refine the long-range offsets, twice by themselves and then twice by the short-range offsets

$$\begin{aligned} L_k(x) \leftarrow x' + L_k(x') \,, x' = L_k(x) \,\,\, \mathrm {and} \,\,\, L_k(x) \leftarrow x' + S_k(x') \,, x' = L_k(x) \,, \end{aligned}$$
(5)

back-propagating through the bilinear warping function during training. Similarly with the mid-range offset refinement in Eq. 2, recurrent long-range offset refinement dramatically improves the long-range offset prediction accuracy.

In Fig. 3 we illustrate the long-range offsets corresponding to the Nose keypoint as computed by our trained CNN for an example image. We see that the long-range vector field effectively partitions the image plane into basins of attraction for each person instance. This motivates us to define as embedding representation for our instance association task the \(2 \cdot K\) dimensional vector \(G(x) = (G_k(x))_{k=1, \dots , K}\) with components \(G_k(x)= x + L_k(x)\).

Our proposed embedding vector has a very simple geometric interpretation: At each image position \(x_i\) semantically recognized as a person instance, the embedding \(G(x_i)\) represents our local estimate for the absolute position of every keypoint of the person instance it belongs to, i.e., it represents the predicted shape of the person. This naturally suggests shape metric as candidates for computing distances in our proposed embedding space. In particular, in order to decide if the person pixel \(x_i\) belongs to the j-th person instance, we compute the embedding distance metric

$$\begin{aligned} D_{i,j} = \frac{1}{\sum _k p_k(y_{j,k})} \sum _{k=1}^K p_k(y_{j,k}) \frac{1}{\lambda _j}\Vert G_k(x_i) - y_{j,k}\Vert \,, \end{aligned}$$
(6)

where \(y_{j,k}\) is the position of the k-th detected keypoint in the j-th instance and \(p_k(y_{j,k})\) is the probability that it is present. Weighing the errors by the keypoint presence probability allows us to discount discrepancies in the two shapes due to missing keypoints. Normalizing the errors by the detected instance scale \(\lambda _j\) allows us to compute a scale invariant metric. We set \(\lambda _j\) equal to the square root of the area of the bounding box tightly containing all detected keypoints of the j-th person instance. We emphasize that because we only need to compute the distance metric between the \(N_S\) pixels and the M person instances, our algorithm is very fast in practice, having complexity \(\mathcal {O}(N_S * M)\) instead of \(\mathcal {O}(N_S * N_S)\) of standard embedding-based segmentation techniques which, at least in principle, require computation of embedding vector distances for all pixel pairs.

To produce the final instance segmentation result: (1) We find all positions \(x_i\) marked as person in the semantic segmentation map, i.e. those pixels that have semantic segmentation probability \(p_S(x_i) \ge 0.5\). (2) We associate each person pixel \(x_i\) with every detected person instance j for which the embedding distance metric satisfies \(D_{i,j} \le t\); we set the relative distance threshold \(t=0.25\) for all reported experiments. It is important to note that the pixel-instance assignment is non-exclusive: Each person pixel may be associated with more than one detected person instance (which is particularly important when doing soft-NMS in the detection stage) or it may remain an orphan (e.g., a small false positive region produced by the segmentation module). We use the same instance-level score produced by the previous person detection and pose estimation stage to also evaluate on the COCO segmentation task and obtain average precision performance numbers.

3.3 Imputing Missing Keypoint Annotations

The standard COCO dataset does not contain keypoint annotations in the training set for the small person instances, and ignores them during model evaluation. However, it contains segmentation annotations and evaluates mask predictions for those small instances. Since training our geometric embeddings requires keypoint annotations for training, we have run the single-person pose estimator of [33] (trained on COCO data alone) in the COCO training set on image crops around the ground truth box annotations of those small person instances to impute those missing keypoint annotations. We treat those imputed keypoints as regular training annotations during our PersonLab model training. Naturally, this missing keypoint imputation step is particularly important for our COCO instance segmentation performance on small person instances. We emphasize that, unlike [68], we do not use any data beyond the COCO train split images and annotations in this process. Data distillation on additional images as described in [68] may yield further improvements.

4 Experimental Evaluation

4.1 Experimental Setup

Dataset and Tasks. We evaluate the proposed PersonLab system on the standard COCO keypoints task [1] and on COCO instance segmentation [69] for the person class alone. For all reported results we only use COCO data for model training (in addition to Imagenet pretraining). Our train set is the subset of the 2017 COCO training set images that contain people (64115 images). Our val set coincides with the 2017 COCO validation set (5000 images). We only use train for training and evaluate on either val or the test-dev split (20288 images).

Model Training Details. We report experimental results with models that use either ResNet-101 or ResNet-152 CNN backbones [70] pretrained on the Imagenet classification task [71]. We discard the last Imagenet classification layer and add 1\(\,\times \,\)1 convolutional layers for each of our model-specific layers. During model training, we randomly resize a square box tightly containing the full image by a uniform random scale factor between 0.5 and 1.5, randomly translate it along the horizontal and vertical directions, and left-right flip it with probability 0.5. We sample and resize the image crop contained under the resulting perturbed box to an 801\(\,\times \,\)801 image that we feed into the network. We use a batch size of 8 images distributed across 8 Nvidia Tesla P100 GPUs in a single machine and perform synchronous training for 1M steps with stochastic gradient descent with constant learning rate equal to 1e-3, momentum value set to 0.9, and Polyak-Ruppert model parameter averaging. We employ batch normalization [72] but fix the statistics of the ResNet activations to their Imagenet values. Our ResNet CNN network backbones have nominal output stride (i.e., ratio of the input image to output activations size) equal to 32 but we reduce it to 16 during training and 8 during evaluation using atrous convolution [67]. During training we also make model predictions using as features activations from a layer in the middle of the network, which we have empirically observed to accelerate training. To balance the different loss terms we use weights equal to (4, 2, 1, 1 / 4, 1 / 8) for the heatmap, segmentation, short-range, mid-range, and long-range offset losses in our model. For evaluation we report both single-scale results (image resized to have larger side 1401 pixels) and multi-scale results (pyramid with images having larger side 601, 1201, 1801, 2401 pixels). We have implemented our system in Tensorflow [73]. All reported numbers have been obtained with a single model without ensembling.

Table 1. Performance on the COCO keypoints test-dev split.

4.2 COCO Person Keypoints Evaluation

Table 1 shows our system’s person keypoints performance on COCO test-dev. Our single-scale inference result is already better than the results of the CMU-Pose [32] and Associative Embedding [2] bottom-up methods, even when they perform multi-scale inference and refine their results with a single-person pose estimation system applied on top of their bottom-up detection proposals. Our results also outperform top-down methods like Mask-RCNN [34] and G-RMI [33]. Our best result with 0.687 AP is attained with a ResNet-152 based model and multi-scale inference. Our result is still behind the winners of the 2017 keypoints challenge (Megvii) [37] with 0.730 AP, but they used a carefully tuned two-stage, top-down model that also builds on a significantly more powerful CNN backbone.

Table 2. Performance on COCO segmentation (Person category) test-dev split. Our person-only results have been obtained with 20 proposals per image. The person category FCIS eval results have been communicated by the authors of [3].
Table 3. Performance on COCO Segmentation (Person category) val split. The Mask-RCNN [34] person results have been produced by the ResNet-101-FPN version of their publicly shared model (which achieves 0.359 AP across all COCO classes).

4.3 COCO Person Instance Segmentation Evaluation

Tables 2 and 3 show our person instance segmentation results on COCO test-dev and val, respectively. We use the small-instance missing keypoint imputation technique of Sect. 3.3 for the reported instance segmentation experiments, which significantly increases our performance for small objects. Our results without missing keypoint imputation are shown in the supplementary material.

Our method only produces segmentation results for the person class, since our system is keypoint-based and thus cannot be applied to the other COCO classes. The standard COCO instance segmentation evaluation allows for a maximum of 100 proposals per image for all 80 COCO classes. For a fair comparison when comparing with previous works, we report test-dev results of our method with a maximum of 20 person proposals per image, which is the convention also adopted in the standard COCO person keypoints evaluation protocol. For reference, we also report the val results of our best model when allowed to produce 100 proposals.

We compare our system with the person category results of top-down instance segmentation methods. As shown in Table 2, our method on the test split outperforms FCIS [3] in both single-scale and multi-scale inference settings. As shown in Table 3, our performance on the val split is similar to that of Mask-RCNN [34] on medium and large person instances, but worse on small person instances. However, we emphasize that our method is the first box-free, bottom-up instance segmentation method to report experiments on the COCO instance segmentation task.

Fig. 5.
figure 5

Visualization on COCO val images. The last row shows some failure cases: missed key point detection, false positive key point detection, and missed segmentation.

4.4 Qualitative Results

In Fig. 5 we show representative person pose and instance segmentation results on COCO val images produced by our model with single-scale inference.

5 Conclusions

We have developed a bottom-up model which jointly addresses the problems of person detection, pose estimation, and instance segmentation using a unified part-based modeling approach. We have demonstrated the effectiveness of the proposed method on the challenging COCO person keypoint and instance segmentation tasks. A key limitation of the proposed method is its reliance on keypoint-level annotations for training on the instance segmentation task. In the future, we plan to explore ways to overcome this limitation, via weakly supervised part discovery.