Keywords

1 Introduction

Visual fashion analysis has drawn lots of attentions recently, due to its wide spectrum of applications such as clothes recognition [13], retrieval [35], and recommendation [6, 7]. It is a challenging task because of the large variations presented in the clothing items, such as the changes of poses, scales, and appearances. To reduce these variations, existing works tackled the problem by looking for informative regions, i.e. detecting the clothes bounding boxes [1, 2] or the human joints [8, 9]. We go beyond the above by studying a more discriminative representation, fashion landmark, which is the key-point located at the functional region of clothes, for example the neckline and the cuff.

This work addresses fashion landmark detection or fashion alignment in the wild. Different from human pose estimation, which detects human joints such as neck and elbows as shown in Fig. 1(a.1), fashion alignment localizes fashion landmarks as shown in (a.2). These landmarks facilitate fashion analysis in the sense that they not only implicitly capture bounding boxes of clothes, but also indicate their functional regions, which can better distinguish design/pattern/category of the clothes. Therefore, features extracted from these landmarks are more discriminative than those extracted from human joints. For example, when search for a dress with ‘V-neck and fringed-hem’, it is more desirable to extract features from collar and hemline.

Fig. 1.
figure 1

Comparisons of fashion landmarks and human joints: (a.1) sample annotations for human joints, (a.2) sample annotations for fashion landmarks (a.3–4) typical deformation and scale variations present in clothing items, (b) spatial variances of human joints (in blue) and fashion landmarks (in green) and (c) appearance variances of human joints (in blue) and fashion landmarks (in green). (Color figure online)

To fully benchmark the task of fashion landmark detection, we select a large subset of images from the DeepFashion database [3] to constitute a fashion landmark dataset (FLD). These images have large pose and scale variations. With FLD, we show that fashion landmark detection in clothes images is a more challenging task than human joint detection in three aspects. First, clothes undergo non-rigid deformations or scale variations as shown in Fig. 1(a.3–4), while rigid deformations are usually presented in human joints. Second, fashion landmarks exhibit much larger spatial variances than human joints, as illustrated in Fig. 1(b), where we plot the positions of the landmarks and the relative human joints in the test set of the FLD dataset. For instance, the positions of ‘left sleeve’ are more diverse than those of the ‘left elbow’ in both the vertical and horizontal directions. Third, the local regions of fashion landmarks also have larger appearance variances than those of human joints. As shown in Fig. 1(c), we average the patches centered at the fashion landmarks and human joints respectively, resulting in several visual comparisons. The patterns of the mean patches of human joints are still recognizable, but those of the mean patches of fashion landmarks are not.

To tackle the above challenges, we propose a deep fashion alignment (DFA) framework, which cascades three deep convolutional networks (CNNs) for landmark estimation. It has three appealing properties. First, to ensure the CNNs have high discriminative powers, unlike existing works [1014] that only estimated the landmarks’ positions, we train the cascaded CNNs to predict both the landmarks’ positions and the pseudo-labels, which encode the similarities between training samples to boost the estimation accuracy. In each stage of the network cascade, the scheme of pseudo-label is carefully designed to reduce different variations presented in the fashion images. Second, instead of training multiple networks for each body part as previous work did [10, 15], the DFA framework trains CNNs using the full image as input to significantly reduce computations. Third, in the DFA framework, an auto-routing strategy is introduced to partition the challenging and easy samples, such that different samples can be handled by different branches of CNNs.

Extensive experiments demonstrate the effectiveness of the proposed method, as well as its generalization ability to pose estimation. Fashion landmark is also compared to clothing bounding boxes and human joints in two applications, fashion attribute prediction and clothes retrieval, showing that fashion landmark is a more discriminative representation to understand fashion images.

1.1 Related Work

Visual Fashion Understanding. Visual fashion understanding has been a long-pursuing topic due to the many human-centric applications it enables. Recent advances include predicting semantic attributes [3, 8, 9, 16, 17], clothes recognition and retrieval [24, 1820] and fashion trends discovery [6, 7, 21]. To better capture discriminative information in fashion items, previous works have explored the usage of full image [2], general object proposals [2], bounding boxes [9, 17] and even masks [18, 2224]. However, these representations either lack sufficient discriminative ability or are too expensive to obtain. To overcome these drawbacks, we introduce the problem of clothes alignment in this work, which is a necessary step toward robust fashion recognition.

Human Pose Estimation. We further convert the problem of clothes alignment into fashion landmark estimation. Though there are no prior work in fashion landmarks, approaches from similar fields (e.g. human pose estimation [10, 11, 2528]) serve as good candidates to explore. Recently, deep learning [1012, 29] has shown great advantages in locating human joints and there are generally two directions here. The first direction [10, 13, 30] utilizes the power of cascading for iterative error correction. DeepPose [10] employs devide-and-conquer strategy and designs a cascaded deep regression framework on part level while Iterative Error Feedback [13] emphasizes more on the target scheduling in each stage. The second direction [11, 12, 31, 32], on the other hand, focuses on the explicit modeling of landmark relationships using graphical models. Chen et al. [11] proposed the combination of CNN and structural SVM [33] to model the tree-like relationships among landmarks while Thompson et al. [12] plugged Markov Random Field (MRF) [34] into CNN for joint training. Here, our framework attempts to absorb the advantages of both directions. The cascading and auto-routing mechanisms enable both stage-wise and branch-wise variation reductions while pseudo-labels encode multi-level sample relationships which depict typical global and local landmark configurations.

Fig. 2.
figure 2

Illustration of the Fashion Landmark Dataset (FLD): (a) sample images and annotations for different types of clothing items, including upper/lower/full-body clothes, (b) sample images and annotations for different subsets, including normal/medium/large poses and medium/large scales, (c) quantitative data distributions.

2 Fashion Landmark Dataset (FLD)

To benchmark fashion landmark detection, we select a subset of images with large pose and scale variations from DeepFashion database [3], to constitute FLD. We label and refine landmark annotations in FLD to make sure each image is correctly labeled with 8 fashion landmarks along with their visibilityFootnote 1. Overall, FLD contains more than 120K images. Sample images and annotations are shown in Fig. 2(a). To characterize the properties of FLD, we divide the images into five subsets according to the positions and visibility of their ground truth landmarks, including the subsets of normal/medium/large poses and medium/large zoom-ins. The ‘normal’ subset represents images with frontal pose and no cut-off landmarks. The subsets of ‘medium’ and ‘large’ poses contain images with side or back views, while the subsets of ‘medium’ and ‘large’ zoom-ins contain clothing items with more than one or three cut-off landmarks, respectively. Sample images of the five subsets are illustrated in Fig. 2(b) and their statistics are demonstrated in Fig. 2(c), which shows that FLD contains substantial percentages of images with large poses/scales.

3 Our Approach

Fashion landmark exhibits large variations in both spatial and appearance domain (see Fig. 1(b, c)). Figure 2(c) further shows that more than \(30\,\%\) images have large pose and zoom-in variations. To account for these challenges, we propose a deep fashion alignment (DFA) framework as shown in Fig. 3(a), which consists of three stages, where each stage subsequently refines previous predictions. Unlike the existing representative model for human joints prediction, such as DeepPose [10] as shown in (b), which trained multiple networks for all localized parts in each stage, the proposed DFA framework that functions on full image is able to achieve superior performance with much lower computations.

Fig. 3.
figure 3

Pipeline of (a) Deep Fashion Alignment (DFA) and (b) DeepPose [10]. DFA leverages pseudo-labels and auto-routing mechanism to reduce the large variations presented in fashion images with much less computational costs.

Framework Overview. As shown in Fig. 3(a), DFA has three stages. In each stage, VGG-16 [35] is employed as network architecture. In the first stage, DFA takes the raw image I as input and predicts rough landmark positions, denoted as \(\hat{l^1}\), as well as pseudo-labels, denoted as \(\hat{f^1}\), which represent landmark configurations such as clothing categories and poses. In the second stage, both the input image I and the predictions of stage-1, \(\hat{l^1}\), are fed in. The whole network is required to predict landmark offsets, signified as \(\hat{\delta l^2}\), and pseudo-labels \(\hat{f^2}\) that represent local landmark offsets. The landmark prediction of stage-2 is computed as \(\hat{l^2} = \hat{l^1} + \hat{\delta l^2}\). The third stage has two CNNs as two branches, which have identical input and output. Similar to the second stage, each CNN employs image I as input and learns to estimate landmark offsets \(\hat{\delta l^3}\) and pseudo-labels \(\hat{f^3}\), which contains information about contextual landmark offsets. In stage-3, each image is passed through one of these two branches. The selection of branch is determined by the predicted pseudo-labels \(\hat{f^2}\) in stage-2. The final prediction is computed as \(\hat{l^3} = \hat{l^2} + \hat{\delta l^3}\).

Network Cascade. Cascade [36] has been proven an effective technique for reducing variations sequentially in pose estimation [10]. Here, we build the DFA system by cascading three CNNs. The first CNN directly regresses landmark positions and visibility from input images, aided by pseudo-labels in the space of landmark configuration. These pseudo-labels are achieved by clustering absolute landmark positions, indicating different clothing categories and poses, as shown in Fig. 4 stage 1. For example, cluster 1 and 4 represents ‘short T-shirt in front view’ and ‘long coat in side view’, respectively. The second CNN takes both the input image and the predictions from stage-1 as input, and estimates the offsets that should be made to correct the predictions of the first stage. In this case, we are learning in the space of landmark offset. Thus, the pseudo-labels generated here represent typical error patterns and their magnitudes, as shown in Fig. 4 stage 2. For instance, cluster 1 represents the corrections should be made along upside and downside, while cluster 2 suggests left/right-direction corrections. In stage-3, we partition data into two subsets, according to the error patterns predicted in the second stage as shown in Fig. 4 stage 3, where branch one deals with ‘easy’ samples such as frontal T-shirt with different sleeve length, while branch two accounts for ‘hard’ samples such as selfie pose (cluster 1), back view (cluster 2) and large zoom-in (cluster 3).

Fig. 4.
figure 4

Visualization of pseudo-labels obtained for each stage. Pseudo-labels in stage-1 indicate clothing categories and poses; pseudo-labels in stage-2 represent typical error patterns; pseudo-labels in stage-3 partition within ‘easy’ and ‘hard’ samples.

Pseudo-Label. In DFA, each training sample is associated with a pseudo-label that reflects its relationship to the other samples. Let the ground truth positions of fashion landmarks denote as l, where \(l_{i}\) specifies the pixel location of landmark i. The pseudo-label, \(f\left( k\right) \in \mathbb {R}^{k\times 1}, k=1\ldots K\), of each sample i is a K-dim vector and can be calculated as

$$\begin{aligned} f\left( k\right) = \exp \left( -\frac{dist\left( i, C\right) }{T}\right) , \end{aligned}$$
(1)

where \(dist(\cdot )\) is a distance measure. \(C_{k}\) is a set of k cluster centers, obtaining by k-means algorithm on the spaces of landmark coordinates (stage-1) or offsets (stage-2 and 3). T is the temperature parameter to soften pseudo-labels. We adopt \(K = 20\) for all three stages.

Here we explain the pseudo-label in each of the three stages. Cluster centers \(C^{1}_{k}\) in stage-1 are obtained in landmark configuration space \(l_{i}\), where \(l_{i}\) is the ground truth landmark positions for sample i. Then pseudo-label \(f^{1}_{i}\) of sample i in stage-1 can be written as \(f^{1}_{i}\left( k\right) = \exp \left( -\frac{\Vert l_{i} - C^{1}_{k}\Vert _{2}}{T}\right) \). We now have a landmark estimation \(\hat{l^{1}_{i}}\) from stage-1 for sample i. In stage-2, we first define the landmark offset \(\delta l^{2}_{i} = \hat{l^{1}_{i}} - l_{i}\), which is the correction should be made on stage-1 estimation. Cluster centers \(C^{2}_{k}\) in stage-2 are obtained in landmark offset space \(\delta l^{2}_{i}\). Similarly, pseudo-label \(f^{2}_{i}\) of sample i in stage-2 can be written as \(f^{2}_{i}\left( k\right) = \exp \left( -\frac{\Vert \delta l^{2}_{i} - C^{2}_{k}\Vert _{2}}{T}\right) \). Since outer product \(\otimes \) of two offsets contains the correlations between different fashion landmarks (e.g. ‘left collar’ v.s. ‘left sleeve’), we further include these contextual information into the pseudo-labels of stage-3 \(f^{3}\). To make the results of outer product comparable, we convert them into vectors by stacking columns, which is denoted as \(lin\left( \right) \). The landmark offset of stage-3 is defined as \(\delta l^{3}_{i} = \hat{l^{2}_{i}} - l_{i}\), where \(\hat{l^{2}_{i}} = \hat{l^{2}_{i}} + \hat{\delta l^{2}_{i}}\) is the estimation made by stage-2. Thus, cluster centers \(C^{3}_{k}\) in stage-3 are obtained in contextual offset space \(\delta _{context} l^{3}_{i} = lin\left( \delta l^{3}_{i} \otimes \delta l^{3}_{i}\right) \). Similarly, pseudo-label \(f^{3}_{i}\) of sample i in stage-3 can be written as \(f^{3}_{i}\left( k\right) = \exp \left( -\frac{\Vert \delta _{context} l^{3}_{i} - C^{3}_{k}\Vert _{2}}{T}\right) \). The pseudo-labels used in each stage are summarized in Table 1.

Table 1. Summary of pseudo-labels used in each stage. \(l_{i}\) is the ground truth landmark positions. \(\hat{l^{1}_{i}}\) and \(\hat{l^{2}_{i}}\) are the landmark estimations from stage-1 and 2, respectively. \(\otimes \) is the outer product and \(lin\left( \cdot \right) \) is the linearize operation.

Auto-Routing. Another important building block of DFA is the auto-routing mechanism. It is built upon the fact that the estimated pseudo-labels in stage-2 \(\hat{f^{2}}\) reflects the error patterns for each sample. We first associate each cluster center with an average error magnitude \(e\left( C^{2}_{k}\right) , k = 1\ldots K\). This can be done by averaging the errors of training samples in each cluster. Then, we define the error function \(G\left( \cdot \right) \) within pseudo-label \(\hat{f^{2}_{i}}\) for sample i in stage-2: \(G\left( \hat{f^{2}_{i}}\right) = \sum _{k=1}^{K} e\left( C^{2}_{k}\right) \cdot \hat{f^{2}_{i}}\left( k\right) \). Therefore, the routing function \(r_{i}\) for sample i is formulated as

$$\begin{aligned} r_{i} = \mathbf 1 \left( G\left( \hat{f^{2}_{i}}\right) < \epsilon \right) , \end{aligned}$$
(2)

where \(\mathbf 1 \left( \cdot \right) \) is the indicator function and \(\epsilon \) is the error threshold for auto-routing. We set \(\epsilon = 0.3\) empirically. If \(r_{i} = 1\) indicates sample i will go through branch 1 in stage-3, and \(r_{i} = 0\) indicates otherwise.

Training. Each stage of DFA is trained with multiple loss functions, including landmark estimation \(L_{positions}\), visibility prediction \(L_{visibility}\), and pseudo-label approximation \(L_{labels}\). The overall loss function \(L_{overall}\) is

$$\begin{aligned} L_{overall} = L_{positions}(l, \hat{l}) + \alpha (t)L_{visibility}(v, \hat{v}) + \beta (t)L_{labels}(f, \hat{f}), \end{aligned}$$
(3)

where \(\hat{l}\), \(\hat{v}\) and \(\hat{f}\) are the predicted landmark positions, visibility, and pseudo-labels respectively. We employ the Euclidean loss for \(L_{positions}\) and \(L_{labels}\), and the multinomial logistic loss is adopted for \(L_{visibility}\). \(\alpha \left( t\right) \) and \(\beta \left( t\right) \) are the balancing weights between them. All the VGG-16 networks are pre-trained using ImageNet [37] and the entire DFA cascaded network is fine-tuned by stochastic gradient decent with back-propagation.

The proper scheduling of \(\alpha \left( t\right) \) and \(\beta \left( t\right) \) is very important for network performance. If they are too large, it disturbs the training of landmark positions. If they are too small, the training procedure cannot benefit from these auxiliary information. Similar to [38], we design a piecewise adjustment strategy for \(\alpha \left( t\right) \) and \(\beta \left( t\right) \) during training process,

$$\begin{aligned} \alpha \left( t\right) = \left\{ \begin{array}{ll} \alpha &{}~~~t< t_{1}, \\ \frac{t-t_{1}}{t_{2}-t_{1}}\alpha &{}~~~t_{1} \le t < t_{2}, \\ 0 &{}~~~t_{2} \le t, \end{array} \right. \end{aligned}$$
(4)

where \(t_1= 2000\) iterations and \(t_2= 4000\) iterations in our implementation. The adjustment for \(\beta \left( t\right) \) takes a similar form.

Computations. For a three stage cascade to predict 8 fashion landmarks, DeepPose is required to train 17 VGG-16 models in total, while only three VGG-16 models need to be trained for DFA. Thus, our proposed approach at least saved 5 times computational costs.

4 Experiments

This section presents evaluation and analytical results of fashion landmark detection, as well as two applications including clothing attribute prediction and clothes retrieval.

Experimental Settings. For each clothing category, we randomly select 5K images for validation and another 5K images for test. The remaining \(30K\!\!\sim \!\!40K\) images are used for training. We employ two metrics to evaluate fashion landmark detection, normalized error (NE) and the percentage of detected landmarks (PDL) [10]. NE is defined as the \(\ell _{2}\) distance between predicted landmarks and ground truth landmarks in the normalized coordinate space (i.e. divided by the width/height of the image), while PDL is calculated as the percentage of detected landmarks under certain overlapping criterion. Typically, smaller values of NE or higher values of PDL indicate better results.

Competing Methods. Since this work is the first study of fashion landmark detection, it is difficult to find direct comparisons. Nevertheless, to fully demonstrate the effectiveness of DFA, we compare it with two deep models, including DeepPose [10] and Image Dependent Pairwise Relations (IDPR) [11], which achieved best-performing results in human pose estimation. They are two representative methods that explored network cascade and graphical model to handle human pose. Specifically, DeepPose designed a cascaded deep regression framework on human body parts, while IDPR combined CNN and structural SVM [33] to model the relations among landmarks. To have a fair comparison, we replace the backbone networks in DeepPose and IDPR with VGG-16 and carefully train them using the same data and protocol as DFA did.

4.1 Ablation Study

We demonstrate the merits of each component in DFA.

Effectiveness of Network Cascade. Table 2 lists the performance of NE among three stages, where we have two observations. First, as shown in stage one, training DFA with both landmark position and visibility, denoted as ‘direct regression’, outperforms training DFA with only landmark position, denoted as ‘- visibility’, showing that visibility helps landmark detection because it indicates variations of pose and scale (e.g. zoom-in). Second, cascade networks gradually reduce localization errors on all fashion landmarks from stage one to stage three. By predicting the corrections over previous stage, DFA decomposes a complex mapping problem into several subspace regression. For example, Fig. 8(e–g) demonstrates the quantitative stage-wise landmark detection results of DFA on different clothing items. In these cases, stage-1 gives rough predictions with shape constraints, while stage-2 and stage-3 refine results by referring to local and contextual correction patterns.

Table 2. Ablation study of DFA on different fashion landmarks. The normalized error (NE) is used here. ‘abs.’ indicates pseudo-labels obtained from absolute landmark positions. ‘offset’ indicates pseudo-labels obtained from local landmark offsets. ‘c. offset’ indicates pseudo-labels obtained from contextual landmark offsets. Intuitively, for the image size of \(224\times 224\), the best prediction is achieved in stage-3 using contextual offset as pseudo-labels, whose errors in pixels are \(0.048\times 224=10.752\), 10.752, 20.384, 19.936, 15.904, and 16.128 respectively. ‘T = 20’ is found empirically in the experiments.
Table 3. Ablation study of DFA on different evaluation subsets. NE is used here.

Effectiveness of Different Pseudo-Labels. Within each stage, choices of different pseudo-labels are explored. We design pseudo-labels representing landmark configurations, local landmark offsets and contextual landmark offsets for three stages respectively. Table 3 shows that pseudo-labels lead to substantial gains beyond direction regression, especially for the first stage. Next, we further justify the forms of different pseudo-labels adopted for each stage. In stage-1, we find that using soft label, denoted as ‘\(+p.~labels~(T=20)\)’, instead of hard label, denoted as ‘\(+p.~labels~(T=1)\)’, results in better performance, because soft label is more informative in identifying landmark configuration of samples. In stage-2, pseudo-label generated from offset landmark positions is superior to that generated from absolute landmark positions since landmark offsets can provide more guidance on the local corrections to be predicted. In stage-3, including contextual landmark offsets help achieve further gains, due to the fact that landmark corrections to be made are generally correlated.

Effectiveness of Auto-Routing. Finally, we show that auto-routing is an effective way to tackle data with different correction difficulties. From Table 2 stage-3, we can see that auto-routing (denoted as ‘+auto-routing’) provides more benefits when compared with averaging the predictions from two branches trained using all data (denoted as ‘+two-branch’). By further inspecting the stage-wise performance on each evaluation set, which is shown in Table 3, we can observe that auto-routing mechanism improves the performance of medium/large zoom-in subsets, showing that the routing function makes one of the branch in stage-3 focus on difficult samples.

4.2 Benchmarking

To illustrate the effectiveness of DFA, we compare it with state-of-the-art human pose estimation methods like DeepPose [10] and IDPR [11]. We also analyze the strengths and weaknesses of each method on fashion landmark detection.

Landmark Types. Figure 5 (the first row) shows the percentage of detection rates on different fashion landmarks, where we have three observations. First, on landmark ‘hem’, DeepPose performs better when distance threshold is small, while IDPR catches up when the threshold is large, because DeepPose is a part-based method which can locally refines the results for easy cases. Second, collars are the easiest landmarks to be detected while sleeves are the hardest. Third, DFA consistently outperforms both DeepPose and IDPR or shows comparable results on all fashion landmarks, showing that the pseudo-labels and auto-routing mechanisms enable robust fashion landmark detection.

Fig. 5.
figure 5

Performance of fashion landmark detection on different fashion landmarks (the first row) and different clothing types (the second row). [px] represents pixels. The percentage of detected landmarks (PDL) is used here.

Clothing Types. Figure 5 (the second row) shows the percentage of detection rates on different clothing types. Again, DFA outperforms all other methods on all distance thresholds. We have two additional observations. First, DFA (stage-1) already achieves comparable results on full-body and lower-body clothes when compared with IDPR and DeepPose (stage-3). Second, upper-body clothes pose most challenges on fashion landmark detection. It is partially due to the various clothing sub-categories contained.

Difficulty Levels. Figure 6 shows the percentage of detection rates on different evaluation subsets, with the distance threshold fixed at 15 pixels. Two observations are made here. First, fashion landmark detection is a challenging task. Even the detection rate for normal pose set is just above \(70\,\%\). More powerful model needs to be developed. Second, DFA has the most advantages on medium pose/zoom-in subsets. Pseudo-labels provide effective shape constraints for hard cases. Please also note that DFA (stage-3) requires much less computational costs than DeepPose (stage-3).

Fig. 6.
figure 6

Performance of fashion landmark detection on different evaluation subsets. The distance threshold is fixed at 15 pixels. PDL is used here.

For a \(300\times 300\) image, DFA takes around 100ms to detect full sets of fashion landmarks on a single GTX Titan X GPU. In contrast, DeepPose needs nearly 650 ms in the same setting. Our framework has large potential in real-life applications. Visual results of fashion landmark detection by different methods are given in Fig. 8.

4.3 Generalization of DFA

To test the generalization ability of the proposed framework, we further apply DFA on a related task, i.e. human pose estimation, as reported in Table 4. In the following, DFA is trained and evaluated on LSP dataset [40] as [11] did.

First, we compare DFA system with other state-of-the-art methods on pose estimation task. Without much adaptation, DFA achieves 74.4 mean strict PCP results, with 87, 91, 70, 56, 81, 76 for ‘torso’, ‘head’, ‘u.arms’, ‘l.arms’, ‘u.legs’ and ‘l.legs’ respectively. It shows comparable results to [11] and outperforms several recent works [10, 30, 32, 39], showing that DFA is a general approach for structural prediction problem besides fashion landmark detection.

Table 4. Comparison of strict PCP results on the LSP dataset. DFA shows good generalization ability to human pose estimation.

Then, we show that pseudo-label and auto-routing scheme of DFA can be generalized to improve performance of pose estimation methods, such as IDPR [11]. [11] trained DCNN and achieved 75 mean strict PCP. We add pseudo-labels to this DCNN and include auto-routing in cascading predictions. Training and evaluation of graphical model are kept unchanged. Pseudo-labels leverage the result to 77 mean strict PCP and auto-routing leads to another 1.6 point gain. It demonstrates that pseudo-labels and auto-routing are effective and complementary techniques to current methods.

4.4 Applications

Finally, we show that fashion landmarks can facilitate clothing attribute prediction and clothes retrieval. We employ a subset of DeepFashion dataset [3], which contains 10K images, 50 clothing attributes and corresponding image pairs (i.e. the images containing the same clothing item). We compare fashion landmarks with different localization schemes, including the full image, the bounding box (bbox) of clothing item, and the human-body joints, where fashion landmarks are detected by DFA, human joints are obtained by the executable code of [11], and bounding boxes are manually annotated. For both tasks of attribute recognition and clothes retrieval, we use off-the-shelf CNN features as described in [35].

Attribute Prediction. We train a multi-layer perceptron (MLP) using the extracted CNN features as input to predict all 50 attributes. Following [41], we employ the top-k recall rate as measuring criteria, which is obtained by ranking the classification scores and determine how many ground truth attributes have been found in the top-k predicted attributes. Overall, the average top-5 recall rates on 50 attributes of ‘full image’, ‘bbox’, ‘human joints’, and ‘fashion landmarks’ are \(27\,\%\), \(53\,\%\), \(65\,\%\) and \(73\,\%\), respectively, showing that fashion landmarks are the most effective representation for attribute prediction of fashion items. Figure 7(a) shows the top-5 recall rates of ten representative attributes, e.g. ‘stripe’, ‘long-sleeve’, and ‘V-neck’. We observe that fashion landmark outperforms all the other localization schemes in all the attributes, especially for part-based attributes, such as ‘zip-up’ and ‘shoulder-straps’.

Fig. 7.
figure 7

Experimental results on clothing attribute prediction and clothes retrieval using features extracted from ‘full image’, ‘bbox’, ‘human joints’, and ‘fashion landmarks’: (a) the top-5 recall rates of clothing attributes and (b) the top-k clothes retrieval accuracy.

Fig. 8.
figure 8

Visual results of fashion landmark detection by different methods: (a) Ground Truth, (b) IDPR, (c) DeepPose (stage 1), (d) DeepPose (full model), (e) DFA (stage 1), (f) DFA (stage 2) and (g) DFA (full model).

Clothes Retrieval. We adopt the \(\ell _{2}\) distance between the extracted CNN features for clothes retrieval. The top-k retrieval accuracy is used to measure the performance, such that given a query image, correct retrieval is considered as the exact clothing item has been found in the top-k retrieved gallery images. As shown in Fig. 7(b), the top-20 retrieval accuracies for ‘full image’, ‘bbox’, ‘human joints’, and ‘fashion landmarks’ are \(25\,\%\), \(40\,\%\), \(45\,\%\), and \(51\,\%\) respectively. When \(k=1\) and \(k=5\), features extracted around fashion landmarks still perform better than the other alternatives, demonstrating that fashion landmarks provide more discriminative information beyond traditional paradigms.

5 Conclusions

This paper introduced fashion landmark detection, which is an important step towards robust fashion recognition. To benchmark fashion landmark detection, we introduced a large-scale fashion landmark dataset (FLD). With FLD, we proposed a deep fashion alignment network (DFA) for robust fashion landmark detection, which leverages pseudo-labels and auto-routing mechanism to reduce the large variations presented in fashion images. Extensive experiments showed the effectiveness of different components as well as the generalization ability of DFA. To demonstrate the usefulness of fashion landmark, we evaluated on two fashion applications, clothing attribute prediction and clothes retrieval. Experiments revealed that fashion landmark is a more discriminative representation than clothes bounding boxes and human joints for fashion-related tasks, which we hope could facilitate future research.