Joint Learning of Semantic and Latent Attributes

Peng, Peixi; Tian, Yonghong; Xiang, Tao; Wang, Yaowei; Huang, Tiejun

doi:10.1007/978-3-319-46493-0_21

Peixi Peng^17,20,
Yonghong Tian^17,20,
Tao Xiang¹⁸,
Yaowei Wang¹⁹ &
…
Tiejun Huang¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9908))

Included in the following conference series:

European Conference on Computer Vision

24k Accesses
22 Citations

Abstract

As mid-level semantic properties shared across object categories, attributes have been studied extensively. Recent approaches have attempted joint modelling of multiple attributes together with class labels so as to exploit their correlations for better attribute prediction and object recognition. However, they often ignore the fact that there exist some shared properties other than nameable/semantic attributes, which we call latent attributes. Basically, they can be further divided into discriminative and non-discriminative parts depending on whether they can contribute to an object recognition task. We argue that learning the latent attributes jointly with user-defined semantic attributes not only leads to better representation for object recognition but also helps with semantic attribute prediction. A novel dictionary learning model is proposed which decomposes the dictionary space into three parts corresponding to semantic, latent discriminative and latent background attributes respectively. An efficient algorithm is then formulated to solve the resultant optimization problem. Extensive experiments show that the proposed attribute learning method produces state-of-the-art results on both attribute prediction and attribute-based person re-identification.

You have full access to this open access chapter, Download conference paper PDF

On the Exploration of Joint Attribute Learning for Person Re-identification

Robust Attribute-Based Visual Recognition Using Discriminative Latent Representation

A Multisource Domain Generalization Approach to Visual Attribute Detection

Keywords

1 Introduction

Attributes are a type of mid-level semantic properties of visual objects that can be shared across different object categories. Typically, semantic attributes are nameable and often annotated based on a user-defined ontology. Attribute learning has been studied extensively recently [1–9]. Existing approaches vary drastically depending on the objectives of learning attributes. Specifically, attributes learning methods have been developed for three objectives: (1) attribute prediction for image search [10], where each image is indexed by a list of predicted attributes and can thus be searched by text queries; (2) learning mid-level representation from low-level features for object recognition, typically at the fine-grained [6] or instance-level [11]; (3) zero-shot learning where given an attribute ‘prototype’ [12], unseen classes can be recognised by comparing the prototypes with the predicted attributes.

Earlier attribute learning works often tried to learn a set of binary attribute classifiers for each attribute separately and independently, whilst ignoring the existence of correlations among them, e.g., ‘female’ and ‘long-hair’ are correlated. This has been rectified by recent approaches [4–9] which jointly learn multiple attributes together with the object class labels so as to exploit their correlations. However, all these joint modelling approaches focus on the semantic user-defined attributes only, whilst ignoring the factors that (1) semantic attributes are often not exhaustively defined; and (2) there are also other shareable but not nameable/semantic properties. We call these shareable but undefined properties latent attributes and argue that they should also be jointly modelled with the user-defined semantic attributes and object class labels.

Jointly learning semantic and latent attributes is important for both attribute prediction and object recognition. This is due to two reasons: First, these latent attributes can also be discriminative and thus useful for object recognition. For example, Fig. 1 shows that a limited list of user-defined semantic attributes are often inadequate for instance-level object recognition such as person re-identification [13]. However, when a set of complementary and discriminative latent attributes are learned to augment the user-defined semantic attributes, recognition can be made easier. Second, even if predicting the user-defined attributes is the only goal, discovering and learning these latent attributes is still useful – it is certain that shareable properties that do not belong to the user-defined attributes are accounted for the model rather than act as a distractor to corrupt the learned semantic attribute predictor. Furthermore, by modelling latent attributes together with class labels, we can identify two types of latent attributes: those that are related to class labels and thus are potentially useful for object recognition, and those that are not. The former is called discriminative latent attributes (D-LA), while the latter background latent attributes (B-LA) which could literally be object background that might appear in any object class.

To jointly learn both types of latent attributes as well as semantic attributes together with their correlations with the class labels, we propose a novel dictionary learning model with dictionary decomposition. Dictionary learning is naturally suited for learning a low-dimensional subspace corresponding to the latent attribute space. This is because by sharing the same dictionary with all object classes, it automatically discovers shareable properties. More importantly, we can easily decompose the learned dictionary into multiple parts and different parts are subject to different correlations with the available object annotations. Specifically, the learned dictionary subspace are decomposed into three parts: (1) The D-LA dictionary subspace part that is subject to the label correlation constraint so as to make sure that it is discriminative, (2) The B-LA dictionary subspace part that only helps data reconstruction and is subject to no constraint, and (3) The user-defined-attribute-correlated (UDAC) dictionary subspace part which is correlated to the user-defined attribute annotations. Note that in our framework, the user-defined attributes are learned through the latent attribute space. This is because a dictionary learning model aims to reconstruct the original signal using all dictionary atoms together, enforcing the learned three different types of attributes to be complementary to each other. Figure 2 illustrates the proposed dictionary learning framework.

2 Related Work

Learning Semantic Attributes. Earlier works on semantic attribution learning [1–3, 14] consider predicting each attribute as a binary classification problem and solve them independently. Later works [4, 5, 7–9] realised that there exist correlations between different attributes, as well as between attributes and class labels, and proposed to learn different attributes jointly together with the class labels. For example, a unified multiplicative framework is proposed in [7] which projects images and category/class information into a shared feature space and the latent factors were disentangled and multiplied for attribute prediction. In [9, 15], they learn the semantic attributes by incorporating class label information. Our model also learns user-defined semantic attributes and class labels jointly. Different from existing jointly attribute modelling works, we additionally model discriminative latent attributes and background latent attributes to improve the learn of user-defined semantic attributes as well as making the learned attribute-based representation more discriminative for the object classification task.

Learning Latent Attributes. The method for learning discriminative latent attributes has been exploited before [6, 16–21]. However, in theses works, the latent attributes are not learned jointly and thus are not necessarily complementary to the user-defined attributes. Comparing to the few exceptions which learn them jointly [22, 23], there is a significant difference: by using an additive dictionary, we aim to reconstruct the original feature representation; we thus devise the third type of attributes: background latent attributes (B-LA) to explicitly account for non-discriminant part of the representation (e.g. scene background, or what a person looks like in general) that is useless for the targeted task but has to be learned to avoid corrupting the other two types of useful attributes. Experimental results demonstrate clearly the importance of learning all three jointly. This novel concept can also be applied to existing joint attribute learning models.

Attribute-Based Person Re-identification. Semantic attributes have been exploited as a mid-level representation for matching people across non-overlapping camera views, or the person re-identification (Re-ID) problem [13, 24–26]. However, these attribute-based Re-ID representations are not competitive on the benchmark datasets. This is because (1) the user-defined attribute representations have very low dimensions (dozens vs. tens of thousands for the typical low-level feature based representations used by the state-of-the-art Re-ID methods [27]); and (2) no latent attributes are exploited. Recently, user-defined attributes and low-level feature are modelled jointly in [28] in a multi-task learning framework to learn a discriminative representation for Re-ID. However, the user-defined attributes are predicted independently and no latent attributes are used. In contrast, our model is flexible in that discriminative latent attributes can still be learned when no annotation on user-defined attributes is available. Another relevant work is [11] which deploys a generative model to transfer attribute annotations from auxiliary data (fashion clothing) to the target data (surveillance video). Again, as a generative model, it is weak in learning discriminative representation.

Dictionary Learning. Beyond attribute learning, dictionary learning [29, 30] has been studied widely as a method for learning a low-dimensional subspace. Originally designed for unsupervised learning, it has been extended for supervised learning for tasks such as face verification/recognition [31] and person Re-ID [32–34]. Our model is related to these dictionary-learning-based Re-ID models in that all models learn discriminative latent attributes through the learned dictionary subspace. However, only our model is able to additionally learn user-defined attributes and background latent attributes for better representation learning.

Contributions. Our contributions are as follows: (1) A unified framework for learning both user-defined semantic attributes and discriminative latent attributes is proposed. (2) We further develop a novel dictionary learning model which decomposes the learned dictionary subspace into three parts corresponding to the semantic, discriminative latent as well as background latent attributes respectively. An efficient optimisation algorithm is also formulated. Extensive experiments are carried out on benchmark attribute prediction and person Re-ID datasets. The results show that the proposed unified framework generates state-of-the-art results on both tasks.

3 Methodology

3.1 Formulation

Assume that a set of training data are given which are labelled with some user-defined (semantic) attributes^{Footnote 1} and object classes. In this paper, we focus on the problem of learning user-defined semantic and latent attributes jointly by dictionary learning. Specifically, the learned dictionary are decomposed into following three parts (see Fig. 2): (1) $D^u$ corresponding to the user-defined-attribute-correlated (UDAC) dictionary subspace part which is correlated to the user-defined attribute annotations, (2) $D^d$ corresponding to the discriminative latent attributes (D-LA) dictionary subspace part which is correlated to the class labels and thus useful for the given classification/recognition task, and (3) $D^b$ corresponding to the background latent attributes (B-LA) dictionary subspace part which captures all the residual information in the training data which is uncorrelated to either user-defined attributes or class labels and thus is learned without any supervision.

Formally, we assume $Y \in \mathbb {R}^{m \times n}$ is a data matrix where each column $y_i$ corresponds to an m-dimensional feature vector representing the $i^{th}$ object’s appearance. n denotes the numbers of training samples. A is a $p \times n$ matrix where each column $a_i \in {\{0,1\}}^p$ indicates the absence or presence of all p binary user-defined attributes. The proposed method can be formulated as:

$$\begin{aligned} \begin{aligned} \left[ D^u,D^d,D^b,W\right] =&\arg \min \left\| Y - D^u X^u-D^d X^d \right\| _F^2+\left\| Y - D^u X^u-D^d X^d-D^b X^b \right\| _F^2\\&+ \, \alpha \sum \limits _{i,j= 1}^{n}{m_{i,j}}{{\left\| {x_{i}^d - x_{j}^d} \right\| }^2} +\beta ^2 \left\| X^u-WA\right\| _F^2.\\&s.t.~\left\| d^u_i \right\| _2^2 \le 1,~\left\| d^d_i \right\| _2^2 \le 1,~\left\| d^b_i \right\| _2^2 \le 1,~\left\| w_i \right\| _2^2 \le 1~\forall i, \end{aligned} \end{aligned}$$

(1)

where matrices $X^u$, $X^d$ and $X^b$ are codes/coefficients corresponding to dictionaries $D^u$, $D^d$ and $D^b$ respectively; W is used to build correspondence between the codes obtained using $D^u$ and the user-defined attribute annotation matrix A; $d^u_i, d^d_{i}, d^b_{i}$ and $w_i$ are the $i^{th}$ columns of $D^u$, $D^d$, $D^b$ and W respectively; $x^d_{i}$ is the $i^{th}$ column of $X^d$; $\alpha $ and $\beta $ are free parameters controlling the strengths of two regularisation terms to be explained later; M is an affinity matrix indicating the class-relationships (same/different class) among different training samples. Specifically, $m_{i,j}=1$ if $x^d_{i}$ and $x^d_{j}$ are of same class, and $m_{i,j}=0$ otherwise. The third term can be rewritten using the Laplacian matrix as:

$$\begin{aligned} \small \sum \limits _{i,j= 1}^{n} m_{i,j}{\left\| x^d_{i}-x^d_{j} \right\| ^2}=\mathrm{Tr}(X^d L {X^d}'), \end{aligned}$$

(2)

where $L=Q-M$ and Q is a diagonal matrix whose diagonal elements are the sums of the row elements of M. There are four terms of three categories in the cost function which are now explained in detail:

1.
The first two terms are reconstruction errors that make sure the learned dictionaries can encode the data matrix Y well. Note that the two reconstruction error terms are stepwise ordered. Specifically, the minimisation of the first reconstruction error term enables $D^u$ and $D^d$ to encode Y as much as possible, while the minimisation of the second reconstruction error term enables $D^b$ to encode and align the residual part of Y that cannot be coded by $D^u$ and $D^d$. This stepwise two reconstruction error term formulation is important to prevent the B-LA $D^b$ from dominating the reconstruction error leading to trivial solutions for $D^u$ and $D^d$.
2.
The third term is a graph Laplacian regularisation term which dictates that the projections of columns of Y in the D-LA subspace, i.e., $X^d$ are close to each other if the corresponding data points belong to the same class. The goal of this term is thus to make the D-LA subspace parametrised by $D^d$ to be more discriminative (class-dependent).
3.
The last term is the constraint for learning the UDAC subspace part $D^u$. Note that we attempt to establish a linear constraint W between the projection in that subspace, $X^u$ and user-defined attribute annotations A, rather than simply setting them to be equal ($X^u=A$), because user-defined attributes are always not additive. As explained earlier, modelling user-defined attributes via the same dictionary subspace makes the learned other two types of latent attributes to be complementary to the user-defined attributes.

3.2 Optimisation

Here we detail how the optimisation problem in (1) is solved. The problem is divided into the following subproblems:

1.
Computing codes $X^u$. Given fixed $D^u$, $D^d$, $D^d$, W, $X^d$ and $X^b$, the coding problem of estimating $X^u$ becomes:
$$\begin{aligned} \small \begin{aligned} \min \left\| {\tilde{Y}}- {\tilde{D}} X^u \right\| _F^2, \end{aligned} \end{aligned}$$
(3)
where
$$\begin{aligned} \small \begin{aligned}&\tilde{Y}=\left[ \begin{array}{l} { Y-D^dX^d}\\ {Y-D^dX^d-D^bX^b}\\ {\beta W A} \end{array} \right] ,{\,} \tilde{D}=\left[ \begin{array}{l} { D^u}\\ {D^u}\\ {\beta I} \end{array} \right] , \end{aligned} \end{aligned}$$
and I is the identity matrix. Let the derivative of (3) equal to 0 and the analytical solution of $X^u$ can be obtained with:
$$\begin{aligned} \small \begin{aligned} X^u=\left( {\tilde{D}}'{\tilde{D}}\right) ^{-1}{\tilde{D}}'\tilde{Y}. \end{aligned} \end{aligned}$$
(4)
2.
Computing codes $X^d$. Given the other variables fixed, the coding problem of $X^d$ becomes:
$$\begin{aligned} \small \begin{aligned} \min \left\| {\tilde{Y}} - {\tilde{D}} X^d \right\| _F^2+\alpha \mathrm{Tr}({X^d} L {X^d}'), \end{aligned} \end{aligned}$$
(5)
where
$$\begin{aligned} \small \begin{aligned}&\tilde{Y}=\left[ \begin{array}{l} { Y-D^u X^u}\\ {Y-D^u X^u- D^b X^b} \end{array} \right] ,{\,} \tilde{D}=\left[ \begin{array}{l} { D^d}\\ {D^d}\end{array} \right] . \end{aligned} \end{aligned}$$
and the analytical solution of $x^d_{i}$ (the $i^{th}$ column of $X^d$) is:
$$\begin{aligned} \small \begin{aligned} x^d_{i} =\left( {\tilde{D}}'{\tilde{D}}+2\alpha l_{t,i,i}I\right) ^{-1}\left( {\tilde{D}}'{\tilde{y}}_{i}-2\alpha \sum \limits _{k\ne i} {\tilde{y}}_{k} l_{k,i}\right) , \end{aligned} \end{aligned}$$
(6)
where $l_{k,i}$ is the (k, i) element of L and ${\tilde{y}}_{i}$ is the $i^{th}$ column of ${\tilde{Y}}$.
3.
Computing code $X^b$. Fix other terms and $X^b$ can be solved by:
$$\begin{aligned} \small \begin{aligned} \min \left\| {Y - D^u X^u-D^d X^d - D^b X^b} \right\| _F^2. \end{aligned} \end{aligned}$$
(7)
Let the derivative of (7) equal to 0 and the analytical solution of $X^b$ is:
$$\begin{aligned} \small \begin{aligned}&X^b=\left( {D^b}'D^b\right) ^{-1}{D^b}'\left( Y - D^u X^u-D^d X^d\right) .\\ \end{aligned} \end{aligned}$$
(8)
4.
Updating dictionaries. First, when $D^b$, $X^u$, $X^d$ and $X^b$ are given, $D^u$ and $D^d$ are estimated by the following optimisation problem:
$$\begin{aligned} \small \begin{aligned} \min \left\| {{\mathcal {Y}} - \mathcal {D} \mathcal {X}} \right\| _F^2, ~s.t.~\left\| d^u_i \right\| _2^2 \le 1,~\left\| d^d_{i} \right\| _2^2 \le 1, \end{aligned} \end{aligned}$$
(9)
where
$$\begin{aligned} \small \begin{aligned} \mathcal {D}=[D^u, D^d],~ \mathcal {Y}=[Y, Y-D^b X^b],~\mathcal {X}=\left[ {\begin{array}{*{20}{c}} {{X^u}}&{}{{X^u}}\\ {{X^d}}&{}{{X^d}} \end{array}} \right] . \end{aligned} \end{aligned}$$
(10)
(9) can be optimised with the Lagrange dual. Thus, the analytical solution of $\mathcal {D}$ is: $\mathcal {D}=\left( \mathcal {Y} {\mathcal {X}}' \right) \left( \mathcal {X} {\mathcal {X}}'+\Lambda \right) ^{-1}$, where $\Lambda $ is a diagonal matrix constructed from all the dual variables. Second, we fix other variables and solve $D^b$ with the following objective function:
$$\begin{aligned} \small \begin{aligned} \min \left\| {Y-D^u X^u-D^d X^d - D^b X^b} \right\| _F^2, ~~&s.t.~\left\| d^b_{i} \right\| _2^2 \le 1 (\forall i), \end{aligned} \end{aligned}$$
(11)
(11) can be solved similar to (9).
5.
Updating W. Similar to the dictionary updating procedure in Step 4, we fix other variables and solve W by:
$$\begin{aligned} \small \begin{aligned} \min \left\| X^u-WA \right\| _F^2, ~~&s.t.~\left\| w_{i} \right\| _2^2 \le 1 (\forall i). \end{aligned} \end{aligned}$$
(12)
(12) can be optimised using the Lagrange dual. The analytical solution of W is: $D^u=\left( {X^u} {A}' \right) \left( A A'+\Lambda \right) ^{-1}$, where $\Lambda $ is a diagonal matrix constructed from all the dual variables.

Algorithm 1 summaries the whole algorithm. In practice, we found that it always converges after a few (${<}50$) iterations in our experiments.

3.3 Application to Person Re-ID

In the Person Re-ID problem, we assume that the training images are represented by some feature representation denoted as Y, and labelled with identities encoded in the matrix M, and a set of user-defined attributes A. Once the three dictionaries are learned using the training set as described above, each test image y can be encoded as $\left[ x^u,x^d,x^b\right] $ via $D^u$, $D^d$ and $D^b$ respectively. The encoding problem can be formulated as:

$$\begin{aligned} \small \begin{aligned} \left[ x^u,x^d,x^b\right] =\arg \min \left\| {y- {D^u}{x^u} -D^d{x^d}- D^b{x^b}} \right\| _2^2+\gamma \left\| x^u\right\| _2^2+\gamma \left\| x^d\right\| _2^2+\gamma \left\| x^b\right\| _2^2, \end{aligned} \end{aligned}$$

(13)

where $x^u$, $x^d$ and $x^b$ are the projections of y in the UDAC, D-LA and B-LA part of the learned dictionary subspaces respectively, and $\gamma $ is a weight for the regularisation terms. (13) can be solved easily with a linear system. After we obtain $x^u$, the user-defined attribute vector a can be predicted via the linear constraint W:

$$\begin{aligned} \small \begin{aligned} a=\arg \min \left\| x^u- W a\right\| _2^2+\gamma \left\| a\right\| _2^2. \end{aligned} \end{aligned}$$

(14)

Now, the test sample y can be represented as the predicted user-defined attributes a and D-LA $x^d$. Simply treating the predicted attributes as features, Re-ID could be performed by score-level fusion of computing the cosine distance of a and $x^d$ between the attribute vectors of a probe sample and a gallery one.

Note that the proposed method can still work without the user-defined attribute annotations A in the training data. In this case, $D^u$, W and $X^u$ will be dropped and (1) becomes:

$$\begin{aligned} \begin{aligned} \left[ D^u,D^b\right] =&\arg \min \left\| Y-D^d X^d \right\| _F^2+\left\| Y-D^d X^d-D^b X^b \right\| _F^2+\alpha \sum \limits _{i,j= 1}^{n}{m_{i,j}}{{\left\| {x_{i}^d - x_{j}^d} \right\| }^2}, \\&s.t.~\left\| d^d_{i} \right\| _2^2 \le 1~\left\| d^b_{i} \right\| _2^2 \le 1,~\forall i, \end{aligned} \end{aligned}$$

(15)

(15) can be solved as a special case of (1). Consequently, the test sample y is represented only by its D-LA $x^d$, which can be obtained by solving an optimisation problem similar to (13).

3.4 Application to User-Defined Attribute Prediction

In this task, our only goal is to predict the user-defined attributes, hence having a separate D-LA $D^d$ is unnecessary and $D^b$ alone can be used to explain any information that cannot be explained by $D^u$. Consequently, $D^d$, $X^d$ and the graph Laplacian regularisation from (1) can be removed, and the optimisation problem for dictionary learning becomes:

$$\begin{aligned} \begin{aligned} \left[ D^u,D^b,W\right] =&\arg \min \left\| Y - D^u X^u \right\| _F^2+\left\| Y - D^u X^u-D^b X^b \right\| _F^2+\beta ^2 \left\| X^u-WA\right\| _F^2 \\&s.t.~\left\| d^u_i \right\| _2^2 \le 1,~\left\| d^b_{i} \right\| _2^2 \le 1,~\left\| w_{i} \right\| _2^2 \le 1~\forall i. \end{aligned} \end{aligned}$$

(16)

It can also be solved as a special case of (1) with a similar solver as described in Sect. 3.2. Once the model is learned using a training set, a test sample y can be encoded with $D^u$ and $D^b$ by solving an optimisation problem similar to (13). Finally, the user-defined attribute vector a is predicted via (14).

4 Experiments

The proposed attribute learning model is evaluated on three tasks: attribute-based person re-identification (Re-ID), user-defined attribute prediction and zero-shot learning^{Footnote 2}.

4.1 Person Re-ID

For this task, our attribute learning model is used to learn a discriminative mid-level representation for matching people across camera views.

Datasets. Four widely used benchmark datasets are chosen for person Re-ID. VIPeR [35] contains 1,264 images of 632 individuals from two distinct camera views (two images per individual) featured with large viewpoint changes and varying illumination conditions. All individuals are randomly divided into two equal-sized subsets for training and testing respectively with no overlapping in identity between the two subsets. This random partition process is repeated 10 times, and the averaged performance is reported. For fair comparison, we use the same data splits as in [36]. PRID [37] consists of images extracted from multiple person trajectories recorded from two surveillance static cameras. Camera view A contains 385 individuals, camera view B contains 749 individuals, and 200 of them appearing in both the two views. The single shot version of the dataset is used in our experiments as in [36], and we use the same data splits as in [36]. In each data split, 100 people with one image from each view are randomly chosen from the 200 present in both camera views for the training set, while the remaining 100 of View A are used as the probe set, and the remaining 649 of View B are used as gallery. Experiments are repeated over the 10 splits. iLIDS [38] has 476 images of 119 individuals captured in an airport terminal from three cameras of distinct viewpoints. It contains heavy occlusions caused by a large number of people and luggages. As in [39], 119 identities are randomly divided into two equal halves, one for training and the other for testing. The reported results are obtained by averaging over 10 trials. Market-1501 [40] is the biggest re-id benchmark dataset to date, containing 32,668 detected person images of 1,501 identities. Each identity is captured by six cameras at most, and two cameras at least. We use the provided fixed training and test sets in [40], under both the single-query and multi-query evaluation settings.

Attribute Annotation. The training sets of all three datasets have labels indicating the identities of the people. In addition, a total of 105 user-defined attributes have been annotated on each training images in VIPeR, PRID and iLIDs as in [14]. We remove the user-defined attributes which appear in each dataset rarely, and the numbers of the remaining attributes are 85, 56 and 73 for VIPeR, PRID and iLIDs respectively. Note that the attribution annotation is unavailable on Market-1501. As mentioned in Sect. 3.3, our model works with and without the user-defined attributes. For fair comparisons with existing methods which do not use additional attribute annotations, we report results of our model both with and without user-defined attributes.

Features and Evaluation Metric. The low-level feature representation in [36] is employed in our experiments. These include colour histogram, HOG and LBP features which are concatenated resulting in 5,138 dimensions. For evaluation metric, we compute Cumulated Matching Characteristics (CMC) curves. Due to space constraint as well as for easier comparison with published results, we only report the cumulated matching accuracy at selected ranks in tables rather than reporting the actual CMC curves. The only exception is the Market-1501 dataset. Since there are on average 14.8 cross-camera ground truth matches for each query, we additionally use mean average precision (mAP) as in [40].

Parameter Settings. On the VIPeR, PRID and iLIDs datasets, the sizes of $D^u$, $D^d$ and $D^b$ are set to 100. We found that the performance of our model is insensitive to the dictionary size when it is between 100 to 200. The size of $D^d$ is increased to 400 for Market-1501 due to the fact that the Market-1501 dataset is much bigger than the other three. The other free parameters in our model, $\alpha $ and $\beta $ in (1) and $\gamma $ in (13), are obtained using four-fold cross-validation.

Competitors. Twelve state-of-the-art Re-ID methods are selected for comparison. They fall into five categories: (1) Unsupervised: BoW features [40] based on Colour Names (CN) alone or in combination with Hue-Saturation Histograms (HS) are used to compute $l_2$ distance. (2) Distance metric leaning based methods: RPLM [41], Mid-level Filter [42], LADF [43], and Similarity Learning [44]. (3) Kernel-based Discriminative subspace learning methods: MFA [39], kLFDA [39], kCCA [36], XQDA [27], and MLAPG [45]; (4) Deep learning based: Improved Deep [46]; (5) Feature fusion based: Metric Ensembles [47]. Note that this method fuses more than one kind of features, which is known to be beneficial to all methods. (6) Attribute-based method: aMTL [28]. This is the most relevant to ours as it also utilises the user-defined attributes. Note that aMTL requires multiple images of each person for training, hence they apply data augmentation to generate more training samples on VIPeR and utilises the multi-shot setting of PRID rather than the single-shot one adopted by most other methods including ours. Furthermore, different from our model, aMTL cannot work without user-defined attributes. For fair comparison, we use the same features and the same training-test splits for the compared methods whenever possible (i.e. when the code is available we use the same features as ours). Three versions of our models are evaluated: “Ours_L” means only latent attributes are learned as representation, that is, the user-defined attribute annotation is not used as do most other compared methods. “Ours_U” means that only user-defined attributes are used to represent a person. “Ours_All” means both the user-defined and latent attributes are used.

Table 1. Comparative results on four benchmark Re-ID datasets. ‘*’ means we compare these methods with the same features using the author-provided code. ‘-’ means no reported result is available.

Full size table

Comparative Results. From the results shown in Table 1, we have the following key findings: (1) Even without using the additional attribute annotation, our method Ours_L outperforms all compared method particularly at low ranks. (2) If user-defined attributes are available, the results of Ours_U is very poor, showing that the user-defined attributes cannot represent a person discriminatively without latent attributes, because the user-defined attribute representations have very low dimensions as explained. Ours_All outperforms Ours_L and Ours_U on all datasets. That shows the learned user-defined attributes and discriminative latent attribute are indeed complementary to each other. (3) Compared to the alternative attribute-based Re-ID model aMTL, our model (Ours_All) is clearly better. In particular, the proposed method outperforms aMTL by a large margin even when they used more training data on PRID. In addition, aMTL can only be applied when there are user-defined attribute annotations, whilst our model is not restricted by that.

4.2 User-Defined Attribute Prediction

Datasets and Settings. Three widely used benchmark datasets are chosen in this experiment. AwA is composed of 30,475 images from 50 animal categories and each category is annotated with 85 attributes. Following [2, 3, 9], we divide the dataset into two parts: 40 classes (24,295 images) for training and 10 classes (6,180 images) for testing. For fair comparison with the state-of-the-art methods, we adopt the same 4096-dimensional deep learning features DeCAF [48] provided by [3]. CUB contains 11,788 images of 200 bird classes. Each category is annotated with 312 attributes. We split the dataset following [8, 9] to facilitate direct comparison with the state-of-the-art methods (150 classes for training and the rest 50 classes for testing). We also extract the same 4096-D DeCAF features as in [9]. PETA comprises 10 publicly available small-scale person image datasets totalling 19,000 images. Each image is labelled with 105 attributes. For fair comparison with [14], we follow the same setting and randomly select 9,500 images for training, 1,900 for validation and 7,600 for testing. We repeat 10 times and the average result is reported. As in [14], the same low-level color and texture features are extracted and the prediction results of the same selected 35 attributes are evaluated.

Competitors and Evaluation Metrics. Six state-of-the-art attribute learning approaches are compared. These include Direct Attribute Prediction (DAP) [2, 3], Indirect Attribute Prediction (IAP)[2, 3], Attribute Label Embedding (ALE) [8], Class-Specific Hypergraph based Attribute Predictor (CSHAP) [9], “Two birds, One stone” (TbOs) [15] and Markov Random Field graph (MRF) [14]. For direct comparison with the reported results in the literature, the attribute prediction performance is measured by mean area under ROC curve (mAUC) on AwA and CUB, while the mean classification accuracy (mACC) is used on PETA.

Table 2. Comparative results on (a) predicting user-defined attributes and (b) zero-shot learning. “*” means same feature are used and “-” means no reported results.

Full size table

Comparative Results. We report the user-defined attribute prediction performance in Table 2(a). The results show that the proposed method achieves state-of-the-art performance on CUB and PETA. In particular, on CUB, its mAUC is 6 % higher than the nearest competitor CSHAP. However, it is slightly inferior to CSHAP on AwA.

4.3 Zero-Shot Learning

Since images from different classes may share common attributes, we can recognize images from unseen classes based on transferred attribute concepts, which is referred as zero-shot learning [3]. Specifically, the user-defined attributes learned from seen classes are used to classify the images from unseen classes.

Datasets and Settings. Two benchmark datasets, AwA and CUB, are used in this experiment. For AWA, 40 classes are chosen as seen classes for training and the remanning 10 classes are chosen as unseen classes for testing. Also, we split CUB as 150 classes for training and 50 classes for testing. For both datasets, we utilize MatConvNet [52] with the “imagenet-vgg-verydeep-19” pretrained model [53] to extract a 4096-dim CNN feature vector for each image (or bounding box). The train-test split and features are as same as [49–51].

Comparative Results. In this experiments, we compare our methods with several state-of-the-art methods and the image classification accuracy is reported. As shown in Table 2(b), the performance of our method is significantly better than the state-of-the-art approaches on both datasets.

4.4 Further Evaluations

Contributions of Model Components. There are several key components in the proposed model (see (1)): (a) two types of latent attributes: D-LA ($D^d$) and B-LA ($D^b$) are learned together with the user-defined attributes; and (b) instead of learning it directly as part of the dictionary subspace, we model a linear transformation (W) from the user-defined attributes A to the UDAC dictionary subspace ($D^u$). In order to evaluate the effectiveness of these two components, we compare our full model (Ours_full) with various striped-down versions of our model. The results in Table 3 show clearly that all these components contribute positively to the final performance of the model.

Table 3. Evaluation on the contributions of different model components for (a) user-defined attributes (att) prediction on AwA, (b) person Re-ID on VIPeR and (c) zero-shot learning (zsl) results on AWA. Note that $D^d$ is not used for user-defined attributes prediction and zero-shot learning; there is thus no result under ‘Without $D^d$’ for AwA.

Full size table

Running Cost. All algorithms are implemented in Matlab and run on a server with 2.0 GHz CPU cores and 128 GB memory. For person Re-ID on the VIPeR dataset, our model takes 28.29 seconds to train and 0.35 seconds to match 312 images against 312 images. For predicting user-defined attributes on AwA, it takes 2,377 seconds to train and 0.33 seconds to predict 85 user-defined attributes on 6,180 images. It is thus extremely efficient during testing as a linear model.

5 Conclusions

We have proposed a novel attribute learning model which learns user-defined semantic attributes jointly with latent discriminative and background attributes. The model is based on dictionary learning with dictionary decomposition. An efficient algorithm is then formulated to solve the resultant optimization problem. Extensive experiments show that the proposed attribute learning method produces state-of-the-art results on attribute prediction, attribute-based person re-identification and zero-shot learning.

Notes

1.
We will show later that the requirement on the availability of user-defined attributes can be removed.
2.
The code can be downloaded at http://pkuml.com/resources/code.html.

References

Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their attributes. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 1778–1785 (2009)
Google Scholar
Lampert, C.H., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by between-class attribute transfer. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 951–958, June 2009
Google Scholar
Lampert, C.H., Nickisch, H., Harmeling, S.: Attribute-based classification for zero-shot visual object categorization. IEEE Trans. Pattern Anal. Machine Intell. 36(3), 453–465 (2014)
Article Google Scholar
Mahajan, D., Sellamanickam, S., Nair, V.: A joint learning framework for attribute models and object descriptions. In: IEEE International Conference on Computer Vision, pp. 1227–1234 (2011)
Google Scholar
Jayaraman, D., Sha, F., Grauman, K.: Decorrelating semantic visual attributes by resisting the urge to share. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1629–1636 (2014)
Google Scholar
Wang, Y., Mori, G.: A discriminative latent model of object classes and attributes. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 155–168. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15555-0_12
Chapter Google Scholar
Liang, K., Chang, H., Shan, S., Chen, X.: A unified multiplicative framework for attribute learning. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2506–2514, December 2015
Google Scholar
Akata, Z., Perronnin, F., Harchaoui, Z., Schmid, C.: Label-embedding for attribute-based classification. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 819–826 (2013)
Google Scholar
Huang, S., Elhoseiny, M., Elgammal, A., Yang, D.: Learning hypergraph-regularized attribute predictors. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 409–417 (2015)
Google Scholar
Kovashka, A., Parikh, D., Grauman, K.: Whittlesearch: Interactive image search with relative attribute feedback. Int. J. Comput. Vis. 115(2), 185–210 (2015)
Article MathSciNet Google Scholar
Shi, Z., Hospedales, T.M., Xiang, T.: Transferring a semantic representation for person re-identification and search. In: Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Fu, Y., Hospedales, T.M., Xiang, T., Gong, S.: Transductive multi-view zero-shot learning. IEEE Trans. Pattern Anal. Mach. Intell. 37(11), 2332–2345 (2015)
Article Google Scholar
Layne, R., Hospedales, T.M., Gong, S.: Attributes-Based Re-identification. Springer, London (2014)
Book Google Scholar
Deng, Y., Luo, P., Loy, C.C., Tang, X.: Pedestrian attribute recognition at far distance. In: Proceedings of the ACM International Conference on Multimedia, pp. 789–792 (2014)
Google Scholar
Li, Y., Wang, R., Liu, H., Jiang, H., Shan, S., Chen, X.: Two birds, one stone: jointly learning binary code for large-scale face image retrieval and attributes prediction. In: IEEE International Conference on Computer Vision, pp. 3819–3827 (2015)
Google Scholar
Yu, F.X., Cao, L., Feris, R.S., Smith, J.R., Chang, S.F.: Designing category-level attributes for discriminative visual recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 771–778 (2013)
Google Scholar
Singh, S., Gupta, A., Efros, A.A.: Unsupervised discovery of mid-level discriminative patches. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7573, pp. 73–86. Springer, Heidelberg (2012)
Google Scholar
Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M.: Disentangling factors of variation for facial expression recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 808–822. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33783-3_58
Google Scholar
Berg, T.L., Berg, A.C., Shih, J.: Automatic attribute discovery and characterization from noisy web data. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 663–676. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15549-9_48
Chapter Google Scholar
Rastegari, M., Farhadi, A., Forsyth, D.: Attribute discovery via predictable discriminative binary codes. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 876–889. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33783-3_63
Google Scholar
Feng, J., Jegelka, S., Yan, S., Darrell, T.: Learning scalable discriminative dictionary with sample relatedness. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1645–1652 (2014)
Google Scholar
Fu, Y., Hospedales, T.M., Tao, X., Gong, S.: Learning multimodal latent attributes. IEEE Trans. Pattern Anal. Mach. Intell. 36(2), 303–316 (2014)
Article Google Scholar
Sharmanska, V., Quadrianto, N., Lampert, C.H.: Augmented attribute representations. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 242–255. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33715-4_18
Google Scholar
Layne, R., Hospedales, T.M., Gong, S.: Towards Person Identification and Re-identification with Attributes. In: Fusiello, A., Murino, V., Cucchiara, R. (eds.) ECCV 2012. LNCS, vol. 7583, pp. 402–412. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33863-2_40
Chapter Google Scholar
N Hospedales, T., Layne, R., Gong, S.: Re-id: hunting attributes in the wild. In: British Machine Vision Conference (BMVC) (2014)
Google Scholar
Layne, R., Hospedales, T.M., Gong, S.: Person re-identification by attributes. In: British Machine Vision Conference (2012)
Google Scholar
Liao, S., Hu, Y., Zhu, X., Li, S.Z.: Person re-identification by local maximal occurrence representation and metric learning. In: CVPR, pp. 2197–2206 (2015)
Google Scholar
Su, C., Yang, F., Zhang, S., Tian, Q., Davis, L.S., Gao, W.: Multi-task learning with low rank attribute embedding for person re-identification. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 3739–3747, December 2015
Google Scholar
Kenneth, K., Joseph, M., Bhaskar, R., Kjersti, E., Te-Won, L., Terrence, S.: Dictionary learning algorithms for sparse representation. Neural Comput. 15(2), 349–396 (2003)
Article MATH Google Scholar
Aharon, M., Elad, M., Bruckstein, A.: K-svd: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Sig. Proces. 54, 4311–4322 (2006)
Article Google Scholar
Guo, H., Jiang, Z., Davis, L.S.: Discriminative dictionary learning with pairwise constraints. In: Proceedings of the 11th Asian conference on Computer Vision (2014)
Google Scholar
Zheng, J., Jiang, Z.: Learning view-invariant sparse representations for cross-view action recognition. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 3176–3183. IEEE (2013)
Google Scholar
Liu, X., Song, M., Tao, D., Zhou, X., Chen, C., Bu, J.: Semi-supervised coupled dictionary learning for person re-identification. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
Google Scholar
Karanam, S., Li, Y., Radke, R.J.: Person re-identification with discriminatively trained viewpoint invariant dictionaries. In: 2015 IEEE International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Gray, D., Brennan, S., Tao, H.: Evaluating appearance models for recognition, reacquisition, and tracking. In: Proceedings of IEEE International Workshop on Performance Evaluation for Tracking and Surveillance (PETS), vol. 3. Citeseer (2007)
Google Scholar
Lisanti, G., Masi, I., Del Bilmbo, A.: Matching people across camera views using kernel canonical correlation analysis. In: Proceedings of ICDSC (2014)
Google Scholar
Hirzer, M., Beleznai, C., Roth, P.M., Bischof, H.: Person re-identification by descriptive and discriminative classification. In: Heyden, A., Kahl, F. (eds.) SCIA 2011. LNCS, vol. 6688, pp. 91–102. Springer, Heidelberg (2011). doi:10.1007/978-3-642-21227-7_9
Chapter Google Scholar
Zheng, W., Gong, S., Xiang, T.: Associating groups of people. In: BMVC (2009)
Google Scholar
Xiong, F., Gou, M., Camps, O., Sznaier, M.: Person re-identification using kernel-based metric learning methods. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 1–16. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10584-0_1
Google Scholar
Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scalable person re-identification: a benchmark. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1116–1124, December 2015
Google Scholar
Hirzer, M., Roth, P.M., Köstinger, M., Bischof, H.: Relaxed pairwise learned metric for person re-identification. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 780–793. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33783-3_56
Google Scholar
Zhao, R., Ouyang, W., Wang, X.: Learning mid-level filters for person re-identification. In: Proceedings of CVPR (2014)
Google Scholar
Li, Z., Chang, S., Liang, F., Huang, T.S., Cao, L., Smith, J.: Learning locally-adaptive decision functions for person verification. In: CVPR (2013)
Google Scholar
Chen, D., Yuan, Z., Hua, G., Zheng, N., Wang, J.: Similarity learning on an explicit polynomial kernel feature map for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1565–1573 (2015)
Google Scholar
Liao, S., Li, S.Z.: Efficient PSD constrained asymmetric metric learning for person re-identification. In: The IEEE International Conference on Computer Vision (ICCV), December 2015
Google Scholar
Ahmed, E., Jones, M., Marks, T.K.: An improved deep learning architecture for person re-identification. In: CVPR (2015)
Google Scholar
Paisitkriangkrai, S., Shen, C., van den Hengel, A.: Learning to rank in person re-identification with metric ensembles. arXiv preprint (2015). arXiv:1503.01543
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: A deep convolutional activation feature for generic visual recognition. University of California Berkeley, Brigham Young University, pp. 647–655 (2013)
Google Scholar
Zhang, Z., Saligrama, V.: Zero-shot learning via semantic similarity embedding. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4166–4174, December 2015
Google Scholar
Akata, Z., Reed, S., Walter, D., Lee, H., Schiele, B.: Evaluation of output embeddings for fine-grained image classification. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2927–2936, June 2015
Google Scholar
Zhang, Z., Saligrama, V.: Zero-shot learning via joint latent similarity embedding. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Vedaldi, A., Lenc, K.: Matconvnet - convolutional neural networks for matlab. Eprint Arxiv (2016)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Computer Science (2014)
Google Scholar

Download references

Acknowledgements

This work is partially supported by grants from the National Basic Research Program of China under grant 2015CB351806, the National Natural Science Foundation of China under contract No. 61390515, No. 61425025 and No. 61471042, Beijing Municipal Commission of Science and Technology under contract No. Z151100000915070 and the National Key Technology and Development Program of China under contract No. 2014BAK10B02. These authors are also supported by Microsoft Research Asia Collaborative Research Program 2016, project ID FY16-RES-THEME-034.

Author information

Authors and Affiliations

National Engineering Laboratory for Video Technology, Peking University, Beijing, China
Peixi Peng, Yonghong Tian & Tiejun Huang
School of Electronic Engineering and Computer Science, Queen Mary University of London, London, UK
Tao Xiang
Department of Electronic Engineering, Beijing Institute of Technology, Beijing, China
Yaowei Wang
Cooperative Medianet Innovation Center, Beijing, China
Peixi Peng & Yonghong Tian

Authors

Peixi Peng
View author publications
You can also search for this author in PubMed Google Scholar
Yonghong Tian
View author publications
You can also search for this author in PubMed Google Scholar
Tao Xiang
View author publications
You can also search for this author in PubMed Google Scholar
Yaowei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Tiejun Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yonghong Tian or Yaowei Wang .

Editor information

Editors and Affiliations

RWTH Aachen, Aachen, Germany
Bastian Leibe
Czech Technical University, Prague 2, Czech Republic
Jiri Matas
University of Trento, Povo - Trento, Italy
Nicu Sebe
University of Amsterdam, Amsterdam, The Netherlands
Max Welling

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Peng, P., Tian, Y., Xiang, T., Wang, Y., Huang, T. (2016). Joint Learning of Semantic and Latent Attributes. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds) Computer Vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science(), vol 9908. Springer, Cham. https://doi.org/10.1007/978-3-319-46493-0_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-46493-0_21
Published: 17 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46492-3
Online ISBN: 978-3-319-46493-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Joint Learning of Semantic and Latent Attributes

Abstract

Similar content being viewed by others

On the Exploration of Joint Attribute Learning for Person Re-identification

Robust Attribute-Based Visual Recognition Using Discriminative Latent Representation

A Multisource Domain Generalization Approach to Visual Attribute Detection

Keywords

1 Introduction

2 Related Work

3 Methodology

3.1 Formulation

3.2 Optimisation

3.3 Application to Person Re-ID

3.4 Application to User-Defined Attribute Prediction

4 Experiments

4.1 Person Re-ID

4.2 User-Defined Attribute Prediction

4.3 Zero-Shot Learning

4.4 Further Evaluations

5 Conclusions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Joint Learning of Semantic and Latent Attributes

Abstract

Similar content being viewed by others

On the Exploration of Joint Attribute Learning for Person Re-identification

Robust Attribute-Based Visual Recognition Using Discriminative Latent Representation

A Multisource Domain Generalization Approach to Visual Attribute Detection

Keywords

1 Introduction

2 Related Work

3 Methodology

3.1 Formulation

3.2 Optimisation

3.3 Application to Person Re-ID

3.4 Application to User-Defined Attribute Prediction

4 Experiments

4.1 Person Re-ID

4.2 User-Defined Attribute Prediction

4.3 Zero-Shot Learning

4.4 Further Evaluations

5 Conclusions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation