Keywords

1 Introduction

Although deep learning [32] depending on large-scale labeled data training has been generally used for visual recognition [31], a daunting challenge still exists to recognize visual object “in the wild”. In fact, in specific applications it is impossible to collect all class data for training deep model, so training (seen classes) and testing classes(unseen classes) are often disjoint. The main idea of ZSL is to handle this problem by exploiting the transfer model from the redundant relevance of the semantic description. To recognize unseen classes from seen classes, ZSL needs face to two challenges [3]. One is how to utilize the semantic information for constructing the relationship between unseen classes and seen classes, and other is how to find the compatibility among all kinds of information for obtaining the optimal discriminative characteristics on unseen classes.

ZSL can bridge the gap among the different domains to recognize unseen class objects by semantic embedding of class labels. These semantic embeddings can come from vision (attributes [11]) and language information (text [25]) by the manual annotation, machine learning [29]or data mining [5]. In term of the transformation relationship of different embedding, recent ZSL methods mainly fall into linear embedding, nonlinear embedding and similarity embedding. Linear embedding [1, 2, 7, 13, 24] implements the linear transformation method among different embedding spaces for learning the relevance between unseen class objects and class labels. Nonlinear embedding [23, 25, 28] can realize the nonlinear mapping of the embedding space for building the compatibility function or classifier, which can be learned by deep networks [14, 30]. Similarity embedding [3, 9, 15, 19, 33] builds the classifier by the similarity metrics, which mostly include structure learning or class-wise similarities. In our approach, the similarity metric is extended from semantic space to image space, we attempt to find the relationship of similarities (manifold structure in the different space) for constraining the compatibility function, and further capture to the positive structure propagation for the significantly improvement of the unseen object classification.

In this paper, our motivation is inspired by structure fusion [16,17,18] for jointly dealing with two challenges. The intrinsic manifold structure is crucial for object classification. However, in fact, we only can attain the observation data of the manifold structure, which can represent different aspects of the intrinsic manifold structure. For recovering or approximating the intrinsic structure, we can fuse various manifold structures from observation data. Based on the above idea, we try to capture different manifold structures in image and semantic space for improving the recognition performance of unseen classes in ZSL. Therefore, we expect to construct the compatibility function for predicting labels of unseen classes by building the manifold structure of image classes. On the other end, we attempt to find the relevance between the manifold structure of semantic classes and that of image classes in model space for encoding the influence between the negative and positive transfer, and further make the better compatibility function for classifying unseen class objects. Model space corresponding to visual appearances is the jointed projection space of semantic space and image space, and can preserve the respective manifold structure. Figure 1 illustrates the idea of the proposed method conceptually. SFP considers not only semantic and image structures but also the positive structure propagation for ameliorating unseen objects classification, while SynC [3] only focus on manifold structure in semantic space for combining the base classifier in ZSL.

Fig. 1.
figure 1

The illustration of structure fusion and propagation for zero-shot learning. Phantom object classes (the coordinates of classes in the model space are optimized to achieve the best performance of the resulting model for the real object classes in discriminative tasks [3].) and real object classes corresponding to all classes in model space.

In our main contribution, a novel idea have tow aspects to recover or approximate the intrinsic manifold structure from seen classes to unseen classes by fusing the different space manifold structure for handling the challenging unseen classes recognition. Specifically, one constructs the projected manifold structure for real and phantom class in model space, another constrains the compatibility function and the relationship of the manifold structure for the positive structure propagation.

2 Structure Fusion and Propagation

In ZSL, we have training data set \(\mathscr {D}=\{(x_{n}\in R^{D},y_{n})\}_{n=1}^{N}\), in which \(x_{n}\) is image representation (it can be extracted based on deep model, and the detail is described in Table 1) and \(y_{n}(n=1,...,N)\) is the class label in the seen class set . We can denote the unseen class set as \(\mathscr {U}=\{u|u=S+1,...,S+U\}\). \(a_{c}\in R_{D}\) is the linear transformation vector of the class.

2.1 Classification Model and Manifold Structure

We construct a pair-wise linear classifier [3] in the visual image feature space, and determinate a estimated label \(\hat{y}\) to a feature x by the following formula.

$$\begin{aligned} \begin{aligned} \hat{y}=\arg \max _{c}a_{c}^{T}x, \end{aligned} \end{aligned}$$
(1)

here, \(a_{c}\in R^{D}\) is not only the transformation vector of the feature x, but also the representation of the class c in model. In other words, the above formula can describe the pair-wise linear relation between the feature space and the class label space for characterizing the class representation in the model.

To measure the manifold structure, we can compute the similarity of the related representation in the homogeneous space, which has the same scale and metric. To this end, we respectively build a bipartite graph between unseen classes and seen classes in semantic space and image space (this space includes all image representations). In these bipartite graphs, nodes are corresponding to unseen classes or seen classes, and weights of these nodes connect unseen classes with seen classes. Because we focus on the transfer relation between unseen classes and seen classes, no connection exists in unseen classes or seen classes. Supposing \(G_{b}{<}V_{b},E_{b}{>}\) can denote the manifold structure of semantic classes. Here, \(V_{b}=V_{bs}\bigcup V_{bu}\) and \(\emptyset =V_{bs}\bigcap V_{bu}\). \(E_{b}\) includes connections between \(V_{bs}\) (seen classes set in semantic space) and \(V_{bu}\) (unseen classes set in semantic space); \(G_{x}{<}V_{x},E_{x}{>}\) for the manifold structure of image classes. Here, \(V_{x}=V_{xs}\bigcup V_{xu}\) and \(\emptyset =V_{xs}\bigcap V_{xu}\). \(E_{x}\) includes the connections between \(V_{xs}\) (seen classes set in image space) and \(V_{xu}\) (unseen classes set in image space). Therefore, the similarity of semantic and image space is respectively regarded as the weight between nodes, which can be defined as following.

$$\begin{aligned} w_{su}^{(b)}=\frac{\exp (-d(b_{s},b_{u}))}{\sum _{u=1}^{U}\exp (-d(b_{s},b_{u}))},w_{su}^{(x)}=\frac{\exp (-d(x_{s},x_{u}))}{\sum _{u=1}^{U}\exp (-d(x_{s},x_{u}))}, \end{aligned}$$
(2)

here, \(b_{s}\) and \(x_{s}\) are respectively the semantic and image representation (the detail is described in Table 1) of the seen class s, while \(b_{u}\) and \(x_{u}\) are respectively the semantic and image representation of the unseen class u. \(w_{su}^{(b)}\) and \(w_{su}^{(x)}\) are respectively the weight (the similarity) between the seen class s and the unseen class u in semantic and image representation space. \(d(b_{s},b_{u})\) and \(d(x_{s},x_{u})\) are respectively the distance metric [3] of each space, and can be defined as following.

$$\begin{aligned} d(b_{s},b_{u})=(b_{s}-b_{u})^{T}\varSigma _{b}^{-1}(b_{s}-b_{u}),d(x_{s},x_{u})=(x_{s}-x_{u})^{T}\varSigma _{x}^{-1}(x_{s}-x_{u}), \end{aligned}$$
(3)

here, \(\varSigma _{b}=\sigma _{b}I\) can be learned from the semantic representation by cross-validation (We alternately divide the training classes set into two part in according with the proportion between the training classes set and the test classes set. One part is to learn the model, and another is to validate the model. We give the range of \(\sigma _{b}\), which is form \(2^{-5}\) to \(2^{5}\), and select the parameter corresponding to the best result as the value of \(\sigma _{b}\).) \(\varSigma _{x}=\sigma _{x}I\) can be learned from the image representation by cross-validation (It is the same procedure like \(\sigma _{b}\) learning.). In image space, the differentiation compared with the semantic space is that \(x_{u}\) is not determined because of unseen classes, while \(x_{s}\) can be obtained from training data by computing the mean value of the seen class. The way to produce the center of the class as a representation is simple for convenient computation, and it is reasonable to preserve the base characteristic of image representation according with the distribution of the same class. \(x_{u}\) can be attained by pre-classification of unseen classes (the detail in the next section).

In (1), \(a_{c}\) is the transformation vector, and also is the class representation in model space. In (2), \(b_{s}\) and \(b_{u}\) is the class representation in semantic space, while \(x_{s}\) and \(x_{u}\) is the class representation in image space. We expect to construct the link among these space by \(v_{s}\) and \(v_{u}\), which are respectively the phantom class of seen or unseen classes in model. For preserving the manifold structure of two bipartite graphs and aligning the image, the semantic and the model space, we build the optimization formula under the condition of the distortion error minimization, which is defined as following.

$$\begin{aligned} \begin{aligned} (a_{c},v_{u},\varvec{\beta })=\arg \min _{a_{c},v_{u},\varvec{\beta }}&\Vert a_{c}-\sum _{u=1}^{U}\varvec{\beta }^{T}\left[ \begin{matrix} w_{su}^{(x)}&{}w_{su}^{(b)}\\ \end{matrix} \right] ^{T}v_{u}-\sum _{s=1}^{S}\varvec{\gamma }^{T}\left[ \begin{matrix} w_{ss}^{(x)}&{}w_{ss}^{(b)}\\ \end{matrix} \right] ^{T}v_{s}\Vert _{2}^{2},\\ s.t. \quad&\varvec{\beta }^{T}\varvec{\mathbf {1}}=1,\varvec{\gamma }^{T}\varvec{\mathbf {1}}=1,0\le \beta _{i}\le 1,0\le \gamma _{i}\le 1 \quad (i=1,2) \end{aligned} \end{aligned}$$
(4)

here, \(\varvec{\beta }=\left[ \begin{matrix} \beta _{1} &{}\beta _{2}\\ \end{matrix} \right] ^{T}\), \(\varvec{\gamma }=\left[ \begin{matrix} \gamma _{1} &{}\gamma _{2}\\ \end{matrix} \right] ^{T}\), and \(\varvec{\mathbf {1}}=\left[ \begin{matrix} 1 &{}1\\ \end{matrix} \right] ^{T}\). Because no connection exists between unseen classes or seen classes in tow bipartite graphs, \(w_{ss}^{(b)}=0\) and \(w_{ss}^{(x)}=0\). The analytical solution of (4) can find the relation between \(a_{c}\) and \(v_{u}\).

$$\begin{aligned} \begin{aligned} a_{c}=&\sum _{u=1}^{U}\varvec{\beta }^{T}\left[ \begin{matrix} w_{su}^{(x)}&{}w_{su}^{(b)}\\ \end{matrix} \right] ^{T}v_{u},\\ s.t. \quad&\varvec{\beta }^{T}\varvec{\mathbf {1}}=1,0\le \beta _{i}\le 1 \quad (i=1,2) \end{aligned} \end{aligned}$$
(5)

here, \(\forall c\in \{1,2,...,S+U\}\).

2.2 Phantom Classes and Structure Relation Learning

For obtaining phantom class \(v_{u}(u=1,...,U)\) and the manifold structure of the weight coefficient vector \(\beta \), we further reformulate the optimization formula for one-versus-other classifier [3].

$$\begin{aligned} \begin{aligned} (v_{1},...,v_{U},\varvec{\beta })=&\arg \min _{v_{1},...,v_{U},\varvec{\beta }}\sum _{c=1}^{S}\sum _{n=1}^{N}\ell (x_{n},\mathbb {I}_{y_{n},c},a_{c})\\&+\frac{\lambda }{2}\sum _{c=1}^{S}\Vert a_{c}\Vert _{2}^{2}+\frac{\gamma }{2}\Vert \beta _{1}W^{x}-\beta _{2}W^{b}\Vert _{2}^{2},\\&s.t. \quad a_{c}=\sum _{u=1}^{U}\varvec{\beta }^{T}\left[ \begin{matrix} w_{su}^{(x)}&{}w_{su}^{(b)}\\ \end{matrix} \right] ^{T}v_{u},\\&\varvec{\beta }^{T}\varvec{\mathbf {1}}=1,0\le \beta _{i}\le 1 \quad (i=1,2) \end{aligned} \end{aligned}$$
(6)

here, \(w_{su}^{(x)}\) is the element of the matrix \(W^{x}\), and \(w_{su}^{(b)}\) is the element of the matrix \(W^{b}\). The first term of formula (6) is the squared hinge loss, which can be defined as \(\ell (x_{n},\mathbb {I}_{y_{n},c}, a_{c})=\max (0,1-\mathbb {I}_{y_{n},c}a_{c}x_{n})\). \(\mathbb {I}_{y_{n},c}\in \{-1,1\}\) determines whether or not \(y_{n}=c\). The second term of formula (6) is \(a_{c}\) of a regularization tern, which avoids over-fitting problem on the pair-wise linear classifier for modeling the relationship between the class label and the image representation. The third term of formula (6) is the constraint of the manifold structure similarity for preventing the negative structure propagation in image space. The alternating optimization can be implemented for minimizing the formula (6) with respect to \(\{v_{u}\}_{u=1}^{U} \) and \(\varvec{\beta }\) by solving the quadratic programming problem.

To depict the whole process of the structure fusion and propagation mechanism, we show the pseudo code of the proposed SFP algorithm in Algorithm 1.

figure a

2.3 Complexity Analysis

Formula (6) can be solved by alternately quadratic programming, which of the complexity includes two parts. In the first part, when \(\varvec{\beta }\) is fixed, formula (6) is related to \(\{v_{u}\}_{u=1}^{U} \) of a quadratic programming problem, which of the complexity is \(O(U^{3})\) for the worst. In the second part, while \(\{v_{u}\}_{u=1}^{U} \) is fixed, formula (6) is corresponding to \(\varvec{\beta }\) of a quadratic programming problem, which of the complexity is \(O(k^{3})\) (k is the dimension of \(\varvec{\beta }\)) for the worst. Given the proposed algorithm SFP needs P iterations, it’s complexity is \(O(PU^{3}+Pk^{3})\).

3 Experiment

3.1 Datasets

For evaluating the proposed algorithm SFPFootnote 1, we carry out the experiment in four challenging datasets, which are Animals with Attributes (AwA) [12], CUB-200-2011 Birds (CUB) [27], Stanford Dogs (Dogs) [4], and SUN Attribute (SUN) [21]. These datasets can be used for fine-grained recognition (CUB and Dogs) or non-fine-grained recognition (AwA and SUN) in ZSL. In semantic space, AwA and CUB respectively are described by att [6], w2v [20], glo [22] and hie [1], while Dogs is represented by w2v [20], glo [22] and hie [1]. SUN is only depicted by att [6]. Table 1 provides the statistics and the extracted features for these datasets. In addition, for conveniently comparing with the state-of-art methods, we adopt image feature provided by [1].

Table 1. Datasets statistics and the extracted feature in experiments.

3.2 Comparison with the Baseline Methods

In this paper, there are three methods as the baseline for comparing with the proposed SFP method because of the semantic structure mining. The first method is structured joint embedding (SJE) [1], which can build the bilinear compatibility function with consideration of the structured output space for predicting the label of the unseen class. The second method is latent embedding model (LatEm) [28],which can construct the pair-wise bilinear (nonlinear) compatibility function according to model number selection for recognizing unseen classes. The third method is synthesized classifiers (SynC) [3], which can make nonlinear compatibility function with manifold structure in semantic space for combining the base classifier in ZSL. Table 2 shows the performance of the structure fusion and propagation (the proposed SFP method) greatly outperforms that of other three methods.

3.3 Classification and Validation Protocols

Classification accuracy is average value of all test class accuracy in each database. Because the learned model involves four parameters, which are \(\lambda , \gamma , \sigma _{b}\) and \(\sigma _{x}\) (respectively are in formula (3) in formula (6)). We alternately divide the training classes set into two part in according with the proportion between the training classes set and the test classes set. One part is to learn the model, and another is to validate the model. Firstly, we set \(\sigma _{b}\) and \(\sigma _{x}\) to 1, and obtain \(\gamma \) and \(\lambda \) corresponding to the best result in \(\gamma \) (form \(2^{-24}\) to \(2^{-9}\)) and \(\lambda \) (form \(2^{-24}\) to \(2^{-9}\)) by cross validation. Secondly, we learn \(\sigma _{b}\) and \(\sigma _{x}\) corresponding to the best result in \(\sigma _{b}\) and \(\sigma _{x}\) (form \(2^{-5}\) to \(2^{5}\)) by cross validation.

Table 2. Comparison of SFP method with SJE [1], LatEm [28] and SynC [3] in each semantic space, average per-class Top-1 accuracy (%) of unseen classes is reported based on the same data configurations, same images and semantic features in AwA. w: the fusion includes att, w2v, glo and hie, while w/o: the fusion contains w2v, glo and hie.

3.4 Structure Fusion and Propagation with the Iteration

The main idea of the proposed SFP method shows three contents. In the first content, the manifold structure of images is considered for constructing the compatibility function between the class label and the visual feature. In the second content, the relationship between multi-manifold structures is found for booting the influence of the positive structure. In the last content, it is the most important to propagate the positive structure and fuse multi-manifold structures by the iteration computation. Therefore, we carry out the related experiment for evaluating the effect of the iteration on the structure evolution in AwA. The recognition accuracy can show the approximation degree of the class manifold structure. In other word, the better recognition accuracy is proportional to the more similar relationship between the reconstruction manifold structure and the intrinsic manifold structure of classes. Figure 2 demonstrates the recognition accuracy change with the iteration. In the beginning, the recognition accuracy rapidly increases with the iteration, and then reaches a stable state. It means that structure fusion and propagation with the iteration can advance the recognition accuracy and finally obtain the best state.

Fig. 2.
figure 2

Average per-class Top-1 accuracy (%) of unseen classes is reported with structure fusion and propagation iteration times on AwA. w: the fusion includes att, w2v, glo and hie, while w/o: the fusion contains w2v, glo and hie

3.5 Comparison with State-of-the-Arts

In term of the image data utilization of unseen classes in testing, we can divide ZSL methods into two categories, which are inductive ZSL and transductive ZSL. Inductive ZSL methods can serially process unseen samples without the consideration of the underlying manifold structure in unseen samples [1, 3, 28, 33], while transductive ZSL can usually use the manifold structure of unseen samples to improve ZSL performance [8, 10, 15]. SFP can find the structure of unseen classes in image feature space to enhance the transfer model between seen and unseen classes, so SFP belongs to a transductive ZSL method. For a fair comparison, we use deep feature of images based on GoogleNet [26] in contrasting methods, which include our method, one transductive ZSL method (DMaP [15]), and three inductive ZSL methods (SJE [1], LatEm [28] and SynC [3]). To the best of our knowledge, these methods are state-of-the-art methods for ZSL. Table 3 shows their results for ZSL on three benchmark datasets. SFP mostly outperforms the state-of-the-art methods except DMaP on CUB. DMaP focuses on the manifold structure consistency between the semantic representation and the image feature, and can better distinguish fine-grained classes. SFP can complement the manifold structure between the semantic representation and the image feature, and better recognize coarse-grained classes. Therefore, integrating two ideas is expected to further improve the ZSL performance in future work.

Table 3. Comparison of SFP method with state-of-the-art methods for ZSL, average per-class Top-1 accuracy (%) of unseen classes is reported based on the same data configurations. ‘+’ indicates fusion operation.

3.6 Experimental Result Analysis

From the above experiments, we can attain the following observations.

  • The semantic description have the different contribution for classifying unseen classes. The supervised attribute tend to obtain the better recognition performance than the unsupervised semantic representation (w2v, glo and hie) in AwA and CUB. In the unsupervised semantic representation, the recognition accuracy of w2v or glo is better than that of hie in AwA and CUB, but the performance of hie is superior to that of w2v or glo in Dogs. This is mainly due to the flexibility and uncertainty of the semantic representation in the unsupervised way.

  • The performance of SFP is better than that of other three methods, which are SJE, LatEm, and SynC. However, the performance improvement is different in the various datasets. The obvious improvement can be found in AwA, Dogs and SUN, while the slight improvement can be shown in CUB. The main reason of this situation is related to whether or not effectively to propagate the positive structure in the optimization computation in term of data differences.

  • SFP emphasizes on the different manifold structure complement, while DMaP focuses on the various manifold structure consistency. Therefore, the performance of SFP is superior to that of DMaP because the structure complementarity plays the important role for learning transfer model in AwA and Dogs, and the performance of DMaP is better than that of SFP because the structure consistency is a key point for classifying unseen classes in CUB.

  • SFP performs better with the positive structure fusion and propagation. SFP has demonstrated great promise in above experiments due to multi-manifold structure consideration and alternated optimization between the weight computation and the manifold structure estimation for ZSL.

  • The proposed fusion method can attain the better performance than the non-fusion method because of appropriate complementing each other. w or w/o always performs better on AwA, CUB and Dogs.

4 Conclusion

We have proposed a new ZSL method, which called structure fusion and propagation (SFP). This method can not only directly model the relevance among the manifold structures in semantic and image space, but also dynamically propagate the positive structure by the crossing iteration. Specifically, the proposed SFP method mainly includes four parts. First, nonlinear model constructs the mapping relationship between the class label and the visual image representation. Second, graph describes the relevance between seen classes and unseen classes in semantic or image space. Three, loss function indicates the constrains relationship of multi-manifold structure to balance the structure dependance. Last, structure fusion and propagation is implemented by the crossing iteration computation between phantom classes and weights solving. For evaluating the proposed SFP, we carry out the experiment on AwA, CUB, Dogs and SUN. Experimental results show that SFP can obtain the promising results for ZSL.