Keywords

1 Introduction

As neural networks become ubiquitous, there is an increasing need to understand and interpret their learned representations [25, 27]. In the context of convolutional neural networks (CNNs), methods have been developed to explain predictions and latent activations in terms of heat maps highlighting the image regions which caused them [31, 37].

In this paper, we present Deep Feature Factorization (DFF), which exploits non-negative matrix factorization (NMF) [22] applied to activations of a deep CNN layer to find semantic correspondences across images. These correspondences reflect semantic similarity as indicated by clusters in a deep CNN layer feature space. In this way, we allow the CNN to show us which image regions it ‘thinks’ are similar or related across a set of images as well as within a single image. Given a CNN, our approach to semantic concept discovery is unsupervised, requiring only a set of input images to produce correspondences. Unlike previous approaches [2, 11], we do not require annotated data to detect semantic features. We use annotated data for evaluation only.

We show that when using a deep CNN trained to perform ImageNet classification [30], applying DFF allows us to obtain heat maps that correspond to semantic concepts. Specifically, here we use DFF to localize objects or object parts, such as the head or torso of an animal. We also find that parts form a hierarchy in feature space, e.g., the activations cluster for the concept body contains a sub-cluster for limbs, which in turn can be broken down to arms and legs. Interestingly, such meaningful decompositions are also found for object classes never seen before by the CNN.

In addition to giving an insight into the knowledge stored in neural activations, the heat maps produced by DFF can be used to perform co-localization or co-segmentation of objects and object parts. Unlike approaches that delineate the common object across an image set, our method is also able to retrieve distinct parts within the common object. Since we use a pre-trained CNN to accomplish this, we refer to our method as performing weakly-supervised co-segmentation.

Our main contribution is introducing Deep Feature Factorization as a method for semantic concept discovery, which can be used both to gain insight into the representations learned by a CNN, as well as to localize objects and object parts within images. We report results on several datasets and CNN architectures, showing the usefulness of our method across a variety of settings.

Fig. 1.
figure 1

What in this picture is the same as in the other pictures? Our method, Deep Feature Factorization (DFF), allows us to see how a deep CNN trained for image classification would answer this question. (a) Pyramids, animals and people correspond across images. (b) Monument parts match with each other.

2 Related Work

2.1 Localization with CNN Activations

Methods for the interpretation of hidden activations of deep neural networks, and in particular of CNNs, have recently gained significant interest [25]. Similar to DFF, methods have been proposed to localize objects within an image by means of heat maps [31, 37].

In these works [31, 37], localization is achieved by computing the importance of convolutional feature maps with respect to a particular output unit. These methods can therefore be seen as supervised, since the resulting heat maps are associated with a designated output unit, which corresponds to an object class from a predefined set. With DFF, however, heat maps are not associated with an output unit or object class. Instead, DFF heat maps capture common activation patterns in the input, which additionally allows us to localize objects never seen before by the CNN, and for which there is no relevant output unit.

2.2 CNN Features as Part Detectors

The ability of DFF to localize parts stems from the CNN’s ability to distinguish parts in the first place. In Gonzales et al. [11] and Bau et al. [2] the authors attempt to detect learned part-detectors in CNN features, to see if such detectors emerge, even when the CNN is trained with object-level labels. They do this by measuring the overlap between feature map activations and ground truth labels from a part-level segmentation dataset. The availability of ground truth is essential to their analysis, yielding a catalog of CNN units that sufficiently correspond to labels in the dataset.

We confirm their observations that part detectors do indeed emerge in CNNs. However, as opposed to these previous methods, our NMF-based approach does not rely on ground truth labels to find the parts in the input. We use labeled data for evaluation only.

2.3 Non-negative Matrix Factorization

Non-negative matrix factorization (NMF) has been used to analyze data from various domains, such as audio source separation [12], document clustering [36], and face recognition [13].

There has been work extending NMF to multiple layers [6], implementing NMF using neural networks [9] and using NMF approximations as input to a neural network [34]. However, to the best of our knowledge, the application of NMF to the activations of a pre-trained neural network, as is done in DFF, has not been previously proposed.

Fig. 2.
figure 2

An illustration of Deep Feature Factorization. We extract features from a deep CNN and view them as a matrix. We apply NMF to the feature matrix and reshape the resulting k factors into k heat maps. See Sect. 3 for a detailed explanation. Shown: Statute of Liberty subset from iCoseg with \(k=3\).

3 Method

3.1 CNN Feature Space

In the context of CNNs, an input image \(\mathcal {I}\) is seen as a tensor of dimension \(h_\mathcal {I}\times w_\mathcal {I}\times c_\mathcal {I}\), where the first two dimensions are the height and the width of the image, respectively, and the third dimension is the number of color channels, e.g., 3 for RGB. Viewed this way, the first two dimensions of \(\mathcal {I}\) can be seen as a spatial grid, with the last dimension being a \(c_\mathcal {I}\)-dimensional feature representation of a particular spatial position. For an RGB image, this feature corresponds to color.

As the image gets processed layer by layer, the hidden activation at the \(\ell \)th layer of the CNN is a tensor we denote \(\mathcal {A}_\mathcal {I}^\ell \) of dimension \(h_\ell \times w_\ell \times c_\ell \). Notice that generally \(h_\ell< h_\mathcal {I},~w_\ell < w_\mathcal {I}\) due to pooling operations commonly used in CNN pipelines. The number of channels \(c_\ell \) is user-defined as part of the network architecture, and in deep layers is often on the order of 256 or 512.

The tensor \(\mathcal {A}_\mathcal {I}^\ell \) is also called a feature map since it has a spatial interpretation similar to that of the original image \(\mathcal {I}\): the first two dimensions represent a spatial grid, where each position corresponds to a patch of pixels in \(\mathcal {I}\), and the last dimension forms a \(c_\ell \)-dimensional representation of the patch. The intuition behind deep learning suggests that the deeper layer \(\ell \) is, the more abstract and semantically meaningful are the \(c_\ell \)-dimensional features [3].

Since a feature map represents multiple patches (depending on the size of image \(\mathcal {I}\)), we view them as points inhabiting the same \(c_\ell \)-dimensional space, which we refer to as the CNN feature space. Having potentially many points in that space, we can apply various methods to find directions that are ‘interesting’.

3.2 Matrix Factorization

Matrix factorization algorithms have been used for data interpretation for decades. For a data matrix A, these methods retrieve an approximation of the form:

$$\begin{aligned} A&\approx \hat{A} = HW \\ \text {s.t. } A,~\hat{A}\in&\mathcal {R}^{n\times m},~ H\in \mathcal {R}^{n\times k},~W\in \mathcal {R}^{k\times m} \nonumber \end{aligned}$$
(1)

where \(\hat{A}\) is a low-rank matrix of a user-defined rank k. A data point, i.e., a row of A, is explained as a weighted combination of the factors which form the rows of W.

A classical method for dimensionality reduction is principal component analysis (PCA) [18]. PCA finds an optimal k-rank approximation (in the \(\ell ^2\) sense) by solving the following objective:

$$\begin{aligned} \begin{array}{lll} \text {PCA}(A, k) = &{} {\mathop {\hbox {argmin}}\limits _{\hat{A}_k}} &{} \Vert A-\hat{A}_k \Vert _F^2, \\ &{}\text { subject to }&{} \hat{A}_k = AV_kV_k^\top ,~V_k^\top V_k = I_k, \end{array} \end{aligned}$$
(2)

where \(\Vert .\Vert _F\) denotes the Frobenius norm and \(V_k\in \mathcal {R}^{m\times k}\). For the form of Eq. (1), we set \(H=AV_k,~W=V_k^\top \). Note that the PCA solution generally contains negative values, which means the combination of PCA factors (i.e., principal components) leads to the canceling out of positive and negative entries. This cancellation makes intuitive interpretation of individual factors difficult.

On the other hand, when the data A is non-negative, one can perform non-negative matrix factorization (NMF):

$$\begin{aligned} \begin{array}{lll} \text {NMF}(A, k) = &{} {\mathop {\hbox {argmin}}\limits _{\hat{A}_k}} &{} \Vert A-\hat{A}_k \Vert _F^2, \\ &{} \text { subject to }&{} \hat{A}_k = HW,~ \forall ij, H_{ij} ,W_{ij} \ge 0, \end{array} \end{aligned}$$
(3)

where \(H\in \mathcal {R}^{n\times k}\) and \(W\in \mathcal {R}^{k\times m}\) enforce the dimensionality reduction to rank k. Capturing the structure in A while forcing combinations of factors to be additive results in factors that lend themselves to interpretation [22].

3.3 Non-negative Matrix Factorization on CNN Activations

Many modern CNNs make use of the rectified linear activation function, \(\max (x, 0)\), due to its desirable gradient properties. An obvious property of this function is that it results in non-negative activations. NMF is thus naturally applicable in this case.

Recall the activation tensor for image \(\mathcal {I}\) and layer \(\ell \):

$$\begin{aligned} \mathcal {A}_\mathcal {I}^\ell \in \mathbb {R}^{h\times w\times c} \end{aligned}$$
(4)

where \(\mathbb {R}\) refers to the set of non-negative real numbers. To apply matrix factorization, we partially flatten \(\mathcal {A}\) into a matrix whose first dimension is the product of h and w:

$$\begin{aligned} A_\mathcal {I}^\ell \in \mathbb {R}^{(h\cdot w)\times c} \end{aligned}$$
(5)

Note that the matrix \(A_\mathcal {I}^\ell \) is effectively a ‘bag of features’ in the sense that the spatial arrangement has been lost, i.e., the rows of \(A_\mathcal {I}^\ell \) can be permuted without affecting the result of factorization. We can naturally extend factorization to a set of n images, by vertically concatenating their features together:

$$\begin{aligned} A = \begin{bmatrix} A_1^\ell \\ \vdots \\ A_n^\ell \end{bmatrix} \in \mathbb {R}^{(n\cdot h\cdot w)\times c} \end{aligned}$$
(6)

For ease of notation we assumed all images are of equal size, however, there is no such limitation as images in the set may be of any size. By applying NMF to A we obtain the two matrices from Eq. 1, \(H\in \mathbb {R}^{(n\cdot h\cdot w)\times k}\) and \(W\in \mathbb {R}^{k\times c}\).

3.4 Interpreting NMF Factors

The result returned by the NMF consists of k factors, which we will call DFF factors, where k is the predefined rank of the approximation.

The W Matrix. Each row \(W_j\) (\(1\le j\le k\)) forms a c-dimensional vector in the CNN feature space. Since NMF can be seen as performing clustering [8], we view a factor \(W_j\) as a centroid of an activation cluster, which we show corresponds to coherent object or object-part.

The H Matrix. The matrix H has as many rows as the activation matrix A, one corresponding to every spatial position in every image. Each row \(H_i\) holds coefficients for the weighted sum of the k factors in W, to best approximate the c-dimensional \(A_i\).

Each column \(H_j\) (\(1\le j\le k\)) can be reshaped into n heat maps of dimension \(h\times w\), which highlight regions in each image that correspond to the factor \(W_j\). These heat maps have the same spatial dimensions as the CNN layer which produced the activations, often low. To match the size of the heat map with the input image, we upsample it with bilinear interpolation.

4 Experiments

In this section we first show that DFF can produce a hierarchical decomposition into semantic parts, even for sets of very few images (Sect. 4.3). We then move on to larger-scale, realistic datasets where we show that DFF can perform state-of-the-art weakly-supervised object co-localization and co-segmentation, in addition to part co-segmentation (Sects. 4.4 and 4.5).

4.1 Implementation Details

NMF. NMF optimization with multiplicative updates [23] relies on dense matrix multiplications, and can thus benefit from fast GPU operations. Using an NVIDIA Titan X, our implementation of NMF can process over 6 K images of size \(224\times 224\) at once with \(k=5\), and requires less than a millisecond per image. Our code is available online.

Neural Network Models. We consider five network architectures in our experiments, namely VGG-16 and VGG-19 [32], with and without batch-normalization [17], as well as ResNet-101 [16]. We use the publicly available models from [26].

4.2 Segmentation and Localization Methods

In addition to gaining insights into CNN feature space, DFF has utility for various tasks with subtle but important differences in naming:

  • Segmentation vs. Localization is the difference between predicting pixel-wise binary masks and predicting bounding boxes, respectively.

  • Segmentation vs. co-segmentation is the distinction between segmenting a single image into regions and jointly segmenting multiple images, thereby producing a correspondence between regions in different images (e.g., cats in all images belong to the same segment).

  • Object co-segmentation vs. Part co-segmentation. Given a set of images representing a common object, the former performs binary background-foreground separation where the foreground segment encompasses the entirety of the common object (e.g., cat). The latter, however, produces k segments, each corresponding to a part of the common object (e.g., cat head, cat legs, etc.).

When applying DFF with \(k=1\) can we compare our results against object co-segmentation (background-foreground separation) methods and object co-localization methods.

In Sect. 4.3 we compare DFF against three state-of-the-art co-segmentation methods. The supervised method of Vicente et al. [33] chooses among multiple segmentation proposals per image by learning a regressor to predict, for pairs of images, the overlap between their proposals and the ground truth. Input to the regressor included per-image features, as well as pairwise features. The methods Rubio et al. [29] and Rubinstein et al. [28] are unsupervised and rely on a Markov random field formulation, where the unary features are based on surface image features and various saliency heuristics. For pairwise terms, the former method uses a per-image segmentation into regions, followed by region-matching across images. The latter approach uses a dense pairwise correspondence term between images based on local image gradients.

In Sect. 4.4 we compare against several state-of-the-art object co-localization methods. Most of these methods operate by selecting the best of a set of object proposals, produced by a pre-trained CNN [24] or an object-saliency heuristic [5, 19]. The authors of [21] present a method for unsupervised object co-localization that, like ours, also makes use of CNN activations. Their approach is to apply k-means clustering to globally max-pooled activations, with the intent of clustering all highly active CNN filters together. Their method therefore produces a single heat map, which is appropriate for object co-segmentation, but cannot be extended to part co-segmentation.

When \(k>1\), we use DFF to perform part co-segmentation. Since we have not come across examples of part co-segmentation in the literature, we compare against a method for supervised part segmentation, namely Wang et al. [35] (Table 3 in Sect. 4.5). Their method relies on a compositional model with strong explicit priors w.r.t to part size, hierarchy and symmetry. We also show results for two baseline methods described in [35]: PartBB+ObjSeg where segmentation masks are produced by intersecting part-bounding-boxes [4] with whole-object segmentation masks [14]. The method PartMask+ObjSeg is similar, but here bounding-boxes are replaced with the best of 10 pre-learned part masks.

4.3 Experiments on iCoseg

Dataset. The iCoseg dataset [1] is a popular benchmark for co-segmentation methods. As such, it consists of 38 sets of images, where each image is annotated with a pixel-wise mask encompassing the main object common to the set. Images within a set are uniform in that they were all taken on a single occasion, depicting the same objects. The challenging aspect of this datasets lies in the significant variability with respect to viewpoint, illumination, and object deformation.

We chose five sets and further labeled them with pixel-wise object-part masks (see Table 1). This process involved partitioning the given ground truth mask into sub-parts. We also annotated common background objects, e.g., camel in the Pyramids set (see Fig. 1). Our part-annotation for iCoseg is available online. The number of images in these sets ranges from as few as 5 up to 41. When comparing against [33] and [29] in Table 1, we used the subset of iCoseg used in those papers.

Part Co-segmentation. For each set in iCoseg, we obtained activations from the deepest convolutional layer of VGG19 (conv5_4), and applied NMF to these activations with increasing values of k. The resulting heat maps can be seen in Figs. 1 and 3.

Qualitatively, we see a clear correspondence between DFF factors and coherent object-parts, however, the heat maps are coarse. Due to the low resolution of deep CNN activations, and hence of the heat map, we get blobs that do not perfectly align with the underlying region of interest. We therefore also report additional results with a post-processing step to refine the heat maps, described below.

We notice that when \(k=1\), the single DFF factor corresponds to a whole object, encompassing multiple object-parts. This, however, is not guaranteed, since it is possible that for a set of images, setting \(k=1\) will highlight the background rather than the foreground. Nonetheless, as we increase k, we get a decomposition of the object or scene into individual parts. This behavior reveals a hierarchical structure in the clusters formed in CNN feature space.

For instance, in Fig. 3(a), we can see that \(k=1\) encompasses most of gymnast’s body, \(k=2\) distinguished her midsection from her limbs, \(k=3\) adds a finer distinctions between arms and legs, and finally \(k=4\) adds a new component that localizes the beam. This observation also indicates the CNN has learned representation that ‘explains’ these concepts with invariance to pose, e.g., leg positions in the 2nd, 3rd, and 4th columns.

A similar decomposition into legs, torso, back, and head can be seen for the elephants in Fig. 3(b). This shows that we can localize different objects and parts even when they are all common across the image set. Interestingly, the decompositions shown in Fig. 1 exhibit similar high semantic quality in spite of their dissimilarity to the ImageNet training data, as neither pyramids nor the Taj Mahal are included as class labels in that dataset. We also note that as some of the given sets contain as few as 5 images (Fig. 1(b) comprises the whole set), our method does not require many images to find meaningful structure.

Fig. 3.
figure 3

Example DFF heat maps for images of two sets from iCoseg. Each row shows a separate factorization where the number of DFF factors k is incremented. Different colors correspond to the heat maps of the k different factors. DFF factors correspond well to distinct object parts. This Figure visualizes the data in Table 1, where heat map color corresponds with row color. (Best viewed electronically with a color display) Color figure online

Object and Part Co-segmentation. We operationalize DFF to perform co-segmentation. To do so we have to first annotate the factors as corresponding to specific ground-truth parts. This can be done manually (as in Table 3) or automatically given ground truth, as described below. We report the intersection-over-union (IoU) score of each factor with its associated parts in Table 1.

Since the heat maps are of low-resolution, we refine them with post processing. We define a dense conditional random field (CRF) over the heat maps. We use the filter-based mean field approximate inference [20], where we employ guided filtering [15] for the pairwise term, and use the biliniearly upsampled DFF heat maps as unary terms. We refer to DFF with post-processing ‘DFF-CRF.

Each heat map is converted to a binary mask using a thresholding procedure. For a specific DFF factor f (\(1\le f\le k\)), let \(\{H(f, 1),\cdots , H(f,n)\}\) be the set of n heat maps associated with n input images, The value of a pixel in the binary map B(fi) of factor f and image i is 0 if its intensity is lower than the 75th percentile of entries in the set of heat maps \(\{H(f,j) | 1\le j\le n\}\).

We associate parts with factors by considering how well a part is covered by a factor’s binary masks. We define the coverage of part p by factor f as:

$$\begin{aligned} Cov_{f,p} = \frac{|\sum _i B(f,i) \bigcap P(p,i)|}{|\sum _i P(p,i)|} \end{aligned}$$
(7)

The coverage is the percentage of pixels belonging to p that are set to 1 in the binary maps\(\{B(f,i) | 1\le i\le n\}\). We associate the part p with factor f when \(Cov_{f,p}>Cov_{\text {th}}\). We experimentally set the threshold \(Cov_{\text {th}}=0.5\).

Finally, we measure the IoU between a DFF factor f and its m associated ground-truth parts \(\{p^{(f)}_1,\cdots ,p^{(f)}_m\}\) similarly to [2], specifically by considering the dataset-wide IoU :

$$\begin{aligned}&P_f(i) = \bigcup _j^m P(p^{(f)}_j) \end{aligned}$$
(8)
$$\begin{aligned}&IoU_{f,p} = \frac{|\sum _i B_i \bigcap P_f(i)|}{|\sum _i B_i \bigcup P_f(i)|} \end{aligned}$$
(9)
Table 1. Object and part discovery and segmentation on five iCoseg image sets. Part-labels are automatically assigned to DFF factors, and are shown with their corresponding IoU-scores. Our results show that clusters in CNN feature space correspond to coherent parts. More so, they indicate the presence of a cluster hierarchy in CNN feature space, where part-clusters can be seen as sub-clusters within object-clusters (See Figs. 1, 2 and 3 for visual comparison. Row color corresponds with heat map color). With \(k=1\), DFF can be used to perform object co-segmentation, which we compare against state-of-the-art methods. With \(k>1\) DFF can be used to perform part co-segmentation, which current co-segmentation methods are not able to do.

In the top of Table 1 we report results for object co-segmentation (\(k=1\)) and show that our method is comparable with the supervised approach of [33] and domain-specific methods of [28, 29].

The bottom of Table 1 shows the labels and IoU-scores for part co-segmentation on the five image sets of iCoseg that we have annotated. These scores correspond to the visualizations of Figs. 1 and 3 and confirm what we observe qualitatively.

We can characterize the quality of a factorization as the average IoU of each factor with its single best matching part (which is not the background). In Fig. 4(a) we show the average IoU for different layer of VGG-19 on iCoseg as the value of k increases. The variance shown is due to repeated trials with different NMF initializations. There is a clear gap between convolutional blocks. Performance with in a block does not strictly follow the linear order of layers.

We also see that the optimal value for k is between 3 and 5. While this naturally varies for different networks, layers, and data batches, another deciding factor is the resolution of the part ground truth. As k increases, DFF heat maps become more localized, highlighting regions that are beyond the granularity of the ground truth annotation, e.g., a pair of factors that separates leg into ankle and thigh. In Fig. 4(b) we show that DFF performs similarly within the VGG family of models. For ResNet-101 however, the average IoU is distinctly lower.

4.4 Object Co-Localization on PASCAL VOC 2007

Dataset. PASCAL VOC 2007 has been commonly used to evaluate whole object co-localization methods. Images in this dataset often comprise several objects of multiple classes from various viewpoints, making it a challenging benchmark. As in previous work [5, 19, 21], we use the trainval set for evaluation and filter out images that only contain objects which are marked as difficult or truncated. The final set has 20 image sets (one per class), with 69 to 2008 images each.

Fig. 4.
figure 4

(a) Average IoU score for DFF on iCoseg. for (a) different VGG19 layers and (b) the deepest convolutional layer for other CNN architectures. Expectedly, different convolutional blocks show a clear difference in matching up with semantic parts, as CNN features capture more semantic concepts. The optimal value for k is data dependent but is usually below 5. We see also that DFF performance is relatively uniform for the VGG family of models.

Evaluation. The task of co-localization involves fitting a bounding box around the common object in a set of image. With \(k=1\), we expect DFF to retrieve a heat map which localizes that object.

As described in the previous section, after optionally filtering DFF heat maps using a CRF, we convert the heat maps to binary segmentation masks. We follow [31] and extract a single bounding box per heat map by fitting a box around the largest connected component in the binary map.

We report the standard CorLoc score [7] of our localization. The CorLoc score is defined as the percentage of predicted bounding boxes for which there exists a matching ground truth bounding box. Two bounding boxes are deemed matching if their IoU score exceeds 0.5.

The results of our method are shown in Table 2, along with previous methods (described in Sect. 4.2). Our method compares favorably against previous approaches. For instance, we improve co-localization for the class dog by 16% higher CorLoc and achieve better co-localization on average, in spite of our approach being simpler and more general.

Table 2. Co-localization results for PASCAL VOC 2007 with DFF \(k=1\). Numbers indicate CorLoc scores. Overall, we exceed the state-of-the-art approaches using a much simpler method.
Table 3. Avg. IoU(%) for three fully supervised methods reported in [35] (see Sect. 4.2 for details) and for our weakly-supervised DFF approach. As opposed to DFF, previous approaches shown are fully supervised. Despite not using hand-crafted features, DFF compares favorably to these approaches, and is not specific to these two image classes. We semi-automatically mapped DFF factors (\(k=3\)) to their appropriate part labels by examining the heat maps of only five images, out of approximately 140 images. This illustrates the usefulness of DFF co-segmentation for fast semi-automatic labeling. See visualization for cow heat maps in Figure 5.
Fig. 5.
figure 5

Example DFF heat maps for images of six classes from PASCAL-Parts with \(k=3\). For each class we show four images that were successfully decomposed into parts, and a failure case on the right. DFF manages to retrieve interpretable decompositions in spite of the great variation in the data. In addition to the DFF factors for cow from Table 3, here visualized are the factors which appear in Table 4, where heat map colors correspond to row colors.

Table 4. IoU of DFF heat maps with PASCAL-Parts segmentation masks. Each DFF factor is autmatically labeled with part labels as in Sect. 4.3. Higher values of k allow DFF to localize finer regions across the image set, some of which go beyond the resolution of the ground truth part annotation. Figure 5 visualizes the results for \(k=3\) (row color corresponds to heat map color).

4.5 Part Co-segmentation in PASCAL-Parts

Dataset. The PASCAL-Part dataset [4] is an extension of PASCAL VOC 2010 [10] which has been further annotated with part-level segmentation masks and bounding boxes. The dataset decomposes 16 object classes into fine grained parts, such as bird-beak and bird-tail etc. After filtering out images containing objects marked as difficult and truncated, the final set consists of 16 image sets with 104 to 675 images each.

Evaluation. In Table 3 we report results for the two classes, cow and horse, which are also part-segmented by Want et al. as described in Sect. 4.2. Since their method relies on strong explicit priors w.r.t to part size, hierarchy, and symmetry, and its explicit objective is to perform part-segmentation, their results serve as an upper bound to ours. Nonetheless we compare favorably to their results and even surpass them in one case, despite our method not using any hand-crafted features or supervised training.

For this experiment, our strategy for mapping DFF factors (\(k=3\)) to their appropriate part labels was with semi-automatic labeling, i.e., we qualitatively examined the heat maps of only five images, out of approximately 140 images, and labeled factors as corresponding to the labels shown in Table 3.

In Table 4 we give IoU results for five additional classes from PASCAL-Parts, which have been automatically mapped to parts as in Sect. 4.3. In Fig. 5 we visualize these DFF heat maps for \(k=3\), as well as for cow from Table 3. When comparing the heat maps against their corresponding IoU-scores, several interesting conclusions can be made. For instance, in the case of motorbike, the first and third factors for \(k=3\) in Table 4 both seems to correspond with wheel. The visualization in Fig. 5(e) reveals that these factors in fact sub-segment the wheel into top and bottom, which is beyond the resolution of the ground truth data.

We can see also that while the first factor of the class aeroplane (Fig. 5(a)) consistently localizes airplane wheels, it does not to achieve high IoU due to the coarseness of the heat map.

Returning to Table 4, when \(k=4\), a factor emerges that localizes instances of the class person, which occur in 60% of motorbike images. This again shows that while most co-localization methods only describe objects that are common across the image set, our DFF approach is able to find distinctions within the set of common objects.

5 Conclusions

In this paper, we have presented Deep Feature Factorization (DFF), a method that is able to locate semantic concepts in individual images and across image sets. We have shown that DFF can reveal interesting structures in CNN feature space, such as hierarchical clusters which correspond to a part-based decomposition at various levels of granularity.

We have also shown that DFF is useful for co-segmentation and co-localization, achieving results on challenging benchmarks which are on par with state-of-the-art methods, and can be used to perform semi-automatic image labeling. Unlike previous approaches, DFF can also perform part co-segmentation as well, making fine distinction within the common object, e.g. matching head to head and torso to torso.