Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Watching and sharing videos on social media has become an integral part of everyday life. We are often intrigued by the textual description of the videos and attempt to fast-forward to the segments of interest without watching the entire video. However, these textual descriptors usually do not specify the exact segment of the video associated with a particular description. For example, someone describing a movie clip as “head-on collision between cars while Chris Cooper is driving” neither provide the time-stamps for the collision or driving events nor the spatial locations of the cars or Chris Cooper. Such descriptions are referred to as ‘weak labels’. For efficient video navigation and consumption, it is important to automatically determine the spatio-temporal locations of these concepts (such as ‘collision’ or ‘cars’). However, it is prohibitively expensive to train concept-specific models for all concepts of interest in advance and use them for localization. This shortcoming has triggered a great amount of interest in jointly learning concept-specific classification models as well as localizing concepts from multiple weakly labeled images [13] or videos [4, 5].

Video descriptions include concepts which may refer to persons, objects, scenes and/or actions and thus a typical description is a combination of heterogeneous concepts. In the running example, extracted heterogeneous concepts are ‘car’ (object), ‘head-on collision’ (action), ‘Chris Cooper’ (person) and ‘driving’ (action). Learning classifiers for these heterogeneous concepts along with localization is an extremely challenging task because: (a) the classifiers for different kinds of concepts are required to be learned simultaneously, e.g., a face classifier, an object classifier, an action classifier etc., and (b) the learning model must take into account the spatio-temporal location constraints imposed by the descriptions while learning these classifiers. For example, the concepts ‘head-on collision’ and ‘cars’ should spatio-temporally co-occur at least once and there should be at least one car in the video.

Recently there has been growing interest to jointly learn concept classifiers from weak labels [1, 5]. Bojanowski et al. [5] proposed a discriminative clustering framework to jointly learn person and action models from movies using weak supervision provided by the movie scripts. Since weak labels are extracted from scripts, each label can be associated with a particular shot in the movie, which may last only for a few seconds, i.e., the labels are well localized and that makes the overall learning easier. However, in real world videos, one does not have access to such shot-level labels but only to video-level labels. Therefore in our work, we do not assume availability of such well localized labels, and tackle the more general problem of learning concepts from weaker video-level labels. The framework in [5], when extended to long videos, does not give satisfactory results (see Sect. 4). Such techniques, which are based on a linear mapping from features to labels and model background using only a single latent factor, are usually inadequate to capture all the inter-class and intra-class variations. Shi et al. [1] jointly learn object and attribute classifiers from images using weakly supervised Indian Buffet Process (IBP). Note that IBP [6, 7] allows observed features to be explained by a countably infinite number of latent factors. However, the framework in [1] is not designed to handle heterogeneous concepts and location constraints, which leads to a significant degradation in performance (Sect. 4.3). [8] and [9] propose IBP based cross-modal categorization/query image retrieval models which learn semantically meaningful abstract features from multimodal (image, speech and text) data. However, these unsupervised approaches do not incorporate any location constraints which naturally arise in the weakly supervised setting with heterogeneous labels.

We propose a novel Bayesian Non-parametric (BNP) approach called WSC-SIIBP (Weakly Supervised, Constrained&Stacked Integrative IBP) to jointly learn heterogeneous concept classifiers and localize these concepts in videos. BNP models are a class of Bayesian models where the hidden structure that may have generated the observed data is not assumed to be fixed. Instead, a framework is provided that allows the complexity of the model to increase as more data is observed [10]. Specifically, we propose:

  1. 1.

    A novel generalization of IBP which for the first time incorporates weakly supervised spatio-temporal location constraints and heterogeneous concepts in an integrated framework.

  2. 2.

    Posterior inference of WSC-SIIBP model using mean-field approximation.

We assume that the weak video labels come in the form of tuples: in the running example, the extracted heterogeneous concept tuples are ({car, head-on collision}, {Chris Cooper, driving})Footnote 1. We perform experiments on two video datasets (a) the Casablanca movie dataset [5] and (b) the A2D dataset [11]. We show that the proposed approach WSC-SIIBP outperforms several state-of-the-art methods for heterogeneous concept classification and localization in a weakly supervised setting. For example, WSC-SIIBP leads to a relative improvement of 7 %, 5 % and 24 % on person, action and pairwise classification accuracies, respectively, over the most competitive baselines on the Casablanca dataset. Similarly, the relative improvement on localization accuracy is 9 % over the next best approach on the A2D dataset.

Fig. 1.
figure 1

Pipeline of WSC-SIIBP. Multiple videos with heterogeneous weak labels are provided as input and localization and classification of the concepts are performed in these videos.

2 Related Work

In this section, we discuss relevant prior work in two broad categories.

Weakly Supervised Learning: Localizing concepts and learning classifiers from weakly annotated data is an active research topic. Researchers have learned models for various concepts from weakly labeled videos using Multi-Instance Learning (MIL) [12, 13] for human action recognition [14], visual tracking [15] etc. Cour et al. [16] uses a novel convex formulation to learn face classifiers from movies and TV series using multimodal features which are obtained from finely aligned screenplay, speech and video data. In [4, 17], the authors propose discriminative clustering approaches for aligning videos with temporally ordered text descriptions or predefined tags and in the process also learn action classifiers. In our approach, we consider weak labels which are neither ordered nor aligned to any specific video segment. [18] proposes a method for learning object class detectors from real world web videos known to contain only the target class by formulating the problem as a domain adaptation task. [19] learns weakly supervised object/action classifiers using a latent-SVM formulation where the objects or actions are localized in training images/videos using latent variables. We note that - both [18, 19] consider only a single weak label per video and, unlike our approach, do not jointly learn the heterogeneous concepts. The authors in [20, 21] use dialogues, scene and character identification to find an optimal mapping between a book and movie shots using shortest path or CRF approach. However, these approaches neither jointly model heterogeneous concepts nor spatio-temporally localized them. Although [22] proposes a discriminative clustering model for coreference resolution in videos, only faces are considered in their experiments.

Heterogeneous concept learning: There are prior works on automatic image [2326] and video [2729] caption generation, where models are trained on pairs of image/video and text that contain heterogeneous concept descriptions to predict captions for novel images/videos. While most of these approaches rely on deep learning methods to learn a mapping between an image/video and the corresponding text description, [25] uses MIL to learn visual concept detectors (spatial localization in images) for nouns, verbs and adjectives. However, none of these approaches spatio-temporally localize points of interests in videos. Perhaps the available video datasets are not large enough to train such a weakly supervised deep learning model.

To the best of our knowledge there is no prior work that jointly classifies and localizes heterogeneous concepts in weakly supervised videos.

3 WSC-SIIBP: Model and Algorithm

In this section, we describe the details of WSC-SIIBP (see Fig. 1 for the pipeline). We first introduce notations and motivate our approach in Sects. 3.1 and 3.2 respectively. This is followed by Sect. 3.3 where we introduce stacked non-parametric graphical model - IBP and its corresponding posterior computation. In Sects. 3.4 and 3.5, we formulate an extension of the stacked IBP model which can generalize to heterogeneous concepts as well as incorporate the constraints obtained from weak labels. In Sect. 3.6, we briefly describe the inference procedure using truncated mean-field variational approximation and summarize our entire algorithm. Finally, we discuss how one can classify and localize concepts in new test videos using WSC-SIIBP.

3.1 Notation

Assume we are given a set of weakly labeled videos denoted by \(\varvec{\Lambda } = \left\{ (i, \varGamma ^{(i)})\right\} \), where i indicates a video and \(\varGamma ^{(i)}\) denotes the heterogeneous weak labels corresponding to the i-th video. Although the proposed approach can be used for any number of heterogeneous concepts, for readability, we restrict ourselves to two concepts and call them subjects and actions. We also have a closed set of class labels for these heterogeneous concepts: for subjects \(\mathcal {S} = (s_1,\dots ,s_{K_s})\) and for actions \(\mathcal {A} = (a_1,\dots ,a_{K_a})\). Let \(K_s = |\mathcal {S}|\), \(K_a = |\mathcal {A}|\), \(\varGamma ^{(i)} = \left\{ (s_l,a_l): s_l\in \mathcal {S} \cup \emptyset , a_l \in \mathcal {A} \cup \emptyset , 1\le l \le |\varGamma ^{(i)}|\right\} \), \(\emptyset \) indicate that the corresponding subject or action class label is not present and \(M = |\varvec{\Lambda }|\) represents the number of videos. The video-level annotation simply indicates that the paired concepts \(\varGamma ^{(i)}\) can occur anywhere in the video and at multiple locations.

Assume that \(N_i\) spatio-temporal tracks are extracted from each video i where each track j is represented as an aggregation of multiple local features, \(\mathbf {x}^{(i)}_j\). The spatio-temporal tracks could be face tracks, 3-D object proposals or action proposals (see Sect. 4.1 for more details). We associate the \(j^{th}\) track in video i to an infinite binary latent coefficient vector \(\mathbf {z}_{j}^{(i)}\) [1, 6]. Each video i is represented by a bag of spatio-temporal tracks \( \mathbf {X}^{(i)} = \{\mathbf {x}^{(i)}_j, j = 1,\dots ,N_i\}\). Similarly, \( \mathbf {Z}^{(i)} = \{\mathbf {z}^{(i)}_j, j = 1,\dots ,N_i\}\).

3.2 Motivation

Our objective is to learn (a) a mapping between each of the \(N_i\) tracks in video i and the labels in \(\varGamma ^{(i)}\) and (b) the appearance model for each label identity such that the tracks from new test videos can be classified. To achieve these objectives, it is important for any model to discover the latent factors that can explain similar tracks across a set of videos with a particular label. In general, the number of latent factors are not known apriori and must be inferred from the data. In Bayesian framework, IBP treats this number as a random variable that can grow with new observations, thus letting the model to effectively explain the unbounded complexity in the data. Specifically, IBP defines a prior distribution over an equivalence class of binary matrices of bounded rows (indicating spatio-temporal tracks) and infinite columns (indicating latent coefficients). To achieve our goals, we build on IBP and introduce the WSC-SIIBP model which can effectively learn the latent factors corresponding to each heterogeneous concept and utilize prior location constraints to reduce the ambiguity in learning through the knowledge of other latent coefficients.

3.3 Indian Buffet Process (IBP)

The spatio-temporal tracks in the videos \( \varvec{\Lambda }\) are obtained from an underlying generative process. Specifically, we consider a stacked IBP model [1] as described below.

  • For each latent factor \(k \in 1 \dots \infty \),

    1. 1.

      Draw an appearance distribution with mean \(\mathbf {a}_{k} \thicksim \mathcal {N}(0,\sigma _A^2\mathbf {I})\)

  • For each video \(i \in 1\dots M\),

    1. 1.

      Draw a sequence of i.i.d. random variables, \(v^{(i)}_1, v^{(i)}_2 \dots \thicksim \) Beta\((\alpha ,1)\)

    2. 2.

      Construct the prior on the latent factors, \(\pi _k^{(i)} = \prod _{t=1}^k v_t^{(i)}\), \(\forall k \in 1\dots \infty ,\)

    3. 3.

      For \(j^{th}\) subject track in \(i^{th}\) video, where \(j\in 1 \dots N_i\),

      1. (a)

        Sample state of each latent factor, \(z_{jk}^{(i)} \thicksim \) Bern\((\pi _k^{(i)})\),

      2. (b)

        Sample track appearance, \(\mathbf {x}_j^{(i)} \thicksim \mathcal {N}\left( \mathbf {z}_{j}^{(i)}\mathbf {A},\sigma _n^2\mathbf {I}\right) \)

where \(\alpha \) is the prior controlling the sparsity of latent factors, \(\sigma _A^2\) and \(\sigma _n^2\) are the prior appearance and noise variance shared across all factors, respectively. Each \(\mathbf {a}_{k}\) forms \(k^{th}\) row of \(\mathbf {A}\) and the value of the latent coefficient \(z_{jk}^{(i)}\) indicates whether data \(\mathbf {x}_j^{(i)}\) contains the \(k^{th}\) latent factor or not. In the above model, we have used stick-breaking construction [30] to generate the \(\pi _k^{(i)}\)s.

\({\underline{\mathbf{Posterior}}}\): Now, we describe how the posterior is obtained for the above graphical model. Let \(\mathbf {Y} = \left\{ \mathbf {\pi }^{(1)}\dots \mathbf {\pi }^{(M)},\mathbf {Z}^{(1)}\dots \mathbf {Z}^{(M)},\mathbf {A}\right\} \) and \(\varvec{\Theta } = \left\{ \alpha , \sigma _A^2,\sigma _n^2\right\} \) denote hidden variables and prior parameters, respectively. \(\mathbf {X}\) denotes the concatenation of all the spatio-temporal tracks in all M videos, \(\left\{ \mathbf {X}^{(1)}\dots \mathbf {X}^{(M)}\right\} \). Given prior distribution \(\varPsi (\mathbf {Y} | \varvec{\Theta })\) and likelihood function \(p(\mathbf {x}^{(i)}_j | \mathbf {Y},\varvec{\Theta })\), the posterior probability is given by,

$$\begin{aligned} {\small \begin{aligned}&p(\mathbf {Y} | \mathbf {X},\varvec{\Theta }) = \frac{\varPsi (\mathbf {Y} | \varvec{\Theta }) \prod _{i=1}^M \prod _{j=1}^{N_i}p(\mathbf {X}^{(i)}_j | \mathbf {Y},\varvec{\Theta })}{p(\mathbf {X} | \varvec{\Theta })} \\&\varPsi (\mathbf {Y} | \varvec{\Theta }) = \prod _{k=1}^\infty \left( \prod _{i=1}^M p(\pi _k^{(i)} | \alpha ) \prod _{j=1}^{N_i} p(z_{jk}^{(i)} | \pi _k^{(i)} )\right) p(\mathbf {a}_{k.} | \sigma _A^2). \end{aligned} } \end{aligned}$$
(1)

where \(p(\mathbf {X} | \varvec{\Theta })\) is the marginal likelihood. For simplicity, we denote \(p(\mathbf {Y} | \mathbf {X},\varvec{\Theta })\) as \(q(\mathbf {Y})\). Apart from the significance of inferring \(\mathbf {Z}^{(i)}\) for identifying track-level labels, inferring prior \(\pi _k^{(i)}\) for each video helps to identify video-level labels, while the inference of appearance model \(\mathbf {A}\) will be used to classify new test samples (see Sect. 3.6). Thus, learning in our model requires computing the full posterior distribution over \(\mathbf {Y}\).

\({\underline{\mathbf{Regularized\, posterior}}}\): We note that it is difficult to infer the regularized posterior distributions using (1). However, it is known [31, 32] that the posterior distribution in (1) can also be obtained as the solution \(q(\mathbf {Y})\) of the following optimization problem,

$$\begin{aligned} {\small \begin{aligned} \min _{q(\mathbf {Y})} \quad&\text {KL}\left( q(\mathbf {Y}) || \varPsi (\mathbf {Y}|\varvec{\Theta })\right) - \sum _{i=1}^M \sum _{j=1}^{N_i} \int \log p(\mathbf {x}^{(i)}_j | \mathbf {Y}, \varvec{\Theta }) q(\mathbf {Y}) d\mathbf {Y} \,\, s.t. \quad q(\mathbf {Y}) \in P_{prob} \end{aligned} } \end{aligned}$$
(2)

where \(\text {KL}(.)\) denotes the Kullback-Liebler divergence and \(P_{prob}\) is the probability simplex. As we will see later, this procedure enables us to learn the posterior distribution using a constrained optimization framework.

3.4 Integrative IBP

Our objective is to model heterogeneous concepts (such as subjects and actions) using a graphical model. However, the IBP model described above can not handle multiple concepts because it is highly unlikely that the subject and the action features can be explained by the same statistical model. Hence, we propose an extension of stacked IBP for heterogeneous concepts, where different concept types are modeled using different appearance models.

Let the subject and action types corresponding to the spatio-temporal track j in video i be denoted by \(\mathbf {x^s}^{(i)}_j\) and \(\mathbf {x^a}^{(i)}_j\), respectively, with each having different dimensions \(D^e\) (\(e \in \{s,a\}\))Footnote 2. Unlike the IBP model, \(\mathbf {X^s}^{(i)}_j\) and \(\mathbf {X^a}^{(i)}_j\) are now represented using two different gaussian noise models \(\mathcal {N}(\mathbf {z}^{(i)}_{j}\mathbf {A}^s, \sigma _{ns}^2\mathbf {I})\) and \(\mathcal {N}(\mathbf {z}^{(i)}_{j}\mathbf {A}^a, \sigma _{na}^2\mathbf {I})\) respectively where \(\sigma _{ne}^2\) denotes prior noise variance and \(\mathbf {A}^e\) are \(K \times D^e\) matrices (K \(\rightarrow \infty \)). The mean of the subject and action appearance models for each latent factor are also sampled independently from gaussian distributions of different variances \(\sigma _{Ae}^2\). The new posterior probability is given by,

$$\begin{aligned} \begin{aligned} \tilde{q}(\mathbf {Y})&= \frac{\varPsi (\mathbf {Y} | \varvec{\Theta }) \prod _{i=1}^M \prod _{j=1}^{N_i}\prod _{e\in \{s,a\}}p(\mathbf {x^e}^{(i)}_j | \mathbf {Z},\mathbf {A}^e, \varvec{\Theta })}{p(\mathbf {X} | \varvec{\Theta })} \\ {\varPsi }(\mathbf {Y} | \varvec{\Theta })&= \prod _{k=1}^\infty \left( \prod _{i=1}^M p(\pi _k^{(i)} | \alpha ) \prod _{j=1}^{N_i} p(z_{jk}^{(i)} | \pi _k^{(i)} )\right) \prod _{{e\in \{s,a\}}} p(\mathbf {a}^e_{k} | \sigma _{Ae}^2\mathbf {I}). \end{aligned} \end{aligned}$$
(3)

3.5 Integrative IBP with Constraints

Although the graphical model described above is capable of handling heterogeneous features, the location constraints inferred from the weak labels still need to be incorporated into the graphical model. As motivated in Sect. 1, the concepts ‘head-on collision’ and ‘cars’ should spatio-temporally co-occur at least once and there should be at least one car in the full video. Imposing these location constraints in the inference algorithm can lead to more accurate parameter estimation of the graphical model and faster convergence of the inference procedure. These constraints can be generalized as follows,

  1. 1.

    Every label tuple in \(\varGamma ^{(i)}\), is associated with at least one spatio-temporal track (i.e., the event occurs in the video).

  2. 2.

    Spatio-temporal tracks should be assigned a label only from the list of weak labels assigned to the video. Concepts present in the video but not in the label will be subsumed in the background models.

Ideally, in the case of noiseless labels, these constraints should be strictly followed. However, we assume that real-world labels could be noisy and noise is independent of the videos. Hence, we allow constraints to be violated but penalize the violations using additional slack variables.

We associate the first \(K_s\) and the following \(K_a\) latent factors (the rows of \(\mathbf {A}\)) to the subject and action classes in \(\mathcal {S}\) and \(\mathcal {A}\) respectively. The inferred values of their corresponding latent coefficients in \(\mathbf {z}^{(i)}_{j}\) are used to determine the presence/absence of the associated concept in a particular spatio-temporal track. The remaining unbounded number of latent factors are used to explain away the background tracks from unknown action and subject classes in a video. With these assignments, we enforce the following constraints on latent factors which are sufficient to satisfy the conditions mentioned earlier.

To satisfy 1, we introduce the following constraints, \(\forall i \in 1\dots M,\) and \(\forall j \in 1\dots N_i\),

$$\begin{aligned}&\sum _{j=1}^{N_i} z_{js}^{(i)} z_{ja}^{(i)} \ge 1 - \xi _{(s,a)}^{(i)}, \quad \forall (s,a)\in \varGamma ^{(i)}, \end{aligned}$$
(4)
$$\begin{aligned}&\sum _{j=1}^{N_i} z_{js}^{(i)} \ge 1 - \xi _{(s,\emptyset )}^{(i)}, \quad \forall (s,\emptyset )\in \varGamma ^{(i)}, \end{aligned}$$
(5)
$$\begin{aligned}&\sum _{j=1}^{N_i} z_{ja}^{(i)} \ge 1 - \xi _{(\emptyset ,a)}^{(i)}, \quad \forall (\emptyset ,a)\in \varGamma ^{(i)}, \end{aligned}$$
(6)

where \(\xi \) is the slack variable, \(z_{js}\) and \(z_{ja}\) are the latent factor coefficients corresponding to subject class s and action class a respectively.

To satisfy 2, we use the following constraints, \(\forall i \in 1\dots M\) and \(\forall j \in 1\dots N_i\),

$$\begin{aligned} z_{js}^{(i)}&= 0, \text {if } \not \exists (s,\emptyset )\in \varGamma ^{(i)} \text { and } \not \exists (s,a) \in \varGamma ^{(i)}, \forall a \in \mathcal {A},\end{aligned}$$
(7)
$$\begin{aligned} z_{ja}^{(i)}&= 0, \text {if } \not \exists (\emptyset ,a)\in \varGamma ^{(i)} \text { and } \not \exists (s,a) \in \varGamma ^{(i)}, \forall s \in \mathcal {S}. \end{aligned}$$
(8)

The constraints defined in (4)–(8) have been used in the context of discriminative clustering  [5, 22]. However, our model is the first to use these constraints in a Bayesian setup. In their simplest form, they can be enforced using the point estimate of z e.g., MAP estimation. However, \(\mathbf {Z}^{(i)}\) is defined over the entire probability space. To enforce the above constraints in a Bayesian framework, we need to account for the uncertainty in \(\mathbf {Z}^{(i)}\). Following [33, 34], we define effective constraints as an expectation of the original constraints in (4)–(8), where the expectation is computed w.r.t. the posterior distribution in (3) (see supplementary material for the expectation constraints). The proposed graphical model, incorporating heterogeneous concepts as well as the location constraints provided by the weak labels, is shown in Fig. 2.

Fig. 2.
figure 2

WSC-SIIBP: Graphical Model using two heterogeneous concepts, subjects and actions. Each video (described by video-level labels L) is independently modeled using latent factor prior \(\pi \) and contains \(N_i\) tracks. Each track is represented using subject and action features \(X_s\) and \(X_a\) respectively, which are modeled using Gaussian appearance models \(A_s\) and \(A_a\). z are the binary latent variables indicating the presence or absence of the latent factors in each track. c denotes the set of location constraints extracted from the video labels.

We restrict the search space for the posterior distribution in Eq. (3) by using the expectation constraints. In order to obtain the regularized posterior distribution of the proposed model, we solve the following optimization problem under these expectation constraints,

$$\begin{aligned} {\small \begin{aligned} \min _{\tilde{q}(\mathbf {Y}), \mathbf {\xi }^{(i)}} \,\,\text {KL}\left( \tilde{q}(\mathbf {Y}) || \tilde{\varPsi }(\mathbf {Y}|\varvec{\Theta })\right)&- \sum _{i=1}^M \sum _{j=1}^{N_i} \int \left( \sum _{e\in \{s,a\}}\log p\left( \mathbf {X^e}^{(i)}_j | \mathbf {Y}, \varvec{\Theta }\right) \right) \tilde{q}(\mathbf {Y}) d\mathbf {Y} \\&+ C \sum _{i=1}^M \sum _{J \in \varGamma ^{(i)}} \xi ^{(i)}_J \quad s.t. \quad \tilde{q}(\mathbf {Y}) \in P_{prob} \end{aligned} } \end{aligned}$$
(9)

3.6 Learning and Inference

Note that the variational inference for true posterior \(\tilde{q}(\mathbf {Y})\) (in Eq. (3)) is intractable over the general space of probability functions. To make our problem easier to solve, we establish truncated mean-field variational approximation [30] to the desired posterior \(\tilde{q}(\mathbf {Y})\), such that the search space \(P_{prob}\) is constrained by the following tractable parametrised family of distributions,

$$\begin{aligned} {\small \begin{aligned} \tilde{w}(\mathbf {Y}) =&\prod _{i=1}^M \left( \prod _{k=1}^{K_{max}} p(v_k^{(i)} | \tau _{k1}^{(i)}, \tau _{k2}^{(i)}) \prod _{j=1}^{N_i} p(z_{jk}^{(i)} | \nu _{jk}^{(i)}) \right) \prod _{k=1}^{K_{max}} \prod _{{\,\,\,\,\,\,e\in \{s,a\}}} p(\mathbf {a}^e_{k}|\mathbf {\Phi }_k^e, \sigma _{ke}^2\mathbf {I}). \end{aligned} } \end{aligned}$$
(10)

where \(p(v_k^{(i)} | \tau _{k1}^{(i)}, \tau _{k2}^{(i)}) = \text {Beta}(v_k^{(i)}; \tau _{k1}^{(i)}, \tau _{k2}^{(i)})\), \(p(z_{jk}^{(i)} | \nu _{jk}^{(i)}) = \text {Bern}(z_{jk}^{(i)}; \nu _{jk}^{(i)})\) and \(p(\mathbf {a}^e_{k}|\mathbf {\Phi }_k^e, \sigma _{ke}^2\mathbf {I}) = \mathcal {N}(\mathbf {a}^e_{k}; \mathbf {\Phi }_k^e, \sigma _{ke}^2\mathbf {I})\). In Eq. (10), we note that all the latent variables are modeled independently of all other variables, hence simplifying the inference procedure. The truncated stick breaking process of \(\pi _k^{(i)}\)’s is bounded at \(K_{max}\), wherein \(\pi _k = 0\) for \(k > K_{max} \gg K_s + K_a + K_{bg}\). \(K_{bg}\) indicates the number of latent factors chosen to explain background tracks.

The optimization problem in Eq. (9) is solved using the posterior distribution from Eq. (10). We obtain the parameters (see supplementary material for details) \(\sigma _{ke}^2, \mathbf {\Phi ^e}_k\), \(\tau _{ke}^{(i)}\) and \(\nu _{jk}^{(i)}\) for the optimal posterior distribution \(\tilde{q}(\mathbf {Y})\) using iterative update rules as summarized in Algorithm 1. We note that this algorithm is similar to other IBP learning algorithms [1, 30]. The complexity of Algorithm 1 is \(\mathcal {O}(MN_{max}D_{max}K_{max})\), the same as [1]. The mean of binary latent coefficients \(z_{jk}\), denoted by \(\nu _{jk}\), has an update rule which will lead to several interesting observations.

$$\begin{aligned} \nu _{jk}^{(i)}&= \frac{L_k^{(i)}}{1+e^{-\zeta _{jk}^{(i)}}}. \end{aligned}$$
(11)
$$\begin{aligned} {\small \begin{aligned} \zeta _{jk}^{(i)}&= \sum _{j=1}^k \left( \varPsi (\tau _{j1}^{(i)}) - \varPsi (\tau _{j1}^{(i)} + \tau _{j2}^{(i)})\right) - \mathcal {L}_k - \sum _{e\in \{s,a\}}\frac{1}{2\sigma _{ne}^2}\left( D^e\sigma _{ke}^2 + \mathbf {\Phi ^e}_k\mathbf {\Phi ^e}_k^T\right) \\&+ \sum _{e\in \{s,a\}} \frac{1}{\sigma _{ne}^2} \mathbf {\Phi ^e}_k\left( \mathbf {x}_j^{(i)} - \sum _{l\ne k} \nu _{jl}^{(i)}\mathbf {\Phi ^e}_l\right) ^T + C\underbrace{\sum _{\begin{array}{c} J\in \varGamma ^{(i)} \\ J=(k,a) \end{array}} \mathbb {I}_{\left\{ \sum _{l=1}^{N_i}\nu _{lk}^{(i)}\nu _{la}^{(i)}< 1\right\} } \nu _{ja}^{(i)}}_\text {(i)} \\&+ C\overbrace{\sum _{\begin{array}{c} J\in \varGamma ^{(i)} \\ J=(s,k) \end{array}} \mathbb {I}_{\left\{ \sum _{l=1}^{N_i}\nu _{ls}^{(i)}\nu _{lk}^{(i)}< 1\right\} } \nu _{js}^{(i)}}^\text {(ii)} + C\overbrace{\mathbb {I}_{\left\{ \sum _{l=1}^{N_i}\nu _{lk}^{(i)} < 1, k \le K_a + K_s\right\} }}^\text {(iii)}. \end{aligned} } \end{aligned}$$
(12)

where \(\varPsi (.)\) is the digamma function, \(\mathbb {I}\) is an indicator function, \(L_k^{(i)}\) is an indicator variable and \(\mathcal {L}_k \) is a lower bound for \(\mathbb {E}_{\tilde{w}}[\log (1 - \prod _{j=1}^k v^{(i)})]\). The \(L_k^{(i)}\) indicates whether a concept (action/subject) k is part of the \(i^{th}\) video label set \(\varGamma ^{(i)}\) or not. If \(L_k^{(i)} = 0\), all the corresponding binary latent coefficients \(z_{jk}^{(i)}, j = \{1,\dots ,N_i\}\), are forced to 0, which is equivalent to enforcing the constraints in Eq. (7) and (8). Note that the value of \(\nu _{jk}^{(i)}\) increases with \(\zeta _{jk}^{(i)}\). The terms (i)-(iii) in the update rule for \(\zeta _{jk}^{(i)}\) (Eq. (12)), which are obtained due to the location constraints in Eq. (4)–(6), act as the coupling terms between \(\nu _{je}^{(i)}\)’s. For example, for any action concept, term (ii) suggests that if the location constraints are not satisfied, better localization of all the coupled subject concepts (high value of \(\nu _{js}^{(i)}\)) will drive up the value of \(\zeta _{ja}^{(i)}\). This implies that the strong localization of one concept can lead to better localization of other concepts.

The hyperparameter \(\sigma _{ne}^2\) and \(\sigma _{Ae}^2\) can be set apriori or estimated from data. Similar to the maximization step of EM algorithm, their empirical estimation can easily be obtained by maximizing the expected log-likelihood (see supplementary material).

figure a

Given the input features \(\mathbf {X_s}\) and \(\mathbf {X_a}\), the inferred latent coefficients \(\nu _{je}^{(i)}\) estimate presence/absence of associated classes in a video. One can classify each spatio-temporal track by estimating the track-level labels using \(L^*_j = \arg \max _{k} \nu _{jk}\). Here the maximization is over the latent coefficients corresponding to either the subject or action concepts depending upon the label which we are interested in extracting. For the concept localization task in a video with label pair (sa), the best track in the video is selected using \(j^* = \arg \max _{j} \nu _{js}\times \nu _{ja}\).

Test Inference: Although the above formulation is proposed for concept classification and localization in a given set of videos (transductive setting), the same algorithm can also be applied to unseen test videos. The latent coefficients for the tracks of test videos can be learned alongside the training data except that the parameters \(\sigma _{ke}^2\), \(\mathbf {\Phi ^e}_{k.}\), \(\sigma _{Ae}^2\) and \(\sigma _{ne}^2\) are updated only using training data. In the case of free annotation, i.e., absence of labels for test video i, we run the proposed approach by setting \(L^{(i)}_k = 1\) in Eq. (11), indicating that the tracks in a video i can belong to any of the classes in \(\mathcal {S}\) or \(\mathcal {A}\) (i.e., no constraints as defined by (4)–(8) are enforced).

4 Experimental Results

In this section, we present an evaluation of WSC-SIIBP on two real-world databases: Casablanca movie and A2D dataset, which represent typical ‘in-the-wild’ videos with weak labels on heterogeneous concepts.

4.1 Datasets

\({\underline{\mathbf{Casablanca\, dataset}}}\): This dataset, introduced in [5], has 19 persons (movie actors) and three action classes (sitdown, walking, background). The heterogeneous concepts used in this dataset are persons and actions. The Casablanca movie is divided into shorter segments of duration either 60 or 120 s. We manually annotate all the tracks in each video segment which may contain multiple persons and actions. Given a video segment and the corresponding video-level labels (extracted from all ground truth track labels), our algorithm maps each of these labels to one or more tracks in that segment, i.e., converts the weak labels to strong labels. Our main objective of evaluation on this dataset is to compare the performance of various algorithms in classifying tracks from videos of varying length.

For our setting, we consider face and action as the two heterogeneous concepts and thus it is required to extract the face and the corresponding action track features. We extract 1094 facial tracks from the full 102 min Casablanca video. The face tracks are extracted by running the multi-view face detector from [35] in every frame and associating detections across frames using point tracks [36]. We follow [37] to generate the face track feature representations: Dense rootSIFT features are extracted for each face in the track followed by PCA and video-level Fisher vector encoding. The action tracks corresponding to 1094 facial tracks are obtained by extrapolating the face bounding-boxes using linear transformation [5]. For action features, we compute Fisher vector encoding on dense trajectories [38] extracted from each action track.

On an average, each 60 s. segment contains 11 face-action tracks and 4 face-action annotations while each 120 s. video contains 21 tracks and 6 annotations. Note that, our experimental setup is more difficult compared to the experimental setting considered in [5]. In [5], the Casablanca movie is divided into numerous bags based on the movie script, where on average each segment is of duration 31 s. containing only 6.27 face-action tracks.

\({\underline{\mathbf{A2D\, dataset}}}\): This dataset [11] contains 3782 YouTube videos (on average 7–10 s. long) covering seven objects (bird, car etc.) performing one of nine actions (fly, jump etc.). The heterogeneous concepts considered are objects and actions. This dataset provides the bounding box annotations for every video label pair of object and action. Using the A2D dataset, we aim to analyze the track localization performance on weakly labeled videos as well as the track classification accuracy on a held-out test dataset.

We use the method proposed in [39] to generate spatio-temporal object track proposals. For computational purpose, we consider only 10 tracks per video and use the Imagenet pretrained VGG CNN-M network [40] to generate object feature representation. We extract convolutional layer conv-4 and conv-5 features for each track image followed by PCA and video-level Fisher vector encoding. In this dataset, the corresponding action tracks are kept similar to the object tracks (proposals) and the action features are extracted using the same approach as used for the Casablanca dataset.

4.2 Baselines

We compare WSC-SIIBP to several state-of-the-art approaches using the same features.

  1. 1.

    WS-DC [5]: This approach uses similar weak constraints as in (4)–(6), but in a discriminative setup where the constraints are incorporated in a biconvex optimization framework.

  2. 2.

    WS-SIBP [1]: This is a weakly supervised stacked IBP model which does not consider integrative framework for heterogeneous data and only enforces constraints equivalent to (7)–(8). For each spatio-temporal track, the features extracted for heterogeneous concepts are concatenated while using this approach.

  3. 3.

    WS-S / WS-A: This is similar to WS-SIBP except that instead of concatenating features from multiple concepts they are treated independently in two different IBP. WS-S(WS-A) is used to model only the person/object(action) features.

  4. 4.

    WS-SIIBP: This model integrates WS-SIBP with heterogeneous concepts.

  5. 5.

    WSC-SIBP: This model is similar to WS-SIBP, but unlike WS-SIBP, it additionally enforces the location constraints obtained from weak labels.

Implementation details: For each dataset, the Fisher encoded features are PCA reduced to an appropriate dimension, \(D^{e}\). We select the best feature length and other algorithm specific hyper-parameters for each algorithm using cross-validation on a small set of input videos. For the IBP based models, the cross-validation range for hyper-parameters are \(K_{max} := K_a + K_s : 10 : K_a + K_s + 100\), \(\alpha := 3K_{max}: 10 : 4K_{max}\) and \(C := 0 : 0.5 : 5\). For all IBP based models, the parameters \(D^{e}\), \(\alpha \), \(K_{max}\) and C are set as 32, 100, 30 and 0.5 respectively for the Casablanca dataset and as 128, 160, 50 and 5 respectively for the A2D dataset. For WS-DC, \(D^{e}\) is set as 1024.

Fig. 3.
figure 3

Comparison of results for the Casablanca movie dataset. (a) Classification accuracy for 60 s. segments. (b) Recall for background vs non-background class (60 s, person). (c) Recall for background vs non-background (60 s., action). (d) Classification accuracy for 120 s. segments. (e) Recall for background vs non-background class (120 s., person). (f) Recall for background vs non-background (120 s., action). (g),(h) Mean Average Precision for 60, 120 s. segments. (i) Classification accuracy obtained with and without constraints (7) and (8)

4.3 Results on Casablanca

The track-level classification performance is compared in Fig. 3. From Figs. 3a and d, it can be seen that WSC-SIIBP significantly outperforms other methods for person and action classification in almost all of the scenarios. For instance, in the 120 s video segments, person classification improves by 4 % (relative improvement is 7 %) compared to the most competitive approach WS-SIIBP. We also compare pairwise label accuracy to gain insight into the importance of the constraints in eq (4)–(6). For any given track with non-background person and action label, the classification is assumed to be correct only if both person and action labels are correctly assigned. Even in this scenario WSC-SIIBP performs 8.1 % better (24 % relative improvement) than the most competitive baseline. Since we combine the heterogeneous concepts along with location constraints in an integrated framework, WSC-SIIBP outperforms all other baselines. The weak results of WS-DC in pairwise classification, though surprising, can be attributed to their action classification results which are significantly biased towards one particular action ‘sitdown’ (Fig. 3d, note that WS-DC performs very poorly in ‘walking’ classification). Indeed, it should be noted that nearly 40 % and 89 % of person and action labels respectively belong to the background class. Thus, for fair evaluation of both background and non-background classes, we also plot the recall of background class against the recall of nonbackground classes for person and action classification in Fig. 3b, c, e and f. These curves were obtained by simultaneously computing recall for background and non-background classes over a range of threshold values on score, \(\nu \). The mean average precision (mAP) of WSC-SIIBP along with all other baselines are plotted in Fig. 3g and h. The mAP values also clearly demonstrate the effectiveness of the proposed approach. From the performance of WS-SIIBP (integrative concepts, no constraints) and WSC-SIBP (no integrative concepts, constraints) (Fig. 3a and d), it is clear that the improvement in performance in the WSC-SIIBP can be attributed to both addition of integrative concepts and the location constraints.

Effect of constraints (7), (8): We note that, regardless of other differences, every weakly supervised IBP model considered here enforces constraints (7), (8). However, these constraints are not part of the original WS-DC. To make a fair comparison between WS-DC and WSC-SIIBP, we analyze the effect of these constraints in Fig. 3i. Although, these additional constraints improve WS-DC performance, they do not supersede the performance of WSC-SIIBP. Further we observe that these constraints have improved the performance of all the weakly supervised IBP models.

4.4 Results on A2D

First, we evaluate localization performance on the full A2D dataset. We experiment with 37,820 tracks extracted from 3,782 videos with around 5000 weak labels. For every given object-action label pair our algorithm selects the best track from the corresponding video using the approach outlined in Sect. 3.6. The localization accuracy is measured by calculating the average IoU (Intersection over Union) of the selected track (3-D bounding box) with the ground truth bounding box. The class-wise IoU accuracy and the mean IoU accuracy for all classes are tabulated in Tables 1 and 2 respectively. In this task WSC-SIIBP also leads to a relative improvement of 9 % above the next best baseline. We also evaluate how accurately the extracted object proposals match with the ground truth bounding boxes to estimate an upper bound on the localization accuracy (referred as Upper Bound in Tables 1 and 2). In this case, the track maximizing the average IoU with the ground truth annotation is selected and the corresponding IoU is reported. We plot the correct localization accuracy with varying IoU thresholds in Fig. 4a, which also shows the effectiveness of the proposed approach. Figure 4b–4c shows some qualitative track localization results using the proposed approach on the selected frame.

\({\underline{\mathbf{Test\, Inference}}}\): We evaluate the classification performance on held-out test samples using the same train/test partition as in [11]. We consider two setups for the evaluation, (a) using video-level labels for the test samples and (b) free annotation where no test video labels are provided. The proposed approach is compared with GT-SVM, which is a fully supervised linear SVM that uses ground truth bounding boxes and their corresponding strong labels during training. The results are tabulated in Table 3. Note that the performance of WSC-SIIBP is close to that of the fully supervised setup.

Fig. 4.
figure 4

(a) Correct localization accuracy at various IOU thresholds. (b) and (c) Qualitative results: green boxes show the concept localization using our proposed approach.

Table 1. Per class mean IoU on A2D dataset.
Table 2. Average IoU comparison with other approaches on A2D dataset.
Table 3. mAP classification test accuracy on A2D dataset.

5 Conclusion

We developed a Bayesian non-parametric approach that integrates the Indian Buffet Process with heterogeneous concepts and spatio-temporal location constraints arising from weak labels. We report experimental results on two recent datasets containing heterogeneous concepts such as persons, objects and actions and show that our approach outperforms the best state of the art method. In future work, we will extend the WSC-SIIBP model to additionally localize audio concepts from speech input and develop an end-to-end deep neural network for joint feature learning and Bayesian inference.