Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The problem of object tracking is ubiquitous in computer vision. While many object tracking methods are available, multiple-person tracking remains extremely challenging [1]. In addition to the difficulties related to single-object tracking (occlusions, self-occlusions, visual appearance variability, unpredictable temporal behavior, etc.), tracking a varying and unknown number of objects makes the problem more challenging, for the following reasons: (i) the observations associated with detectors need to be associated to objects being tracked, which includes the process of discarding detection errors, (ii) the number of objects is not known in advance and hence it must be estimated and updated over time, (iii) mutual occlusions (not present in single-tracking scenarios) must be robustly handled, and (iv) the number of objects varies over time and one has to deal with hidden states of varying dimensionality, from zero when there is no visible object, to a large number of detected objects. Note that in this case and if a Bayesian setting is being considered, as is often the case, an exact recursive filtering solution is intractable.

Several multiple-person tracking methods have been proposed within the trans-dimensional Markov chain model [2], where the dimensionality of the state-space is treated as a state variable. This allows to track a variable number of objects by jointly estimating the number of objects and their states. [35] exploited this framework for tracking a varying number of objects. The main drawback is that the states are inferred by means of a reversible jump Markov-chain Monte Carlo sampling, which is computationally expensive [6]. The random finite set framework proposed in [79] is also very popular, where the targets are modeled as realizations of a random finite set which is composed of an unknown number of elements. Because an exact solution to this model is computationally intensive, an approximation known as the probability hypothesis density (PHD) filter was proposed [10]. Further sampling-based approximations of random-set based filters were subsequently proposed, e.g. [1113]. These were exploited in [14] for tracking a time-varying number of active speakers using auditory cues and in [15] for multiple-target tracking using visual observations. Recently, conditional random fields have been introduced to address multiple-target tracking [1618]. In this case, tracking is cast into an energy minimization problem. In radar tracking, popular multiple-target tracking methods are joint probabilistic data association (JPDA), and multiple hypothesis filters [19].

An interesting and less investigated framework for multiple-target tracking is the variational Bayesian class of models for tracking an unknown and varying number of persons. Although variational models are very popular in machine learning, their use for object tracking has been limited to tracking a fixed number of targets [20]. Variational Bayes methods approximate the joint a posteriori distribution of the complete set of latent variables by a separable distribution [21, 22]. In an online tracking scenario, where only past and current observations are available, this leads to approximating the filtering distribution. An interesting aspect of variational methods is that they yield closed-form expressions for the posterior distributions of the hidden variables and for the model parameters, thus enabling an intrinsically efficient filtering procedure implemented via a variational EM (VEM) algorithm. In this paper, we derive a variational Bayesian formulation for multiple-person tracking, and present results on the MOT 2016 challenge dataset [23]. The proposed method extends [24] in many apsects: (i) the assignment variables are included in the filtering equation and therefore the state variables and the assignment variables are jointly inferred, (ii) a temporal window is incorporated in the visibility process, leading to a tracker that is more robust to misdetections, (iii) death process allows to forget about old tracks and thus opens the door to large-scale processing, as needed in many realistic situations. Finally, full evaluation of the proposed tracker within the MOT 2016 challenge dataset assesses its performance against other state-of-the-art methods in a principled and systematic way. Examples of results obtained with our method and Matlab code are publicly available.Footnote 1

The remainder of this paper is organized as follows. Section 2 details the proposed Bayesian model and a variational solution is presented in Sect. 3. In Sect. 4, we depict the birth, visibility and death processes allowing to handle an unknown and varying number of persons. Section 5 presents benchmarking results. Finally, Sect. 6 draws conclusions.

2 Variational Multiple-Person Tracking

We start by introducing our notations. Vectors and matrices are in bold \(\mathbf {A}, \mathbf {a}\), scalars are in italic Aa. In general random variables are denoted with upper-case letters, e.g. \(\mathbf {A}\) and A, and their realizations with lower-case letters, e.g. \(\mathbf {a}\) and a.

Let N be the maximum number of persons. A track \(n \le N\) at time t is associated to the existence binary variable \(e_{tn}\) taking the value \(e_{tn}=1\) if the person has already been seen and \(e_{tn}=0\) otherwise. The vectorization of the existence variables at time t is denoted by \(\mathbf {e}_t=(e_{t1},...,e_{tN})\) and their sum, namely the effective number of tracked persons at t, is denoted by \(N_t = \sum _{n=1}^N e_{tn}\). The existence variables are assumed to be observed in Sects. 3 and 4; Their inference, grounded in a birth stochastic process, is discussed in Sect. 5.

The kinematic state of person n is a random vector \(\mathbf {X}_{tn}=({\mathbf {L}_{tn}^{\top }},{\mathbf {U}_{tn}^{\top }})^\top \in \mathbb {R}^6\), where \(\mathbf {L}_{tn}\in \mathbb {R}^4\) is the person location and size, i.e., 2D image position, width and height, and \(\mathbf {U}_{tn}\in \mathbb {R}^2\) is the person velocity in the image plane. The multiple-person state random vector is denoted by \(\mathbf {X}_t=(\mathbf {X}_{t1}^\top ,\ldots ,\mathbf {X}_{tN}^\top )^\top \in \mathbb {R}^{6N}\).

Fig. 1.
figure 1

Examples of detected persons from the MOT 2016 dataset.

We assume the existence of a person detector, providing \(K_t\) localization observations at each time t. The k-th localization observation delivered by the detector at time t is denoted by \(\mathbf {y}_{tk}\in \mathbb {R}^4\), and represents the location (2D position, width, height) of a person, e.g. Figure 1. The set of observations at time t is denoted by \(\mathbf {y}_t=\{\mathbf {y}_{tk}\}_{k=1}^{K_t}\). Associated to \(\mathbf {y}_{tk}\), there is a photometric description of the person appearance, denoted by \(\mathbf {h}_{tk}\). This photometric observation is extracted from the bounding box of \(\mathbf {y}_{tk}\). Altogether, the localization and photometric observations constitute the observations \(\mathbf {o}_{tk}=(\mathbf {y}_{tk},\mathbf {h}_{tk})\) used by our tracker. Definitions analogous to \(\mathbf {y}_t\) hold for \(\mathbf {h}_t=\{\mathbf {h}_{tk}\}_{k=1}^{K_t}\) and \(\mathbf {o}_t=\{\mathbf {o}_{tk}\}_{k=1}^{K_t}\). The probability of a set of random variables is written as \(p(\mathbf {o}_t) = p(\mathbf {o}_{t1},\ldots ,\mathbf {o}_{tK_t})\).

We also define an observation-to-person assignment (hidden) variable \(Z_{tk}\), associated with each observation \(\mathbf {o}_{tk}\). \(Z_{tk}=n, n \in \{1 \ldots N\}\) means that \(\mathbf {o}_{tk}\) is associated to person n. It is common that a detection corresponds to some clutter instead of a person. We cope with these false detections by defining a clutter target. In practice, the index \(n=0\) is assigned to this clutter target, which is always visible, i.e. \(e_{t0}=1\) for all t. Hence, the set of possible values for \(Z_{tk}\) is extended to \(\{0\}\cup \{1 \ldots N\}\), and \(Z_{tk}=0\) means that observation \(\mathbf {o}_{tk}\) has been generated by clutter and not by a person. The practical consequence of adding a clutter track is that the observations assigned to it play no role in the estimation of the parameters of the other tracks, thus leading to an estimation robust to outliers.

2.1 The Online Tracking Model

The online multiple-person tracking problem is cast into the estimation of the filtering distribution of the hidden variables given the causal observations \(p(\mathbf {Z}_t,\mathbf {Z}_{t-1},\mathbf {X}_t,\mathbf {X}_{t-1}|\mathbf {o}_{1:t}, \mathbf {e}_{1:t})\), where \(\mathbf {o}_{1:t}= \{\mathbf {o}_1, \dots , \mathbf {o}_t \}\). Importantly, we assume that the observations at time t only depend on the hidden and visibility variables at time t. The filtering distribution can be written as:

$$\begin{aligned}&p(\mathbf {Z}_t,\mathbf {Z}_{t-1},\mathbf {X}_t,\mathbf {X}_{t-1}|\mathbf {o}_{1:t}, \mathbf {e}_{1:t}) = \nonumber \\&\quad \frac{p(\mathbf {o}_t|\mathbf {Z}_t,\mathbf {X}_t,\mathbf {e}_t)p(\mathbf {Z}_t,\mathbf {X}_t|\mathbf {Z}_{t-1},\mathbf {X}_{t-1},\mathbf {e}_t)p(\mathbf {X}_{t-1},\mathbf {Z}_{t-1}|\mathbf {o}_{1:t-1}, \mathbf {e}_{1:t})}{p(\mathbf {o}_t|\mathbf {o}_{1:t-1}, \mathbf {e}_{1:t})}. \end{aligned}$$
(1)

The denominator of (1) only involves observed variables and therefore its evaluation is not necessary as long as one can normalize the expression arising from the numerator. Hence we focus on the two terms of the latter, namely the observation model \(p(\mathbf {o}_t|\mathbf {Z}_t,\mathbf {X}_t,\mathbf {e}_t)\) and the dynamic distribution \(p(\mathbf {Z}_t,\mathbf {X}_t|\mathbf {Z}_{t-1},\mathbf {X}_{t-1},\mathbf {e}_t)\).

The Observation Model. The joint observations are assumed to be independent and identically distributed:

$$\begin{aligned} p(\mathbf {o}_t|\mathbf {Z}_t,\mathbf {X}_t,\mathbf {e}_t) = \prod _{k=1}^{K_t} p(\mathbf {o}_{tk}|Z_{tk},\mathbf {X}_t,\mathbf {e}_t). \end{aligned}$$
(2)

In addition, we make the reasonable assumption that, while localization observations depend both on the assignment variable and kinematic state, the appearance observations only depend on the assignment variable, that is the person identity, but not on his/her kinematic state. We also assume the localization and appearance observations to be independent given the hidden variables. Consequently, the observation likelihood of a single joint observation can be factorized as:

$$\begin{aligned} p(\mathbf {o}_{tk}|Z_{tk},\mathbf {X}_t,\mathbf {e}_t)= & {} p(\mathbf {y}_{tk},\mathbf {h}_{tk}|Z_{tk},\mathbf {X}_t,\mathbf {e}_t)\nonumber \\= & {} p(\mathbf {y}_{tk}|Z_{tk},\mathbf {X}_t,\mathbf {e}_t)p(\mathbf {h}_{tk}|Z_{tk},\mathbf {e}_t). \end{aligned}$$
(3)

The localization observation model is defined depending on whether the observation is generated by clutter or by a person:

  • If the observation is generated from clutter, namely \(Z_{tk} =0\), the variable \(\mathbf {y}_{tk}\) follows an uniform distribution with probability density function \(u(\mathbf {y}_{tk})\);

  • If the observation is generated by person n, namely \(Z_{tk} =n\), the variable \(\mathbf {y}_{tk}\) follows a Gaussian distribution with mean \(\mathbf {P}\mathbf {X}_{tn}\) and covariance \(\mathbf {\Sigma }\): \(\mathbf {y}_{tk}\sim g(\mathbf {y}_{tk};\mathbf {P}\mathbf {X}_{tn},\mathbf {\Sigma })\)

The linear operator \(\mathbf {P}\) maps the kinematic state vectors onto the space of observations. For example, when \(\mathbf {X}_{tn}\) represents the full-body kinematic state (full-body localization and velocity) and \(\mathbf {y}_{tk}\) represents the full-body localization observation, \(\mathbf {P}\) is a projection which, when applied to a state vector, only retains the localization components of the state vector. Finally, the full observation model is compactly defined by the following, where \(\delta _{ij}\) stands for the Kronecker function:

$$\begin{aligned} p(\mathbf {y}_{tk}|Z_{tk} =n,\mathbf {X}_{t},\mathbf {e}_t)= u(\mathbf {y}_{tk})^{1-e_{tn}} \left( u(\mathbf {y}_{tk})^{\delta _{0n}} g(\mathbf {y}_{tk}; \; \mathbf {P}\mathbf {X}_{tn},\mathbf {\Sigma }) ^{1-\delta _{0n}}\right) ^{e_{tn}}. \end{aligned}$$
(4)

The appearance observation model is also defined depending on whether the observations is clutter or not. When the observation is generated by clutter, it follows a uniform distribution with density function \(u(\mathbf {h}_{tk})\). When the observation is generated by person n, it follows a Bhattacharya distribution with density defined by

$$\begin{aligned} b(\mathbf {h}_{tk};\mathbf {h}_n)=\frac{1}{W_\lambda } \exp (-\lambda d_B(\mathbf {h}_{tk},\mathbf {h}_n) ), \end{aligned}$$

where \(\lambda \) is a positive skewness parameter, \(d_B(\cdot )\) is the Battacharya distance between histograms, \(\mathbf {h}_n\) is the reference appearance model of person n. This gives the following compact appearance observation model:

$$\begin{aligned} p(\mathbf {h}_{tk}|Z_{tk}=n,\mathbf {X}_t,\mathbf {e}_t) = u(\mathbf {h}_{tk})^{1-e_{tn}} (u(\mathbf {h}_{tk})^{\delta _{0n}} b(\mathbf {h}_{tk};\mathbf {h}_n)^{1-\delta _{0n}} )^{e_{tn}}. \end{aligned}$$
(5)

The Dynamic Distribution. Here we consider two hypotheses, firstly, we assume the at each time instance, assignment variable doesn’t depends on the previous assignment. So we can factorize the the dynamic distribution into the observation-to-person prior distribution and the predictive distribution. Secondly, the kinematic state dynamics follow a first-order Markov chain, meaning that the state \(\mathbf {X}_t\) only depends on state \(\mathbf {X}_{t-1}\).

$$\begin{aligned} p(\mathbf {Z}_t,\mathbf {X}_t|\mathbf {Z}_{t-1},\mathbf {X}_{t-1},\mathbf {e}_t) = p(\mathbf {Z}_t|\mathbf {e}_t)p(\mathbf {X}_{t} | \mathbf {X}_{t-1}, \mathbf {e}_{t}). \end{aligned}$$
(6)

The Observation-to-Person Prior Distribution. The joint distribution of the assignment variables can be factorized as:

$$\begin{aligned} p(\mathbf {Z}_t|\mathbf {e}_t)= \prod _{k=1}^{K_t} p(Z_{tk}|\mathbf {e}_t). \end{aligned}$$
(7)

When observations are not yet available, given existence variables \(\mathbf {e}_t\), the assignment variables \(Z_{tk}\) are assumed to follow multinomial distributions defined as:

$$\begin{aligned} p(Z_{tk} =n|\mathbf {e}_t) = e_{tn}a_{tn} \quad \text {with}\quad \sum _{n=0}^Ne_{tn}a_{tn} = 1. \end{aligned}$$
(8)

Because \(e_{tn}\) takes the value 1 only for actual persons, the probability to assign an observation to a non-existing person is null. When person n is visible, \(a_{tn}\) represents the probability of observation \(\mathbf {y}_{tk}\) to be generated from person n.

The Predictive Distribution. The kinematic state predictive distribution represents the probability distribution of the kinematic state at time t given the observations up to time \(t-1\) and the existence variables \(p(\mathbf {X}_t | \mathbf {X}_{t-1}, \mathbf {e}_t)\). The predictive distribution is mainly driven by the dynamics of persons’s kinematic states, which are modeled assuming that the person locations do not influence each other’s dynamics, meaning that there is one first-order Markov chain for each person. Formally, this can be written as:

$$\begin{aligned} p(\mathbf {X}_t | \mathbf {X}_{t-1}, \mathbf {e}_t) = \prod _{n=1}^N p(\mathbf {X}_{tn} | \mathbf {X}_{t-1n}, e_{tn}). \end{aligned}$$
(9)

For the model to be complete, \(p(\mathbf {X}_{tn} | \mathbf {X}_{t-1,n}, e_{tn})\) needs to be defined. The temporal evolution of the kinematic state \(\mathbf {X}_{tn}\) is defined as:

$$\begin{aligned} p(\mathbf {X}_{tn} = \mathbf {x}_{tn} |\mathbf {X}_{t-1,n} =\mathbf {x}_{t-1,n},e_{tn}) = u(\mathbf {x}_{tn})^{1-e_{tn}} g(\mathbf {x}_{tn}; \; \mathbf {D}\mathbf {x}_{t-1,n},\mathbf {\Lambda }_n) ^{e_{tn}}, \end{aligned}$$
(10)

where \(u(\mathbf {x}_{tn})\) is a uniform distribution over the motion state space, g is a Gaussian probability density function, \(\mathbf {D}\) represents the dynamics transition operator, and \(\mathbf {\Lambda }_n\) is a covariance matrix accounting for uncertainties on the state dynamics. The transition operator is defined as:

$$\begin{aligned} \mathbf {D}= \left( \begin{array}{cc} \mathbf {I}_{4\times 4} &{} \begin{array}{c} \mathbf {I}_{2\times 2} \\ \mathbf {0}_{2\times 2} \\ \end{array}\\ \mathbf {0}_{2\times 4} &{} \mathbf {I}_{2\times 2} \\ \end{array} \right) \end{aligned}$$

In other words, the dynamics of an existing person n, either follows a Gaussian with mean vector \(\mathbf {D}\mathbf {X}_{t-1,n}\) and covariance matrix \(\mathbf {\Lambda }_n\), or a uniform distribution if person n does not exist. The complete set of parameters of the proposed model is denoted with \(\mathbf {\Theta }=\big ( \{\mathbf {\Sigma }\},\;\{\mathbf {\Lambda }_n\}_{n=1}^N,\mathbf {A}_{1:t}\big )\), with \(\mathbf {A}_{t}=\{a_{tn}\}_{n=0}^{N}\).

3 Variational Bayesian Inference

Because of the combinatorial nature of the observation-to-person assignment problem, a direct optimization of the filtering distribution (1) with respect to the hidden variables is intractable. We propose to overcome this problem via a variational Bayesian inference method. The principle of this family of methods is to approximate the intractable filtering distribution \(p(\mathbf {Z}_t,\mathbf {Z}_{t-1},\mathbf {X}_t,\mathbf {X}_{t-1}|\mathbf {o}_{1:t},\mathbf {e}_{1:t})\) by a separable distribution, e.g. \(q(\mathbf {Z}_t)\prod _{n=0}^Nq(\mathbf {X}_{tn})\). According to the variational Bayesian formulation [21, 22], given the observations and the parameters at the previous iteration \(\mathbf {\Theta }^\circ \), the optimal approximation has the following general expression:

$$\begin{aligned} \log q(\mathbf {Z}_{t})&=\mathbf {E}_{q(\mathbf {X}_t)q(\mathbf {X}_{t-1})q(\mathbf {Z}_{t-1})}\left\{ \log \widetilde{P}\right\} , \end{aligned}$$
(11)
$$\begin{aligned} \log q(\mathbf {Z}_{t-1})&=\mathbf {E}_{q(\mathbf {X}_t)q(\mathbf {X}_{t-1})q(\mathbf {Z}_{t})}\left\{ \log \widetilde{P}\right\} , \end{aligned}$$
(12)
$$\begin{aligned} \log q(\mathbf {X}_{tn})&= \mathbf {E}_{ q(\mathbf {Z}_t)q(\mathbf {Z}_{t-1}) q(\mathbf {X}_{t-1,n}) \prod _{m\ne n}q(\mathbf {X}_{tm})} \left\{ \log \widetilde{P}\right\} , \end{aligned}$$
(13)
$$\begin{aligned} \log q(\mathbf {X}_{t-1,n})&=\mathbf {E}_{ q(\mathbf {Z}_t)q(\mathbf {Z}_{t-1}) q(\mathbf {X}_{t,n}) \prod _{m\ne n}q(\mathbf {X}_{t-1,m})}\left\{ \log \widetilde{P}\right\} , \end{aligned}$$
(14)

where, for simplicity, we used the notation \(\widetilde{P}=p(\mathbf {Z}_t,\mathbf {Z}_{t-1},\mathbf {X}_t,\mathbf {X}_{t-1}|\mathbf {o}_{1:t},\mathbf {e}_{1:t},\mathbf {\Theta }^\circ )\). In our particular case, when these two equations are put together with the probabilistic model defined in (2), (6) and (9), the expression of \(q(\mathbf {Z}_t)\) is factorized further into:

$$\begin{aligned} \log q(Z_{tk})&=\mathbf {E}_{q(\mathbf {X}_t)q(\mathbf {X}_{t-1})q(\mathbf {Z}_{t-1})}\left\{ \log \widetilde{P} \right\} , \end{aligned}$$
(15)

Note that this equation leads to a finer factorization that the one we initially imposed. This behavior is typical of variational Bayes methods in which a very mild separability assumption can lead to a much finer factorization when combined with priors over hidden states and latent variables, i.e. (2), (6) and (9). The final factorization writes:

$$\begin{aligned} p(\mathbf {Z}_t,\mathbf {Z}_{t-1},\mathbf {X}_t,\mathbf {X}_{t-1}|\mathbf {o}_{1:t},\mathbf {e}_{1:t})\approx \prod _{k=0}^{K_t} q(Z_{tk}) \prod _{k=0}^{K_{t-1}} q(Z_{t-1,k}) \prod _{n=0}^{N} q(\mathbf {X}_{tn}) q(\mathbf {X}_{t-1,n}). \end{aligned}$$
(16)

Once the posterior distribution over the hidden variables is computed (see below), the optimal parameters are estimated using \(\hat{\mathbf {\Theta }} =\arg \max _{\mathbf {\Theta }} J(\mathbf {\Theta },\mathbf {\Theta }^\circ )\) with \(J\) defined as:

$$\begin{aligned} J(\mathbf {\Theta },\mathbf {\Theta }^\circ )=\mathbf {E}_{q(\mathbf {Z},\mathbf {X})}\left\{ \log p(\mathbf {Z}_t,\mathbf {Z}_{t-1},\mathbf {X}_t,\mathbf {X}_{t-1},\mathbf {o}_{1:t}| \mathbf {e}_{1:t},\mathbf {\Theta },\mathbf {\Theta }^\circ )\right\} . \end{aligned}$$
(17)

3.1 E-Z-Step

The estimation of \(q(Z_{tk})\) is carried out by developing the expectation (15) which yields the following formula:

$$\begin{aligned} q(Z_{tk}=n) = \alpha _{tkn} = \frac{e_{tn} \epsilon _{tkn} a_{tn}}{\sum _{m=0}^N e_{tm} \epsilon _{tkm} a_{tn}}, \end{aligned}$$
(18)

and \(\epsilon _{tkn}\) is defined as:

$$\begin{aligned} \epsilon _{tkn} = \left\{ \begin{array}{ll} u(\mathbf {y}_{tk}) u(\mathbf {h}_{tk}) &{} n=0, \\ g(\mathbf {y}_{tk},\mathbf {P}\varvec{\mu }_{tn},\mathbf {\Sigma }) e^{-\frac{1}{2} \text {Tr}\left( \mathbf {P}^{\top } \left( \mathbf {\Sigma }\right) ^{-1} \mathbf {P}\mathbf {\Gamma }_{tn}\right) } b(\mathbf {h}_{tk};\mathbf {h}_n) &{} n \ne 0, \end{array}\right. \end{aligned}$$
(19)

where \(\text {Tr}(\cdot )\) is the trace operator and \(\varvec{\mu }_{tn}\) and \(\mathbf {\Gamma }_{tn}\) are defined by (21) and (22) below. Intuitively, this approximation shows that the assignment of an observation to a person is based on spatial proximity between the observation localization and the person localization, and the similarity between the observation’s appearance and the person’s reference appearance.

3.2 E-X-Step

The estimation of \(q(\mathbf {X}_{tn})\) is derived from (13). Similarly to the previous posterior distribution, which boil down to the following formula:

$$\begin{aligned} q(\mathbf {X}_{tn}) = u(\mathbf {X}_{tn})^{1-e_{tn}} g(\mathbf {X}_{tn};\varvec{\mu }_{tn},\mathbf {\Gamma }_{tn} )^{e_{tn}}, \end{aligned}$$
(20)

where the mean vector \(\varvec{\mu }_{tn}\) and the covariance matrix \(\mathbf {\Gamma }_{tn}\) are given by:

$$\begin{aligned} \mathbf {\Gamma }_{tn}&= \Big ( \sum _{k=0}^{K_t} \alpha _{tkn} \left( \mathbf {P}^{\top } \left( \mathbf {\Sigma }\right) ^{-1} \mathbf {P}\Big ) + \mathbf {\Lambda }_n^{-1} \right) ^{-1}, \end{aligned}$$
(21)
$$\begin{aligned} \varvec{\mu }_{tn}&= \mathbf {\Gamma }_{tn}\Big ( \sum _{k=0}^{K_t} \alpha _{tkn} \mathbf {P}^{\top } \left( \mathbf {\Sigma }\right) ^{-1} \mathbf {y}_{tk} + \mathbf {\Lambda }_n^{-1}\mathbf {D}\varvec{\mu }_{t-1,n} \Big ). \end{aligned}$$
(22)

Similarly, for the estimation of the distribution

$$\begin{aligned} q(\mathbf {X}_{t-1,n}) = u(\mathbf {X}_{t-1,n})^{1-e_{tn}} g(\mathbf {X}_{t-1,n};\widehat{\varvec{\mu }}_{t-1,n},\widehat{\mathbf {\Gamma }}_{t-1,n} )^{e_{tn}}, \end{aligned}$$
(23)

the mean and covariance are:

$$\begin{aligned} \widehat{\mathbf {\Gamma }}_{t-1,n}&= \Big ( \mathbf {D}^\top \mathbf {\Lambda }_n^{-1} \mathbf {D}+ \mathbf {\Gamma }_{t-1,n} \Big )^{-1} \end{aligned}$$
(24)
$$\begin{aligned} \widehat{\varvec{\mu }}_{t-1,n}&= \widehat{\mathbf {\Gamma }}_{t-1,n}\Big ( \mathbf {D}^\top \mathbf {\Lambda }_n^{-1} \varvec{\mu }_{t,n} + \mathbf {\Gamma }_{t-1,n}^{-1} \varvec{\mu }_{t-1,n} \Big ). \end{aligned}$$
(25)

We note that the variational approximation of the kinematic-state distribution reminds the Kalman filter solution of a linear dynamical system with mainly one difference: in our formulation, (21) and (22), the means and covariances are computed by weighting the observations with \(\alpha _{tkn}\), i.e. (21) and (22).

3.3 M-Step

Once the posterior distribution of the hidden variables is estimated, the optimal parameter values can be estimated via maximization of \(J\) defined in (17). Concerning the parameters of the a priori observation-to-object assignment \(\mathbf {A}_t\) we compute:

$$\begin{aligned} J(a_{tn}) = \sum _{k=1}^{K_t} e_{tn}\alpha _{tkn}\log (e_{tn}a_{tn}) \quad \text {s.t.} \quad \sum _{n=0}^N e_{tn}a_{tn} = 1, \end{aligned}$$
(26)

and we trivially obtain:

$$\begin{aligned} a_{tn} = \frac{e_{tn}\sum _{k=1}^{K_t}\alpha _{tkn}}{ \sum _{m=0}^N e_{tm}\sum _{k=1}^{K_t}\alpha _{tkm}}. \end{aligned}$$
(27)

The observation covariance \(\mathbf {\Sigma }\) and the state covariances \(\mathbf {\Lambda }_n\) can be estimated during the M-step. However, in our current implementation estimates for \(\mathbf {\Sigma }\) and \(\mathbf {\Lambda }_n\) are instantaneous, i.e., they are obtained only from the observations at time t (see the experimental section for details).

4 Person-Birth, -Visibility and -Death Processes

Tracking a time-varying number of targets requires procedures to create tracks when new targets enter the scene and to delete tracks when corresponding targets leave the visual scene. In this paper, we propose a statistical-test based birth process that creates new tracks and a hidden Markov model (HMM) based visibility process that handles disappearing targets. Until here, we assumed that the existence variables \(e_{tn}\) were given. In this section we present the inference model for the existence variable based on the stochastic birth-process.

4.1 Birth Process

The principle of the person birth process is to search for consistent trajectories in the history of observations associated to clutter. Intuitively, two hypotheses are confronted, namely: (i) the considered observation sequence is generated by a person not being tracked and (ii) the considered observation sequence is generated by clutter.

The model of “the considered observation sequence is generated by a person not being tracked” hypothesis is based on the observations and dynamic models defined in (4) and (10). If there is a not-yet-tracked person n generating the considered observation sequence \(\{\mathbf {y}_{t-L,k_L},\ldots ,\mathbf {y}_{t,k_0}\}\),Footnote 2 then the observation likelihood is \(p(\mathbf {y}_{t-l,k_l}|\mathbf {x}_{t-l,n}) = g(\mathbf {y}_{t-l,k_l};\mathbf {P}\mathbf {x}_{t-l,n}, \mathbf {\Sigma })\) and the person trajectory is governed by the dynamical model \(p(\mathbf {x}_{t,n}|\mathbf {x}_{t-1,n}) = g(\mathbf {x}_{t,n};\mathbf {D}\mathbf {x}_{t-1,n},\mathbf {\Lambda }_n)\). Since there is no prior knowledge about the starting point of the track, we assume a “flat" Gaussian distribution over \(\mathbf {x}_{t-L,n}\), namely \(p_b(\mathbf {x}_{t-L,n})=g(\mathbf {x}_{t-L,n};\mathbf {m}_b,\mathbf {\Gamma }_b)\), which is approximatively equivalent to a uniform distribution over the image. Consequently, the joint observation distribution writes:

$$\begin{aligned} \tau _0&=p (\mathbf {y}_{t,k_0},\ldots ,\mathbf {y}_{t-L,k_L}) \nonumber \\&= \int p(\mathbf {y}_{t,k_0},\ldots ,\mathbf {y}_{t-L,k_L},\mathbf {x}_{t:t-L,n})d\mathbf {x}_{t:t-L,n} \nonumber \\&= \int \prod _{l=0}^L p(\mathbf {y}_{t,k_l}|\mathbf {x}_{t-l,n})\times \prod _{l=0}^{L-1} p(\mathbf {x}_{t-l,n}|\mathbf {x}_{t-l-1,n})\times p_b(\mathbf {x}_{t-2,n}) d\mathbf {x}_{t:t-L,n}, \end{aligned}$$
(28)

which can be seen as the marginal of a multivariate Gaussian distribution. Therefore, the joint observation distribution \(p(\mathbf {y}_{t,k_0},\mathbf {y}_{t-1,k_1},\ldots ,\mathbf {y}_{t-2,k_L})\) is also Gaussian and can be explicitly computed.

The model of “the considered observation sequence is generated by clutter” hypothesis is based on the observation model given in (4). When the considered observation sequence \(\{\mathbf {y}_{t,k_0},\ldots ,\mathbf {y}_{t-L,k_L}\}\) is generated by clutter, observations are independent and identically uniformly distributed. In this case, the joint observation likelihood is

$$\begin{aligned} \tau _1=p (\mathbf {y}_{t,k_0},\ldots ,\mathbf {y}_{t-L,k_L}) =\prod _{l=0}^L u(\mathbf {y}_{t-l,k_l}). \end{aligned}$$
(29)

Finally, our birth process is as follows: for all \(\mathbf {y}_{t,k_0}\) such that \(\tau _0>\tau _1\), a new person is added by setting \(e_{tn}=1, q(\mathbf {x}_{t,n};\varvec{\mu }_{t,n},\mathbf {\Gamma }_{t,n})\) with \(\varvec{\mu }_{t,n} = [\mathbf {y}_{t,k_0}^\top ,\mathbf {0}_2^\top ]^\top \), and \(\mathbf {\Gamma }_{tn}\) is set to the value of a birth covariance matrix (see (20)). Also, the reference appearance model for the new person is defined as \(\mathbf {h}_{t,n}=\mathbf {h}_{t,k_0}\).

4.2 Visibility Process

A tracked person is said to be visible at time t whenever there are observations associated to that person, otherwise the person is considered not visible. Instead of deleting tracks, as classical for death processes, our model labels tracks without associated observations as sleeping. In this way, we keep the possibility to awake such sleeping tracks whenever their reference appearance highly matches an observed appearance.

We denote the n-th person visibility (binary) variable by \(V_{tn}\), meaning that the person is visible at time t if \(V_{tn}=1\) and 0 otherwise. We assume the existence of a transition model for the hidden visibility variable \(V_{tn}\). More precisely, the visibility state temporal evolution is governed by the transition matrix, \(p(V_{tn} =j|V_{t-1,n}=i)=\pi _v^{\delta _{ij}} (1-\pi _v)^{1-\delta _{ij}}\), where \(\pi _v\) is the probability to remain in the same state. To enforce temporal smoothness, the probability to remain in the same state is taken higher than the probability to switch to another state.

The goal now is to estimate the visibility of all the persons. For this purpose we define the visibility observations as \(\nu _{tn}=e_{tn} a_{tn}\), being 0 when no observation is associated to person n. In practice, we need to filter the visibility state variables \(V_{tn}\) using the visibility observations \(\nu _{tn}\). In other words, we need to estimate the filtering distribution \(p(V_{tn}|\nu _{1:tn},e_{1:tn})\) which can be written as:

$$\begin{aligned} p(V_{tn}&=v_{tn}|\nu _{1:t},e_{1:tn})= \nonumber \\&\frac{p(\nu _{tn}|v_{tn},e_{tn})\sum _{v_{t-1,n}} p(v_{tn}|v_{t-1,n}) p(v_{t-1,n}|\nu _{1:t-1,n},e_{1:t-1})}{p(\nu _{tn}|\nu _{1:t-1,n},e_{1:t})}, \end{aligned}$$
(30)

where the denominator corresponds to integrating the numerator over \(v_{tn}\). In order to fully specify the model, we define the visibility observation likelihood as:

$$\begin{aligned} p(\nu _{tn}|v_{tn},e_{tn})=(\text {exp}(-\lambda \nu _{tn}))^{v_{tn}}(1-\text {exp}(-\lambda \nu _{tn}))^{1-v_{tn}} \end{aligned}$$
(31)

Intuitively, when \(\nu _{tn}\) is high, the likelihood is large if \(v_{tn}=1\) (person is visible). The opposite behavior is found when \(\nu _{tn}\) is small. Importantly, at each frame, because the visibility state is a binary variable, its filtering distribution can be straightforwardly computed. We found this rather intuitive strategy to be somewhat “shaky” over time even taking the Markov dependency into account. This is why we enriched the visibility observations to span over multiple frames \(\nu _{tn}=\sum _{l=0}^{L}e_{t+l n}a_{t+l n}\), so that if \(v_{tn}=1\), the likelihood is large when \(\nu _{tn}\) is high and therefore the target is visible in one or more neighboring frames. This is the equivalent of the hypothesis testing spanning over time associated to the birth process.

4.3 Death Process

The idea of the person-visibility process arises from encouraging track consistency when a target disappears and appears back in the field of view. However, a tracker that remembers all the tracks that have been previously seen is hardly scalable. Indeed, the memory resources required by a system that remembers all previous appearance templates grows indefinitely with new appearances. Therefore, one must discard old information to facilitate the scalability of the method to large datasets containing sequences with several dozens of different people involved. In addition to alleviating the memory requirements, this also reduces the computational complexity of the tracker. This is the motivation of including a death process into the proposed variational framework. Intuitively one would like to discard those tracks that have not been seen during several frames. In practice, we found that discarding those tracks that are not visible for ten consecutive frames yields a good trade-off between complexity, resource demand and performance. Setting this parameter for a different dataset should not be chimeric, since the precise interpretation of the meaning of it is straightforward.

5 Experiments, Performance Evaluation, and Benchmark

We evaluated the performance of the proposed variational multiple-person tracker on the MOT 2016 dataset challenge [23]. This dataset is composed of seven training videos and seven test videos. Importantly, we use the detections that are provided with the dataset. Because multiple-person tracking intrinsically implies track creation (birth), deletion (death), target identity maintenance, and localization, evaluating multiple-person tracking techniques is a non-trivial task. Many metrics have been proposed, e.g. [2528].

We adopt the metrics used by the MOT 2016 benchmark, namely [27]. The main tracking measures are: the multiple-object tracking accuracy (MOTA), that combines false positives (FP), missed targets (FN), and identity switches (ID); the multiple-object tracking precision (MOTP), that measures the alignment of the tracker output bounding box with the ground truth; the false alarm per frame (FAF); the ratio of mostly tracked trajectories (MT); the ratio of mostly lost trajectories (ML) and the number of track fragmentations (Frag).

Figure 2 shows sample images of all test videos: They contain three sequences recorded with static cameras (MOT16-01, MOT16-03 and MOT16-08), which contain very crowded scenes and thus are very challenging, and five sequences with large camera motions, both translations and rotations, which make the data even more difficult to process.

Fig. 2.
figure 2

Samples images from the MOT 16 test sequences.

As explained above, we use the public pedestrian detections provided within the MOT16 challenge. These static detections are complemented in two different ways. First, we extract velocity observations by means of a simple optical-flow based algorithm that looks for the most similar region of the next temporal frame within the neighborhood of the original detection. Therefore, the observations operator P is the identity matrix, project the entire state variable into the observation space. Second, the appearance feature vector is the concatenation of joint color histograms of three regions of the torso in HSV space.

The proposed variational model is governed by several parameters. Aiming at providing an algorithm that is dataset-independent and that features a good trade-off between flexibility and performance, we set the observation covariance matrix \(\mathbf {\Sigma }\) and the state covariance matrix \(\mathbf {\Lambda }_n\) automatically from the detections. More precisely, both matrices are imposed to be diagonal; for \(\mathbf {\Sigma }\), the variances of the horizontal position, of the width, and of the horizontal speed are 1/3, 1/3 and 1/6 of the detected width. The variances of the vertical quantities are built analogously. The rationale behind this choice is that we consider that the true detection lies, more or less, within the width and height of the detected bounding box. Regarding \(\mathbf {\Lambda }_n\), the diagonal entries are 1, 1 and 1/2 of the detected width, and vertical quantities are defined analogously. Furthermore, in order to eliminate arbitrary false detections, we set \(L=5\) in the birth process. Finally, for sequences in which the size of the bounding boxes is roughly constant, we discarded those detections that were too large or too small.

Fig. 3.
figure 3

Sample results on several sequences of MOT16 datasets, red bounding boxes represents the tracking results, and the number inside each box is the track index. (Color figure online)

Examples of the tracking results for all the test sequences except MOT16-07 are shown in Fig. 3, while six frames from MOT16-07 are shown in Fig. 4. In all figures, the red boxes represent our tracking result and the numbers within the boxes are the tracking indexes. Generally speaking, on one hand the variational model is crucial to properly associate detections with trajectories. On the other hand, the birth and visibility processes play a role when tracked objects appear and disappear. Regarding Fig. 4, it contains 54 tracks recorded by a moving camera in a sequence of 500 frames. It is a very challenging tracking task, not only because the density of pedestrians is quite high, but also because significant camera motion makes the person trajectories to be both rough and discontinuous. One drawback of the proposed approach is that partially consistent false detections could lead to the creation of a false track, therefore tracking an inexistent pedestrian. On the positive side, the main advantage of the proposed model is that the probabilistic combination of the dynamic and appearance models can decrease the probability of switching the identities of two tracks.

Fig. 4.
figure 4

Sample results on the sequence MOT16-07, encoded as in the previous figure. (Color figure online)

Table 1. Evaluation of the proposed multiple-person tracking method with different features on the seven sequences of the MOT16 test dataset.

Table 1 reports the performance of the proposed algorithm, which is referred to as OVBT (online variational Bayesian tracker), over the seven test sequences of the MOT 2016 challenge. The results obtained with OVBT are available on the MOT 2016 webpage.Footnote 3 One can notice that our method provides high precision (MOTP) but low accuracy (MOTA), meaning that some tracks were missed (mostly due to misetections). This is consistent with a rather low MT measure. This behavior was more extreme when the visibility process did not include any observation aggregation over time. Indeed, we observed that considering multiple observations within the visibility process leads to better performance (for all sequences and almost all measures).

6 Conclusions

We propose a variational Bayesian solution to the multiple-target problem. In the literature, other solutions based on sampling such as MCMC, and random finite set, such as the PHD filter have been proposed to solve the same problem. Comparison with other state of the art methods are available [24].

The main goal of our study was to benchmark the model on MOT Challenge 2016. Implementation issues of the tracker are discussed as well as its strengths and weaknesses regarding the absolute performance on the test sequences and the relative performance when compared with other participants to the MOT Challenges.

The presented model is free from magic parameters, since these are automatically derived from the data. Moreover, the proposed temporal aggregation for the visibility process appears to be an excellent complement to the variational Bayes EM algorithm. In the near future, we plan to derive self-paced learning strategies within this variational framework able to automatically assess which detections must be used for tracking and which should not be utilized.