Keywords

1 Introduction

Visual tracking is one of the most important component of many applications in computer vision, such as surveillance, human-computer interaction, medical imaging and robotics [1]. For robust visual tracking, numerous methods have been presented. Despite reasonably good results from these approaches, some common challenges remain for tracking objects under complex scenes, e.g., when objects undergo significant pose changes or other severe deformations, i.e., object pose variations accompanied with object occlusions or object intersections. To address these problems, a wide range of appearance models for tracking have been proposed by researchers [2]. Roughly speaking, these models can be categorized into two types: discriminative-based model [5, 913, 18, 20, 21] and generative-based model [3, 4, 68, 14, 15, 19].

Recently multiple kernel learning (MKL) [22, 27] has been applied in computer vision, such as object classification [23, 24], object detection [25, 26]. The MKL method aim to compute an optimal combination of weighted kernels in the supervised learning paradigm. Rather than using one single kernel, the MKL algorithms fuse different features and kernels in an optimal setting, which improves the discriminative power of multiple features.

Motivated by the MKL, we propose a novel patch-based tracking method based on two-stage multiple kernel learning. The patch-based methods utilize the local information of object and can effectively handle partial occlusion and deformation to some extend. However these trackers may cause drift problem because they do not consider the different importance of each patch when occlusion happens. In this work we combine patch-based method with MKL and present a patch-based tracking approach with two-stage MKL. In the first stage, each object patch is represented with multiple features. Unlike simple feature combination, we utilize MKL method to obtain the optimal combination of multiple features and kernels, which assigns different weight to the features according to their discriminative power. In the second stage, we apply MKL to making full use of multiple patches of the target. This method can automatically distribute different weight to the object patches according to their importance, which improves the discriminative power of object patches as a whole. Within the Bayesian framework, we achieve visual tracking by constructing a classifier, and the candidate with the maximum likelihood is selected to be the tracked result. Besides, an effective update method is adopted to help the proposed tracker adapt to the object appearance changes.

The rest of this paper is organized as follows. Section 2 briefly reviews the related works. Section 3 describes the multiple kernel learning method. The proposed two-stage multiple kernel learning is given in Sect. 4. Section 5 describes our tracking method. Experimental results are shown in Sect. 6, and Sect. 7 concludes this paper.

2 Related Work

General tracking approaches can be categorized into either discriminative or generative models [2]. The discriminative methods regard tracking as a classification problem which aims to best separate the object from the ever-changing background. These methods employ both the foreground and background information. Avidan [18] proposes an ensemble tracker which treats tracking as a pixel-based binary classification problem. This method can distinguish target from background, however the pixel-based representation needs more computational resources and thereby limits its performance. In [10], Grabner et al. present an online boosting tracker to update discriminative features and further in [20] a semi-online method is proposed to handle drifting problem. Kalal et al. [13] introduce a P-N learning algorithm to learn effective features from positive and negative samples for object tracking. This tracking method nevertheless is prone to induce drifting problem when object appearance varies. Fan et al. [15] suggest a weighted P-N learning algorithm and combine it with part-based framework for visual tracking. This method can improve the robustness of tracker in the presence of occlusion. Babenko et al. [9] utilize the multiple instance learning (MIL) method for visual tracking, which can alleviate drift to some extent. Whereas the MIL tracker may detect the positive sample that is less important because it does not consider the sample importance in its learning process. Further in [21], Zhang et al. propose the online weighted multiple instance learning (WMIL) by assigning weight to different samples in the process of training classifier. In [12], Zhang et al. propose a compressive tracker with an appearance model based on features extracted in the compressed domain. This tracker easily induce drift even failure since it is lack of an effective updating strategy in the presence of appearance variations.

On the contrary, the generative models formulate the tracking problem as searching for regions most similar to object. These methods are based on either subspace models or templates. To solve the problem of appearance variations caused by illumination or deformation, the appearance model is updated dynamically. In [3], the incremental visual tracking method suggests an online approach for efficiently learning and updating a low dimensional PCA subspace representation for the object. However, this PCA subspace based representation scheme is sensitive to partial occlusion. Adam et al. [4] present a fragment-based template model for visual tracking. This tracking method estimates the target based on voting map of each part via comparing its histogram with the templates. Nevertheless, static template with equal importance being assigned to each fragment obviously lowers the performance of tracker. Mei et al. [6] apply sparse representation to visual tracking, which can resist occlusion in some degree. However, this method is prone to cause drift because it does not have any update strategy. Jia et al. [8] propose a local structural spare appearance model for object tracking. This method adopts a online update mechanism to help the tracker adapt to appearance changes. Kwon et al. [14] decompose the appearance model into multiple basic observation models to cover a wide range of illumination and deformation.

Recently, MKL method has been widely used in image classification, object detection and recognition. In [23], Yang et al. present a group-sensitive MKL for object categorization. Jawanpuria et al. [24] utilize MKL for non-linear feature selection and apply it to classification. Vedaldi et al. [25] propose a novel three-stage classifier with MKL, which combines linear, quasi-linear, and non-linear kernel SVMs. Zhang et al. [26] proposes an E2LSH based clustering algorithm which combines the advantages of nonlinear multiple kernel combination methods, and use it for object detection.

The most related work to ours is [28], in which a multiple kernel boosting method with affinity constraints is proposed. This method boosts the multiple kernel learning process, thereby facilitating robust visual tracking in complex scenes effectively and efficiently. However, their method does not adopt patch-based representation and hence may be sensitive to partial occlusion. In our work, we segment object into multiple patches and combine it with a two-stage MKL method. Consequently, the proposed tracker is more adaptive to appearance variations.

3 Multiple Kernel Learning

Support vector machine (SVM) has been successfully applied to numerous classification and regression tasks. One of the most important problem in these tasks is to choose an appropriate data representation. In SVM-based approaches, the data representation is implicitly selected by the kernel function \(K(x,x_{i})\), where \(K(\cdot ,\cdot )\) is a function associated with a reproducing kernel Hilbert space [28]. Nevertheless, it is difficult for a single SVM classifier to select a good kernel function for the training set in some case. To address this issue, MKL algorithm is proposed. MKL is an extension of kernel learning method. By using different types of kernel to represent different properties of samples (e.g., feature and metric), MKL provides a unified framework for model combination and selection. One of the most popular multiple kernel learning methods is SimpleMKL (SMKL) [29] in which the kernel function is defined as a convex linear combination of kernels

$$\begin{aligned} K(x,x_{i}) = \sum \limits _{m=1}^{M}{\beta _{m}K_{m}(x,x_{i})}, \; \sum \limits _{m=1}^{M}{\beta _{m}} = 1,\beta _{m}\ge {0} \end{aligned}$$
(1)

where \(K(x,x_{i})\) denotes the \(m^{th}\) kernel and \(\beta _{m}\) is the corresponding weight. The SMKL is aimed to simultaneously obtain support vectors, support vector coefficients and kernel weights by solving the constrained optimization problem as follows

$$\begin{aligned} \mathop {\min }\limits _{\beta }{J(\beta )} \;\;\;\; s.t. \;\; \sum \limits _{m=1}^{M}{\beta _{m}} = 1,\beta _{m}\ge {0} \end{aligned}$$
(2)

where

$$\begin{aligned} \begin{aligned} J(\beta ) = \mathop {\min }\limits _{\lbrace {f}\rbrace ,b,\xi }{\frac{1}{2}\sum \limits _{m}{\frac{1}{\beta _{m}}||{f_{m}}||_{\mathcal {H}_{m}}^{2}}} + C\sum \limits _{i}{\xi _{i}} \\ s.t. \;\; y_{i}\sum \limits _{m}{f_{m}(x_{i})} + y_{i}b\ge {1-\xi _{i}}, \xi _{i}\ge {0}, \forall {i} \end{aligned} \end{aligned}$$
(3)

where \(x_{i}\) denotes the \(i^{th}\) training sample, \(y_{i}\) is the class label for the \(i^{th}\) sample, \(\xi _{i}\) and C represent its slack variable and penalty factor for slack variable respectively, \(\mathcal {H}_{m}\) denotes the reproducing kernel Hilbert space (RKHS), and each function \(f_{m}\) belongs to a different RHSH \(\mathcal {H}_{m}\) associated with a kernel \(K_{m}\). The Formulation (3) can be solved by reduced gradient method [29], which computes simple differentiation of the dual function of Eq. (3) with respect to \(\beta _{m}\)

$$\begin{aligned} \frac{\partial J}{\partial \beta _{m}}=-\frac{1}{2}\sum \limits _{i,j}{\alpha _{i}\alpha _{j}y_{i}y_{j}K_{m}(x_{i},x_{j})}, \forall {m} \end{aligned}$$
(4)

where \(\alpha _{i}\) represents the dual coefficient of \(x_{i}\). Then the decision function for binary classification is defined as

$$\begin{aligned} F(x)=\sum \limits _{i}{\alpha _{i}y_{i}}\sum \limits _{m}{\beta _{m}K_{m}(x,x_{i})}+b \end{aligned}$$
(5)

4 Patch-Based Two-Stage MKL

4.1 Object Segmentation

In this paper, we use multiple patches to represent the target, which utilizes the local information of object and can effectively handle partial occlusion and deformation to some extend. Different from [4], we adopt a overlapping slide window segmentation strategy as shown in Fig. 1. After segmentation, we can obtain a patch set \(\mathcal {P}=\lbrace {p_{1},p_{2},\cdots ,p_{P}}\rbrace \), where \(p_{i}\) is the \(i^{th}\) patch and P is the number of patches.

Fig. 1.
figure 1figure 1

Illustration of the overlapping slide window segmentation. Image (a) is the object, image (b) shows the segmentation method and image (c) is the set of object patches.

4.2 First-Stage Multiple Kernel Learning

In the first stage, we use multiple features (e.g., HIS histogram, HoG [16] and LBP [17] descriptors) to represent each object patch and apply MKL to the optimal combination for multiple features. For each \(i^{th}\) patch, it can be represent with feature set \(\lbrace {f_{i,1},f_{i,2},\cdots ,f_{i,D}}\rbrace \), where \(f_{i,j}\) denotes the \(j^{th}(j=1,2,\cdots ,D)\) feature and D is the number of features. Our goal is to find a strategy to integrate these multiple features to maximize the overall discriminative power. MKL has shown its potential in integrating multiple features in recent research. Therefore, for each \(i^{th}\) patch, the output margin of first-stage MKL classifier can be written as the following

$$\begin{aligned} F_{i}^{'}(f_{i})=\sum \limits _{l=1}^{L}{\alpha _{l}^{'}y_{l}^{'}}\sum \limits _{d=1}^{D}{\gamma _{i,d}K_{d}(f_{i},f_{i,l})}+b_{i}^{'} \end{aligned}$$
(6)

where \(F_{i}^{'}(\cdot )\) denotes the classification function for the \(i^{th}\) patch, \(K_{d}(\cdot ,\cdot )\) represent the \(d^{th}\) kernel for the \(d^{th}\) feature, L is the number of training samples, D stands for the number of features and \(\gamma _{i,d}\) weights the discriminative power of the \(d^{th}\) feature. Note that in the first stage, the MKL is only used to obtain the weight of each feature. Figure 2 gives a simple illustration about how we make use of MKL to obtain the weight of multiple features for each patch.

Fig. 2.
figure 2figure 2

We firstly collect the training samples for the \(i^{th}\) patch in (a), and extract D features for it in (b). The MKL in (c) is then utilized to obtain the weight of multiple features for the patch i as shown in (d).

With the weight of different features, we can obtain the optimal combination of multiple features for the each patch. For the \(i^{th}\) patch, we define \(\mathcal {F}_{i}\) as its combined feature

$$\begin{aligned} \mathcal {F}_{i} = [\gamma _{i,1}f_{i,1},\gamma _{i,2}f_{i,2},\cdots ,\gamma _{i,D}f_{i,D}],\; i=1,2,\cdots ,P \end{aligned}$$
(7)

4.3 Second-Stage Multiple Kernel Learning

In the second stage, we apply MKL to assigning different weight to the object patches according to their importance. In Sect. 4.1, the target is represented by a patch set \(\mathcal {P}=\lbrace {p_{1},p_{2},\cdots ,p_{P}}\rbrace \) in which \(p_{i}\) denotes the \(i^{th}(i=1,2,\cdots ,P)\) patch associated with a combination feature \(\mathcal {F}_{i}\). Our goal is aimed to use MKL find an optimal combination for the patches in which the coefficient of each patch stands for the corresponding weight. Therefore, for the target, the output margin of MKL classifier can be written as follows

$$\begin{aligned} F^{*}(\mathcal {F})=\sum \limits _{q=1}^{N}{\alpha _{q}^{*}y_{q}^{*}}\sum \limits _{i=1}^{P}{\delta _{i}K_{i}(\mathcal {F},\mathcal {F}_{q})}+b^{*} \end{aligned}$$
(8)

where \(F^{*}(\cdot )\) denotes the decision function , \(K_{i}(\cdot ,\cdot )\) represents the \(i^{th}\) kernel for the \(i^{th}\) patch, N is the number of training samples, P stands for the number of patches and \(\delta _{i}\) weights the discriminative power of the \(i^{th}\) patch. The process of weighing patches can be shown in Fig. 3.

Fig. 3.
figure 3figure 3

To start with, we compute the combined features in (b) for all the training patches in (a). Then MKL in (c) is used to obtain the weight of each patch as shown in (d).

After obtaining the weight of each patch, we can represent the object with a feature vector as follows

$$\begin{aligned} H=[\delta _{1}\mathcal {F}_{1},\delta _{2}\mathcal {F}_{2},\cdots ,\delta _{P}\mathcal {F}_{P}] \end{aligned}$$
(9)

where H denotes the feature of the target, \(\delta _{i}\) and \(\mathcal {F}_{i}\) are the weight and combined feature for the \(i^{th}\) patch.

4.4 Classifier

In this section, a classifier is constructed to discriminative the object from the background. In the initial frame, we randomly sample bounding boxes around the tracked target as positive samples and far away from the target as negative samples. By controlling the distance from the tracked object, the negative samples contain pure background so that they are capable to differentiate from the target to the most extent. We use sets \(S^{+}=\lbrace {s_{1}^{+},s_{2}^{+},\cdots ,s_{N^{+}}^{+}}\rbrace \) and \(S^{-}=\lbrace {s_{1}^{-},s_{2}^{-},\cdots ,s_{N^{-}}^{-}}\rbrace \) to denote the positive samples and the negative samples, where \(N^{+}\) and \(N^{-}\) are the number of positive and negative samples. For each sample, it can be represented by a feature vector with Eq. (9) through two-stage MKL. Therefore, we use sets \(\mathcal {H}^{+}=\lbrace {H_{1}^{+},H_{2}^{+},\cdots ,H_{N^{+}}^{+}}\rbrace \) and \(\mathcal {H}^{-}=\lbrace {H_{1}^{-},H_{2}^{-},\cdots ,H_{N^{-}}^{-}}\rbrace \) to represent the features of positive and negative samples. With these features, we can build a LIBSVM classifier G according to [30]. For a new sample s associated with the feature \(H_{s}\), its classification error can be represented with \(G(H_{s})\). The smaller the classification error is, the more likely the sample belongs to the object.

5 The Proposed Tracking Method

5.1 Tracking Formulation

Our tracker is implemented via the Bayesian framework. Given the observation set of target \(Y^{t}=\lbrace {y_{1},y_{2},\cdots ,y_{t}}\rbrace \) up to the frame t, we can obtain estimation \(\widehat{X}_{t}\) by computing the maximum a posterior via

$$\begin{aligned} \widehat{X}_{t} = \mathop {\max }\limits _{X_{t}^{i}}{p(X_{t}^{i}\vert {Y^{t}})} \end{aligned}$$
(10)

where \(\widehat{X}_{t}\) denotes the \(i^{th}\) sample at the state of \(X_{t}\). The posterior probability \(p(X_{t}^{i}\vert {Y^{t}})\) can be obtained by the Bayesian theorem recursively via

$$\begin{aligned} p(X_{t}\vert {Y^{t}})\varpropto {p(y_{t}\vert {X_{t}})\int {p(X_{t}\vert {X_{t-1}})p(X_{t-1}\vert {Y^{t-1}})}d{X_{t-1}}} \end{aligned}$$
(11)

where \(p(X_{t}\vert {X_{t-1}})\) and \(p(X_{t-1}\vert {Y^{t-1}})\) represent the dynamic model and observation model respectively.

The dynamic model indicates the temporal correlation of the target state between consecutive frames. We apply affine transformation to model the target motion between two consecutive frames within the particle filter framework. The state transition can be formulated as

$$\begin{aligned} p(X_{t}\vert {X_{t-1}})=\mathcal {N}(X_{t};X_{t-1},\varPsi ) \end{aligned}$$
(12)

where \(\varPsi \) is a diagonal covariance matrix whose elements are the variance of affine parameters. The observation model \(p(y_{t}\vert {X_{t}})\) represents the probability of the observation \(y_{t}\) as state \(X_{t}\). In this paper, the observation is designed by

$$\begin{aligned} p(y_{t}\vert {X_{t}})\varpropto {1-G(X_{t})} \end{aligned}$$
(13)

where \(G(X_{t})\) is the classification error of the \(t^{th}\) candidate. Through Bayesian framework, we can determine the candidate sample with the smallest classification error as the tracking result.

5.2 Online Update

Due to the appearance variations of target, updating is essential. In this paper, an effective mechanism is proposed to update the classifier G. To start with, we design a set \(\varPhi \). In each frame, after locating the target, we randomly sample bounding boxes around the tracked target as positive samples and far away from the target as negative samples. These samples are collected as a group, and added into the set \(\varPhi \). When the set size v reaches a threshold V, we apply to the set \(\varPhi \) to updating the weight (both the feature weight and patch weight). Then we extract feature for each sample in \(\varPhi \) and train them for the classifier G, and empty \(\varPhi \) in the end. However, when accumulating elements into \(\varPhi \), the tracking result may contain significant noise and thus is not reliable if the tracking result determined by our tracker has a high classification error which is greater than a threshold E. In this case, we skip this frame to avoid introducing noise into \(\varPhi \).

So far, we have introduced the overall procedure of the proposed tracking algorithm as shown in Algorithm 1.

figure afigure a

6 Experiments

In order to evaluate the performance of our tracking algorithm, we test our method on nine challenging image sequences and compare it with eight state-of-the-art trackers. These algorithms are Frag tracking [4], TLD tracking [13], \(\ell _{1}\) tracking [6], IVT tracking [3], MIL tracking [9], CT tracking [12] OAB tracking [10] and SPT tracking [5]. Some representative results are displayed in this section.

The proposed algorithm is implemented in MATLAB and runs at 1.6 frames on a 3.2 GHz Intel E3-1225 v3 Core PC with 8GB memory. We use three features (HSI histogram, HoG and LBP descriptors) and four types of kernels (linear, polynomial, RBF kernel, and sigmoid functions) to represent the target. The parameters of the proposed tracker are fixed in all experiments. The number of particles in Bayesian framework is set to 300 to 500. The training frame N is 4 and the size of the set \(\varPhi \) in this work is set to 5. The parameter classification error threshold E is fixed to 0.4 to 0.6.

6.1 Quantitative Comparison

We evaluate the above mentioned trackers via center location error and overlapping rate [31], and the comparing results are shown in Tables 1 and 2. Figure 4 shows the center location error of the trackers on nine test sequences. Overall, the tracker proposed in this paper outperforms the state-of-the-art algorithms.

Table 1. Center location errors (in pixels). The best result is shown in and the second best in fonts.
Table 2. Overlapping rate.
figure cfigure c
fonts indicate the best performance while the
figure dfigure d
fonts indicate the second best.
Fig. 4.
figure 4figure 4

Quantitative evaluation in terms of center location error (in pixel).

Fig. 5.
figure 5figure 5

Screenshots of some sample tracking results.

6.2 Qualitative Comparison

Heavy Occlusion: Deformation is a challenge for tracker, because the template features have completely changed when deformation occurs. As shown in Fig. 5, MIL, CT, IVT, OAB, TLD and do not have good performances in the sequences Bolt and Jogging. Differently, Frag and SPT have relatively better tracking results in these sequences, because part-based trackers are less sensitive to structure variation than holistic appearance. Whereas, the lack of effective updating strategy still easily cause drifting away even failure. Our tracker have obvious advantage in handling structure deformation even in high frequency, since some local patches of the target remain the same in the presence of the deformation and with the help of effective updating mechanism, our tracking method robustly adapts to the deformation.

Motion Blur: Fig. 5 demonstrates experimental results on two challenging sequences (Deer and Lemming). Because the target undergoes fast and abrupt motion, it is more prone to cause blur, which causes drifting problem. It is worth noticing that the suggested approach in this paper performs better than other algorithms. When motion blur happens, our tracker can still effectively represent the target appearance. Besides, our updating mechanism can resist motion blur to some degree. Hence our tracker will not be undermined by the abrupt movement.

Deformation: Deformation is a challenge for tracker, because the template features have completely changed when deformation occurs. As shown in Fig. 5, MIL, CT, IVT, OAB, TLD and do not have good performances in the sequences Bolt and Jogging. Differently, Frag and SPT have relatively better tracking results in these sequences, because part-based trackers are less sensitive to structure variation than holistic appearance. Whereas, the lack of effective updating strategy still easily cause drifting away even failure. Our tracker have obvious advantage in handling structure deformation even in high frequency, since some local patches of the target remain the same in the presence of the deformation and with the help of effective updating mechanism, our tracking method robustly adapts to the deformation.

Background Clutter: The sequences Cup and Basketball in Fig. 5 are challenging as the background cluttered and the target undergoes the scale variation. Our tracker performs well in this sequence as the target can be differentiated from the cluttered background with the use of our two-stage MKL method. In addition, the updating scheme is also robust to the complex background.

7 Conclusion

In this paper a novel patch-based tracking algorithm is proposed by using two-stage multiple kernel learning. Our method can automatically distribute different weight to the object patches according to their importance, which improves the discriminative power of object patches as a whole. Experiments on challenging image sequences demonstrate that our method performs favorably against several state-of-the-art methods.