1 Introduction

Object tracking plays an important role in computer vision, which has been widely used in the field of surveillance, intelligent transportation control, medical image and military simulation [1] etc. Even though numerous tracking problems have been studied for decades, and reasonable good results have been achieved, it is still challenging to track general objects in a dynamic environment accurately due to various factors that include noise, occlusion, background cluttering, illumination changes, fast motions, and variations in pose and scale. At present, most of state-of-the-art object tracking methods can be categorized into two types: generative models and discriminative models [2].

The generative methods take the candidate having the best compatibility with the appearance model as the tracked object. For example, Ross et al. proposed an incremental subspace model to adapt object appearance variation [3]. Wang et al. put forward multi-features fusion object model under the guidance of color-feature, and tracking object accurately is realized by the principle of spatial consistency [4]. Wang et al. proposed a probability continuous outlier model to cope with partial occlusion via holistic object template [5]. The latter addresses object tracking as a binary classification to separate the object from background. For example, Kalal et al. first utilized structured unlabeled data and used an online semi-supervised learning method [6], then tracking-learning-detection (TLD) for object tracking in long sequences is proposed subsequently [7]. Babenkon et al. formulated object tracking as an online multiple instance learning [8]. Generally speaking, the former can get more accurate characteristic of object but with high computational complexity. The latter can obtain better tracking accuracy but has to process a large number of training samples. Object needs to be retrained if its appearance changed, and tracking failure can be easily caused by inadequate training samples. In addition, there are some combine both generative and discriminative models [9,10,11] to get more desirable results.

Recently, sparse representation based methods [9,10,11,12,13] have shown promising results in various tests. Object is represented as a linear combination of a few templates, which are helpful to remove the influences from partial occlusion, illumination and other factors on object based on sparse coding. However, this kind of method is based on solving \( \ell_{1} \) minimization that has large computational load, and sparse code is solved by complex optimization. Therefore, a multi-modality dictionary is built in this paper to simplify the sparse coding, and then follow the idea of combination of generative and discriminative to achieve object tracking.

The remainder of this paper is organized as follows. In Sect. 2, particle filter and object representation that are related to our work are reviewed. Section 3 introduces the details of the proposed tracking method. Experimental results and analysis are shown in Sect. 4, and we conclude this paper in Sect. 5.

2 Preliminary

2.1 Particle Filter

Particle filter as the tracking framework in this paper, the object of the next frame is estimated by the observation probability of particles at the current frame [14]. Suppose \( Y_{t} = [y_{1} , \ldots ,y_{t} ] \) are observed images at frames 1 to \( t \), \( x_{t} \) is the state variable that describing object motion parameters at frame \( t \), and follows the following probability distribution:

$$ p(x_{t} \,|\,Y_{t} )\,{ \propto }\,p(y_{t} \,|\,x_{t} )\int {p\left( {x_{t} \,|\,x_{t - 1} } \right)p\left( {x_{t - 1} \,|\,Y_{t - 1} } \right)dx_{t - 1} } $$
(1)

where \( p\left( {x_{t} \,|\,x_{t - 1} } \right) \) is state transition distribution, \( p\left( {y_{t} \,|\,x_{t} } \right) \) estimates the likelihood of observing \( y_{t} \) at state \( x_{t} \). Particles are sampled as Gaussian distribution with the center position of previous tracking result. As the number of particles will affect the tracking efficiency, irrelevant particles need to be filtered to reduce the tracking redundancy.

2.2 Object Representation

Liao et al. proposed Local Maximal Occurrence (LOMO) feature [15] for the performance of the target in different cameras is inconsistent, which is an effective handmade feature that can be compared with the characteristics of deep learning network in recent years. The LOMO feature analyzes the horizontal occurrence of local features, and maximizes the occurrence to make a stable representation against viewpoint changes. Specifically, the Retinex algorithm is firstly applied to produce a color image that is consistent to human observation of the scene, then HSV color histogram is used to extract color features of Retinex images, finally, Scale Invariant Local Ternary Pattern (SILTP) descriptor [16] is applied to achieve invariance to intensity scale changes and robustness to image noises.

Since the challenging problems in object tracking and person re-identification are actually the same, in view of the validity of the LOMO feature has been verified, the LOMO feature as the object feature in this paper.

3 Proposed Method

In this paper, object tracking is regarded as the dictionary learning problem. By constructing multi-modality dictionary properly that can describe object precisely, thus the complex optimization can be simplified. The proposed tracking method is presented in Algorithm 1.

figure a

3.1 Multi-modality Dictionary Building and Updating

In general, sparse representation based tracking method usually uses over-complete dictionary to encode the object. Sparse code learning involves two problems: sparse coding that is to solve the computation of the coefficients to represent the object with the learned dictionary, and dictionary learning that is to solve the problem of constructing the dictionary [17]. With the sparse assumption, a candidate object \( x_{i} \) can be represented as a linear combination of sparse code \( \alpha_{i} \) from dictionary \( D \). The sparse code \( \alpha_{i} \in {\mathbb{R}}^{n + m} \) corresponding to \( x_{i} \) is calculated by

$$ \mathop {\hbox{min} }\limits_{{\alpha_{i} }} \frac{1}{2}\left\| {x_{i} - D\alpha_{i} } \right\|_{2}^{2} + \lambda \left\| {\alpha_{i} } \right\|_{1} $$
(2)

where over-complete dictionary \( D = \left[ {D_{p} ,D_{n} } \right] \in {\mathbb{R}}^{{d \times \left( {n + m} \right)}} \) that is formed by the foreground dictionary \( D_{p} \in {\mathbb{R}}^{d \times n} \) and background dictionary \( D_{n} \in {\mathbb{R}}^{d \times m} \). The above problem is also referred to as dictionary learning, which is actually the Lasso regression [18] that can be solved by LARS [19], and then get the sparse code \( \alpha_{i} \) of \( x_{i} \).

However, the method mentioned above not only requires a large number of templates for over-complete dictionary, but also makes the tracking process more complicated. Actually, for objects in the ideal state without severe external influences, a small number of dictionary templates can distinguish objects from background well; for objects with appearance changed, more dictionary templates will bring many errors. So if a suitable dictionary for the current object can be obtained in real time, there is no need to build over-complete dictionary and experience complex optimization process.

In this paper, object dictionary is formed by short-term templates \( D_{s} \) and long-term templates \( D_{l} \), the templates of background dictionary \( D_{n} \) are selected randomly from non-object area of video image. Therefore, multi-modality dictionary \( D \) is built by the two parts, that is \( D = \left[ {D^{S} ,D^{L} } \right], \) where \( D^{S} = \left[ {D_{s} ,D_{n} } \right] \in {\mathbb{R}}^{{d \times (m_{s} + n_{s} )}} \), \( D^{L} = \left[ {D_{l} ,D_{n} } \right] \in {\mathbb{R}}^{{d \times (m_{l} + n_{l} )}} \), and the dictionary template is represented by the observed pixel values. More specifically, \( D_{s} \) is initialized by the transformed object templates of the first frame, that is the current object moves 1–2 pixels along four directions (up, down, left and right). \( D_{l} \) is initialized by the clustering center of \( D_{s} \), that is \( D_{l} = \frac{1}{{l_{s} }}\sum\limits_{i = 1}^{{l_{s} }} {D_{s}^{\left( i \right)} } \), where \( l_{s} \) represents the number of templates in \( D_{s} \), here let \( l_{s} = 9 \). It should be noted that the observation vector of each template usually constraints its columns to have \( \ell_{2} \)-norm less than or equal to 1.

For multi-modality dictionary, when object appearance changes little, short-term dictionary can distinguish object from background effectively, and long-term dictionary can reduce errors accumulation; when object appearance changes greatly, short-term dictionary can track object continuously, and long-term dictionary can prevent loss of correct sampled object. Thus, the combination of two modality dictionaries can better balance the adaptability and robustness of \( \ell_{1} \) trackers, and it is crucial for updating multi-modality dictionary. \( D_{s} \) is trained and updated using the candidates sampled in the previous frame. \( D_{l} \) is trained and updated using accurate result in all previous frames, and then according to the theory of K-means clustering, the category that the current object belongs to is identified by calculating the Euclidean distance between the current object and the clustering centers of long-term dictionary, as shown in Eq. (3). The long-term dictionary is represented by the cluster center of each category, which reduces the amount of computation effectively.

$$ \left\{ {\begin{array}{*{20}l} {x^{t} \in D_{l}^{{c^{i} }} \begin{array}{*{20}c} , & {d\left( {f\left( {x^{t} } \right),\,f\left( {D_{l}^{{c^{i} }} } \right)} \right) \in \left[ {0,d\_\hbox{max} +\Delta } \right]} \\ \end{array} } \hfill \\ {x^{t} \in D_{l}^{{c^{new} }} \begin{array}{*{20}c} , & {d\left( {f\left( {x^{t} } \right),\,f\left( {D_{l}^{{c^{i} }} } \right)} \right) > d\_\hbox{max} +\Delta } \\ \end{array} } \hfill \\ \end{array} } \right. $$
(3)

where \( D_{l}^{{c^{i} }} \) represents the existed category of long-term dictionary, \( D_{l}^{{c^{new} }} \) represents the new category of long-term dictionary, \( d\left( \cdot \right) \) denotes Euclidean distance, \( f\left( \cdot \right) \) indicates the corresponding LOMO feature, \( d\_\hbox{max} \) represents the maximum value of Euclidean distance between templates \( d\left( {D_{s} ,D_{s}^{'} } \right) \) in initialized short-term dictionary \( D_{s} \), and \( \Delta \) is variable.

3.2 Tracking Based on Multi-modality Dictionary

The proposed method is based on particle filter framework, and all the sampled particles are expressed as \( Y = \left\{ {y_{1} ,y_{2} , \ldots ,y_{N} } \right\} \in {\mathbb{R}}^{d \times N} \). Then irrelevant particles are filtered by the distance constraint that is the distance between the center coordinate of the sampled object \( p\left( {y_{i}^{t} } \right) \) and the center coordinate of tracking result of previous frame \( p\left( {x_{t - 1} } \right) \) should meet \( \left\| {p\left( {y_{i}^{t} } \right) - p\left( {x_{t - 1} } \right)} \right\|_{2} \le \hbox{max} (w,h) \), where \( w \) and \( h \) represent the width and height of bounding box of previous tracking result respectively. The candidate samples are expressed as \( X = \left\{ {x^{i} \,|\,i \in \left[ {1,q} \right]} \right\} \in {\mathbb{R}}^{d \times q} \left( {q \ll N} \right) \).

Assuming that the multi-modality dictionary can be adapt to the object appearance changes well, the value of the cost function between the ideal tracking result and the templates of object dictionary should be minimal. The cost function of short-term dictionary and long-term dictionary are expressed as Eqs. (4) and (5) respectively. Then the best coefficients are solved using the LARS method [19].

$$ l_{S} \left( {x^{i} ,D^{S} } \right) = \mathop {\hbox{min} }\limits_{{\alpha_{s}^{i} }} \frac{1}{2}\left\| {x^{i} - D^{S} \cdot \alpha_{s}^{i} } \right\|_{2}^{2} + \lambda \left\| {\alpha_{s}^{i} } \right\|_{1} $$
(4)
$$ l_{L} \left( {x^{i} ,D^{L} } \right) = \mathop {\hbox{min} }\limits_{{\alpha_{l}^{i} }} \frac{1}{2}\left\| {x^{i} - D^{L} \cdot \alpha_{l}^{i} } \right\|_{2}^{2} + \lambda \left\| {\alpha_{l}^{i} } \right\|_{1} $$
(5)

Generally, an image observation of a “good” object candidate is effectively represented by the object templates and not the background templates, thereby, leading to a sparse representation. Likewise, an image observation of a “bad” object candidate can be more sparsely represented by a dictionary of background templates. Therefore, for ideal sampled object, the difference between the \( \ell_{1} - norm \) of coefficients of object templates and background templates should be larger. Then the candidate tracking results \( R \) are formed by the first \( p \) samples satisfying the condition, as shown in Eqs. (6) and (7).

$$ R = \left[ {R_{S} ,R_{L} } \right] = \left[ {I_{S}^{i} |_{p} ,I_{L}^{i} |_{p} } \right]\begin{array}{*{20}c} {} & {\left( {i \in \left[ {1,q} \right]} \right)} \\ \end{array} $$
(6)
$$ \begin{aligned} I_{S}^{i} = \hbox{max} \left( {\left\| {\alpha_{{s^{ + } }}^{i} } \right\|_{1} - \left\| {\alpha_{{s^{ - } }}^{i} } \right\|_{1} } \right) \hfill \\ I_{L}^{i} = \hbox{max} \left( {\left\| {\alpha_{{l^{ + } }}^{i} } \right\|_{1} - \left\| {\alpha_{{l^{ - } }}^{i} } \right\|_{1} } \right) \hfill \\ \end{aligned} $$
(7)

where \( \alpha_{{s^{ + } }}^{i} \) and \( \alpha_{{s^{ - } }}^{i} \) represent the coefficients of object templates and background templates in short-term dictionary, \( \alpha_{{l^{ + } }}^{i} \) and \( \alpha_{{l^{ - } }}^{i} \) represent the coefficients of object templates and background templates in long-term dictionary.

Eventually,the observation likelihood function is built for each candidate tracking result, then the candidate tracking result with the highest similarity is regarded as the final tracking result \( \hat{x} \), as shown in Eqs. (89).

$$ \hat{x} = \arg \hbox{max} \left( {\omega_{s} \cdot s_{s} + \omega_{l} \cdot s_{l} } \right)\;s_{l} = sim\left( {f\left( {R_{L}^{j} } \right),\,f\left( {D_{L} } \right)} \right) $$
(8)
$$ \begin{array}{*{20}l} \begin{aligned} s_{s} = sim\left( {f\left( {R_{s}^{j} } \right),\,f\left( {D_{S} } \right)} \right) \hfill \\ s_{l} = sim\left( {f\left( {R_{L}^{j} } \right),\,f\left( {D_{L} } \right)} \right) \hfill \\ \end{aligned} \hfill \\ \end{array} \,\,\,\,j \in \left[ {1,p} \right] $$
(9)

where \( sim\left( \cdot \right) \) represent the similarity that is calculated by Bhattacharyya distance, \( f\left( \cdot \right) \) is the LOMO feature of the corresponding image area. \( \omega_{s} = s_{s} /\left( {s_{s} + s_{l} } \right) \) and \( \omega_{s} = s_{s} /\left( {s_{s} + s_{l} } \right) \) are the weights.

4 Experiments and Analysis

The test video sequences (FaceOcc1, Walking and Fish) are selected from object tracking benchmark [1]. We test four state-of-the-art methods on the same video sequences for comparison. They are IVT [3], L1APG [13], TLD [7] and SP [5]. The code of all those trackers are public available, and we keep the parameter settings provided by authors for all the test sequences. In this paper, we use the error rate \( \left( {error} \right) \) and overlap rate \( \left( {overlap} \right) \) to evaluate the tracking performance of each tracking method. \( error \) is the Euclidean distance between the center coordinate obtained from tracking method and tracking ground truth, which means the smaller the value, the more accurate position the method tracks. \( overlap \) is the overlap ratio between the tracking window of the method and the ideal tracking window, which means the larger the value, the more suitable window the method has. Figure 1 shows some representative results of test sequences.

Fig. 1.
figure 1

Some representative results of test sequences. (a) FaceOcc1; (b) Walking; (c) Fish.

Tracking error plots and tracking overlap plots for all the test sequences are shown in Figs. 2 and 3. The main tracking problem in FaceOcc1 is that the object is occluded in large area for a long time. TLD fails to track when object is occluded in large area, but it can back to track object well when object remains in the normal state. The performance of IVT, L1APG, SP and our method can maintain low tracking error and high overlap rate, in which, our method performs the best. The main tracking problem in Walking is partially occlusion and object scales variation. When object scale becomes small, object cannot be distinguished from background clearly, so TLD and SP lose object. IVT, L1APG and our method can track object continuously, but the tracking bounding box of IVT is too large to fit the object size. Our method performs the best and L1APG performs the second best. The main tracking problem in Fish is the illumination changes. As the object is affected by the illumination and camera shake, IVT and L1APG start to drift. TLD can track object roughly, but the tracking bounding box is small. SP and our method show the promising performance, in which SP is the best tracker, and there is a slightly difference between SP and our method.

Fig. 2.
figure 2

Tracking error plots for all test sequences. (a) FaceOcc1; (b) Walking; (c) Fish.

Fig. 3.
figure 3

Tracking overlap plots for all test sequences. (a) FaceOcc1; (b) Walking; (c) Fish.

Table 1 shows the mean of tracking error and tracking overlap rate, in which bold fonts indicate the best performance while the Italic underlined fonts indicate the second best ones. From these data, we can conclude that the proposed method has good performance on occlusions, illumination and object scale variation, etc.

Table 1. The mean of tracking error and tracking overlap rate.

5 Conclusions

In this paper, object tracking method based on multi-modality dictionary is proposed, which addresses object tracking as a problem of learning a dictionary that can represent object accurately. Under the particle filter framework, a multi-modality dictionary is built and updated by clustering, which makes the candidate tracking result can be obtained easily by comparing the coefficients difference with respect to multi-modality dictionary. And then the final tracking result is determined by calculating observation function precisely through employing LOMO feature. By applying some benchmark videos, the experimental results show that the proposed method is more robust against occlusions, illumination changes and background interference.