Keywords

1 Introduction

Modeling an object appearance in tracking are mainly classified to two approaches: static and adaptive. Static models is proposed in the context of using assumption that the object appearance change is limited and known [1]. From this assumption, it is clear that unexpected changes of the object appearance can not be tracked. Adaptive methods are proposed to address this drawback, which update the object model during tracking [2]. These approaches assume that every update is correct. Under this underlying assumption, error of the model accumulated over time and drift are caused by every incorrect update. The drift problem has been addressed by introduction of so called visual constraints [3]. Even though this approach demonstrated increased robustness and accuracy, its performance was tested only on videos where the object was in the field of view. In scenarios where an object moves in and out of the frame, object re-detection is essential. Object detection have been extensively studied [4] and a range of ready-to-use object detectors are available [5] which enable tracking-by-detection. Apart from expensive off-line training, the disadvantage of tracking-by-detection is that all objects have the same model and therefore the identities can not be distinguished. An object tracking algorithm that splits the object model into three parts with different lifespan, is proposed to solve this problem [6]. This makes the tracker suitable for low-frame rate videos but the longest period the face can disappear from the camera view is limited. Another class of approaches for face tracking was developed as part of automatic character annotation in video [7]. These systems can handle the scenario considered in this paper, but they have been designed for off-line processing and adaptation for real-time tracking is not straightforward.

An approach called Tracking-Learning-Detection (TLD) has been designed for long-term tracking of arbitrary objects in unconstrained environments [8]. Learning part of TLD was analyzed in [9]. The object was tracked and simultaneously learned in order to build a detector that supports the tracker once it fails. The detector was build upon the information from the first frame as well as the information provided by the tracker. Apart from this, the detector was build upon the information from the gray-scale distribution, i.e. the model of object appearance. This means that, given a moving object, these approaches discard the information from the motion of object, i.e. the variation of foreground. These approaches, therefore, may result in tracking an object far away from the location with the real variation of foreground. Beside this problem, shadow removal was not considered in this detector. Also, the tracker can not guarantee the non-linear multi-mode tracking, i.e. this can not very well adapt to sudden appearance changes, long-lasting occlusions etc.

This work has three major contributions. First, a combined detector containing the background subtraction and the object appearance model-based detector is proposed to solve such problems as linking, overlapping, false object detecting etc. Second, a non-linear multi-mode tracker with the combined detector is used to solve such problems as sudden appearance changes and long-lasting occlusions, etc. The non-linear multi-mode tracker is chosen as the particle filter with spline resampling and global transition proposed in [12]. Also, a person re-identification is used to numbering person in the context of multi-target tracking.

2 The Proposed Method

In this section, we propose first a combined detector containing the background subtraction and the object appearance model-based detector. Then we give a non-linear multi-mode tracker with the combined detector.

2.1 Combined Detector

Background Subtraction. Background subtraction involves calculating a reference image, subtracting each new frame from this image and thresholding the result. What results is a binary segmentation of the image which highlights regions of non-stationary objects. Here, a color or gray-scale video frame is compared with a background model to determine whether individual pixels are part of the background or the foreground.

Given a series of either gray-scale or color video frames, methods based on background mixture model compute the foreground mask using Gaussian mixture models (GMM). In our framework, the adaptive background mixture model is used to compute the foreground mask. This method allows system learn faster and more accurately as well as adapt effectively to changing environments.

Shadow Removal Based on Level-Thresholding. The strategy of our proposed person tracking framework is to:perform first the shadow removal using a strong threshold even if foreground pixels are also removed simultaneously, then implement shadow removal using a weakness threshold in bounded local regions. The first step prevents background pixels retained by the shadow removal. Some parts foreground pixels will still lost when the only combination of the above two filters, and the lost pixels tend to be the contact parts among outlines. The main reasons result in losing foreground pixels are that the texture of background is unobvious and current pixels are in the penumbra. The second step guarantees regain of foreground pixels lost in the first step. In first step, the intensity ratio between the background and the current frame is calculated by Eq. (1), then the pixel is a shadow when it meets the formula Eq. (2).

$$\begin{aligned} \left\{ \begin{aligned} E_{r}(i,j)= \frac{\min (B_{r}(i,j),Cur_{r}(i,j))}{\max (B_{r}(i,j),Cur_{r}(i,j))}\\ E_{g}(i,j)= \frac{\min (B_{g}(i,j),Cur_{g}(i,j))}{\max (B_{g}(i,j),Cur_{g}(i,j))}\\ E_{b}(i,j)= \frac{\min (B_{b}(i,j),Cur_{b}(i,j))}{\max (B_{b}(i,j),Cur_{b}(i,j))}\\ \end{aligned}\right. \end{aligned}$$
(1)
$$\begin{aligned} M(i,j)= \left\{ \begin{array}{ll} 1,&{}E_{r}(i,j)<T_{1}\;and\; E_{g}(i,j)<T_{1}\; and\; E_{b}(i,j)<T_{1}\;\\ 0,&{}otherwise \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \\ \end{array}\right. \end{aligned}$$
(2)

where \(E_{r}(i,j)\), \(E_{g}(i,j)\) and \(E_{b}(i,j)\) are the intensity ratio images or the difference images for three channels; \(B_{r}(i,j)\), \(B_{g}(i,j)\) and \(B_{b}(i,j)\) are background images; \(Cur_{r}(i,j)\), \(Cur_{g}(i,j)\) and \(Cur_{b}(i,j)\) are current frames; M(ij) is the binary mask; \(T_{1}\) is the threshold for the first level of shadow removal. In the binary mask, pixels with a value of 1 correspond to the foreground, and pixels with a value of 0 correspond to the background. Then, morphological operations on the resulting binary mask are performed to remove noisy pixels and to fill the holes in the remaining blobs. In second step, a weakness threshold is used to implement shadow removal in bounded local regions and recovery foreground pixels lost in the first step.

$$\begin{aligned} M(i_{b},j_{b})= \left\{ \begin{array}{ll} 1,&{}E_{r}(i_{b},j_{b})<T_{2}\;and\; E_{g}(i_{b},j_{b})<T_{2}\; and\; E_{b}(i_{b},j_{b})<T_{2}\\ 0,&{}otherwise \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \qquad \\ \end{array}\right. \end{aligned}$$
(3)

where \(T_{2}\) is the threshold for the second level of shadow removal, and \(i_{b}\) and \(j_{b}\) are pixel coordinates in bounded local regions. Then, morphological operations on this binary mask are performed to remove noisy pixels and to fill the holes in the remaining blobs.

Combined Detector with the Object Appearance Model-Based Detector. This part detects people in an input image using the Histogram of Oriented Gradient (HOG) features and a trained Support Vector Machine (SVM) classifier, and detects unoccluded people in an upright position.

Local object appearance and shape can often be characterized rather well by the distribution of local intensity gradients or edge directions, even without precise knowledge of the corresponding gradient or edge positions. This is implemented by dividing the image window into small spatial regions, for each region accumulating a local 1-D histogram of gradient directions or edge orientations over the pixels of the region. The representation is formed by the combined histogram entries. For better invariance to illumination, shadowing, etc., contrast-normalizing is also used for the local responses before using them. This can be done by accumulating a measure of local histogram “energy” over somewhat larger spatial regions (“blocks”) and using the results to normalize all of the cells in the block. We will refer to the normalized descriptor blocks as Histogram of Oriented Gradient (HOG) descriptors. Human detection chain is given by tiling the detection window with a dense (in fact, overlapping) grid of HOG descriptors and by using the combined feature vector in a SVM based window classifier. One-class SVM has been widely used for outlier detection. Only positive samples are used in training. The basic idea of one-class SVM is to use a hypersphere to describe data in the feature space and put most of the data into the hypersphere. The problem is formulated into an objective function as follows:

$$\begin{aligned} \min _{\mathcal {R}\in \mathbb {R}, \xi \in \mathbb {R}^{l}, c\in F}R^2+\frac{1}{vl}\sum _{i}\xi _{i}, \end{aligned}$$
(4)
$$\begin{aligned} \Vert \varPhi (X_{i})-c\Vert ^{2}\le \mathbb {R}^2+\xi _{i}, \forall i\in \{1,...,l\}:\xi _{i}\ge 0 \end{aligned}$$
(5)

where \(\varPhi (X_{i})\) is the multi-dimensional feature vector of training sample \(X_{i}\), l is the number of training samples, R and c are the radius and center of the hypersphere, and \(v\in [0,1]\) is a trade-off parameter. The goal of optimizing the objective function is to keep the hypersphere as small as possible and include most of the training data. The optimization problem can be solved in a dual form by QP optimization methods, and the decision function is:

$$\begin{aligned} f(X)=R^2-\Vert \varPhi (X_{i})-c\Vert ^{2}, \end{aligned}$$
(6)

where \(\Vert \varPhi (X_{i})-c\Vert ^{2}=k(X,X)-2\sum _{i}\alpha _{i}k(X_{i},X)+\sum _{i,j}\alpha _{i}\alpha _{j}k(X_{i},X_{j})\), and \(\alpha _{i}\) and \(\alpha _{j}\) are the parameters for each constraint in the dual problem. In our task, we use the radius basis function (RBF) \(k(X,Y)=\exp \{-\Vert X-Y\Vert ^{2}/2\sigma ^{2}\}\) as kernel in one-class SVM to deal with high-dimensional, non-linear, multi-mode distributions. The decision function of kernel one-class SVM can well capture the density and modality of feature distribution.

The model is specified as either \(128\times 64\) or \(96\times 48\), whose size is the image size used for training indicated by the pixel dimensions. The images used to train the models include background pixels around the person. Therefore, the actual size of a detected person is smaller than the training image size.

$$\begin{aligned} M(i,j)= \left\{ \begin{array}{ll} 1,&{}if\,person\,is\,detected\\ 0,&{}otherwise \qquad \qquad \;\; \end{array}\right. \end{aligned}$$
(7)

where M(ij) is the binary mask. Then, morphological operations on the resulting binary mask are performed to remove noisy pixels.

2.2 Non-linear Multi-mode Tracker with Combined Detector

Many trackers were build upon the model of object appearance. This means that, given a moving object, these detectors discard the information from the motion of object, i.e. the variation of foreground. Therefore, this may result in tracking an object far away from the location with the real variation of foreground. Our detector combines the information from the gray-scale distribution with the variation of foreground. This makes the tracker to search an object at the place consistent with the location where exists the real variation of foreground. This approach compares the result of PF(Particle Filter) with one of background subtraction, after PF implemented. That is, after PF implemented, this re-checks if object detected by background subtraction exists in the place as object region estimated by PF or how much they are overlapping. If existed or overlapped, the result of PF will be corrected to one of background subtraction, and if not, there will be two proposals. If the confidence of PF result is not sufficient, then it will be canceled completely, and if not, a new object region will be added to the result of background subtraction to compensate error of background subtraction. In fact, when object is overlapped seriously, real object region may be considered as background region from incompleteness of person detector. These regions will be just recovered by object appearance model-based detector.

$$\begin{aligned} M(i,j)= \left\{ \begin{array}{ll} 1,&{}if\;the\;large\;confidence,\;existed\;or\;overlapped\\ 0,&{}otherwise \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \; \end{array}\right. \end{aligned}$$
(8)

where M(ij) is the binary mask.

Finally, PF is implemented around extended regions of each bounding boxes obtained by background subtraction, i.e. size of each extended region is lager than ones bounded by background subtraction. All coordinates for PF are calculated as relative coordinates for each extended regions.

3 Experimental Results

We evaluate the performance of our proposed tracking framework. Our experiment should be aimed at the relative evaluation for our proposed approaches themselves in relation with the previous approach. The sophisticated feature descriptor, of course, may be used to this tracking scenario to obtain high accuracy of tracking. However, our experiment should be aimed at the relative evaluation for our proposed tracking framework in relation with previous approaches. Therefore, these experiments do not use other feature descriptors. We test our proposed approach and the previous approach in several challenging image sequences.

The performance evaluation includes two parts. The first part contains the evaluation for the performance of our proposed tracking framework in image sequences given a single example of a specific object. The second part compares the performance of our proposed tracking framework with one of the previous framework in the context of tracking multiple object.

All experiments are conducted on an Intel\(\circledR \) Core(TM) i5 2.40 GHz PC with 4 GB memory(1.32 GHz). The real image sequences are available at http://www.ces.clemson.edu/~stb/research/headtracker and under MATLAB(2013a) directory.

The first set of experiments is implemented in a popularly used real video sequence containing 500 frames with resolution of \(128\times 96\) pixels, respectively. Figure 1 show the comparison of tracking results in real color video, the absolute error for every frame and the error histograms for two approaches, respectively. Red box indicates the result of our approach, and blue for the TLD. The experiment results show our proposed approach has also a robust performance for real video sequences. As a result, the tracker based on proposed approaches shows the most satisfactory performance among two trackers.

Fig. 1.
figure 1figure 1

Comparison of tracking result on real color video; Red box indicates the result of our approach, and blue for the TLD; Shown are frames 0, 3, 22, 98(top); 117, 126, 135, 188(middle); and 427, 457, 471, 500(bottom) (Color figure online).

Next, the second set of experiments is implemented in a color image sequence containing motion of multiple object. This video sequence contains \(480\times 360\)-pixel color images. This video contain such variations as model distortion, occlusion, appearance of multi-objects and noise etc. The purpose of this experiment is to evaluate the robustness of our proposed approach under the configuration of multiple object tracking. The result of this experiment is shown in Fig. 2. It can be seen clear that our tracking framework has better performance for occlusion and multiple object tracking than TLD.

Fig. 2.
figure 2figure 2

Comparison of tracking result on color video with motion of multiple object; TLD(a) and our proposed approach(b); Shown are frames 30, 43, 45(top); and 48, 54, 65(bottom) (Color figure online).

All the above experimental results prove that our proposed tracking framework is more robust and accurate compared with the other approaches. The other important thing we want to emphasize here is that our approach may also obtain competitive tracking results on other image datasets.

4 Conclusion

In this paper, we first proposed the combined detector containing the background subtraction and the object appearance model-based detector. This was used to solve such problems as linking, overlapping, false object detecting etc. Then, the non-linear multi-mode tracker with the combined detector was proposed to solve such problems as sudden appearance changes and long-lasting occlusions, etc. Finally, we tested our proposed person tracking framework in single-object and multi-object tracking scenario. Future work should be aimed at extending our proposed tracking approaches to connecting with more sophisticated feature descriptors and similarity measures such as the geogram [13], SOG [14] and AIBS [15].