Introduction

In the past decades, detection and tracking of video objects has always been a major task in the computer vision field [1,2,3]. As one subset of video object tracking, pedestrian detection and tracking has drawn massive research attention and been applied to many applications such as visual surveillance [4,5,6,7,8], driver-assistance systems [9,10,11], human activity recognition [12,13,14], and others [15, 16]. For pedestrian detection and tracking, visible camera and thermal imagery are two popularly used sources of image modalities, though not necessarily in a combined solution [17,18,19]. However, either visible image or thermal image has their advantages and disadvantages. Visible image can show detailed color information; however, it really suffer from lighting variations, cluttered backgrounds, artificial appearances, i.e., shadows, and etc. Since the object is detected by its temperature and radiated heat, thermal image can eliminate the influence of color and illumination changes on the objects’ appearance [20] in any weather conditions and at both day and night time. However, in some cases, e.g., occlusions, the thermal camera may fail to detect the object properly. In Fig. 1, there are three pedestrian templates; for the one with a yellow rectangle, both visible and thermal image can detect it very well since it has high contrast to the background in the visible domain and human temperature in the thermal domain. For the template in the red rectangle, it has a compact shape in the thermal image. However, in the visible image, we can just identify it coarsely due to the similar appearance in color of the background and the person’s cloth. The one in green rectangle can be seen in the visible image but hardly observed in the corresponding thermal image. This is because thermography is only able to directly detect surface temperatures, and it cannot work well when the object is (partially) occluded. Moreover, it will detect any objects (e.g., windows and cars in Fig. 1) with surface temperature.

Fig. 1
figure 1

Visible image of a scene (left) and thermal image of the same scene (right)

For the purpose of object detection, background subtraction plays an important role in it. Due to its significance, a large number of background subtraction algorithms have been proposed in recent years. Andrew et al. [21] proposed a single-camera statistical segmentation algorithm where a combination of statistical background image estimation and Bayesian-based segmentation is used to achieve foreground detection. Domenico and Luca [22] proposed a fast background subtraction method based on a clustering algorithm with a condition-based mechanism. Zhao et al. [23] proposed a background modeling method for motion detection in dynamic scenes based on type-2 fuzzy Gaussian mixture model [24] and Markov random field (MRF) [25]. In [26], authors introduced a background subtraction framework based on texture feature. Furthermore, color cues are clustered by the codebook scheme in order to refine the texture-based detection. Pierre-Luc points out in [27] that most background subtraction methods do not pay attention to the spatial or spatiotemporal relationship of each analyzed pixel, and also suffer in complexity, computation cost, and versatility. Therefore, he proposed a spatiotemporal-based background subtraction algorithm which has been proved low-cost and highly efficient. In addition, he also proposed another one using spatiotemporal feature descriptors in [28] in order to build an adaptive and flexible model rather than tuning parameters in different scenarios for optimal performance. In [29], a background subtraction model based on independent component analysis and principal component analysis is proposed to detect multiple moving objects under complex outdoor scenes such as bad weather or dynamic background. In [30], based on the assumption that moving objects are usually small and sparse, a collaborative low-rank and sparse separation model is proposed to robustly detect moving objects with different sizes. However, background regions which have the similar color/intensity as the foreground may be detected as foreground by mistake. In Wang et al. [31], a coarse-to-fine pedestrian detection method is proposed for visual surveillance, which can solve the problem in detecting small pedestrians. By using pan-tilt-zoom control, it also helps to achieve real-time tracking, though the performance depends on specified sensor settings.

However, due to lack of cognitive knowledge, some of their methods have good objective performance; their subjective performance is not satisfied (detailed in “Experimental Results”). Besides, existing approaches mainly rely on color image for pedestrians’ detection and tracking, using different features such as color and texture for modeling. In our paper, thermal images are also used, which have neither color nor texture information but just intensity instead. Unlike color images, thermal images are robust to any weather or illumination conditions though they are sensitive to surface temperature. As a result, it is necessary to find a new path to process both visible and thermal image based on their characteristics. Inspired by several multi-modality image fusion approaches [32,33,34], where color and infrared images are integrated for saliency-based image fusion [32, 34] and image registration [33], the fusion of the two image modalities (RGB and thermal) offers new insights for the supplementary information they can provide. This has proven to be a success in determining the refined foreground map by the fusion of both visible and thermal binary maps. By combining cognitive models from different levels and aspects, we have proposed a generic model for effective detection and tracking of pedestrians from color and thermal videos.

In our proposed approach, different levels of cognitive models are integrated together for effective detection and tracking of pedestrians from color and thermal videos. These include color- and intensity-based cognitive models of human visual perception for robust background estimation and foreground detection, cognitive models of object priors for shape-constrained morphological filtering in determining the refined foreground maps, and cognitive model of motion for motion consistency-constrained mean shift in extracting single persons from a group. By systematically integrating these cognitive models together, an effective model is developed and proven to be the best when benchmarking with several state-of-the-art techniques. It is believed the proposed approach can be also applied in other areas of object detection and tracking, e.g., medical imaging for improved performance.

The main contributions in this paper can be highlighted in the following three aspects:

  • As color and intensity information plays important roles in the cognitive models of our human visual perception, an adaptive Gaussian mixture model is proposed to measure the distribution of such information in multi-modality images (color and thermal) before deriving the estimated background for foreground detection.

  • Based on the prior knowledge of the human objects to be detected, shape constraints are fused in combination with morphological filtering for determining the refined foreground maps.

  • Inspired by cognitive model of motion, motion consistency is applied in a constrained mean-shift scheme for the extraction of single persons from a group.

The rest of the paper is organized as follows: The “Overview of the Proposed System” illustrates the framework of the proposed method. The “Foreground Detection” describes the foreground detection approach. The “Object Tracking” elaborates the object tracking method. Experimental results are presented and discussed in the “Experimental Results.” Finally, some concluding remarks and future work are summarized in the “Conclusion.”

Overview of the Proposed System

In this paper, we proposed a two-stage background subtraction procedure based on human cognition knowledge on both visible and thermal images for fusion-based pedestrian detection, and four modules are included in Fig. 2. In the first stage, we predict the background model by computing the median value of randomly selected frames in the videos (module 1), and apply an adaptive threshold to detect binary foreground map along with knowledge-based morphological refinement (module 2). In the second stage, we use the results from module 1 as prior frames and employ learning-based adaptive Gaussian mixture model to estimate the background model and generate the binary foreground map (module 3). Then the initial and Gaussian-based foreground maps of both visible and thermal images will be refined by shape-constrained morphological filtering and further fused together to get the final foreground map (module 4). In the performance evaluation (module 5), the proposed background subtraction method is compared against a number of state-of-the-art methods on a widely used publicly available video sequences. Some widely used evaluation criteria such as precision, recall, and F measure are used for quantitative assessment. In addition, we also proposed constrained mean-shift tracking method to have a capability of scale change and identify the individual pedestrian template from a pedestrian group more efficiently (detailed in “Object Tracking”). Furthermore, the performance of object tracking is also evaluated by qualitative assessment. Detailed results are reported in the “Experimental Results.”

Fig. 2
figure 2

Proposed framework within five modules

Foreground Detection

In this section, a two-stage foreground detection method is applied for both visible and thermal images. Eventually, the desired foreground map is fused by the foreground detection results of two types of images with cognition-based morphological process.

Random Median Background Subtraction

To capture the initial region of pedestrians in visible and thermal image, we first estimate the background model by computing a median map (Fig. 2 module 1) of N frames randomly selected from the video sequence. And initial background subtraction process for each visible or thermal frame is defined as:

$$ {\mathrm{BS}}_{\mathrm{ini}}\left(\mathrm{x},\mathrm{y}\right)=\left|\mathrm{I}\left(\mathrm{x},\mathrm{y}\right)-{\mathrm{I}}_{\mathrm{med}}\left(\mathrm{x},\mathrm{y}\right)\right| $$
(1)

After that, we binarize the BSini with an adaptive threshold, i.e., OTSU [35] to get a binary image Ibi with coarse human body region (Fig. 2 module 2). However, Ibi contains many ambiguous contents and some objects that should be detected as a whole are fractured. Therefore, a cognitive-based morphology refinement is applied here to filter the insignificant region and integrate the potential objects. Since the object that we want to detect is pedestrian, and we can assume the shape of the pedestrians is an ellipse or a rectangle based on our cognition, so that its major axis length is usually larger than minor axis length. Therefore, in our morphology refinement, we define a rectangle-shaped structuring element to connect separated regions together to be a whole object. The width and height of the rectangle is defined as 2n + 1 and \( 2\mathrm{n}+3\ \left(n\in {\mathrm{Z}}_0^{+}\right) \), respectively. Here, we set n as 1. Furthermore, as the size of the pedestrian in the video will not be small, we filter those noise regions by an empirical threshold T. From Fig. 3, we can see in the refinement result Ir, the noise regions with the small area have been removed and every object has been integrated.

Fig. 3
figure 3

The refined initial background subtraction results of visible (left) and thermal (right) images

Adaptive Mixture Background Subtraction

Although the random median background subtraction module can detect some potential objects, it still contains many false alarms due to lack of the analysis of the scene changes, lighting changes, moving object, and etc. Therefore, a learning-based background mixture model is employed here to estimate the foreground map under a real scene. For a particular surface under particular lighting, a single Gaussian per pixel is sufficient to represent the pixel value. However, in practice, there are multiple surfaces due to the lighting condition change. Thus, in order to fit the real world situation and our human cognition, multiple adaptive Gaussians are necessary. In this paper, we model each pixel by a mixture of K Gaussian distributions. The probability of observed pixel X t at time t can be written as:

$$ P\left({X}_t\right)=\sum \limits_{i=1}^K{\omega}_{i,t}\ast \eta \left({X}_t;{\mu}_{i,t},{\Sigma}_{i,t}\right) $$
(2)

where ω i, t is the weight parameter of the i th Gaussian in the mixture at time t, μ i, t and \( {\varSigma}_{i,t}={\sigma}_i^2I \) are the mean and covariance value of the i th Gaussian in the mixture at time t. η(∗) is the normal distribution of the i th Gaussian component.

$$ \eta \left({X}_t;{\mu}_{i,t};{\Sigma}_{i,t}\right)=\frac{1}{{\left(2\pi \right)}^{\frac{D}{2}}{\left|{\Sigma}_{i,t}\right|}^{\frac{1}{2}}}{e}^{-\frac{1}{2}{\left({X}_t-{\mu}_{i,t}\right)}^T{\Sigma_{i,t}}^{-1}\left({X}_t-{\mu}_{i,t}\right)} $$
(3)

The first B distributions are chosen as the background model

$$ B={\mathrm{argmin}}_b\left(\sum \limits_{i=1}^b{\omega}_i>T\right) $$
(4)

T is the minimum portion of the data that should be counted as background. For any new observed pixel value, X t will be considered as foreground if it is more than 2.5 standard deviations away from existing B distributions. And the first Gaussian component that matches the new observed pixel value will be updated by the following progress:

$$ {\omega}_{i,t}=\left(1-\alpha \right){\omega}_{i,t-1}+\alpha \hat{p}\left({\omega}_{i,t}|{X}_t\right) $$
(5)
$$ {\mu}_{i,t}=\left(1-\alpha \right){\mu}_{i,t-1}+\rho {X}_t\Big) $$
(6)
$$ {\Sigma}_{i,t}=\left(1-\alpha \right){\Sigma}_{i,t-1}+\rho \left({X}_t-{\mu}_{i,t}\right){\left({X}_t-{\mu}_{i,t}\right)}^T $$
(7)
$$ \rho =\alpha \eta \left({X}_t;{\mu}_{i,t},{\Sigma}_{i,t}\right) $$
(8)
$$ \hat{p}\left({\omega}_{i,t}|{X}_t\right)=\left\{\begin{array}{cc}1&; \mathrm{if}\ {\omega}_{i,t}\mathrm{matches}\ \mathrm{first}\ \mathrm{Gaussian}\ \mathrm{component}\\ {}0&; \mathrm{otherwise}\end{array}\right. $$
(9)

where α is the learning rate.

In addition, we use ten random median background subtraction results to predict the initial value of the parameters ω i, t , μ i, t , and Σ i, t for better performance. After the adaptive background mixture model is done, we can get the foreground map I a vis and I a thm of visible and thermal images (Fig. 2 module 3).

Fusion Strategy

In order to generate the final foreground map and make the fusion result close to human perception, we put a shape-constrained morphological refinement to the results from the previous stage and integrate them together. For I r vis, I a vis, I r thm, and I a thm, we define a function D(∙) that can dilate all the potential objects with a shape-based structuring element. And we set n = 0 because we just want to smooth the edge for each object and connect the small gap between some object pieces. By doing so, the shape of the object will have continuity, which matches human perceptions. Then the final foreground map (Fig. 2 module 5) can be built by fusion strategy as follows:

$$ {I}_{vis}=\left({I_a}^{vis}\cap D\left({I_r}^{vis}\right)\right)\cup \left({I_r}^{vis}\cap D\left({I_a}^{vis}\right)\right) $$
(10)
$$ {I}_{thm}=\left({I_a}^{thm}\cap D\left({I_r}^{thm}\right)\right)\cup \left({I_r}^{thm}\cap D\left({I_a}^{thm}\right)\right) $$
(11)
$$ {I}_{final}=\left({I}_{vis}\cap D\left({I}_{thm}\right)\right)\cup \left({I}_{thm}\cap D\left({I}_{vis}\right)\right) $$
(12)

Object Tracking

For any continuous frames, if the later frame has fewer objects than the former frame, there will be only two situations. The first situation is one or more objects in the former frame have been out of the later frame, and the other situation is some individual objects in the later frame are detected as a whole in the foreground detection stage due to the inevitable overlap and occlusion problem. Figure 4 (right and middle) shows the detection detail of two adjacent frames where there should be three pedestrian patterns detected in both frames, but for the frame in the middle, the object detection method considers the left two patterns as one object because they are close to each other. Therefore, in this section, an improved mean-shift method is proposed to track the individual objects in the second situation.

Fig. 4
figure 4

Initial detection result of frame 1 (left) and frame 2 (middle), and updated detection result of frame 2 (right) after mean-shift tracking progress

Conventional mean-shift method [36] mainly has two drawbacks. The first one is that it tracks the object mostly based on the color and texture feature, but does not take too much account of the spatial relationship of the object. Therefore, if the object has the similar color with surrounding background, the tracker will probably locate the object at the background region in the following frame. The second one is the similarity computation for two probability density functions (PDFs). In [36], it defines the distance between two PDFs as

$$ \mathrm{d}\left(\mathrm{y}\right)=\sqrt{1-\uprho \left[\hat{\mathrm{p}}\left(\mathrm{y}\right),\hat{\mathrm{q}}\right]} $$
(13)
$$ \uprho \left[\hat{\mathrm{p}}\left(\mathrm{y}\right),\hat{\mathrm{q}}\right]=\sum \limits_{\mathrm{u}=1}^{\mathrm{m}}\sqrt{{\hat{\mathrm{p}}}_{\mathrm{u}}\left(\mathrm{y}\right){\hat{\mathrm{q}}}_{\mathrm{u}}} $$
(14)

where ρ[∙] is the Bhattacharyya coefficient, \( \widehat{\mathrm{q}}={\left\{{\widehat{\mathrm{q}}}_{\mathrm{u}}\right\}}_{\mathrm{u}=1\dots \mathrm{m}} \) (with \( \sum \limits_{\mathrm{u}=1}^{\mathrm{m}}{\widehat{\mathrm{q}}}_{\mathrm{u}}=1 \)) is the discrete density from the m-bin histogram of the object model, \( \widehat{\mathrm{p}}\left(\mathrm{y}\right)={\left\{{\widehat{\mathrm{p}}}_{\mathrm{u}}\left(\mathrm{y}\right)\right\}}_{\mathrm{u}=1\dots \mathrm{m}} \) (with \( \sum \limits_{\mathrm{u}=1}^{\mathrm{m}}{\widehat{\mathrm{p}}}_{\mathrm{u}}\left(\mathrm{y}\right)=1 \)) is estimated as a given location y from the m-bin histogram of the object candidate. However, \( \widehat{\mathrm{q}} \) does not change with time which is not fit with human cognition because the surrounding of the object cannot be always the same in the real scene. On the other hand, unchangeable \( \widehat{\mathrm{q}} \) will also increase the convergence cost because it will take more time to match object candidate and object model within difference background.

To overcome two problems mentioned above, we propose constrained mean-shift method where two improvements are introduced in the following. Firstly, the object model is updated in each frame in order to get the real-time \( \widehat{\mathrm{q}} \). Thus, the size of the \( \widehat{\mathrm{q}} \) will change with the scale changing of the object. Meanwhile, the pedestrians usually move slowly which means their surrounding background in adjacent frames will not be changed too much. In this case, \( \widehat{\mathrm{p}}\left(\mathrm{y}\right) \) can be quickly matched with \( \widehat{\mathrm{q}} \) in each frame. Secondly, we limit the shift range with the spatial information of the objects in adjacent frames. We define Fn − 1 and Fnd are frame n-1 and frame n, \( {\mathrm{R}}_{\mathrm{n}-1}^{\mathrm{i}} \) is the region i in Fn − 1, and \( {\mathrm{R}}_{\mathrm{n}}^{\mathrm{j}} \) is the region j in Fn, \( {\mathrm{X}}_{\mathrm{n}-1}^{\mathrm{i},1},{\mathrm{X}}_{\mathrm{n}-1}^{\mathrm{i},2},{\mathrm{Y}}_{\mathrm{n}-1}^{\mathrm{i},1},\mathrm{and}\ {\mathrm{Y}}_{\mathrm{n}-1}^{\mathrm{i},2} \) which are the location elements of \( {\mathrm{R}}_{\mathrm{n}-1}^{\mathrm{i}} \), and \( {\mathrm{X}}_{\mathrm{n}}^{\mathrm{j},1},{\mathrm{X}}_{\mathrm{n}}^{\mathrm{j},2},{\mathrm{Y}}_{\mathrm{n}}^{\mathrm{j},1},\mathrm{and}\ {\mathrm{Y}}_{\mathrm{n}}^{\mathrm{j},2} \) are the location elements of \( {\mathrm{R}}_{\mathrm{n}}^{\mathrm{j}} \). After the location of \( {\mathrm{R}}_{\mathrm{n}-1}^{\mathrm{i}} \) candidate in Fn (expressed as \( {\mathrm{X}}_{\mathrm{n}}^{\mathrm{i},1},{\mathrm{X}}_{\mathrm{n}}^{\mathrm{i},2},{\mathrm{Y}}_{\mathrm{n}}^{\mathrm{i},1},\mathrm{and}\ {\mathrm{Y}}_{\mathrm{n}}^{\mathrm{i},2} \)) is determined by conventional mean-shift algorithm in every iteration, we further refine this location by a displacement term represented as λx, λy.

Let \( {\uplambda}_{\mathrm{x}}^{\mathrm{i},1}={\mathrm{X}}_{\mathrm{n}}^{\mathrm{j},1}-{\mathrm{X}}_{\mathrm{n}}^{\mathrm{i},1},{\uplambda}_{\mathrm{x}}^{\mathrm{i},2}={\mathrm{X}}_{\mathrm{n}}^{\mathrm{j},2}-{\mathrm{X}}_{\mathrm{n}}^{\mathrm{i},2},{\uplambda}_{\mathrm{y}}^{\mathrm{i},1}={\mathrm{Y}}_{\mathrm{n}}^{\mathrm{j},1}-{\mathrm{Y}}_{\mathrm{n}}^{\mathrm{i},1},\mathrm{and}\ {\uplambda}_{\mathrm{y}}^{\mathrm{i},2}={\mathrm{Y}}_{\mathrm{n}}^{\mathrm{j},2}-{\mathrm{Y}}_{\mathrm{n}}^{\mathrm{i},2} \) be the displacement terms, the new position of the object can be determined by using these displacement terms as follows:

$$ \left\{\begin{array}{c}{\mathrm{X}}_{\mathrm{n}}^{\mathrm{i}}={\mathrm{X}}_{\mathrm{n}}^{\mathrm{i}}+{\uplambda}_{\mathrm{x}}^{\mathrm{i},1}, if{\uplambda}_{\mathrm{x}}^{\mathrm{i},1}>0\\ {}{\mathrm{X}}_{\mathrm{n}}^{\mathrm{i}}={\mathrm{X}}_{\mathrm{n}}^{\mathrm{i}}+{\uplambda}_{\mathrm{x}}^{\mathrm{i},2}, if{\uplambda}_{\mathrm{x}}^{\mathrm{i},2}<0\end{array}\right. $$
(15)
$$ \left\{\begin{array}{c}{\mathrm{Y}}_{\mathrm{n}}^{\mathrm{i}}={\mathrm{Y}}_{\mathrm{n}}^{\mathrm{i}}+{\uplambda}_{\mathrm{y}}^{\mathrm{i},1}, if{\uplambda}_{\mathrm{y}}^{\mathrm{i},1}>0\\ {}{\mathrm{Y}}_{\mathrm{n}}^{\mathrm{i}}={\mathrm{Y}}_{\mathrm{n}}^{\mathrm{i}}+{\uplambda}_{\mathrm{y}}^{\mathrm{i},2}, if{\uplambda}_{\mathrm{y}}^{\mathrm{i},2}<0\end{array}\right. $$
(16)

As can be seen from Fig. 4, region 1 and region 2 in frame 1 are two individual object models, the corresponding object candidate should be limited in region 3 in frame 2. In this case, the object group in frame can be tracked separately in regions 4 and 5 (shown in the right image in Fig. 4).

Experimental Results

Dataset Description and Evaluation Criteria

To evaluate the performance of our foreground detection and object tracking methods, a publicly available database 03 OSU Color-Thermal Database from OTCBVS are employed here. Thermal sequences are captured by Raytheon PalmIR 250D thermal sensor and color sequence are captured by Sony TRV87 Handycam color sensor. All the frames in both sequences have a spatial resolution of 320 × 240 pixels. The number of frames in each video sequence is Sequence-1:2107, Sequence-2:1201, Sequence-3:3399, Sequence-4:3011, Sequence-5:4061, and Sequence-6:3303, respectively. Figure 5 shows some visible and thermal frames and the results of our foreground detection method. For our foreground detection method, we do both qualitative (Fig. 6) and quantitative (Table 4) analysis against six state-of-the-art methods, i.e., GMG [21], IMBS [22], LOBSTER [27], MultiCue [26], SuBSENSE [28], and T2FMRF [23] on some manually segmented silhouettes. For our object tracking method, we do comprehensive qualitative experiments on all video sequences (Fig. 7).

Fig. 5
figure 5

Visual results of proposed foreground detection algorithm. a Sequence-1. b Sequence-2. c Sequence-3. d Sequence-4. e Sequence-5. f Sequence-6

Fig. 6
figure 6

Visual comparison. a Original images. b Ground truth. cj Saliency maps generated by different methods

Fig. 7
figure 7

Visual tracking results of the proposed approach across different images and scenarios. af Sequence-1–6

For quantitative performance assessment of the proposed foreground detection algorithm, several commonly used metrics are adopted in our experiments, which include the precision, recall, and F measure. The precision value P and recall value R are determined by \( P=\frac{T_p}{T_p+{F}_p},R=\frac{T_p}{T_p+{F}_p} \), where T p , F p , and F n , respectively, refer to the number of correctly detected foreground pixels of the pedestrians, incorrectly detected foreground pixels (false alarms), and incorrectly detected background pixels (or missing pixels from the object). Specifically, these three numbers can be calculated by comparing the binary masks of the detected image and the ground truth. Furthermore, since the database does not have ground truth, we obtain a manual segmentation of the pedestrian regions in 53 frames from Sequence-1. The F measure is defined by \( {F}_{\mathrm{measure}}=\frac{2\bullet P\bullet R}{P+R} \).

Key Parameter Selection

In this paper, we carefully choose the key parameter by investigating their changes on the performance. The effect of several key parameters in the proposed approach is discussed as follows. For adaptive Gaussian mixture model, the key parameters are the learning rate α, the threshold of background portion T, and the Gaussian distribution number K. Tables 1, 2, and 3 summarize the performance by changing these three parameters, respectively. From Table 1, we can see that the precision will slightly increase with the rising learning rate while the recall shows the inverse trend against the precision. As the learning rate decides how many recent frames are used for training, the larger the learning rate is, the less the recent number of frames is used. Generally, with less recent frames used to predict the background, the local information will be more detailed. On the contrary, more recent frames will make the background to have more global property and robust to local inconsistency. To this end, the learning rate can be neither too large nor too small, which is set to 0.002 (500 recent frames) in this paper based on our practical measurement [37]. From Table 2, we can find that the precision grows with the increasing T, yet the recall has the opposite tendency against the precision again. The reason for this is when the portion of background is increased some foreground regions or noise may be considered as background. Although it somehow makes the precision increase, the recall will reduce sharply. However, if the portion of background is too small, many noised regions can be considered as foreground. As a result, it will significantly degrade the precision rate due to the growth of the false alarm while the recall will not increase much. Though T was set to 0.6 in [37], we empirically choose T = 0.7 for its better performance. Table 3 shows that the number of Gaussian distributions does not affect the performance much as long as it is larger than 2. Therefore, we set K = 5 as suggested in [38].

Table 1 Key parameter α analysis
Table 2 Key parameter T analysis
Table 3 Key parameter K analysis

Assessment of Foreground Detection Method

To evaluate the quality of the extracted foreground map, we compare our proposed method with six state-of-the-art methods in terms of precision, recall, and F measure as the performance metrics with the results shown in Table 4. For fair comparison, instead of just comparing our fusion result with others’ results on visible images, we do the same fusion strategy for each method where I vis and I thm are generated by those methods on visible and thermal images, respectively. From Table 1, we can see the precision of proposed foreground detection is comparable with GMG [21] and LOBSTER [27], and both recall and F measure of our method outperform other methods. IMBS, MultiCue, and T2FMRF yield bad performance due to their algorithms does not take too much account of the scene change. Although their methods work well in some indoor and outdoor data, those data do not have too much light change. However, in the 03 OSU Color-Thermal Database from OCTBVS, the clouds make the big shadow on the ground and the light of the scene changes as time goes by. GMG, LOBSTER, and SuBSENSE almost have similar performance and very comparable with our proposed method.

Table 4 Comparison of precision, recall, and F measure values

However, these methods are mainly designed for the object detection in the small scene. And the objects in the small scene usually have large size than the pedestrians in a surveillance system. Therefore, these methods can detect the pedestrians within close or middle range but not long range from the camera. In addition, affected by light change and weather condition, some details have been lost. As can be seen in the visible image in Fig. 6, some pedestrians’ shapes in GMG are not integrated; some pedestrians’ shapes in GMG are fractured, e.g., left person in the first image is split into two regions; some pedestrians that are far away from the camera cannot be detected in SuBSENSE, e.g., the fifth and sixth images. Hence, these methods have good quantitative results but their qualitative results do not fit human’s cognition. However, our foreground detection result is generated by a two-stage background subtraction procedure and fusion strategy where cognition-based knowledge is applied in to refine the procedure and guide the fusion strategy.

Although our proposed method yields the best performance in terms of F measure, there are still rooms for further improvements. As seen, our proposed method has produced high recall value but relative low precision value just like other methods. There are two main reasons, i.e., missing detection and inaccurate ground truth mapping. For the cases of missing detection, this is mainly due to the failure in detecting objects dressing in similar color to the background and behind obstacles. This can be possibly improved by introducing certain post-processing such as back-tracking. However, it can still be challenging in dealing with small objects which are frequently grouped together. This also explains the low accuracy of ground truth as in some cases and the silhouettes of the pedestrians can be hardly defined accurately even in a manual way.

Assessment of Object Tracking Method

To validate the performance of the proposed object tracking approach, all video sequences are used in our experiments. In Fig. 7, detection and tracking results from these sequences are given to illustrate the extracted/tracked objects using their bounding boxes. As can be seen, the proposed method can give reliable pedestrian detection and tracking results under various conditions, including occlusion and light changes in terms of illumination and scale. When the pedestrians are independent, we can detect them very well with proper scale bounding box. We can also identify the people even they are overlapped such as the first and third images in Fig. 7d. In addition, when there are some occlusions appear like tree or wall such as the second and third images in Fig. 7d, first image in Fig. 7e, second and third images in Fig. 7f, sixth image in Fig. 7b, and sixth image in Fig. 7c, etc. For the object is getting out of the screen, such as the third image in Fig. 7e, fourth and sixth images in Fig. 7f, we can still locate the objects and track their motion.

However, some failure cases, such as third and fourth images in Fig. 7 a, the second image in Fig. 7 c still exist in our tracking results. There are two main reasons, and the first one is that some pedestrians always walk together as a group from the beginning to the end in the sequence; therefore, our tracking system always consider the pedestrian group as a single object. The second reason is that if one pedestrian leaves a group of pedestrians and join in another group, the tracking system cannot extract its own color, texture, and spatial features. As a result, the mean-shift method may fail to track in such a context.

Conclusion

In this paper, we proposed a cognitive model by fusing visible and thermal images for pedestrian detection, along with an improved mean-shift method proposed for tracking the pedestrians in the videos. There are three key components in this model, i.e., foreground detection, fusion based object refinement, and object tracking. By estimating the background model followed by a two-stage background subtraction, foreground objects can be successfully detected. Shape-constrained morphological filtering-based fusion strategy helps to further refine the detected foreground objects. Finally, prediction-based forward and backward tracking is found particularly useful to separate overlapped or occluded objects, and robust to the scale change. However, if one certain pedestrian in a group cannot be detected individually from the beginning to the end, the tracking system will fail to estimate its own track and just estimate the track of its group instead. In future work, we will put deep learning model to further enhance the foreground detection performance and improve the tracking procedure in order to precisely estimate the objects’ track even with some challenging situations.