Non-learning-Based Motion Cognitive Detection and Self-adaptable Tracking for Night-Vision Videos

Bai, Lianfa; Han, Jing; Yue, Jiang

doi:10.1007/978-981-13-1669-2_7

Non-learning-Based Motion Cognitive Detection and Self-adaptable Tracking for Night-Vision Videos

Lianfa Bai⁴,
Jing Han⁵ &
Jiang Yue⁶

Chapter
First Online: 12 January 2019

580 Accesses

Abstract

Motion detection and tracking technology is one of the core subjects in the field of computer vision. It is significant and has wide practical value in night-vision research. Traditional learning-based detection and tracking algorithms require many samples and a complex model, which is difficult to implement. The robustness of detection and tracking in complex scenes is weak. This chapter introduces a series of infrared small-target detection, non-learning motion detection and tracking methods based on imaging spatial structure, which are robust to complex scenes.

Download chapter PDF

Motion detection and tracking technology is one of the core subjects in the field of computer vision. It is significant and has wide practical value in night-vision research. Traditional learning-based detection and tracking algorithms require many samples and a complex model, which is difficult to implement. The robustness of detection and tracking in complex scenes is weak. This chapter introduces a series of infrared small-target detection, non-learning motion detection and tracking methods based on imaging spatial structure, which are robust to complex scenes.

7.1 Target Detection and Tracking Methods

7.1.1 Investigation of Infrared Small-Target Detection

Infrared small-target detection is an important issue in computer vision and military fields. With the development of infrared imaging technology, infrared sensors have successfully produced high-resolution images, which facilitate the detection of objects. However, it still faces a few challenges owing to the complexity of scenes. Recently, various methods have been proposed for infrared small-target detection, such as morphological top-hat filter method by Drummond (1993). Then, the adaptive Butterworth high-pass filter method was designed for infrared small objects by Yang et al. (2004). Gu et al. (2010) proposed a kernel-based nonparametric regression method for background prediction and small-target detection. Bae et al. (2012) used the edge-directional 2D least mean square filter to estimate infrared small and dim target detection. Li et al. (2014) proposed a novel infrared small-target detection in a compressive domain. Albeit these methods have been proposed, a few commonly influencing issues still endure, related to complexity of patterns in the infrared images. These methods often fail to detect small objects when the scenes are more challenging. Aiming to solve this problem, a novel infrared small-target detection method using sparse errors and structure differences is introduced. The method exploits sparse errors and structure differences between target and background regions. Then, a simple fusion method is applied to generate the final detection result (see Fig. 7.1).

7.1.2 Moving Object Detection Based on Non-learning

Accompanied by increasing night-vision sensors, numerous videos are being produced. Thus, it is essential to automatically target and motion. This plays an important role in intelligence operations. Over the last 20 years, cascade classification (Viola and Jones 2001) and Bayes’ rule (Muesera et al. 1999) proposed several methods. To determine the parameters of classifiers, traditional detection methods (Sanchez et al. 2007; Shen and Fan 2010) require a lot of learning, which leads to the problem of low real-time performance. Accordingly, a new method without learning has been put forward gradually.

In 2007, Takeda et al. (2007) proposed classic kernel regression to recover the high-frequency information of images, which is used to denoise. In 2009, Milanfar (2009) studied the adaptive kernel regression method to remove noise, enhancing image details and target detection. In the same year, Seo and Milanfar (2009a, b) made further efforts and proposed the locally adaptive regression kernel method, which is a new non-parameter method for detecting targets. A few years later, Seo and Milanfar (2009a, b, 2011a, b, c) improved the robust performance of regression kernel in different aspects. However, the matching algorithm presented by Seo and Milanfar (2011a, b, c) (hereafter called ‘Seo algorithm’) has not been suitable for non-compact targets, such as human actions. The overall template with background was applied to match with test video but limited the choice to test video scenes. Recognition accuracy relies on the background similarity between template and test video. Only when the background of a test video is quite similar with the template can the results be satisfactory. Conversely, when the view angle is changed or the scene becomes complex, the outcomes are disappointing. In 2007, Wang (2012) amplified template images and divided them into many parts to detect human faces. The template only contains a face, which gives us the inspiration to remove the background. Furthermore, when the actions were partially obscured by landscape, the matching pattern with overall template could not be recognised.

Inspired by the trend towards detection in video streaming, spatiotemporal multiscale statistical matching (SMSM), based on a new weighted 3D Gaussian difference LARK (GLARK), was introduced, which more efficiently recognises actions in video.

7.1.3 Researches on Target Tracking Technology

Target tracking is a key technology of scene perception. Target information, such as moving trajectory, target position and velocity, is the basis for subsequent video image processing, such as target recognition and behavioural analysis.

Many target tracking algorithms have been gradually proposed in recent years (Ming 2016; Rong et al. 2010). The continuously adaptive mean shift (CAMSHIFT) tracking algorithm proposed by Bradski was based on colour information. It effectively solved the problem of target deformation and size scaling and consumed less time. However, it was not good for tracking fast-moving targets or targets in complex backgrounds. Then, several improved CAMSHIFT tracking algorithms (Hong et al. 2006; Chun and Shi 2015; Xing et al. 2012; Ran 2012) improved the stability of tracking. However, these algorithms needed distinctive target colours, high-quality images and simple backgrounds. In a complex background scene, there are now many popular online learning-based tracking algorithms. Compression tracking and its improved algorithm (Kai 2012; Chong 2014) showed good real-time performance and good robustness with target occlusion and appearance changes. Its fixed target tracking window scaled easily, leading to tracking drift. A visual tracking algorithm based on spatiotemporal context information and its improvement (Zhang et al. 2013; Guo 2016) both required less time. Owing to simpler features to get the statistical relevance of the target to the surrounding area, the algorithm failed to track when the target moved too fast or was blocked. Additionally, because of low contrast, lack of colour information and small grayscale dynamic range, infrared image tracking has become a hot topic in tracking research. The classical mean shift tracking algorithm of Zhang et al. (2006) used grey information for real-time tracking. However, it was vulnerable to similar grey information background interference. It failed to track when target size changed. Meng (2015) proposed an improved mean shift algorithm with a weighted kernel histogram and brightness–distance space. It successfully tracked rigid infrared targets but could not track a variety of non-rigid targets, such as humans and animals. The mean shift algorithm of speed-up robust features (SURF), proposed by Zhang et al. (2011), solved the tracking problem of the target scale change in the ideal state. However, it could not track small or textured single targets because of fewer feature points or matched feature points.

In summary, this tracking algorithm cannot perform target tracking well in a complex background. A tracking model based on global LARK feature matching and CAMSHIFT algorithm is thus introduced in Sect. 7.3.1. Because the LARK feature is not sensitive to the grey value of each point or the specific object in the picture, it is more sensitive to the change of grey gradient and graphic structure in the picture (Seo and Milanfar 2009a, b, 2011a, b, c; Wang 2012). It is possible to distinguish between rigid or compact targets and background areas by combining SSIM maps and colour probability distributions. To track non-compact infrared targets, we also propose a local LARK feature statistical matching method. LARK features can also describe the essential characteristics of weak-edge targets well in an infrared image. Whereas the overall structure of the same target is dissimilar in different forms, there are some similarities in local fine structure (Luo et al. 2015). Thus, global matching is transformed into local matching statistical analysis in Sect. 7.3.2. Combined with infrared image characteristics and microstructure similarity statistical analyses, the introduced method could well distinguish between backgrounds and infrared targets in different forms.

7.2 Infrared Small Object Detection Using Sparse Error and Structure Difference

7.2.1 Framework of Object Detection

Sparse Error. There exist appearance divergences between target and background regions. Generally, the boundary regions of the image are considered background regions. Background templates are constructed from these image boundaries, and the entire image is reconstructed by sparse appearance error modelling. The image patch is defined by the learnt bases from a set of infrared image patches. First, a given infrared image is taken as input, and SLIC (Achanta et al. 2012) is used to segment the input image into multiple uniform regions. Each segment is described by $ p = \left\{ {x,y,lu,g_{x} ,g_{y} } \right\} $, where lu is the luminance information (Wang et al. 2015), $ g_{x} $ and $ g_{y} $ are the gradient information, and x and y indicate the coordinates of a pixel. The entire infrared image is represented as $ P = \left[ {p_{1} ,p_{2} , \ldots ,p_{N} } \right] $, where N is the number of segments. The boundary segment, d, is extracted from P as the bases and the background template set, $ D = \left[ {d_{1} ,d_{2} ,d_{3} , \ldots ,d_{M} } \right] $, is constructed, where M is the number of image boundary segments. Given the background templates, the sparse error between object and background regions could be computed, where there should be a large difference between object and background regions based on the same bases. Then, the image segment is encoded, defined by

$$ a_{i} = \arg \min_{{a_{i} }} \left\| {p_{i} - Da_{i} } \right\|_{2}^{2} + \lambda \left\| {diag(w_{i} )a_{i} } \right\|_{ 1} , $$

(7.1)

where $ \uplambda $ is the regularisation parameter and $ w_{i} $ is used to represent the weight for segment $ d_{i} $, computed as

$$ w_{i} = \frac{1}{{H(d_{i} )}}\sum\limits_{{j \in H(d_{i} )}} {\exp (\frac{{\left\| {d_{j} - d_{i} } \right\|^{2} }}{{2\sigma^{2} }})} , $$

(7.2)

where $ H(d_{i} ) $ denotes the amount of $ d_{i} $’s neighbours. This weight computes the similarity between segment $ d_{i} $ and its surrounding segments. Large $ w_{i} $ will suppress nonzero entry, $ a_{i} $, which is forced to concentrate on zero entry when $ w_{i} $ is small. The weight for background template should be proportional to the similarity between segments in the image boundaries.

Then the regulated reconstruction error for each segment is computed as

$$ E_{i} = \left\| {p_{i} - Da_{i} } \right\|_{2}^{2} . $$

(7.3)

Coarse interest regions can be easily estimated by the regulated sparse errors. Large values of reconstruction errors are the target regions, which are different from background regions.

Structure Difference

Objects often have different structure information compared to background regions. In this section, structure differences between object and background regions are exploited via region covariances. Use F to denote the feature image extracted from the input image, I, as $ F =\Gamma (I) $, where $ \Gamma $ denotes a mapping function that extracts a k-dimensional feature vector from each pixel in I. A region, $ q_{i} $, inside F, can be computed as a $ k \times k $ covariance matrix, $ C_{qu} $

$$ C_{{q_{u} }} = \frac{1}{n - 1}\sum\limits_{i = 1}^{n} {(q_{u} - \mu )(q_{u} - \mu )^{T} } , $$

(7.4)

where $ q_{u} ,u = 1, \ldots ,n $ denotes the k-dimensional feature vectors inside region, and q and $ \mu $ denotes the mean of these feature vectors. In this section, $ k = 5 $ features (e.g. $ x,y,lu,g_{x} ,g_{y} $) are used to build the region feature. The structure difference map is computed based on two different covariances:

$$ G(q_{u} ,q_{v} ) = \psi (C_{{q_{u} }} ,C_{{q_{v} }} ), $$

(7.5)

where $ \psi (C_{{q_{u} }} ,C_{{q_{v} }} ) $ is used to compute the similarity between two covariances (Karacan et al. 2013).

This covariance matrix can better capture local image structure data and can effectively estimate the structure differences between objects and background regions. Thus, object regions have higher values of G than background regions.

Fusing Error and Structure

Based on two obtained maps from two different views, the maps are fused to generate a final map via linear combination, and the weights in the linear combination are learnt.

$$ S = \sum\limits_{t = 1}^{T} {\beta_{t} S^{t} } , $$

(7.6)

where $ \left\{ {\beta_{t} } \right\} = {\text{argmin}}\sum {\left\| {S - \sum\nolimits_{t} {\beta_{t} S^{t} } } \right\|}_{F}^{2} $, t denotes the number of maps and $ S^{T} $ denotes the map from the error map or the structure difference map. The weights are learnt well using a least-squares estimator. This problem is solved by conditional random field solutions by Liu et al. (2010). Finally, like the previous method of Li et al. (2014), the final interest target can be extracted by a threshold method, defined as

$$ S^{\prime}(x,y) = \left\{ {\begin{array}{*{20}l} 1 \hfill & {S(x,y) \ge \gamma S_{\hbox{max} } } \hfill \\ 0 \hfill & {\text{others}} \hfill \\ \end{array} ,} \right. $$

(7.7)

where $ S_{ \hbox{max} } $ is the maximum value of S and $ \gamma = 0.6 $ is the threshold value (Fig. 7.2).

7.2.2 Experimental Results

The empirical values, $ \sigma = 3 $, $ N = 400 $, $ T = 2 $ and $ \lambda = 0.01 $, are chosen for the introduced method. The input image is $ 480 \times 640 $ pixels. The method is compared with other existing methods, including top-hat (TH) filter in Drummond (1993) and compressive domain (CD) from Li et al. (2014). Signal-to-clutter ratio gain (SCRG) and background suppression factor (BSF) are used to evaluate these methods. These two metrics are defined as

$$ SCRG = 20 \times \log_{10} \left( {\frac{{(U/C)_{\text{out}} }}{{(U/C)_{\text{in}} }}} \right),\quad BSF = 20 \times \log_{10} \left( {\frac{{C_{\text{in}} }}{{C_{\text{out}} }}} \right), $$

(7.8)

where U is the average target intensity and C is the clutter standard deviation in an infrared image. In this experiment, four different scenes are selected to verify the performance of the introduced method. As shown in Fig. 7.3, the first row of Fig. 7.3 presents the four types of inputs. The second row illustrates the predicted target results produced by TH, and the third row indicates the detection results of CD. The introduced method produces the detection results in the fourth row of Fig. 7.3. From these visual examples, we know that the introduced method can accurately detect the infrared small target in the infrared image, because the introduced method exploits sparse errors and structure differences. These cues can better represent the details of the infrared target. However, the method of TH uses a simple top-hat filter to detect the infrared target, producing some false detection results in the second row of Fig. 7.3. CD improves TH and performs well in the four scenes. However, some infrared small objects cannot be well detected, which increase the missing rate of object detection. It is thereby less accurate for infrared small object detection (third row of Fig. 7.3).

To further assess the introduced method against other competing methods, SCRG and BSF are used to evaluate performance. The quantitative results for all the test images are shown in Table 7.1. It is clearly seen that the introduced method obtains the highest evaluation scores among others. Thus, the introduced method has a superior ability to detect infrared small targets in complex scenes, demonstrating the effectiveness and robustness of the introduced method. This is because the introduced method carefully designs the introduced framework based on the characteristics of infrared small targets. Using sparse errors and structure differences, the target candidate areas can be well detected. Then, a simple fusion framework can be applied to further improve the accuracy of the introduced method.

Table 7.1 Quantitative evaluations (SCRG/BSF) of four infrared scenes

Full size table

7.3 Adaptive Mean Shift Algorithm Based on LARK Feature for Infrared Image

7.3.1 Tracking Model Based on Global LARK Feature Matching and CAMSHIFT

The tracking model based on global LARK feature matching and CAMSHIFT mainly uses the colour information and LARK structure to calculate the probability distribution of target in the processing image. Then mean shift algorithm is used to get target centre and its size in the probability map. To shorten the matching time, tracking processing in each frame extracts only two times the previous frame target area. The model process is shown in Fig. 7.4.

Principle of LARK Feature Matching

The calculation principle of local kernel values of LARK is described by Takeda et al. (2007), Milanfar (2009), and Luo et al. (2015). To reduce the influence of external interference (e.g. light), the local kernel of each point is normalised, as shown in Eq. (7.9).

$$ k_{i}^{l} = K_{i}^{l} /\sum\limits_{l = 1}^{{P^{2} }} {K_{i}^{l} \in R^{{P^{2} \times 1}} } ,i \in [1,N],l \in [1,P^{2} ], $$

(7.9)

where $ K_{i}^{l} $ is the local kernel, N is the total number of pixels in the image and $ P^{2} $ is the number of pixels in the local window. The normalised local kernel of a point in the image is sorted by column as a weight vector:

$$ w_{i} = \left[ {k_{i}^{1} ,k_{i}^{2} , \ldots ,k_{i}^{{P^{2} }} } \right]^{T} . $$

(7.10)

Next, the $ P^{2} $ size window is used to traverse the entire graph to get its weight matrix.

$$ W = \left[ {w_{1} ,w_{2} , \ldots ,w_{N} } \right] \in R^{{p^{2} \times N}} . $$

(7.11)

After obtaining the LARK weight matrices, $ W\_Q $ and $ W\_T $, which represent the feature of template image and processing image, the PCA method is used to reduce redundancy of $ W\_Q $ and preserve only the characteristic of the former d of the principal component to constitute matrix, $ A_{Q} \in R^{{P^{2} \times d}} $. Then feature matrices, $ F_{Q} $ and $ F_{T} $, are calculated according to $ A_{Q} $ in Wang (2012).

$$ F_{Q} = \left[ {f_{Q}^{1} ,f_{Q}^{2} , \ldots ,f_{Q}^{N} } \right] = A_{Q}^{T} W_{Q} , $$

(7.12)

$$ F_{T} = \left[ {f_{T}^{1} ,f_{T}^{2} , \ldots ,f_{T}^{M} } \right] = A_{Q}^{T} W_{T} , $$

(7.13)

where N and M are total numbers of pixels in template and processing images, respectively. Then, the LARK feature matching method is performed. First, a moving window of the same size as the template image is used to move pixel-by-pixel the processing image. Then, cosine similarity of $ F_{{T_{i} }} $ of the moving window and $ F_{Q} $ is calculated, as shown in Eq. (7.14).

$$ \rho (f_{Q} ,f_{{T_{i} }} ) = \frac{{f_{Q} f_{{T_{i} }} }}{{\left\| {f_{Q} } \right\|\left\| {f_{{T_{i} }} } \right\|}} = \cos \theta \in \left[ { - 1,1} \right], $$

(7.14)

where $ f_{Q} $ and $ f_{{T_{i} }} $ are column vectors, sorted by $ F_{Q} $ and $ F_{{T_{i} }} $ in pixel column order. Lastly, a similarity graph is constructed with a constructor, as shown in Eq. (7.15).

$$ f(\rho ) = \frac{{\rho^{2} }}{{1 - \rho^{2} }}. $$

(7.15)

Framework of Tracking Algorithm

The pseudocode of the global LARK feature matching tracking model introduced in this section contains four parts: target selection, LARK feature extraction, target probability map calculation and target position search . While obtaining the weighted fusion target probability graph, process diagrams are shown in Fig. 7.5.

Algorithm 1

Target tracking model based on global LARK feature matching:

Manually select the tracking target as a template and calculate its $ W_{Q} $.
- Extract two times the target area as the processing image, calculate its $ W_{T} $.
- Apply PCA to $ W_{Q} $ and $ W_{T} $, and obtain feature matrix $ F_{Q} $ and $ F_{T} $.
- Transform RGB into HSV, and use the H component to get original target probability graph.
- Apply LARK feature matching to obtain a structure similarity graph and normalise it.
- Obtain the weighted fusion target probability map.
- Apply adaptive mean shift algorithm to locate target position.
- Loop the second to the fourth step to achieve tracking.

7.3.2 Target Tracking Algorithm Based on Local LARK Feature Statistical Matching

The previous section described a LARK feature global matching method used to obtain a structure similarity graph. When tracking non-compact targets, such as pedestrians, its deformation is random, complex and diverse. LARK features of the same target under different forms reflect the overall structure but are not similar, whereas the local structure is highly similar. Thus, in this section, global matching will be turned into local matching, and the number of similar local structures will be statistically analysed. Because the LARK feature can describe the basic structural features of the infrared target with weak edges, infrared images have no colour information and no rich grey information. By combining grey information and the local similarity statistics to get the target probability graph, we can distinguish the non-compact infrared target from the background and realise the next iterative search tracing.

The tracking process, based on local LARK feature statistical matching, is shown in Fig. 7.6. Because infrared images have no colour information, we first obtain the original target probability graph according to its grey value. From Sect. 7.3.2, every column vector in a feature matrix represents the local structure feature of the image. To use the least number of feature vectors to well reflect the structural features, the cosine similarity is used to reduce the redundancy of the feature matrix. The cosine similarity value is calculated in Eq. (7.14) between each $ f_{Q}^{i} $ of the feature matrix, $ F_{Q} $, and the other N − 1 $ f_{Q}^{i} $. If it is more than threshold, $ t1 $, the two vectors are similar, and only one of the vectors is preserved. Otherwise, the two vectors are both preserved. After removing the redundancy, the local features of the feature matrix $ F^{\prime}_{Q} = \left[ {f_{Q}^{{1{\prime }}} ,f_{Q}^{{2{\prime }}} , \ldots ,f_{Q}^{{n{\prime }}} } \right] $, $ n < N $ is non-repetitive. It effectively avoids the influence of the original similar structure in the template image on statistical matching. Then, the number of similar local structures of the feature matrix between the template image and the processing image is analysed statistically. The cosine similarity matrix is established by calculating the cosine similarity of each vector of $ F_{T} $ and $ F^{\prime}_{Q} $, as shown in Eq. (7.16).

$$ \rho_{L} = \rho \left\langle {F_{T} ,F_{Q}^{{\prime }} } \right\rangle = \left[ {\begin{array}{*{20}c} {\rho_{11} } & \cdots & {\rho_{1n} } \\ \vdots & \ddots & \vdots \\ {\rho_{M1} } & \cdots & {\rho_{Mn} } \\ \end{array} } \right], $$

(7.16)

where $ \rho_{ij} $ is the cosine similarity of the ith column of $ F_{T} $ and the jth column of $ F^{\prime}_{Q} $. The closer the cosine similarity is to 1, the more similar the local structure represented by the two vectors and the greater the likelihood of the target. Thus, the maximum value of each row is extracted from the matrix, $ \rho_{L} $, and its column location in $ F^{\prime}_{Q} $ is preserved in the index matrix, $ {\text{index}}_{L} $.

$$ \rho^{\prime}_{L} = \left[ {{ \hbox{max} }(\rho_{{1k_{1} }} ),{ \hbox{max} }(\rho_{{2k_{2} }} ), \ldots ,{ \hbox{max} }(\rho_{{Mk_{n} }} )} \right]^{T} , $$

(7.17)

$$ \begin{array}{*{20}c} {{\text{index}}_{L} = \left[ {x_{1} ,x_{2} , \ldots ,x_{M} } \right]^{T} } & {x_{1} ,x_{2} , \ldots ,x_{M} \in \left[ {1,2, \ldots ,n} \right]} \\ \end{array} . $$

(7.18)

The resulting position index matrix, $ {\text{index}}_{L} $, and the maximum similarity matrix, $ \rho^{\prime}_{L} $, are sorted by pixel column order. The similarity threshold, $ t 2 $, is set to reduce interference of the less similar local structure. If the similarity value is greater than $ t 2 $, it is considered to be an effective similarity. Otherwise, it is considered invalid, and the index value of the corresponding position in $ {\text{index}}_{L} $ is set to zero.

The index value in the matrix represents the position of the similar structure in the template image. If there are more different index values in the local window, the local window will contain more similar local structures, thus the greater the probability of the target. Finally, a fixed-size local window is selected to traverse the index matrix. If the number of nonzero pixels in the window in the original target probability map is greater than a certain threshold, the number of non-repetitive index values is counted. Otherwise, the number of index values in the window is recorded as zero. The matrix, $ R_{n} $, of the number of index value is constructed according to the number of non-repeat index values for each window. A statistical matching graph is obtained by normalising the matrix, $ R_{n} $. The target probability graph is obtained by weighting it with the original target probability map, as shown in Fig. 7.7. There is a significant difference between the pixel values of the target and the background area. The target position is obtained using the adaptive mean shift algorithm to the target probability map.

7.3.3 Experiment and Analysis

This section describes the tracking model based on global LARK feature matching and CAMSHIFT as a global feature matching tracking (GLMT) algorithm. The tracking model based on local LARK feature statistical matching is the local feature matching tracking algorithm (LLSMT). Because GLMT uses global matching, it is suitable for rigid or compact target tracking. Thus, the first part uses GLMT, compressive tracking (CT), spatiotemporal context (STC) learning and CAMSHIFT algorithms to trace human faces in a standard video library, trace cars on UAV video and make an analysis comparison. Because LLSMT algorithm uses local statistical matching, it is more suitable for non-compact targets with large changes. The second part uses LLSMT, GLMT and CAMSHIFT to trace pedestrians on infrared standard video libraries and make an analysis comparison.

Target Tracking Experiment in Visible Video

Experiment on tracking car

In this experiment, the trajectory of the car is a turn at the beginning and then a straight line. The experimental results are shown in Fig. 7.8. The red box is the tracking result of GLMT algorithm, and the blue box is the tracking result of CAMSHIFT.

As shown in Fig. 7.8, when the target colour and background are quite different, the target has a significant rotation, and the GLMT and CAMSHIFT algorithms can effectively distinguish between goals and backgrounds, owing to their use of colour information. When the colour difference between the target and the background is small, the tracking result of the CAMSHIFT algorithm produces drift or even loses the target, because of its use of the colour information. However, the GLMT algorithm uses both colour information and structure features, improving the contrast between target and background, giving a good tracking result.

Experiment on tracking a human face

This experiment chooses the human face video of AVSS2007. The human face in the video has fast size changes, a slight rotation and slight occlusion. The background in the video also has similar colour to the face. Figure 7.9 displays the tracking results of different algorithms. The red box is the tracking result of GLMT. The green box represents CT, the blue box represents STC and the yellow box represents CAMSHIFT. The centre position error (CLE) curves of the different algorithm tracking results are shown in Fig. 7.10. The abscissa represents the number of frames of video, and the ordinate represents the error value of the centre position of the tracking result and the true target centre position. Table 7.2 is their error analysis values, including average CLE, DP and OP.

Table 7.2 Tracking result analysis on the human face video

Full size table

As shown in Fig. 7.9, the target edge is not clear, owing to its size rapid changes and other objects of same colour. Thus, CAMSHIFT, CT and STC algorithms cannot accurately obtain the target location. Because LARK is not subject to the edge of the target and is more sensitive to the internal structure, GLMT algorithm can better track the target by making a global match against a non-changing target internal structure. From Fig. 7.10, CLE of each frame tracking result of GLMT algorithm is lower than that of CT, STC and CAMSHIFT. The average CLE of GLMT algorithm is also the lowest from Table 7.2. DP is computed as the relative number of frames in the sequence where the centre location error is smaller than a certain threshold. The DP threshold is set to 20. OP is defined as the percentage of frames where the bounding box overlap surpasses a threshold: 0.5. The DP and OP values of GLMT algorithm are both higher than CT, STC and CAMSHIFT in Table 7.2. Thus, we can conclude that the GLMT algorithm would well track rigid or compact targets.

Tracking Experiment on Infrared Non-compact Target

This experiment chooses pedestrian infrared image sequences from VOT2016. The pedestrian in the sequence has a significant change in posture. Experiment results are shown in Fig. 7.11. The red box is the tracking result of the LLSMT algorithm, the blue box represents GLMT and the green box represents CAMSHIFT.

From Fig. 7.11, in the infrared image, the edge of the target is weak, and its grey information is not rich. Thus, the tracking result of CAMSHIFT produces a drift because of its only use of grey information. GLMT and LLSMT both use LARK, which well describes the internal structure of the infrared target. It uses LARK feature matching to accurately obtain the target region. With large varieties of posture, the target has large changes in the larger structure and little changes in local structure (e.g. head and feet). The LLSMT algorithm divides the whole structure into local structures and performs statistical matching so it can track targets well. Figure 7.12 shows the CLE curve of the three tracking algorithms on tracking the infrared pedestrian. The CLE value of LLSMT algorithm is the lowest from Fig. 7.12 and Table 7.3. As shown in Table 7.3, the DP and OP of LLSMT algorithm are higher than that of GLMT and CAMSHIFT. Thus, we can conclude that the algorithm could well track the non-compact infrared targets of various postures.

Table 7.3 Tracking result analysis on pedestrian infrared image sequence

Full size table

7.4 An SMSM Model for Human Action Detection

Considering noise, background interference and massive information, this section introduces a non-learning SMSM model based on the dense computation of a so-called space-time local adaptive regression kernel to identify non-compact human actions. This model contains three parts:

calculation of GLARK feature;
multi-scaled composite template set and
spatiotemporal multiscale statistical matching.

Before we begin a more detailed elaboration, it is worthwhile to highlight some aspects of the introduced model in Fig. 7.13 .

Algorithm 1

Human actions detection method based on SMSM model:

Construct $ W_{{Q_{i} }} $ and $ W_{{T_{i} }} $, which are a collection of 3D GLARKs associated with $ Q_{i} $, $ T_{i} $. Connect $ W_{{Q_{i} }} $ to $ W_{Q} $.
- Apply PCA to $ W_{{T_{i} }} $ and obtain $ F_{{T_{i} }} $.
- Remove similar column vectors in $ W_{Q} $ to get $ F_{Q} $.
- Compute matrix cosine similarity
  
  for every target cube $ T_{i} $, where $ i = {\text{video frames do}} $
  $$ \rho_{i} = \left\langle {\frac{{F_{Q} }}{{\left\| {F_{Q} } \right\|_{F} }},\frac{{F_{{T_{i} }} }}{{\left\| {F_{{T_{i} }} } \right\|_{F} }}} \right\rangle_{F} \;{\text{and}} $$
  
  $$ \rho_{3DGLK} (:,:,i) = \rho \left\langle {F_{Q} ,F_{T} } \right\rangle = \left[ {\begin{array}{*{20}c} {\rho_{11} } & \cdots & {\rho_{{1n_{T} }} } \\ \vdots & \ddots & \vdots \\ {\rho_{{m_{T} 1}} } & \cdots & {\rho_{{m_{T} n_{T} }} } \\ \end{array} } \right] $$
  end for
- Record the position of the column vector corresponding to this maximum in $ F_{Q} $, count the no-duplicate index value number and get RM.
- Apply non-maxima suppression parameter, $ \tau $, to RM.

7.4.1 Technical Details of the SMSM Model

Local GLARK Feature

This part first briefly introduces the space-time locally adaptive regression kernel. The LARK proposed by Seo and Milanfar (2009a, b) captures the geometric structure effectively. The locally adaptive regression kernel definition formula (see Seo and Milanfar 2009a, b; Luo et al. 2015) is

$$ K(C_{l} ,\Delta X_{l} ) = \exp ( - ds^{2} ) = \exp \left\{ { - \Delta X_{l}^{T} C_{l} \Delta X_{l} } \right\}. $$

(7.19)

Covariance matrix, $ C_{l} $, is calculated from the simple gradient information of the image. In fact, it is difficult to describe the concrete structural feature information of a target with simple gradient information. Moreover, it is easy to ignore the weak edge of the target when the contrast of the edge region of the target is relatively small. Thus, it will cause missing detection. To make up for this defect, we fully exploit LARK feature information, introduce the difference of Gaussians (DOG) operator into the LARK feature and generate a new GLARK feature descriptor to enhance weak-edge structure information. It is mainly inspired by the receptive field structure of neurons in Fig. 7.14 (Kuffler 1953). Rodieck and Stone (1965) established DOG model to simulate the concentric circular antagonistic receptive field of retinal ganglion cells.

It is necessary to note that Gaussian kernel operator is defined by

$$ g( \cdot ,\sigma ) = \frac{1}{{(2\pi \sigma^{2} )^{N/2} }}\exp \left( { - \frac{{\sum\nolimits_{k = 1}^{D} {x_{k}^{2} } }}{{2\sigma^{2} }}} \right). $$

(7.20)

As for the two-dimensional $ (D = 2) $ image, different Gaussian convolution kernels, $ \sigma $, as multiscale factors, are taken to make convolution with the gradient information of pixel point, as shown in Eq. (7.21).

$$ D\left( {x,y,\sigma } \right) = g\left( {x,y,\sigma } \right) \otimes z\left( {x,y} \right), $$

(7.21)

$$ Z\left( {x,y,\sigma ,k} \right) = D\left( {x,y,k\sigma } \right) - D, $$

(7.22)

where $ \otimes $ represents convolution, $ (x,y) $ is the spatial coordinate and $ z(x,y) $ is the gradient of image, which has two forms of expression: $ z_{x} (x,y) $ and $ z_{y} (x,y) $. $ Z\left( {x,y,\sigma ,k} \right) $ is the Gaussian difference gradient matrices: $ Z_{x} $ and $ Z_{y} $. Figure 7.15 shows the gradient matrix, z, of a $ 3 \times 3 $ region. Here, we assume a $ 3 \times 3 $ Gauss kernel operator $ g = \left\{ {\begin{array}{*{20}c} 1 & 2 & 3 \\ \end{array} ;\begin{array}{*{20}c} 4 & 5 & 6 \\ \end{array} ;\begin{array}{*{20}c} 7 & 8 & 9 \\ \end{array} } \right\} $ and calculate the convolution as follows:

$$ f(i,j) = g * z = \sum\limits_{k,l} {g(i - k,j - l)z(h,l) = \sum\limits_{k,l} {g(k,l)z(i - k,j - l)} } . $$

(7.23)

Next, we take the centre point, $ (2,2) $, of the $ 3 \times 3 $ region as an example.

$$ \left[ {\begin{array}{*{20}c} {Z_{x22} } \\ {Z_{y22} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {Z_{x11} } & \cdots & {Z_{x33} } \\ {Z_{y11} } & \cdots & {Z_{y33} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} 9 & 8 & 7 & \cdots & 2 & 1 \\ \end{array} } \right]^{T} . $$

(7.24)

In this section, the third dimension of 3D GALRK is time, and $ \Delta X = \left[ {dx,dy,dt} \right] $. Thus, the new $ C_{GLK} $ is

$$ C_{GLK} = \sum\limits_{{m \in \varOmega_{l} }} {\left[ {\begin{array}{*{20}c} {Z_{x}^{2} (m)} & {Z_{x} (m)Z_{y} (x_{k} )} & {Z_{x} (m)ZT(m)} \\ {Z_{x} (m)Z_{y} (m)} & {Z_{y}^{2} (m)} & {Z_{y} (m)ZT(m)} \\ {Z_{x} (m)DT(m)} & {Z_{y} (m)ZT(m)} & {ZT^{2} (m)} \\ \end{array} } \right]} , $$

(7.25)

where $ \Omega _{l} $ is the space-time analysis window. Then, GLARK is defined by

$$ K(C_{GLK} ,\Delta X_{l} ) = \exp \left\{ { - \Delta X_{l}^{T} C_{GLK} \Delta X_{l} } \right\}. $$

(7.26)

GLARK highlights the edges of an object, which weakens the texture interference, especially paying more attention to the local structure of weak edges. It is expressed as

$$ W_{x} = [W_{x}^{1} ,W_{x}^{2} , \cdots ,W_{x}^{M} ] \in R^{N \times M} , $$

(7.27)

where M represents the number of pixels of the GLARK feature. Figure 7.16 gives the GLARK feature descriptor. As can be seen from the figure, GLARK can better describe the graph trend at the weak edge. In the normal edge region, the introduced feature descriptor goes even farther than the Seo algorithm.

Multi-scaled Composite Template Set

The colour video contains three colour channels, and each frame contains three dimensions. However, the third dimension of our model is time. Thus, the video is translated to a grey image sequence. Differing from the supervise methods containing millions of training images, our training set is minimal. Figure 7.17 displays the templates, from T1 to T5, including pedestrian, long jump, skiing, cliff diving and javelin throw actions.

The actions are resized to multi-scaled, and the model can recognise multiple sizes action in one frame. Figure 7.18 shows the formation of the multi-scaled template set. Every template image is sequenced to get the original cube, and then the nearest-neighbour interpolation method resizes the original template to 0.5 times and 1.5 times, respectively. Three template sequences are formed. Only the first and second dimensions of original template, indicating size, should be resized. However, the third dimension, implying motion, remains the same. 3D GLARK is utilised to get the multi-scaled local structure template set.

As shown in Fig. 7.18, the size of $ {\text{template}}_{0.5} $, $ {\text{template}}_{\text{initial}} $ and $ {\text{template}}_{1.5} $ are, respectively, $ m_{Z} /2 \times n_{Z} /2 \times t_{Z} $, $ m_{z} \times n_{z} \times t_{z} $ and $ 3m_{z} /2 \times 3n_{z} /2 \times t_{z} $. Hence, we can obtain three GLARK matrices, shown as $ W_{0.5} \in R^{{N \times M_{1} }} $, $ M_{1} = M/2^{2} $, $ W_{\text{initial}} \in R^{N \times M} $, $ M = m_{z} \times n_{z} \times t_{z} $ and $ W_{1.5} \in R^{{N \times M_{3} }} $, $ M_{3} = M \times (3/2)^{2} $, where $ N = m_{1} \times n_{1} \times t_{1} $ is the size of 3D windows for calculating GLARK. Then, $ W_{0.5} $, $ W_{\text{initial}} $ and $ W_{1.5} $ are connected to obtain the total template matrix:

$$ W_{Q} = \left[ {\begin{array}{*{20}c} {W_{0.5} } & {W_{\text{orginal}} } & {W_{1.5} } \\ \end{array} } \right] \in R^{{N \times M_{t} }} ,M_{t} = M_{1} + M + M_{3} . $$

(7.28)

SMSM Model

This section studies a method based on non-learning SMSM model, regarding $ W_{{T_{S} }} $ (‘s’ represents the video frames), deriving features by applying dimensionality reduction (i.e. PCA) to the resulting arrays (Seo and Milanfar 2009a, b). However, it needs a threshold, $ \alpha $, a so-called similar structure threshold, to remove the redundancy of $ W_{Q} $. The model calculates the cosine value between column vectors intersection angles reciprocally in the matrix. If the cosine value is greater than $ \alpha $, the two vectors are similar. If so, it retains one of them. After reducing the dimensionality and redundancy, $ F_{Q} $ is obtained with $ F_{{T_{S} }} $.

The next step in the introduced framework is a decision rule based on the measurement of a matrix cosine similarity (Leibe et al. 2008). Regarding $ F_{{T_{S} }} $, we calculate the cosine value of the angles between each column vector, $ f_{T}^{i} $ and $ f_{Q}^{i} $. Then, we obtain the cosine similarity matrix, $ \rho_{{3{\text{DGLK}}}} $:

$$ \rho_{{3{\text{DGLK}}}} (:,:,k) = \rho \left\langle {F_{Q} ,F_{T} } \right\rangle = \left[ {\begin{array}{*{20}c} {\rho_{11} } & \cdots & {\rho_{{1n_{T} }} } \\ \vdots & \ddots & \vdots \\ {\rho_{{m_{T} 1}} } & \cdots & {\rho_{{m_{T} n_{T} }} } \\ \end{array} } \right]\left( {k = 1,2, \ldots ,t_{T} } \right), $$

(7.29)

where $ m_{T} \times n_{T} \times t_{T} $ is the size of the test video. This section takes the maximum of each row in the matrix, $ \rho_{{3{\text{DGLK}}}} (:,:,k) $, and records the position of the column vector corresponding to the maximum in $ F_{Q} $. The location information is saved in the $ {\text{index}}_{\text{GLK}} $ matrix:

$$ {\text{index}}_{\text{GLK}} (:,k) = (x_{1} ,x_{2} , \ldots ,x_{{m_{T} }} )^{T} \quad x_{1} ,x_{2} , \ldots ,x_{{m_{T} }} = 1,2, \ldots ,n_{T} . $$

(7.30)

We arrange the position index matrix, $ {\text{index}}_{\text{GLK}} (:,k) $ and $ \rho_{{3{\text{DGLK}}}} (:,:,k) $, into a matrix of $ m_{T} \times n_{T} $, according to the column order. Thus, we need a similarity threshold, $ \theta $, whose empirical value is 0.88, to judge each element, $ \rho^{\prime}_{{3{\text{DGLK}}}} $ in $ \rho_{{3{\text{DGLK}}}} $. If $ \rho^{\prime}_{{3{\text{DGLK}}}} \ge \theta $, two vectors are similar. If $ \rho^{\prime}_{{3{\text{DGLK}}}} < \theta $, two vectors are not similar. At this point, the corresponding position in the $ {\text{index}}_{\text{GLK}} $ is recorded as 0.

This section selects the appropriate local window of $ P \times P \times T $ to go through the $ {\text{index}}_{\text{GLK}} $ matrix and count the no-duplicate-index value in the window. This section uses the formula $ {\text{num}} = {\text{num}}({\text{Unique}}({\text{index}}_{{{\text{GLK}}_{P \times P \times T} }} )) $ to explain the process, where $ {\text{num}} $ represents the similarity value of interested target and present region. The index value represents the corresponding structure of the target image, like the template. The more different the index values, more similar structures are in the local window. To this end, we construct a similarity matrix, $ {\text{RM}}_{\text{GLK}} $. Figure 7.19 notes the process from $ {\text{index}}_{\text{GLK}} $ to $ {\text{RM}}_{\text{GLK}} $.

On the basis of $ {\text{RM}}_{\text{GLK}} $, we can acquire a global similarity image, RM. Next, we employ a method of non-maxima suppression from Devernay (1995) to extract the local maxima of the resemblance image and seek out the location of people. Therefore, we need a non-maximum suppression threshold, $ \tau $, to decide whether to ignore some areas of RM. Then, we calculate the maximum value of the remaining possible area.

In Fig. 7.20a, the deep-red part of the map represents higher grey values, indicating the possibility of the existence of the target in the corresponding region of the test frames. It is easy to see that there is a region of high grey value. Thus, there may be a human body in the location of the test frames, T, corresponding to the maximum grey value of the region. Figure 7.20b represents RM with non-maximum suppression; Fig. 7.20c represents the detection result of a sampled frame.

7.4.2 Experiments Analysis

Parameters Analysis

The main parameters for our approach are as follows: similar structure threshold, $ \alpha $, for $ W_{Q} $ and non-maximum suppression threshold, $ \tau $, for RM.

When removing redundancies, we judge whether two vectors are similar by comparing two weighted vector cosine values with the threshold, $ \alpha $. The larger the threshold, $ \alpha $, the more the number of reserved weight vectors, the more complex the calculation, the smaller the threshold, $ \alpha $, the greater the difference in the retained weight vector matrix and the more inaccurate the recognition results. Therefore, there is a problem of selection of a specific threshold, $ \alpha $. This section analyses statistically the number of weight vector matrices under different $ \alpha $. As shown in Fig. 7.21, the abscissa represents the threshold, $ \tau $, and the ordinate represents the number of weight vectors after deallocation. This section simply regards the pictures as curve graphs, where the number of weight vector matrices first increases slowly and then grows rapidly with the threshold, $ \alpha $, increasing. Therefore, the threshold is determined by the turning point of the curve, where the slope is 0.5. In Fig. 7.21a, this section chooses a point (0.986,909). Thus, threshold $ \alpha $ is set to 0.986, the number of weight vectors after de-redundancy is 909 and the new weight vector matrix is used to participate in the similarity structure analysis. Equally, in Fig. 7.21b, we choose a point (0.994,289). Threshold $ \alpha $ is set to 0.994 and the number of weight vectors after de-redundancy is 289. In Fig. 7.21c, we choose a point (0.996,118). Threshold $ \alpha $ is set to 0.996 and the number of weight vectors after de-redundancy is 118.

As shown in Fig. 7.20, to accurately determine the location of the target, a non-maximum suppression threshold, $ \tau $, is required for optimal retention of the pixel values in the RM. Thus, a plurality of target images are selected and probability density curves are established according to the respective RM matrices (see Fig. 7.22a). The probability density curves are integrated to establish the integrals sum curves (see Fig. 7.22b). Whereas the RM probability density distribution of each target image is quite different, their integrals sum curves converge at the tail. Whereas the tail corresponds to the larger part of the RM value, the target may still occur. Therefore, other relatively small grayscale pixel values can be omitted to reduce the amount of computation and improve efficiency. Thus, we select the points where the probability of integration of each image RM integrals sum curves begin to converge. The right side of the point is retained, and the left is omitted. Figure 7.22b shows that the $ \tau $ is set to 0.75.

Test Results

Evaluation of the performance of the introduced method will be introduced in the next section. This section tests our method and the Seo algorithm using the challenging scenes (i.e. single object, fast motion object and multiple cases of objects) from the Visual Tracker Benchmark.

Single person

To evaluate the robustness of the introduced method, we first test the introduced method on visible video with one object. Figure 7.23 shows the results of searching for walking people in a target video (597 frames of 352 × 288 pixels). The query video contains a very short walking action moving to the right (10 frames of 48 × 102 pixels). This section randomly displays the results of three frames. Obviously, the detection effect of the multispectral is superior to the single band. Owing to the single template, Seo algorithm can only recognise the target having a similar posture with the template. For other images, Seo has poor performance. The introduced algorithm enlarges the weight matrix commiserate with the abundant spectral information. Thus, not only do we improve the detection rate, but we also identify results more accurately. Simultaneously, because our GLARK focuses on the local structure information of the target, our algorithm can well detect when the target is partially occluded. However, the Seo algorithm does not, because it is more concerned about the overall structure of the local structure. Thus, when part of the body structure is blocked, our algorithm does not apply.

Fast motions

This section also tests the introduced method in the visible video with fast motions. Figure 7.24 shows the results of detecting a female skater turning (160 frames of 320 × 240 pixels) and a male surfer in a video (84 frames of 480 × 270 pixels). The query video contains fast motions of the athletes (6 frames of 121 × 146 pixels) and a surfer (6 frames of 84 × 112 pixels). Figure 7.24b finds that the Seo algorithm can coarsely localise the athlete in the images, but the bounding boxes only contain parts of people. However, the turns of the athlete (female) are detected by the introduced method even though the video contains very fast-moving parts and relatively large variability on a spatial scale and appearance, compared to that given in query Q in (a). These bounding boxes can well locate the athlete in the image. Additionally, we test the scene with a change of the target scale in fast motions. As shown in (c) and (d), the Seo algorithm is very unstable, and our algorithm has good robustness to the volatile scale of the target. Analysis suggests that our templates are multi-scaled composited, and the Seo template is relatively simple. Thus, our method overcomes the drawbacks of the Seo algorithm and produces reliable results on this dataset.

Multiple cases and scales of pedestrians

To further evaluate the performance of the introduced method, we test the introduced method in the visible video with different people. Figure 7.25 shows the results of detecting multiple cases and different sizes of humans, which occur simultaneously in two monitor videos. The query videos contain two pedestrians who drift ever farther away. Their sizes are diminishing. Figure 7.25a, c shows the results of the introduced algorithm, and Fig. 7.25b, d represents detecting results of the Seo algorithm. The woman’s coat on the left of the first picture in (a) and (b) and the man’s clothes in (c) and (d) have a low contrast against the cement. For such human weak edges, it easy mistakes the human as background using the Seo algorithm. Whereas our algorithm detects the target well in this case, ours can enhance weak-edge information locally. Moreover, the multi-scales of characters limit the detection of the Seo algorithm in the same field of view in (c) and (d). Our algorithm does not suffer such interference in experiments because of the establishment of multiscale template sets.

Analysis of the Introduced Method

This section first evaluates IoU on the images shown, in terms of our method and the Seo algorithm in Fig. 7.26a. It is clear that the introduced method achieves superior performance compared to the Seo algorithm. For target recognition systems, precision and recall are contradictory quantities. Receiver operating characteristic (ROC) curves (Lachiche and Flach 2003) are introduced to evaluate the effects of target recognition.

$$ {\text{T}}\Pr = \frac{TP}{TP + FN}\quad {\text{FPr = }}\frac{FP}{TP + FP}, $$

(7.31)

where TP is the correctly detected target area, FP is the area which is mistaken for the target and FN is the undetected target area. Hence, the RPC (Leibe et al. 2008) is utilised. As shown in Fig. 7.26b, the horizontal and vertical axes correspond to two evaluation indexes: 1-precision and recall.

The visual tracker benchmark has manually tagged the test sequences with 11 attributes, representing the challenging aspects of visual detection. The 30 videos containing human actions in datasets are tested by our model and the Seo algorithm. The mAP is the average precision of 11 attributes, as shown in Fig. 7.27.

Experiments at THUMOS 2014

To make the experimental results of our algorithm more credible, we leverage two other benchmark systems: (1) S-CNN (Devernay 1995) leverages deep networks in temporal action localisation via three segments based on 3D ConvNets; (2) Wang et al. (THUMOS 2014: http://crcv.ucf.edu/THUMOS14/download.html) built a system on iDT with FV representation and frame-level CNN features and performed post-processing to refine the detection results. The temporal action detection task in THUMOS Challenge 2014 was dedicated to localising action instances in long untrimmed videos (THUMOS 2014: http://crcv.ucf.edu/THUMOS14/download.html). Our model is unsupervised and only uses the test data.

The detection task involved 20 actions, as shown in Table 7.4, and every action included many scenes. Average precision of each action was tested for all scenes with three methods. mAP is the mean average precision for 213 videos, shown in Table 7.4. Whereas the mAP is slight lower than S-CNN, our model is unsupervised and does not rely on the training process.

Table 7.4 Histogram of average precision (%) for each class on THUMOS (2014)

Full size table

7.5 Summary

In view of some problems of the existing detection and tracking methods (e.g. diverse motion shapes, complex image features and weak adaptability to varied scenes), we described the following three models:

A new method for infrared small object detection is capable of detecting small objects using sparse error and structure differences. This is the first time that structure differences and sparse errors were successfully applied to infrared small object detection. The infrared small target can be easily extracted by exploiting sparse error and structure differences. Results are estimated via a simple fusion framework. Experimental results demonstrate that the introduced method is effective, performs favourably against other methods and yields superior detection results.
The introduced algorithm, GLMT, utilises the advantages of LARK spatial structures without interference from weak-edge targets. It combines the colour or grey information to improve the contrast between the target and the background. The GLMT algorithm improves the lack of spatial information and interference of background in CAMSHIFT, and tracks compact or non-compact targets with less changes of posture in different spectra videos. Additionally, the local structure statistical matching method of LARK was introduced using the LARK feature to be sensitive to the change of the weak local structure. According to the local statistical matching method of LARK, the introduced algorithm, LLSMT, can track the non-compact target of different posture changes.
A space-time local structure statistical matching model was introduced to recognise non-compact actions and to expand the scenes of test video. First, a new GLARK feature was introduced by introducing Gaussian difference gradients into LARK descriptors to encode local structure information. Second, the multi-scaled composite template set was applied for actions of multiple sizes. At last, the method was applied to action videos by the space-time local structure statistics matching method to mark the actions. Experimental results demonstrate that the approach outperforms previous approaches and achieves more robust and accurate results for human actions detection. Furthermore, the SMSM is distinguished from traditional learning methods and matches template with test video more efficiently.

References

Achanta, R., Shaji, A., Smith, K., et al. (2012). SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2274.
Article Google Scholar
Bae, T. W., Zhang, F., & Kweon, I. S. (2012). Edge directional 2D LMS filter for infrared small target detection. Infrared Physics & Technology, 55(1), 137–145.
Article Google Scholar
Chong, X. G. (2014). Research on online video target tracking algorithm based on target scaling and blocking. Chongqing University.
Google Scholar
Chun, B. X., & Shi, A. W. (2015). Camshift tracking method for significant histogram model. Optics and Precision Engineering, 6, 1749–1757.
Google Scholar
Devernay, F. (1995). A non-maxima suppression method for edge detection with subpixel accuracy. Technical Report RR-2724, Institut National de Recherche en Informatique et en Automatique.
Google Scholar
Drummond, O. E. (1993). Morphology-based algorithm for point target detection in infrared backgrounds. Proceedings of SPIE-The International Society for Optical Engineering, 1954, 2–11.
Google Scholar
Gu, Y., Wang, C., Liu, B. X., et al. (2010). A kernel-based nonparametric regression method for clutter removal in infrared small-target detection applications. IEEE Geoscience and Remote Sensing Letters, 7(3), 469–473.
Article Google Scholar
Guo, M. C. (2016). Research on target tracking algorithm for context information and color information fusion. Zhejiang Normal University.
Google Scholar
Hong, Z. Z., Jin, H. Z., Hui, Y., & Shi, L. H. (2006). Target tracking algorithm based on CamShift. Computer Engineering and Design, 27(11), 2012–2014.
Google Scholar
Kai, H. Z. (2012). Real-time compressive tracking. Computer Vision–ECCV, 30(7), 864–877.
Google Scholar
Karacan, L., Erdem, E., & Erdem, A. (2013). Structure-preserving image smoothing via region covariances. ACM Transactions on Graphics, 32(6), 1–11.
Article Google Scholar
Kuffler, S. W. (1953). Discharge patterns and functional organization of mammalian retina. Journal of Neurophysiology, 16, 37–68.
Article Google Scholar
Lachiche, N., & Flach, P. (2003) Improving accuracy and cost of two-class and multi-class probabilistic classifiers using ROC curves (pp. 416–423). ICML.
Google Scholar
Leibe, B., Leonardis, A., & Schiele, B. (2008). Robust object detection with interleaved categorization and segmentation. International Journal of Computer Vision, 77, 259–289.
Article Google Scholar
Li, L., Li, H., Li, T., et al. (2014). Infrared small target detection in compressive domain. Electronics Letters, 50(7), 510–512.
Article MathSciNet Google Scholar
Liu, T., Yuan, Z., Sun, J., et al. (2010). Learning to detect a salient object. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(2), 353–367.
Google Scholar
Luo, F., Han, J., Qi, W., et al. (2015). Robust object detection based on local similar structure statistical matching. Infrared Physics & Technology, 68, 75–83.
Article Google Scholar
Meng, R. L. (2015). Research on infrared target tracking algorithm based on mean shift. Shenyang University of Technology.
Google Scholar
Milanfar, P. (2009). Adaptive regression kernels for image/video restoration and recognition. Computational Optical Sensing and Imaging.
Google Scholar
Ming, B. W. (2016). A survey of moving target tracking methods. Computer and Digital Engineering, 11, 2164–2167, 2208.
Google Scholar
Muesera, P. R., Cowanb, N., & Kim, T. (1999). A generalised signal detection model to predict rational variation in base rate use. Cognition, 69(3), 267–312.
Article Google Scholar
Ran, W. (2012). Research on improved algorithm of motion prediction target tracking based on camshift algorithm. Shandong University.
Google Scholar
Rodieck, R. W., & Stone, J. (1965). Analysis of receptive fields of cat retinal ganglion cells. Journal of Neurophysiology, 28(5), 833–849.
Article Google Scholar
Rong, T. C., Yuan, H. W., Ming, J. W., & Qin, X. W. (2010) A survey of video target tracking algorithms. Television Technology, 12, 135–138, 142.
Google Scholar
Sanchez, J. S., Mollineda, R. A., & Sotoca, J. M. (2007). An analysis of how training data complexity affects the nearest neighbour classifiers. Pattern Analysis and Applications, 10(3), 189–201.
Article MathSciNet Google Scholar
Seo, H. J., & Milanfar, P. (2009a). Training-free generic object detection using locally adaptive regression kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1688–1704.
Google Scholar
Seo, H. J., & Milanfar, P. (2009b). Detection of human actions from a single example. In 2009 IEEE 12th International Conference on Computer Vision (pp. 1965–1970). IEEE.
Google Scholar
Seo, H. J., & Milanfar, P. (2011a). Face verification using the lark representation. IEEE Transactions on Information Forensics and Security, 6(4), 1275–1286.
Article Google Scholar
Seo, H. J., & Milanfar, P. (2011b). Action recognition from one example. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5), 867–882.
Article Google Scholar
Seo, H. J., & Milanfar, P. (2011c). Robust visual recognition with locally adaptive regression kernels. A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Electrical Engineering.
Google Scholar
Shen, Y., & Fan, J. (2010). Multi-task multi-label multiple instance learning. Journal of Zhejiang University-Science C (Computers &Electronics), 11, 860–871.
Google Scholar
Takeda, H., Farsiu, S., & Milanfar, P. (2007). Kernel regression for image processing and reconstruction. IEEE Transactions on Image Process, 16(2), 349–366.
Article MathSciNet Google Scholar
THUMOS (2014). http://crcv.ucf.edu/THUMOS14/download.html.
Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. Conference on Computer Vision and Pattern Recognition.
Google Scholar
Wang, Y. (2012). Adaboost face detection algorithm based on locally adaptive kernels regression. Nanjing University of Aeronautics and Astronautics.
Google Scholar
Wang, X., Ning, C., & Xu, L. (2015). Saliency detection using mutual consistency-guided spatial cues combination. Infrared Physics & Technology, 72, 106–116.
Article Google Scholar
Xing, F. G., Yao, B. M., & Qiu, J. L. (2012). Summarization of visual target tracking algorithm based on mean shift. Computer Science, 12, 16–24.
Google Scholar
Yang, L., Yang, J., & Yang, K. (2004). Adaptive detection for infrared small target under sea-sky complex background. Electronics Letters, 40(17), 1083–1085.
Article Google Scholar
Zhang, J., Fang, J., & Lu, J. (2011). Mean-Shift algorithm integrating with SURF for tracking. In Seventh International Conference on Natural Computation (pp. 960–963). IEEE.
Google Scholar
Zhang, B., Tian, W., & Jin, Z. (2006). Joint tracking algorithm using particle filter and mean shift with target model updating. Chinese Optics Letters, 4(10), 569–572.
Google Scholar
Zhang, K., Zhang, L., Yang, M. H., et al. (2013). Fast tracking via spatio-temporal context learning. Computer Science.
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electronic and Optical Engineering, Nanjing University of Science and Technology, Nanjing, Jiangsu, China
Lianfa Bai
School of Electronic and Optical Engineering, Nanjing University of Science and Technology, Nanjing, Jiangsu, China
Jing Han
National Key Laboratory of Transient Physics, Nanjing University of Science and Technology, Nanjing, Jiangsu, China
Jiang Yue

Authors

Lianfa Bai
View author publications
You can also search for this author in PubMed Google Scholar
Jing Han
View author publications
You can also search for this author in PubMed Google Scholar
Jiang Yue
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lianfa Bai .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bai, L., Han, J., Yue, J. (2019). Non-learning-Based Motion Cognitive Detection and Self-adaptable Tracking for Night-Vision Videos. In: Night Vision Processing and Understanding. Springer, Singapore. https://doi.org/10.1007/978-981-13-1669-2_7

Download citation

DOI: https://doi.org/10.1007/978-981-13-1669-2_7
Published: 12 January 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1668-5
Online ISBN: 978-981-13-1669-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

7.1 Target Detection and Tracking Methods

7.1.1 Investigation of Infrared Small-Target Detection

7.1.2 Moving Object Detection Based on Non-learning

7.1.3 Researches on Target Tracking Technology

7.2 Infrared Small Object Detection Using Sparse Error and Structure Difference

7.2.1 Framework of Object Detection

Structure Difference

Fusing Error and Structure

7.2.2 Experimental Results

7.3 Adaptive Mean Shift Algorithm Based on LARK Feature for Infrared Image

7.3.1 Tracking Model Based on Global LARK Feature Matching and CAMSHIFT

Principle of LARK Feature Matching

Framework of Tracking Algorithm

Algorithm 1

7.3.2 Target Tracking Algorithm Based on Local LARK Feature Statistical Matching

7.3.3 Experiment and Analysis

Tracking Experiment on Infrared Non-compact Target

7.4 An SMSM Model for Human Action Detection

Algorithm 1

7.4.1 Technical Details of the SMSM Model

Local GLARK Feature

Multi-scaled Composite Template Set

SMSM Model

7.4.2 Experiments Analysis

Parameters Analysis

Test Results

Analysis of the Introduced Method

Experiments at THUMOS 2014

7.5 Summary

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation