Motion detection and tracking technology is one of the core subjects in the field of computer vision. It is significant and has wide practical value in night-vision research. Traditional learning-based detection and tracking algorithms require many samples and a complex model, which is difficult to implement. The robustness of detection and tracking in complex scenes is weak. This chapter introduces a series of infrared small-target detection, non-learning motion detection and tracking methods based on imaging spatial structure, which are robust to complex scenes.

7.1 Target Detection and Tracking Methods

7.1.1 Investigation of Infrared Small-Target Detection

Infrared small-target detection is an important issue in computer vision and military fields. With the development of infrared imaging technology, infrared sensors have successfully produced high-resolution images, which facilitate the detection of objects. However, it still faces a few challenges owing to the complexity of scenes. Recently, various methods have been proposed for infrared small-target detection, such as morphological top-hat filter method by Drummond (1993). Then, the adaptive Butterworth high-pass filter method was designed for infrared small objects by Yang et al. (2004). Gu et al. (2010) proposed a kernel-based nonparametric regression method for background prediction and small-target detection. Bae et al. (2012) used the edge-directional 2D least mean square filter to estimate infrared small and dim target detection. Li et al. (2014) proposed a novel infrared small-target detection in a compressive domain. Albeit these methods have been proposed, a few commonly influencing issues still endure, related to complexity of patterns in the infrared images. These methods often fail to detect small objects when the scenes are more challenging. Aiming to solve this problem, a novel infrared small-target detection method using sparse errors and structure differences is introduced. The method exploits sparse errors and structure differences between target and background regions. Then, a simple fusion method is applied to generate the final detection result (see Fig. 7.1).

Fig. 7.1
figure 1

Pipeline of infrared small-target detection

7.1.2 Moving Object Detection Based on Non-learning

Accompanied by increasing night-vision sensors, numerous videos are being produced. Thus, it is essential to automatically target and motion. This plays an important role in intelligence operations. Over the last 20 years, cascade classification (Viola and Jones 2001) and Bayes’ rule (Muesera et al. 1999) proposed several methods. To determine the parameters of classifiers, traditional detection methods (Sanchez et al. 2007; Shen and Fan 2010) require a lot of learning, which leads to the problem of low real-time performance. Accordingly, a new method without learning has been put forward gradually.

In 2007, Takeda et al. (2007) proposed classic kernel regression to recover the high-frequency information of images, which is used to denoise. In 2009, Milanfar (2009) studied the adaptive kernel regression method to remove noise, enhancing image details and target detection. In the same year, Seo and Milanfar (2009a, b) made further efforts and proposed the locally adaptive regression kernel method, which is a new non-parameter method for detecting targets. A few years later, Seo and Milanfar (2009a, b, 2011a, b, c) improved the robust performance of regression kernel in different aspects. However, the matching algorithm presented by Seo and Milanfar (2011a, b, c) (hereafter called ‘Seo algorithm’) has not been suitable for non-compact targets, such as human actions. The overall template with background was applied to match with test video but limited the choice to test video scenes. Recognition accuracy relies on the background similarity between template and test video. Only when the background of a test video is quite similar with the template can the results be satisfactory. Conversely, when the view angle is changed or the scene becomes complex, the outcomes are disappointing. In 2007, Wang (2012) amplified template images and divided them into many parts to detect human faces. The template only contains a face, which gives us the inspiration to remove the background. Furthermore, when the actions were partially obscured by landscape, the matching pattern with overall template could not be recognised.

Inspired by the trend towards detection in video streaming, spatiotemporal multiscale statistical matching (SMSM), based on a new weighted 3D Gaussian difference LARK (GLARK), was introduced, which more efficiently recognises actions in video.

7.1.3 Researches on Target Tracking Technology

Target tracking is a key technology of scene perception. Target information, such as moving trajectory, target position and velocity, is the basis for subsequent video image processing, such as target recognition and behavioural analysis.

Many target tracking algorithms have been gradually proposed in recent years (Ming 2016; Rong et al. 2010). The continuously adaptive mean shift (CAMSHIFT) tracking algorithm proposed by Bradski was based on colour information. It effectively solved the problem of target deformation and size scaling and consumed less time. However, it was not good for tracking fast-moving targets or targets in complex backgrounds. Then, several improved CAMSHIFT tracking algorithms (Hong et al. 2006; Chun and Shi 2015; Xing et al. 2012; Ran 2012) improved the stability of tracking. However, these algorithms needed distinctive target colours, high-quality images and simple backgrounds. In a complex background scene, there are now many popular online learning-based tracking algorithms. Compression tracking and its improved algorithm (Kai 2012; Chong 2014) showed good real-time performance and good robustness with target occlusion and appearance changes. Its fixed target tracking window scaled easily, leading to tracking drift. A visual tracking algorithm based on spatiotemporal context information and its improvement (Zhang et al. 2013; Guo 2016) both required less time. Owing to simpler features to get the statistical relevance of the target to the surrounding area, the algorithm failed to track when the target moved too fast or was blocked. Additionally, because of low contrast, lack of colour information and small grayscale dynamic range, infrared image tracking has become a hot topic in tracking research. The classical mean shift tracking algorithm of Zhang et al. (2006) used grey information for real-time tracking. However, it was vulnerable to similar grey information background interference. It failed to track when target size changed. Meng (2015) proposed an improved mean shift algorithm with a weighted kernel histogram and brightness–distance space. It successfully tracked rigid infrared targets but could not track a variety of non-rigid targets, such as humans and animals. The mean shift algorithm of speed-up robust features (SURF), proposed by Zhang et al. (2011), solved the tracking problem of the target scale change in the ideal state. However, it could not track small or textured single targets because of fewer feature points or matched feature points.

In summary, this tracking algorithm cannot perform target tracking well in a complex background. A tracking model based on global LARK feature matching and CAMSHIFT algorithm is thus introduced in Sect. 7.3.1. Because the LARK feature is not sensitive to the grey value of each point or the specific object in the picture, it is more sensitive to the change of grey gradient and graphic structure in the picture (Seo and Milanfar 2009a, b, 2011a, b, c; Wang 2012). It is possible to distinguish between rigid or compact targets and background areas by combining SSIM maps and colour probability distributions. To track non-compact infrared targets, we also propose a local LARK feature statistical matching method. LARK features can also describe the essential characteristics of weak-edge targets well in an infrared image. Whereas the overall structure of the same target is dissimilar in different forms, there are some similarities in local fine structure (Luo et al. 2015). Thus, global matching is transformed into local matching statistical analysis in Sect. 7.3.2. Combined with infrared image characteristics and microstructure similarity statistical analyses, the introduced method could well distinguish between backgrounds and infrared targets in different forms.

7.2 Infrared Small Object Detection Using Sparse Error and Structure Difference

7.2.1 Framework of Object Detection

Sparse Error. There exist appearance divergences between target and background regions. Generally, the boundary regions of the image are considered background regions. Background templates are constructed from these image boundaries, and the entire image is reconstructed by sparse appearance error modelling. The image patch is defined by the learnt bases from a set of infrared image patches. First, a given infrared image is taken as input, and SLIC (Achanta et al. 2012) is used to segment the input image into multiple uniform regions. Each segment is described by \( p = \left\{ {x,y,lu,g_{x} ,g_{y} } \right\} \), where lu is the luminance information (Wang et al. 2015), \( g_{x} \) and \( g_{y} \) are the gradient information, and x and y indicate the coordinates of a pixel. The entire infrared image is represented as \( P = \left[ {p_{1} ,p_{2} , \ldots ,p_{N} } \right] \), where N is the number of segments. The boundary segment, d, is extracted from P as the bases and the background template set, \( D = \left[ {d_{1} ,d_{2} ,d_{3} , \ldots ,d_{M} } \right] \), is constructed, where M is the number of image boundary segments. Given the background templates, the sparse error between object and background regions could be computed, where there should be a large difference between object and background regions based on the same bases. Then, the image segment is encoded, defined by

$$ a_{i} = \arg \min_{{a_{i} }} \left\| {p_{i} - Da_{i} } \right\|_{2}^{2} + \lambda \left\| {diag(w_{i} )a_{i} } \right\|_{ 1} , $$
(7.1)

where \( \uplambda \) is the regularisation parameter and \( w_{i} \) is used to represent the weight for segment \( d_{i} \), computed as

$$ w_{i} = \frac{1}{{H(d_{i} )}}\sum\limits_{{j \in H(d_{i} )}} {\exp (\frac{{\left\| {d_{j} - d_{i} } \right\|^{2} }}{{2\sigma^{2} }})} , $$
(7.2)

where \( H(d_{i} ) \) denotes the amount of \( d_{i} \)’s neighbours. This weight computes the similarity between segment \( d_{i} \) and its surrounding segments. Large \( w_{i} \) will suppress nonzero entry, \( a_{i} \), which is forced to concentrate on zero entry when \( w_{i} \) is small. The weight for background template should be proportional to the similarity between segments in the image boundaries.

Then the regulated reconstruction error for each segment is computed as

$$ E_{i} = \left\| {p_{i} - Da_{i} } \right\|_{2}^{2} . $$
(7.3)

Coarse interest regions can be easily estimated by the regulated sparse errors. Large values of reconstruction errors are the target regions, which are different from background regions.

Structure Difference

Objects often have different structure information compared to background regions. In this section, structure differences between object and background regions are exploited via region covariances. Use F to denote the feature image extracted from the input image, I, as \( F =\Gamma (I) \), where \( \Gamma \) denotes a mapping function that extracts a k-dimensional feature vector from each pixel in I. A region, \( q_{i} \), inside F, can be computed as a \( k \times k \) covariance matrix, \( C_{qu} \)

$$ C_{{q_{u} }} = \frac{1}{n - 1}\sum\limits_{i = 1}^{n} {(q_{u} - \mu )(q_{u} - \mu )^{T} } , $$
(7.4)

where \( q_{u} ,u = 1, \ldots ,n \) denotes the k-dimensional feature vectors inside region, and q and \( \mu \) denotes the mean of these feature vectors. In this section, \( k = 5 \) features (e.g. \( x,y,lu,g_{x} ,g_{y} \)) are used to build the region feature. The structure difference map is computed based on two different covariances:

$$ G(q_{u} ,q_{v} ) = \psi (C_{{q_{u} }} ,C_{{q_{v} }} ), $$
(7.5)

where \( \psi (C_{{q_{u} }} ,C_{{q_{v} }} ) \) is used to compute the similarity between two covariances (Karacan et al. 2013).

This covariance matrix can better capture local image structure data and can effectively estimate the structure differences between objects and background regions. Thus, object regions have higher values of G than background regions.

Fusing Error and Structure

Based on two obtained maps from two different views, the maps are fused to generate a final map via linear combination, and the weights in the linear combination are learnt.

$$ S = \sum\limits_{t = 1}^{T} {\beta_{t} S^{t} } , $$
(7.6)

where \( \left\{ {\beta_{t} } \right\} = {\text{argmin}}\sum {\left\| {S - \sum\nolimits_{t} {\beta_{t} S^{t} } } \right\|}_{F}^{2} \), t denotes the number of maps and \( S^{T} \) denotes the map from the error map or the structure difference map. The weights are learnt well using a least-squares estimator. This problem is solved by conditional random field solutions by Liu et al. (2010). Finally, like the previous method of Li et al. (2014), the final interest target can be extracted by a threshold method, defined as

$$ S^{\prime}(x,y) = \left\{ {\begin{array}{*{20}l} 1 \hfill & {S(x,y) \ge \gamma S_{\hbox{max} } } \hfill \\ 0 \hfill & {\text{others}} \hfill \\ \end{array} ,} \right. $$
(7.7)

where \( S_{ \hbox{max} } \) is the maximum value of S and \( \gamma = 0.6 \) is the threshold value (Fig. 7.2).

Fig. 7.2
figure 2

Image decomposition results: a input image; b structure component of the input image; c reconstruct error map; d our final result

7.2.2 Experimental Results

The empirical values, \( \sigma = 3 \), \( N = 400 \), \( T = 2 \) and \( \lambda = 0.01 \), are chosen for the introduced method. The input image is \( 480 \times 640 \) pixels. The method is compared with other existing methods, including top-hat (TH) filter in Drummond (1993) and compressive domain (CD) from Li et al. (2014). Signal-to-clutter ratio gain (SCRG) and background suppression factor (BSF) are used to evaluate these methods. These two metrics are defined as

$$ SCRG = 20 \times \log_{10} \left( {\frac{{(U/C)_{\text{out}} }}{{(U/C)_{\text{in}} }}} \right),\quad BSF = 20 \times \log_{10} \left( {\frac{{C_{\text{in}} }}{{C_{\text{out}} }}} \right), $$
(7.8)

where U is the average target intensity and C is the clutter standard deviation in an infrared image. In this experiment, four different scenes are selected to verify the performance of the introduced method. As shown in Fig. 7.3, the first row of Fig. 7.3 presents the four types of inputs. The second row illustrates the predicted target results produced by TH, and the third row indicates the detection results of CD. The introduced method produces the detection results in the fourth row of Fig. 7.3. From these visual examples, we know that the introduced method can accurately detect the infrared small target in the infrared image, because the introduced method exploits sparse errors and structure differences. These cues can better represent the details of the infrared target. However, the method of TH uses a simple top-hat filter to detect the infrared target, producing some false detection results in the second row of Fig. 7.3. CD improves TH and performs well in the four scenes. However, some infrared small objects cannot be well detected, which increase the missing rate of object detection. It is thereby less accurate for infrared small object detection (third row of Fig. 7.3).

Fig. 7.3
figure 3

Qualitative visual results of four scenes: a three targets in the wild; b two targets in the sky; c three targets in the sea; d one target in the sky

To further assess the introduced method against other competing methods, SCRG and BSF are used to evaluate performance. The quantitative results for all the test images are shown in Table 7.1. It is clearly seen that the introduced method obtains the highest evaluation scores among others. Thus, the introduced method has a superior ability to detect infrared small targets in complex scenes, demonstrating the effectiveness and robustness of the introduced method. This is because the introduced method carefully designs the introduced framework based on the characteristics of infrared small targets. Using sparse errors and structure differences, the target candidate areas can be well detected. Then, a simple fusion framework can be applied to further improve the accuracy of the introduced method.

Table 7.1 Quantitative evaluations (SCRG/BSF) of four infrared scenes

7.3 Adaptive Mean Shift Algorithm Based on LARK Feature for Infrared Image

7.3.1 Tracking Model Based on Global LARK Feature Matching and CAMSHIFT

The tracking model based on global LARK feature matching and CAMSHIFT mainly uses the colour information and LARK structure to calculate the probability distribution of target in the processing image. Then mean shift algorithm is used to get target centre and its size in the probability map. To shorten the matching time, tracking processing in each frame extracts only two times the previous frame target area. The model process is shown in Fig. 7.4.

Fig. 7.4
figure 4

Tracking model based on global LARK feature matching and CAMSHIFT

Principle of LARK Feature Matching

The calculation principle of local kernel values of LARK is described by Takeda et al. (2007), Milanfar (2009), and Luo et al. (2015). To reduce the influence of external interference (e.g. light), the local kernel of each point is normalised, as shown in Eq. (7.9).

$$ k_{i}^{l} = K_{i}^{l} /\sum\limits_{l = 1}^{{P^{2} }} {K_{i}^{l} \in R^{{P^{2} \times 1}} } ,i \in [1,N],l \in [1,P^{2} ], $$
(7.9)

where \( K_{i}^{l} \) is the local kernel, N is the total number of pixels in the image and \( P^{2} \) is the number of pixels in the local window. The normalised local kernel of a point in the image is sorted by column as a weight vector:

$$ w_{i} = \left[ {k_{i}^{1} ,k_{i}^{2} , \ldots ,k_{i}^{{P^{2} }} } \right]^{T} . $$
(7.10)

Next, the \( P^{2} \) size window is used to traverse the entire graph to get its weight matrix.

$$ W = \left[ {w_{1} ,w_{2} , \ldots ,w_{N} } \right] \in R^{{p^{2} \times N}} . $$
(7.11)

After obtaining the LARK weight matrices, \( W\_Q \) and \( W\_T \), which represent the feature of template image and processing image, the PCA method is used to reduce redundancy of \( W\_Q \) and preserve only the characteristic of the former d of the principal component to constitute matrix, \( A_{Q} \in R^{{P^{2} \times d}} \). Then feature matrices, \( F_{Q} \) and \( F_{T} \), are calculated according to \( A_{Q} \) in Wang (2012).

$$ F_{Q} = \left[ {f_{Q}^{1} ,f_{Q}^{2} , \ldots ,f_{Q}^{N} } \right] = A_{Q}^{T} W_{Q} , $$
(7.12)
$$ F_{T} = \left[ {f_{T}^{1} ,f_{T}^{2} , \ldots ,f_{T}^{M} } \right] = A_{Q}^{T} W_{T} , $$
(7.13)

where N and M are total numbers of pixels in template and processing images, respectively. Then, the LARK feature matching method is performed. First, a moving window of the same size as the template image is used to move pixel-by-pixel the processing image. Then, cosine similarity of \( F_{{T_{i} }} \) of the moving window and \( F_{Q} \) is calculated, as shown in Eq. (7.14).

$$ \rho (f_{Q} ,f_{{T_{i} }} ) = \frac{{f_{Q} f_{{T_{i} }} }}{{\left\| {f_{Q} } \right\|\left\| {f_{{T_{i} }} } \right\|}} = \cos \theta \in \left[ { - 1,1} \right], $$
(7.14)

where \( f_{Q} \) and \( f_{{T_{i} }} \) are column vectors, sorted by \( F_{Q} \) and \( F_{{T_{i} }} \) in pixel column order. Lastly, a similarity graph is constructed with a constructor, as shown in Eq. (7.15).

$$ f(\rho ) = \frac{{\rho^{2} }}{{1 - \rho^{2} }}. $$
(7.15)

Framework of Tracking Algorithm

The pseudocode of the global LARK feature matching tracking model introduced in this section contains four parts: target selection, LARK feature extraction, target probability map calculation and target position search . While obtaining the weighted fusion target probability graph, process diagrams are shown in Fig. 7.5.

Fig. 7.5
figure 5

Process diagrams

Algorithm 1

Target tracking model based on global LARK feature matching:

  • Manually select the tracking target as a template and calculate its \( W_{Q} \).

    • Extract two times the target area as the processing image, calculate its \( W_{T} \).

    • Apply PCA to \( W_{Q} \) and \( W_{T} \), and obtain feature matrix \( F_{Q} \) and \( F_{T} \).

    • Transform RGB into HSV, and use the H component to get original target probability graph.

    • Apply LARK feature matching to obtain a structure similarity graph and normalise it.

    • Obtain the weighted fusion target probability map.

    • Apply adaptive mean shift algorithm to locate target position.

    • Loop the second to the fourth step to achieve tracking.

7.3.2 Target Tracking Algorithm Based on Local LARK Feature Statistical Matching

The previous section described a LARK feature global matching method used to obtain a structure similarity graph. When tracking non-compact targets, such as pedestrians, its deformation is random, complex and diverse. LARK features of the same target under different forms reflect the overall structure but are not similar, whereas the local structure is highly similar. Thus, in this section, global matching will be turned into local matching, and the number of similar local structures will be statistically analysed. Because the LARK feature can describe the basic structural features of the infrared target with weak edges, infrared images have no colour information and no rich grey information. By combining grey information and the local similarity statistics to get the target probability graph, we can distinguish the non-compact infrared target from the background and realise the next iterative search tracing.

The tracking process, based on local LARK feature statistical matching, is shown in Fig. 7.6. Because infrared images have no colour information, we first obtain the original target probability graph according to its grey value. From Sect. 7.3.2, every column vector in a feature matrix represents the local structure feature of the image. To use the least number of feature vectors to well reflect the structural features, the cosine similarity is used to reduce the redundancy of the feature matrix. The cosine similarity value is calculated in Eq. (7.14) between each \( f_{Q}^{i} \) of the feature matrix, \( F_{Q} \), and the other N − 1 \( f_{Q}^{i} \). If it is more than threshold, \( t1 \), the two vectors are similar, and only one of the vectors is preserved. Otherwise, the two vectors are both preserved. After removing the redundancy, the local features of the feature matrix \( F^{\prime}_{Q} = \left[ {f_{Q}^{{1{\prime }}} ,f_{Q}^{{2{\prime }}} , \ldots ,f_{Q}^{{n{\prime }}} } \right] \), \( n < N \) is non-repetitive. It effectively avoids the influence of the original similar structure in the template image on statistical matching. Then, the number of similar local structures of the feature matrix between the template image and the processing image is analysed statistically. The cosine similarity matrix is established by calculating the cosine similarity of each vector of \( F_{T} \) and \( F^{\prime}_{Q} \), as shown in Eq. (7.16).

Fig. 7.6
figure 6

Tracking model based on local LARK feature statistical matching

$$ \rho_{L} = \rho \left\langle {F_{T} ,F_{Q}^{{\prime }} } \right\rangle = \left[ {\begin{array}{*{20}c} {\rho_{11} } & \cdots & {\rho_{1n} } \\ \vdots & \ddots & \vdots \\ {\rho_{M1} } & \cdots & {\rho_{Mn} } \\ \end{array} } \right], $$
(7.16)

where \( \rho_{ij} \) is the cosine similarity of the ith column of \( F_{T} \) and the jth column of \( F^{\prime}_{Q} \). The closer the cosine similarity is to 1, the more similar the local structure represented by the two vectors and the greater the likelihood of the target. Thus, the maximum value of each row is extracted from the matrix, \( \rho_{L} \), and its column location in \( F^{\prime}_{Q} \) is preserved in the index matrix, \( {\text{index}}_{L} \).

$$ \rho^{\prime}_{L} = \left[ {{ \hbox{max} }(\rho_{{1k_{1} }} ),{ \hbox{max} }(\rho_{{2k_{2} }} ), \ldots ,{ \hbox{max} }(\rho_{{Mk_{n} }} )} \right]^{T} , $$
(7.17)
$$ \begin{array}{*{20}c} {{\text{index}}_{L} = \left[ {x_{1} ,x_{2} , \ldots ,x_{M} } \right]^{T} } & {x_{1} ,x_{2} , \ldots ,x_{M} \in \left[ {1,2, \ldots ,n} \right]} \\ \end{array} . $$
(7.18)

The resulting position index matrix, \( {\text{index}}_{L} \), and the maximum similarity matrix, \( \rho^{\prime}_{L} \), are sorted by pixel column order. The similarity threshold, \( t 2 \), is set to reduce interference of the less similar local structure. If the similarity value is greater than \( t 2 \), it is considered to be an effective similarity. Otherwise, it is considered invalid, and the index value of the corresponding position in \( {\text{index}}_{L} \) is set to zero.

The index value in the matrix represents the position of the similar structure in the template image. If there are more different index values in the local window, the local window will contain more similar local structures, thus the greater the probability of the target. Finally, a fixed-size local window is selected to traverse the index matrix. If the number of nonzero pixels in the window in the original target probability map is greater than a certain threshold, the number of non-repetitive index values is counted. Otherwise, the number of index values in the window is recorded as zero. The matrix, \( R_{n} \), of the number of index value is constructed according to the number of non-repeat index values for each window. A statistical matching graph is obtained by normalising the matrix, \( R_{n} \). The target probability graph is obtained by weighting it with the original target probability map, as shown in Fig. 7.7. There is a significant difference between the pixel values of the target and the background area. The target position is obtained using the adaptive mean shift algorithm to the target probability map.

Fig. 7.7
figure 7

Left is the original target probability map; right is the weighted fusion target probability map

7.3.3 Experiment and Analysis

This section describes the tracking model based on global LARK feature matching and CAMSHIFT as a global feature matching tracking (GLMT) algorithm. The tracking model based on local LARK feature statistical matching is the local feature matching tracking algorithm (LLSMT). Because GLMT uses global matching, it is suitable for rigid or compact target tracking. Thus, the first part uses GLMT, compressive tracking (CT), spatiotemporal context (STC) learning and CAMSHIFT algorithms to trace human faces in a standard video library, trace cars on UAV video and make an analysis comparison. Because LLSMT algorithm uses local statistical matching, it is more suitable for non-compact targets with large changes. The second part uses LLSMT, GLMT and CAMSHIFT to trace pedestrians on infrared standard video libraries and make an analysis comparison.

  • Target Tracking Experiment in Visible Video

  • Experiment on tracking car

In this experiment, the trajectory of the car is a turn at the beginning and then a straight line. The experimental results are shown in Fig. 7.8. The red box is the tracking result of GLMT algorithm, and the blue box is the tracking result of CAMSHIFT.

Fig. 7.8
figure 8

Tracking results for different frames

As shown in Fig. 7.8, when the target colour and background are quite different, the target has a significant rotation, and the GLMT and CAMSHIFT algorithms can effectively distinguish between goals and backgrounds, owing to their use of colour information. When the colour difference between the target and the background is small, the tracking result of the CAMSHIFT algorithm produces drift or even loses the target, because of its use of the colour information. However, the GLMT algorithm uses both colour information and structure features, improving the contrast between target and background, giving a good tracking result.

  • Experiment on tracking a human face

This experiment chooses the human face video of AVSS2007. The human face in the video has fast size changes, a slight rotation and slight occlusion. The background in the video also has similar colour to the face. Figure 7.9 displays the tracking results of different algorithms. The red box is the tracking result of GLMT. The green box represents CT, the blue box represents STC and the yellow box represents CAMSHIFT. The centre position error (CLE) curves of the different algorithm tracking results are shown in Fig. 7.10. The abscissa represents the number of frames of video, and the ordinate represents the error value of the centre position of the tracking result and the true target centre position. Table 7.2 is their error analysis values, including average CLE, DP and OP.

Fig. 7.9
figure 9

Results of different tracking algorithms for human face video

Fig. 7.10
figure 10

Frame-by-frame comparison of CLE (in pixels) on tracking the human face

Table 7.2 Tracking result analysis on the human face video

As shown in Fig. 7.9, the target edge is not clear, owing to its size rapid changes and other objects of same colour. Thus, CAMSHIFT, CT and STC algorithms cannot accurately obtain the target location. Because LARK is not subject to the edge of the target and is more sensitive to the internal structure, GLMT algorithm can better track the target by making a global match against a non-changing target internal structure. From Fig. 7.10, CLE of each frame tracking result of GLMT algorithm is lower than that of CT, STC and CAMSHIFT. The average CLE of GLMT algorithm is also the lowest from Table 7.2. DP is computed as the relative number of frames in the sequence where the centre location error is smaller than a certain threshold. The DP threshold is set to 20. OP is defined as the percentage of frames where the bounding box overlap surpasses a threshold: 0.5. The DP and OP values of GLMT algorithm are both higher than CT, STC and CAMSHIFT in Table 7.2. Thus, we can conclude that the GLMT algorithm would well track rigid or compact targets.

Tracking Experiment on Infrared Non-compact Target

This experiment chooses pedestrian infrared image sequences from VOT2016. The pedestrian in the sequence has a significant change in posture. Experiment results are shown in Fig. 7.11. The red box is the tracking result of the LLSMT algorithm, the blue box represents GLMT and the green box represents CAMSHIFT.

Fig. 7.11
figure 11

Results of different tracking algorithms for pedestrian infrared image sequence

From Fig. 7.11, in the infrared image, the edge of the target is weak, and its grey information is not rich. Thus, the tracking result of CAMSHIFT produces a drift because of its only use of grey information. GLMT and LLSMT both use LARK, which well describes the internal structure of the infrared target. It uses LARK feature matching to accurately obtain the target region. With large varieties of posture, the target has large changes in the larger structure and little changes in local structure (e.g. head and feet). The LLSMT algorithm divides the whole structure into local structures and performs statistical matching so it can track targets well. Figure 7.12 shows the CLE curve of the three tracking algorithms on tracking the infrared pedestrian. The CLE value of LLSMT algorithm is the lowest from Fig. 7.12 and Table 7.3. As shown in Table 7.3, the DP and OP of LLSMT algorithm are higher than that of GLMT and CAMSHIFT. Thus, we can conclude that the algorithm could well track the non-compact infrared targets of various postures.

Fig. 7.12
figure 12

Frame-by-frame comparison of the centre location error (in pixels) on tracking pedestrian

Table 7.3 Tracking result analysis on pedestrian infrared image sequence

7.4 An SMSM Model for Human Action Detection

Considering noise, background interference and massive information, this section introduces a non-learning SMSM model based on the dense computation of a so-called space-time local adaptive regression kernel to identify non-compact human actions. This model contains three parts:

  • calculation of GLARK feature;

  • multi-scaled composite template set and

  • spatiotemporal multiscale statistical matching.

Before we begin a more detailed elaboration, it is worthwhile to highlight some aspects of the introduced model in Fig. 7.13 .

Fig. 7.13
figure 13

Flowchart of our SMSM model on motion detection

Algorithm 1

Human actions detection method based on SMSM model:

  • Construct \( W_{{Q_{i} }} \) and \( W_{{T_{i} }} \), which are a collection of 3D GLARKs associated with \( Q_{i} \), \( T_{i} \). Connect \( W_{{Q_{i} }} \) to \( W_{Q} \).

    • Apply PCA to \( W_{{T_{i} }} \) and obtain \( F_{{T_{i} }} \).

    • Remove similar column vectors in \( W_{Q} \) to get \( F_{Q} \).

    • Compute matrix cosine similarity

      for every target cube \( T_{i} \), where \( i = {\text{video frames do}} \)

      $$ \rho_{i} = \left\langle {\frac{{F_{Q} }}{{\left\| {F_{Q} } \right\|_{F} }},\frac{{F_{{T_{i} }} }}{{\left\| {F_{{T_{i} }} } \right\|_{F} }}} \right\rangle_{F} \;{\text{and}} $$
      $$ \rho_{3DGLK} (:,:,i) = \rho \left\langle {F_{Q} ,F_{T} } \right\rangle = \left[ {\begin{array}{*{20}c} {\rho_{11} } & \cdots & {\rho_{{1n_{T} }} } \\ \vdots & \ddots & \vdots \\ {\rho_{{m_{T} 1}} } & \cdots & {\rho_{{m_{T} n_{T} }} } \\ \end{array} } \right] $$

      end for

    • Record the position of the column vector corresponding to this maximum in \( F_{Q} \), count the no-duplicate index value number and get RM.

    • Apply non-maxima suppression parameter, \( \tau \), to RM.

7.4.1 Technical Details of the SMSM Model

Local GLARK Feature

This part first briefly introduces the space-time locally adaptive regression kernel. The LARK proposed by Seo and Milanfar (2009a, b) captures the geometric structure effectively. The locally adaptive regression kernel definition formula (see Seo and Milanfar 2009a, b; Luo et al. 2015) is

$$ K(C_{l} ,\Delta X_{l} ) = \exp ( - ds^{2} ) = \exp \left\{ { - \Delta X_{l}^{T} C_{l} \Delta X_{l} } \right\}. $$
(7.19)

Covariance matrix, \( C_{l} \), is calculated from the simple gradient information of the image. In fact, it is difficult to describe the concrete structural feature information of a target with simple gradient information. Moreover, it is easy to ignore the weak edge of the target when the contrast of the edge region of the target is relatively small. Thus, it will cause missing detection. To make up for this defect, we fully exploit LARK feature information, introduce the difference of Gaussians (DOG) operator into the LARK feature and generate a new GLARK feature descriptor to enhance weak-edge structure information. It is mainly inspired by the receptive field structure of neurons in Fig. 7.14 (Kuffler 1953). Rodieck and Stone (1965) established DOG model to simulate the concentric circular antagonistic receptive field of retinal ganglion cells.

Fig. 7.14
figure 14

Receptive field map: the classical receptive field has a mutually antagonistic structure of the central and surrounding regions; and the non-classical receptive field is a larger area outside the classical receptive field, which removes the suppression of the classical sensory field

It is necessary to note that Gaussian kernel operator is defined by

$$ g( \cdot ,\sigma ) = \frac{1}{{(2\pi \sigma^{2} )^{N/2} }}\exp \left( { - \frac{{\sum\nolimits_{k = 1}^{D} {x_{k}^{2} } }}{{2\sigma^{2} }}} \right). $$
(7.20)

As for the two-dimensional \( (D = 2) \) image, different Gaussian convolution kernels, \( \sigma \), as multiscale factors, are taken to make convolution with the gradient information of pixel point, as shown in Eq. (7.21).

$$ D\left( {x,y,\sigma } \right) = g\left( {x,y,\sigma } \right) \otimes z\left( {x,y} \right), $$
(7.21)
$$ Z\left( {x,y,\sigma ,k} \right) = D\left( {x,y,k\sigma } \right) - D, $$
(7.22)

where \( \otimes \) represents convolution, \( (x,y) \) is the spatial coordinate and \( z(x,y) \) is the gradient of image, which has two forms of expression: \( z_{x} (x,y) \) and \( z_{y} (x,y) \). \( Z\left( {x,y,\sigma ,k} \right) \) is the Gaussian difference gradient matrices: \( Z_{x} \) and \( Z_{y} \). Figure 7.15 shows the gradient matrix, z, of a \( 3 \times 3 \) region. Here, we assume a \( 3 \times 3 \) Gauss kernel operator \( g = \left\{ {\begin{array}{*{20}c} 1 & 2 & 3 \\ \end{array} ;\begin{array}{*{20}c} 4 & 5 & 6 \\ \end{array} ;\begin{array}{*{20}c} 7 & 8 & 9 \\ \end{array} } \right\} \) and calculate the convolution as follows:

Fig. 7.15
figure 15

Gauss kernel convolution

$$ f(i,j) = g * z = \sum\limits_{k,l} {g(i - k,j - l)z(h,l) = \sum\limits_{k,l} {g(k,l)z(i - k,j - l)} } . $$
(7.23)

Next, we take the centre point, \( (2,2) \), of the \( 3 \times 3 \) region as an example.

$$ \left[ {\begin{array}{*{20}c} {Z_{x22} } \\ {Z_{y22} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {Z_{x11} } & \cdots & {Z_{x33} } \\ {Z_{y11} } & \cdots & {Z_{y33} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} 9 & 8 & 7 & \cdots & 2 & 1 \\ \end{array} } \right]^{T} . $$
(7.24)

In this section, the third dimension of 3D GALRK is time, and \( \Delta X = \left[ {dx,dy,dt} \right] \). Thus, the new \( C_{GLK} \) is

$$ C_{GLK} = \sum\limits_{{m \in \varOmega_{l} }} {\left[ {\begin{array}{*{20}c} {Z_{x}^{2} (m)} & {Z_{x} (m)Z_{y} (x_{k} )} & {Z_{x} (m)ZT(m)} \\ {Z_{x} (m)Z_{y} (m)} & {Z_{y}^{2} (m)} & {Z_{y} (m)ZT(m)} \\ {Z_{x} (m)DT(m)} & {Z_{y} (m)ZT(m)} & {ZT^{2} (m)} \\ \end{array} } \right]} , $$
(7.25)

where \( \Omega _{l} \) is the space-time analysis window. Then, GLARK is defined by

$$ K(C_{GLK} ,\Delta X_{l} ) = \exp \left\{ { - \Delta X_{l}^{T} C_{GLK} \Delta X_{l} } \right\}. $$
(7.26)

GLARK highlights the edges of an object, which weakens the texture interference, especially paying more attention to the local structure of weak edges. It is expressed as

$$ W_{x} = [W_{x}^{1} ,W_{x}^{2} , \cdots ,W_{x}^{M} ] \in R^{N \times M} , $$
(7.27)

where M represents the number of pixels of the GLARK feature. Figure 7.16 gives the GLARK feature descriptor. As can be seen from the figure, GLARK can better describe the graph trend at the weak edge. In the normal edge region, the introduced feature descriptor goes even farther than the Seo algorithm.

Fig. 7.16
figure 16

GLARK map of hat edge

Multi-scaled Composite Template Set

The colour video contains three colour channels, and each frame contains three dimensions. However, the third dimension of our model is time. Thus, the video is translated to a grey image sequence. Differing from the supervise methods containing millions of training images, our training set is minimal. Figure 7.17 displays the templates, from T1 to T5, including pedestrian, long jump, skiing, cliff diving and javelin throw actions.

Fig. 7.17
figure 17

Template set of various actions

The actions are resized to multi-scaled, and the model can recognise multiple sizes action in one frame. Figure 7.18 shows the formation of the multi-scaled template set. Every template image is sequenced to get the original cube, and then the nearest-neighbour interpolation method resizes the original template to 0.5 times and 1.5 times, respectively. Three template sequences are formed. Only the first and second dimensions of original template, indicating size, should be resized. However, the third dimension, implying motion, remains the same. 3D GLARK is utilised to get the multi-scaled local structure template set.

Fig. 7.18
figure 18

Multi-scaled template set

As shown in Fig. 7.18, the size of \( {\text{template}}_{0.5} \), \( {\text{template}}_{\text{initial}} \) and \( {\text{template}}_{1.5} \) are, respectively, \( m_{Z} /2 \times n_{Z} /2 \times t_{Z} \), \( m_{z} \times n_{z} \times t_{z} \) and \( 3m_{z} /2 \times 3n_{z} /2 \times t_{z} \). Hence, we can obtain three GLARK matrices, shown as \( W_{0.5} \in R^{{N \times M_{1} }} \), \( M_{1} = M/2^{2} \), \( W_{\text{initial}} \in R^{N \times M} \), \( M = m_{z} \times n_{z} \times t_{z} \) and \( W_{1.5} \in R^{{N \times M_{3} }} \), \( M_{3} = M \times (3/2)^{2} \), where \( N = m_{1} \times n_{1} \times t_{1} \) is the size of 3D windows for calculating GLARK. Then, \( W_{0.5} \), \( W_{\text{initial}} \) and \( W_{1.5} \) are connected to obtain the total template matrix:

$$ W_{Q} = \left[ {\begin{array}{*{20}c} {W_{0.5} } & {W_{\text{orginal}} } & {W_{1.5} } \\ \end{array} } \right] \in R^{{N \times M_{t} }} ,M_{t} = M_{1} + M + M_{3} . $$
(7.28)

SMSM Model

This section studies a method based on non-learning SMSM model, regarding \( W_{{T_{S} }} \) (‘s’ represents the video frames), deriving features by applying dimensionality reduction (i.e. PCA) to the resulting arrays (Seo and Milanfar 2009a, b). However, it needs a threshold, \( \alpha \), a so-called similar structure threshold, to remove the redundancy of \( W_{Q} \). The model calculates the cosine value between column vectors intersection angles reciprocally in the matrix. If the cosine value is greater than \( \alpha \), the two vectors are similar. If so, it retains one of them. After reducing the dimensionality and redundancy, \( F_{Q} \) is obtained with \( F_{{T_{S} }} \).

The next step in the introduced framework is a decision rule based on the measurement of a matrix cosine similarity (Leibe et al. 2008). Regarding \( F_{{T_{S} }} \), we calculate the cosine value of the angles between each column vector, \( f_{T}^{i} \) and \( f_{Q}^{i} \). Then, we obtain the cosine similarity matrix, \( \rho_{{3{\text{DGLK}}}} \):

$$ \rho_{{3{\text{DGLK}}}} (:,:,k) = \rho \left\langle {F_{Q} ,F_{T} } \right\rangle = \left[ {\begin{array}{*{20}c} {\rho_{11} } & \cdots & {\rho_{{1n_{T} }} } \\ \vdots & \ddots & \vdots \\ {\rho_{{m_{T} 1}} } & \cdots & {\rho_{{m_{T} n_{T} }} } \\ \end{array} } \right]\left( {k = 1,2, \ldots ,t_{T} } \right), $$
(7.29)

where \( m_{T} \times n_{T} \times t_{T} \) is the size of the test video. This section takes the maximum of each row in the matrix, \( \rho_{{3{\text{DGLK}}}} (:,:,k) \), and records the position of the column vector corresponding to the maximum in \( F_{Q} \). The location information is saved in the \( {\text{index}}_{\text{GLK}} \) matrix:

$$ {\text{index}}_{\text{GLK}} (:,k) = (x_{1} ,x_{2} , \ldots ,x_{{m_{T} }} )^{T} \quad x_{1} ,x_{2} , \ldots ,x_{{m_{T} }} = 1,2, \ldots ,n_{T} . $$
(7.30)

We arrange the position index matrix, \( {\text{index}}_{\text{GLK}} (:,k) \) and \( \rho_{{3{\text{DGLK}}}} (:,:,k) \), into a matrix of \( m_{T} \times n_{T} \), according to the column order. Thus, we need a similarity threshold, \( \theta \), whose empirical value is 0.88, to judge each element, \( \rho^{\prime}_{{3{\text{DGLK}}}} \) in \( \rho_{{3{\text{DGLK}}}} \). If \( \rho^{\prime}_{{3{\text{DGLK}}}} \ge \theta \), two vectors are similar. If \( \rho^{\prime}_{{3{\text{DGLK}}}} < \theta \), two vectors are not similar. At this point, the corresponding position in the \( {\text{index}}_{\text{GLK}} \) is recorded as 0.

This section selects the appropriate local window of \( P \times P \times T \) to go through the \( {\text{index}}_{\text{GLK}} \) matrix and count the no-duplicate-index value in the window. This section uses the formula \( {\text{num}} = {\text{num}}({\text{Unique}}({\text{index}}_{{{\text{GLK}}_{P \times P \times T} }} )) \) to explain the process, where \( {\text{num}} \) represents the similarity value of interested target and present region. The index value represents the corresponding structure of the target image, like the template. The more different the index values, more similar structures are in the local window. To this end, we construct a similarity matrix, \( {\text{RM}}_{\text{GLK}} \). Figure 7.19 notes the process from \( {\text{index}}_{\text{GLK}} \) to \( {\text{RM}}_{\text{GLK}} \).

Fig. 7.19
figure 19

Process from \( {\text{index}}_{\text{GLK}} \) to \( {\text{RM}}_{\text{GLK}} \)

On the basis of \( {\text{RM}}_{\text{GLK}} \), we can acquire a global similarity image, RM. Next, we employ a method of non-maxima suppression from Devernay (1995) to extract the local maxima of the resemblance image and seek out the location of people. Therefore, we need a non-maximum suppression threshold, \( \tau \), to decide whether to ignore some areas of RM. Then, we calculate the maximum value of the remaining possible area.

In Fig. 7.20a, the deep-red part of the map represents higher grey values, indicating the possibility of the existence of the target in the corresponding region of the test frames. It is easy to see that there is a region of high grey value. Thus, there may be a human body in the location of the test frames, T, corresponding to the maximum grey value of the region. Figure 7.20b represents RM with non-maximum suppression; Fig. 7.20c represents the detection result of a sampled frame.

Fig. 7.20
figure 20

a Resemblance maps; b RMs with non-maximum suppression; c detection result of sampled frames

7.4.2 Experiments Analysis

Parameters Analysis

The main parameters for our approach are as follows: similar structure threshold, \( \alpha \), for \( W_{Q} \) and non-maximum suppression threshold, \( \tau \), for RM.

When removing redundancies, we judge whether two vectors are similar by comparing two weighted vector cosine values with the threshold, \( \alpha \). The larger the threshold, \( \alpha \), the more the number of reserved weight vectors, the more complex the calculation, the smaller the threshold, \( \alpha \), the greater the difference in the retained weight vector matrix and the more inaccurate the recognition results. Therefore, there is a problem of selection of a specific threshold, \( \alpha \). This section analyses statistically the number of weight vector matrices under different \( \alpha \). As shown in Fig. 7.21, the abscissa represents the threshold, \( \tau \), and the ordinate represents the number of weight vectors after deallocation. This section simply regards the pictures as curve graphs, where the number of weight vector matrices first increases slowly and then grows rapidly with the threshold, \( \alpha \), increasing. Therefore, the threshold is determined by the turning point of the curve, where the slope is 0.5. In Fig. 7.21a, this section chooses a point (0.986,909). Thus, threshold \( \alpha \) is set to 0.986, the number of weight vectors after de-redundancy is 909 and the new weight vector matrix is used to participate in the similarity structure analysis. Equally, in Fig. 7.21b, we choose a point (0.994,289). Threshold \( \alpha \) is set to 0.994 and the number of weight vectors after de-redundancy is 289. In Fig. 7.21c, we choose a point (0.996,118). Threshold \( \alpha \) is set to 0.996 and the number of weight vectors after de-redundancy is 118.

Fig. 7.21
figure 21

Relationship between number of local structures and threshold \( \alpha \): a pedestrian; b athlete; c multiscale people in one video

As shown in Fig. 7.20, to accurately determine the location of the target, a non-maximum suppression threshold, \( \tau \), is required for optimal retention of the pixel values in the RM. Thus, a plurality of target images are selected and probability density curves are established according to the respective RM matrices (see Fig. 7.22a). The probability density curves are integrated to establish the integrals sum curves (see Fig. 7.22b). Whereas the RM probability density distribution of each target image is quite different, their integrals sum curves converge at the tail. Whereas the tail corresponds to the larger part of the RM value, the target may still occur. Therefore, other relatively small grayscale pixel values can be omitted to reduce the amount of computation and improve efficiency. Thus, we select the points where the probability of integration of each image RM integrals sum curves begin to converge. The right side of the point is retained, and the left is omitted. Figure 7.22b shows that the \( \tau \) is set to 0.75.

Fig. 7.22
figure 22

a RM probability density curves; b RM probability density integrals sum curves

Test Results

Evaluation of the performance of the introduced method will be introduced in the next section. This section tests our method and the Seo algorithm using the challenging scenes (i.e. single object, fast motion object and multiple cases of objects) from the Visual Tracker Benchmark.

  • Single person

To evaluate the robustness of the introduced method, we first test the introduced method on visible video with one object. Figure 7.23 shows the results of searching for walking people in a target video (597 frames of 352 × 288 pixels). The query video contains a very short walking action moving to the right (10 frames of 48 × 102 pixels). This section randomly displays the results of three frames. Obviously, the detection effect of the multispectral is superior to the single band. Owing to the single template, Seo algorithm can only recognise the target having a similar posture with the template. For other images, Seo has poor performance. The introduced algorithm enlarges the weight matrix commiserate with the abundant spectral information. Thus, not only do we improve the detection rate, but we also identify results more accurately. Simultaneously, because our GLARK focuses on the local structure information of the target, our algorithm can well detect when the target is partially occluded. However, the Seo algorithm does not, because it is more concerned about the overall structure of the local structure. Thus, when part of the body structure is blocked, our algorithm does not apply.

Fig. 7.23
figure 23

Detection results of pedestrian: a introduced algorithm; b Seo algorithm; c introduced algorithm in overshadowed scene; d Seo algorithm in overshadowed scene

  • Fast motions

This section also tests the introduced method in the visible video with fast motions. Figure 7.24 shows the results of detecting a female skater turning (160 frames of 320 × 240 pixels) and a male surfer in a video (84 frames of 480 × 270 pixels). The query video contains fast motions of the athletes (6 frames of 121 × 146 pixels) and a surfer (6 frames of 84 × 112 pixels). Figure 7.24b finds that the Seo algorithm can coarsely localise the athlete in the images, but the bounding boxes only contain parts of people. However, the turns of the athlete (female) are detected by the introduced method even though the video contains very fast-moving parts and relatively large variability on a spatial scale and appearance, compared to that given in query Q in (a). These bounding boxes can well locate the athlete in the image. Additionally, we test the scene with a change of the target scale in fast motions. As shown in (c) and (d), the Seo algorithm is very unstable, and our algorithm has good robustness to the volatile scale of the target. Analysis suggests that our templates are multi-scaled composited, and the Seo template is relatively simple. Thus, our method overcomes the drawbacks of the Seo algorithm and produces reliable results on this dataset.

Fig. 7.24
figure 24

Fast motions detection: a, c introduced algorithm; b, d Seo algorithm

  • Multiple cases and scales of pedestrians

To further evaluate the performance of the introduced method, we test the introduced method in the visible video with different people. Figure 7.25 shows the results of detecting multiple cases and different sizes of humans, which occur simultaneously in two monitor videos. The query videos contain two pedestrians who drift ever farther away. Their sizes are diminishing. Figure 7.25a, c shows the results of the introduced algorithm, and Fig. 7.25b, d represents detecting results of the Seo algorithm. The woman’s coat on the left of the first picture in (a) and (b) and the man’s clothes in (c) and (d) have a low contrast against the cement. For such human weak edges, it easy mistakes the human as background using the Seo algorithm. Whereas our algorithm detects the target well in this case, ours can enhance weak-edge information locally. Moreover, the multi-scales of characters limit the detection of the Seo algorithm in the same field of view in (c) and (d). Our algorithm does not suffer such interference in experiments because of the establishment of multiscale template sets.

Fig. 7.25
figure 25

Visual comparisons of different methods with different scales: a, c introduced algorithm; b, d Seo algorithm

Analysis of the Introduced Method

This section first evaluates IoU on the images shown, in terms of our method and the Seo algorithm in Fig. 7.26a. It is clear that the introduced method achieves superior performance compared to the Seo algorithm. For target recognition systems, precision and recall are contradictory quantities. Receiver operating characteristic (ROC) curves (Lachiche and Flach 2003) are introduced to evaluate the effects of target recognition.

Fig. 7.26
figure 26

Comparison between Seo and the introduced algorithm: a IoU curve; b RPC curve

$$ {\text{T}}\Pr = \frac{TP}{TP + FN}\quad {\text{FPr = }}\frac{FP}{TP + FP}, $$
(7.31)

where TP is the correctly detected target area, FP is the area which is mistaken for the target and FN is the undetected target area. Hence, the RPC (Leibe et al. 2008) is utilised. As shown in Fig. 7.26b, the horizontal and vertical axes correspond to two evaluation indexes: 1-precision and recall.

The visual tracker benchmark has manually tagged the test sequences with 11 attributes, representing the challenging aspects of visual detection. The 30 videos containing human actions in datasets are tested by our model and the Seo algorithm. The mAP is the average precision of 11 attributes, as shown in Fig. 7.27.

Fig. 7.27
figure 27

Histogram of average precision (%) for each attribute on visual tracker benchmark

Experiments at THUMOS 2014

To make the experimental results of our algorithm more credible, we leverage two other benchmark systems: (1) S-CNN (Devernay 1995) leverages deep networks in temporal action localisation via three segments based on 3D ConvNets; (2) Wang et al. (THUMOS 2014: http://crcv.ucf.edu/THUMOS14/download.html) built a system on iDT with FV representation and frame-level CNN features and performed post-processing to refine the detection results. The temporal action detection task in THUMOS Challenge 2014 was dedicated to localising action instances in long untrimmed videos (THUMOS 2014: http://crcv.ucf.edu/THUMOS14/download.html). Our model is unsupervised and only uses the test data.

The detection task involved 20 actions, as shown in Table 7.4, and every action included many scenes. Average precision of each action was tested for all scenes with three methods. mAP is the mean average precision for 213 videos, shown in Table 7.4. Whereas the mAP is slight lower than S-CNN, our model is unsupervised and does not rely on the training process.

Table 7.4 Histogram of average precision (%) for each class on THUMOS (2014)

7.5 Summary

In view of some problems of the existing detection and tracking methods (e.g. diverse motion shapes, complex image features and weak adaptability to varied scenes), we described the following three models:

  • A new method for infrared small object detection is capable of detecting small objects using sparse error and structure differences. This is the first time that structure differences and sparse errors were successfully applied to infrared small object detection. The infrared small target can be easily extracted by exploiting sparse error and structure differences. Results are estimated via a simple fusion framework. Experimental results demonstrate that the introduced method is effective, performs favourably against other methods and yields superior detection results.

  • The introduced algorithm, GLMT, utilises the advantages of LARK spatial structures without interference from weak-edge targets. It combines the colour or grey information to improve the contrast between the target and the background. The GLMT algorithm improves the lack of spatial information and interference of background in CAMSHIFT, and tracks compact or non-compact targets with less changes of posture in different spectra videos. Additionally, the local structure statistical matching method of LARK was introduced using the LARK feature to be sensitive to the change of the weak local structure. According to the local statistical matching method of LARK, the introduced algorithm, LLSMT, can track the non-compact target of different posture changes.

  • A space-time local structure statistical matching model was introduced to recognise non-compact actions and to expand the scenes of test video. First, a new GLARK feature was introduced by introducing Gaussian difference gradients into LARK descriptors to encode local structure information. Second, the multi-scaled composite template set was applied for actions of multiple sizes. At last, the method was applied to action videos by the space-time local structure statistics matching method to mark the actions. Experimental results demonstrate that the approach outperforms previous approaches and achieves more robust and accurate results for human actions detection. Furthermore, the SMSM is distinguished from traditional learning methods and matches template with test video more efficiently.