1 Introduction

Infrared thermography (IRT, or thermal video) has been widely used in civilian and military applications such as surveillance, night vision and tracking, weather forecasting, firefighting, facility inspections, etc. for collecting high quality image data that is beyond the human visual perception range. The exceptional capacity of IRT comes from its capability to detect and record radiation in the long-wavelength infrared range of the electromagnetic spectrum [1]. In comparison to conventional night-vision techniques, local illumination or other disturbing factors such as fog or smoke do not become an essential obstacle. Recent advances in IRT cameras have significantly improved the resolution and bit-depth of thermal images, which had previously often been considered inferior to visual images, thereby making IRT images suitable and widely used in scenarios containing high value targets, including remote surveillance applications where distant vehicles, pedestrians or buildings are monitored. For this reason, automatic detection and recognition of these targets has raised increasing interest in both academia and industry [17].

Despite the advances in acquisition technology, long range object detection and recognition in IRT images collected under real-world settings is still a challenging research topic. Such images are usually acquired at a very long distance, leading to extremely low numbers of pixels on target. A further challenge resides in the nature of IRT imaging: if the temperature of the object of interest is similar to the background, the contrast will be low. These adverse effects emerge as significant obstacles that degrade the performance of automatic object detection/recognition (ATD/R) in IRT images and hinder the application in practise. Figure 1 shows two real-world image examples where the targets (a people carrier (“Bus”) in (a) and an estate car (“Skoda”) in (b)) bear low resolution and poor contrast, which can lead to high probability of false alarms in our developed ATD/R system (See Table 3 in Section 4). The Bus target in Fig. 1a has very low resolution (14 × 8 pixels), which is barely visible from a distance away. The Skoda target in Fig. 1b is almost blended with the background.

Fig. 1
figure 1

Two IRT image examples captured at the distance of 600 m (camera wide angle of view). The vehicle objects are highlighted in the bounding box

A number of techniques have been developed to detect and recognise objects in video surveillance. These include region-based segmentation, background subtraction, temporal differencing, active contour models, and generalised Hough transforms [18]. Due to video sequences in surveillance being obtained through static cameras and fixed background, background subtraction [22] is a commonly used detection approach in this scenario, where a background is modelled and then moving objects in a scene can be identified by comparing key frames with the background. Furthermore, moving object recognition methods rely on low-level feature based methods (Viola-Jones, Histogram of oriented gradients –HoG, Speeded Up Robust Features –SURF, Scale Invariant Feature Transform –SIFT) [3, 16] or texture descriptors (Discrete Wavelet Transform –DWT, Legendre moments and Haralick features) [8] to recognise the objects by classifying them into a certain semantic class or category. Support Vector Machines (SVM) have shown great potential for this classification task, whilst other methods use ensemble classifiers (e.g. AdaBoost in Viola-Jones) [24].

The schemes above may be able to detect objects to some extent in IRT applications. However, when target signatures are small, the capability of existing methods is uncertain due to the difficulties mentioned previously. Figure 2 illustrates the DoG (Difference of Gaussian) pyramids of two images with different vehicle types (Bus and a Transit van (“Van”)). These two images are so similar at various scales that the traditional DoG based methods (like SIFT) are prone to making wrong judgements, where Bus and Van are shown in Fig. 2a and b respectively. This issue was also identified by looking at performance on the canonical visual recognition task, PASCAL VOC object detection [5]. In particular, a pipeline that HoG [3] based feature descriptor combines with a learned SVM classifier is a common way to perform object recognition in the community. This HoG descriptor extracted from an input image can be encoded into a fixed length representation that can be classified by a linear SVM [3, 7] or an additive kernel SVM [23]. We have employed this method as a benchmark for performance comparison (we will refer to it as HoG-SVM). We will show with our experiments that HoG-SVM is ineffective in IRT imaging and the performance is seriously affected by low resolution and poor contrast since these local descriptors obtained from hand-crafted features exhibit limited semantic messages of objects [23, 25]. In particular HoG are greatly dependent on image gradient, i.e. edge information or image contrast. To illustrate this fact, we present in Fig. 3 some issues appearing when applying HoG-SVM to our collected data, where the top 10 detections are kept (some detections may overlap so less than 10 boxes are observable in the figure). We performed two trial acquisitions (hereinafter \(T_{1}\) and \(T_{2}\), see Section 2) in different weather conditions, which led to different contrast datasets. For \(T_{1}\) data, the vehicle with the salient features (see Fig. 3a1) can be detected and recognised by HoG-SVM. However, Fig. 3b1 and Fig. 3c1 show that HoG-SVM fails to detect the vehicle when applied to cases with weak features and poor contrast. In particular, it is highly ineffective for \(T_{2}\) data (Fig. 3c1) due to the low contrast between vehicle and background. In Section 4, the quantitative comparison demonstrates numerically that our solution scheme is superior to this traditional object recognition method.

Fig. 2
figure 2

DoG pyramids for two IRT images with different objects

Fig. 3
figure 3

Three resulting examples using Hog-SVM (a1,b1,c1) and our method (a2,b2,c2)

Deep learning technique in the form of convolutional neural networks (CNNs) [12], have recently achieved significant progress for object recognition by extracting informative non-linear features with hierarchical, multi-stage processes. The idea of CNNs was originally proposed by LeCun [13] in 1989 but its influence in object recognition was limited due to the popularity of SVM until the \(2010s\). In their celebrated paper [12], Krizhevsky et al. substantially improved the previous CNN model with several revolutionary inputs, e.g., “ReLU”: \(max(x,0)\), “dropout”: regularisation, and a fast GPU implementation etc. Since then, with the development of CNN architectures [21, 27], CNNs show unrivalled success in object detection that is credited to some new localising formulations proposed instead of the obsolete sliding-window detectors, such as selective search [23] and region proposal networks [20]. To date, a prevalent CNN framework [20] for object recognition is as follows: first, convolutional layers are employed to acquire region based features for detecting objects of interest; then a region-wise multi-layer perceptron (MLP) classifier is followed to do the classification for recognising objects. Thus, a complete CNN based ATD/R system can be developed for object detection and recognition in video surveillance. Moreover, CNNs can additionally be used to perform end-to-end mapping between low and high-resolution images [4] in order to achieve a super-resolution version of the input image. This image enhancement process can help to overcome the low pixel-on-target count challenge in long-range IRT acquisitions.

This paper presents an infrared video based surveillance system consisting of a resolution-enhanced ATD/R system that can be widely used in various civilian and military applications. The system has been tested under different seasonal conditions using two datasets featuring pedestrians and 6 different types of vehicle targets. The developed ATD/R system can effectively cope with the small pixel-on-target issue and recognise extremely small-resolution targets with superior performance. A comparison with traditional methods (HoG-SVM) confirms this superiority both qualitatively and quantitatively. Preliminary results on the application of super-resolution have been presented in [28]. This present paper substantially extends the previous study in both the system development and additional experiments with new data. The main contribution in this paper is twofold as follows.

  1. 1.

    To deal with the small pixel-on-target issue in the IRT imagery, we propose to use a CNN-based super-resolution method [4] to increase target signature resolution and optimise the baseline quality of the input images to the object recognition module.

  2. 2.

    To overcome the small signature and low contrast challenges in IRT-based surveillance, a CNN-based ATD/R system is developed based on a novel Region Proposal Network (Faster-RCNN [20]) which detects objects by extracting convolutional feature maps.

The performance of the proposed system is assessed using IRT surveillance data including 6 types of vehicles and pedestrians (The data is available for public use at http://www.lpi.tel.uva.es/AALARTDATA).

The remainder of the paper is as follows: Section 2 describes the data preparation and background. Section 3 introduces the framework of our ATD/R system and relevant methodologies involved. Experimental results and system evaluation are presented in Section 4. Finally, Section 5 closes the paper with the main conclusions extracted from this research.

2 Data acquisition and pre-processing

2.1 Data acquisition and preparation

We extracted raw images from surveillance video clips acquired using Catherine MP LWIR camera (Thales UK Ltd) [2], which is a specialised thermal camera using micro-scanning technology to combine the fields of resolution (640 × 512). We have carried out data collection on two separate occasions, termed Trial 1 (T1) and Trial 2 (T2), with 6 types of vehicles involved in total as well as pedestrian targets. \(T_{1}\) data was acquired in winter time, where there was significant thermal contrast between background and targets. Three types of vehicles, Bus, Skoda, and Van, were employed as targets. The targets were acquired using the camera wide field of view at 100 m, 200 m, 300 m, 400 m, 500 m and 600 m, and six groups of video clips were collected, each consisting of approximately 8,000 IRT images. Figure 1 shows two example images from a video clip collected at 600 m range with a people carrier (left) and an estate car (right) in the respective scenes. As for \(T_{2}\) scenes, they were acquired in spring time with lower contrast between the background scenery and the targets, with a distant cold blue sky expanding the image dynamic range, so the contrast levels and absolute target signatures are lower compared with \(T_{1}\) scenes. The vehicles involved are Landrover, Saloon car (“Saloon”), Pickup truck (“Truck”). The datasets were acquired with the same camera setup as in \(T_{1}\). Figure 4 presents several images in \(T_{1}\) and \(T_{2}\). The Bus training and test datasets can be downloaded with the link, http://www.lpi.tel.uva.es/AALARTDATA. The whole data will be released soon.

Fig. 4
figure 4

Example images for T1 and T2

2.2 Image preprocessing and camera bias correction

The video clips were initially acquired as “*.vstream”. Raw frames were subsequently extracted and converted to the png format to feed into our ATD/R system. IRT images have a small amount of random noise uncorrelated from pixel to pixel and small-scale non-uniformities from pixel to adjacent pixel. The effect of these perturbations was negligible in the subsequent process. To decrease variability during model training and CNN calculation, the input images were average-subtracted before entering the pipeline.

3 ATD/R structure and methodology

Our ATD/R system consists of two main stages as shown in Fig. 5. At the first stage as shown in the blue block in Fig. 5, a contemporary CNN-based super-resolution method (SRCNN) [4] is applied to improve the signature of small-number-of-pixels-on-target objects in the original IRT images. The CNN weights were trained using the raw IRT data randomly selected from the dataset. At the second stage, a state-of-the-art CNN model, faster RCNN [20], is applied to perform object detection and recognition as shown in the green block in Fig. 5. The architectures of these stages are unified into the overall CNN framework and the implementation is based on the Caffe [11] development environment.

Fig. 5
figure 5

Block diagram of our developed ATD/R system

3.1 Convolutional neural networks

Findings on the mechanisms in the visual cortex of the brain have successfully driven CNN design, in order to address pattern based problems. Similar to a traditional neural network architecture, a CNN is made up of layers that aim to obtain a set of locally connected neurons between two layers by learning data-specific kernels. Three main types of layers are employed to build a CNN model: convolutional layer, pooling layer and fully-connected layer. A typical CNN for object recognition is illustrated in Fig. 6. The input is an image and output is a single vector of class scores. The role of each type of layer can be described as follows:

  1. (i)

    The convolutional layer will generate a volume of feature maps by computing a dot product between the weights used and the region connected to the input volume.

  2. (ii)

    The pooling layer will downsample the feature map along the spatial dimensions.

  3. (iii)

    The fully-connected layer will create the class scores according to the given categories.

In addition, in order to make the CNN model robust, a ReLu layer is used to apply an elementwise activation function. A dropout strategy is used to perform the action of randomly ignoring neurons for preventing inter-dependencies between neurons.

Fig. 6
figure 6

A typical CNN architecture

3.2 Superresolution process using CNNs

3.2.1 Introduction to the SRCNN method

Figure 7 shows the block diagram of processing our data using SRCNN [4]. The overall idea of super-resolution is that a low-resolution image I is upscaled to a new image Y using bicubic interpolation and then a mapping function F is employed to recover the high-resolution image X from Y. To obtain F, a popular strategy is the following: first, generate patches from Y and represent them by a set of pre-trained bases, and thus obtain the feature maps of low-resolution images; second, a non-linear mapping is applied to the feature maps so that the representation of a high-resolution patch is generated; finally, the predicated high-resolution patches are averaged to produce the final full image. In SRCNN, these traditional operations are implemented by creating a three-layer CNN. The mapping F is conceptually obtained by a CNN framework, which consists of the following three operations:

  • (Operation 1) Patch extraction and representation This is the implementation of the first layer in Fig. 7. It can be described as an operation \(F_{1}\):

    $$ F_{1}(Y)=max(0, W_{1} * Y + B_{1}) $$
    (1)

    where \(W_{1}\) and \(B_{1}\) are the filters and biases, respectively. \(W_{1}\) applies \(n_{1}\) convolutions on the input image, where the kernel size is \(c\times f_{1} \times f_{1}\), with c the image channel. The output includes \(n_{1}\) feature maps. \(B_{1}\) is an \(n_{1}\)-D vector associated with the filters.

  • (Operation 2) Non-linear mapping

    The second layer in Fig. 7 is applied to implement the following operation:

    $$ F_{2}(Y)=max(0, W_{2} * F_{1} (Y) + B_{2}) $$
    (2)

    where \(W_{2}\) is a matrix of \(n_{1} \times 1 \times 1 \times n_{2}\) dimensions and \(B_{2}\) is an \(n_{2}\)-D vector. Each of the outputs is an \(n_{2}\)-D vector that conceptually represents a high-resolution patch.

  • (Operation 3) Reconstruction

    This convolutional layer in Fig. 7 produces the final high-resolution image by applying the following operation:

    $$ F(Y)=W_{3} * F_{2} (Y) + B_{3} $$
    (3)

    where \(W_{3}\) is a matrix of \(n_{2} \times f_{3} \times f_{3} \times c\) dimensions, and \(B_{3}\) is a c-D vector.

Fig. 7
figure 7

Block diagram of the super-resolution process using SRCNN. Here f1, f2, f3 are the digital matrices. “×” denotes the convolutional operation

3.2.2 Training the model weights with the acquired IRT images

The model weights, \(W_{1}, W_{2}, W_{3}\), in (13), are calculated by applying the standard stochastic gradient descent (SGD) algorithm. This is a back-propagation CNN process for the 3-layer CNN. The training set of 100 IRT images is randomly selected from our created IRT database. The following steps are performed to obtain the model weights.

  1. (i)

    The ground truth images are prepared as \(32 \times 32\)-pixel sub-images randomly cropped from the training set.

  2. (ii)

    Low resolution images are pre-processed using Bicubic interpolation.

  3. (iii)

    The initial filter weights of each layer are generated by drawing randomly from a Gaussian distribution with zero mean and standard deviation 0.001. The learning rates are 0.0001 for the first two layers and 0.00001 for the last layer.

3.2.3 Applying the obtained model weights in SRCNN

In order to adapt the model to fit our IRT data and enhance the images more effectively, we integrate the trained model weights into the SRCNN model to replace the original default weights. Thus, the collected IRT images can be improved properly according to the acquisition environment and modality properties in practise.

3.3 Object detection and recognition using Faster-RCNN

Faster-RCNN [20], a most recently developed object detection system, aims to integrate traditional region proposals and an object detector/classifier into one CNN. It is composed of two main components. The first one is region proposal networks (RPN), a fully convolutional network, that produces region proposals where objects therein are likely. The second one is Fast RCNN [9], which uses the proposed regions to do classification and make a final decision on the existence of those objects. Our ATD/R system employs Faster-RCNN to carry out object detection and recognition as shown in the green block of Fig. 5. The following subsections introduce how Faster-RCNN works.

3.3.1 RPN for generating region proposals

This region proposal network is constructed as a fully convolutional network (FCN) [15] that produces region bounds and objectiveness scores simultaneously at each location. As shown in Fig. 8, the RPN architecture is actually composed of an \(n \times n\) convolutional layer (L1,n = 3 used) and two sibling \(1 \times 1\) convolutional layers for box regression (reg) and box classification (cls) respectively.

  • (L1 layer): an \(n \times n\) spatial window of the feature map of the last shared convolutional layer is input into this layer for generating region proposals. At each sliding position, \(k (= 9)\) region proposals (the green part in Fig. 8) are created by using 3 scales and 3 aspect ratios. These k proposals are mapped to a feature vector that is fed into the reg layer and cls layer.

  • (reg layer): this layer performs the regression process for the input feature vector in terms of each sliding position. The outputs are the coordinates of k region proposals.

  • (cls layer): this layer estimates the object probability for each region proposal. The outputs are the scores of each proposal.

Fig. 8
figure 8

Region Proposal Network (RPN)

3.3.2 Fast RCNN object detection network

Fast RCNN is an improved version of RCNN [10] for accelerating the detection process. It uses bounding box proposal methods [23] to create bounding boxes. Then, Region of Interest (RoI) pooling is applied to generate a feature vector for each bounding box. Afterwards, the feature vector is input to a 2-layer regression network and a classification network for fine-tuning the bounding boxes and obtaining class scores. Finally, non-maximum suppression (NMS) is applied over all boxes to eliminate the redundant bounding boxes.

3.3.3 Model training

For creating region proposals, the RPN is trained end-to-end by back propagation and stochastic gradient descent (SGD). For recognising objects, fast RCNN is adopted and can be trained independently. In Faster RCNN, a unified network is learnt from RPN and fast RCNN by sharing convolutional layers (see the green block in Fig. 5). This is implemented by a pragmatic 4-step training algorithm as follows.

  1. (i)

    The RPN is trained as above, which is initialised with an ImageNet-pre-trained model, primarily trained on Visible Band imagery.

  2. (ii)

    A detection network is trained within fast RCNN using the proposals generated in the trained RPN, which is also initialised by the ImageNet-pre-trained model.

  3. (iii)

    The initialisation of RPN training is through the detection network by fixing the shared convolutional layers and only fine-tuning the layers unique to RPN.

  4. (iv)

    The shared convolutional layers is kept fixed, the layers unique to fast RCNN are fine-tuned.

Thus, a unified network is formed because the same convolutional layers are shared by both of the RPN and detection networks.

4 Experimental results

The developed ATD/R system is evaluated with the collected \(T_{1}\) and \(T_{2}\) datasets. The evaluation results are presented both in qualitative and quantitative aspects. For demonstrating the performance of our ATD/R, we implemented the HoG-SVM method, which adopts a HoG-based sliding-window detector to localise objects in images and an SVM classifier is trained for classification whilst a non-maximum suppression (NMS) algorithm is used to eliminate redundant detections. As discussed in Section 1, HoG-SVM is incapable of dealing with low contrast imagery like \(T_{2}\) data. So, our comparison is only presented for 438 \(T_{1}\) test images. Due to HoG being sensitive to image contrast, and in order to fairly show the performance of HoG-SVM, we define that it has detected and recognised the vehicle when its detection has at least 10% overlap with the ground truth. In addition, the same training set as used in our ATD/R is employed to train the HoG model.

The following two models have been generated from the training datasets:

  1. (i)

    Raw model – the original training dataset is used.

  2. (ii)

    Superresolution model – the superresolution training dataset is used.

For the purposes of running our ATD/R system and HoG-SVM, our chosen PC configuration was an HP Pavilion 550 with Intel \(i7-6700\) processor and Nvidia Quadro \(K4200\) GPU.

4.1 Training set and test set

The ground truth data are created manually by marking the datasets of various vehicles with 100m, 200 m and 400 m distances. For obtaining the training and test datasets, the ground truth data (3780 images in total) are segregated into different subsets with a given class and a camera distance such as Bus200m. Then, for each subset, a random selection with an \(80-20\) ratio of training and test data is made. Afterwards, all training and test subsets are aggregated separately for generating the training set (3025 images, \(T_{1}:1758\), \(T_{2}:1266\)) and test set (755 images, \(T_{1}:438\), \(T_{2}:317\)). Tables 1 and 2 present the relevant statistics on the target pixel values in our collected datasets. In addition, the datasets of 300m, 500 m and 600 m are reserved for testing and are not involved into the training process. Note that our ADT/R system is trained using the pre-trained VGG16 model to be the initialisation, which was trained on the PASCAL VOC 2007 detection benchmark [5].

Table 1 ROI size statistics (pixels) for person targets in the collected data
Table 2 ROI size statistics (pixels) for vehicle targets in the collected data

4.2 Performance results

We employ receiver operating characteristic (ROC) [6] to analyse the performance of our object detection and recognition system. The ROCs in Fig. 9 illustrate that both the raw and superresolution models of our ATD/R can perform much better than HoG-SVM in \(T_{1}\) test images. The area under curve (AUC) is 0.70 for HoG-SVM (green) in Fig. 9 whereas the obtained AUCs in Fig. 9 are 0.92, 0.96 for our two models respectively. Figure 9 shows that True Positive Rate (TPR) approaches 100% when False Positive Rate (FPR) surpasses 15% for the superresolution model (cyan) and 45% for the raw model (blue). However, for HoG-SVM, FPR is at least 95% when TPR reaches 100% as shown in Fig. 9.

Fig. 9
figure 9

ROCs for Trial 1 test data generated from the HoG-SVM method and our ATD/R

Figure 10 presents specific ROC curves for our blind test data acquired from 300 m (upper row) and 500 m (lower row) in \(T_{1}\). Data acquired from these distances were not involved in the training process, so the relevant performance can truly reflect the ability of our ATD/R to recognise objects. Perfect AUC values are obtained for all cases at 300 m, meaning that almost all vehicles are correctly recognised with negligible FPR. For the case of Skoda at 500 m, we can see our approach also achieves comparable results, even though the training dataset just comes from 100 m to 400 m, particularly in the case of Van (Fig. 10f). It is worth noting that the superresolution model improves the performance significantly compared with the raw model in the cases at 500 m.

Fig. 10
figure 10

ROCs for the 300 m (a, b, c) and 500 m (d, e, f) datasets of Bus, Skoda and Van in Trial 1. Each graph features two ROCs corresponding to raw model and super-resolution model-derived results

In comparison, we apply our ATD/R to the blind datasets in \(T_{2}\). Figure 11 illustrates the obtained ROCs. Due to the weather and seasonal conditions leading to low contrast in the recorded imagery, the overall performance is inferior to that in \(T_{1}\). Figure 11a,d show that our method can recognise Landrover well in both 300 m and 500 m. For the case of Truck, Fig. 11b,e demonstrate that superresolution can help increase the AUC value from 87% to 94% at 300 m, 82% to 90% at 500 m respectively. It is worth noting that the AUC value obtained in the case of Saloon can keep around 91% for both 300 m and 500 m.

Fig. 11
figure 11

ROCs for the 300 m (a, b, c) and 500 m (d, e, f) datasets of Truck, Saloon and Landrover in Trial 2. Each graph features two ROCs corresponding to raw model and super-resolution model-derived results

Regarding the datasets acquired at 600 m, the signatures of objects in images are extremely low: the smallest size for vehicles is \(8\times 12\) and for pedestrians \(8\times 5\). Figure 12 illustrates that our ATD/R can deal with the difficulties of low resolution and poor contrast well. Particularly, a significantly high performance has been achieved in the cases of Van, Saloon and Landrover ((c),(e),(f)) when applying the superresolution model.

Fig. 12
figure 12

ROCs for the 600 m datasets of Bus, Skoda, Van, Truck, Saloon and Landrover. Each graph features two ROCs corresponding to raw model and super-resolution model-derived results

To examine the overall performance of our ATD/R in both \(T_{1}\) and \(T_{2}\) as a whole, we apply our two generated models to the entire test set (1266 + 317 = 1583 images). In addition, for a more accurate study of the role of the superresolution in the overall ATD/R chain, we applied superresolution to the test data so that we can see whether the improved test data can help enhance the performance of the final ATD/R. Figure 13 illustrates that the superresolution model performs better than the raw model, regardless of whether or not the input data is super-resolved. However, the raw model applied to the super-resolved dataset does not provide better performance in comparison with the raw data input.

Fig. 13
figure 13

ROCs for the results using different models in the test data

4.3 Difficult cases

This subsection presents qualitative and quantitative results on the performance of our ATD/ATR system in some difficult cases, in order to further illustrate its application.

Table 3 specifically gives the detection confidences of the objects shown in Fig. 1, and provides some insight about the method. Before applying the proposed image enhancement method, the top three detection probabilities are: Bus, 0.311699; Skoda, 0.875782; and Van, 0.0154172. Under these results, the target would be classified as a Skoda since it has the highest detection confidence. However, if we cheque the ground truth, the target is actually a Bus. This error is clearly corrected by adopting the proposed methodology. The second row in Table 3 shows that, after image enhancement, the detection probability of Bus is the highest amongst the three. Therefore, the ATD/R system can correctly recognise the target that was wrongly interpreted in the absence of image enhancement. Another point worth noting is the following. In Table 3, for Skoda in Fig. 1b, the system exhibits detection confidence increasing from 0.205708 to 0.483660 after data enhancement.

Table 3 The recognition results of Fig. 1 using raw training dataset and enhanced training dataset, respectively

Figure 14 shows detection performance when the raw model is used to detect objects for the cases of Bus 500 m (a1) and Van300 m (b1) respectively. It is worth noting that Bus is not detected in (a1) and the pedestrian is missed in (b1). However, the superresolution model can detect all objects in both of these cases as shown in (a2) and (b2).

Fig. 14
figure 14

a1a2 Results with the raw-model and the superresolution-model for Bus500m respectively, b1b2 Result with the raw-model and the superresolution-model for Van300 m respectively

Concerning the cases with extremely small signatures, three examples from the 600 m distance, Bus, Van with pedestrian, and Saloon, are presented in Fig. 15. The resulting images using the raw model are shown in Fig. 15a1, b1, c1, and indicate that both Bus and Saloon are wrongly detected in (a1)(c1) and the pedestrian is not detected in (b1). Figure 15a2,b2,c2 show the resulting images where the superresolution model is applied to the same images. We can clearly see that the relevant objects are correctly recognised accordingly.

Fig. 15
figure 15

Examples showing the performance of the superresolution-model. Upper row: the results for Skoda600m, Van600m, Saloon600 m with the raw-model. Lower row: the results for the same images with the superresolution-model

Fig. 16
figure 16

Examples for Truck600m: a Result with the raw-model, b Result with the superresolution-model

Figure 16 illustrates an example of a pick up truck from 600 m distance in \(T_{2}\). The detection confidence using the superresolution model is 0.896 whilst it is 0.856 with the raw model. This visually demonstrates that the superresolution model can improve the performance in comparison to the raw model.

4.4 Discussion

Due to the discrepancy in image quality between two Trials, Figs. 10 and 11 show that our ATD/R has performed much better in \(T_{1}\) than in \(T_{2}\). However, it is apparent that Fig. 9 demonstrates that our ATD/R can deal with the issue of low contrast much better than HoG-SVM.

The success of superresolution on improvement of ATD/R in long distant cases is generally reduced, since feature information extracted may be limited. Figure 12a,b,d,e indicates that the superresolution model performs similarly to the raw model at the 600 m range. However, in Fig. 12c,f, we can see that the superresolution technique helps improve the performance significantly. A plausible reason is that the vehicle signature in these cases of Van and Landrover has larger dimensions compared with other vehicles.

As shown with the presented ROCs, although the overall detection and recognition performance is improved, the small size of the objects at distant views can still cause some false positives. Due to the limited number of pixels on the small object signature being presented, the enhancement process may generate inaccurate feature information, leading to false positives. For example, a Bus target is initially detected as a Bus correctly with a confidence of 0.729734 in the raw model; however, it is wrongly detected as a Skoda with a confidence of 0.772193 after the enhancement processing. This issue may potentially be improved by refining the training set in future work so that the ATD/R system can obtain more accurate feature information for recognition. In our system, such false positives introduced by the enhancement process are rare and considerably outweighed by the significant performance gain. In fact, the ROCs show that the true positive ratios have been greatly improved after the enhancement process.

The use of Faster-RCNN is justified in our work due the difficulty of annotating very small signature objects at long distances. The proposed regions in the feature space allow training at shorter distances where the annotations are feasible (100–400 m) and then deploy the system for small signatures (600 m) with high accuracy. This can be achieved since the RPN applies a multi-scale method with a sliding window associated to a scale and aspect ratio. In terms of CNN-based object recognition, there are several recently proposed techniques which have shown good performance in both accuracy and speed. Particularly, the single shot mulitbox detector (SSD) [19] and YOLO [14] can be highlighted. YOLO’s model training is based on the entire image rather than the region proposal network (RPN) used in Faster-RCNN. So, YOLO’s loss function deals equally with all bounding boxes. This leads to YOLO underperforming for small objects if accurate annotations cannot be provided (as in our case at 600 m). As for SSD, it employs multi-scale and data augmentation to enhance the detection accuracy. In addition, SSD uses bank filters to subsample data for faster calculations. However, SSD is insensitive to detect smaller objects since SSD features a class-aware RPN with a lot of bells and whistles. Therefore, Faster-RCNN is better fitted to our IRT images where very small objects are difficult to annotate and detect accurately.

5 Conclusions

This paper presents an ATD/R system that is able to deal with the main difficulties in IRT video surveillance. First, we propose a CNN-based super-resolution method to improve IRT images, especially in long-distance view cases where the small target signature can hinder the detection/recognition process. Then, a state-of-the-art object detection method, faster RCNN, is employed to carry out object detection and recognition. We integrate these two approaches into our system to produce a robust and accurate surveillance system. Evaluation results show that the performance of the developed ATD/R system can efficiently deal with the obstacles in IRT video, and thus validates the surveillance system in practise. The study suggests that further work including developing advanced super-resolution methods, incorporating appropriate denoising techniques and geometrical feature extraction like [26], and integrating the methods to create a fully deployable system can be valuable extensions.