Keywords

1 Introduction

Spectral imaging sensors aim to obtain the spectrum of the scene in terms of tens or hundreds of bandwise images, each corresponding to a certain narrow wavelength range. The spectral image obtained in this way plays an important role in many fields such as remote sensing [16], medical diagnosis [17], and agriculture [18], also including various computer vision tasks such as image segmentation [43], face recognition [34], and object tracking [44]. Therefore, the development of spectral imaging systems has been an active research field in recent years [7, 15, 29, 47, 50, 51].

However, to capture 3D spectral images with existing 2D sensors, trade-offs between spectral and spatial/temporal resolutions are inevitable [3, 14]. Conventional imaging spectrometers usually operate in a scanning manner, which simply trades the temporal resolution for the spectral resolution [5], either scanning the space [3, 35] or scanning the spectrum [14, 37]. On the other hand, to enable spectral acquisition for dynamic scenes, snapshot spectral imagers relying on computational reconstruction have been developed in the last decade. The representative techniques along this line include computed tomographic imaging spectrometry (CTIS) [10], prism-mask spectral video imaging system (PMVIS) [7], and coded aperture snapshot spectral imager (CASSI) [47]. These snapshot spectral imagers support spectral video acquisition, yet at the cost of sacrificing the spatial resolution or reconstruction fidelity. To address this issue, dual-camera systems incorporating a snapshot spectral imager and an RGB/panchromatic camera with a beam splitter are proposed [30, 48, 49], which combine the high spectral resolution of the former and the high spatial resolution of the latter. Still, these systems are of high cost in terms of hardware implementation.

As an alternative solution, spatial super-resolution (SR) of spectral images has also attracted tremendous research efforts, where only a low-resolution (LR) spectral image is required as input and the high-resolution (HR) spectral image is obtained without the need of additional hardware. Similar to single-image SR, mainstream solutions for this ill-posed problem are example-based ones. As a representative, the sparse coding based approaches learn the relationship between LR spectral cubes and the corresponding HR counterparts as a dictionary from an external database [1, 2, 13], where the HR spectral images can be obtained through scanning-based spectrometers. With the success of deep learning in single-image SR [11, 24], CNN-based methods rapidly emerge to directly learn an end-to-end mapping between LR and HR spectral images [27, 28, 31, 54], which achieve promising performance while promoting the efficiency compared to sparse coding. Following this trend, we propose an advanced network architecture for spectral image SR, which integrates the powerful channel attention mechanism [20] with the classic residual blocks [19]. Consequently, the correlations across both the spectral and spatial dimensions are fully exploited and the performance of spectral image SR is significantly improved over plain residual networks.

In practical applications where spectral images are captured, it is often possible to capture an RGB/panchromatic image of the same scene at the same time with a much higher spatial resolution. Different from the above mentioned systems that using a beam splitter [30, 48], the resulting spectral and RGB/panchromatic measurements are not aligned but rather in a stereo configuration, which thus relieves the difficulty of hardware implementation. This additional HR RGB/panchromatic image can be of great help for enhancing the LR spectral image, which is usually referred to as color-guided spectral image SR [23, 25, 53] or pan-sharpening [26, 36, 38] and has been investigated extensively in the literature. Based on the proposed network, we then design a fusion framework for color-guided spectral image SR. The spatial resolution of the spectral input is enhanced in one branch, and the spectral resolution of the RGB input is enhanced in the other. These two branches are then fused together through the attention mechanism again to reconstruct the final HR spectral image, which achieves further improvement compared to using the single LR spectral input.

The main contributions of this paper can be summarized as follows:

  • An advanced network architecture for spectral image SR, which integrates the powerful channel attention mechanism with the classic residual blocks.

  • A fusion framework for color-guided spectral image SR based on the proposed network, which is designed for the scenario when stereo pairs of LR spectral and HR RGB measurements are available.

  • State-of-the-art results on the above two tasks, and one of the winning solutions in the PIRM 2018 Spectral Image SR Challenge [40, 41].

2 Related Work

Spectral Image SR. Different from reconstructing a spectral image from the corresponding RGB image [39, 52] (sometimes also referred to as spectral SR), the spectral image SR here denotes to spatial resolution enhancement of a spectral image. Similar to single-image SR that enhances the spatial resolution of a 2D image, mainstream solutions for this ill-posed problem are example-based ones. As a representative, the sparse coding based approaches learn the relationship between LR spectral cubes and the corresponding HR counterparts as a dictionary from an external database [1, 2, 13], where the HR spectral images can be obtained through scanning-based spectrometers. With the success of deep learning in single-image SR [11, 24], CNN-based methods rapidly emerge to directly learn an end-to-end mapping between LR and HR spectral images [27, 28, 31, 54], which achieve promising results while promoting the efficiency compared to sparse coding. With even advanced network architectures such as the one proposed in this paper, the performance of spectral image SR is expected to be further improved.

Color-Guided Spectral Image SR. The approaches which fuse an LR spectral image with an HR RGB image based on matrix factorization have been actively investigated [23, 25, 53]. These approaches first unfold the spectral image as a matrix, and then decompose the matrix as spectral basis and corresponding coefficients. Spectral image SR then becomes the estimation of spectral basis and corresponding coefficients from the LR spectral and HR RGB measurements of the same scene. In addition to considering the spectral information, some approaches also use the spatial structures of spectral images for SR [12, 42, 46]. These matrix factorization based methods often start by unfolding the 3D data structures into matrices. However, operating with matrices makes it hard to fully exploit the inherent spatial-spectral correlations in spectral images. These correlations can be better exploited using a CNN-based fusion framework, as demonstrated in this paper.

Attention Mechanism. Attention can be viewed as an adaptation mechanism which automatically allocates processing resources towards the most informative components of an input [45]. Following the successful application in neural machine translation [45], increasingly more methods have been proposed to apply attention into the deep network [4, 6, 22, 32, 55]. Inspired by [55], we integrate the channel attention mechanism [20] with the residual block [19], which promotes the reconstruction fidelity of spectral image SR.

3 Spectral Image SR Network

3.1 Baseline VDSR-3D Network

A 3D spectral image can be viewed as a stack of 2D bandwise grayscale images. Hence, classic single-image SR networks can be directly applied to the spectral image SR task. Take the well-known VDSR [24] network for example, the network can use each LR bandwise image as input and output the corresponding HR image, and the HR spectral image can be obtained by stacking these reconstructed HR bandwise images. However, this simple reuse of VDSR is not efficient since it ignores the correlations across the spectral dimension. A natural extension would be taking the 3D LR spectral image as input of the network, in which the spectral correlations are learned automatically. This modified VDSR network, denoted as VDSR-3D, is regarded as the baseline model in this paper.

Although VDSR-3D can achieve a better performance compared with its 2D version, as demonstrated later, there are still some problems in the VDSR architecture. First, it requires a bicubic-interpolated image as the input, which is a sub-optimal solution since a deconvolutional layer can easily learn a better operation than bicubic in a back-propagation manner. Moreover, according to the experimental results, it fails to achieve an improved performance when deepening the network structure, which is not favorable for pursuing more accurate solutions. To address these problems, we propose the deep residual attention network that greatly improve the capability of VDSR-3D.

Fig. 1.
figure 1

Deep residual attention network for spectral image SR

3.2 Deep Residual Attention Network

The architecture of our proposed deep residual attention network is shown in Fig. 1. The network mainly consists of three modules: feature extraction, feature mapping, and reconstruction. We use one deconvolutional layer and one convolutional layer to extract the feature as

$$\begin{aligned} \displaystyle F_{0} =f_{E} ( I_{LR} ), \end{aligned}$$
(1)

where \( f_{E} \) denotes the feature extraction module. Note that, we set the stride of deconvolution the same as the upsampling scale (i.e., \(\times 3\) in our experiments) to convert an LR spectral input \(I_{LR}\) to the HR space for succeeding procedures. The obtained feature \(F_{0}\) is then fed into the feature mapping module, where a number of residual blocks are adopted to implement this procedure.

Inspired by [55], we integrate the channel attention mechanism [20] with the residual block [19] to generate the attention resblock, which is also shown in Fig. 1. Let \(F^{'}_{l} (l=1,\dots ,L)\) denotes the intermediate feature in the \(l^{th}\) attention resblock. The channel attention coefficient \(\delta _l\) can be calculated as

$$\begin{aligned} \begin{aligned} \delta _l&=Sig( C( P_G ( F^{'}_{l} ) ) ),\\ P_G ( F^{'}_{l, c})&=\dfrac{1}{H\times W}\sum ^{H}_{i=1}\sum ^{W}_{j=1} F^{'}_{l, c}( i,j), \end{aligned} \end{aligned}$$
(2)

where H and W denote the spatial resolution of the feature map, \(F^{'}_{l, c}(i,j)\) denotes the value at position (ij) in the \(c^{th}\) channel of the intermediate feature in the \(l^{th}\) attention resblock, \( P_G(\cdot )\) denotes the global average pooling function, \(C(\cdot )\) denotes the combining operation which includes two successive convolutional layers (with a \(1 \times 1\) kernel size) and a ReLu activation function [33] in between, and \(Sig(\cdot )\) denotes the sigmoid function for normalization.

Obtained the channel attention coefficient \(\delta _l \), the output of the \(l^{th}\) attention resblock \(F_{l}\) can then be calculated as

$$\begin{aligned} \displaystyle F_{l} =F_{l-1} +\delta _{l} \cdot F^{'}_{l}. \end{aligned}$$
(3)

With the proposed channel attention, the residual component in the attention resblock is adaptively weighted. Compare to the plain convolution in VDSR-3D, the attention resblock offers higher reconstruction fidelity with a much deepened network. Moreover, we adopt the skip connection in each stack of the attention resblock to relieve the vanishing of gradient and ease the convergence of the network. The super-resolved spectral image \(I_{SR}\) is finally reconstructed as

$$\begin{aligned} \displaystyle I_{SR} =f_R( F_{L} ), \end{aligned}$$
(4)

where \(f_R(\cdot )\) denotes the reconstruction module implemented using a single convolutional layer.

Fig. 2.
figure 2

Color-guided fusion framework for spectral image SR

3.3 Color-Guided Fusion Framework

In practical applications where spectral images are captured, it is often possible to capture an RGB image of the same scene at the same time with a much higher spatial resolution. However, the resulting spectral and RGB images are not aligned but rather in a stereo configuration. A registration step is needed to address the scene disparity and generate the aligned RGB image as an input for SR in addition to the spectral image. This aligned HR RGB image can provide more information in the spatial dimension to facilitate the spectral image SR.

To this end, we design a color-guided fusion framework with a two-branch architecture to implement the fusion between the spectral information in the spectral image and the spatial information in the RGB image. The fusion framework is shown in Fig. 2. The two branches share the same basic network structure, i.e., the deep residual attention network described in Sect. 3.2. Note that the upscaling operation is no longer needed in the RGB branch, while in the spectral branch it is moved to the tail of the network (due to the restriction of GPU memory in implementation). A concatenating operation is used to combine the features from these two branches. The reconstructed HR spectral image \(I_{SR}\) can be obtained as

$$\begin{aligned} I_{SR} =f_{FUS}( f_{SPE}( I_{LR}) \oplus f_{RGB}( I_{RGB})), \end{aligned}$$
(5)

where \(f_{RGB}(\cdot )\), \(f_{SPE}(\cdot )\), and \(f_{FUS}(\cdot )\) denote the RGB branch, the spectral branch, and the fusion part, respectively. The symbol \(\oplus \) denotes the concatenating operation between features output from these two branches. Note that another attention resblock is used in the fusion part.

3.4 Implementation Details

Training Process. We train the network of Track 1 (spectral image SR) using LR spectral cubes with a size of \(20\times 20\times 14\) and corresponding HR cubes with a size of \(60\times 60\times 14\). We set the batch size as 64 and adopt the ADAM optimizer with \(\beta _{1} = 0.9\), \(\beta _{2} = 0.999\), and \(\epsilon = 10^{-8}\). The initial leaning rate is set as \(1\times 10^{-4}\) and then it is decreased to half every \(1\times 10^{5}\) iterations of back-propagation. We stop training when no notable decay of training loss is observed. Based on the default setting of Track 1, we add the aligned HR RGB patches as inputs to train the network of Track 2 (color-guided spectral image SR), which have the same spatial size with the HR spectral cubes. We implement the proposed networks based on the PyTorch framework and train them using a 1080Ti GPU. It takes about 12 h to train each network.

Testing Process. For Track 1, there is no difference between the training and testing processes. Yet for Track 2, we conduct a post-processing operation to further promote the performance. Since the aligned RGB images are obtained using the FlowNet 2.0 [21] according to the challenge configuration, black pixels are usually observed around the borders due to the imperfect registration. To address this issue, we first crop 12 pixels in the four-direction borders and then make them up using the network of Track 1. In other words, the center part of the reconstructed image is obtained using the color-guided fusion framework, while the boundary pixels are only reconstructed from the LR spectral image.

Loss Function. CNN-based methods for single-image SR usually adopt the mean square error (MSE) as the loss function during training [9, 11, 24], which has also been applied to spectral reconstruction tasks [52]. However, the luminance level in spectral images usually varies significantly among different bands, and the same deviation in the pixel value may have different influence to the bands with different luminance levels. It thus makes the MSE loss generate a bias towards the bands with high luminance levels, which is not desired because each band matters equally. Hence, recent methods adopt the mean relative absolute error (MRAE) as a substitute [39].

Let \(I_{SR}(i)\) and \(I_{GT}(i)\) denote the \(i^{th} (i=1,\dots ,N)\) pixel of the reconstructed and groundtruth spectral images, respectively. The MRAE is formulated as

$$\begin{aligned} MRAE=\frac{1}{N}\sum ^{N}_{i=1}\left( \mid I^{( i)}_{SR} -I^{( i)}_{GT} \mid /I^{( i)}_{GT}\right) . \end{aligned}$$
(6)

When there exists zero points in the groundtruth image, the MRAE loss will become infinite, making the network fail to converge. To address this issue, a small value of \(\epsilon _1\) is added to the denominator of MRAE. Considering the intensity range of the spectral data is from 0 to 65535, \(\epsilon _1\) is set to 1 according to the challenge configuration, deriving the modified MRAE as

$$\begin{aligned} MRAE'=\frac{1}{N}\sum ^{N}_{i=1}\left( \mid O((I^{( i)}_{SR})) -I^{( i)}_{GT} \mid /(I^{( i)}_{GT}+\epsilon _1)\right) , \end{aligned}$$
(7)

where the operation \(O(\cdot )\) rounds the small value less than 1 into 0, which further decreases the reconstruction error caused by the zero points.

To investigate the impact of different loss functions, we also implement the spectral information divergence (SID) function which is designed to evaluate the spectral similarity and discriminability [8]. The SID is formulated as

(8)

where \(\epsilon _2\) is set to \(1\times 10^{-3}\) according to the challenge configuration to avoid the infinite value caused by the zero points.

4 Experimental Results

4.1 Dataset and Evaluation

The experiments are conducted strictly following the instructions of the PIRM 2018 Spectral Image SR Challenge [40, 41]. There are two tracks in this challenge. Track 1 requires to upscale an LR spectral image (with a size of \(80\times 160\times 14\)) by a factor of 3 in the spatial dimension. The dataset for this track consists of 240 spectral images captured by an IMEC 16-band snapshot spectral imager, which is split into 200 for training, 20 for validation, and 20 for testing. Track 2 requires to accomplish the same task with the help of corresponding HR RGB images (with a size of \(240\times 480\)). The dataset contains 120 pairs of stereo images, with one view captured by the IMEC 16-band snapshot spectral imager and the other by an ordinary color camera. The images are split into 100 pairs for training, 10 for validation, and 10 for testing. Also, the dataset provides a registered version of the RGB images using the FlowNet 2.0 [21].

In the development phase, data augmentation is performed on the training images which are randomly rotated by \(90^{\circ }\), \(180^{\circ }\), \(270^{\circ }\), and flipped horizontally. We reserve 10 pairs of images as our own validation set. Our reported results are all calculated on the official validation set. No additional preprocessing or dataset is needed for both tracks. The main ranking metrics of this challenge are MRAE, SID, and the mean opinion score (MOS). Note that the challenge utilizes the normalized SID as the final metric, which may be different from the one used in this paper.

4.2 Ablation Experiments

Loss Function. To investigate the impact of different loss functions, we conduct a series of experiments. The experimental results are shown in Table 1. We first compare the results between using MRAE and SID as loss function alone. There seems to exist a trade-off between these two metrics and one increases while the other decreases. To prove this, we further design a combined loss function where the weight of SID is ten times that of MRAE, which gives a simple joint optimization according to cross-validation. The result shows that the SID can improve notably (i.e., \(13\%\)) by slightly sacrificing the performance in MRAE (i.e., \(2\%\)). We also use the SID as loss function to fine-tune the model pretrained by MRAE. During the training process, the SID turns to decrease but the MRAE increases at the same time, which again demonstrates the trade-off between SID and MRAE. Finally, it converges to a similar point to that achieved by using the SID as loss function alone. In the experiments below and in the challenge, we use the MRAE as loss function alone since it is the primary evaluation metric.

Table 1. Comparisons between different loss function settings. The MRAE and SID are adopted as the metrics
Fig. 3.
figure 3

The influence of depth on the network performance

Table 2. Comparisons between our proposed methods and baseline models. The validation sets of the challenge are adopted for evaluation. Bold indicates the best performance and italic indicates the second best one
Fig. 4.
figure 4

Visual comparison of a spectral image from the validation set of Track 2. All 14 bands are averaged for the evaluation of spatial information

Network Depth. The depth is an important factor to determine the basic capacity of the network. However, a deeper network does not always yield a better performance. As depicted in Fig. 3, a number of 45 attention resblocks seems to be an optimal depth under both SID and MRAE metrics. Note that when the block number goes up to 57, gradient exploding will occur in the training process. To overcome this, the initial learning rate is lowered down to 1/5 of the original, which could result in the performance decrease in the deeper structure. Also, we find that the initial value of the network can slightly affect the performance but the conclusion holds. The other settings of the network are kept the same to eliminate the influence of other hyper-parameters. In the experiments below and in the challenge, we use 15 attention resblocks in consideration of implementation speed.

4.3 Comparison with Baseline Models

To validate the effectiveness of the proposed deep residual attention network (DRAN) and the color-guided fusion framework (Fusion), we first compare their performance with the baseline models VDSR-2D and VDSR-3D (both with a network depth comparable to DRAN). All these models are trained using the whole training set of each track, and the official validation set is adopted for evaluation. The quantitative results are listed in Table 2. As can be seen, for both tracks, VDSR-3D outperforms VDSR-2D by a large margin, which indicates the importance of exploiting correlations across different bands in spectral image SR. Meanwhile, the results of DRAN show that the architecture of attention resblocks has a distinct advantage over the plain residual networks. For Track 2, the fusion method achieves 12% decrease in MRAE compared with DRAN, which proves the usefulness of the HR RGB images as well as the fusion framework.

Fig. 5.
figure 5

Visual comparison of a spectral image from the validation set of Track 2. All 14 bands are averaged for the evaluation of spatial information

4.4 Ensemble Method and Running Time

For ensemble purpose, we flip and rotate the input image and treat it as another input similar to data augmentation. Then we apply an inverse transform to the corresponding output. Finally, we average the transformed output and the original output to generate the self-ensemble result. In this way, further improvement (3% decrease in MRAE) can be achieved.

We calculate the average running time using a 1080Ti GPU. The running time includes the process of ensembling. The fusion method cost nearly double time compared to a single DRAN since it needs the result of DRAN to recover the cropped borders. And the VDSR-based models are slightly faster than DRAN.

Fig. 6.
figure 6

Visual comparison of a spectral image from the validation set of Track 2. Three bands are selected for the evaluation of spectral accuracy

4.5 Visual Quality Comparison

To evaluate the perceptual quality of spectral image SR, we show the visual comparison of two images in Track 2 (for which all methods can be compared together). Note that we average the spectral image across the spectral dimension for a better visual experience in the spatial dimension. As can be seen in Figs. 4 and 5, the edge regions of super-resolved images from DRAN and Fusion are notably shaper and clearer than the VDSR-based models and bicubic interpolation. Also, with the help of HR RGB images, the blurring artifacts are alleviated and more details of spectral images are recovered, if comparing the results of Fusion and DRAN.

To further visualize the spectral accuracy of the reconstructed HR spectral images, we show the error maps of three selected bands in the above two images in Figs. 6 and 7. As can be seen, on the one hand, the error of DRAN and Fusion are smaller than the VDSR-based models and bicubic interpolation, which again validates the effectiveness of the proposed method. On the other hand, the error in the edge regions of Fusion is obviously smaller than that of DRAN, which demonstrate that the HR RGB image mainly contributes to the edge regions.

Fig. 7.
figure 7

Visual comparison of a spectral image from the validation set of Track 2. Three bands are selected for the evaluation of spectral accuracy

5 Conclusions

This paper presents a novel deep residual attention network for the spatial SR of spectral images. The proposed method integrates the channel attention mechanism into the residual network to fully exploit the correlations across both the spectral and spatial dimensions of spectral images, which greatly promotes the performance of spectral image SR. In addition, we design a fusion framework based on the proposed network when stereo pairs of LR spectral and HR RGB measurements are available, which achieves further improvement compared to using the single LR spectral input. Experimental results demonstrate the superiority of the proposed method over plain residual networks, and our method is one of the winning solutions in the PIRM 2018 Spectral Super-resolution Challenge.