Keywords

1 Introduction

Image fusion is an information fusion of images. It combines different images obtained by different sensors for the same target or scene, or different images obtained with the same sensor in different imaging modes or at different imaging times. The multi-focus image fusion is a branch of image fusion. The fused image can reflect the information of multiple original images to achieve a comprehensive description of the target and the scene, making it more suitable for visual perception or computer processing. Multi-focus image fusion has become a representative topic since many algorithms have been developed in many fields, such as remote sensing applications, medical imaging applications and surveillance applications [14]. Conventionally, the multi-focus image fusion algorithms can be divided into transform domain algorithms and spatial domain algorithms [15]. Since there are many new algorithms that have been proposed recently, we would like to divide the existing fusion algorithms into three categories: multi-scale transform methods, sparse representation (SR) and low-rank representation based fusion methods, and deep learning based fusion methods.

The multi-scale transform (MST) methods are the most commonly used methods, such as discrete wavelet transform (DWT) [9], contourlet transform (CT) [25], shift-invariant shearlet transform [24] and curvelet transform (CVT) [5] etc. The basic idea is to perform image transformation on the source images to get the coefficient representation. Then fuse the coefficients according to a certain fusion rule to obtain fused coefficients, and finally obtain the fused image through inverse transformation. All these methods share a “decomposition-fusion-reconstruction” framework. These methods are good representation of their structural information, but can only extract limited direction information and cannot accurately extract the complete contours [26].

In recent years, methods based on sparse representation and low rank representation also have significant performance in image fusion. Yin et al. [27] proposed a novel multi-focus image fusion approach. The key point of this approach is that a maximum weighted multi-norm fusion rule is used to reconstruct fused image from sparse coefficients and the joint dictionary. And the method based on saliency detection in sparse domain [16] also has a remarkable result. Yang el al. [26] combined robust sparse representation with adaptive PCNN is also an effective method. Liu et al. [20] combined multi-scale transform with sparse representation for image fusion which overcomes the inherent defects of both the MST- and SR-based fusion methods. Besides the above methods, Li et al. [10] proposed a novel multi-focus image fusion method based on dictionary learning and low-rank representation which gets a better performance in both global and local structure. Li et al. also achieved significant results from the perspective of noisy image fusion using the low-rank representation [12].

With the development of deep learning, deep features are used as saliency features to fuse images. Liu et al. [19] suggested a convolutional sparse representation (CSR)-based image fusion. The CSR model was introduced by Zeiler et al. [28] in their deconvolutional networks for feature learning. Thus, although CSR is different from deep learning methods, the features extracted by CSR are still deep features. Liu et al. [18] also applied CNN model to image fusion, which can be used to generate the activity level measurement and fusion rule. Li et al. [13] proposed an effective image fusion method using the fixed VGG-19 [23] to generate a single image which contains all the features from infrared and visible images. But we all know that the training of deep model is very time consuming and complicated. And the requirements for hardware conditions are very high.

In this paper, we propose a novel and effective multifocus fusion method based on PCA filters of PCANet [4] which is a very simple deep learning model. The main contribution of this paper is using PCANet to extract image features and using nuclear norm to construct an effective feature space for image fusion. In particular, the training time of PCANet is shorter than that of other CNN-based network, and the extracted features can play the same role. The experimental results demonstrate that the proposed method can obtain state-of-the-art fusion performance in terms of both objective assessment and visual quality.

The structure of the rest paper is organized as follows. In Sect. 2, we give a brief introduction to related work. In Sect. 3, the proposed multi-focus image fusion method is presented in detail. Section 4 presents the experimental results. Finally, Sect. 5 concludes the paper and puts forward the future work.

2 Related Work

In [4], PCANet is a very simple deep learning network which contains three parts: cascaded principal component analysis (two-stage), binary hashing and block-wise histograms (output layer). In this architecture, PCA is employed to learn multistage filter banks. Therefore, this network can be designed and learned extremely easy and efficient. In this paper, we just use the PCA filters to extract image features, binary hashing and block-wise histograms are not used. In two-stage of PCANet [4], the number of filters in the first stage L1 is set as 8, the number of filters in the second stage L2 is the same. Therefore, we can get 8 image features in the first stage and 64 image features in the second stage. Considering the time efficiency and the non redundancy of the data, we just utilize image features of the first stage to fuse images. Therefore, the explanation of the first stage of PCANet is introduced as follows.

For each input image of size of \(m\) \(\times \) \(n\), we take \(k_{1}\) \(\times \) \(k_{2}\) patches, and combine these patches (overlapping) together; i.e., \(x_{i,1},x_{i,2},\cdots ,x_{i,mn}\in R^{k_1 k_2}\), where \(x_{i,j}\) denotes the \(j-th\) vectorized patch in image \(I_i\). Subtracting patch mean from each patch, we obtain \(X_i = [\bar{x}_{i,1},\bar{x}_{i,2},\cdots ,\bar{x}_{i,mn}]\), where \(\bar{x}_{i,j}\) is a mean-centralized patch. Taking the same action for all input images \(\{I_i\}_{i=1}^N\) (N is the number of the input images) and putting the results together, we get

$$\begin{aligned} X = [\bar{X}_1,\bar{X}_2,\cdots ,\bar{X}_N]\in R^{K_1 K_2 \times N_{mn}} \end{aligned}$$
(1)

Assuming that the number of filters in layer i is \(L_i\), therefore, \(L_1\), is the number of filters in layer 1. PCA minimizes the reconstruction error, i.e.,

$$\begin{aligned} \min _{V\in R^{K_1 K_2 \times L_1}}||X-VV^TX||_F^2, s.t. V^TV=I_{L_1} \end{aligned}$$
(2)

where \(I_{L_1}\) is identity matrix of size \(L_1\times L_1\) and V is a matrix composed of eigenvectors corresponding to the first \(L_1\) eigenvalues of X. Therefore, PCA filters is expressed as

$$\begin{aligned} W_l^1 = map_{k_1,k_2}(q_l(XX^T)) \in R^{K_1 K_2},l=1,2,\cdots ,L_1 \end{aligned}$$
(3)

where \(map_{k_1,k_2 }(v)\) is a function that maps v to a matrix \(W\in R^{K_1 K_2}\), and \(q_l (XX^T)\) is the \(l-th\) principal eigenvector of \(XX^T\). The leading principal eigenvectors capture the main variations of all the mean-centralized training patches. Therefore we can use PCA filters to extract image features.

3 The Proposed Fusion Method

3.1 Image Features

Considering the generalization performance of the filters, we train the two-stage PCANet using ImageNet [22] which contains 1000 categories to get PCA filters. We randomly select 5 images for these 1000 categories, therefore, 5000 images in total are used to train PCA filters and all of them are resized to \(256\times 256\) and color images are converted to gray ones. Training PCA filters is implemented in Matlab R2016b on 3.60 GHz Inter(R) Core(TM) CPU with 64 GB RAM. We use the filters of the first stage to extract image features. For example, we use the fixed PCA filters to extract features from the source image, as shown in Fig. 1.

Fig. 1.
figure 1

(a) is an original image and (b)–(i) are image features obtained by the PCA filters of the first stage of the PCANet.

In Fig. 1, (a) is a near focused source image. (b)–(i) are image features extracted by PCA filters, such as texture features, vertical textures, horizontal textures, point textures, diagonal textures, etc. Obviously, the first few images extract features from a global perspective, and the subsequent features become more specific to local significance. Image fusion is the fusion of images with different contents of the same scene into an image. In other words, the saliency features of different images are extracted and combined into an image. In [3], to get better performance, the authors remove the first three principal components in the Eigenface method. Therefore, the feature extraction before fusion can reflect the significance of the source image as much as possible. Because of that, we do an experiment in which there are four cases including all the 8 image features or discarding the first image feature or discarding second one or both of them are abandoned.

Experimental results show that it is better to discard the first two feature maps. Therefore we just utilize the last six feature maps to make the activity level map. This experiment will be introduced in detail in Sect. 4.2.

3.2 Proposed Fusion Method

In this section we will introduce the proposed fusion method in detail. The framework of the proposed fusion method is shown in Fig. 2.

Fig. 2.
figure 2

The framework of the proposed method.

As shown in Fig. 2, the input images are denoted by \(Source_1\) and \(Source_2\) respectively the feature extraction of the \(Source_1\) and \(Source_2\) is carried out through the pre-trained PCA filters. We just take the last six image features to calculate the activity level map.

In the [11], authors apply the nuclear norm to the image features. In this paper, we use nuclear norm [17] which is the sum of the singular values of matrix to extract the image features to get the activity level maps. This procedure is shown in Fig. 3.

Fig. 3.
figure 3

The procedure of the processing strategy for image features.

Fig. 4.
figure 4

The construction of focus score maps through the silding window technique.

As shown in Fig. 3, the edge of the feature map is filled with zero, the six feature maps are concatenated as a 6-channel image feature, and the multi-channel image feature is processed by nuclear norm. Taking each pixel as the center, multi-channel block is taken from the same position, and the multi-channel block is transformed into a two-dimensional matrix. The sum of the singular values of the matrix is calculated, and the sum value is used to replace the original pixel point. Therefore, activity level maps are composed of the sum of singular values.

(4)

where \(R(*)\) is reshape operation, (xy) is the position of the pixel, c is channel number, k is the number of preregistered source image, \((2t+1)\times (2t+1)\) is the block size, \(M_k (*)\) is the activity level map and \(F_k^{1:c} (*)\) is c feature maps of the \(k-th\) source image.

Next, the activity level map is processed to obtain the decision map. According to the reference [21], we process the activity level maps as shown in Fig. 4.

In Fig. 4, for each corresponding pair of blocks which take each pixel as the center \(f_1\) and \(f_2\), we calculate the sum of all the coefficients in each of them, denoted as \(s_1\) and \(s_2\). If \(s_1>s_2\), the corresponding pixel is set as 1, otherwise, the pixel is set as 0. Finally, we can obtain two complementary focus score maps, denoted as \(FSM_1\) and \(FSM_2\). The steps are shown in Eqs. 5 and 6.

(5)
$$\begin{aligned} s_i=sum(f_i),i \in \{1,2\} \end{aligned}$$
(6)

where \(sum(*)\) is the sum of all coefficients in \(f_i\).

As focus score maps usually contain some small holes surrounded by the focused regions, we apply a simple post-processing approach to remove these regions. We apply a small region removal strategy [18] that the area threshold is universally set to \({0.1\times H \times W}\), where H and W are the height and width of each source image. Subsequently, we apply morphological closing and opening operation to the focus score maps. Finally, according to [21], we combine the two focus score maps into one decision map, that is, for the pixels where the two focus score maps are complementary, take the value of the first focus score map; otherwise, the value is 0.5. The final decision map denoted as \(D_{final}\) is evaluated as shown in formula 7.

(7)

Finally, we obtain the \(D_{final}\) shown in Fig. 5 (a).

Fig. 5.
figure 5

(a) is \(D_{final}\) and (b) is fused image.

3.3 Fusion Method

With the final decision map \(D_{final}\), the fused image F is calculated by

$$\begin{aligned} F = D_{final}Source_1 + (1-D_{final})Source_2. \end{aligned}$$
(8)

The fused image is shown in Fig. 5(b).

The algorithm is described in Table 1.

Table 1. Algorithm flow chart

4 Experiments

In this section, we introduce the source images and experimental environment. There is also a detailed description of Sect. 3.1 and a subjective and objective comparison between the proposed method and the existing methods.

4.1 Experimental Settings

As introduced in Sect. 3.1, our images sets are denoted as SET1 and SET2 coming from two references [29] and [10]. There are 15 pairs of source images in SET1 and 20 pairs in SET2. Part of them are shown in Figs. 6 and 7.

Fig. 6.
figure 6

Four pairs of source images from SET1.

Fig. 7.
figure 7

Four pairs of source images from SET2.

Secondly, we compare the proposed method with several existing fusion methods, including: convolutional sparse representation fusion method (CSR) [19], multi-focus image fusion with dense SIFT (DSIFT) [21], multi-focus image fusion with a deep convolutional neural network (CNN) [18], infrared and visible image fusion using a deep learning framework (VGG) [13], discrete cosine harmonic wavelet transform fusion method (DCHWT) [7] and cross bilateral filter fusion method (CBF) [8].

In order to evaluate our proposed method and existing methods from an objective perspective, we choose several image quality metrics. These are: Average Gradient (AG), entropy(EN), Mutual Information (MI) [1], FMI_gradient [6] and the sum of the correlations of differences (SCD) [2].

In our experiment, the sliding window size is \(5\times 5\) in nuclear norm used for feature processing and the step is one pixel. The sliding window size is \(3\times 3\) in the construction of focus score maps and the step is one pixel as well.

The fusion algorithm is implemented in Matlab R2016a on 3.00 GHz Inter(R) Core(TM) CPU with 4 GB RAM.

4.2 Feature Selection Experiment

In this section, we will introduce the experiment which is mentioned in Sect. 3.1. In this two images sets (SET1 and SET2), according to the number of image features, our method is divided into four cases: (1) all features are used, (2) discarding the first image feature, (3) discarding the second image feature and (4) discarding the first two image features. For all the cases, we compare each other and adopt multiple evaluation indexes as reference, and take the average value of experimental results of each image as the final, as shown in Tables 2 and 3.

Table 2. The AG, EN, MI and FMI_gradient average values of the compared methods and the proposed method for SET1.
Table 3. The AG, EN, MI and SCD average values of the compared methods and the proposed method for SET2.

In Tables 2 and 3, the best results are bloded. It can be seen from the two tables that the distribution of the best results is the same in the four cases of each data set, but the value of the fourth case is better than or equal to the first three cases. Therefore, only the last six image features are selected, and the first two are abandoned.

4.3 Image Fusion Results

We use fifteen pairs of source images (SET1) to test comparison methods and the proposed method. The fused results are shown in Fig. 8, we choose one pair of source images as an example. And the values of AG, EN, MI and FMI_gradient for fifteen fused images are shown in Tables 4 and 5.

Fig. 8.
figure 8

The examples of fused results. (a) Source image a; (b) Source image b; (c) csr; (d) dsift; (e) cnn; (f) vgg; (g) DCHWT; (h) cbf; (i) The proposed method.

Fig. 9.
figure 9

The examples of fused results. (a) Source image a; (b) Source image b; (c) csr; (d) dsift; (e) cnn; (f) vgg; (g) DCHWT; (h) cbf; (i) The proposed method.

Table 4. The AG, EN, MI and FMI_gradient values of the compared methods and the proposed method for 10 pairs source images from SET1.
Table 5. The AG, EN, MI and FMI_gradient values of the compared methods and the proposed method for another 5 pairs source images from SET1.
Table 6. The AG, EN, MI and SCD values of the compared methods and the proposed method for 10 pairs source images from SET2.
Table 7. The AG, EN, MI and SCD values of the compared methods and the proposed method for another 10 pairs source images from SET2.

As shown in Fig. 8, we can see, the proposed method has almost the same fusion performance compared with other classical and novel fusion methods in human visual system. Therefore we mainly discus the fusion performance with quality metrics, as shown in Tables 4 and 5.

In Tables 4 and 5, the best results are bloded, the second-best results are marked in red. We can see, in most cases, the proposed method has good indicators.

We also make the same comparison on SET2 which contains 20 pairs of source images. The fused results are shown in Fig. 9, we choose one pair of source images as an example as well. And the values of AG, EN, MI and SCD for twenty fused images are shown in Tables 6 and 7.

As shown in Fig. 9, we can see, from human visual perspective, there is almost no significant difference in the fusion results between these methods. Therefore we evaluate the fusion results objectively, as shown in Tables 6 and 7.

In Tables 6 and 7, the best results are bloded, the second-best results are marked in red. We can see, in most cases, the proposed method has good indicators as well.

5 Conclusion

In this paper, we propose a novel fusion method based on PCANet. First of all, we utilize the PCA filters to extract image features of source images, and then we apply the nuclear norm to process the image features in order to get activity level maps. Through a series of post-processing operations on activity level maps, the decision map is obtained. Finally, the fused image is obtained by utilizing a weighted fusion rule. The experimental results demonstrate that the proposed method can obtain state-of-the-art fusion performance in terms of both objective assessment and visual quality.