Keywords

1 Introduction

In medical image analysis, it is sometimes convenient or necessary to infer an image from one modality or resolution from another image modality or resolution for better disease visualization, prediction and detection purposes. A major challenge of cross-modality image segmentation or registration comes from the differences in tissue appearance or spatial resolution in images arising from different physical acquisition principles or parameters, which translates into the difficulty to represent and relate these images. Some existing methods tackle this problem by learning from a large amount of registered images and constraining pairwise solutions in a common space. In general, one would desire to have high-resolution (HR) three-dimensional Magnetic Resonance Imaging (MRI) with near isotropic voxel resolution as opposed to the more common image stacks of multiple 2D slices for accurate quantitative image analysis and diagnosis. Multi-modality imaging can generate tissue contrast arising from various anatomical or functional features that present complementary information about the underlying organ. Acquiring low-resolution (LR) single-modality images, however, is not uncommon.

To solve the above problems, super-resolution (SR) [1, 2] reconstruction is carried out for recovering an HR image from its LR counterpart, and cross-modality synthesis (CMS) [3] is proposed for synthesizing target modality data from available source modality images. Generally, these methods have explored image priors from either internal similarities of image itself [4] or external data support [5], to construct the relationship between two modalities. Although these methods achieve remarkable results, most of them suffer from the fundamental limitations associated with large scale pairwise training sets or patch-based overlapping mechanism. Specifically, a large amount of multi-modal images is often required to learn a sufficiently expressive dictionaries/networks. However, this is impractical since collecting medical images is very costly and limited by many factors. On the other side, patch-based methods are subjected to inconsistencies introduced during the fusion process that takes place in areas where patches overlap.

To deal with the bottlenecks of training data and patch-based implementation, we develop a dual convolutional filter learning (DOTE) method with an application to neuroimaging that investigates data (in both source and target modalities from the same set of subjects) in a more effective way, and solves image SR and CMS problems respectively. The contributions of this work are mainly in four aspects: (1) We present a unified model (DOTE) for any cross-modality image synthesis problem; (2) The proposed method can efficiently reduce the amount of training data needed from the model, by generating abundant feedbacks from dual mapping functions during the training process; (3) Our method integrates feature learning and mapping relation in a closed loop for self-optimization. Local neighbors are preserved intrinsically by directly working on the whole images; (4) We evaluate DOTE on two datasets in comparison with stat-of-the-art methods. Experimental results demonstrate superior performance of DOTE over these approaches.

Fig. 1.
figure 1

Flowchart of the proposed method for MRI cross-modality synthesis.

2 Method

2.1 Background

Convolutional Sparse Coding (CSC) remedies a fundamental drawback of conventional patch-based sparse representation methods by modeling shift invariance for consistent approximation of local neighbors on whole images. Instead of decomposing the vector as the multiplication of dictionary atoms and the coded coefficients, CSC provides a more elegant way to model local interactions. That is, by representing an image as the summation of convolutions of the sparsely distributed feature maps and the corresponding filters. Concretely, given an \(m \times n\) image \(\mathbf {x}\) in vector form, the problem of learning a set of vectorized filters for sparse feature maps is solved by minimizing the objective function that combines the convolutional least-squares term and the \(l_1\)-norm penalty on the representations:

$$\begin{aligned} \begin{aligned} \arg \min _{\mathbf {f},\mathbf {s}} \frac{1}{2}\left\| \mathbf {x}-\sum _{k=1}^{K}\mathbf {f}_{k}*\mathbf {s}_{k} \right\| _{2}^{2}+\lambda \sum _{k=1}^{K}\left\| \mathbf {s}_{k} \right\| _{1}\\ s.t. \; \left\| \mathbf {f}_{k} \right\| _{2}^{2}\le 1 \;\; \forall k=\left\{ 1,...,K \right\} , \end{aligned} \end{aligned}$$
(1)

where \(\mathbf {f}_{k} \in \mathbf {f}=\left[ \mathbf {f}_{1}^{T},...,\mathbf {f}_{K}^{T} \right] ^{T}\) is the k-th \(d \times d\) filter, \(*\) denotes the 2D convolution operator, \(\mathbf {z}_{k} \in \mathbf {z}=\left[ \mathbf {z}_{1}^{T},...,\mathbf {z}_{K}^{T} \right] ^{T}\) refers to the sparse feature map corresponding to \(\mathbf {f}_{k}\) with size \(\left( m+d-1 \right) \times \left( n+d-1 \right) \) to approximate \(\mathbf {x}\), and \(\lambda \) is a regularization parameter. The problem in Eq. (1) can be efficiently and explicitly solved in the Fourier domain, derived within an Alternating Direction Method of Multipliers (ADMM) framework [6].

Dual Learning (DL) [7] is a new learning paradigm that translates the input model by forming a closed loop between source and target domains to generate informative feedbacks. Specifically, for any dual tasks (e.g., \(A \leftrightarrow B\)) DL strategy appoints \(A \rightarrow B\) as the primary task and the other \(A \leftarrow B\) as the dual task, and forces them learning from each other to produce the pseudo-input \({A}'\). It can achieve the comparable performance through iteratively updating and minimizing the reconstruction error \(A-{A}'\) that helps maximize the use of data. Therefore, making the learning-based methods have less dependent on the large number of training data.

Problem Formulation: The cross-modality image synthesis problem can be formulated as: given an 3D image \(\mathbf {X}\) of modality \(\mathcal {M}_{1}\), the task is to infer from \(\mathbf {X}\) a target 3D image \(\mathbf {Y}\) that approximates to the ground truth of Let \(\mathcal {X}=\left[ \mathbf {X}_{1}, ...,\mathbf {X}_{C}\right] \in \mathbb {R}^{m\times n\times z\times C}\) be a set of images of modality \(\mathcal {M}_{1}\) in the source domain, and \(\mathcal {Y}=\left[ \mathbf {Y}_{2}, ...,\mathbf {Y}_{C}\right] \in \mathbb {R}^{m\times n\times z\times C}\) be a set of images of modality \(\mathcal {M}_{2}\) in the target domain. m, n are the dimensions of axial view of the image, and z denotes the size of image along the z-axis, while C is the numbers of elements in the training sets. Each pair of \(\left\{ \mathbf {X}_{i},\mathbf {Y}_{i} \right\} \) \(\forall i=\left\{ 1,...,C \right\} \) are registered. To bridge image appearances across different modalities while preserving the intrinsic local interactions (i.e., intra-domain consistency), we propose a method based on CSC to jointly learn a pair of filters \(\mathbf {F}^{x}\) and \(\mathbf {F}^{Y}\). Moreover, inspired by the DL strategy, we form a closed loop between both domains and assume that there exists a primal mapping function \(\mathcal {F}\left( \cdot \right) \) from \(\mathcal {X}\) to \(\mathcal {Y}\) for relating and predicting from one another. We also assume there exists a dual mapping function \(\mathcal {G}\left( \cdot \right) \) from \(\mathcal {Y}\) to \(\mathcal {X}\) to generate feedbacks for model self-optimization. Experimentally, we investigate human brain MRI and apply our method to two cross-modality synthesis tasks, i.e., image SR and CMS. An overview of our method is depicted in Fig. 1. Notation: Matrices and 3D images are written in bold uppercase (e.g., image \(\mathbf {X}\)), vectors and vectorized 2D images in bold lowercase (e.g., filter \(\mathbf {f}\)) and scalars in lowercase (e.g., element k).

2.2 Dual Convolutional Filter Learning

Inspired by CSC (cf. Sect. 2.1) and the benefits of conventional coupled sparsity, we propose a dual convolutional filter learning (DOTE) model, which extends the original CSC formulation into a DL strategy and joint representation into a unified framework. More specifically, given \(\mathcal {X}\) together with the corresponding \(\mathcal {Y}\) for training, in order to facilitate a joint mapping, we associate the sparse feature maps of each registered data pair \(\left\{ \mathbf {X}_{i},\mathbf {Y}_{i} \right\} _{i=1}^{C}\) by constructing a forward mapping function \(\mathcal {F}: \mathcal {X} \mapsto \mathcal {Y}\) with \(\mathbf {Y}=\mathcal {F}\left( \mathbf {X} \right) \). Since such cross-modality synthesis problem satisfies a dual-learning mechanism, we further leverage the duality of the bidirectional transformation between the two domains. That is, by establishing a dual mapping function \(\mathcal {G}: \mathcal {Y} \mapsto \mathcal {X}\) with \(\mathbf {Y}=\mathcal {G}\left( \mathbf {X} \right) \). Incorporating feature maps representing and the above closed-loop mapping functions, we can thus derive the following objective function:

(2)

where \(\mathbf {S}_{k}^{x}\) and \(\mathbf {S}_{k}^{y}\) take the role of the k-th sparse feature maps that approximate data \(\mathbf {X}\) and \(\mathbf {Y}\) when convolved with the k-th filters \(\mathbf {F}_{k}^{x}\) and \(\mathbf {F}_{k}^{y}\) of a fixed spatial support, \(k=1,...,K\). \(\left\| \cdot \right\| _{F}\) is a Frobenius norm chosen to induce the convolutional least squares approximation, and \(*\) is represented as a 3D convolution operator, while \(\lambda \), \(\beta \), \(\gamma \) are the regularization parameters. Particularly, dual mapping functions \(\mathcal {F}\left( \mathbf {S}_{k}^{x}, \mathbf {W}_{k}\right) =\mathbf {W}_{k} \mathbf {S}_{k}^{x}\) and \(\mathcal {G}\left( \mathbf {S}_{k}^{y}, \mathbf {W}_{k}^{-1}\right) =\mathbf {W}_{k}^{-1} \mathbf {S}_{k}^{y}\) are used to relate the sparse feature maps of \(\mathbf {X}\) and \(\mathbf {Y}\) over \(\mathbf {F}^{x}\) and \(\mathbf {F}^{y}\). They are done by solving two sets of least squares terms (i.e., \(\sum _{k=1}^{K}(\left\| \mathbf {S}_{k}^{y}-\mathbf {W}_{k}\mathbf {S}_{k}^{x} \right\| _{F}^{2} +\sum _{k=1}^{K} \left\| \mathbf {S}_{k}^{x}-\mathbf {W}_{k}^{-1}\mathbf {S}_{k}^{y} \right\| _{F}^{2})\) with respect to the linear projections.

2.3 Optimization

Similar to classical dictionary learning methods, the objective function in Eq. (2) is not simultaneously convex with respect to the learned filter pairs, the sparse feature maps and the mapping. Instead, we divide the proposed method into three sub-problems: learning \(\mathbf {S}^{x}\), \(\mathbf {S}^{y}\), training \(\mathbf {F}^{x}\), \(\mathbf {F}^{y}\), and updating \(\mathbf {W}\).

Computing sparse feature maps: We first initialize the filters \(\mathbf {F}^{x}\), \(\mathbf {F}^{y}\) as two random matrices and the mapping \(\mathbf {W}\) as an identity matrix, then fix them for calculating the solutions of sparse feature maps \(\mathbf {S}^{x}\), \(\mathbf {S}^{y}\). As a result, the problem of Eq. (2) can be converted into two optimization sub-problems. Unfortunately, this cannot be solved under \(l_{1}\) penalty without breaking rotation invariance. The resulting alternating algorithms [6] by introducing two auxiliary variables \(\mathbf {U}\) and \(\mathbf {V}\) enforce the constraint inherent in the splitting. In this paper, we follow [6] and solve the convolution subproblems in the Fourier domain within an ADMM optimization strategy:

(3)

where \(\hat{}\) applied to any symbol denotes the frequency representations (i.e., Discrete Fourier Transform (DFT)). For instance, \(\hat{\mathbf {X}} \leftarrow f(\mathbf {X})\) where \(f(\cdot )\) is the Fourier transform operator. \(\odot \) represents the component-wise product. \(\mathbf {\Phi }^{T}\) is the inverse DFT matrix, and \(\mathbf {V}\) projects a filter onto the small spatial support. The auxiliary variables \(\mathbf {U}_{k}^{x}\), \(\mathbf {U}_{k}^{y}\), \(\mathbf {V}_{k}^{x}\) and \(\mathbf {V}_{k}^{y}\) relax each of the CSC problems under dual mapping constraint by leading to several subproblem decompositions.

Learning convolutional filters: Like when solving for sparse feature maps, filter pairs can be learned similarly by setting \(\mathbf {S}^{x}\), \(\mathbf {S}^{y}\) and \(\mathbf {W}\) fixed, and then learning \(\mathbf {F}^{x}\) and \(\mathbf {F}^{y}\) by minimizing

(4)

Equation (4) can be solved by a one-by-one update strategy through an augmented Lagrangian method [6].

Updating mapping: With fixed \(\mathbf {F}^{x}\), \(\mathbf {F}^{y}\), \(\mathbf {S}^{x}\) and \(\mathbf {S}^{y}\), we solve the following ridge regression problem for updating mapping \(\mathbf {W}\):

$$\begin{aligned} \begin{aligned} \min _{\mathbf {W}}\sum _{k=1}^{K}\left\| \mathbf {S}_{k}^{y}-\mathbf {W}_{k}\mathbf {S}_{k}^{x} \right\| _{F}^{2}+\left\| \mathbf {S}_{k}^{x}-\mathbf {W}_{k}^{-1}\mathbf {S}_{k}^{y} \right\| _{F}^{2}+\left( \frac{\gamma }{\beta } \right) \sum _{k=1}^{K}\left\| \mathbf {W}_{k} \right\| _{F}^{2}. \end{aligned} \end{aligned}$$
(5)

Particularly, the primal mapping function \(\left\| \mathbf {S}_{k}^{y}-\mathbf {W}_{k}\mathbf {S}_{k}^{x} \right\| _{F}^{2}\) constructs an intrinsic mapping while the corresponding dual mapping function \(\left\| \mathbf {S}_{k}^{x}-\mathbf {W}_{k}^{-1}\mathbf {S}_{k}^{y} \right\| _{F}^{2}\) is utilized to give feedbacks and further optimize the relationship between \(\mathbf {S}_{k}^{x}\) and \(\mathbf {S}_{k}^{y}\). Ideally (as the final solution), \(\mathbf {S}_{k}^{y}=\mathbf {W}_{k}\mathbf {S}_{k}^{x}\), such that the problem in Eq. (5) is reduced to \(\min _{\mathbf {W}_k}\sum _{k=1}^{K}\left\| \mathbf {S}_{k}^{y}-\mathbf {W}_{k}\mathbf {S}_{k}^{x} \right\| _{F}^{2}+\left( \frac{\gamma }{\beta } \right) \sum _{k=1}^{K}\left\| \mathbf {W}_{k} \right\| _{F}^{2}\) with the solution \(\mathbf {W}=\mathbf {S}_{k}^{y}{\mathbf {S}_{k}^{x}}^{T}(\mathbf {S}_{k}^{x}{\mathbf {S}_{k}^{x}}^{T}+\frac{\gamma }{\beta } \mathbf {I})^{-1}\), where \(\mathbf {I}\) is an identity matrix. We summarize the proposed DOTE method in the following Algorithm 1.

figure a

2.4 Synthesis

Once the optimization is completed, we can obtain the learned filters \(\mathbf {F}^{x}\), \(\mathbf {F}^{y}\) and the mapping \(\mathbf {W}\). We then apply the proposed model to synthesize images across different modalities (i.e., LR \(\rightarrow \) HR and \(\mathcal {M}_{1} \rightarrow \mathcal {M}_{2}\), respectively). Given a test image \(\mathbf {X}^{t}\), we compute the sparse feature maps \(\mathbf {S}^{tx}\) related to \(\mathbf {F}^{x}\) by solving a single CSC problem like Eq. (1): \(\mathbf {S}^{tx}=\arg \min _{\mathbf {S}^{tx}} \frac{1}{2}\left\| \mathbf {X}^{t}-\sum _{k=1}^{K}\mathbf {F}_{k}^{x}*\mathbf {S}_{k}^{tx} \right\| _{2}^{2}+\lambda \sum _{k=1}^{K}\left\| \mathbf {S}_{k}^{tx} \right\| _{1}\). After that, we can synthesize the target modality image of \(\mathbf {X}^{t}\) by the sum of K target feature maps \(\mathbf {S}_{k}^{ty}=\mathbf {W}\mathbf {S}_{k}^{tx}\) convolved with \(\mathbf {F}_{k}^{y}\), i.e., \(\mathbf {Y}^{t}=\sum _{k=1}^{K}\mathbf {F}_{k}^{y}\mathbf {S}_{k}^{ty}\).

3 Experimental Results

Experimental Setup: The proposed DOTE is validated on two datasets: IXIFootnote 1 (including 578 \(256 \times 256 \times p\) MR healthy subjects) and NAMICFootnote 2 (involving 20 \(128 \times 128 \times q\) subjects). In our experiments, we perform 4-fold cross-validation for testing. That is, selecting 144 subjects from IXI and 5 subjects from NAMIC, respectively, as our test data. Following [1], the regularization parameters \(\sigma \), \(\lambda \), \(\beta \), and \(\gamma \) are empirically set to be 1, 0.05, 0.10, 0.15, respectively. The number of filters is set as 800 according to [8]. Convergence towards primal feasible solution is proved in [6] by first converting Eq. (2) into two optimization sub-problems that involve two proxies \(\mathbf {U}\), \(\mathbf {V}\) and then solving them alternatively. DOTE converges after ca. 10 iterations. For the evaluation criteria, we adopt PSNR and SSIM indices to objectively assess the quality of our results.

MRI Super-Resolution. As we introduced in Sect. 1, we first address image SR as one of cross-modality image synthesis. In this scenario, we investigate the T2-w images of the IXI dataset for evaluating and comparing DOTE with ScSR [1], A+ [2], NLSR [4], Zeyde [5], ANR [9], and CSC-SR [8]. Generally, LR images are generated by down-sampling HR ground-truth images using bicubic interpolation. We perform image SR with scaling factor 2, and show visual results in Fig. 2. The quantitative results are reported in Fig. 3, while the average PSNRs and SSIMs for all 144 test subjects are shown in Table 1. The proposed model achieves the best PSNRs and SSIMs. Moreover, to validate our argument that DL-based self-optimization strategy is beneficial and requires less training data, we compare \(\text {DOTE}_{\text {nodual}}\) (removing dual mapping term) and DOTE under different training data size (i.e., \(\frac{1}{4},\frac{1}{2},\frac{3}{4}\) of the original dataset). The results are listed in Table 2. From Table 2, we see that DOTE is always better than \(\text {DOTE}_{\text {nodual}}\) especially with few training samples.

Fig. 2.
figure 2

Example SR results and the corresponding PSNRs and SSIMs.

Fig. 3.
figure 3

Error measures of SR results on the IXI dataset.

Fig. 4.
figure 4

Visual comparison of synthesized results using MIMECS and DOTE.

Table 1. Quantitative evaluation: DOTE vs. other SR methods.
Table 2. Quantitative evaluation: DOTE vs. \(\text {DOTE}_{\text {nodual}}\).
Fig. 5.
figure 5

CMS results: DOTE vs. MIMECS on the IXI dataset.

Cross-Modality Synthesis. For the problem of CMS, we evaluate DOTE and the relevant algorithms on both datasets involving six groups of experiments: (1) synthesizing T2-w image from PD-w acquisition and (2) vice versa; (3) generating T1-w image from T2-w input, and (4) vice versa. We conduct (1–2) experiments on the IXI dataset, while (3–4) are explored on the NAMIC dataset. The representative and state-of-the-art CMS methods, including Vemulapalli’s method [3] and MIMECS [10] are employed to compare with our DOTE approach. We demonstrate visual and quantitative results in Figs. 4, 5 and Table 3, respectively. Our algorithm yields the best results against MIMECS and Vemulapalli for two datasets validating our claim of being able to synthesize better results through the expanded dual optimization.

Table 3. CMS results: DOTE vs. other synthesis methods on the NAMIC dataset.

4 Conclusion

We presented a dual convolutional filter learning (DOTE) method which directly decomposes the whole image based on CSC, such that local neighbors are preserved consistently. The proposed dual mapping functions integrated with joint learning model form a closed loop that leverages the training data more efficiently and keeps a very stable mapping between image modalities. We applied DOTE to both image SR and CMS problems. Extensive results showed that our method outperforms other state-of-the-art approaches. Future work could concentrate on extending DOTE to higher-order imaging modalities like diffusion tensor MRI and to other modalities beyond MRI.