1 Introduction

Magnetic Resonance (MR) Imaging is widely adopted in many diagnostic applications due to its improved soft-tissue contrast, non-invasiveness and excellent spatial resolution. However, MRI is associated with long scan durations as the data is read out sequentially in k-space and the speed at which the k-space can be traversed is limited by the underlying imaging physics. This in turn limits the clinical use of MRI, causes inconvenience to patients, and renders this modality expensive and less accessible. One potential approach to accelerate MRI acquisition is to undersample k-space i.e. reduce the number of k-space traversals made during acquisition. However, such an undersampling violates the Nyquist-Shannon Sampling theorem [7] and generates aliasing artefacts upon reconstruction. A learning based reconstruction algorithm should effectively compensate for missing k-space samples by leveraging a priori knowledge of the anatomy at hand and the undersampling pattern.

Fig. 1.
figure 1

(i) Complex fully convolutional neural network architecture. (ii) Complex dense block, composed of 3 complex conv2D layers, followed by complex batch normalization and ReLU. (iii) Complex Conv2D layer, responsible for performing complex convolution operation, here a and b represents real and complex feature maps, and \({W_R}\) and \({W_I}\) represents real and imaginary parts of learnable weights.

Deep learning is being increasingly adopted for MR reconstruction. Instead of using handcrafted features, Hammernik et al. [4] demonstrated learning a set of regularizers under a variational framework, for reconstruction of accelerated MRI data. Kinam et al. [5] used the multilayer perceptron for accelerated parallel MRI. These works were further extended using techniques such as, deep residual learning [6], domain adaptation [13], data consistency layer [10], manifold approximation (AUTOMAP) [14], to name a few. However, all of the above mentioned reconstruction methods employ real-valued convolution operations in the spatial-domain by treating real (amplitude) and imaginary (phase) parts as two independent components. It should be noted that unlike multi-channel images (such as RGB images) where individual channels are acquired independently, MR data is inherently complex-valued in nature. Quadrature detection is employed to measure the changing circularly polarized magnetic field within the scanner which results in two simultaneously acquired data streams with a \(\pi /2\) phase difference. Upon digitization, these signals constitute the real and imaginary parts of each complex data point in the k-space. The magnitude derived from this complex valued data mainly carries information about proton density as well as relaxation properties of the tissue. The phase can be used to obtain the information, for example, about magnetic susceptibility, flow, or temperature. To faithfully recover the complete k-space, it is important to learn the co-relationship between these data-streams.

In this paper, for the first time, we explore end-to-end learning with complex-valued data targeted at MR reconstruction. Towards this, we propose the Complex Dense Fully Convolutional Network (\(\mathbb {C}\)DFNet) by introducing densely connected fully convolutional blocks made with layers supporting deep learning operations on complex valued data. Complex-valued arithmetic operators for deep learning were proposed by Trabelsi et al. [11] where complex counterparts of convolution, batch-normalization, network initialization etc. were explored. We also propose a composite loss function that simultaneously minimizes reconstruction error while improving structural similarity.

2 Methodology

2.1 Problem Formulation

Let the fully-sampled complex-valued MR image be represented as \({\mathbf {x}_{f}\in \mathbb {C}^N }\) consisting of \({\sqrt{N}\times \sqrt{N}}\) pixels arranged in a column fashion with each pixel composed of a complex vector with real and imaginary components. This image is reconstructed from fully-sampled measurements in k-space, say \({\mathbf {y}_{f}\in \mathbb {C}^N }\), such that: \(\mathbf {y}_{f} = \mathbf {F}_{f}\mathbf {x}_{f}\), where \(\mathbf {F}_{f} \in \mathbb {C}^{N \times N}\) is the fully sampled encoding matrix. During under-sampling, we acquire measurements in k-space, say \(\mathbf {y}_{u} \in \mathbb {C}^{M}\) where \(M \ll N\). Let the image reconstructed from zero-filling \(\mathbf {y}_{u}\) be represented as \(\mathbf {x}_{u}\), such that \(\mathbf {x}_{u} = \mathbf {F}_{u}^{-1}\mathbf {y}_{u}\). Reconstructing \(\mathbf {x}_{f}\) directly from \(\mathbf {y}_{u}\) is ill-posed and direct inversion is not possible due to the under-determined nature of the system of equations. In our approach, we enforce \(\mathbf {x}_{f}\) to be approximated using a complex fully convolutional neural network (represented as \(f_{\mathbb {C}}\)). As \(\mathbf {x}_{u}\) is highly-aliased due to sub-Nyquist sampling, \(f_{\mathbb {C}}\) aims at recovering image \(\mathbf {x_r}\) that is as close as possible to an ideal fully sampled image \(\mathbf {x}_{f}\).

2.2 Network Architecture

Complex Dense Block: The densely connected block proposed in [2], introduces feed-forward connections from each layer to every other layer (illustrated in Fig. 1(ii)). Such an architecture choice was demonstrated to encourage feature reusability and strengthen information propagation through the network. We suitably adapt this block for complex valued data by proposing counterparts of classic deep learning layers such as convolution, batch normalization, non-linearity (ReLU), up-sampling etc. For sake of brevity, we delve only into the complex convolution (denoted as \(*_{\mathbb {C}}\)) in detail. Let \(\mathbf {h} = \mathbf {a} + i\mathbf {b}\) be the complex-valued input to convolution layer with weights \(\mathbf {W} = \mathbf {W_R} + i \mathbf {W_I}\), the complex convolution between \(\mathbf {h}\) and \(\mathbf {W}\) is simulated using real-valued arithmetic as: \(\mathbf {W} *_{\mathbb {C}} \mathbf {h} = \left( \mathbf {a} * \mathbf {W_R} - \mathbf {b} * \mathbf {W_I} \right) + i\left( \mathbf {a} * \mathbf {W_I} + \mathbf {b} * \mathbf {W_R} \right) \), as shown in Fig. 1(iii). The complex output feature maps are fed into the complex batch normalization layer, which normalizes the data to have equal variance along the real and imaginary components, thereby ensuring a co-relationship between them. The complex variant of non-linearity ReLU and max-pooling are applied on the real and imaginary channels separately.

Complex Dense Fully Convolutional Network (\(\mathbb {C}\)DFNet): The \(\mathbb {C}\)DFNet \(f_{\mathbb {C}}\) is based on the DenseNet [2] architecture, comprising of a sequence of four densely-connected complex encoder blocks with corresponding densely-connected complex decoder blocks separated by a bottleneck layer (illustrated in Fig. 1(i). The output of the last decoder block is given to a reconstruction layer (with complex convolution operators) for reconstructing the image. The encoders and decoders are stacked and trained in a progressive way i.e. output from one block is used as input to other block. Skip connections are included in the architecture between encoder and corresponding decoder blocks to fuse high-level representations (decoder) with low-level features (encoder) for preserving contextual information. Furthermore, skip connections prevent the vanishing gradient problem, by directly propagating gradients from decoder to respective encoder block. The network \(f_{\mathbb {C}}\) takes complex-valued aliased image \(\mathbf {x}_{u}\) (generated by zero-filling under-sampled k-space data \(\mathbf {y}_{u}\)) as input to an intermediate reconstructed image \(\widetilde{\mathbf {x}}_{r}\) which is fed further into the data consistency layer for imputing missing k-space values.

Data Consistency Layer (DCL): We recover a full reconstructed k-space spectrum \(\widetilde{\mathbf {y}}_{r}\) via a Fourier transform on the reconstructed image \(\widetilde{\mathbf {x}}_{r}\). To retain all the a priori available k-space values \(\mathbf {y}_{u}\) (collected at spatial locations denoted via mask \(\varOmega \)) and impute only the missing values at locations (\(\not \in \varOmega \)), the data consistency layer performs the following operation:

$$\begin{aligned} \mathbf {y}_{r}\left( z \right) = \left\{ \begin{array}{ll} \mathbf {y}_{u}\left( z \right) &{} z \in \varOmega \\ \widetilde{\mathbf {y}}_{r}\left( z \right) &{} z \not \in \varOmega \end{array}\right. \end{aligned}$$
(1)

After the DCL layer, the final de-aliased image \(\mathbf {x}_{r}\) is recovered through inverse Fourier transform of \(\mathbf {y}_{r}\). It must be noted that the inclusion of the DCL layer within \(f_{\mathbb {C}}\) ensures improved efficacy of the network by focusing exclusively on missing k-space values and enforces consistency with a priori acquired data \(\mathbf {y}_{u}\). Further, the DCL layer does not have any learnable parameters and does not increase the complexity of the network.

2.3 Model Learning and Optimization

The network \(f_{\mathbb {C}}\) is optimized to recover missing k-space data while simultaneously preserving fine-grained anatomical details. We adopt a supervised learning approach wherein a training dataset \(\mathcal {D}\) of input-target (under-sampled and fully-sampled) pairs (\(\mathbf {x}_{u},\mathbf {x}_{f}\)) to train \(f_{\mathbb {C}}\). We use a composite loss function comprising of two contributing terms, firstly a mean-squared error term (\({\mathcal {L}_{L_{2}}}\)) and secondly Structural Similarity Index Measure (SSIM) (\(\mathcal {L_{\text {SSIM}}}\)) as discussed below:

\(\mathcal {L}_{L_{2}}\) Loss: This loss is used to minimize the difference between the reconstructed image \(\mathbf {x}_{r}\) and target fully sampled image \(\mathbf {x}_{f}\).

$$\begin{aligned} \mathcal {L}_{L_{2}} = \sum _{\left( \mathbf {x}_{u},\mathbf {x}_{f} \right) \in \mathcal {D}} \left\| \mathbf {x}_{f} - \mathbf {x}_{r} \right\| ^{2}_{2} = \sum _{\left( \mathbf {x}_{u},\mathbf {x}_{f} \right) \in \mathcal {D}} \left\| \mathbf {x}_{f} - f_{\mathbb {C}}\left( \mathbf {x}_{u} | \theta \right) \right\| ^{2}_{2} \end{aligned}$$
(2)

The \(\mathcal {L}_{2}\) loss penalizes large errors, but fails to capture finer details which the human visual system is sensitive to such as contrast, luminance and structure. To offset the above shortcoming of \(\mathcal {L}_{2}\) loss, we use SSIM [12], which is perceptually closer to the human visual system, as an additional loss \(\mathcal {L_{\text {SSIM}}}\), defined as:

$$\begin{aligned} \mathcal {L_{\text {SSIM}}} = \sum _{\left( \mathbf {x}_{u},\mathbf {x}_{f} \right) \in \mathcal {D}}\left( 1 - \mathcal {S}\left( \mathbf {x}_{r},\mathbf {x}_{f} \right) \right) \end{aligned}$$
(3)

where \(\mathcal {S}\left( \mathbf {x}_{r},\mathbf {x}_{f} \right) \) is the SSIM calculated between \(\mathbf {x}_{r}\) and \(\mathbf {x}_{f}\). The composite loss function \(\mathcal {L}\) for optimizing \(f_{\mathbb {C}}\) is defined as: \(\mathcal {L}\left( \mathbf {x} , f_{\mathbb {C}}\left( \mathbf {x}_{u} | \theta \right) \right) = \mathcal {L}_{L_{2}} + \lambda \mathcal {L_{\text {SSIM}}}\), where \({\lambda }\) is a scaling constant.

Fig. 2.
figure 2

Edge-map results comparison at undersampling factor of x4. (a), (e) ground-truth and its edge-map, (b), (f) undersampled, and its edge-map (c), (g) DLMRI reconstruction and its edge-map (d), (h) proposed reconstruction and its edge-map. Here, green represents edges present in ground-truth, red represents edges that are missing in reconstructed image, as compared to ground-truth and blue represents edges that are not present in ground-truth but only in reconstructed images. (Color figure online)

3 Results and Discussion

3.1 Experimental Settings and Evaluation

Dataset. Our experiments were evaluated on the publicly available 20 fully-sampled knee k-space dataset from mridata.org [9]. The data was split randomly into 16 patients for training and rest for testing. The coils were fused using sum of squares into a single complete k-space dataset and training data for proof-of-concept was generated using Cartesian under-sampling proposed in [10], wherein eight lowest spatial frequencies were preserved and a zero-mean Gaussian distribution was used to determine the sampling probability along the phase encoding direction (the frequency-encoding direction was fully-sampled).

Baselines and Comparative Methods: To ablatively test the introduction of complex convolution, we compare with the naïve variant of densely connected networks treating the complex-valued input as two independent channels (termed BL1). We further compare the contribution of the data-consistency layer by defining a variant sans DCL (termed BL2). Finally, to evaluate the contribution of training with \(\mathcal {L_{\text {SSIM}}}\), we set the corresponding factor \(\lambda \) to 0 and contrast with the proposed method (termed BL3). Further, we compare against a state-of-the art dictionary learning based MR reconstruction method proposed in [8] (termed as DLMRI). It must be noted that BL1 is akin to deep learning based reconstruction method proposed in [3], differing only in the usage of densely-connected blocks. In all the aforementioned network configurations, we used complex convolution operators (except BL1) with a depth of 32, and kernel size of \(3 \times 3\), BL1 was designed with depth of 46 for a fair comparison. Parameters were chosen in such a manner so that model complexity across all baselines remain similar. The networks were trained until convergence using RMSProp as an optimizer with a learning rate of \({5e^{-5}}\) with decay of 0.9 and batch-size of 5 for 50 epochs.

The networks were evaluated at two acceleration factors of 4\(\times \) and 6\(\times \) along the phase-encoding directions. During training of the deep networks, the under-sampling masks were generated on-the-fly to induce the tolerance towards a range of potential aliasing artefacts. We further used image-level rigid and elastic transformations to augment the training data. As demonstrated in [10], fidelity of image reconstruction is evaluated by measuring the similarity between a reconstructed image to the fully-sampled ground truth image using metrics such as SSIM, mean squared error (MSE) etc. However, these metrics do not explicitly focus on finer details of the reconstruction and towards this we employ Pratt’s figure of merit (Pratt’s FOM) [1] as an additional metric. Pratt’s FOM exclusively focuses on the edges and corner points present in the reconstructed image that are concurrent with structures present in the ground truth image while simultaneously penalizing both missing and artificially hallucinated edges.

3.2 Results

The networks trained for 4\(\times \) and 6\(\times \) acceleration factors were tested across and within these factors resulting in four train-test combinations. All the methods were evaluated for each of these combinations to quantify their generalizability to unseen aliasing effects.

Table 1. Pratt’s Figure of Merit of comparative analysis against baselines
Table 2. Quantitative comparison from Cartesian trajectory with undersampling factor of 4\(\times \) and 6\(\times \)
Fig. 3.
figure 3

Reconstruction results using 4\(\times \) acceleration factor. (a), (d) Undersampled image and its error map, (b), (e) DLMRI reconstruction and its error map, (c), (f) Proposed reconstruction and its error-map, and ground truth.

Qualitative Analysis: Figure 2 illustrates the contrastive results on recovery of fine-grained details using the edge-map extracted from an under-sampled image (Fig. 2(b, f)), DLMRI (Fig. 2(c, g)) and proposed method (Fig. 2(d, h)). We observe that the proposed network demonstrates maximal consistency in finer details with respect to the ground-truth. Figure 3 highlights the differences with respect to the ground truth through a difference map and particularly focus on reconstruction of fine details in the region between the tibia and femur and the synovial membrane.

Ablative Testing: To ablatively evaluate the contributions of this work, the proposed method was contrasted against baselines (discussed in Sect. 3.1) and observations are tabulated in Table 1. For sake of brevity, we only present the Pratt’s FOM metric in this table. Contrasting the proposed method against BL1 in Table 1, we observe a consistent improvement in the reconstruction error due to the introduction of complex dense blocks in place of vanilla dense blocks. This is particularly evident for the case of aggressive under-sampling (6\(\times \)) where the proposed method outperformed BL1 with a significant margin of 5.7%. Comparing BL2 with the proposed method, the inclusion of the data consistency layer proved to be of high significance as evidenced across all validation combinations with an average improvement of over 6%. The use of SSIM as an additional loss function during optimization (comparing BL3 with proposed method) also consistently improves Pratt’s FOM across all the test cases.

Comparative Methods: In Table 2, we compare the proposed method against the under-sampled input image (\(\mathbf {x}_{u}\)) and state-of-art compressed sensing approach, DLMRI, in terms of the evaluation metrics SSIM, MSE and Pratt’s FOM. We observe consistent improvement across all metrics in comparison to DLMRI, with the proposed method being able to recover finer details significantly (over 11% improvement in Pratt’s FOM). In scenarios of testing on aggressive acceleration (6\(\times \)), which corresponds to the limit of sparsity based methods, we observe that \(\mathbb {C}\)DFNet recovers anatomical details better as it is learnt in an end-to-end fashion allowing for efficient learning of anatomical priors from the training data.

4 Conclusion and Future Work

We have presented a deep learning based MR imaging reconstruction method, wherein real-valued neural network operations are replaced by complex convolutional operations. In this work, we demonstrated that the proposed network architecture outperformed the standard state-of art and the real-valued counter part methods by significant margins in terms of recovering fine structures and high frequency textures. The experiments also show that the proposed method is robust towards the undersampling ratio, which eliminates the need for training multiple large networks for each acquisition settings. Finally, Pratt’s figure of merit was adapted for performing evaluation by considering the overall perceptual quality of reconstructed image. As k-space is inherently complex-valued, we believe that this method can be adapted to learn both, domain transformation as well as reconstruction. Moreover, non-Cartesian trajectories can be investigated, as they possess different aliasing properties, a further validation is appropriate to determine the flexibility of our method towards this end.