1 Introduction

Cardiac magnetic resonance imaging (MRI) provides qualitative and quantitative information of the morphology and function of the heart, which are crucial for assessing cardiovascular diseases. Both cardiac MR image segmentation and motion estimation are essential steps for the dynamic exploration of the cardiac function. However, one limitation of the cardiovascular MR is the low acquisition speed due to both hardware and physiological constraints. Most approaches consider undersampling the data in k-space and then reconstruct the images [7, 9]. Nevertheless, in most cases, perfect reconstructions are not necessary as long as the images allow to obtain accurate clinically relevant parameters such as changes in ventricular volumes and the elasticity and contractility properties of the myocardium. Therefore, instead of firstly recovering non-aliased images, it may be more effective to estimate the final results directly from undersampled MR data and also to make such estimations as accurate as possible.

In this paper, we propose to learn a joint deep learning network for cardiac motion estimation and segmentation directly from undersampled cardiac MR data, bypassing the MR reconstruction process. In particular, we extend the joint model proposed in [6] which consists of an unsupervised cardiac motion estimation branch and a weakly-supervised segmentation branch, where the two tasks share the same feature encoder. We investigate the network’s capability of predicting motion estimation and segmentation maps simultaneously and directly from undersampled cardiac MR data. The problem is formulated by incorporating supervision from fully sampled MR image pairs in addition to the composite loss function as proposed in [6]. Simulation experiments have been performed on 220 subjects under different acceleration factors with radial undersampling patterns. Experiments indicate that results learned directly from undersampled data are reasonably accurate and are close to predictions from fully-sampled data. This could potentially lead to future works that enable fast and accurate analysis in an integrated MRI reconstruction and analysis pipeline.

1.1 Related Work

Cardiac segmentation and motion estimation are well studied problems in medical imaging. Traditionally, most approaches consider these two tasks separately [1, 11, 12]. However, it is known that segmentation and motion estimation problems are closely related, and optimising these two tasks jointly has been proven to improve the performance for both challenges. Recently, Oksuz et al. [5] proposed a joint optimisation scheme for registration and segmentation using dictionary learning based descriptors, which enables better performance for both of these ill-posed processes. Qin et al. [6] proposed a unified deep learning model for both cardiac motion estimation and segmentation, where no motion ground truth is required and only temporally sparse annotated frames in a cardiac cycle are needed.

However, there are only a limited number of works that focus on obtaining segmentation maps and motion fields directly from undersampled MR data. One direction of the research is on the application-driven MRI [2], where an integrated acquisition-reconstruction-segmentation process was adopted to provide a more efficient and accurate solution. Schlemper et al. [10] expanded on the idea of application-driven MRI and presented an end-to-end synthesis network and a latent feature interpolation network to predict segmentation maps from extremely undersampled dynamic MR data. Our work focuses on the scenario where motion fields and segmentation maps can be jointly predicted directly from undersampled MR data, bypassing the usual MR image reconstruction stage.

2 Methods

Our goal is to predict the simultaneous motion estimation and segmentation directly from undersampled cardiac MR images and make sure that such predictions are as accurate and efficient as possible. Here we extend the effective unified model (Motion-Seg Net) proposed in [6] to adapt to the application for undersampled MR data. The proposed network architecture consists of two branches which perform motion estimation and segmentation jointly, and a well-trained sub-network for fully-sampled images is incorporated to provide additional supervision during the training process. Note that at test stage, only the undersampled sub-network is needed, and no fully-sampled data is required. The overall architecture of the model is shown in Fig. 1.

Fig. 1.
figure 1

The overall schematic architecture of proposed network for joint estimation of cardiac motion and segmentation directly from undersampled data. (a) (b) The Motion-Seg net adopted from [6]. (c) Proposed architecture for training the Motion-Seg net on undersampled data. US: undersampled, FS: fully-sampled

2.1 Unsupervised Cardiac Motion Estimation from Undersampled MR Image

Inspired by the success of the joint prediction network proposed in [6] which effectively learns useful representations, here we propose to adapt the network to undersampled MR data. In contrast to the fully-sampled case where only self-supervision is required for the motion estimation, it is difficult for the undersampled images to merely rely on self-supervision, i.e., the intensity difference, due to the noises caused by aliased patterns. To address this, we propose to incorporate their corresponding fully-sampled image pairs as an additional supervision to guide the training for the undersampled images, and a schematic illustration of the model is shown in Fig. 1(a) and (c).

The task is to find an optical flow representation between the target undersampled frame \(I_{t}^{US}\) and the source undersampled frame \(I_{t+k}^{US}\), where the output is a pixel-wise 2D motion field \(\varDelta ^{US}\) representing the displacement in x and y directions. We exploit a modified version of the network proposed in [6] for the representation learning, in which it mainly consists of three components: a Siamese network for the feature extraction of both target frame and source frame where the encoder is adapted from VGG-16 Net; a multi-scale concatenation of features from pairs of frames motivated by the traditional multi-level registration method [8]; and a bilinear interpolation sampler that warps the source frame to the target one by using the estimated displacement field \({\varDelta }^{US} = ({\varDelta }^{US} x, {\varDelta }^{US} y; \theta _{\varDelta }^{US})\), where the network is parameterised by \(\theta _{\varDelta }^{US}\) which is learned directly from undersampled MR data. Note that a RNN unit could be potentially incorporated to propagate motion information along the temporal dimension [6], and we will leave it as one of our future work.

Due to the severe aliased patterns existing in the undersampled MR images, it is not practical to train the spatial transformer network purely based on minimising the intensity difference between the transformed undersampled frame and the target undersampled frame. To address this, we propose to introduce the fully-sampled image pairs as a supervision for the training. Specifically, instead of warping the undersampled source image, here we propose to transform the corresponding fully-sampled source image, which can be expressed as \(I_{t+k}^{'FS}(x, y) = \varGamma \{I_{t+k}^{FS}(x+\varDelta _{t+k}^{US}x, y+\varDelta _{t+k}^{US}y)\}\). Then the network can be trained by optimising the pixel-wise mean squared error between \(I_{t}^{FS}\) and \(I_{t+k}^{'FS}\). To ensure local smoothness, we maintain the regularisation term for the gradients of displacement fields which uses an approximation of Huber loss proposed in [3, 6], namely \(\mathcal {H}(\delta _{x, y}\varDelta ^{US}) = \sqrt{\epsilon +\sum _{i=x, y}(\delta _{x}\varDelta ^{US} i^2+\delta _{y}\varDelta ^{US} i^2)}\), where \(\epsilon =0.01\). Therefore, the loss function can be described as follows:

$$\begin{aligned} \mathcal {L}_m = \frac{1}{N_s}\sum _{\left( I_t,I_{t+k} \right) \in S}\big [\Vert I_t^{FS}-I_{t+k}^{'FS}\Vert ^2+\alpha \mathcal {H}(\delta _{x, y}\varDelta _{t+k}^{US})\big ], \end{aligned}$$
(1)

where \(N_s\) stands for the number of sample pairs in the training set S, and \(\alpha \) is a regularisation parameter to trade off between image dissimilarity and local smoothness.

However, it is observed that for heavily undersampled images, such weak supervision in Eq. 1 is not sufficient. Therefore, in order to push the learning results from undersampled data to be as accurate as that from fully-sampled data, we additionally introduce a pixel-wise mean squared error loss on the displacement fields between the estimation from undersampled data (\(\varDelta _{t+k}^{US}\)) and that from fully-sampled one (\(\varDelta _{t+k}^{FS}\)). Since only the motion of anatomical structures is of interest, here we propose to mask the region of interests (ROI) by utilising the predicted segmentation maps from fully-sampled data to allow that only errors from ROI can be backpropagated to contribute to the learning. The proposed loss term can be expressed as \(\mathcal {L}_{\varDelta _{t+k}}=\Vert (\varDelta _{t+k}^{US}-\varDelta _{t+k}^{FS}) * \mathbf{{M}}_t\Vert ^2\), where \(\mathbf{{M}}_t\) is a one-hot mask (1 for ROI, and 0 for background) generated from the segmentation maps from frame t of fully-sampled images. Thus, the overall loss function for motion estimation is as follows:

$$\begin{aligned} \mathcal {L}_m = \frac{1}{N_s}\sum \big [\Vert I_t^{FS}-I_{t+k}^{'FS}\Vert ^2+\alpha \mathcal {H}(\delta _{x, y}\varDelta _{t+k}^{US}) + \beta \Vert (\varDelta _{t+k}^{US}-\varDelta _{t+k}^{FS})* \mathbf{{M}}_t\Vert ^2 \big ], \end{aligned}$$
(2)

in which an additional trade-off parameter \(\beta \) is introduced. Note that no ground truth displacement fields are required during the training, thus the motion is still estimated unsupervisedly.

2.2 Joint Cardiac Motion Estimation and Segmentation from Undersampled MR Image

Previous works have shown that motion estimation and segmentation tasks are complementary [4, 6, 13]. Therefore, here we couples both tasks for the joint prediction from undersampled MR data. The schematic architecture of the unified model is shown in Fig. 1.

The joint model consists of two branches: the motion estimation branch proposed in Sect. 2.1 which introduces additional supervision from fully sampled images, and the segmentation branch based on the network proposed in [1], where both branches share the joint feature encoder (Siamese style network) as shown in Fig. 1. As images are only temporally sparse annotated, predictions from corresponding fully-sampled images are used as supervision for those unlabelled data. Therefore a categorical cross-entropy loss \(\mathcal {L}_s = -\sum _{l\in L} y_{l}^{GT} log ( f(x_{l};\varTheta ^{US})) -\sum _{n\in U} \hat{y}_{n}^{FS} log ( f(x_{n};\varTheta ^{US}) )\) on labelled data set L and unlabelled data set U is used for segmentation branch, in which we define \(x_l\) and \(x_n\) as the input data, \(y_l^{GT}\) as the ground truth, \(\hat{y}_n^{FS}\) is predictions from fully-sampled images and f is the segmentation function parameterised by \(\varTheta ^{US}\). Different from the loss function as stated in [6], here we don’t employ the loss \(\mathcal {L}_w\) between the warped segmentations and the target, as we find that for undersampled cases, minimising \(\mathcal {L}_w\) could introduce more noises and uncertainties into the network training presumably because of the less accurate predictions. We empirically observed that this could lead to a small performance degradation especially for the segmentation branch.

As a result, the overall loss function for the joint model can be defined as:

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{m}+\lambda \mathcal {L}_{s}, \end{aligned}$$
(3)

where \(\lambda \) is a trade-off parameter for balancing these two tasks. \(\mathcal {L}_{m}\) can be of the form of Eq. 1 or Eq. 2, and we will examine their comparisons in experiments.

3 Experiments and Results

Experiments were performed on 220 short-axis cardiac MR sequences from UK Biobank study. Each scan contains a sequence of 50 frames, where manual segmentations of left-ventricular (LV) cavity, the myocardium (Myo) and the right-ventricular (RV) cavity are available on ED and ES frames. A short-axis image stack typically consists of 10 image slices, and the pixel resolution is \(1.8 \times 1.8 \times 10.0\) \(mm^3\). Since only magnitude images are available, here we employed a phase map synthesis scheme proposed in [10] to synthetically generate phase maps (smoothly varying 2D sinusoid waves), in order to convert magnitude images to complex valued images and to make the simulation more realistic. In experiments, the synthesised complex valued images were back-transformed to regenerate k-space samples. The input undersampled images were generated by randomly undersampling the k-space samples using uniform radial undersampling patterns. For pre-processing, all training images were cropped to the same size of \(192\times 192\), and intensity was normalized to the range of [0,1]. In our experiments, we split the data into 100/100/20 for training/testing/validation. Parameters used in the loss function were set to be \(\alpha = 0.001\), \(\beta = 1\), and \(\lambda =0.01\), which were chosen via validation set. Fully-sampled sub-network parameters were loaded from [6], and we train the undersampled network using Adam optimiser with a learning rate of 0.0001. Data augmentation was performed on-the-fly, with random rotation, translation, and scaling.

As work [6] has already shown that the joint model can significantly outperform model with single branch, in this work, we mainly focus on the evaluation of the performance on undersampled data. We first evaluated the performance of motion estimation by comparing the proposed model with a B-spline free-form deformation (FFD) algorithmFootnote 1 [8], and the results are shown in Table 1. Here we examined the effect of different losses on the model’s performance, where we termed method using \(\mathcal {L}_m\) with the form of Eq. 1 as Proposed-A, and the one using Eq. 2 as Proposed-B. Motion fields were estimated between ES and ED frame, and mean contour distance (MCD) and Hausdorff distance (HD) were computed between the warped ES segmentations and ED segmentations. Results on fully-sampled (FS) images are presented in Table 1 as a reference. It can be observed that proposed methods consistently outperform FFD on all acceleration rates with \(p \ll 0.001\) using Wilcoxon signed rank test, and is able to produce results that are close to the fully-sampled images. Furthermore, it can also be noticed that for higher acceleration rates (\(6 \times \) and \(8 \times \)), Proposed-B produces significantly better results than Proposed-A (\(p \ll 0.001\)). This is reflected by the fact that higher undersampling rates result in more aliased images, therefore a relatively strong supervision (\(\mathcal {L}_\varDelta \)) is more needed to guide the learning in comparison to images with less aliasing (\(3 \times \)).

Table 1. Evaluation of motion estimation accuracy for undersampled MR data with different acceleration factors in terms of the mean contour distance (MCD) and Hausdorff distance (HD) in mm (mean and standard deviation). Loss function using \(\mathcal {L}_m\)(Eq. 1) is termed as Proposed-A, and the one using \(\mathcal {L}_m\)(Eq. 2) is termed as Proposed-B. Bold numbers indicate the best results for different undersampling rates.

We further evaluated the segmentation performance of the model on undersampled data with different acceleration factors. Results reported in Table 2 are Dice scores computed with manual annotations on LV, Myo, and RV, as well as the clinical parameter ejection fraction (EF). It has been observed that Proposed-A and Proposed-B didn’t differ significantly in terms of segmentation performance, so here we only report results obtained from Proposed-B in Table 2. It can be seen that though there is a relatively small drop of performance as acceleration factors increase, the network is robust to train on undersampled data, and the clinical parameter predicted directly from undersampled data is very close to that from fully-sampled images. Furthermore, a visualisation result of the network predictions on \(8 \times \) accelerated data in a cardiac cycle is shown in Fig. 2, where myocardial motion indicated by the yellow arrows were established between ED and other time frames. Overall, predictions directly from undersampled MR data are reasonably accurate, despite some small underestimations.

Table 2. Evaluation of segmentation performance under different acceleration factors in terms of Dice Metric (mean and standard deviation) and average percentage (%) error for ejection fraction (EF) compared with fully-sampled data.
Fig. 2.
figure 2

Comparison visualisation results for simultaneous prediction of motion estimation and segmentation on data with undersampling rates 8. Myocardial motions are from ED to other time points (numbers on the top right). Segmentations are overlaid on fully-sampled data for better visualisation.

4 Conclusion

In this paper, we explored the joint motion estimation and segmentation directly from undersampled cardiac MR data, bypassing the usual image reconstruction stage. The proposed method takes advantage of a unified model which shares the same feature encoder for both tasks and performs them simultaneously. In particular, we additionally introduced a parallel well-trained sub-network for corresponding fully-sampled MR image pairs as a supervision source for training undersampled data, in order to push the predictions from undersampled data to be as accurate as possible. We showed that the proposed network is robust to undersampled data, and results predicted directly from undersampled images are close to that from fully-sampled ones, which could potentially enable fast analysis for MR imaging. In the future, it is also interesting to explore methods that are independent of aliased patterns and acceleration factors.