1 Introduction

Atrial fibrillation (AF) is a cardiac arrhythmia caused by abnormal electrical discharges in the atrium, often beginning with hemodynamic and/or structural changes in the left atrium (LA) [1]. AF is clinically associated with LA strain, and MRI is shown to be a promising imaging method for assessing the disease state and predicting adverse clinical outcomes. The LA also has an important role in patients with ventricular dysfunction as a booster pump to augment ventricular volume [2]. Computed tomography (CT) imaging of the heart is frequently performed when managing AF and prior to pulmonary vein ablation (isolation) therapy due to its rapid processing time. In recent years, there is an increasing interest in shifting towards cardiac MRI due to its excellent soft tissue contrast properties and lack of radiation exposure. For pulmonary vein ablation therapy planning in AF, precise segmentation of the LA and PPVs is essential. However, this task is non-trivial because of multiple anatomical variations of LA and PPV.

Historically, statistical shape and atlas-based methods have been the state-of-the-art cardiac segmentation approaches due to their ability to handle large shape/appearance variations. One significant challenge for such approaches is their limited efficiency: an average of 50 min processing time per volume [3]. Statistical shape models are faster than atlas-based methods, and a high degree uncertainties in the accuracy of such models is inevitable [4]. To alleviate this problem and accomplish the segmentation of LA and PPVs from 3D cardiac MRI with high accuracy and efficiency, we propose to a new deep CNN. Our proposed method is fully automated, and largely different from previous methods of LA and PPVs segmentation. The summary of these differences and key novelties of the proposed method, named as CardiacNET, are listed as follows:

  • Training CNN from scratch for 3D cardiac MRI is not feasible with insufficient 3D training data (with ground truth) and limited computer memory. Instead, we parsed 3D data into 2D components (axial (A), sagittal (S), and coronal (C)), and utilized a separate deep learning architecture for each component. The proposed CardiacNET was trained using more than 60K 2D slices of cardiac MR images without relying on a pre-training network of non-medical data.

  • We have combined three CNN networks through an adaptive fusion mechanism where complementary information of each CNN was utilized to improve segmentation results. The proposed adaptive fusion mechanism is based on a new strategy; called robust region, which measures (roughly) the reliability of segmentation results without the need for ground truth.

  • We devised a new loss function in the proposed network, based on a modified z-loss, to provide fast convergence of network parameters. This not only improved segmentation results due to fast and reliable allocation of network parameters, but it also provided a significant acceleration of the segmentation process. The overall segmentation process for a given 3D cardiac MRI takes at most 10s in GPU, and 7.5 min in CPU on a normal workstation.

Fig. 1.
figure 1

High-level overview of the proposed multi-view CNN architecture.

2 Proposed Multi-view Convolutional Neural Network (CNN) Architecture

The proposed pipeline for deep learning based segmentation of the LA and PPVs is summarized in Fig. 1. We used the same CNN architecture for each view of the 3D cardiac MRI after parsing them into axial, sagittal, and coronal views. The rationale behind this decision is based on the limitation of computer memory and insufficient 3D data for training on 3D cardiac MRI from scratch. Instead, we reduced the computational burden of the CNN training by constraining the problem into a 2D domain. The resulting pixel-wise segmentations from each CNN are combined through an adaptive fusion strategy. The fusion operation was designed to maximize the information content from different views. The details of the pipeline are given in the following subsections.

Encoder-Decoder CNN: We constructed an encoder-decoder CNN architecture, similar to that of Noh et al. [5]. The network includes 23 layers (11 in encoder, 12 in decoder units). Two max-pooling layers in encoder units reduce the image dimensions by half, and a total of 19 convolutional (9 in encoder, 10 in decoder), 18 batch normalization, and 18 ReLU (rectified linear unit) layers are used. Specific to the decoder unit, two upsampling layers are used to convert the images back into original sizes. Also, the kernel size of all filters are considered as \(3\times 3\). The final layer of the network includes a softmax function (logistic) for generating a probability score for each pixel. Details of these layers, and associated filter size and numbers are given in Fig. 2.

Loss Function: We used a new loss function that can estimate the parameters of the proposed network at a much faster rate. We trained end-to-end mapping with a loss function \(L(\mathbf {o},c)=\,\)softplus\((a(b-z_c))/a\), called z-loss [6], where \(\mathbf {o}\) denotes output of the network, c denotes the ground truth label, and \(z_c\) indicate z-normalized label, obtained as \(z_c=(o_c-\mu )/\sigma \) where mean (\(\mu \)) and standard deviation \(\sigma \) are obtained from \(\mathbf {o}\). z-loss is simply obtained with the reparametrization of soft-plus (SP) function (i.e., \(SP(x)=ln(1+e^x)\)) through two hyperparameters: a and b. Herein, we kept these hyperparameters fixed, and trained the network with a reduced z-loss function. The rationale behind this choice is the following: the z-loss function provides an efficient training performance as it belongs to spherical loss family, and it is invariant to scale and shift changes in the output, avoiding output parameters to deviate from extreme values.

Training CardiacNET from Scratch: 3D cardiac MRI images along with its corresponding expert annotated ground truths were used to train the CNN after the images are parsed into three views (A, S, C). Data augmentation has been conducted on the training dataset with translation and rotation operation as indicated in Table 1. Obtained 3D images were parsed into A, S, and C views, and more than 60K 2D images were obtained to feed training of the CNN (approximately 30K for A and C views, around 11K for S view). The 9 of the subjects and their corresponding augmented data are considered as a training and 1 subject and its corresponding augmented data is considered as validation. As a preprocessing step, all images have undergone anisotropic smoothing filtering and histogram matching.

Fig. 2.
figure 2

Details of the CNN architecture. Note that image size is not necessarily fixed for each view’s CNN.

Table 1. Data augmentation parameters and number of training images

Multi-view Information Fusion. Since cardiac MRI is often not reconstructed with isotropic resolution, we expected varying segmentation accuracy in different views. In order to alleviate potential adverse effects caused by non-isotropic spatial resolutions of a particular view, it is desirable to reduce the contribution of that view into final segmentation. We have achieved this with the adaptive fusion strategy as described next. For a given MRI volume I, and its corresponding segmentation \(\mathbf {o}\), we proposed a new strategy, called robust region, that roughly determined the reliability of the output segmentation \(\mathbf {o}\) by assessing its object distribution. To achieve this, we hypothesized that the output should include only one connected object when the segmentation is successful, and if there was more than a single connected object available, these can be considered as false positives. Accordingly, respective performance of segmentation performance in A, S, and C views can be compared and weighted. To this end, we utilized connected component analysis (CCA) to rank output segmentations and reduced the contribution of CNN for a particular view when false positive findings (non-trusted objects/components) were large and true positive findings (trusted object/component) were small. Figure 3 describes the adaptive fusion strategy as \(CCA (\mathbf {o})=\{o_{1},\ldots ,o_{n} |\,\cup \,o_{i} = \mathbf {o}, \text {and}\,\cap \,o_i=\phi \}\). Thus, the contribution of each view’s CNN was computed based on a weighting \( w= {max_{i} \{|o_{i}| \}}/ \sum _{i} |o_{i}| \), indicating that higher weights were assigned when the component with largest volume dominated the whole output volume. Note that this block has been used only in the test phase. Complementary to this strategy, we also used simple linear fusion of each views for comparison (See Experimental Results section).

Fig. 3.
figure 3

Connected components obtained from each view were computed and the residual volume (T-NT) was used to determine the strength for fusion with the other views.

3 Experimental Results

Data sets: Thirty cardiac MRI data sets were provided by the STACOM 2013 challenge organizers [3]. Ten training data were provided with ground truth labels, and the remaining twenty were provided as a test set. It is important to note that not the complete PVs are considered in the segmentation challenge, but only the proximal segments of the PVs up to the first branching vessel or after 10 mm from the vein ostium were included in the segmentation. MR images were obtained from a 1.5T Achieva (Philips Healtcare, The Neatherlands) scanner with an ECG-gated 3D balanced steady-state free precession acquisition [3] with TR/TE\(\,=4.4/2.4\) ms, and Flip-angle = \(90^\circ \). Typical acquisition time for the cardiac volume imaging was 10 min. In-plane resolution was recorded as 1.25 \(\times \) 25 mm\(^2\), slice thickness was measured as 2.7 mm. Further details on the data acquisition, and image properties can be found in [3].

Fig. 4.
figure 4

First row shows sample MRI slices from S, C, and A views (red contour is ground-truth and green one is output of proposed method). Second-to-fifth rows: 3D surface visualization for the ground-truth and the output generated by the proposed method w.r.t simple fusion (F), adaptive fusion (AF), and the new loss function (SP).

Evaluations. For evaluation and comparison with other state-of-the-art method, we have used the same evaluation metrics, provided by the STACOM 2013 challenge: Dice index and surface-to-surface (S2S) metrics. In addition, we calculated Dice index and S2S for the LA and PPVs separately. To provide a comprehensive evaluation and comparisons, sensitivity (true positive rate), specificity (true negative rate), precision (positive prediction value), and Dice index values for the combined LA and PPVs were included too. Table 2 summarizes all these evaluation metrics along with efficiency comparisons where we tested our algorithm both in GPU and CPU. LTSI-VRG, UCL-1C, and UCL-4C are three atlas-based method which their output were published publicly as a part of STACOM 2013 challenge. Also, OBS-2 is the result from human observer which its output was available as a part of STACOM 2013 challenge. Using leave-one-out cross-validation strategy on training dataset, we achieved high sensitivity (0.92) and Dice value (0.93). Similarly, in almost all evaluation metrics in the test set, the proposed method out-performed the state-of-the-art approaches by large margins. Table 2 indicates the results of varying combinations using CardiacNET such as single CNN in particular view (i.e., \(S\_CNN\)), with simple linear fusion F-CNN, adaptive fusion AF-CNN, and with the new loss function AF-CNN-SP. In AF-CNN, the loss function was cross-entropy. The best method in the challenge data set was reported to have a Dice index of 0.94 for LA and 0.65 for PPVs (combined LA and PPVs was less than 0.9). In our proposed method, the Dice index for combined LA and PPVs was well above 0.90. For efficiency comparison, our approach only takes at most 10s on a Nvidia TitanX GPU and 7.5 min in a CPU with Octa-core processor (2.4 GHz) configuration. The method in [7] required 30–45 min of processing times (with Quad-core processor (2.13 GHz)). For qualitative evaluation, we have used surface rendering of output segmentations compared to ground truth in Fig. 4. Sample axial, sagittal, and coronal MRI slices are given in the same figure with ground truth annotations overlaid with the segmented LA and PPVs.

4 Discussions and Concluding Remarks

The advantage of CardiacNET is accurate and efficient method for both LA and PPVs segmentation in atrial fibrillation patients: combined segmentation of the LA and PPVs. Precise segmentation of the LA and PPVs is needed for ablation therapy planning and clinical guidance in AF patients. PPVs have a greater number of anatomical variations than the LA-body, leading to challenges with accurate segmentation. Joint segmentation the LA and PPVs is even more challenging compared to sole LA-body segmentation. Nevertheless, with all available quantitative metrics, the proposed method has been shown to greatly improve the segmentation accuracy on the existing benchmark for LA and PPVs segmentation. The benchmark evaluation has also allowed the method and its variations to be cross-compared on the same dataset with other existing methods in literature (Fig. 5).

Table 2. The evaluation metrics for state-of-the-art and proposed methods. \(^{**}\): the running time on CPU \(^*\): the running time on NVIDIA TitanX GPU
Fig. 5.
figure 5

Box plots for sensitivity, precision, and Dice index for state-of-the-art (LTSI_VRG,UCL_1C, UCL_4C, OBS_2) and proposed methods (F_CNN, AF_CNN, AF_CNN_SP) on the LA segmentation benchmark

Despite the efficacy of the proposed method, there are several possibilities that our work can be extended in future studies. Firstly, the new method will be tested, evaluated, and validated our in more diverse data sets from several independent cohorts, and at the different imaging resolution and noise levels, and even across different scanner vendors. Secondly, extending our framework into 4D (i.e. motion) analysis of cardiac images can be possible by extending our parsing strategy. Thirdly, we aim to explore the feasibility of training completely 3D cardiac MRI based on the availability of multiple GPUs, or developing sparse CNNs to alleviate the segmentation problem. Fourthly, with low-dose cardiac CT technology on the rise; it is desirable to have similar network structure trained on CT scans. This notable efficacy of the deep learning strategies presented in this work promises a similar performance on CT scans.

In conclusion, the proposed method has utilized the strength of deeply trained CNN to segment LA and PPVs from cardiac MRI. We have shown combining information from different views of MRI by using an adaptive fusion strategy and a new loss function improves segmentation accuracy and efficiency significantly.