1 Introduction

There is a great demand for detecting the accurate location of a myocardial ischemia area for better myocardial infarction (MI) diagnosis. The use of magnetic resonance contrast agents based on gadolinium-chelates for visualizing the position and size of scarred myocardium has become ‘the gold standard’ for evaluating the area of the MI [1]. However, the contrast agents are not only expensive but also nephrotoxic and neurotoxic and, hence, could damage the health of humans [2]. In routine clinical procedures, and especially for early screening and postoperative assessment, visual assessment is one popular method, but it is subject to high inter-observer variability and is both subjective and non-reproducible. Furthermore, the estimation of the time course of the wall motion remains difficult even for experienced radiologists.

Therefore, computer-aided detection systems have been attempted in recent years to automatically analyze the left ventricle (LV) myocardial function quantitatively. This computerized vision can serve to simulate the brain of a trained physicians intuitive attempts at clinical judgment in a medical setting. Previous MI detection methods have been mainly based on information theoretic measures and Kalman filter approaches [3], Bayesian probability model [4], pattern recognition technique [5, 6], and biomechanical approaches [7]. However, all of these existing methods still fail to directly and accurately identify the position and size of the MI area. More specifically, these methods have not been able to capture sufficient information to establish integrated correspondences between the myocardial motion field and MI area. More recently, unsupervised deep learning feature selection techniques have been successfully used to solve many difficult computer vision problems. The general concept behind deep learning is to learn hierarchical feature representations by first inferring simple representations and then progressively building up more complex representations from the previous level. This method has been successfully applied to the recognition and prediction of prostate cancer, Alzheimers disease, and vertebrae and neural foramina stenosis [8].

In this study, an end-to-end deep-learning framework has been developed for accurate and direct detection of infarction size at the pixel level using cardiac magnetic resonance (CMR) images. Our methods contributions and advantages are as follows: (1) for the first time, we propose an MI area detection framework at the pixel level that can give the physician the explicit position, size and shape of the infarcted areas; (2) a feature extraction architecture is used to establish solid correspondences between the myocardial motion field and MI area, which can help in understanding the complex cardiac structure and periodic nature of heart motion; and (3) a unified deep-learning framework can seamlessly fuse different methods and layers to better learn hierarchical feature representations and feature selection. Therefore, our framework has great potential for improving the efficiency of the clinical diagnosis of MI.

Fig. 1.
figure 1

The architecture of OF-RNN: heart localization layers, motion statistical layers, and fully connected discriminate layers.

2 Methodology

As shown in Fig. 1, there are three function layers inside the OF-RNN. The heart localization layers can automatically detect the ROI, including the LV, and the motion statistical layers can generate motion features that accurately characterize myocardial physiologic and physical function, followed by the fully connected discriminate layers that use stacked auto-encoders and softmax classifiers to detect the MI area from motion features.

Heart localization layers. One FAST R-CNN [9] is used here for the automatic detection of a region of interest (ROI) around the LV, to reduce the computational complexity and improve the accuracy. In this study, the first process of the heart localization layers is to generate category-independent region proposals. Afterward, a typical convolutional neural network model is used to produce a convolution feature map by input images. Then, for each object proposed, an ROI pooling layer extracts a fixed-length feature vector from the feature map. The ROI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of \(H\) \(\times \) \(W\), where H and W are layer hyper-parameters that are independent of any particular ROI. Finally, each feature vector is fed into a sequence of fully connected layers that branch into two sibling output layers, thereby generating a 64 \(\times \) 64 bounding-box for cropping the ROI image sequences, including the LV from CMR sequences.

Motion statistical layers. The motion statistical feature layers are used to extract time-series image motion features through ROI image sequences to understand the periodic nature of ghd heart motion. The local motion features are generated by LSTM-RNN, and the global motion features are generated by deep optical flow. Thus, in the first step, we attempt to compute the local motion features that are extracted from the ROI image sequence. For each ROI sequence, the input image \(I = ({I_1},{I_2}...{I_J},J = 25)\) of size 64 \(\times \) 64, I(p) represents a pixel coordinate \(p = [x,y]\) of the image I. A window of size 11 \(\times \) 11 is constructed for the overlapping I[xy] neighborhoods, which has an intensity value that is representative of the feature of each p on image \({I_J}\). This approach results in the J image sequence features being unrolled as vector \({P_l}(p) \in {R^{11*11*J}}\) for each pixel as input. Then, four layers of RNN [10] with LSTM cells layers are used to learn the input. Give the input layer \({X_t}\) at time t, each time corresponds to each frame(t = J), which indicates that \({x_t} = {P_l}(p)\) at frame J, and for the hidden state frame of the previous time step \({h_t-1}\), the hidden and output layers for the current time step are computed as follows:

$$\begin{aligned} \begin{array}{l} {h_t} = \phi \left( {{W_{xh}}\left[ {{h_{t - 1}},{x_t}} \right] } \right) ,\quad {p_t} = soft\max \left( {{W_{hy}}{h_t}} \right) ,\quad {\hat{y}_t} = \arg \max {p_t}\ \end{array} \end{aligned}$$
(1)

where \(x_t\), \(h_t\) and \(y_t\) are layers that represent the input, hidden, and output at each time step t, respectively; \(W_{xh}\) and \(W_{hy}\) are the matrices that denote the weights between the input and hidden layers and between the hidden and output layers, respectively, and \(\phi \) denotes the activation function. The LSTM cell [10] is designed to mitigate the vanishing gradient. In addition to the hidden layer vector \(h_t\), the LSTMs maintain a memory vector \(c_t\), an input gate \(i_t\), a forget gate \(f_t\), and an output gate \(o_t\); These gates in the LSTMs are computed as follows:

$$\begin{aligned} \left[ \begin{array}{l} {i_t}\\ {f_t}\\ {o_t}\\ {{\tilde{c}}_t} \end{array} \right] = \left( \begin{array}{l} sigm\\ sigm\\ sigm\\ \tanh \end{array} \right) {W_t}\left[ {D({x_t}),{h_{t - 1}}} \right] \end{aligned}$$
(2)

where \({W_t}\) is the weight matrix, and D is the dropout operator. The final memory cell and the final hidden state are given by

$$\begin{aligned} \begin{array}{l} {c_t} = {f_t} \odot {c_{t - 1}} + {i_t} \odot {\tilde{c}_t},\quad \quad {h_t} = {o_t} \odot \tanh ({c_t}) \end{array} \end{aligned}$$
(3)

In the second step, we attempt to compute the global motion feature of the image sequence based on an optical flow algorithm [11] by the deep architecture. An optical flow can describe a dense vector field, where a displacement vector is assigned to each pixel, which points to where that pixel can be found in another image. Considering an adjacent frame, a reference image I = (\(I_{J-1}\)) and a target image \(I'\) = (\({I_{J}}\)), the goal is to estimate the flow \(w = {(u,v)^ \top }\) that contains both horizontal and vertical components. We assume that the images are already smoothed by using a Gaussian filter with a standard deviation of \(\sigma \). The energy to be optimized is the weighted sum of a data term ED, a smoothness term ES, and a matching term EM:

$$\begin{aligned} E(w) = \int _\varOmega {{E_D} + \alpha {E_S} + \beta {E_M}dx} \end{aligned}$$
(4)

Next, a procedure is developed to produce a pyramid of response maps, and we start from the optical flow constraint, assuming a constant brightness. A basic way to build a Data term and a Smoothness term is the following:

$$\begin{aligned} {E_D} = \delta \varPsi \left( {\sum \limits _{i = 1}^c {{w^ \top }} \bar{J}_0^iw} \right) + \gamma \varPsi \left( {\sum \limits _{i = 1}^c {{w^ \top }} \bar{J}_{xy}^iw} \right) \end{aligned}$$
(5)
$$\begin{aligned} {E_S} = \varPsi (||\nabla u|{|^2} + ||\nabla v|{|^2}) \end{aligned}$$
(6)

where \(\varPsi \) is a robust penalizer; \(\bar{J}_{xy}^iw\) is the tensor for channel I; \(\delta \) and \(\gamma \) are the two balanced weights. The matching term encourages the flow estimation to be similar to a precomputed vector field \(w'\), and a term c(x) has been added.

$$\begin{aligned} {E_M} = c\varPsi ({\left\| {w - {w'}} \right\| ^2}) \end{aligned}$$
(7)

For any pixel \(p'\) of \(I'\), \({C_{n,p}}(p')\) is a measure of similarity between \(I_{n,p}\) and \(I'_{n,p'}\). We have \(I_{n,p}\) to be a patch size of N \(\times \) N (N \(\in \) 4, 8, 16) from the first image centered at p. We start with the bottom-level correlation maps, which are iteratively aggregated to obtain the upper levels. This aggregation consists of max-pooling, sub-sampling, computing a shifted average and non-linear rectification. In the end, for each image \(I_{J-1}\), a fully motion field \({\mathrm{{w}}_{J-1}}\mathrm{{ = (}}{\mathrm{{u}}_{J-1}}\mathrm{{, }}{\mathrm{{v}}_{J-1}}\mathrm{{)}}\) is computed with reference to the next frame \({I_{J }}\).

Fully connected discriminate layers. The fully connected discriminate layers are used to detect the MI area accurately from the local motion features and the global motion features. First, for each \(w_j\), we use image patches, say 3 \(\times \) 3, by extracting the feature beginning from a point p in the first frame and tracing p in the following frame. We can thereby obtain \({P_g}(p)\) while containing a 3 \(\times \) 3 vector for displacement and a 3 \(\times \) 3 vector for the orientation of p for each frame. Second, we conduct a simple concatenation between the local image feature \({P_l}(p)\) from the LSTM-RNN and the motion trajectories feature \({P_g}(p)\) via optical flow, to establish a whole feature vector P(p). Finally, an auto-encoder with three stacking layers is used for learning the P(p), followed by a softmax layer, which is used to determine whether p belongs to the MI area or not.

Fig. 2.
figure 2

(a, b) Our predicted MI area (the green zone) can be a good fit for the ground truth (the yellow arrow) (c) our predicted MI area (the green zone) can be a good fit for the ground truth (the yellow dotted line).

3 Experimental Results

Data acquisition. We collected the short axis image dataset and the corresponding enhanced images using gadolinium agents from 114 subjects in this study on a 3T CMR scanner. Each subjects short-axis image dataset consisted of 25 2D images (a cardiac cycle), a total of 43 apical, 37 mid-cavity and 34 basal short-axis image datasets for 114 subjects. The temporal resolution is 45.1 ± 8.8 ms, and the short-axis planes are 8-mm thick. The delayed enhancement images were obtained approximately 20 min after intravenous injection of 0.2 mmol/kg gadolinium diethyltriaminepentaacetic acid. A cardiologist (with more than 10 years of experience) analyzed the delayed enhancement images and manually traced the MI area by the pattern of late gadolinium enhancement as the ground truth.

Implementation details. We implemented all of the codes using Python and MATLAB R2015b on a Linux (Kylin 14.04) desktop computer with an Intel Xeon CPU E5-2650 and 32 GB DDR2 memory. The graphics card is an NVIDIA Quadro K600, and the deep learning libraries were implemented with Keras (Theano) with RMSProp solver. The training time was 373 min, and the testing time was 191 s for each subject (25 images).

Performance evaluation criteria. We used three types of criteria to measure the performance of the classifier: (1) the receiver operating characteristic (ROC) curve; (2) the precision-recall (PR) curve; (3) for pixel-level accuracy, we assessed the classifier performance with a 10-fold cross-validation test, and for segment-level accuracy, we used 2/3 data for training and the remaining data for testing.

Automatic localization of the LV. The experiment’s result shows that OF-RNN can obtain good localization of the LV. We achieve an overall classification accuracy of 96.49%, with a sensitivity of 94.39% and a specificity of 98.67%, in locating the LV in the heart localization layers. We used an architecture similar to the Zeiler and Fergus model to pre-train the network. Using selective searches quality mode, we sweep over 2 k proposals per image. Our results for the ROI localization bounding-box from 2.85 k CMR images were compared to the ground truth marked by the expert cardiologist. The ROCs and PRs curves are shown in Fig. 3(a, b).

MI area detection. Our approach can also accurately detect the MI area, as shown in Fig. 2. The overall pixel classification accuracy is 94.35%, with a sensitivity of 91.23% and a specificity of 98.42%. We used the softmax classifier by fine-tuning the motion statistical layers to assess each pixel (as normal/abnormal). We also compared our results to 16 regional myocardial segments (depicted as normal/abnormal) by following the American Heart Association standards. The accuracy performance for the apical slices was an average of 99.2%; for the mid-cavity slices, it was an average of 98.1%; and for the basal slices, an average of 97.9%. The ROCs and PRs of the motion statistical layers are shown in Fig. 3(a, b).

Fig. 3.
figure 3

(a, b) ROCs and PRs show that our results have good classification performance. (c, d) ROCs and PRs for local motion features and global motion features. (e) The accuracy and time for various patch sizes.

Local and global motion statistical features. A combination of local and global motion statistical features has the potential to improve the results because the features influence one another through a shared representation. To evaluate the effect of motion features, we use local or global motion statistical features separately along with both motion features in our framework. Table 1 and Fig. 3(c, d) show that the results that combine motion statistical features in our framework have better accuracy, sensitivity, and specificity in comparison to those that use only the local or global motion features, in another 10-fold cross-validation test.

Table 1. Combined motion statistical features effectively improve the overall accuracy of our method

Size of patch. We use an N \(\times \) N patch to extract the local motion features from the whole image sequence. Because the displacements of the LV wall between two consecutive images are small (approximately 1 or 2 pixels/frame), it is necessary to adjust the size of the patch to capture sufficient local motion information. Figure 3(e) shows the accuracy and computational time of our framework, using from 3 \(\times \) 3 to 17 \(\times \) 17 patches in one 10-fold cross-validation test. We find that the 11 \(\times \) 11 patch size in our framework can obtain better accuracy in a reasonable amount of time.

Performance of the LSTM-RNN. To evaluate the performance of the LSTM-RNN, we replaced the LSTM-RNN using SVMrbf, SAE-3, DBN-3, CNN and RNN in our deep learning framework, and we ran these different frameworks over 114 subjects using a 10-fold cross-validation test. Table 2 reports the classification performance by using the other five different learning strategies: the RNN, Deep Belief Networks (DBN), Convolutional Neural Network (CNN), SAE and Support Vector Machine with RBF kernel (SVMrbf). LSTM-RNN shows better accuracy and precision in all of the methods.

Fig. 4.
figure 4

A pair of frames at the beginning of systole (a) and at the end of systole (b) were first displayed, followed by the visual results of our deep optical flow (c) and Horn and Schunck (HS) optical flow (d) at pixel precision.

Table 2. LSTM-RNN works best in comparison with other models

Performance of the optical flow. The purpose of the optical flow is to capture the global motion features. To evaluate the performance of our optical flow algorithm with a deep architecture, we used the average angular error (AAE) to evaluate our deep optical flow and other optical flow approaches. The other optical flow methods, including the Horn and Schunck method, pyramid Horn and Schunck method, intensity-based optical flow method, and phase-based optical flow method, can be found in [12]. The comparison results are shown in Table 3, and visual examples are illustrated in Fig. 4.

Table 3. Deep optical flow (OF) can work better in comparison to other optical flow techniques in capturing global motion features

4 Conclusions

We have, for the first time, developed and presented an end-to-end deep-learning framework for the detection of infarction areas at the pixel level from CMR sequences. Our experimental analysis was conducted on 114 subjects, and it yielded an overall classification accuracy of 94.35% at the pixel level. All of these results demonstrate that our proposed method can aid in the clinical diagnosis of MI assessments.