1 Introduction

The intervertebral disc is a fibrocartilage disc that connects adjacent vertebrae so that the spine can move within a certain angle. The IVDs have the nature of toughness and elasticity, and can be deformed under pressure, so that the force applied on the IVDs can be evenly distributed into all directions, and ensure the entire surface of vertebral is subjected to the same pressure. IVDs are also the main structure for absorbing shock. When the human body jumps, falls from a high place, and performs other vertical movements, or when the shoulders, back, and waist suddenly load heavy objects, the IVDs can buffer the force by conduction and self-deformation, hence plays the role of protecting the spinal cord and vital organs in the body.

However, with age, excessive activity or overload, it may lead to degeneration of the intervertebral disc, causing lower back pain, numbness of lower limb, nerve injury or even loss of movement, which will seriously affect work ability and life quality. Clinically, medical image analysis is usually the best non-invasive diagnostic method. In order to obtain quantitative parameters, doctors usually manually annotate the IVDs. However, for 3D images, this method is usually tedious, time-consuming, needs a lot of expertise and lack of reproducibility. Therefore, a fully automatic localization and segmentation algorithm of the intervertebral disc can offer visualized 3D reconstructed image and also provide quantitative parameters, which can greatly improve the speed as well as the quality of the diagnosis.

As magnetic resonance imaging has the properties of excellent sensitivity to soft tissue and no radiation, it is widely considered to be the best modality for disc disease diagnosis. Further more, the Dixon method can generate fat only and water only images by combining the in-phase and opposed-phase signal. Making full use of the image information from different modalities can improve the accuracy of the segmentation algorithm. The four modality Dixon sequences are showed in Fig. 1.

Fig. 1.
figure 1

Examples of multi-modality Dixon sequence, including in-phase, opposed-phase, fat and water from left to right. Each modality has different contrasts for specific components, making full use of multi-modality information can result in better contour segmentation. It should be noted that there are more than seven IVDs in the MR images, but only the lumber IVDs are our objective.

The task of this challenge has two parts, the localization and segmentation of intervertebral disc. The objective of segmentation is to obtain the binary mask of each IVD, i.e. each voxel in the image is classified into the disc category or non-disc category. The objective of localization is to obtain the coordinates of the centroid of each disc, which is calculated by the morphological center of each IVD mask. The segmentation algorithm affects both segmentation accuracy and localization accuracy, therefore a good segmentation algorithm is a prerequisite.

1.1 Related Work

In early studies, researchers typically used hand-crafted features [4, 14] based on image intensity or texture features for IVD localization and segmentation. Graph-based methods are commonly used in the segmentation of vertebrae and discs. For example normalized cut [2] and graph cut algorithm [1] were used for IVDs segmentation in spine MR images. And graphical models [5, 12] were used for IVD localization.

As learning-based approaches gain more and more attention in the medical image analysis field, several marginal spacing learning [9] and regression-based methods [3] are proposed for localize IVDs and segment IVDs. However, those methods were limited by the representation capability of the hand-crafted features.

Recently, deep learning methods have revolutionized medical image analysis and computer vision field with its remarkable feature representation capability. For example, Ronneberger et al. [13] proposed U-net for cell segmentation from 2D images and Dou et al. [6] proposed 3D convolutional neural network for 3D liver cancer segmentation. Deep learning methods also improve the performance of IVD localization and segmentation to a brand new level. For example, Li et al. [10] proposed a 3D multi-scale FCN with random modality dropout scheme to better utilize multi-modality information and achieved decent accuracy for IVD localization and segmentation.

1.2 Contribution

We propose a strong and robust deep learning framework for IVDs localization and segmentation from multi-modality MR images. The evaluation results from MICCAI 2018 Automatic Intervertebral Disc Localization and Segmentation from 3D Multi-modality MR Images demonstrated the effectiveness of our proposed framework. Our main contributions can be summarized as follows:

  • We delicately design a 2D fully convolutional network, which only performs downsampling for 2 times, and use densely connected atrous spatial pyramid pooling to capture multi-scale features as well as ensure large enough receptive field. The network consists of three separate pathways for different spatial resolution features, which makes the training of encoder more effective. Further more, a Squeeze-and-Excitation module are used for channel-wise attention. This network is a strong backbone that can be generalized to other medical image segmentation tasks.

  • We designed a 3D V-Net based localization network with a robust post-process scheme to classify the seven lumber disc into seven category and distinguish them from other thoracic discs, which makes the whole framework to be fully automated.

  • We proposed a novel and intuitive training strategy that can make the segmentation network focus on the spine region while ignore the interference from large and complex backgrounds.

  • Our method was evaluated on MICCAI 2018 IVDM3Seg dataset which consists of 16 sets of 3D multi-modality MR images from 8 subjects, and demonstrated superior performance.

Fig. 2.
figure 2

The pipeline of proposed framework for IVD localization and segmentation. The segmentation first perform binary segmentation to classify each voxel into disc and non-disc region. The localization network and post-process treat each disc instance as a categories and assign label from 1–7 from bottom to top.

2 Methodology

The pipeline of our framework for IVD localization and segmentation are illustrated in Fig. 2. Our localization and segmentation framework mainly consists of two parts: the segmentation network, the localization network and the post-process scheme. The objective of segmentation is to output the binary masks of each IVD, however, because of the similarity between thoracic discs and lumber discs, the segmentation network will predict more than 7 IVD masks, though only 7 lumber discs have annotation. To obtain the final result and achieve the purpose of fully automation, we designed a V-Net based localization network which treats each IVD as an instance, i.e. performs 7 class segmentation, and then used a post-processing method to increase the robustness of the localization network.

Fig. 3.
figure 3

The proposed 2D fully convolutional segmentation network, it takes the 4-channel concatenated multi-modality image as input, and outputs the binary mask of IVDs.

2.1 Segmentation Network

In recent years, convolutional neural networks have revolutionized the field of computer vision and medical image analysis. 2D CNNs based methods have made great progress on medical images compared to traditional methods. Recently, 3D CNNs [6, 11] are explored as they can capture volumetric contextual information and have better representation capability. However, 3D CNNs essentially have a disadvantage compared to 2D CNNs, they have a greater demand for data, as 3D CNNs treat a volumetric image as a single sample while 2D CNNs treat each slice as a single sample. There are only 16 samples in the training set, therefore, we think the 2D network is more suitable for this task.

U-net [13] is one of the most successful 2D convolutional neural networks in medical image analysis, many previous deep learning methods are modified based on it. U-net has a symmetric Encoder-Decoder structure, the encoder encodes multi-scale information into feature maps by four downsamplings. The decoder then reconstruct spatial resolution from high-level feature maps by upsampling or deconvolution, while high-resolution features are also concatenated by short connection from encoder to assist reconstruction. However, this structure has three inherent defect for semantic segmentation. First, too many times of downsampling leads to the loss of detail information, although the high-resolution feature maps are used in the reconstruction process, but this low-level feature concatenate and feature fusion can only slightly alleviate the problem. Second, UNet captures multi-scale features by downsapling, which results in capturing only fixed and limited scales of features, making it difficult to represent complex and variable anatomical structures. Third, during the gradient back propagation in the training phase, the encoder will receive two gradient signals from different resolutions, one is the low resolution gradient signal from below, and the other is the same resolution gradient signal from the shortcut connection. It cannot be guaranteed that the two path have the same magnitude of gradient signal because the number of convolution layers on different paths is quite different. The mixing of these two signals in the training process will affect the effectiveness of the encoder training.

To solve the problems mentioned above, we elaborately design our segmentation network, see in Fig. 3. We use a strong backbone network, which is based on DenseNet [8] and uses Squeeze-and-Excitation module [7] as channel-wise attention. For the first problem, reduce the number of downsampling is a intuitive solution. In the trade-off between the GPU memory usage and segmentation accuracy, our network only perform two downsamplings, it can effectively reduce the information loss, and improve the segmentation accuracy of the detailed edge region of discs. However, such structure has a disadvantage that the network can only fuse less scales of features, moreover, the receptive field of convolution kernel become smaller, which makes it difficult to capture more global and high-level features. Therefore, we further use densely connected dilated convolution to solve this problem, or use another name, densely connected atrous spatial pyramid pooling (ASPP). Compared with the serial connected or parallel connected [15] counterpart, densely connected ASPP combines arbitrary scales of features, which can be adjusted by dilation rate, and better feature reuse. In our model, we use the dilation rate of 3, 6, 12, 18 and 24. For the third problem, inspired by some works on multi-task learning, we design three separate paths to handle different resolution signal, i.e. treat each resolution signal path as a single task. This approach can train each path more effectively without interfering with each other.

Fig. 4.
figure 4

The pipeline of localization network and the post-process scheme. (Color figure online)

2.2 Localization Network and Post-process

Although the segmentation network is trained only with 7 lumbar discs annotation, the network predicts more than 7 IVDs because of the similar anatomy pattern of thoracic disc and the lumbar disc. We design a localization network and a post-process scheme to handle the output of the segmentation network, and fully automatically get the target mask of 7 lumbar discs. The structure is shown in Fig. 4. The localization network has a V-Net structure, which is a 3D fully convolutional neural network with residual connection. The ground truth annotation of the localization network is obtained by marking the mask of the 7 IVDs in the original annotation from 1 to 7 from bottom to top, that is, the localization network output 8 channels score map, including seven IVDs and one background.

Then the prediction from localization network and prediction from segmentation network are intersected together. Due to the similar appearance of the IVDs, the predicted mask from localization network in the often have misclassified areas in the upper part of images, but the segmentation of the bottom disc is always right. We then use a reference image from training set as moving image to be registered to the predicted mask, and fit the centerline of spine from the centroid of each disc in the registered mask, i.e. the red line shown in Fig. 4. At last, we calculate the connected area of the predicted mask. Only the connected area that intersects with the fitted centerline of the spine is retained. The other connected regions are set as background. Then, the reserved connected region is assigned with label from 1 to 7 from bottom to top.

This localization and post-processing strategy can greatly improve the robustness of the framework, even if there are some misclassified outliers in the segmentation network, it will not affect the identify of IVDs.

2.3 Training Strategy

To further improve the performance of the segmentation network, we made a natural assumption.

Assumption. Only the spine part of the entire input image is useful for IVD segmentation, while region outside the spine only acts as a useless background, which will reduce the accuracy of the segmentation performance.

We first train a UNet to predict the spine area, where the label was generated by calculating the convex hull of the annotation of discs after several dilation operations. When training the segmentation network, the predicted mask from the UNet was used, and we ignore the loss outside the spine region. In the inference phase, the spine region is also predicted, and all the region outside the spine is set as background in the output of segmentation network.

2.4 Loss Function

When training segmentation network, focal loss was used for better focus on hard samples, i.e. the boundary region of IVDs, and the formula is as follow:

$$\begin{aligned} L(p)=-\alpha (1-p)^\gamma log(p), \end{aligned}$$
(1)

Since the use of focal loss may cause instability problems when training, we first train several epochs using cross entropy loss, then use focal loss.

3 Experiments

3.1 Dataset and Data Augmentation

We evaluated our proposed method on the dataset from MICCAI 2018 IVDM3Seg Challenge using both cross validation on the training data and independent test data on the on-site challenge, where training data consists of 16 sets of 3D multi-modality MR images from 8 subjects, and test data consists of 8 sets of 3D multi-modality MR images from 4 subjects. Each subject was scanned with a 1.5-Tesla MRI scanner of Siemens using Dixon protocol. The voxel spacing of each image is 2 mm \(\times \) 1.25 mm \(\times \) 1.25 mm. For the data augmentation, we use 3D deformation, random scale, random noise, and random crop.

3.2 Evaluation Metrics

Dice overlap coefficient measures the percentage of correctly segmented voxels. Dice is computed by

$$\begin{aligned} Dice(A,B)=\frac{2\mid A\cap B\mid }{\mid A\mid + \mid B\mid }\times 100\%, \end{aligned}$$
(2)

where A is the sets of foreground voxels in the ground-truth data and B is the corresponding sets of foreground voxels in the segmentation result, respectively.

Average absolute distance (ASD) is a metric measures the average absolute distance from the ground truth disc surface and the segmented surface. Smaller average absolute distance means better segmentation accuracy.

Table 1. Dice overlap coefficient of independent test set in on-site challenge.

Localization distance R is computed by

$$\begin{aligned} R=\sqrt{(\varDelta x)^2+(\varDelta y)^2+(\varDelta z)^2}, \end{aligned}$$
(3)

where \(\varDelta x\), \(\varDelta y\), \(\varDelta z\) is the absolute difference between the identified IVD center and the ground truth IVD center calculated from the ground truth segmentation in X, Y and Z axis. Smaller localization distance means better segmentation accuracy.

3.3 Results of MICCAI 2018 and Training Set Cross Validation

The evaluation result of the on-site challenge of MICCAI 2018 IVDM3Seg are listed in Tables 1, 2 and 3. Our method demonstrated good performance and strong robustness. Since the test data are not available to us, the segmentation result are visualized using training set cross validation, see in Fig. 5.

Table 2. Average absolute distance of independent test set in on-site challenge.
Table 3. Average absolute distance of independent test set in on-site challenge.
Fig. 5.
figure 5

Visualization of one subject in training set cross validation, the green line is the prediction of our approach, read line is ground truth and yellow line is the intersection. These images are from subject 3, the images in the first row is obtained in the first phase, while the second row is obtained in the second phase. (Color figure online)

4 Conclusion

In this paper, we present our novel and robust IVD segmentation and localization framework from multi-modality MR images, which achieve state-of-the-art performance. The delicately designed segmentation network can preserve the detailed information as much as possible by reducing the number of downsamplings, and at the same time, using densely connected atrous spatial pyramid pooling to capture and fuse multi-scale information as well as reserve large enough receptive field, which can greatly enhance the feature representation ability of the network. We also design three separate paths to handle different resolution signal to train each path more effectively. A new training strategy is also proposed to prevent the segmentation network from interfered by the large complex background. Furthermore, we propose a localization network with robust post-process scheme to distinguish thoracic discs and lumber discs. The result of MICCAI 2018 challenge on IVD localization and segmentation demonstrated the effectiveness of our proposed method.