Keywords

1 Introduction

Approximately 3.6 billion diagnostic radiological examinations, such as radiographs (x-rays), are performed globally every year [1]. Chest radiographs are performed to evaluate the lungs, heart and thoracic viscera. They are crucial for diagnosing various lung disorders in all levels of health care. Computer-aided diagnostic (CAD) tools serve an important role to assist the radiologists with the growing number of chest radiographs. Accurate segmentation of anatomical structures in chest radiographs is essential for many analysis tasks in CAD. For example: segmentation of the lungs field can help detecting lung diseases and shape irregulars; segmentation of the heart outline can help to predict cardiomegaly; and the segmentation of clavicles can improve the diagnosis of pathologies near the apex of the lung.

Evaluating a chest radiograph is a challenging task due to the high variability between patients, unclear and overlapping organs borders, and image artifacts. A clear and high quality radiograph is not easy to acquire. This challenge drew many researchers over the years to improve the segmentation of anatomical structures in chest radiographs [2,3,4,5]. An open benchmark dataset that was provided by Ginneken et al. [6] facilitated over the years an objective comparison between the different segmentation methods. Classic approaches include active shape and appearance models, pixel classification methods, hybrid models and landmark based models. More recently deep learning approaches were suggested [2, 3] based on the successful employment of convolutional neural networks (CNNs) on various detection and segmentation tasks in the medical imaging domain [7].

CNN architectures for semantic segmentation usually incorporate encoder and decoder networks [8, 9] that reduce the resolution of the image to capture the most important details and then restore the resolution of the image. Another semantic segmentation approach is to keep the resolution of the network by incorporating dilated convolutions [10] that enlarge the global receptive field of the CNN to larger context information. In both approaches, the CNN can output single-class or multiple-class segmentation masks. The resolution of the output mask is the same as the input radiograph image. The training process of each CNN is affected by several training features: One is the selection of the loss function that guides the optimization process during the training process (with different loss functions effecting differently the final output segmentation performance results); The other is the initialization of the network weights - random initialization or weights transferred from another trained network (transfer learning from a totally different task).

In this paper, we explore the segmentation of anatomical structures in chest radiographs, namely the lungs field, the heart and the clavicles, using a set of the most advanced CNN architectures for multi-class semantic segmentation. We propose an improved encoder-decoder style CNN with pre-trained weights of the encoder network and show its superiority over other state of the art CNN architectures. We further examine the use of multiple loss functions for training the best selected network and the effect of multi-class vs. single-class training. We present qualitative and quantitative comparisons on a common benchmark data, based on the JSRT database [11]. Our best performing model, the U-net with an ImageNet pre-trained encoder, outperformed the currently state-of-the-art segmentation methods for all anatomical structures.

2 Methods

2.1 Fully Convolutional Neural Network Architectures

Fully convolutional networks (FCN) are extensively used for semantic segmentation tasks. In this study, four different state of the art architectures have been tested as follows:

FCN - The first FCN architecture that we used in this work is based on the FCN-8s net that uses the VGG-16 layer net [9, 12]. The VGG-16 net is converted into an FCN by decapitating the final classification layer and converting fully connected layers into convolution. Deconvolution layers are then used to upsample the coarse outputs to pixel-dense outputs. Skip connections are used to merge output from previous pooling layers in the network which was shown to improve the segmentation quality [9].

Fully Convolutional DenseNet - The second network architecture that was tested is based on the fully convolutional DenseNet shown in [13]. DenseNet architecture [14] proposes intensive layer fusion. Each dense block consists of a set of convolution layers using a similar scale where each convolution layer processes the concatenation of all its previous layers thus enabling the fusion of numerous representation levels. For the fully convolutional DenseNet architecture a decoding path is added to generate the segmentation output. The fusion between different layers consists of intra dense block layers fusion as well as the concatenation of the preceding high level feature maps and the ones coming from the encoding block at the same scale.

Dilated Residual Networks - The dilated residual network (DRN) [10] uses dilated convolution [15] to increase the resolution of output feature maps without reducing the receptive field of individual neurons. It was shown to improve the performance compared to the standard residual networks presented in [16]. We have implemented the DRN-C-26 as stated in [10].

U-Net with VGG-16 Encoder - The U-Net architecture [8] has been extensively used for different image-to-image tasks in computer vision with a major contribution to the image segmentation task. The U-Net includes a contracting path (the encoder) with several layers of convolution and pooling for down-sampling. The second half of the network includes an expansion path (the decoder) that uses up-sampling and convolution layers sequentially to generate an output with a similar size as the input image. Additionally, the U-Net architecture combines the encoder features with the decoder features in different levels of the network using skip connections. Iglovikov et al. [17] proposed to use a VGG11 [12] as an encoder which was pre-trained on ImageNet [18] dataset and showed that it can improve the standard U-Net performance in binary segmentation of buildings in aerial images. A similar concept was used in the current study with the more advanced VGG16 [12] as an encoder. Figure 1 shows a diagram of our proposed network. The chest X-ray image is duplicated to obtain an input image with 3 channels similar to the RGB images that are used as input to the VGG-16 net (which is the encoder in the proposed architecture).

Fig. 1.
figure 1

The proposed U-Net architecture with a VGG-16 based encoder.

2.2 Objective Loss Functions

The loss function is used to guide the training process of a convolutional network by measuring the compatibility between the network prediction and the ground truth label. Let us denote S as the estimated segmentation mask and G as the ground truth mask. In a multi-class semantic segmentation task including \(C = \{c_1,...,c_m\}\) classes, the total loss (TS) between S and G is defined as the sum of losses in every class:

$$\begin{aligned} TL(S,G)=\sum _{c=1}^{m}L_c(S,G) \end{aligned}$$
(1)

In this study we explore the influence of using different loss functions in the FCNs training process. The Dice similarity coefficient (DSC) and Jaccard similarity coefficient (JSC) are two well known measures in segmentation and can be used as objective loss functions in training. These segmentation measures between S and G are defined as:

$$\begin{aligned} DSC(S,G)= & {} 2\frac{|SG|}{|S|+|G|} \end{aligned}$$
(2)
$$\begin{aligned} JSC(S,G)= & {} \frac{|SG|}{|S|+|G|-|SG|} \end{aligned}$$
(3)

when used as loss in training, both measures weights FP and FN detections equally. The Tversky loss [19] introduces weighting into the loss function for highly imbalanced data, where we want to segment small objects. The Tversky index is defined as:

$$\begin{aligned} Tversky(S,G;\alpha ,\beta )=\frac{|SG|}{|SG|+\alpha |S / G|+\beta |G / S|} \end{aligned}$$
(4)

where \(\alpha \) and \(\beta \) control the magnitude of penalties for FPs and FNs, respectively. In our study we used \(\alpha =0.3\) and \(\beta =0.7\).

An additional loss function tested is the Binary Cross-Entropy (BCE). BCE was calculated separately for each class segmentation map. For each pixel \(s_i\in S\) and pixel \(g_i\in G\) that share the same pixel position i, the loss is averaged over all pixels N as follows:

$$\begin{aligned} BCE(S,G)=\frac{1}{N}\sum _{i=1}^N g_i\log (s_i) + (1-g_i)\log (1-s_i) \end{aligned}$$
(5)

3 Segmentation of Anatomical Structures

3.1 Dataset

Evaluation of the chest anatomical structures segmentation was done on chest radiographs from the JSRT database [11]. This public database includes 247 posterior-anterior (PA) chest radiograph images of size \(2048\times 2048\) pixels, 0.175 mm pixel spacing and 12-bit gray levels. Ginneken et al. [6] publicized the Segmentation in Chest Radiographs (SCR) database, a benchmark set of segmentation masks for the lungs field, heart and clavicles (see Fig. 2). The annotations were made by two human observers and a radiologist consultant. The segmentations of the first observer generate the ground-truth segmentation masks and the other - human observer results. The benchmark data is split into two folds of 124 and 123 cases, each containing equal amount of normal cases and cases with lung nodules. Following the suggested instructions for comparison between the segmentation results, images in one fold were used for training and images from the other fold were used for testing, and vise versa. The final evaluation is defined as the average performance over the two folds.

Fig. 2.
figure 2

Data sample from [6]: (a) chest radiograph image; (b) clavicles segmentation mask; (c) lung segmentation mask; (d) heart segmentation mask.

For training, we resize the images to \(224\times 224\) pixels and normalize each image by its mean and standard deviation. The networks are trained using Adam optimizer with initial learning rate of \(10^{-5}\) and default parameters for 100 epochs. We use augmentations of scaling, translation and small rotations. In testing, We threshold the output score maps with \(threshold = 0.25\) to generate binary segmentation masks of each anatomical structure.

3.2 Performance Measures

To measure the performance of the proposed architectures and compare to state-of-the-art results, we use well accepted metrics for segmentation: Dice similarity coefficient, jaccard index (also known as intersection over union) and mean absolute contour distance (MACD). MACD is a measure of distance between two contours. For each point on contour A, the closest point on contour B is computed by the euclidean distance \(d(a_{i},B) = min_{b_{j}\in B}||{b_{j} - a_{i}}||\). The distance values are then averaged over all points. Since distances from A to B are not the same as B to A, we derive a common average between the two averages as follows:

$$\begin{aligned} MACD(A,B)=\frac{1}{2}(\frac{\sum _{i=1}^{n} d(a_{i},B)}{n} + \frac{\sum _{i=1}^{m} d(b_{i},A)}{m}) \end{aligned}$$
(6)

Because MACD measure is given in millimeters, we multiply the original pixel spacing by a factor of 2048 / 224 to match the target image resolution.

3.3 Experimental Results

Table 1 compares the segmentation performance of the four state of the art fully convolutional networks for semantic segmentation as listed in Sect. 2.1. All models are trained for multi-class segmentation into three classes: \(lungs \ field, heart, \ clavicles\). We use the sigmoid activation function after the last layer of each network with Dice as the loss function. An additional column in Table 1 shows if the network is fine-tunned (FT) from a pre-trained network.

The results show that the best performing architecture for the segmentation of all anatomical structures in chest radiograph, is the U-Net including the VGG16 encoder pre-trained on ImageNet. This architecture achieved the highest segmentation overlap scores (Jaccard) of 0.961, 0.906 and 0.855 for the Lungs field, Heart and Clavicles respectively. It is noticeable that between all four architectures, the fine-tuned networks performed better than the networks trained from scratch.

Table 1. Segmentation results of four compared architectures trained with multi-class Dice loss showing the Dice (D), Jaccard (J) and MACD metrics. Fine tuned (FT) architectures include a pre-trained VGG16 as an initial encoder.

For the top performing architecture, the U-Net based network, we further analyzed several training features. Table 2 summarizes the multi-class segmentation performance using different objective loss functions. It is evident that structures with smaller pixel area, like the clavicles, benefits from loss metrics with pixel weighing such as Tversky loss function. We also tested the performance of training a single-class network for each of the three classes vs. the multi-class training. For the lungs, the single class training did not resolve in significant improvement. However, for the heart and clavicles, the Dice and Jaccard scores in a single-class training were improved each by 1% in comparison to the multi-class training. The last improvement in performance of the multi-class segmentation was achieved using post-processing including small objects removal and hole fill. While the Dice and Jaccard metrics were not improved, the MACD metric showed an improvement from 1.121, 2.569 and 0.871 [mm] for the lungs, heart and clavicles to 1.019, 2.549 and 0.856 [mm] respectively. Figure 3 shows a few segmentation examples of our best performing model. A comparison of our U-Net based model trained with multi-class dice loss to existing state-of-the-art methods, validated on the same benchmark of chest radiographs and a human observer, is presented in Table 3.

Table 2. Multi-class segmentation results using different loss functions including DSC, JSC, Tversky and BCE (rows). The Dice (D), Jaccard (J) and MACD are used as metrics (columns) for each anatomical structure
Fig. 3.
figure 3

Segmentation results of our best performing architecture with Jaccard score above each image for the Lungs(L), Heart(H) and Clavicles(C); Ground-truth segmentation is shown in blue, CNN segmentation in red and the overlap (true detections) in green. (Color figure online)

Table 3. Our best performing architecture compared to state-of-the-art models; “-” means that the score was not reported; (*) used different data split than suggested in SCR benchmark

4 Discussion and Conclusion

Segmentation of anatomical structures in chest radiographs is a challenging task that attracted considerable interest over the years. The advantages of newly introduced CNN architectures, together with the public benchmark dataset provided in [6] on the JSRT images, motivated further studies in this field. Some of the recent studies focused only on the problem of lung segmentation, and a few have also dealt with the problem of heart and clavicles segmentation. In this paper, we employed and evaluated the segmentation performance of four top FCN architectures [9, 10, 13, 17] for semantic segmentation for all three anatomical structures, using multi-class dice loss.

The network architectures presented in this study are well known and showed promising results in many computer vision semantic segmentation tasks. The FCN [9] and the U-Net [8] are considered classical approaches while the FC DenseNet and the DRN are more advanced and relatively new approaches for semantic segmentation. Hence, it was interesting to see in Table 1 that the classic U-Net and FCN showed superior segmentation performance over the more advanced approaches. The advantage of using pre-trained networks for medical imaging tasks has already been shown in several studies [7], and even though only the encoder part of the FCN and U-Net (VGG16 encoder) networks was pre-trained using the ImageNet database in our case, it seemed to be advantageous. The best segmentation performance was obtained using the proposed U-Net based architecture including the pre-trained VGG16 encoder (Table 1).

Next, we explored the effect of training multi-class segmentation model using different loss functions (Table 2). We demonstrated that small structures such as the clavicles can benefit from weighted loss functions such the Tversky loss function while the larger structures (lung and heart) achieved the best segmentation results using Dice or Binary Cross-Entropy loss functions. Applying additional minor post-processing resulted in further decrease of the MACD measure with cleaner and more precise segmentations for all three structures as displayed in Fig. 3.

Table 3 presents the final comparison between our top selected model, the multi-class U-Net VGG16 with dice loss, to state-of-the-art methods [2,3,4,5,6] and human observer segmentations [6]. Our model outperformed all state-of-the-art methods tested in this study and the human observer for the lungs and heart segmentation. For the clavicles segmentation, fewer studies were conducted. Novikov et al. [2] reported results on different data split than the benchmark recommendation so its not an objective comparison. However, our proposed network outperformed an additional top reported method [6].

In conclusion, we presented an experimental study in which four top segmentation architectures and several losses were compared for the task of segmenting anatomical structures on chest X-Ray images. Results were evaluated quantitatively with qualitative examples of our best performing model. Improving the segmentation of the lung field, heart and clavicles is the foundation for better CAD tools and the development of new applications for medical thoracic images analysis.