1 Introduction

According to the World Health Organization, “cancer is a leading cause of death worldwide, accounting for an estimated 9.6 million deaths in 2018” [1]. Lung cancer is by far the most common form of cancer which makes early detection and treatment even more impactful. Early detection can lead to greater probability of survival, less morbidity, and less expensive treatment. A traditional form of early detection involves radiologists manually screening low dose computed tomography (CT) scans in search of potentially cancerous lesions or nodules in lungs. The process can be tedious, time consuming, and even error prone. We propose an automatic system of deep learning methods for the detection, segmentation, and classification of pulmonary nodules to potentially take some of the burden off of health professionals. The system was developed as part of the LNDb challenge which is composed of a main challenge and three sub-challenges [2].

The main challenge involves predicting a patient’s follow-up recommendation according to the 2017 Fleischner society pulmonary nodule guidelines [3]. Given a chest CT, the system must predict one of four follow-up recommendation classes: 0) No routing follow-up required or optional CT at 12 months according to patient risk; 1) CT at 6–12 months required; 2) CT at 3–6 months required; 3) CT, PET/CT or tissue sampling at 3 months required [2]. The four recommendations take into account the number of nodules in a chest CT, their volume, and their texture. The three sub-challenges focus on predicting these three attributes. The first sub-challenge, nodule detection, detects pulmonary nodules from chest CTs. The second sub-challenge, nodule segmentation, segments pulmonary nodules for the purpose of calculating volumes. Lastly, the third sub-challenge, nodule texture characterization, classifies nodules into one of three texture classes [2]. The method we propose participated in all three sub-challenges as well as the main challenge to produce patient follow-up recommendations.

2 Dataset

The LNDb dataset consists of 294 chest CTs along with radiologist annotations. Each CT contains at most six nodules and all nodules have in-slice diameters of at most 30 mm. The annotations provide centroid coordinates and texture ratings for all nodules and segmentations for nodules greater than 3 mm. Of the 294 CTs, 58 CTs and their associated annotations have been withheld for the test dataset by the LNDb challenge [2] (Fig. 1).

Fig. 1.
figure 1

A sample CT from the dataset with a nodule identified using radiologist annotations.

Automatic analysis of CTs pose a unique challenge considering CTs are essentially 3D arrays that vary in size. The number of slices as well as the size of slices vary across different CTs. Each value in the 3D array is a single-channel Hounsfield unit (HU) (instead of a typical RGB model). These characteristics along with the fact that nodules are at most 30 mm in diameter drove the pre-processing steps we took.

3 Method

3.1 Data Exploration and Pre-processing

A graphics card with 8 GB of random-access memory was used to train three neural networks. Given the memory constraints, the networks had to be trained on smaller 3D patches of CTs. As such, a second dataset of 3D CT patches was derived from the original dataset. A number of pre-processing steps were taken to create the new dataset.

Resampling.

As part of data exploration, we found that the 3D pixel size, or voxel size, varied across CTs: 0.4328 mm–0.8945 mm for the x-axis, 0.4328 mm–0.8945 mm for the y-axis, and 0.5000 mm–1.4000 mm for the z-axis. To avoid issues when training our convolutional neural networks (CNN), all CTs were resampled so that each voxel had an isotropic size of 0.6375 mm × 0.6375 mm × 0.6375 mm. This size was chosen because it aligns with what the LNDb challenge expects for the segmentation task submission and because it is a middle ground for each axis.

Normalization.

As mentioned earlier, CTs are 3D arrays of HU. We plotted the relative frequency distribution of HU for all CT overlaid with the relative frequency distribution of HU for all radiologist-segmented nodules. The latter was calculated by applying masks found in the original dataset to each CT and then counting HU for only the masked voxels.

Fig. 2.
figure 2

The distribution of HU of CTs overlaid with the distribution of HU of nodules.

Nodules had an affinity for HU between −1000 and 500 (Fig. 2). As a result, all resampled CTs were clipped so that HU less than −1000 were replaced with −1000 and HU greater than 500 were replaced with 500. Next, the mean HU value was calculated across CTs. All CTs were zero-centered by subtracting the mean value, −477.88. Finally, to save disk space, all values were min-max normalized between 0 and 255 so that they could be stored as unsigned 8-bit integers.

Patch Extraction.

The final step in creating the derived dataset was to extract equal-sized patches from CTs. By extracting patches and storing them on disk, our neural networks would be able to load them in real-time when training. We chose patches of size 51 mm × 51 mm × 51 mm because they encapsulated the largest nodules (30 mm), fit in GPU memory, and aligned with what the LNDb challenge expects for the segmentation task submission.

We first extracted all patches that had radiologist annotations: 768 patches containing nodules and 451 patches containing non-nodule pulmonary lesions. For each annotated patch, we also extracted the corresponding segmentation patches to be used for the segmentation task. For the detection task, we needed more negative patches containing no nodules so that the dataset better represented a full CT. Patches inside lungs are harder for a detection neural network to classify when compared to patches outside of lungs. We leveraged the provided centroid coordinates of nodules to extract negative patches at varying distance thresholds from positive patches with varying probabilities of acceptance.

In other words, centroid coordinate candidates were generated for each CT with a stride of 12.75 mm. This yielded over 15,000 negative patch candidates per CT. For each candidate, we calculated the distance to the closest positive nodule centroid. Candidates less than 70 mm from a nodule were randomly accepted 10% of the time, candidates greater than 70 mm but less than 100 mm from a nodule were randomly accepted 5% of the time, and candidates greater than 100 mm from a nodule were accepted 1% of the time. The thresholds were determined by visually analyzing where negative patches were being sourced from. The goal was to have most negative patches sourced from within the lung but still maintain representation outside of the lung (Fig. 3).

Fig. 3.
figure 3

Positive patches (green) and a subset of randomly selected negative patches (red).

3.2 Nodule Detection

The first step in calculating a follow-up recommendation is nodule detection. We need to be able to determine how many nodules exist in a given CT. As mentioned earlier, a full CT cannot be used as input for a neural network due to hardware constraints. Instead, we trained a binary classifier that classifies 51 mm 3D patches of CT as either containing a nodule or not containing a nodule.

Training.

Of the 236 available CTs, 80% or 188 CTs were used for training and the remaining 48 CTs were used for validation. In terms of patches, 768 patches containing a nodule and 7,785 patches containing no nodules were extracted from the CTs and used for training a CNN based on the VGG network architecture [4]. We chose this 1:9 ratio of classes as opposed to a more balanced ratio because it better represents a full CT. To compensate for the class imbalance, we used balanced class weights for the binary cross-entropy loss function.

Fig. 4.
figure 4

The architecture for the detection neural network (left) and the structure of the first convolutional block (right). The numbers in parenthesis refer to the output shape of the layer.

To avoid any overfitting, a dropout layer with a rate of 0.4 was placed between the two fully-connected layers. We also applied small amounts of training-time augmentation to all axis. Each batch, consisting of 8 patches (due to hardware constraints), was randomly shifted ±6.4 mm, rotated ±10°, and flipped.

The learning rate was slowly decreased as learning plateaued (Fig. 5). Epochs 1–120 used a learning rate of 1 × 10−5, epochs 121–140 used a learning rate of 1 × 10−6, and epochs 141–150 used a learning rate of 1 × 10−7. In total, training took over 50 h due to the 3D nature of the data, the large number of negative patches, and the hardware.

Fig. 5.
figure 5

The loss (left) and accuracy (right) for the detection neural network.

Training Results.

Epoch 146 produced the best weights with respect to the validation dataset results. Results between the train dataset and the validation dataset were similar. While sensitivity was high, precision was much lower (Table 1).

Table 1. The results for the detection neural network for the positive nodule class.

A confusion matrix (Fig. 6) helps to illustrate the high number of false positives.

Fig. 6.
figure 6

The confusion matrix for the detection neural network on the validation dataset.

Applying the Classifier.

To detect nodules within a given CT, we applied the binary classifier on thousands of overlapping 51 mm 3D patches extracted across the entire CT. A stride of 12.75 mm was used for selecting patches because it provided a good balance of high resolution and low calculation time. A mask was generated as predictions were made regarding the probabilities of patches containing a nodule. The mask was updated with 12.75 mm 3D patches of probabilities located at the centroid coordinates of the original patches. As a result, the mask contained a higher probability wherever the classifier detected a nodule.

A few transformations were applied to the mask so that we could ultimately calculate the centroid coordinates of each module. First, a Gaussian filter was applied to the mask with a sigma of 10 to blur the 12.75 mm probability patches. After blurring, a threshold was applied to the mask. Probabilities below 50% were set to 0 and probabilities over 50% were set to 1. These parameters were picked by manually evaluating their effectiveness on the train dataset. The result is a mask that segments nodules for a full CT (Fig. 7).

Fig. 7.
figure 7

From left to right: the mask, the mask after applying a Gaussian filter, and the mask after applying a threshold.

After producing the mask, the centroid coordinates of each distinct segmentation are determined. Then, 51 mm 3D patches are extracted for each predicted coordinate and classified once more using the binary classifier to get a final probability (Fig. 8).

Fig. 8.
figure 8

From left to right: A sample CT from the validation dataset, the associated radiologist mask, and our predicted mask.

3.3 Nodule Segmentation

Next, we trained a CNN based on the U-Net architecture [5] for predicting nodule segmentations and in turn nodule volumes. The only difference between our architecture and the architecture described in [5] is that we used 3D layers instead of 2D and we halved the number of channels for each convolutional layer. The network was trained such that it takes a 51 mm 3D patch as input and predicts a segmentation in the same form, a 51 mm 3D patch. The output is populated with ones and zeroes based on whether it is part of a nodule or not.

Training.

Of the 236 available CTs, 177 CTs were used for training and 59 CTs were used for validation. From the 236 available CTs, 768 patches containing nodules and 451 patches containing pulmonary lesions were used to train and validate the network. A combination of binary cross-entropy and a negative weighted dice coefficient was used as the loss function.

$$ loss\, = \,binary\_crossentropy\, + \,2*(1{-}dice\_coefficient) $$
(1)

The network was trained for 200 epochs at a learning rate of 1 × 10−5 (Fig. 9) with a batch size of 4 (due to hardware constraints). Similar to the detection task, batches were generated and augmented on the fly as the network was trained. The augmentations were applied to each axis and consisted of small random 3D shifts of ±3.2 mm, rotations of ±5°, and flips. The dice coefficient by itself was used to evaluate performance on the validation set as the network was trained.

Fig. 9.
figure 9

The loss (left) and dice coefficient (right) for the segmentation neural network.

Training Results.

The weights at epoch 200 were used for evaluation. The dice coefficients for the train dataset and validation dataset were similar: 0.4972 and 0.4851 respectively (Fig. 10).

Fig. 10.
figure 10

From left to right: A sample patch from the validation dataset, the associated radiologist mask, and our predicted mask.

3.4 Texture Characterization

The last CNN classifies 51 mm 3D patches of nodules into one of the three texture classes: ground glass opacities (GGO), part solid, and solid.

Training.

Similar to the segmentation task, 177 CTs were used for training and 59 CTs were used for validation. From the 236 available CTs, 768 patches containing nodules were used to train and validate the network. However, there was a sizeable class imbalance that needed to be addressed. The GGO class was represented 38 times, part solid was represented 58 times, and solid was represented 672 times. To workaround the class imbalance, we oversampled the two minority classes by a factor of 5. We also skewed the class weights for the categorical cross-entropy loss function to favor the two minority classes. A class weight of 4.08 was used for GGO, 2.60 for part solid, and 0.57 for solid. As a result, the network was further incentivized to classify the minority classes correctly.

The architecture for this network is similar to the architecture used for the detection task (Fig. 4). As we began training, it became apparent that it was easy to overfit due to the small dataset of nodules and complexity of the network. As such, we changed the architecture by increasing the dropout rate to 0.5 and adding a batch normalization layer within each convolutional block before the max pooling layer. We also used more aggressive augmentations. For each batch, the patches were shifted ±12.8 mm, rotated ±20°, and flipped.

We found that the network struggled to learn with a learning rate of 1 × 10−5 (Fig. 11). Instead, an initial learning rate of 1 × 10−4 was used to train the first 200 epochs. Performance was evaluated every 100 epochs. Once learning began to plateau, the learning rate was decreased to 1 × 10−5 for epochs 201–300 and again to 1 × 10−6 for epochs 301–350.

Fig. 11.
figure 11

The loss (left) and accuracy (right) for the classification neural network.

Training Results.

The weights at epoch 318 were used for evaluation. Despite our attempts to mitigate overfitting, the results for the validation dataset were much worse than the results for the train dataset especially for the two minority classes (Tables 2, 3 and 4).

Table 2. The results for the classification neural network for the GGO class.
Table 3. The results for the classification neural network for the Part Solid class.
Table 4. The results for the classification neural network for the Solid class.

Again, a confusion matrix (Fig. 12) illustrates the fact that the network had a hard time learning the GGO and part solid classes.

Fig. 12.
figure 12

The confusion matrix for the classification neural network on the validation dataset.

3.5 Fleischner Classification

The final step combines the detection, segmentation, and texture characterization tasks to produce a Fleishner score which maps to a follow-up recommendation. The same process as described earlier was used to detect nodules and their centroid coordinates. After detecting nodules, 51 mm 3D patches were extracted for each coordinate. Segmentations were produced for the extracted patches. To calculate volumes, the number of 1 s in a given mask were summed and then multiplied by the volume of a single voxel: 0.6375 mm3. Next, texture classes were predicted for each nodule using the classification network. These features were combined using a script provided by the LNDb challenge to produce the predicted probability of a CT belonging to each of the four Fleischner classes.

4 Results

The LNDb challenge provides a means to separately evaluate the performance of our method for each of the four tasks: Fleischner classification, nodule detection, nodule segmentation, and nodule texture characterization. We produced a submission containing predictions for each of these tasks for the 58 test CTs. For the nodule segmentation and texture characterization tasks, the LNDb challenge provides annotations for centroid coordinates. No such annotations were provided nor used for the Fleischner classification or nodule detection tasks.

Nodule Detection.

Nodule detection predictions were evaluated on the free receiver operating characteristic (FROC) curve. Average sensitivity was computed at two different agreement levels: all nodules and nodules marked by at least two radiologists. The sensitivities are averaged to produce a final score [2] (Table 5).

Table 5. The LNDb challenge results for the detection task.

Nodule Segmentation.

Nodule segmentation predictions were scored based on six different metrics: a modified Jaccard index, mean average distance (MAD), Hausdorff distance (HD), modified Pearson correlation coefficient, bias, and standard deviation [2] (Table 6).

Table 6. The LNDb challenge results for the segmentation task.

Texture Characterization.

Texture predictions were compared to the ground truth and agreement was computed according to Fleiss-Cohen weighted Cohen’s kappa [2]. Our texture characterization submission received a score of 0.3342.

Fleischner Classification.

Fleischner score predictions were compared to the ground truth and agreement was computed according to Fleiss-Cohen weighted Cohen’s kappa [2]. Our Fleischner classification submission received a score of 0.5092.

5 Conclusion

Our system struggled with high numbers of false positives for the nodule detection task. One solution would be to train another network specifically for mitigating false positives. The ensemble would work together to produce more precise predictions. The other problem our system struggled with was an inability to learn the minority classes for the texture characterization task. A larger dataset, a different loss function, more sophisticated oversampling or augmentation techniques, or a different network architecture could improve performance for the minority classes. Lastly, for the nodule segmentation task, the more recently introduced UNet++ architecture [6] may improve performance even further.

The LNDb challenge poses unique problems with the potential to improve the early screening process of patients for lung cancer. While the method described in this paper did not have the best performance on the test dataset, it presents a baseline for further work.