Keywords

1 Introduction

Automated whole breast ultrasound (AWBUS) is a new medical imaging modality approved by FDA in 2012. The AWBUS technology can automatically depict the whole anatomy of breast in a 3D volume, and hence enable the chance of offline thorough image reading. However, the advantage of the volumetric imaging may also introduce more workload for radiologists. Even for a senior radiologist, the reading of a AWBUS volume can take tens of minutes to reach confident diagnostic workup, due to the large image information and difficulty of interpretation of ultrasound images. Accordingly, the manpower shortage of radiologists can be possibly expected with the popularization of this new imaging technology. To improve the reading efficiency, an automatic segmentation method is proposed in this study to parse the AWBUS images into the breast anatomic layers of subcutaneous fat, breast parenchyma, pectolis muscles and chest wall. The layer decomposition is shown in Fig. 1. The layer parsing of breast anatomy in the AWBU images can assist the clinical image reading for less-experienced radiologists and residents. Meanwhile, the breast density, which is an important biomarker for cancer risk [1, 13], can be easily computed with the parsing results. In the context of computer-aided detection, the layer parsing may also help to exclude false-positive detections [2].

Fig. 1.
figure 1

Illustration of anatomical layers in AWBUS. (A), (B), (C) and (D) indicate layers of subcutaneous fat, breast parenchyma, muscle and chest wall, respectively. The green lines are septa boundaries of layers. The red dotted circle indicate a significant shadowing effect, whereas the blue dotted rectangles suggest regions that are difficult for layer differentiation.

Referring to Fig. 1, the segmentation task for layer parsing in AWBUS images can be very challenging. It shall not only deal with the intrinsic low quality properties of ultrasound images like the speckle noise and shadowing effect, but also tackle the distribution overlapping problem of echogenicity among different breast anatomical layers. The low image quality and echogenicity overlapping problems may also lead to ill-defined septa boundaries in places between the consecutive layers, see Fig. 1, and hence render layer parser task more problematic. On the other hand, the appearance and morphology of breast anatomic layers can vary significantly from person to person. For people with less breast density, the fat layer can be thicker, whereas the AWBUS images that depict dense breasts may have larger breast parenchyma. Therefore, the issue of high inter-subject variability for the anatomical structures should also be well considered in the design of layer parsing algorithm.

In the literature, most works focused on the segmentation of breast lesions in ultrasound images [14, 15]. To our best knowledge, this is less related work for breast anatomy parsing in AWBUS images. In this study, we propose to leverage the deep convolutional encoder-decoder network, ConvEDNet for short [3,4,5,6], with the 2-stages domain transfer (2DT) and deep boundary supervision (DBS), i.e., deep supervision [7] with boundary cue, techniques for layer parsing in the AWBUS images. The ConvEDNet [3,4,5,6] can perform end-to-end training for the semantic segmentation, and is typically constituted with the convolutional (encoder) and deconvolutional (decoder) paths, which learn useful object features and restore the object geometry and morphology at the resolution of the input images, respectively. The learning of ConvEDNet is majorly based on the image context cues with the guidance of object labels [3,4,5,6]. However, as discussed earlier, simply usage of the image context cues may not be sufficient to address the issues of low image quality, ill-defined septa boundaries, etc. Accordingly, we further incorporate the boundary cue drawn by radiologists with auxiliary learning techniques of 2DT and DBS to boost the training of ConvEDNet. The cues of boundary and image context are complementary to each other, and can be synergized to achieve promising image analysis performance [8]. The details of 2DT and DBS will be elaborated latter.

The proposed 2DT-DBS ConvEDNet is extensively compared with the state-of-the-art ConvEDNets, i.e., FCN [3], DeconvNet [4], SegNet [5] and U-Net [6]. We also perform the ablation experiments to illustrate the effectiveness of the implementation of the 2DT and DBS techniques. One related deep learning method [8] which also fuses the image context and boundary cues for the training of FCN in the multi-task paradigm is implemented for comparison. Specifically, in [8] the object labeling and boundary delineation are treated as two tasks to co-train the FCN. Our formulation on the other hand adopts the deep supervision strategy to augment the feature learning in the encoder path with the auxiliary boundary cue that encodes the object geometry and morphology. With the extensive experimental comparison, it will be shown that the proposed 2DT-DBS ConvEDNet can outperform other baseline methods for the layer parsing of breast anatomy in AWBUS.

2 Method

The architecture of our network is illustrated in Fig. 2. The mainstream architecture is a ConvEDNet. The segmentation is based on 2D AWBUS images in this study. The data annotation and the learning/testing of layer parsing methods are performed independently on the sagittal and axial 2D views of AWBUS images, and the final annotation and layer parsing results are reached by averaging the three septa boundaries of the four layers from the two corresponding boundaries of the two 2D views.

Fig. 2.
figure 2

Our boundary regularized ConvEDNet architecture. For the encoder layer size of each side network, N is the size of the connecting layer of mainstream encoder.

Since the ultrasound data are relatively noisy, the segmentation capability of the encoder-decoder path in ConvEDNet may not be sufficient to address the challenging issues in our problem. In this study, we employ the 2DT and DBS to augment the network training. As shown in Fig. 2, our network is equipped with five auxiliary side networks to impart the boundary knowledge to regularize the feature learning.

The computational breast anatomy decomposition in the AWBUS images is formulated as a pixel classification problem with four classes of subcutaneous fat, breast parenchyma, pectolis muscle, and chest wall. Given the annotated label map set, \( C \), and the original 2D AWBUS image set, \( X \), the training of the ConvEDNet tries to seek proper neural parameters, \( W_{c} \), with the minimization of the loss function:

$$ {\mathcal{L}}\left( {C, X;W_{c} } \right) = {\mathcal{L}}_{c} \left( {C, X;W_{c} } \right) + \left\| {W_{c} } \right\|_{2} , $$
(1)

where \( {\mathcal{L}}_{c} ( \cdot ) \) is the cross entropy function [12], and \( \left\| \cdot \right\|_{2} \) is the \( L_{2} \) norm for regularization. The minimum of the loss function (1) can be sought by stochastic gradient descend for the end-to-end learning of segmentation.

2.1 Our Mainstream ConvEDNet (MConvEDNet)

Similar to [4], the encoder of the MConvEDNet is composed of VGG-16 [9] net with removal of the last classification layer, see Fig. 2. We change the kernel size of conv6 and deconv6 as 5 to fit our data. The unpooling layers at the decoder are paired with the max-pooling layers of the encoder. The locations of maximum activations at the max-pool layers are memorized with switch variables to assist the unpooling process.

2.2 Two-Stages Domain Transfer (2DT)

Since the cost for collection and annotation of medical images is relatively expensive, the common approach to attain good performance with the deep learning technique is to initialize network with parameters learnt from natural images [10]. However, considering that the domains of natural and AWBUS images are quite different, we propose to engage the knowledge transfer of model parameters in two stages. Specifically, the first stage of domain transfer is carried out to employ the VGG-16 [9] as the encoder followed by a decoder with single deconvolutional layer for the anatomical edge detection in AWBUS images. The learning of the edge detector is guided by the boundary maps, where the three septa boundaries of the four layers are drawn. To boost the learning of edge detection, deep supervision with the boundary maps is also implemented by the same 5 auxiliary side networks shown in Fig. 2. This type of edge detector network is also called Holistically-nested Edge Detector (HED) net [11]. The training for the AWBUS edge detector will land VGG-16 encoder into the AWBUS domain to be familiar with the presence of speckle noise and shadowing effect. Similar to [11], the AWBUS edge detection is formulated as a 2-class differentiation with edge label as 1 whereas non-edge label as 0. The learnt encoder for the AWBUS edge detector is denoted as VGG-USEdge. The tasks of anatomic edge detection and layer parsing may relate to each but remain different. Therefore, the VGG-USEdge encoder may provide more useful prior knowledge than VGG-16. Accordingly, the VGG-USEdge is applied to initialize the encoder network of our MConvEDNet.

2.3 Deep Boundary Supervision (DBS)

As can be found in Fig. 2, the MConvEDNet is relatively deep and hence the gradient vanishing issue can possibly occur in the network training. Meanwhile, the learning process can also be thwarted with the difficult issues discussed earlier. To further boost the learning process, the deep supervision strategy is employed. Here, we introduce the cue of layer boundaries with the deep supervision strategy to improve the learning. To further illustrate the efficacy of boundary cue, we further implement two comparison options. The first option is the deep supervision with label map cue on MConvEDNet. The second one is to perform the DBS on both encoder and decoder, which totally have 10 auxiliary side networks. It will be shown that the pure deep DBS can better boost the segmentation than the other deep supervision strategies.

The DBS is realized by adding auxiliary side networks to the endings of 5 layers in the encoder of MConvEDNet. The auxiliary side networks are shallow and simply constitutes of coupled single convolutional and deconvolutional layers, see Fig. 2. Given the neural parameters, \( W_{e}^{p} \), of an auxiliary side network \( p \), \( 1 \le p \le Q \); \( Q \) is the total number of convolutional layers at the encoder, and the edge map set of layer boundaries, \( E \), the learning of the end-to-end segmentation with the DBS can be realized by the minimization of the reformulated loss function of

$$ {\mathcal{L}}\left( {C,X;W_{c} } \right) = {\mathcal{L}}_{c} \left( {C,X;W_{c} } \right) + \left\| {W_{c} } \right\|_{2} + \sum\nolimits_{p}^{Q} {{\mathcal{L}}_{e} \left( {E,X;W_{e}^{p} } \right)} , $$
(2)

where \( {\mathcal{L}}_{e} ( \cdot ) \) is the class-balanced cross entropy function for the auxiliary side networks that considers the non-balance issue between edge and non-edge classes [11]. With the minimization of cost function, the encoder network be equipped with the capability to drive the prediction masks of mainstream network as close to the manual label maps as possible, and keep the output edge maps of the side networks not deviating from the annotated edge maps significantly. For the comparing implementation of deep supervision with label map cue, we can simply replace the training map set \( E \) with \( C \). The deep supervision with both label and edge map cues need two parallel side networks which consider training map sets of \( C \) and \( E \), respectively.

2.4 Implementation Details

The learning rate of the mainstream network is initialized as 0.01, while the weight decay and momentums parameters are set as 0.0005 and 0.9, respectively. For the auxiliary networks of deep supervision, the learning rates are 10−6, where the parameters of weight decay and momentum are the same as those of MConvEDNet. No dropout is implemented but the batch normalization is adopted. The architectures of auxiliary side networks for the edge detector net and MConvEDNet are the same for simplicity, but with different random initialization on network parameters. Our method is developed based on the Caffe environment [12].

3 Dataset and Annotation

The AWBUS data were collected from Taipei Veterans General Hospital, Taipei, Taiwan, with the approval of their institutional review board (IRB). 16 AWBUS volumes acquired from 16 subjects are involved in this study. The subject ages range from 30 to 62. The non-human dark regions of all AWBUS images are excluded and leaving all image contents with the size of \( 160 \times 160 \). The annotation for the boundaries of the breast anatomical layers in the AWBUS images was performed by a radiologist with 5 years of experience in breast ultrasound. The annotated data were further reviewed by a very senior radiologist with experience of medical ultrasound more than 30 years to ensure the correctness of the annotation. Each AWBUS volume contains around 170–200 2D images and the overall number of 2D images is 3134.

4 Experiments and Results

The evaluation of the AWBUS image segmentation is based on leave-one-out cross validation (LOO-CV). The basic unit of LOO-CV is an AWBUS volume but not a 2D image. Two assessment metrics, intersection over union (IoU) [4] and curve distance (CD) [15], are adopted for the quantitative evaluation between the computerized segmentation results and manual annotations. The CD is the averaged absolute distance between two comparing lines. The state-of-the-art ConvEDNets of FCN, DeconvNet, SegNet and U-Net are also implemented as baseline methods for comparison. Meanwhile, the multi-task method [8], denoted as “Multitask” which fuses image context and boundary cues is also implemented for comparison. The combinational options of 2DT and DBS are also implemented to show the effect of each technique on our problem. As discussed in Sect. 2.3, to illustrate efficacy of DBS, the implementation of DBS on encoder and decoder (FullyDBS) and deep supervision with label map (DLS) are also performed. To show the effectiveness of 2DT, we also implement the random parameter initialization (RandI) for the DBS+ConvEDNet.

Table 1 reports the mean \( \pm \) standard deviation statistics of the CD and IoU metrics for the segmentation results of each implementation over the LOO-CV scheme. Specifically, the segmentation performances w.r.t. the three septa boundaries in-between layers (CD) and four anatomic layers (IoU) are listed in the columns of Table 1. The layers A, B, C and D represents fat, breast parenchyma, pectolis muscles and chest wall, respectively. The lines 1, 2 and 3 are the septa boundaries w.r.t. the layer pairs of “A/B”, “B/C”, and “C/D”. It is worth noting that the our MConvEDNet is based on the DeconvNet [4]. To give the visual comparison, the segmentation results of all methods involved in this study are listed in Fig. 3.

Table 1. Segmentation performances of different methods. “Main” represents our mainstream ConvEDNet (MConvEDNet). It is worth noting that the encoders of “DeconvNet” and “DBS+Main” are initialized with VGG-16.
Fig. 3.
figure 3

Visual Comparison for the layer parsing results from different implementations. The layer boundaries of computerized results are drawn with red color, whereas the manual outlines of radiologists are colored in green.

5 Discussion and Conclusion

As can be observed from Fig. 3 and Table 1, the FCN segmentation results are relatively not stable. Some regions are obviously mislabeled. Therefore, the FCN may have less robustness to the low ultrasound image quality. On the other hand, the DeconvNet is relatively more suitable for our problem, because of deep decoding path. The SegNet results appear worse than the results of FCN in the muscle (C) layer and septa boundary between muscle and chest wall layers. It thus suggests that the fix of feature map with the same size may not help on our problem. The results of U-Net are in-between the results of SegNet and FCN, though the skip connection strategy is adopted in U-Net to alleviate the gradient vanishing problem. Therefore, the feature learning is relatively difficult even with the skip connections between the encoder and decoder correspondences. Accordingly, the incorporation of boundary cue may help to improve the ultrasound image segmentation.

It can be found in Table 1 that the best segmentation performance can be achieved by our method “DBS+Main+2DT” with both IoU and CD metrics. Therefore, it may suggest that our 2DT-DBS ConvEDNet may have better capability to withstand speckle noise, shadowing and other challenges shown in the introduction section. Based on the extensive comparisons with other baseline implementations, the efficacy of the 2DT-DBS ConvEDNet on the layer parsing problem can be corroborated.