1 Introduction

Recently, regional anesthesia has become an attractive alternative for general anesthesia in the context of medical surgeries, mainly because it improves postoperative mobility and reduces morbidity and mortality [1]. Regional anesthesia comprises the administration of an anesthetic substance in the area surrounding a nerve structure to block the transmission of nociceptive information (this procedure is known as Peripheral Nerve Blocking, PNB) [2]. In this regard, the success of regional anesthesia depends on the accurate location of the target nerve [3]. The use of ultrasound images (UI) has gained considerable interest to locate nerve structures in PNB procedures [1]. This method allows a direct visualization of the target nerve and the anatomical structures around it [4]. Notwithstanding, the localization of nerve structures in ultrasound images is a challenging task for the specialists (in this case anesthesiologists) due to these kind of images are affected by several artifacts such as attenuation, acoustic shadows, and speckle noise [5]. Thus, the accurate delimitation of a given target depends on the operator (anesthesiologist) experience [6].

The above problem can be minimized using automatic nerve-segmentation systems, which are intended to assist the anesthesiologist with the aim of locating nerve structures in PNB procedures. Nevertheless, to build a system with these specifications, it is necessary to access to the actual label, that is, we need ultrasound images indicating which regions correspond to a nerve (usually, this process is performed by an anesthesiologist) [1, 6]. In practice, the above is considered as a problem since it is not possible for the specialists to accurately identify nerve structures in an ultrasound image, considering that the speckle noise and the artifacts difficult the delimitation of anatomical structures [7]. In this sense, the obtained labels do not correspond to the Ground Truth but a subjective interpretation (possibly noisy) given by the specialist based on his experience and his training. For the automatic segmentation of nerve structures, without the knowledge of the Ground Truth, a set of noisy annotations (manual segmentation) from various specialists could be used. In this case, it is necessary to use the manual segmentation provided by multiple experts with the aim of building a segmentation model that allows to measure the performance of the annotators based on the parametrization of the ultrasound images to deal with the subjectivity presented in the labeled regions.

In the presence of multiple annotators, the labels from several experts have been used in different ways with the aim of building automatic systems for nerve segmentation. For example, in [6], the authors consider as the gold standard the annotations from one specialist. On the other hand, in [1] the authors use the Majority voting from the annotations as the ground truth. However, these approaches have some problems, for instance, if we only use the labels from one of the annotators, the segmentations results would be biased by the expertise of the annotators. Similarly, in the Majority Voting approach, it is considered that all the annotators are equally reliable, which is not common in real scenarios [8]. Another way to deal with the problem of not having the gold standard is to use a recent trend in machine learning named “Learning from multiple annotators”. The area of learning with multiple annotators is relatively new; its aim is to perform supervised learning task when the gold standard is not available, and we just have access to multiple annotations provided by several experts or annotators. This area has been applied to problems such as, regression [9], classification [10], and sequence labeling [11].

In this paper, we present a method for the automatic segmentation of nerve structures depicted in ultrasound images considering the scenario where the ground truth is not available. In particular, we use two classification schemes with multiple annotators aiming to combine the manual segmentation from different experts to reveal discriminant patterns associated with the nerve structures. One of them is based on Logistic Regression, where the annotator performance depends only on the true label and is measured in terms of sensitivity and specificity [12]. The second scheme is a model based on Logistic Regression, where it is assumed that the annotator performance depends on both, the true label and the instance that the annotator is labeling [13]. We hypothesize that by using classifications schemes considering that consider the non-availability of the ground truth, it is possible to reduce the subjectivity present in the labeled regions. There are a few works that consider the information of several annotators to build nerve-segmentation systems [1, 6]. However, these works use basic schemes to deal with this information (majority voting or an approach where only the information from one annotator is considered), these approaches are not suitable since they consider that the experts have the same level of expertise. In this sense, the main contribution of our work lies in to develop an automatic nerve-segmentation system, which captures the segmentation expertise from different specialist considering a non-homogeneity in the performance of experts. The obtained results show that our approach finds a suitable UI approximation by ensuring the identification of discriminative nerve patterns according to the opinions given by multiple specialists. Indeed, our proposal outperforms state-of-the-art approaches that carry out nerve segmentation in terms of a Dice coefficient assessment.

2 Materials and Methods

2.1 Multi-annotator Classifications Schemes

For the training of a typical classification problem (i.e. a classification scheme without considering multiple annotators) we dispose a training set \({{{\fancyscript{D}}}} = \left\{ (\mathbf {x}_i, t_i)\right\} _{i=1}^N\), with N samples, where \(\mathbf {x}_i\) is an instance known as the \(D-\)dimensional feature vector and \(t_i\) is the label associated to \(x_i\), which is assumed as the “ground truth”. However, in this work, we take into account the case where the ground-truth is not available for the training, and in contrast, we only have access to an amount of labels (possibly noisy) provided by R experts o annotators [12]. In this regard, the training set in the context of multiple annotators is \({{\fancyscript{D}}} = \left\{ (\mathbf {x}_i, \mathbf {y}_i)\right\} _{i=1}^N\), where \(\mathbf {y}_i= y_i^1, \dots , y_i^R\) are the annotations for the i-th sample given by the R annotators. In this work, we use two classification schemes with multiple annotators to deal with the problem of automatic segmentation of nerve structures. The following is a brief description of these methods, where they establish a random variable \(\mathbf {z} = \left[ z_1, \dots , z_N\right] \), which represents the unknown ground-truth for the i-th sample.

Logistic regression with multiple annotators (LFC). We follow the multi-annotator classification model proposed in [12]. The annotator performance is measured in terms of sensitivity \(\alpha ^r\) and specificity \(\beta ^r\), where \( \alpha ^r = p\left( y^r=1|z=1\right) , \; \beta ^r = p\left( y^r=0|z=0\right) . \) Hence, we use the training dataset to construct a multiple-annotator classification based on logistic regression [14]. In this sense, given the samples and the annotations, we need to estimate the parameters associated with the performance of each annotator \(\varvec{\alpha }=\left[ \alpha ^1, \dots , \alpha ^R\right] \), \(\varvec{\beta }=\left[ \beta ^1,\dots , \beta ^R\right] \), and the parameters associated with the classifier \(\varvec{w}\). For estimating these parameters, we employ an Expectation-Maximization (EM) algorithm. The likelihood function is given as

$$\begin{aligned} p\left( {\fancyscript{D}}|\varvec{\theta }\right) = \prod _{i=1}^{N}\left[ p_i\prod _{r=1}^{R}\left( \alpha ^r\right) ^{(y_i^r)}\left( 1-\alpha ^r\right) ^{(1-y_i^r)}+(1-p_i)\prod _{r=1}^{R}\left( \beta ^r\right) ^{(1-y_i^r)}\left( 1-\beta ^r\right) ^{(y_i^r)}\right] , \end{aligned}$$

where \(\varvec{\theta }=\left\{ \varvec{\alpha }, \varvec{\beta }, \varvec{w}\right\} \), and \(p_i\) is computed by means of a “Logistic Regression” function [14]. The EM algorithm is performed from the following steps:

E-step: The conditional expectation of the log-likelihood yields

$$\begin{aligned} \mathbb {E}[\ln (p({\fancyscript{D}}, \mathbf {z}|\varvec{\theta }))] = \sum _{i=1}^{N} \mathbb {E}[z_i]\ln (a_{i}p_{i}) + \left( 1-\mathbb {E}[z_i]\right) \ln \left( b_i(1-p_i)\right) , \end{aligned}$$

where \(\displaystyle a_i = \prod _{r=1}^{R}\left( \alpha ^r\right) ^{(y_i^r)}\left( 1-\alpha ^r\right) ^{(1-y_i^r)}\), \(\displaystyle b_i=\prod _{r=1}^{R}\left( \beta ^r\right) ^{(1-y_i^r)}\left( 1-\beta ^r\right) ^{(y_i^r)}\), and \(\mathbb {E}[z_i]\) is the estimated ground truth which follows \(\displaystyle \mathbb {E}[z_i] = \mu _{i} = \frac{a_{i}p_{i}}{a_ip_i+b_i(1-p_i)}.\)

M-step: Given the estimated gold standard \(\mu _i\) and the training data, we estimate the parameters \(\varvec{\theta }\) by maximizing the conditional expectation of the log-likelihood computed in the E-step. The annotators performance parameters are updated using

$$\begin{aligned} \alpha ^r = \displaystyle \frac{\sum _{i=1}^{N}\mu _iy_i^r}{\sum _{i=1}^{N}\mu _i}, \quad \beta ^r=\frac{\sum _{i=1}^{N}\left( 1-\mu _i\right) \left( 1-y_i^r\right) }{\sum _{i=1}^{N}\left( 1-\mu _i\right) }. \end{aligned}$$

Finally, the parameters related with the logistic regression classifier, can be calculated by using similar equations to the single annotator context, where the true labels are changed for soft labels given by \(\mu _i\). See [14].

Modeling annotator expertise: Learning when everybody knows about something (MAE). We follow the multi-annotator classification schemes proposed in [13]. This model is an extension of the proposed model in [12]. Unlike the model LFC, the model MAE consider that the label given by the annotator r depends on the unknown true label \(z_i\) and the instance \(\mathbf {x}_i\) that he is labeling, in this sense

$$\begin{aligned} p\left( y_i^r|\mathbf {x}_i,z_i\right) = \left( 1-\eta _r(\mathbf {x}_i)\right) ^{|y_i^r - z_i|}\eta _r(\mathbf {x}_i)^{1-|y_i^r - z_i|}, \end{aligned}$$

where \(\eta _r(\mathbf {x}_i)\) follows a Logistic regression model \(\eta _r(\mathbf {x}_i) = \left( 1 + \exp \left( -\varvec{\lambda }_r^{\top }\mathbf {x}_i\right) \right) ^{-1}\). Given the dataset, we need to estimate the parameters associated with the performance of each annotator \(\varvec{\varLambda }=\left[ \varvec{\lambda }_1, \dots , \varvec{\lambda }_R\right] \) and the parameters associated with the classifier \(\varvec{w}\) based on “Logistic Regression” [14]. For estimating these parameters, we employ an Expectation-Maximization (EM) algorithm. The likelihood function is given as

$$\begin{aligned} p\left( {\fancyscript{D}}|\varvec{\phi }\right) = \prod _{i=1}^{N}\left[ p_i\prod _{r=1}^{R}\left( 1-\eta _r(\mathbf {x}_i)\right) ^{1-y_i^r}\eta _r(\mathbf {x}_i)^{y_i^r}+(1-p_i)\prod _{r=1}^{R}\left( 1-\eta _r(\mathbf {x}_i)\right) ^{y_i^r}\eta _r(\mathbf {x}_i)^{1-y_i^r}\right] , \end{aligned}$$

where \(\varvec{\phi }=\left\{ \varvec{\varLambda }, \varvec{w}\right\} \), and \(p_i\) is computed by means of a “Logistic Regression” function [14]. The EM algorithm is performed from the following steps:

E-step: The conditional expectation of the log-likelihood is defined as

$$\begin{aligned} \mathbb {E}[\ln (p({\fancyscript{D}}, \mathbf {z}|\varvec{\phi }))] = \sum _{i=1}^{N} \mathbb {E}[z_i]\ln (c_{i}p_{i}) + \left( 1-\mathbb {E}[z_i]\right) \ln \left( d_i(1-p_i)\right) , \end{aligned}$$

where \( \displaystyle c_i = \prod _{r=1}^{R}\left( 1-\eta _r(\mathbf {x}_i)\right) ^{1-y_i^r}\eta _r(\mathbf {x}_i)^{y_i^r}\), and \( \displaystyle d_i =\prod _{r=1}^{R}\left( 1-\eta _r(\mathbf {x}_i)\right) ^{y_i^r}\eta _r(\mathbf {x}_i)^{1-y_i^r}\), and \(\mathbb {E}[z_i]\) is the estimated ground truth which follows \( \displaystyle \mathbb {E}[z_i] = \mu _{i} = \frac{c_{i}p_{i}}{c_ip_i+d_i(1-p_i)}.\)

M-step: Given the estimated gold standard \(\mu _i\) and the training data, we estimate the parameters \(\varvec{\phi }\) by maximizing The conditional expectation of the log-likelihood computed in the E-step. To compute the parameters \(\varvec{\lambda }\) related to the model, we use gradient-based methods. Next we provided the first order derivate w.r.t. \(\varvec{\lambda }\)

$$\begin{aligned} \frac{\partial \mathbb {E}[\ln (p({\fancyscript{D}}, \mathbf {Z}|\varvec{\theta }))]}{\partial \varvec{\lambda _r}} = \sum _{i=1}^{N}(-1)^{y_i^r}\left( 1-2\mu _{i}\right) \eta _r(\mathbf {x}_i)\left( 1-\eta _r(\mathbf {x}_i)\right) \mathbf {x}_i \end{aligned}$$

Finally, the parameters related with the logistic regression classifier can be calculated by using similar equations to the single annotator context, where the true labels are changed for soft labels given by \(\mu _i\). See [14].

3 Results and Discussions

Ultrasound imaging dataset: To validate the nerve segmentation approach based on classification with multiple annotators, we use a dataset named UI-UTP, which consists of recordings of ultrasound images from patients who underwent regional anesthesia using the Peripheral nerve blocking procedure. This dataset is composed of 48 ultrasound images from the ulnar nerve (21 images) and median nerve (27 images). Each ultrasound image was collected using a Sonosite Nano-Maxx device (the resolution of each image is 640\(\times \)480 pixels). Each image in the dataset was labeled by three specialists in anesthesiology to indicate the location of the nerve structures.

Segmentation Scheme training and testing: A leave-one-out validation scheme is employed to compute the system performance regarding the nerve segmentation in the context of multiple annotators. The nerve segmentation considering multiple experts comprises the following stages: First, we use Graph Cuts Segmentation [15] to define a region of interest (ROI) in which the nerve region is probably located. Then, a median filter is applied over the ROIs to reduce the speckle noise effect while enhancing the UI quality. Then, each filtered image is divided into different regions by using SLIC-superpixel [16], and each of these superpixel is parametrized from the non-linear Wavelet transform (for details, see [6]). Now, in this work, the segmentation problem is considered as a binary classification, where each parametrized superpixel is classified as nerve or background. However, as we have previously pointed out, it is not possible to obtain the ground truth (i.e. the labels indicating which superpixel is a nerve region and what it is not) since in this case, the labels correspond to a subjective assessment given by an anesthesiologist. According to the above, we use two schemes for binary classification in the context of multiple annotators, one of them is a model based on Logistic Regression, where the annotator performance depends on the true label and is measured in terms of sensitivity and specificity [12] (LFC). The second scheme is a model based on Logistic Regression, where it is assumed that the annotator performance depends on both, the true label and the instance that the annotator is labeling [13] (MAE). Finally, we use a methodology based on morphological operators aiming to improve the segmentations results (see [6]). The system performance is measured in terms of the Dice coefficient (DC) that quantifies the overlap between the UI segmented based on the proposed segmentation approach and the label given by the specialists. In addition, we consider two common approaches to deal classification problems with multiple annotators, the first approach is to use and scheme based on Logistic Regression, where the labels from each one of the annotators are considered as gold standard (LR-EX1, LR-EX2, LR-EX3); the second methodology is based on Logistic Regression, where we consider as true labels the majority voting from the annotations (LR-MV).

Fig. 1.
figure 1

Segmentation results for an ulnar nerve. On the top from left to right, we show the original image and the labels provided by the three experts. On the second row from left to right, we expose the segmentation results provided by LR-EX1, LR-EX2, and LR-EX3. Finally, on the bottom from left to right we show the segmentation results for LR-MV, LFC, and MAE.

Obtained results: To visually compare the attained nerve identification in the context of multiple experts, Fig. 1 exposes some segmentation results regarding each multi-annotator classification approach. Overall, all classification methods allow to identify relevant patterns to segment nerves from the UI. Nevertheless, due to the absence of the true label, typical classifications methods (LR-EX1, LR-EX2, LR-EX3, and LR-MV) fail in the complete identification of nerves by generating false negatives (i.e., classify as background nerve regions). The above is a significant issue since the anesthesiologist needs an accurate delimitation of the nerve structure aiming to define the point where the anesthetic should be spread. Unlike these classification methods, our approach based on multiple annotators (specifically MAE) reduces the number of false negatives considerably in the segmented image offering a better identification of the nerve structure. Table 1 shows the results of the morphological validation in terms of the DC for the leave-one-out validation scheme. We perform a statistical significance analysis based on the equal mean test for the segmentation approaches considered in this work. This test allows to determine which method provides a higher performance in terms of the Dice coefficient. From the results exposed, the segmentation scheme based on multiple annotators (specifically the multiple annotators model proposed in [13]) outperforms state-of-the-art approaches which are based on typical supervised learning schemes (i.e. supervised learning approaches without considering multiple annotators). These results can be explained in the sense that the multiple annotators schemes are based on the fact that the gold standard is not available for the training stage, where the true label is estimated from the manual segmentations provided by different experts in anesthesiology. In contrast, segmentation approaches based on typical supervised learning assume as a gold standard the manual segmentations from one of the annotators, which implies that the classifier predictions will be biased according to annotator expertise.

Table 1. Nerve segmentation validation in terms of the Dice coefficient.

4 Conclusion

In this paper, we discuss a first attempt for the design of nerve-segmentation systems based on classification with multiple annotators. In this sense, we perform the nerve identification by considering the case where the ground truth is not available. In fact, this consideration is not far from reality since the nerve identification in UI depends on the specialist expertise. So, we use multi-annotator classification schemes to estimate the unavailable true label and the classifier parameters jointly. We tested our strategy in a real-world nerve segmentation dataset captured by “The Automatics Research Group-Universidad Tecnológica de Pereira,” which holds UI images of ulnar and median nerves. The experimental results showed that the segmentation methodology based on the information from different experts outperforms state-of-the-art alternatives for nerve segmentation in terms of the Dice coefficients. Hence, the proposed method have a better interpretation of the patterns associated with the nerves by combining the manual segmentation given by multiple anesthesiologists. As future work, authors plan to use more robust multi-annotators classification schemes (for example approaches based on deep-learning) to improve further the quality of the nerve segmentation.