Sequence labeling with multiple annotators
 621 Downloads
 5 Citations
Abstract
The increasingly popular use of Crowdsourcing as a resource to obtain labeled data has been contributing to the wide awareness of the machine learning community to the problem of supervised learning from multiple annotators. Several approaches have been proposed to deal with this issue, but they disregard sequence labeling problems. However, these are very common, for example, among the Natural Language Processing and Bioinformatics communities. In this paper, we present a probabilistic approach for sequence labeling using Conditional Random Fields (CRF) for situations where label sequences from multiple annotators are available but there is no actual ground truth. The approach uses the ExpectationMaximization algorithm to jointly learn the CRF model parameters, the reliability of the annotators and the estimated ground truth. When it comes to performance, the proposed method (CRFMA) significantly outperforms typical approaches such as majority voting.
Keywords
Multiple annotators Crowdsourcing Conditional random fields Latent variable models Expectation maximization1 Introduction
The increasing awareness of the importance of Crowdsourcing (Howe 2008) as a means of obtaining labeled data is promoting a shift in machine learning towards models that are annotatoraware. A good example is that of online platforms such as Amazon’s Mechanical Turk (AMT).^{1} These platforms provide an accessible and inexpensive resource to obtain labeled data, whose quality, in many situations, competes directly with the one of “experts” (Snow et al. 2008; Novotney and CallisonBurch 2010). Also, by distributing a labeling task by multiple annotators it can be completed in a considerably smaller amount of time. For such reasons, these online workrecruiting platforms are rapidly changing the way datasets are built.
Furthermore, the social web promotes an implicit form of Crowdsourcing, as multiple web users interact and share contents (e.g., document tags, product ratings, opinions, user clicks, etc.). As the social web expands, so does the need for annotatoraware models.
On another perspective, there are tasks for which ground truth labels are simply very hard to obtain. Consider for instance the tasks of Sentiment Analysis, Movie Rating or Keyphrase Extraction. These tasks are subjective in nature and hence the definition of ground truth requires very strict guidelines, which can be very hard to achieve and follow. Even in well studied tasks like Named Entity Recognition linguists argue what should and should not be considered a named entity and consensus is not easily obtained. In cases where the task is inherently subjective an attainable goal is to build a model that captures the wisdom of the crowds (Surowiecki 2004) as good as possible while paying less attention to dissonant views.
Another example can be found in the field of medical diagnosis, where obtaining ground truth can mean expensive or invasive medical procedures like biopsies. On the other hand, it is much simpler for a physician to consult his colleagues for an opinion, resulting in a multiple “experts” scenario.
Sequence labeling refers to the supervised learning task of assigning a label to each element of a sequence. Typical examples are PartofSpeech tagging, Named Entity Recognition and Gene Prediction (Allen et al. 2004; Allen and Salzberg 2005). In such tasks, the individual labels cannot be considered as detached from the context (i.e. the preceding and succeeding elements of the sequence and their corresponding labels). Two of the most popular sequence models are hidden Markov models (HMM) (Rabiner 1989) and Conditional Random Fields (CRF) (Lafferty et al. 2001). Due to the usually high dimensional feature spaces (specially considering CRFs), these models frequently require large amounts of labeled data to be properly trained, which hinders the construction and release of datasets and makes it almost prohibitive to do with a single annotator. Although in some domains, the use of unlabeled data can help in making this problem less severe (Bellare and Mccallum 2007), a more natural solution is to rely on multiple annotators. For example, for many tasks, AMT can be used to label large amounts of data (CallisonBurch and Dredze 2010). However, the large numbers needed to compensate for the heterogeneity of annotators expertise rapidly raise its actual cost beyond acceptable values. A parsimonious solution needs to be designed that is able to deal with such real world constraints and heterogeneity.
In the past few years many approaches have been proposed that deal with the problem of supervised learning from multiple annotators in different paradigms (classification, regression, ranking, etc.), however the particular problem of sequence labeling from multiple annotators was practically left untouched, and most of the applications typically rely on majority voting (e.g. Laws et al. 2011). Given its importance in such fields as Natural Language Processing, Bioinformatics, Computer Vision, Speech and Ubiquitous Computing, sequence labeling from multiple annotators is a very important problem. Unfortunately, due to its nature, typical approaches proposed for binary or categorical classification cannot be directly applied for sequences.
In this paper we propose a probabilistic approach using the ExpectationMaximization algorithm (EM) for sequence labeling using CRFs for the scenario where we have multiple annotators providing labels with different levels of “reliability” but no actual ground truth. The proposed method is able to jointly learn the CRF model parameters, the reliabilities of the annotators and the estimated ground truth label sequences. It is empirically shown that this method outperforms the baselines even in situations of high levels of noise in the labels of the annotators and when the less “trustworthy” annotators dominate. The proposed approach also has the advantage of not requiring repeated labeling of the same input sequences by the different annotators. Finally, this approach can be easily modified to work with other sequence labeling models like HMMs.
2 Related work
The first works that relate to the problem of learning from multiple annotators go back to 1979 when Dawid and Skene proposed an approach for estimating the error rates of multiple patients (annotators) given their responses (labels) to multiple medical questions. Although this work just focused on estimating the hidden ground truth labels, it inspired other works where there is an explicit attempt to learn a classifier. For example, Smyth et al. (1995) propose a similar approach to solve the problem of volcano detection and classification in Venus imagery with data labelled by multiple experts. Like in previous works, their approach relies on a latent variable model, where they treat the ground truth labels as latent variables. The main difference is that the authors use the estimated (probabilistic) ground truth to explicitly learn a classifier.
More recently, Snow et al. (2008) demonstrated that learning from labels provided by multiple nonexpert annotators can be as good as learning from the labels of one expert. Such kind of findings inspired the development of new approaches that, unlike previous ones (Smyth et al. 1995; Donmez and Carbonell 2008; Sheng et al. 2008), do not rely on repeated labeling, i.e. having the same annotators labeling the same set of instances. In Raykar et al. (2009, 2010) an approach is proposed where the classifier and the annotators reliabilities are learnt jointly. Later works then relaxed the assumption that the annotators’ reliabilities do not depend on the instances they are labeling (Yan et al. 2010), and extended the proposed methodology to an active learning scenario (Yan et al. 2011). All these approaches shared a few key aspects: (1) they use a latent variable model where the ground truth labels are treated as latent variables; (2) they rely on the EM algorithm (Dempster et al. 1977) to find maximum likelihood estimates for the model parameters; and (3) they deal mostly with binary classification problems (although some suggest extensions to handle categorical, ordinal and even continuous data).
The acclaimed importance of supervised learning from multiple annotators lead to many interesting alternative approaches and variations/extensions of previous works in the past couple of years. In Donmez et al. (2010) the authors propose the use of a particle filter to model the timevarying accuracies of the different annotators. Groot et al. (2011) propose a annotatoraware methodology for the regression problems using Gaussian processes, and Wu et al. (2011) present a solution for ranking problems with multiple annotators.
Despite the variety of approaches presented for different learning paradigms, the problem of sequence labeling from multiple annotators was left practically untouched, with the only relevant work being the work by Dredze et al. (2009). In this work the authors propose a method for learning structured predictors, namely CRFs, from instances with multiple labels in the presence of noise. This is achieved by modifying the CRF objective function used for training through the inclusion of a perlabel prior, thereby restricting the model from straying too far from the provided priors. The perlabel priors are then reestimated by making use of their likelihoods under the whole dataset. In this way, the model is capable of using knowledge from other parts of the dataset to prefer certain labels over others. By iterating between the computation of the expected values of the label priors and the estimation of the model parameters in an EMlike style, the model is expected to give preference to the less noisy labels. Hence, we can view this process as selftraining, a process whereby the model is trained iteratively on its own output. Although this approach makes the model computationally tractable, their experimental results indicate that the method only improves performance in scenarios where there is a small amount of training data (low quantity) and when the labels are noisy (low quality).
It is important to stress that, contrarily to the model proposed in this paper, the model by Dredze et al. (2009) is a multilabel model, and not a multiannotator model, in the sense that the knowledge about who provided the multiple label sequences is completely discarded. The obvious solution for including this knowledge would be to use a latent ground truth model similar to the one proposed by Raykar et al. (2009, 2010), thus extending this work to sequence labeling tasks. However, treating the ground truth label sequences as latent variables and using the EM algorithm to estimate the model parameters would be problematic, since the number of possible label sequences grows exponentially with the length of the sequence, making the marginalization over the latent variables intractable. In contrast to this, the approach presented in this paper avoids this problem by treating the annotators reliabilities as latent variables, making the marginalization over the latent variables tractable (see Sect. 3).
In the field of Bioinformatics a similar problem has been attracting attention, in which multiple sources of evidence are combined for gene prediction (e.g. Allen et al. 2004; Allen and Salzberg 2005). In these approaches the outputs of multiple predictors (e.g. HMMs) are usually combined using a voting of the labels predicted, weighted by the confidence (posteriors) of the various sources in their predictions (Allen et al. 2004). Nonlinear decision schemes also exist, for example using Decision Trees (Allen and Salzberg 2005), but similarly to the linearly weighted voting schemes, the confidence weights are estimated once and never corrected. This contrasts with the approaches discussed in this paper, where the goal is to build a single predictor (CRF) from the knowledge of multiple annotators (sources), and where the confidence of each source is iteratively reestimated.
3 Approach
3.1 Measuring the reliability of the annotators
Let y ^{ r } be a sequence of labels assigned by the r ^{ th } annotator to some observed input sequence x. If we were told the actual (unobserved) sequence of true labels y for that same input sequence x, we could evaluate the quality (or reliability) of the r ^{ th } annotator in a dataset by measuring its precision and recall. Furthermore, we could combine precision and recall in a single measure by using the traditional F1measure, and use this combined measure to evaluate how “good” or “reliable” a given annotator is according to some ground truth. In practice any appropriate loss function can be used to evaluate the quality of the annotators. The choice of one metric over others is purely problemspecific. The Fmeasure was used here due to its wide applicability in sequence labeling problems and, particularly, in the tasks used in the experiments (Sect. 4).
3.2 Sequence labeling
If for a dataset of N input sequences \(\mathcal {X}= \{\mathbf{x}_{i}\}_{i=1}^{N}\) we knew the actual ground truth label sequences \(\mathcal {Y}= \{\mathbf{y}_{i}\}_{i=1}^{N}\), we could model the probabilities of the label sequences \(\mathcal {Y}\) given the input sequences \(\mathcal {X}\) using a linearchain Conditional Random Field (CRF) (Lafferty et al. 2001).
According to the model defined in Eq. (1), the most probable labeling sequence for an input sequence x is given by y ^{∗}=argmax_{ y } p _{ crf }(yx,λ), which can be efficiently determined through dynamic programming using the Viterbi algorithm.
The parameters λ of the CRF model are typically estimated from an i.i.d. dataset by maximum likelihood using limitedmemory BFGS (Liu and Nocedal 1989). The loglikelihood for a dataset \(\{\mathbf{x}_{i}, \mathbf{y}_{i}\}_{i=1}^{N}\) is given by \(\sum_{i=1}^{N} \ln p_{\mathit{crf}}(\mathbf{y}_{i}\mathbf{x}_{i}, \boldsymbol{\lambda})\).
3.3 Maximum likelihood estimator
Since we do not know the set of actual ground truth label sequences \(\mathcal {Y}\) for the set of input sequences \(\mathcal {X}\), we must find a way to estimate it using the sets of label sequences provided by the R different annotators \(\{\mathcal {Y}^{1},\ldots,\mathcal {Y}^{R}\}\), and learn a CRF model along the way.
 1.
draw z∼Multinomial(π _{1},…,π _{ R })
 2.for each instance x _{ i }:
 (a)for each annotator r:
 (i)
if z ^{ r }=1, draw \(\mathbf{y}_{i}^{r}\) from \(p_{\mathit{crf}}(\mathbf{y}_{i}^{r}\mathbf{x}_{i},\boldsymbol{\lambda})\)
 (ii)
if z ^{ r }=0, draw \(\mathbf{y}_{i}^{r}\) from \(p_{\mathit{rand}}(\mathbf{y}_{i}^{r}\mathbf{x}_{i})\)
 (i)
 (a)
Although it might seem too restrictive to assume that only one annotator provides the correct label sequences, it is important to note that the model can still capture the uncertainty in who the correct annotator should be. In alternative to this approach, one could replace the multinomial random variable with multiple Bernoullis (one for each annotator). From a generative perspective, this would allow for multiple annotators to be correct. However, this places too much emphasis on the form of \(p_{\mathit{rand}}(\mathbf{y}_{i}^{r}\mathbf{x}_{i})\), since it would be crucial for deciding whether the annotator is likely to be correct. One the other hand, as we shall see later, by using a multinomial, the probabilities \(p_{\mathit{rand}}(\mathbf{y}_{i}^{r}\mathbf{x}_{i})\) cancel out from the updates of the annotators reliabilities, thus forcing the annotators to “compete” with each other for the best label sequences.
The choice of explicitly including the reliability of the annotators (which we represent through the vector z) as latent variables and marginalizing over it, contrasts with typical approaches in learning from multiple annotators (e.g. Raykar et al. 2009, 2010; Dredze et al. 2009; Yan et al. 2011), where the unobserved ground truth labels are treated as latent variables. Since these variables are not observed (i.e. latent), they must be marginalized over. For sequence labeling problems, this marginalization can be problematic due to the combinatorial explosion of possible label sequences over which we would have to marginalize. Instead, by explicitly handling the annotators reliabilities as latent variables this problem can be completely avoided.
3.4 EM algorithm
As with other latent variable models, we rely on the ExpectationMaximization algorithm (Dempster et al. 1977) to find a maximum likelihood parameters of the proposed model.
In the Mstep of the EM algorithm we maximize this expectation with respect to the model parameters θ, obtaining new parameter values θ ^{ new }.
 Estep
 Evaluate$$ \gamma\bigl(z^r\bigr) \propto \pi_r^{old} \prod _{i=1}^N p_{\mathit{crf}} \bigl(\mathbf{y}_i^r\mathbf{x}_i,\boldsymbol{\lambda}^{old}\bigr). $$(15)
 Mstep
 Estimate the new ground truth labels sequences \(\widehat{\mathcal {Y}}^{new}\) and the new parameters θ ^{ new }={π ^{ new },λ ^{ new }} given by$$ \boldsymbol{\lambda}^{new} = \arg\max _{\boldsymbol{\lambda}} \sum_{i=1}^{N} \sum _{r=1}^{R} \gamma\bigl(z^r\bigr) \bigl\{ \ln p_{\mathit{crf}}\bigl(\mathbf{y}_{i}^{r} \mathbf{x}_{i}, \boldsymbol{\lambda}\bigr) \bigr\} $$(16)$$ \widehat{\mathcal {Y}}^{new} = \arg\max_{\widehat{\mathcal {Y}}} p_{\mathit{crf}}\bigl(\widehat{\mathcal {Y}}\mathcal {X},\boldsymbol{\lambda}^{new}\bigr) $$(17)$$ \pi_r^{new} = \frac{F1\mbox{}\mathit{measure}_{r}}{\sum_{j=1}^{R} F1\mbox{}\mathit{measure}_{j}} $$(18)
The initialization of the EM algorithm can be simply done by assigning random values to the annotators reliabilities or by estimating the ground truth label sequences \(\widehat{\mathcal {Y}}\) using majority voting. The algorithm stops when the expectation in equation 11 converges or when the updates to the annotators reliabilities fall below a given threshold.
3.5 Maximumaposteriori
When no prior knowledge about the annotators reliabilities is given, the Dirichlet prior can also be used as noninformative prior with all parameters α _{ r } equal. This prior would act as a regularization term preventing the model to overfit the data provided by a few annotators. The strength of the regularization would depend on the parameter α.
4 Experiments
The proposed approach is evaluated in the field of Natural Language Processing (NLP) for the particular tasks of Named Entity Recognition (NER) and Noun Phrase (NP) chunking. NER refers to the Information Retrieval subtask of identifying and classifying atomic elements in text into predefined categories such as the names of persons, organizations, locations and others, while NP chunking consists of recognizing chunks of sentences that correspond to noun phrases. Because of their many applications these tasks are considered very important in the field of NLP and other related areas.
We make our experiments using two types of data: artificial data generated by simulating multiple annotators, and real data obtained using Amazon’s Mechanical Turk (AMT). In both cases, the label sequences are represented using the traditional BIO (Begin, Inside, Outside) scheme as introduced by Ramshaw and Marcus (1995).

MVseq: majority voting at sequence level (i.e., the label sequence with more votes wins);

MVtoken: majority voting at token level (i.e., the BIO label with more votes for a given token wins);

MVseg: this corresponds to a twostep majority voting performed over the BIO labels of the tokens. First, a majority voting is used for the segmentation process (i.e. to decide whether the token should be considered as part of a segment—a named entity for example), then a second majority voting is used to decide the labels of the segments identified (e.g. what type of named entity it is).

CRFCONC: a CRF using all the data from all annotators concatenated for training.
The proposed model is also compared with the two variants of multilabel model proposed in Dredze et al. (2009): MultiCRF and MultiCRFMAX. The latter differs from the former by training the CRF on the most likely (maximum) label instead of training on the (fuzzy) probabilistic labels (kindly see Dredze et al. (2009) for the details).
As an upperbound, we also show the results of a CRF trained on ground truth (gold) data. We refer to this as “CRFGOLD”.
Summary of CRF features
Features 

Word identity features 
Capitalization patterns 
Numeric patterns 
Other morphologic features (e.g. prefixes 
and suffixes) 
PartofSpeech tags 
Bigram and trigram features 
Window features (window size = 3) 
During our experiments we found that using the square of the F1measure when computing π ^{ r } gives the best results. This has the effect of emphasizing the differences between the reliabilities of the different annotators, and consequently their respective importances when learning the CRF model from the data. Hence, we use this version in all our experiments.
4.1 Artificial data
4.1.1 Named entity recognition
There are a few publicly available “golden” datasets for NER such as the 2003 CONLL English NER task dataset (Sang and Meulder 2003), which is a common benchmark for sequence labeling tasks in the NLP community. Using the 2003 CONLL English NER data we obtained a train set and a test set of 14987 and 3466 sentences respectively.
Since the 2003 CONLL shared NER dataset does not contain labels from multiple annotators, these are simulated for different reliabilities using the following method. First a CRF is trained for the complete training set. Then, random Gaussian noise (with zero mean and σ ^{2} variance) is applied to the CRF parameters and the modified CRF model is used to determine new sequences of labels for the training set texts. These label sequences differ more or less from the ground truth depending on σ ^{2}. By repeating this procedure many times we can simulate multiple annotators with different levels of reliability.
An alternative approach would take the ground truth dataset and randomly flip the labels of each token with uniform probability p. However, this would result in simulated annotators that are inconsistent throughout the dataset, by labeling the data with a certain level of randomness. We believe that the scenario where the annotators are consistent but might not be as good as an “expert” is more realistic and challenging, and thus more interesting to investigate. Therefore we give preference to the CRFbased method in most of our experiments with artificial data. Nonetheless, we also make experiments using this alternative method of labelflipping to simulate annotators for the NP chunking task.
Results for the NER task with 5 simulated annotators (with σ ^{2}=[0.005,0.05,0.05,0.1,0.1]) with repeated labeling
Method  Train set  Test set  

Precision  Recall  F1  Precision  Recall  F1  
MVseq  24.1 %  50.5 %  32.6 ± 2.0 %  47.3 %  30.9 %  37.3 ± 3.1 % 
MVtoken  56.0 %  69.1 %  61.8 ± 4.1 %  62.4 %  62.3 %  62.3 ± 3.4 % 
MVseg  52.5 %  65.0 %  58.0 ± 6.9 %  60.6 %  57.1 %  58.7 ± 7.1 % 
CRFCONC  47.9 %  49.6 %  48.4 ± 8.8 %  47.8 %  47.1 %  47.1 ± 8.1 % 
MultiCRF  39.8 %  22.6 %  28.7 ± 3.8 %  40.0 %  15.4 %  22.1 ± 5.0 % 
MultiCRFMAX  55.0 %  66.7 %  60.2 ± 4.1 %  63.2 %  58.4 %  60.5 ± 3.6 % 
CRFMA  72.9 %  81.7 %  77.0 ± 3.9 %  72.5 %  67.7 %  70.1 ± 2.5 % 
CRFGOLD  99.7 %  99.9 %  99.8 %  86.2 %  87.8 %  87.0 % 
The results in Table 2 indicate that CRFMA outperforms the four proposed baselines in both the train set and test set. In order to assess the statistical significance of this result, a paired ttest was used to compare the mean F1score of CRFMA in the test set against the MVseq, MVtoken, MVseg and CRFCONC baselines. The obtained pvalues were 4×10^{−25}, 7×10^{−10}, 4×2^{−8} and 1×10^{−14} respectively, which indicates that the differences are all highly significant.
Regarding the MultiCRF model, we can see that, at best, it performs almost as good as MVtoken. Not surprisingly, the “MAX” version of MultiCRF performs better than the standard version. This behavior is expected since the “hard” labels obtained from majority voting also perform better than the “soft” label effect obtained in CRFCONC. Nonetheless, neither version of MultiCRF performs as well as MACRF (test set pvalues are 1×10^{−26} and 1×10^{−11} for the MultiCRF and MultiCRFMAX respectively).
Results for the NER task with 5 simulated annotators (with σ ^{2}=[0.005,0.05,0.05,0.1,0.1]) without repeated labeling
Method  Train set  Test set  

Precision  Recall  F1  Precision  Recall  F1  
CRFCONC  52.1 %  56.5 %  54.0 ± 7.3 %  53.9 %  51.7 %  52.6 ± 6.4 % 
CRFMA  63.8 %  71.1 %  67.2 ± 1.7 %  65.7 %  62.7 %  64.2 ± 1.6 % 
CRFGOLD  99.7 %  99.9 %  99.8 %  86.2 %  87.8 %  87.0 % 
The comparison of both experiments (i.e. with and without repeated labeling) indicates that, in this setting, having less repeated labeling hurts the performance of CRFMA. Since this model differentiates between annotators with different levels of expertise, its performance is best when the more reliable ones have annotated more sequences, which is more likely to happen with more repeated labeling. Naturally, the opposite occurs with CRFCONC. Since in this setting the less reliable annotators dominate, more repeated labeling translates in even more predominance of lower quality annotations, which affects the performance of CRFCONC.
4.1.2 Noun phrase chunking
For the NP chunking task, the 2003 CONLL English NER dataset was also used. Besides named entities, this dataset also provides partofspeech tags and syntactic tags (i.e. noun phrases, verbal phrases, prepositional phrases, etc.). The latter were used to generate a train and a test set for NP chunking with the same sizes of the corresponding NER datasets.
Results for the NP chunking task with 5 simulated annotators (with p=[0.01,0.1,0.3,0.5,0.7]) with repeated labeling
Method  Train set  Test set  

Precision  Recall  F1  Precision  Recall  F1  
MVseq  50.6 %  55.6 %  53.0 ± 0.4 %  66.1 %  63.1 %  64.6 ± 2.4 % 
MVtoken  83.6 %  76.1 %  79.7 ± 0.2 %  83.3 %  86.9 %  85.0 ± 0.7 % 
CRFCONC  84.3 %  84.7 %  84.5 ± 1.8 %  83.8 %  82.9 %  83.3 ± 1.9 % 
MultiCRF  76.6 %  65.6 %  70.7 ± 0.4 %  75.6 %  64.9 %  69.8 ± 0.4 % 
MultiCRFMAX  83.6 %  81.3 %  82.5 ± 1.0 %  81.2 %  79.0 %  80.1 ± 1.0 % 
CRFMA  92.0 %  91.8 %  91.9 ± 1.9 %  89.7 %  89.7 %  89.7 ± 0.8 % 
CRFGOLD  99.9 %  99.9 %  99.9 %  95.9 %  91.1 %  91.0 % 
4.2 Real data
The use of Crowdsourcing platforms to annotate sequences is currently a very active research topic (Laws et al. 2011), with many different solutions being proposed to improve both the annotation and the learning processes at various levels like, for example, by evaluating annotators through the use of an expert (Voyer et al. 2010), by using a better annotation interface (Lawson et al. 2010), or by learning from partially annotated sequences thus reducing annotation costs (Fernandes and Brefeld 2011).
Results for the NER task using real data obtained from Amazon’s Mechanical Turk
Method  Train set  Test set  

Precision  Recall  F1  Precision  Recall  F1  
MVseq  79.0 %  55.2 %  65.0 %  44.3 %  81.0 %  57.3 % 
MVtoken  79.0 %  54.2 %  64.3 %  45.5 %  80.9 %  58.2 % 
MVseg  83.7 %  51.9 %  64.1 %  46.3 %  82.9 %  59.4 % 
CRFCONC  86.8 %  58.4 %  69.8 %  40.2 %  86.0 %  54.8 % 
MultiCRF  67.8 %  15.4 %  25.1 %  74.8 %  3.7 %  7.0 % 
MultiCRFMAX  79.5 %  51.9 %  62.8 %  84.1 %  37.1 %  51.5 % 
CRFMA  86.0 %  65.6 %  74.4 %  49.4 %  85.6 %  62.6 % 
CRFGOLD  99.2 %  99.4 %  99.3 %  79.1 %  80.4 %  74.8 % 
Results for the NER task using data from Amazon’s Mechanical Turk without repeated labelling (subsampled data from the original dataset)
Method  Train set  Test set  

Precision  Recall  F1  Precision  Recall  F1  
CRFCONC  71.1 %  42.8 %  53.1 ± 10.5 %  35.9 %  70.1 %  47.2 ± 8.7 % 
CRFMA  76.2 %  54.2 %  63.3 ± 1.6 %  46.0 %  78.2 %  57.9 ± 1.8 % 
CRFGOLD  99.2 %  99.4 %  99.3 %  79.1 %  80.4 %  74.8 % 
5 Conclusion
This paper presented a probabilistic approach for sequence labeling using CRFs with data from multiple annotators which relies on a latent variable model where the reliability of the annotators are handled as latent variables. The EM algorithm is then used to find maximum likelihood estimates for the CRF model parameters, the reliability of the annotators and the ground truth label sequences. The proposed approach is empirically shown to significantly outperform traditional approaches, such as majority voting and using the labeled data from all the annotators concatenated for training, even in situations of high levels of noise in the labels of the annotators and when the less “trustworthy” annotators dominate. This approach also has the advantage of not requiring the repeated labeling of the same input sequences by the different annotators (unlike majority voting, for example). Although we presented a formulation using CRFs, it could be easily modified to work with other sequence labeling models such as HMMs.
Future work intends to explore dependencies of the reliabilities of the annotators on the input sequences they are labeling, which can be challenging due to the high dimensionality of the feature space, and the inclusion of a Dirichlet prior over the qualities of the annotators. Furthermore, the extension of the proposed model to an active learning setting will also be considered. Since the annotators reliabilities are being estimated by the EM algorithm, this information can be used to, for example, decide who are the most trustworthy annotators. Requesting new labels from those annotators will eventually improve the models performance and reduce annotation cost.
Footnotes
 1.
 2.
These constraints are required for the Jensen’s inequality to apply and for the EM algorithm presented in Sect. 3.4 to be valid.
 3.
Note that the ground truth estimate is required to compute the F1scores of the annotators and estimate the multinomial parameters π.
 4.
Datasets available at: http://amilab.dei.uc.pt/fmpr/crfmadatasets.tar.gz. Source code available at: http://amilab.dei.uc.pt/fmpr/macrf.tar.gz.
 5.
In fact, since a BIO decomposition is being used, there are three possible labels: BNP, INP and O, and these labels are the ones that were used in the random flipping process.
Notes
Acknowledgements
The Fundação para a Ciência e Tecnologia (FCT) is gratefully acknowledged for founding this work with the grants SFRH/BD/78396/2011 and PTDC/EIAEIA/115014/2009 (CROWDS). We would also like to thank Mark Dredze and Partha Talukdar for kindly providing the code for their implementation of the MultiCRF model (Dredze et al. 2009).
References
 Allen, J., & Salzberg, S. (2005). JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics, 21(18), 3596–3603. CrossRefGoogle Scholar
 Allen, J., Pertea, M., & Salzberg, S. (2004). Computational gene prediction using multiple sources of evidence. Genome Research, 14(1), 142–148. CrossRefGoogle Scholar
 Bellare, K., & Mccallum, A. (2007). Learning extractors from unlabeled text using relevant databases. In Sixth international workshop on information integration on the web. Google Scholar
 CallisonBurch, C., & Dredze, M. (2010). Creating speech and language data with amazon’s mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with amazon’s mechanical turk (pp. 1–12). Google Scholar
 Dawid, A. P., & Skene, A. M. (1979). Maximum likelihood estimation of observer errorrates using the EM algorithm. Journal of the Royal Statistical Society. Series C. Applied Statistics, 28(1), 20–28. Google Scholar
 Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1–38. MathSciNetMATHGoogle Scholar
 Donmez, P., & Carbonell, J. (2008). Proactive learning: costsensitive active learning with multiple imperfect oracles. In Proceedings of the 17th ACM conference on information and knowledge management (pp. 619–628). Google Scholar
 Donmez, P., Schneider, J., & Carbonell, J. (2010). A probabilistic framework to learn from multiple annotators with timevarying accuracy. In Proceedings of the SIAM international conference on data mining (pp. 826–837). Google Scholar
 Dredze, M., Talukdar, P., & Crammer, K. (2009). Sequence learning from data with multiple labels. In ECMLPKDD 2009 workshop on learning from multilabel data. Google Scholar
 Fernandes, E., & Brefeld, U. (2011). Learning from partially annotated sequences. In Proceedings of the 2011 European conference on machine learning and knowledge discovery in databases (pp. 407–422). Google Scholar
 Groot, P., Birlutiu, A., & Heskes, T. (2011). Learning from multiple annotators with Gaussian processes. In Proceedings of the 21st international conference on artificial neural networks (Vol. 6792, pp. 159–164). Google Scholar
 Howe, J. (2008). Crowdsourcing: why the power of the crowd is driving the future of business (1st edn.). New York: Crown Publishing Group. Google Scholar
 Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th international conference on machine learning (pp. 282–289). Google Scholar
 Laws, F., Scheible, C., & Schütze, M. (2011). Active learning with amazon mechanical turk. In Proceedings of the conference on empirical methods in natural language processing. Stroudsburg: Association for Computational Linguistics (pp. 1546–1556). Google Scholar
 Lawson, N., Eustice, K., Perkowitz, M., & YetisgenYildiz, M. (2010). Annotating large email datasets for named entity recognition with mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with amazon’s mechanical turk. Stroudsburg: Association for Computational Linguistics (pp. 71–79). Google Scholar
 Liu, D., & Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45, 503–528. MathSciNetCrossRefMATHGoogle Scholar
 Novotney, S., & CallisonBurch, C. (2010). Cheap, fast and good enough: automatic speech recognition with nonexpert transcription. In Human language technologies, HLT ’10. Stroudsburg: Association for Computational Linguistics (pp. 207–215). Google Scholar
 Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE (pp. 257–286). Google Scholar
 Ramshaw, L., & Marcus, M. (1995). Text chunking using transformationbased learning. In Proceedings of the third workshop on very large corpora. Stroudsburg: Association for Computational Linguistics (pp. 82–94). Google Scholar
 Raykar, V., Yu, S., Zhao, L., Jerebko, A., Florin, C., Valadez, G., Bogoni, L., & Moy, L. (2009). Supervised learning from multiple experts: whom to trust when everyone lies a bit. In Proceedings of the 26th international conference on machine learning (pp. 889–896). Google Scholar
 Raykar, V., Yu, S., Zhao, L., Valadez, G., Florin, C., Bogoni, L., & Moy, L. (2010). Learning from crowds. Journal of Machine Learning Research, 1297–1322. Google Scholar
 Sang, E., & Meulder, F. D. (2003). Introduction to the conll2003 shared task: languageindependent named entity recognition. In Proceedings of the 7th conference on natural language learning at HLTNAACL (Vol. 4, pp. 142–147). CrossRefGoogle Scholar
 Sheng, V., Provost, F., & Ipeirotis, P. (2008). Get another label? Improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 614–622). CrossRefGoogle Scholar
 Smyth, P., Fayyad, U., Burl, M., Perona, P., & Baldi, P. (1995). Inferring ground truth from subjective labelling of venus images. Advances in Neural Information Processing Systems, 1085–1092. Google Scholar
 Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. (2008). Cheap and fast—but is it good?: evaluating nonexpert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing (pp. 254–263). CrossRefGoogle Scholar
 Surowiecki, J. (2004). The wisdom of crowds: why the many are smarter than the few and how collective wisdom shapes business, economies, societies, and nations. New York: Doubleday. Google Scholar
 Sutton, C., & McCallum, A. (2006). Introduction to conditional random fields for relational learning. Cambridge: MIT Press. Google Scholar
 Voyer, R., Nygaard, V., Fitzgerald, W., & Copperman, H. (2010). A hybrid model for annotating named entity training corpora. In Proceedings of the fourth linguistic annotation workshop. Stroudsburg: Association for Computational Linguistics (pp. 243–246). Google Scholar
 Wu, O., Hu, W., & Gao, J. (2011). Learning to rank under multiple annotators. In Proceedings of the 22nd international joint conference on artificial intelligence (pp. 1571–1576). Google Scholar
 Yan, Y., Rosales, R., Fung, G., Schmidt, M., Valadez, G., Bogoni, L., Moy, L., & Dy, J. (2010). Modeling annotator expertise: learning when everybody knows a bit of something. Journal of Machine Learning Research, 9, 932–939. Google Scholar
 Yan, Y., Rosales, R., Fung, G., & Dy, J. (2011). Active learning from crowds. In Proceedings of the 28th international conference on machine learning (pp. 1161–1168). Google Scholar