Keywords

1 Introduction

Crowdsourcing is a new approach for solving data processing problems for which conventional methods appear to be inaccurate, expensive, or time-consuming. Nowadays, the development of new crowdsourcing techniques is mostly motivated by so called Big Data problems, including problems of assessment and clustering of large datasets obtained in aerospace imaging, remote sensing, and even in social network analysis. For example, by involving volunteers from all over the world, the Geo-Wiki project tackles the problems of environmental monitoring with applications to flood resilience, biomass data analysis and forecasting, etc. The Cropland Capture game, which is a recently developed Geo-Wiki game, aims to map cultivated lands using around 17,000 satellite images from the Earth’s surface. Despite recent progress in image analysis, the solution to these problems is hard to automate since human-experts still outperform the majority of learnable machines and other artificial systems in this field. Replacement of rare and expensive experts by a team of distributed volunteers seems to be promising, but this approach leads to challenging questions: how can we aggregate individual opinions optimally, obtain confidence bounds, and deal with the unreliability of volunteers?

The main goals of the Geo-Wiki project are collecting land cover data and creating hybrid maps [14]. For example, users answer ‘Yes’ or ‘No’ to the question: ‘Is there any cropland in the red box?’ in order to validate the presence or absence of cropland [13]. In the paper [1], which is related to use of Geo-Wiki data, researchers studied the problem of using crowdsourcing instead of experts. The research showed that it is possible to use crowdsourcing as a tool for collecting data, but it is necessary to investigate issues such as how to estimate reliability and confidence.

This paper presents a case study that aims to compare the performance of several state-of-the-art vote aggregation techniques specifically developed for the analysis of crowdsourcing campaigns using the image dataset obtained from the Cropland Capture game. As a baseline, some classic machine learning algorithms such as Random Forest, AdaBoost, etc., augmented with preliminary feature selection and a preprocessing stage, are used.

The rest of the paper is structured as follows. In Sect. 2, we give a brief overview of efforts related to the vote aggregation in crowdsourcing. In Sect. 3, we describe the general structure of the dataset under consideration. In Sect. 4, we propose quality improvements for the initial image dataset, introduce our vote aggregation heuristic and existing state of the art algorithms. Finally, in Sect. 5, we analyse the dataset and present our benchmarking results. Then we use annotator models to classify volunteers.

2 Related Work

In the theoretical justification of crowdsourcing image-assessment campaigns, there are two main problems of interest. The first one is the problem of ground truth estimation from crowd opinion. The second one, which is equally important, deals with the individual performance assessment of the volunteers who participated in the campaign. The solution to this problem is in the clustering of voters with respect to their behavioural strategies into groups of honest workers, biased annotators, spammers, malicious users, etc. Reflection of this posterior knowledge by reweighting of individual opinions of the voters can substantially improve the overall performance of the aggregated decision rule.

There are two basic settings of the latter problem. In the first setup, a crowdsourcing campaign admits some quantity of images previously labeled by experts (these labels are called golden standard). In this case, the problem can be considered as a supervised learning problem, and for its solution, conventional algorithms of ensemble learning (for example, boosting [6, 10, 18]) can be used. On the other hand, in most cases, researchers deal with the full (or almost full) absence of labeled images; ground truth should be retrieved simultaneously with estimation of voters’ reliability, and some kind of unsupervised learning techniques should be developed to solve the problem.

Prior works in this field can be broadly classified in two categories: EM-algorithm inspired and graph-theory based. The works of the first kind extend results of the seminal paper [2], applying a variant of the well known EM-algorithm [3] to a crowdsourcing-like setting of the computer-aided diagnosis problem. For instance, in [12], the EM-based framework is provided for several types of unsupervised crowdsourcing settings (for categorical, ordinal and even real answers) taking into account different competency level of voters and different levels of difficulty in the assessment tasks. In [11], by proposing a special type of prior, this approach is extended to the case when most voters are spammers. Papers [7, 9, 15] develop the fully unsupervised framework based on Independent Bayesian Combination of Classifiers (IBCC), Chinese Restaurant Process (CRP) prior, and Gibbs sampling. Although EM-based techniques perform well in many cases, usually, they are criticized for their heuristic nature since in general there are no guarantees that the algorithm finds a global optimum.

Furthermore, in [5], an efficient reputation algorithm for identifying adversarial workers in crowdsourcing campaigns is elaborated. For some conditions, the reputation scores proposed are proportional to the reliabilities of the voters given that their number tends to infinity. Unlike the majority of EM-based techniques, the listed results have solid theoretical support, but conditions for which their optimality is proven (especially the graph-regularity condition) are too restrictive to apply them straightforward in our setup.

A highly intuitive and computationally simple algorithm, Iterative Weighted Majority Voting (IWMV), is proposed in [8]. Remarkably, for its one step version, theoretical bounds on the error rate is obtained. Experiments on synthetic and real-life data (see [8]) demonstrate that IWMV performs on a par with the state-of-the-art algorithms and around one hundred times faster than the competitors. Since the dataset of the Cropland Capture Game contains around 4.6 million votes, this computational effectiveness makes IWMV the most suitable representative among the state-of-the-art methods for the benchmark.

The aforementioned arguments have motivated us to carry out a case study on the applicability of several state-of-the-art vote aggregation techniques to an actual dataset obtained from the Cropland Capture game. Precisely, we compare the classic EM algorithm [2], methods proposed in [5], IWMV [8], and a heuristic based on the computed reliability of voters. As a baseline, we use the simple Majority Voting (MV) heuristic and several of the most popular universal machine learning techniques.

3 Data

The results of the game were captured as shown in two tables. The first table contains details of the images: imgID is an image identifier; link is the URL of an image; latitude and longitude are geo-coordinates which refer to the centroid of the image; zoom is the resolution of an image (values: 300, 500, 1000 m). The following table shows some sample of image data.

All votes, i.e. ‘a single decision by a single volunteer about a single image’ [13], were collected in the second table: ratingID is a rating identifier; imgID is an image identifier; volunteerID is a volunteer’s identifier; timestamp is the time when a vote was given; rating is a volunteer’s answer. The possible values for rating are as follows: 0 (‘Maybe’), 1 (‘Yes’), −1 (‘No’). The following table shows some sample of vote data.

ratingID

imgID

volunteerID

timestamp

rating

75811

3009

178

2013-11-18 12:50:31

1

566299

3009

689

2013-12-03 08:10:38

0

641369

3009

1398

2013-12-03 17:10:39

−1

3980868

3009

1365

2014-04-10 16:52:07

1

During the crowdsourcing campaign, around 4.6 million votes were collected. We convert the votes to a rating matrix. The matrix consists of ratings given to images (matrix columns) by the volunteers (matrix rows)

$$\begin{aligned} R = \big (r_{v,i}\big )^{ |V|,|I| }_{v=1,i=1}, \end{aligned}$$
(1)

V—the set of all volunteers; I—the set of all images with at least 1 vote; \(r_{v,i}\)—a vote given by a volunteer to an image. Due to an unclear definition, the ‘Maybe’ answer is hard to interpret. As a result, we treat ‘Maybe’ as a situation when the user has not seen the image; both situations are coded as 0. If a volunteer has multiple votes for the same image, then only the last vote is used.

4 Methodology

4.1 Detection of Duplicates and Blurry Images

Since the dataset collected via the game was formed by combining different sources, it is possible that almost the same images can be referenced by different records. In order to check this, we download all 170041 .jpeg images (512*512 size). The total size of all images is around 9 GB. Then we employ perceptive hash functions to reveal such cases. Examples of such functions are aHash (Average Hash or Mean Hash), dHash, and pHash [17]. Perceptual hashing aims to detect images such that a human cannot see the difference. We find that pHash performs much better than computationally less expensive dHash and aHash methods. Note that for a fixed image, the set of all images that is similar according to pHash will contain all images with the corresponding MD5 or SHA1 hash. To summarize, we detect duplicates for 8300 original images; votes for duplicates were merged.

Accepting the idea of the wisdom of the crowd, in order to make a better decision for an image, we need to collect more votes for each image. The detection of all similar images increases statistically significant effects and decreases the dimensionality of the data. In addition, if the detection is performed before the start of the campaign, there is a reduction in the workload of the volunteers.

A visual inspection of images shows the presence of illegible and blurry (unfocused) images. As expected, these images bewildered the volunteers. Thus, we apply automatic methods for blur detection. Namely, by using the Blur Detection algorithm [16], we detect 2300 poor quality images such that it is not possible to give the right answers even for experts. Note that for those images, voting inconsistency is high; volunteers change their opinions frequently. After consultation with the experts, we remove all images of poor quality to decrease the noise level and uncertainty in the data. Finally, we reduce number of images from 170041 to 161752. Note that \(|I|= 161752\) and \(|V|=2783\).

4.2 Majority Voting Based on Reliability

In this subsection we present a conjunction of MV and the widely used notion of reliability (see, for example [5]). It is a standard to define reliability \(w_v\) of volunteer v as

$$w_v= 2p_v-1$$

where \(p_v\) is the probability that v gives a correct answer; it is assumed that it does not depend on the particular task. Obviously, \(w_v \in [-1,1]\). We use traditional weighted MV with weights obtained by the above rule. The heuristic admits a refinement; one may iteratively remove a volunteer with the highest penalty, then recalculate penalties, and obtain new results for the weighted MV.

figure a
figure b

The proposed heuristic is presented in Algorithm 1 relying on Algorithm 2 [5, Hard penalty], for which we also provide pseudocode. We now briefly describe an optimal semi-matching (OSM) used in Algorithm 2. Let \(B = (N_{left} \cup N_{right}, E)\) be a bipartite graph with \(N_{left}\) the set of left-hand vertices, \(N_{right}\) the set of right-hand vertices, and edge set \(E \subset N_{left} \times N_{right}\). A semi-matching in B is a set of edges \(M \subset E\) such that each vertex in \(N_{right}\) is incident to exactly one edge in M. We stress that it is possible for vertices in \(N_{left}\) to be incident to more than one edge in M.

An OSM is defined using the following cost function. For \(u \in N_{left}\), let \(deg_M(u)\) denote the number of edges in M that are incident to u and let

$$cost_M(u)=\sum _{i=1}^{deg_M(u)}{i} =\frac{(deg_M(u)+1) deg_M(u)}{2}.$$

An optimal semi-matching then, is one which minimizes \(\sum _{u \in N_{left}}{cost_M(u)}\). This definition is inspired by the load balancing problem studied in [4]. We also benchmark the Iterative Weighted Majority Voting algorithm (IWMV) introduced in [8]; see pseudocode of Algorithm 3.

figure c

5 Benchmark

To evaluate the volunteers’ performance, a part of the dataset (854 images) was annotated by an expert after the campaign took place. For these images 1813 volunteers gave 16,940 votes in total. We then sampled two subsets for training and testing (70/30 ratio). We now empirically assess the irregularity of volunteer-image assignment in the expert dataset. First, using Fig. 1, we answer the question ‘How many volunteers voted some specified number of times?’ Second, using Fig. 2, we answer the question ‘What is the percentage of volunteers who voted some specified number of times or less?’. Finally, by means of Fig. 3, we answer the questions ‘How many images were labeled a specified number of times?’ and ‘What percentage of images were labeled some specified number of times or less?’.

Fig. 1
figure 1

The figure contains information on the images per volunteer distribution and answers the question ‘How many volunteers voted some specified number of times?’. Namely, the number of images assessed by a volunteer is on the horizontal axes. The number of volunteers is on the vertical axes. Thus, one can see that 566 volunteers labeled only one image from the expert dataset

Fig. 2
figure 2

The figure contains information on the images per volunteer distribution and answers the question ‘What is the percentage of volunteers who voted some specified number of times or less?’. Namely, the maximum number of images assessed by a volunteer is on the horizontal axes. The number of volunteers is on the vertical axes. Thus, one can see that around 49 % of all volunteers saw 2 images or less from the experts dataset. Moreover, around 70 % of volunteers saw 5 images or less from the experts dataset; 90 % of volunteers saw 15 images or less; and 95 % of volunteers saw less than 26 images

Fig. 3
figure 3

The figure contains information on volunteers per image distribution. The left plot answers the question ‘How many images were labeled a specified number of times?’. Here one can find that 71 images were labeled by exactly 7 volunteers, 11 images were labeled by 59 volunteers, and 5 images were labeled by 69 volunteers. The right plot answers the question and ‘What percentage of images were labeled some specified number of times or less?’. One can see that around 27 % of images have 5 votes or less, 52 % of images have 9 votes or less, and 75 % of images have 25 votes or less

The baseline. To use some conventional machine learning algorithms, we first apply SVD to the whole dataset to reduce dimensionality. A study of the explained variance helps us to make an appropriate choice for the number of features: 5, 14, 35. Then we transform the feature space of the testing and training subsets accordingly. On the basis of 10-fold cross-validation of the training subset, we fit parameters for the AdaBoost and Random Forest algorithms. For Linear Discriminant Analysis (LDA), we use default parameters. The accuracy of the algorithms with fitted parameters was estimated using the testing subset; see Table 1.

Table 1 Accuracy of baseline algorithms corresponding to different number of features

Benchmarking of algorithms for an vote aggregation is performed as follows. We feed the expert dataset to the algorithms and check their accuracy on the same test subset as above. Note that the transformation of a feature space is not required in this case. In this section, we experimentally test the heuristic based on reliability and compare it with the state-of-art algorithms designed for crowdsourcing. We use publicly available codeFootnote 1 that was developed for experiments in [5]. The code implements MV and EM algorithm [2] in conjunction with reputation Algorithm 2 (Hard penalty [5]). During each iteration, the reputation algorithm helps to exclude the volunteer with the highest penalty and recalculates the penalties for the remaining volunteers. We also benchmark IWMV [8]. Note that iterations in IWMV have different meanings. The accuracy of the compared algorithms on the test sample is presented in Table 2. Remarkably, IWMV converges right after 1 iteration with exactly the same predictions as MV. Though we observe a change of voters’ weights (some are even flipped from 1 to −1), it does not influence the aggregated score of any image enough to alter the decision of MV. More surprisingly, all crowdsourcing algorithms perform on a par with MV. A possible explanation is the irregular task assignment leading, in particular, to a high percentage of images with only a few votes. To deal with this issue, we continue our analysis using image thresholding. Namely, we perform the same benchmarking for two subsets of the expert dataset. The subsets were obtained by filtering images with the number of votes less than the threshold; see Tables 3 and 4. Another possible explanation is that we mostly deal with reliable volunteers, and thus, crowdsourcing algorithms cannot profit from the detection of spammers or from flipping votes of malicious voters. To analyze the hypothesis, in the next subsection we classify volunteers using the annotator model proposed in [11].

Table 2 Accuracy for ‘crowdsourcing’ algorithms without image thresholding
Table 3 Accuracy for ‘crowdsourcing’ algorithms with image thresholding
Table 4 Accuracy for ‘crowdsourcing’ algorithms with image thresholding
Fig. 4
figure 4

We depict ROCs for volunteers having more votes than defined by a threshold. Threshold \(=\) 0 (the upper left plot), 12 (the upper right plot), 44 (the lower left plot), and 100. These thresholds leave 1813, 262, 52, and 24 volunteers, respectively. ROCs of spammers lie on the red line

5.1 Annotator Models

As it was proposed in [11], a spammer is a person who labels randomly. A possible explanation is that a volunteer ignores images while labeling or does not comprehend the labeling criteria. ‘More precisely an annotator is a spammer if the probability of an observed label is independent of the true label’ [11]. In what follows we define two important concepts, the sensitivity and the specificity. If the true label is 1, then the sensitivity \(\alpha ^j\) is defined as the probability that the volunteer j votes 1 (this probability corresponds to the true positive rate). If the true label is −1, then the specificity \(\beta ^j\) is defined as the probability that the volunteer j votes −1. Note that volunteer j is a spammer if

$$\alpha ^j + \beta ^j=1.$$

This property suggests an easy way for detection of spammers. Namely, we simply depict the Receiver Operating Characteristic (ROC) plot containing details of individual performance; see Fig. 4. Since the task assignment was highly irregular, it is important to study how voting activity of volunteers influences the ROCs. Namely, Fig. 4 contains not one, but four ROCs, where each of them is obtained according to a different level of volunteer thresholding. This thresholding helps to remove volunteers that had a total number of votes less than that defined by the threshold. Figure 4 provides plausible observations for the dataset: there are no spammers among voters with more than 12 votes; good annotators prevail over all other types of annotators; we detect frequently voting volunteers (more than 100 votes) showing better accuracy than any examined algorithm, while there are no malicious voters. We conjecture that these are exactly the reasons why advanced algorithms (EM, IWMV) are on a par with MV.

6 Discussion

Comparing the results in Tables 1 and 2, it is remarkable that ‘general purpose’ learning algorithms slightly outperform ‘special purpose’ crowdsourcing algorithms. Moreover, the naïve heuristic (see Algorithm 1) based on reliability shows the best result. Also, numerical experiments show that MV performs on a par with all other algorithms (see Tables 1, 2, 3 and 4). The analysis of the ROCs of the volunteers suggests that surprisingly high accuracy of frequently voting volunteers coupled with the absence of spammers is a possible explanation for this result. The irregularity of volunteer-image assignment in the dataset with and a high percentage of images with a low number of votes may also contribute to this fact. Note that image thresholding by number of votes helps to improve the results of the ‘crowdsourcing’ algorithms (see Tables 2, 3 and 4) although the results are still on a par with MV. To summarize, good annotators and the irregularity eliminate any advantages of state of the art algorithms over MV. Thus, future research on the problem of aggregation of votes will benefit from a systems analysis approach capturing tradeoffs similar to the one shown in this paper.