Abstract
The Cropland Capture game, which is a recently developed Geo-Wiki game, aims to map cultivated lands using around 17,000 satellite images from the Earth’s surface. Using a perceptual hash and blur detection algorithm, we improve the quality of the Cropland Capture game’s dataset. We then benchmark state-of-the-art algorithms for an aggregation of votes using results of well-known machine learning algorithms as a baseline. We demonstrate that volunteer-image assignment is highly irregular and only good annotators are presented (there are no spammers and malicious voters). We conjecture that the last fact is the main reason for surprisingly similar accuracy levels across all examined algorithms. Finally, we increase the estimated consistency with expert opinion from 77 to 91 % and up to 96 % if we restrict our attention to images with more than 9 votes.
Similar content being viewed by others
Keywords
1 Introduction
Crowdsourcing is a new approach for solving data processing problems for which conventional methods appear to be inaccurate, expensive, or time-consuming. Nowadays, the development of new crowdsourcing techniques is mostly motivated by so called Big Data problems, including problems of assessment and clustering of large datasets obtained in aerospace imaging, remote sensing, and even in social network analysis. For example, by involving volunteers from all over the world, the Geo-Wiki project tackles the problems of environmental monitoring with applications to flood resilience, biomass data analysis and forecasting, etc. The Cropland Capture game, which is a recently developed Geo-Wiki game, aims to map cultivated lands using around 17,000 satellite images from the Earth’s surface. Despite recent progress in image analysis, the solution to these problems is hard to automate since human-experts still outperform the majority of learnable machines and other artificial systems in this field. Replacement of rare and expensive experts by a team of distributed volunteers seems to be promising, but this approach leads to challenging questions: how can we aggregate individual opinions optimally, obtain confidence bounds, and deal with the unreliability of volunteers?
The main goals of the Geo-Wiki project are collecting land cover data and creating hybrid maps [14]. For example, users answer ‘Yes’ or ‘No’ to the question: ‘Is there any cropland in the red box?’ in order to validate the presence or absence of cropland [13]. In the paper [1], which is related to use of Geo-Wiki data, researchers studied the problem of using crowdsourcing instead of experts. The research showed that it is possible to use crowdsourcing as a tool for collecting data, but it is necessary to investigate issues such as how to estimate reliability and confidence.
This paper presents a case study that aims to compare the performance of several state-of-the-art vote aggregation techniques specifically developed for the analysis of crowdsourcing campaigns using the image dataset obtained from the Cropland Capture game. As a baseline, some classic machine learning algorithms such as Random Forest, AdaBoost, etc., augmented with preliminary feature selection and a preprocessing stage, are used.
The rest of the paper is structured as follows. In Sect. 2, we give a brief overview of efforts related to the vote aggregation in crowdsourcing. In Sect. 3, we describe the general structure of the dataset under consideration. In Sect. 4, we propose quality improvements for the initial image dataset, introduce our vote aggregation heuristic and existing state of the art algorithms. Finally, in Sect. 5, we analyse the dataset and present our benchmarking results. Then we use annotator models to classify volunteers.
2 Related Work
In the theoretical justification of crowdsourcing image-assessment campaigns, there are two main problems of interest. The first one is the problem of ground truth estimation from crowd opinion. The second one, which is equally important, deals with the individual performance assessment of the volunteers who participated in the campaign. The solution to this problem is in the clustering of voters with respect to their behavioural strategies into groups of honest workers, biased annotators, spammers, malicious users, etc. Reflection of this posterior knowledge by reweighting of individual opinions of the voters can substantially improve the overall performance of the aggregated decision rule.
There are two basic settings of the latter problem. In the first setup, a crowdsourcing campaign admits some quantity of images previously labeled by experts (these labels are called golden standard). In this case, the problem can be considered as a supervised learning problem, and for its solution, conventional algorithms of ensemble learning (for example, boosting [6, 10, 18]) can be used. On the other hand, in most cases, researchers deal with the full (or almost full) absence of labeled images; ground truth should be retrieved simultaneously with estimation of voters’ reliability, and some kind of unsupervised learning techniques should be developed to solve the problem.
Prior works in this field can be broadly classified in two categories: EM-algorithm inspired and graph-theory based. The works of the first kind extend results of the seminal paper [2], applying a variant of the well known EM-algorithm [3] to a crowdsourcing-like setting of the computer-aided diagnosis problem. For instance, in [12], the EM-based framework is provided for several types of unsupervised crowdsourcing settings (for categorical, ordinal and even real answers) taking into account different competency level of voters and different levels of difficulty in the assessment tasks. In [11], by proposing a special type of prior, this approach is extended to the case when most voters are spammers. Papers [7, 9, 15] develop the fully unsupervised framework based on Independent Bayesian Combination of Classifiers (IBCC), Chinese Restaurant Process (CRP) prior, and Gibbs sampling. Although EM-based techniques perform well in many cases, usually, they are criticized for their heuristic nature since in general there are no guarantees that the algorithm finds a global optimum.
Furthermore, in [5], an efficient reputation algorithm for identifying adversarial workers in crowdsourcing campaigns is elaborated. For some conditions, the reputation scores proposed are proportional to the reliabilities of the voters given that their number tends to infinity. Unlike the majority of EM-based techniques, the listed results have solid theoretical support, but conditions for which their optimality is proven (especially the graph-regularity condition) are too restrictive to apply them straightforward in our setup.
A highly intuitive and computationally simple algorithm, Iterative Weighted Majority Voting (IWMV), is proposed in [8]. Remarkably, for its one step version, theoretical bounds on the error rate is obtained. Experiments on synthetic and real-life data (see [8]) demonstrate that IWMV performs on a par with the state-of-the-art algorithms and around one hundred times faster than the competitors. Since the dataset of the Cropland Capture Game contains around 4.6 million votes, this computational effectiveness makes IWMV the most suitable representative among the state-of-the-art methods for the benchmark.
The aforementioned arguments have motivated us to carry out a case study on the applicability of several state-of-the-art vote aggregation techniques to an actual dataset obtained from the Cropland Capture game. Precisely, we compare the classic EM algorithm [2], methods proposed in [5], IWMV [8], and a heuristic based on the computed reliability of voters. As a baseline, we use the simple Majority Voting (MV) heuristic and several of the most popular universal machine learning techniques.
3 Data
The results of the game were captured as shown in two tables. The first table contains details of the images: imgID is an image identifier; link is the URL of an image; latitude and longitude are geo-coordinates which refer to the centroid of the image; zoom is the resolution of an image (values: 300, 500, 1000 m). The following table shows some sample of image data.
imgID | link | latitude | longitude | zoom |
---|---|---|---|---|
3009 | http://cg.tuwien.ac.at/~sturn/crop/img_-112.313_42.8792_1000.jpg | 42.8792 | −112.313 | 1000 |
3010 | http://cg.tuwien.ac.at/~sturn/crop/img_-112.313_42.8792_500.jpg | 42.8792 | −112.313 | 500 |
3011 | http://cg.tuwien.ac.at/~sturn/crop/img_-112.313_42.8792_300.jpg | 42.8792 | −112.313 | 300 |
All votes, i.e. ‘a single decision by a single volunteer about a single image’ [13], were collected in the second table: ratingID is a rating identifier; imgID is an image identifier; volunteerID is a volunteer’s identifier; timestamp is the time when a vote was given; rating is a volunteer’s answer. The possible values for rating are as follows: 0 (‘Maybe’), 1 (‘Yes’), −1 (‘No’). The following table shows some sample of vote data.
ratingID | imgID | volunteerID | timestamp | rating |
---|---|---|---|---|
75811 | 3009 | 178 | 2013-11-18 12:50:31 | 1 |
566299 | 3009 | 689 | 2013-12-03 08:10:38 | 0 |
641369 | 3009 | 1398 | 2013-12-03 17:10:39 | −1 |
3980868 | 3009 | 1365 | 2014-04-10 16:52:07 | 1 |
During the crowdsourcing campaign, around 4.6 million votes were collected. We convert the votes to a rating matrix. The matrix consists of ratings given to images (matrix columns) by the volunteers (matrix rows)
V—the set of all volunteers; I—the set of all images with at least 1 vote; \(r_{v,i}\)—a vote given by a volunteer to an image. Due to an unclear definition, the ‘Maybe’ answer is hard to interpret. As a result, we treat ‘Maybe’ as a situation when the user has not seen the image; both situations are coded as 0. If a volunteer has multiple votes for the same image, then only the last vote is used.
4 Methodology
4.1 Detection of Duplicates and Blurry Images
Since the dataset collected via the game was formed by combining different sources, it is possible that almost the same images can be referenced by different records. In order to check this, we download all 170041 .jpeg images (512*512 size). The total size of all images is around 9 GB. Then we employ perceptive hash functions to reveal such cases. Examples of such functions are aHash (Average Hash or Mean Hash), dHash, and pHash [17]. Perceptual hashing aims to detect images such that a human cannot see the difference. We find that pHash performs much better than computationally less expensive dHash and aHash methods. Note that for a fixed image, the set of all images that is similar according to pHash will contain all images with the corresponding MD5 or SHA1 hash. To summarize, we detect duplicates for 8300 original images; votes for duplicates were merged.
Accepting the idea of the wisdom of the crowd, in order to make a better decision for an image, we need to collect more votes for each image. The detection of all similar images increases statistically significant effects and decreases the dimensionality of the data. In addition, if the detection is performed before the start of the campaign, there is a reduction in the workload of the volunteers.
A visual inspection of images shows the presence of illegible and blurry (unfocused) images. As expected, these images bewildered the volunteers. Thus, we apply automatic methods for blur detection. Namely, by using the Blur Detection algorithm [16], we detect 2300 poor quality images such that it is not possible to give the right answers even for experts. Note that for those images, voting inconsistency is high; volunteers change their opinions frequently. After consultation with the experts, we remove all images of poor quality to decrease the noise level and uncertainty in the data. Finally, we reduce number of images from 170041 to 161752. Note that \(|I|= 161752\) and \(|V|=2783\).
4.2 Majority Voting Based on Reliability
In this subsection we present a conjunction of MV and the widely used notion of reliability (see, for example [5]). It is a standard to define reliability \(w_v\) of volunteer v as
where \(p_v\) is the probability that v gives a correct answer; it is assumed that it does not depend on the particular task. Obviously, \(w_v \in [-1,1]\). We use traditional weighted MV with weights obtained by the above rule. The heuristic admits a refinement; one may iteratively remove a volunteer with the highest penalty, then recalculate penalties, and obtain new results for the weighted MV.
The proposed heuristic is presented in Algorithm 1 relying on Algorithm 2 [5, Hard penalty], for which we also provide pseudocode. We now briefly describe an optimal semi-matching (OSM) used in Algorithm 2. Let \(B = (N_{left} \cup N_{right}, E)\) be a bipartite graph with \(N_{left}\) the set of left-hand vertices, \(N_{right}\) the set of right-hand vertices, and edge set \(E \subset N_{left} \times N_{right}\). A semi-matching in B is a set of edges \(M \subset E\) such that each vertex in \(N_{right}\) is incident to exactly one edge in M. We stress that it is possible for vertices in \(N_{left}\) to be incident to more than one edge in M.
An OSM is defined using the following cost function. For \(u \in N_{left}\), let \(deg_M(u)\) denote the number of edges in M that are incident to u and let
An optimal semi-matching then, is one which minimizes \(\sum _{u \in N_{left}}{cost_M(u)}\). This definition is inspired by the load balancing problem studied in [4]. We also benchmark the Iterative Weighted Majority Voting algorithm (IWMV) introduced in [8]; see pseudocode of Algorithm 3.
5 Benchmark
To evaluate the volunteers’ performance, a part of the dataset (854 images) was annotated by an expert after the campaign took place. For these images 1813 volunteers gave 16,940 votes in total. We then sampled two subsets for training and testing (70/30 ratio). We now empirically assess the irregularity of volunteer-image assignment in the expert dataset. First, using Fig. 1, we answer the question ‘How many volunteers voted some specified number of times?’ Second, using Fig. 2, we answer the question ‘What is the percentage of volunteers who voted some specified number of times or less?’. Finally, by means of Fig. 3, we answer the questions ‘How many images were labeled a specified number of times?’ and ‘What percentage of images were labeled some specified number of times or less?’.
The baseline. To use some conventional machine learning algorithms, we first apply SVD to the whole dataset to reduce dimensionality. A study of the explained variance helps us to make an appropriate choice for the number of features: 5, 14, 35. Then we transform the feature space of the testing and training subsets accordingly. On the basis of 10-fold cross-validation of the training subset, we fit parameters for the AdaBoost and Random Forest algorithms. For Linear Discriminant Analysis (LDA), we use default parameters. The accuracy of the algorithms with fitted parameters was estimated using the testing subset; see Table 1.
Benchmarking of algorithms for an vote aggregation is performed as follows. We feed the expert dataset to the algorithms and check their accuracy on the same test subset as above. Note that the transformation of a feature space is not required in this case. In this section, we experimentally test the heuristic based on reliability and compare it with the state-of-art algorithms designed for crowdsourcing. We use publicly available codeFootnote 1 that was developed for experiments in [5]. The code implements MV and EM algorithm [2] in conjunction with reputation Algorithm 2 (Hard penalty [5]). During each iteration, the reputation algorithm helps to exclude the volunteer with the highest penalty and recalculates the penalties for the remaining volunteers. We also benchmark IWMV [8]. Note that iterations in IWMV have different meanings. The accuracy of the compared algorithms on the test sample is presented in Table 2. Remarkably, IWMV converges right after 1 iteration with exactly the same predictions as MV. Though we observe a change of voters’ weights (some are even flipped from 1 to −1), it does not influence the aggregated score of any image enough to alter the decision of MV. More surprisingly, all crowdsourcing algorithms perform on a par with MV. A possible explanation is the irregular task assignment leading, in particular, to a high percentage of images with only a few votes. To deal with this issue, we continue our analysis using image thresholding. Namely, we perform the same benchmarking for two subsets of the expert dataset. The subsets were obtained by filtering images with the number of votes less than the threshold; see Tables 3 and 4. Another possible explanation is that we mostly deal with reliable volunteers, and thus, crowdsourcing algorithms cannot profit from the detection of spammers or from flipping votes of malicious voters. To analyze the hypothesis, in the next subsection we classify volunteers using the annotator model proposed in [11].
5.1 Annotator Models
As it was proposed in [11], a spammer is a person who labels randomly. A possible explanation is that a volunteer ignores images while labeling or does not comprehend the labeling criteria. ‘More precisely an annotator is a spammer if the probability of an observed label is independent of the true label’ [11]. In what follows we define two important concepts, the sensitivity and the specificity. If the true label is 1, then the sensitivity \(\alpha ^j\) is defined as the probability that the volunteer j votes 1 (this probability corresponds to the true positive rate). If the true label is −1, then the specificity \(\beta ^j\) is defined as the probability that the volunteer j votes −1. Note that volunteer j is a spammer if
This property suggests an easy way for detection of spammers. Namely, we simply depict the Receiver Operating Characteristic (ROC) plot containing details of individual performance; see Fig. 4. Since the task assignment was highly irregular, it is important to study how voting activity of volunteers influences the ROCs. Namely, Fig. 4 contains not one, but four ROCs, where each of them is obtained according to a different level of volunteer thresholding. This thresholding helps to remove volunteers that had a total number of votes less than that defined by the threshold. Figure 4 provides plausible observations for the dataset: there are no spammers among voters with more than 12 votes; good annotators prevail over all other types of annotators; we detect frequently voting volunteers (more than 100 votes) showing better accuracy than any examined algorithm, while there are no malicious voters. We conjecture that these are exactly the reasons why advanced algorithms (EM, IWMV) are on a par with MV.
6 Discussion
Comparing the results in Tables 1 and 2, it is remarkable that ‘general purpose’ learning algorithms slightly outperform ‘special purpose’ crowdsourcing algorithms. Moreover, the naïve heuristic (see Algorithm 1) based on reliability shows the best result. Also, numerical experiments show that MV performs on a par with all other algorithms (see Tables 1, 2, 3 and 4). The analysis of the ROCs of the volunteers suggests that surprisingly high accuracy of frequently voting volunteers coupled with the absence of spammers is a possible explanation for this result. The irregularity of volunteer-image assignment in the dataset with and a high percentage of images with a low number of votes may also contribute to this fact. Note that image thresholding by number of votes helps to improve the results of the ‘crowdsourcing’ algorithms (see Tables 2, 3 and 4) although the results are still on a par with MV. To summarize, good annotators and the irregularity eliminate any advantages of state of the art algorithms over MV. Thus, future research on the problem of aggregation of votes will benefit from a systems analysis approach capturing tradeoffs similar to the one shown in this paper.
References
Comber, A., Brunsdon, C., See, L., Fritz, S., McCallum, I.: Comparing expert and non-expert conceptualisations of the land: an analysis of crowdsourced land cover data. In: Spatial Information Theory, pp. 243–260. Springer (2013)
Dawid, A.P., Skene, A.M.: Maximum likelihood estimation of observer error-rates using the em algorithm. Appl. Stat. 20–28 (1979)
Dempster, A.P., et al.: Maximum likelihood from incomplete data via the EM algorithm. JRSS Ser. B 1–38 (1977)
Harvey, N.J., Ladner, R.E., Lovász, L., Tamir, T.: Semi-matchings for bipartite graphs and load balancing. In: Algorithms and Data Structures, pp. 294–306. Springer (2003)
Jagabathula, S., et al.: Reputation-based worker filtering in crowdsourcing. In: Advances in Neural Information Processing Systems. pp. 2492–2500 (2014)
Khattak, F.K., Salleb-Aouissi, A.: Improving crowd labeling through expert evaluation. In: 2012 AAAI Spring Symposium Series (2012)
Kim, H.C., Ghahramani, Z.: Bayesian classifier combination. In: International Conference on Artificial Intelligence and Statistics, pp. 619–627 (2012)
Li, H., Yu, B.: Error rate bounds and iterative weighted majority voting for crowdsourcing. arXiv:1411.4086 (2014)
Moreno, P.G., Teh, Y.W., Perez-Cruz, F., Artés-Rodríguez, A.: Bayesian nonparametric crowdsourcing. arXiv:1407.5017 (2014)
Pareek, H., Ravikumar, P.: Human boosting. In: Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 338–346 (2013)
Raykar, V.C.: Eliminating spammers and ranking annotators for crowdsourced labeling tasks. JMLR 13, 491–518 (2012)
Raykar, V.C., et al.: Learning from crowds. J. Mach. Learn. Res. 11, 1297–1322 (2010)
Salk, C.F., Sturn, T., See, L., Fritz, S., Perger, C.: Assessing quality of volunteer crowdsourcing contributions: lessons from the cropland capture game. Int. J. Digital Earth 1–17 (2015)
See, L., et al.: Building a hybrid land cover map with crowdsourcing and geographically weighted regression. ISPRS J. Photogramm. Remote Sens. 103, 48–56 (2015)
Simpson, E., et al.: Dynamic Bayesian combination of multiple imperfect classifiers. In: Decision Making and Imperfection, pp. 1–35. Springer (2013)
Tong, H., Li, M., Zhang, H., Zhang, C.: Blur detection for digital images using wavelet transform. In: 2004 IEEE International Conference on Multimedia and Expo, 2004. ICME’04, vol. 1, pp. 17–20. IEEE (2004)
Zauner, C.: Implementation and benchmarking of perceptual image hash functions. Ph.D. thesis (2010)
Zhu, X., et al.: Co-training as a human collaboration policy. In: AAAI (2011)
Acknowledgments
This research was supported by Russian Science Foundation, grant no. 14-11-00109, and the EU-FP7 funded ERC CrowdLand project, grant no. 617754.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Baklanov, A., Fritz, S., Khachay, M., Nurmukhametov, O., See, L. (2016). The Cropland Capture Game: Good Annotators Versus Vote Aggregation Methods. In: Nguyen, T.B., van Do, T., An Le Thi, H., Nguyen, N.T. (eds) Advanced Computational Methods for Knowledge Engineering. Advances in Intelligent Systems and Computing, vol 453. Springer, Cham. https://doi.org/10.1007/978-3-319-38884-7_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-38884-7_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-38883-0
Online ISBN: 978-3-319-38884-7
eBook Packages: EngineeringEngineering (R0)