Abstract
Crowdsourcing has been proven to be an efficient tool to collect large-scale datasets. Answers provided by the crowds are often noisy and conflicted, which makes aggregating them to infer ground truth a critical challenge. Existing fine-grained truth discovery methods solve this problem by exploring the correlation between source reliability and task topics or answers. However, they can only work on limited tasks, which results in the incompatibility with Writing tasks and Transcription tasks, along with the insufficient utilization of the global dataset. To maintain compatibility, we consider the existence of clusters in both tasks and sources, then propose a general fine-grained method. The proposed approach contains two integral components: kl-means and Pattern-based Truth Discovery (PTD). With the aid of ground truth data, kl-means directly employs a co-clustering reliability model on the correctness matrix to learn the patterns. Then PTD conducts the answer aggregation by incorporating captured patterns, producing a more accurate estimation. Therefore, our approach is compatible with all tasks and can better demonstrate the correlation among tasks and sources. Experimental results show that our method can produce a more precise estimation than other general truth discovery methods due to its ability to learn and utilize the patterns of both tasks and sources.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition 2009, CVPR 2009, pp. 248–255. IEEE (2009)
Torralba, A., Fergus, R., Freeman, W.T.: 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30, 1958–1970 (2008)
Gabriele, P., Jesse, C., Ipeirotis, P.G.: Running experiments on amazon mechanical turk. Judgment Decis. Making 5, 411–419 (2010)
Tim, F., Will, M., Anand, K., Nicholas, K., Justin, M., Mark, D.: Annotating named entities in Twitter data with crowdsourcing. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 80–88. Association for Computational Linguistics (2010)
Dawid, A.P., Skene, A.M.: Maximum likelihood estimation of observer error-rates using the EM algorithm. Appl. Stat. 28, 20–28 (1979)
Dong, X.L., Berti-Equille, L., Srivastava, D.: Integrating conflicting data: the role of source dependence. Proc. VLDB Endowment 2, 550–561 (2009)
Demartini, G., Difallah, D.E., Cudr-Mauroux, P.: ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: Proceedings of the 21st International Conference on World Wide Web, pp. 469–478. ACM (2012)
Ma, F., Li, Y., Li, Q., Qiu, M., Gao, J., Zhi, S., Su, L., Zhao, B., Ji, H., Han, J.: Faitcrowd: fine grained truth discovery for crowdsourced data aggregation. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 745–754. ACM (2015)
Moreno, P.G., Artes-Rodriguez, A., Teh, Y.W., Perez-Cruz, F.: Bayesian nonparametric crowdsourcing. J. Mach. Learn. Res. 16, 1607–1627 (2015)
Simpson, E., Roberts, S.J., Smith, A., Lintott, C.: Bayesian combination of multiple, imperfect classifiers (2011)
Cho, H., Dhillon, I.S., Guan, Y., Sra, S.: Minimum sum-squared residue co-clustering of gene expression data. In: Sdm, p. 3. SIAM (2004)
Welinder, P., Branson, S., Perona, P., Belongie, S.J.: The multidimensional wisdom of crowds. In: Advances in Neural Information Processing Systems, pp. 2424–2432 (2010)
Snow, R., O’Connor, B., Jurafsky, D., Ng, A.Y.: Cheap and fast–but is it good? Evaluating non-expert annotations for natural language tasks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 254–263. Association for Computational Linguistics (2008)
Sheshadri, A., Lease, M.: SQUARE: a benchmark for research on computing crowd consensus. In: First AAAI Conference on Human Computation and Crowdsourcing (2013)
Hartigan, J.A.: Direct clustering of a data matrix. J. Am. Stat. Assoc. 67, 123–129 (1972)
Yin, X., Tan, W.: Semi-supervised truth discovery. In: Proceedings of the 20th International Conference on World Wide Web, pp. 217–226. ACM (2011)
Shah, N.B., Zhou, D., Peres, Y.: Approval voting and incentives in crowdsourcing. In: International Conference on Machine Learning (ICML) (2015)
Acknowledgements
This paper was supported by National Natural Science Foundation of China under Grant No. U1301256,61472383, 61472385, 61672369 and 61572342, Natural Science Foundation of Jiangsu Province in China under No. BK20161257, BK20151240 and BK20161258, China Postdoctoral Science Foundation under Grant No. 2015M580470 and 2016M591920.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Du, Y., Xu, H., Sun, YE., Huang, L. (2017). A General Fine-Grained Truth Discovery Approach for Crowdsourced Data Aggregation. In: Candan, S., Chen, L., Pedersen, T., Chang, L., Hua, W. (eds) Database Systems for Advanced Applications. DASFAA 2017. Lecture Notes in Computer Science(), vol 10177. Springer, Cham. https://doi.org/10.1007/978-3-319-55753-3_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-55753-3_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-55752-6
Online ISBN: 978-3-319-55753-3
eBook Packages: Computer ScienceComputer Science (R0)