Journal of Computer Science and Technology

, Volume 33, Issue 2, pp 366–379 | Cite as

CrowdOLA: Online Aggregation on Duplicate Data Powered by Crowdsourcing

  • An-Zhen ZhangEmail author
  • Jian-Zhong Li
  • Hong Gao
  • Yu-Biao Chen
  • Heng-Zhao Ma
  • Mohamed Jaward Bah


Recently there is an increasing need for interactive human-driven analysis on large volumes of data. Online aggregation (OLA), which provides a quick sketch of massive data before a long wait of the final accurate query result, has drawn significant research attention. However, the direct processing of OLA on duplicate data will lead to incorrect query answers, since sampling from duplicate records leads to an over representation of the duplicate data in the sample. This violates the prerequisite of uniform distributions in most statistical theories. In this paper, we propose CrowdOLA, a novel framework for integrating online aggregation processing with deduplication. Instead of cleaning the whole dataset, CrowdOLA retrieves block-level samples continuously from the dataset, and employs a crowd-based entity resolution approach to detect duplicates in the sample in a pay-as-you-go fashion. After cleaning the sample, an unbiased estimator is provided to address the error bias that is introduced by the duplication. We evaluate CrowdOLA on both real-world and synthetic workloads. Experimental results show that CrowdOLA provides a good balance between efficiency and accuracy.


online aggregation entity resolution crowdsourcing cloud computing 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material

11390_2018_1824_MOESM1_ESM.pdf (376 kb)
ESM 1 (PDF 375 kb)


  1. 1.
    Hellerstein J M, Haas P J, Wang H J. Online aggregation. In Proc. ACM SIGMOD Int. Conf. Management of Data, May 1997, pp.171-182.Google Scholar
  2. 2.
    Doulkeridis C, Nørvåg K. A survey of large-scale analytical query processing in MapReduce. VLDB J., 2014, 23(3): 355-380.CrossRefGoogle Scholar
  3. 3.
    Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 2007, 19(1): 1-16.CrossRefGoogle Scholar
  4. 4.
    Charikar M, Chaudhuri S, Motwani R, Narasayya V R. Towards estimation error guarantees for distinct values. In Proc. ACM SIGMOD Int. Conf. Management of Data, May 2000, pp.268-279.Google Scholar
  5. 5.
    Wang J, Krishnan S, Franklin M J, Goldberg K, Kraska T, Milo T. A sample-and-clean framework for fast and accurate query processing on dirty data. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2014, pp.469-480.Google Scholar
  6. 6.
    Haas P J. Large-sample and deterministic confidence intervals for online aggregation. In Proc. the 9th Int. Conf. Scientific and Statistical Database Management, August 1997, pp.51-63.Google Scholar
  7. 7.
    Haas P J, Hellerstein J M. Ripple joins for online aggregation. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 1999, pp.287-298.Google Scholar
  8. 8.
    Jermaine C, Dobra A, Arumugam S, Joshi S, Pol A. A disk-based join with probabilistic guarantees. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2005, pp.563-574.Google Scholar
  9. 9.
    Luo G, Ellmann C J, Haas P J, Naughton J F. A scalable hash ripple join algorithm. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2002, pp.252-262.Google Scholar
  10. 10.
    Condie T, Conway N, Alvaro P, Hellerstein J M, Gerth J, Talbot J, Elmeleegy K, Sears R. Online aggregation and continuous query support in MapReduce. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2010, pp.1115-1118.Google Scholar
  11. 11.
    Shi Y, Meng X, Wang F, Gan Y. You can stop early with COLA: Online processing of aggregate queries in the cloud. In Proc. the 21st Int. Conf. Information and Knowledge Management, October 2012, pp.1223-1232.Google Scholar
  12. 12.
    Pansare N, Borkar V R, Jermaine C, Condie T. Online aggregation for large MapReduce jobs. PVLDB, 2011, 4(11): 1135-1145.Google Scholar
  13. 13.
    Zeng K, Agarwal S, Stoica I. iOLAP: Managing uncertainty for efficient incremental OLAP. In Proc. ACM SIGMOD Int. Conf. Management of Data, July 2016, pp.1347-1361.Google Scholar
  14. 14.
    Köpcke H, Rahm E. Frameworks for entity matching: A comparison. Data Knowl. Eng., 2010, 69(2): 197-210.CrossRefGoogle Scholar
  15. 15.
    Hernández M A, Stolfo S J. The merge/purge problem for large databases. In Proc. ACM SIGMOD Int. Conf. Management of Data, May 1995, pp.127-138.Google Scholar
  16. 16.
    McCallum A, Nigam K, Ungar L H. Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. ACM SIGMOD Int. Conf. Management of Data, August 2000, pp.169-178.Google Scholar
  17. 17.
    Ananthakrishna R, Chaudhuri S, Ganti V. Eliminating fuzzy duplicates in data warehouses. In Proc. the 28th Int. Conf. Very Large Data Bases, August 2002, pp.586-597.Google Scholar
  18. 18.
    Bhattacharya I, Getoor L. Collective entity resolution in relational data. TKDD, 2007, 1(1): 5.CrossRefGoogle Scholar
  19. 19.
    Altowim Y, Kalashnikov D V, Mehrotra S. Progressive approach to relational entity resolution. PVLDB, 2014, 7(11): 999-1010.Google Scholar
  20. 20.
    Whang S E, Marmaros D, Garcia-Molina H. Pay-as-yougo entity resolution. IEEE Trans. Knowl. Data Eng., 2013, 25(5): 1111-1124.CrossRefGoogle Scholar
  21. 21.
    Gruenheid A, Dong X L, Srivastava D. Incremental record linkage. PVLDB, 2014, 7(9): 697-708.Google Scholar
  22. 22.
    Whang S E, Garcia-Molina H. Incremental entity resolution on rules and data. VLDB J., 2014, 23(1): 77-102.CrossRefGoogle Scholar
  23. 23.
    Li G, Wang J, Zheng Y, Franklin M J. Crowdsourced data management: A survey. In Proc. the 33rd IEEE Int. Conf. Data Engineering, April 2017, pp.39-40.Google Scholar
  24. 24.
    Zheng Y, Cheng R, Maniu S, Mo L. On optimality of jury selection in crowdsourcing. In Proc. the 18th Int. Conf. Extending Database Technology, March 2015, pp.193-204.Google Scholar
  25. 25.
    Zheng Y, Li G, Li Y, Shan C, Cheng R. Truth inference in crowdsourcing: Is the problem solved? PVLDB, 2017, 10(5): 541-552.Google Scholar
  26. 26.
    Zheng Y, Li G, Cheng R. DOCS: Domain-aware crowdsourcing system. PVLDB, 2016, 10(4): 361-372.Google Scholar
  27. 27.
    Zheng Y, Wang J, Li G, Cheng R, Feng J. QASCA: A quality-aware task assignment system for crowdsourcing applications. In Proc. ACM SIGMOD Int. Conf. Management of Data, May 31-June 4, 2015, pp.1031-1046.Google Scholar
  28. 28.
    Xiong H, Zhang D, Chen G, Wang L, Gauthier V, Barnes L E. iCrowd: Near-optimal task allocation for piggyback crowdsensing. IEEE Trans. Mob. Comput., 2016, 15(8): 2010-2022.CrossRefGoogle Scholar
  29. 29.
    Hu H, Zheng Y, Bao Z, Li G, Feng J, Cheng R. Crowdsourced POI labelling: Location-aware result inference and task assignment. In Proc. the 32nd IEEE Int. Conf. Data Engineering, May 2016, pp.61-72.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • An-Zhen Zhang
    • 1
    Email author
  • Jian-Zhong Li
    • 1
  • Hong Gao
    • 1
  • Yu-Biao Chen
    • 1
  • Heng-Zhao Ma
    • 1
  • Mohamed Jaward Bah
    • 1
  1. 1.School of Computer Science and TechnologyHarbin Institute of TechnologyHarbinChina

Personalised recommendations