World Wide Web

, Volume 22, Issue 6, pp 2561–2587 | Cite as

Parallel strategy for multiple scan operations with data replication

  • Xing Wei
  • Huiqi HuEmail author
  • Huichao Duan
  • Weining Qian
  • Aoying Zhou
Part of the following topical collections:
  1. Special Issue on Web and Big Data


To support the large-scale analytic for Web applications, the backend distributed data management system must provide the service for accessing massive data. Thus, the scan operation becomes a critical step. To improve the performance of scan operation, modern data management systems usually rely on the simple partitioned parallelism. Under the partitioned parallelism, tables are consist of several partitions, and each scan operation can access multiple partitions separately. It is a simple and effective solution for a single scan operation. In this paper, we consider managing multiple scan operations together, where the situation is no longer straightforward. To address the problem, we propose the parallel strategy to schedule batched scan operations together beyond the simple partitioned parallelism. For the sake of performance, first, we utilize replications to increase the parallelism and propose an effective load balancing strategy over replication nodes based on linear programming. Second, we propose an effective chunk-based scheduling algorithm for multi-threading parallelism on each node to guarantee all threads have even workloads under a qualified cost model. Finally, we integrate our parallel scan strategy into an open-sourced distributed data management system. Experimental evaluation shows our parallel scan strategy significantly improves the performance of scan operation.


Parallel scan Load balancing Parallel scheduling Distributed data management system 



This is work is partially supported by National Science Foundation of China under grant numbers 61702189, 61432006 and 61672232, and Youth Science and Technology - “Yang Fan” Program of Shanghai under grant number 17YF1427800. Huiqi Hu is the corresponding author.


  1. 1.
    Apache. HBase.
  2. 2.
    Bal, H.E., Kaashoek, M.F., Tanenbaum, A.S., Jansen, J.: Replication techniques for speeding up parallel applications on distributed systems. Concurr. Pract. Exper. 4, 337–355 (1992)CrossRefGoogle Scholar
  3. 3.
    Bouganim, L., Florescu, D., Valduriez, P.: Dynamic load balancing in hierarchical parallel database systems. In: Proc. of the Int. Conf. on Very Large Data Bases (VLDB). Mumbai (1996)Google Scholar
  4. 4.
    Bouganim, L., Florescu, D., Valduriez, P.: Load balancing for parallel query execution on NUMA multiprocessors. Distrib. Parallel Datab. 7(1), 99–121 (1999)CrossRefGoogle Scholar
  5. 5.
    Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra, T., Fikes, A., Gruber, R.: Bigtable: A distributed storage system for structured data. In: Proceedings of 7th Symposium on Operating System Design and Implementation (OSDI), pp. 205218 (2006)Google Scholar
  6. 6.
    Chen, M.-S., Yu, P.S., Wu, K.-L.: Scheduling and processor allocation for parallel execution of multi-join queries. In: Proceedings of the Eighth International Conference on Data Engineering, pp 58–67. IEEE Computer Society, Washington, DC (1992)Google Scholar
  7. 7.
    Cockshott, W.P.: Addressing mechanisms and persistent programming chapter 15 in Atkinson others (1988)Google Scholar
  8. 8.
    DeWitt, D., Gray, J.: Parallel database systems: The future of high performance database processing. Commun. ACM 36, 6 (1992)Google Scholar
  9. 9.
    Du, J., Leung, J.Y.T.: Complexity of scheduling parallel task systems. SIAM J. Discret Math. SIAM (1989)Google Scholar
  10. 10.
    Ferhatosmanoglu, H., Tosun, A.S., Canahuate, G., Ramachandran, A.: Efficient parallel processing of range queries through replicated declustering. Distrib. Parallel Datab. 20(2), 117–147 (2006)CrossRefGoogle Scholar
  11. 11.
    Frikken, K., Atallah, M., Prabhakar, S., Safavi-Naini, R.: Optimal parallel i/o for range queries through replication. In: Proceedings of 13th International Conference of Database and Expert Systems Applications (DEXA), pp. 669–678 (2002)Google Scholar
  12. 12.
    Graefe, G.: Volcano-an extensible and parallel query evaluation system. IEEE Trans. Knowl. Data Eng., 6(1) (1994)CrossRefGoogle Scholar
  13. 13.
  14. 14.
    Johnson, R., Hardavellas, N., Pandis, I., Mancheril, N., Harizopoulos, S., Sabirli, K., Ailamaki, A., Falsafi, B.: To share or not to share? In: VLDB (2007)Google Scholar
  15. 15.
    Krikellas, K., Cintra, M., Viglas, S.: Scheduling threads for intra-query parallelism on multicore processors. In: EDBT (2010)Google Scholar
  16. 16.
    Krompass, S., Kuno, H., Dayal, U., Kemper, A.: Dynamic workload management for very large data warehouses: Juggling feathers and bowling balls. In: Proc. of the 33rd Intl. Conf. on Very Large Databases (VLDB), pp. 1105–1115 (2007)Google Scholar
  17. 17.
    Kuo, T.-W., Wei, C.-H., Lam, K.-y.: Real-time data access control on B-tree index structures. In: IEEE 15th International Conference on Data Engineering. Sydney (1999)Google Scholar
  18. 18.
    Lee, R., Ding, X., Chen, F., Lu, Q., Zhang, X.: MCC-DB: Minimizing cache conflicts in multi-core processors for databases. PVLDB 2(1), 373–384 (2009)Google Scholar
  19. 19.
    Lim, L., Wang, M., Vitter, J.S.: SASH: A self-adaptive histogram set for dynamically changing workloads. In: Proceedings of 29th VLDB Conference. Berlin (2003)Google Scholar
  20. 20.
    Microsoft: SQL Server parallelism enhancements (2008)
  21. 21.
  22. 22.
  23. 23.
    Oracle Database 11g. Parallel execution (2007)
  24. 24.
    Pan, C.S., Zymbler, M.L.: Encapsulation of partitioned parallelism into open-source database management systems. Program Comput. Softw. 41(6), 350–360 (2015)CrossRefGoogle Scholar
  25. 25.
    Percival, C.: Cache missing for fun and profit. In: Proc. of BSDCan 2005 (2005)Google Scholar
  26. 26.
    Pivotal. GREENPLUM DB.
  27. 27.
    Qiao, L., Raman, V., Reiss, F., Haas, P.J., Lohman, G.M.: Main-memory scan sharing for multi-core CPUs. Proc. VLDB Endow. 1(1), 610–621 (2008)CrossRefGoogle Scholar
  28. 28.
    Rahm, E., Stöhr, T.: Analysis of parallel scan processing in parallel shared disk database systems. In: Proc. EURO-PAR Conf., LNCS, p. 966. Springer (1995)Google Scholar
  29. 29.
    Ristau, B., Fettweis, G.: An optimization methodology for memory allocation and task scheduling in SoCs via linear programming SAMOS 89–98 (2006)Google Scholar
  30. 30.
    Sokolinsky, LB.: Survey of architectures of parallel database system. Program Comput. Softw. 30(6), 337–346 (2004)CrossRefGoogle Scholar
  31. 31.
    Son, SH.: Replicated data management in distributed database systems, ACM SIGMOD, vol. 17 Issue 4, pp 62–69. ACM, New York (1988)Google Scholar
  32. 32.
    Tsafrir, D.: The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops). In: Proceeding ecs’07 Experimental computer science on Experimental computer science, pp. 3–3. San Diego (2007)Google Scholar
  33. 33.
    Valduriez, P.: Parallel Database Systems: Open Problems and New Issues, Distributed and Parallel Databases. Springer (1993)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Data Science and EngineeringEast China Normal UniversityShanghaiChina

Personalised recommendations