Skip to main content
Log in

Distributed stream join under workload variance

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Flexible and self-adaptive stream join processing plays an important role in a parallel shared-nothing environments. Join-Matrix model is a high-performance model which is resilient to data skew and supports arbitrary join predicates for taking random tuple distribution as its routing policy. To maximize system throughputs and minimize network communication cost, a scalable partitioning scheme on matrix is critical. In this paper, we present a novel flexible and adaptive scheme partitioning model for stream join operator, which ensures high throughput but with economical resource usages by allocating resources on demand. Specifically, a lightweight scheme generator, which requires the sample of each stream volume and processing resource quota of each physical machine, generates a join scheme; then a migration plan generator decides how to migrate data among machines under the consideration of minimizing migration cost while ensuring correctness. We do extensive experiments on different kinds of join workloads and the evaluation shows high competence comparing with baseline systems on benchmark data and real data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11

Similar content being viewed by others

Notes

  1. http://open.weibo.com/wiki/2/statuses/user_timeline

References

  1. Apache Storm. http://storm.apache.org/

  2. The TPC-H Benchmark. http://www.tpc.org/tpch

  3. Ananthanarayanan, R., Basker, V., Das, S., Gupta, A., Jiang, H., Qiu, T., Reznichenko, A., Ryabkov, D., Singh, M., Venkataraman, S.: Photon: fault-tolerant and scalable joining of continuous data streams. In: SIGMOD, pp 577–588 (2013)

  4. Anis Uddin Nasir, M., De Francisci Morales, G., et al.: The power of both choices: Practical load balancing for distributed stream processing engines. In: ICDE, pp 137–148 (2015)

  5. Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55(1), 58–75 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  6. Dittrich, J.P., Seeger, B., Taylor, D.S., Widmayer, P.: Progressive merge join: A generic and non-blocking sort-based join algorithm. In: VLDB, pp 299–310 (2002)

  7. Elseidy, M., Elguindy, A., Vitorovic, A., Koch, C.: Scalable and adaptive online joins. In: VLDB, pp 441–452 (2014)

  8. Epstein, R. S., Stonebraker, M., Wong, E.: Distributed query processing in a relational data base system. In: SIGMOD, pp 169–180 (1978)

  9. Fang, J., Wang, X., Zhang, R., Zhou, A.: Flexible and adaptive stream join algorithm. In: APWEB, pp 3–16 (2016)

  10. Fang, J., Zhang, R., Fu, T.Z.J., Zhang, Z., Zhou, A., Zhu, J.: Parallel stream processing against workload skewness and variance. CoRR, abs/1610.05121 (2016)

  11. Gedik, B.: Partitioning functions for stateful data parallelism in stream processing. VLDB J. 23(4), 517–539 (2014)

    Article  Google Scholar 

  12. Graefe, G.: Query evaluation techniques for large databases. ACM Comput. Surv. (CSUR) 25(2), 73–169 (1993)

    Article  Google Scholar 

  13. Huang, X., Cheng, H., Li, R.-H., Qin, L., Yu, J.X.: Top-k structural diversity search in large networks. In: VLDB, pp 1618–1629 (2013)

  14. Huebsch, R., Garofalakis, M., Hellerstein, J., Stoica, I.: Advanced join strategies for large-scale distributed computation. In: VLDB, pp 1484–1495 (2014)

  15. Ives, Z.G., Florescu, D., Friedman, M., Levy, A., Weld, D.S.: An adaptive query execution system for data integration, vol. 28, pp 299–310. ACM (1999)

  16. Kwon, Y., Balazinska, M., et al.: Skewtune: mitigating skew in mapreduce applications. In: SIGMOD, pp 25–36 (2012)

  17. Li, J., Liu, C., Liu, B., Mao, R., Wang, Y., Chen, S., Yang, J.-J., Pan, H., Wang, Q.: Diversity-aware retrieval of medical records. Comput. Ind. 69, 81–91 (2015)

    Article  Google Scholar 

  18. Lin, Q., Ooi, B.C., Wang, Z., Yu, C.: Scalable distributed stream join processing. In: SIGMOD, pp 811–825 (2015)

  19. Liu, B., Zhu, Y., Jbantova, M., et al.: and A dynamically adaptive distributed system for processing complex continuous queries. In: VLDB, pp 1338–1341 (2005)

  20. Lu, M., Tang, Y., Sun, R., Wang, T., Chen, S., Mao, R.: A real time displacement estimation algorithm for ultrasound elastography. Comput. Ind. 69, 61–71 (2015)

    Article  Google Scholar 

  21. Mao, R., Xu, H., Wu, W., Li, J., Li, Y., Lu, M.: Overcoming the challenge of variety: big data abstraction, the next evolution of data management for aal communication systems. Communications Magazine, IEEE 53(1), 42–47 (2015)

    Article  Google Scholar 

  22. Mao, R., Zhang, P., Li, X., Liu, X., Lu, M.: Pivot selection for metric-space indexing. Int. J. Mach. Learn. Cybern. 7(2), 311–323 (2016)

    Article  Google Scholar 

  23. Nasir, M.A.U., Serafini, M., et al.: When two choices are not enough: Balancing at scale in distributed stream processing. In: ICDE (2016)

  24. Okcan, A., Riedewald, M.: Processing theta-joins using mapreduce. In: SIGMOD, pp 949–960 (2011)

  25. Qian, Z., He, Y., Su, C., Wu, Z., Zhu, H., Zhang, T., Zhou, L., Yu, Y., Zhang, Z.: Timestream: reliable stream computation in the cloud. In: Eurosys, pp 1–14 (2013)

  26. Stamos, J.W., Young, H.C.: A symmetric and replicate algorithm for distributed joins. IEEE Trans. Parallel Distrib. Syst. 4(12), 1345–1354 (1993)

    Article  Google Scholar 

  27. Tao, Y., Yiu, M., Papadias, D., et al.: Rpj: Producing fasj join results on streams through rate-based optimization. In: SIGMOD, pp 371–382 (2005)

  28. ufler, N., Augsten, B., Reiser, A., Kemper, A.: Load balancing in mapreduce based on scalable cardinality estimates. In: ICDE, pp 522–533 (2012)

  29. Urhan, T., Franklin, M.J.: Xjoin: A reactively-scheduled pipelined join operatorỳ Bulletin of the Technical Committee (2000)

  30. Vitorovic, A., ElSeidy, M., Koch, C.: Load balancing and skew resilience for parallel joins. In: ICDE (2016)

  31. Wang, J., Huang, J.Z., Guo, J., Lan, Y.: Recommending high-utility search engine queries via a query-recommending model. Neurocomputing 167, 195–208 (2015)

    Article  Google Scholar 

  32. Wilschut, A., Apers, P.: Dataflow query execution in a aarallel main-memory environment. Distributed and Parallel Databases 1(1), 103–128 (1993)

    Article  Google Scholar 

  33. Xing, Y., Hwang, J., Cetintemel, U., Zdonik, S.: Providing resiliency to load variations in distributed stream processing. In: VLDB, pp 775–786 (2006)

  34. Xu, Y., Kostamaa, P., Zhou, X., Chen, L.: Handling data skew in parallel joins in shared-nothing systems. In: SIGMOD, pp 1043–1052 (2008)

  35. Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: Fault-tolerant streaming computation at scale. In: SOSP, pp 423–438. ACM (2013)

  36. Zheng, B., Zheng, K., Xiao, X., Su, H., Yin, H., Zhou, X., Li, G.: Keyword-aware continuous knn query on road networks. In: ICDE, pp 871–882 (2016)

  37. Zheng, K., Zheng, Y., Yuan, N.J., Shang, S.: On discovery of gathering patterns from trajectories. In: ICDE, pp 242–253 (2013)

  38. Zheng, K., Zheng, Y., Yuan, N.J., Shang, S., Zhou, X.: Online discovery of gathering patterns over trajectories. IEEE Trans. Knowl. Data Eng. 26(8), 1974–1988 (2014)

    Article  Google Scholar 

  39. Zhu, Z., Xiao, J., Li, J., Wang, F., Zhang, Q.: Global path planning of wheeled robots using multi-objective memetic algorithms. Integrated Computer-Aided Engineering 22(4), 387–404 (2015)

    Article  Google Scholar 

Download references

Acknowledgments

This work is partially supported by National High Technology Research and Development Program of China (863 Project) No. 2015AA015307, National Science Foundation of China under grant (No. 61232002, No. 61672233 and NO. 61572194).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rong Zhang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fang, J., Zhang, R., Wang, X. et al. Distributed stream join under workload variance. World Wide Web 20, 1089–1110 (2017). https://doi.org/10.1007/s11280-017-0431-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-017-0431-7

Keywords

Navigation