Integrating workload balancing and fault tolerance in distributed stream processing system

  • Junhua FangEmail author
  • Pingfu Chao
  • Rong Zhang
  • Xiaofang Zhou


Distributed Stream Processing Engine (DSPE) is designed for processing continuous streams so as to achieve the real-time performance with low latency guaranteed. To satisfy such requirement, the availability and efficiency are the main concern of the DSPE system, which can be achieved by a proper design of the fault tolerance module and the workload balancing module, respectively. However, the inherent characteristics of data streams, including persistence, dynamic and unpredictability, pose great challenges in satisfying both properties. As far as we know, most of the state-of-the-art DSPE systems take either fault tolerance or workload balancing as its single optimization goal, which in turn receives a higher resource overhead or longer recovery time. In this paper, we combine the fault tolerance and workload balancing mechanisms in the DSPE to reduce the overall resource consumption while keeping the system interactive, high-throughput, scalable and highly available. Based on our data-level replication strategy, our method can handle the dynamic data skewness and node failure scenario: during the distribution fluctuation of the incoming stream, we rebalance the workload by selectively inactivate the data in high-load nodes and activate their replicas on low-load nodes to minimize the migration overhead within the stateful operator; when a fault occurs in the process, the system activates the replicas of the data affected to ensure the correctness while keeping the workload balanced. Extensive experiments on various join workloads on both benchmark data and real data show our superior performance compared with baseline systems.


Real-time data processing Distributed systems Load Balancing High availability computing 



This work is partially supported by National Science Foundation of China under grant (No. 61802273, 61572194, 61772356, 61572335, and 61836007), Postdoctoral Research Foundation of China (2017M621813), Postdoctoral Science Foundation of Jiangsu Province (2018K029C), Natural science fund for colleges and universities in Jiangsu Province (18KJB520044), and Suzhou Science and Technology Development Program(SYG201803). This work is also supported by the Open Program of Neusoft Corporation(SKLSAOP1801) and Blockheaders Co. Ltd.


  1. 1.
  2. 2.
    Abadi, D.J., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., Zdonik, S.B.: Aurora: a new model and architecture for data stream management. VLDBJ 12(2), 120–139 (2003)CrossRefGoogle Scholar
  3. 3.
    Aniello, L., Baldoni, R., Querzoni, L.: Adaptive online scheduling in storm. In: DEBS, pp 207–218 (2013)Google Scholar
  4. 4.
    Balazinska, M., Balakrishnan, H., Madden, S.R., Stonebraker, M.: Fault-tolerance in the borealis distributed stream processing system. ACM Trans. Database Syst. (TODS) 33(1), 3 (2008)CrossRefGoogle Scholar
  5. 5.
    Bellavista, P., Corradi, A., Kotoulas, S., Reale, A.: Adaptive fault-tolerance for dynamic resource provisioning in distributed stream processing systems. In: EDBT, pp. 85–96 (2014)Google Scholar
  6. 6.
    Castro Fernandez, R., Migliavacca, M., Kalyvianaki, E., Pietzuch, P.: Integrating scale out and fault tolerance in stream processing using operator state management. In: Proceedings of the Integrating Scale ACM SIGMOD International Conference on Management of Data, p 2013. ACM (2013)Google Scholar
  7. 7.
    Cherniack, M., Balakrishnan, H., Balazinska, M., Carney, D., Cetintemel, U., Xing, Y., Zdonik, S.B.: Scalable distributed stream processing. In: CIDR, vol. 3, pp 257–268 (2003)Google Scholar
  8. 8.
    Coffman Jr, E.G., Garey, M.R., Johnson, D.S.: Approximation algorithms for bin-packing<a an updated survey. In: Algorithm Design for Computer System Design, pp 49–106. Springer (1984)Google Scholar
  9. 9.
    Elseidy, M., Elguindy, A., Vitorovic, A., Koch, C.: Scalable and adaptive online joins. VLDB 7(6), 441–452 (2014)Google Scholar
  10. 10.
    Fang, J., Zhang, R., Fu, T.Z., Zhang, Z., Zhou, A., Zhu, J.: Parallel stream processing against workload skewness and variance. In: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, pp 15–26. ACM (2017)Google Scholar
  11. 11.
    Fu, T.Z.J., Ding, J., Ma, R.T.B., Winslett, M., Yang, Y., Zhang, Z.: Drs: dynamic resource scheduling for real-time analytics over fast streams. In: ICDCS, pp 411–420. IEEE, Columbus (2015)Google Scholar
  12. 12.
    Gedik, B.: Partitioning functions for stateful data parallelism in stream processing. VLDBJ 23(4), 517–539 (2014)CrossRefGoogle Scholar
  13. 13.
    Ghanbari, H., Simmons, B., Litoiu, M., Iszlai, G.: Exploring alternative approaches to implement an elasticity policy. In: 2011 IEEE International Conference on Cloud Computing (CLOUD), pp. 716–723. IEEE (2011)Google Scholar
  14. 14.
    Heath, T., Martin, R.P., Nguyen, T.D.: Improving cluster availability using workstation validation. In: ACM SIGMETRICS Performance Evaluation Review, vol. 30, pp. 217–227. ACM (2002)Google Scholar
  15. 15.
    Heinze, T., Zia, M., Krahn, R., Jerzak, Z., Fetzer, C.: An adaptive replication scheme for elastic data stream processing systems. In: Proceedings of the 9th ACM International Conference on Distributed Event-Based Systems, pp. 150–161. ACM (2015)Google Scholar
  16. 16.
    Hwang, J.-H., Balazinska, M., Rasin, A., Cetintemel, U., Stonebraker, M., Zdonik, S.: High-availability algorithms for distributed stream processing. In: Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on, pp. 779–790. IEEE (2005)Google Scholar
  17. 17.
    Hwang, J.-H., Cetintemel, U., Zdonik, S.: Fast and highly-available stream processing over wide area networks. In: Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on, pp. 804–813. IEEE (2008)Google Scholar
  18. 18.
    Hwang, J.-H., Xing, Y., Cetintemel, U., Zdonik, S.: A cooperative, self-configuring high-availability solution for stream processing. In: Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, pp. 176–185. IEEE (2007)Google Scholar
  19. 19.
    Jacques-Silva, G., Gedik, B., Andrade, H., Wu, K.-L., Iyer, R.K.: Fault injection-based assessment of partial fault tolerance in stream processing applications. In: Proceedings of the 5th ACM International Conference on Distributed Event-Based System, pp. 231–242. ACM (2011)Google Scholar
  20. 20.
    Ji, Y., Nica, A., Jerzak, Z., Hackenbroich, G., Fetzer, C.: Quality-driven disorder handling for concurrent windowed stream queries with shared operators. In: Proceedings of the 10th ACM International Conference on Distributed and Event-Based Systems, pp. 25–36. ACM (2016)Google Scholar
  21. 21.
    Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., Lewin, D.: Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In: STOC, pp. 654–663 (1997)Google Scholar
  22. 22.
    Katsipoulakis, N.R., Labrinidis, A., Chrysanthis, P.K.: A holistic view of stream partitioning costs. VLDB 10(11), 1286–1297 (2017)Google Scholar
  23. 23.
    Khandekar, R., Hildrum, K., Parekh, S., Rajan, D., Wolf, J., Wu, K.-L., Andrade, H., Gedik, B.: Cola: Optimizing stream processing applications via graph partitioning. In: Middleware, pp 308–327 (2009)Google Scholar
  24. 24.
    Lin, Q., Ooi, B.C., Wang, Z., Yu, C.: Scalable distributed stream join processing. In: SIGMOD, pp. 811–825 (2015)Google Scholar
  25. 25.
    Nasir, M.A.U., Morales, G.D.F., García-Soriano, D., Kourtellis, N., Serafini, M.: The power of both choices: Practical load balancing for distributed stream processing engines. ICDE (2015)Google Scholar
  26. 26.
    Nasir, M.A.U., Serafini, M., et al.: When two choices are not enough: Balancing at scale in distributed stream processing. In: ICDE (2016)Google Scholar
  27. 27.
    Noghabi, S.A., Paramasivam, K., Pan, Y., Ramesh, N., Bringhurst, J., Gupta, I., Campbell, R.H.: Samza: Stateful scalable stream processing at linkedin. VLDB 10(12), 1634–1645 (2017)Google Scholar
  28. 28.
    Qian, Z., He, Y., Su, C., Wu, Z., Zhu, H., Zhang, T., Zhou, L., Yu, Y., Zhang, Z.: Timestream: Reliable stream computation in the cloud. In: Proceedings of the 8th ACM European Conference on Computer Systems, pp. 1–14. ACM (2013)Google Scholar
  29. 29.
    Rupprecht, L., Culhane, W., Pietzuch, P.: Squirreljoin: Network-aware distributed join processing with lazy partitioning. Proceedings of the VLDB Endowment 10(11), 1250–1261 (2017)CrossRefGoogle Scholar
  30. 30.
    Salama, A., Binnig, C., Kraska, T., Zamanian, E.: Cost-based fault-tolerance for parallel data processing. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 285–297. ACM (2015)Google Scholar
  31. 31.
    Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. IEEE Trans. on Dependable and Secure Comput. 7(4), 337–350 (2010)CrossRefGoogle Scholar
  32. 32.
    Su, L., Zhou, Y.: Tolerating correlated failures in massively parallel stream processing engines. In: ICDE, pp. 517–528 (2016)Google Scholar
  33. 33.
    Su, L., Zhou, Y.: Passive and partially active fault tolerance for massively parallel stream processing engines. TKDE (2017)Google Scholar
  34. 34.
    Upadhyaya, P., Kwon, Y., Balazinska, M.: A latency and fault-tolerance optimizer for online parallel query plans. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pp. 241–252. ACM (2011)Google Scholar
  35. 35.
    Vishwanath, K.V., Nagappan, N.: Characterizing cloud computing hardware reliability. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 193–204. ACM (2010)Google Scholar
  36. 36.
    Wolf, J., Bansal, N., Hildrum, K., Parekh, S., Rajan, D., Wagle, R., Wu, K.-L., Fleischer, L.: Soda: an optimizing scheduler for large-scale stream-based distributed computer systems. In: Middleware, pp 306–325 (2008)Google Scholar
  37. 37.
    Xing, Y., Hwang, J., Cetintemel, U., Zdonik, S.: Providing resiliency to load variations in distributed stream processing. In: VLDB, pp. 775–786 (2006)Google Scholar
  38. 38.
    Xing, Y., Zdonik, S., Hwang, J.: Dynamic load distribution in the borealis stream processor. In: ICDE, pp. 791–802 (2005)Google Scholar
  39. 39.
    Zaharia, M., Das, T., Li, H., Hunt, T.: Discretized streams: Fault-tolerant streaming computation at scale. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 423–438. ACM (2013)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  • Junhua Fang
    • 1
    • 4
    Email author
  • Pingfu Chao
    • 2
  • Rong Zhang
    • 3
  • Xiaofang Zhou
    • 1
    • 2
  1. 1.Institute of Artificial Intelligence, School of Computer Science and TechnologySoochow UniversitySuzhouChina
  2. 2.School of Information Technology and Electrical EngineeringThe University of QueenslandBrisbaneAustralia
  3. 3.School of Data Science and EngineeringEast China Normal UniversityShanghaiChina
  4. 4.Neusoft CorporationShenyangChina

Personalised recommendations