Cluster Computing

, Volume 18, Issue 3, pp 1157–1169 | Cite as

A data transmission algorithm for distributed computing system based on maximum flow

  • Xiaolu Zhang
  • Jiafu Jiang
  • Xiaotong Zhang
  • Xuan Wang


Data skew can lead to load imbalance and longer computation time in the distributed computing system. To avoid data skew and reduce the data computation time, it is necessary to transmit the data to appropriate machines, this may however take too much network resources. How to balance the computational resources and the network resources is a problem. In this paper, we introduce a computation model called distributed two-phase model, in which the process of a task can be divided into two independent phases: data transmission and data computation. Suppose an upper bound of relative computation time is given, we show how to schedule data transmission with minimum resources, such as data transmission time and occupied bandwidth, to meet the demand. In this paper, we present a novel algorithm to minimize data transmission time and network bandwidth usage in the data transmission phase, with the conditions that an upper bound of relative computation time of data computation phase is given. Moreover, the number of nodes that participate in data computation phase is also reduced, in this way, the computational resources are saved. The simulation results show that the occupied bandwidth can be reduced effectively (about 70 %) in the situation of large-scale data sets and large number of nodes. Our algorithm is also shown to be available in replication situation.


Distributed computing system Minimize bandwidth usage  Data transmission time 



This work was supported by the National 863 Project (2011AA040101) and was jointly funded by the Beijing municipal Education Commission of the Scientic Research.


  1. 1.
  2. 2.
    Ahuja, R.K., Goldberg, A.V., Orlin, J.B., Tarjan, R.E.: Finding minimum-cost flows by double scaling. Math. Program. 53(1–3), 243–266 (1992)MATHMathSciNetCrossRefGoogle Scholar
  3. 3.
    Buyya, R.: Parmon: a portable and scalable monitoring system for clusters. Software 30(7), 723–740 (2000)MATHGoogle Scholar
  4. 4.
    Cherkassky, B.V., Goldberg, A.V.: On implementing the pushrelabel method for the maximum flow problem. Algorithmica 19(4), 390–410 (1997)MATHMathSciNetCrossRefGoogle Scholar
  5. 5.
    Christiano, P., Kelner, J.A., Madry, A., Spielman, D.A., Teng, S.H.: Electrical flows, laplacian systems, and faster approximation of maximum flow in undirected graphs. In: Proceedings of the 43rd Annual ACM Symposium on Theory of Computing, pp. 273–282. ACM Press, San Jose (2011)Google Scholar
  6. 6.
    Cidon, A., Rumble, S., Stutsman, R., Katti, S., Ousterhout, J., Rosenblum, M.: Copysets: reducing the frequency of data loss in cloud storage. In: Presented as part of the 2013 USENIX Annual Technical Conference, pp. 37–48. USENIX (2013)Google Scholar
  7. 7.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.L., et al.: Introduction to Algorithms. MIT Press, Cambridge (2001)MATHGoogle Scholar
  8. 8.
    Dean, J., Ghemawat, S.: Mapreduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010). doi: 10.1145/1629175.1629198 CrossRefGoogle Scholar
  9. 9.
    Dinic, E.: Algorithm for solution of a problem of maximum flow in a network with power estimation. Soviet Math. Doll. 11(5), 1277–1280, (1970) (English translation by RF. Rinehart)Google Scholar
  10. 10.
    Edmonds, J., Karp, R.M.: Theoretical improvements in algorithmic efficiency for network flow problems. J. ACM (JACM) 19(2), 248–264 (1972)MATHCrossRefGoogle Scholar
  11. 11.
    Ford, D., Fulkerson, D.R.: Flows in Networks. Princeton University Press, Princeton (2010)Google Scholar
  12. 12.
    Ford, L.R., Fulkerson, D.R.: Maximal flow through a network. Can. J. Math. 8(3), 399–404 (1956)MATHMathSciNetCrossRefGoogle Scholar
  13. 13.
    Goldberg, A.V., Rao, S.: Beyond the flow decomposition barrier. J. ACM (JACM) 45(5), 783–797 (1998)MATHMathSciNetCrossRefGoogle Scholar
  14. 14.
    Gufler, B., Augsten, N., Reiser, A., Kemper, A.: Handling Data Skew in Mapreduce, pp. 574–583. Eindhoven University of Technology, Noordwijkerhout (2011)Google Scholar
  15. 15.
    Helal, A.S., Yuan, D., Hesham, E.R.: Dynamic data reallocation for skew management in shared-nothing parallel databases. Distrib. Parallel Databases 5(3), 271–288 (1997)Google Scholar
  16. 16.
    Hsiao, H.C., Chung, H.Y., Shen, H., Chao, Y.C.: Load rebalancing for distributed file systems in clouds. IEEE Trans. Parallel Distrib. Syst. 24(5), 951–962 (2013). doi: 10.1109/TPDS.2012.196 CrossRefGoogle Scholar
  17. 17.
    Ibrahim, S., Jin, H., Lu, L., He, B., Antoniu, G., Wu, S.: Handling partitioning skew in mapreduce using leen. Peer-to-Peer Netw. Appl. 6(4), 409–424 (2013)CrossRefGoogle Scholar
  18. 18.
    Imamagic, E., Dobrenic, D.: Grid infrastructure monitoring system based on nagios. In: Proceedings of the 2007 Workshop on Grid Monitoring, pp. 23–28. ACM Press, New York (2007)Google Scholar
  19. 19.
    Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. SIGOPS Oper. Syst. Rev. 41(3), 59–72 (2007). doi: 10.1145/1272998.1273005 CrossRefGoogle Scholar
  20. 20.
    Jin, J., Luo, J., Song, A., Dong, F., Xiong, R.: Bar: an efficient data locality driven task scheduling algorithm for cloud computing. In: Cluster, Cloud and Grid Computing (CCGrid), 2011 11th IEEE/ACM International Symposium on, pp. 295–304. IEEE Press, New York (2011)Google Scholar
  21. 21.
    Kliazovich, D., Bouvry, P., Khan, S.U.: Dens: data center energy-efficient network-aware scheduling. Clust. Comput. 16(1), 65–75 (2013)CrossRefGoogle Scholar
  22. 22.
    Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: Skewtune: mitigating skew in mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 25–36. ACM Press, New York (2012)Google Scholar
  23. 23.
    Li, M., Subhraveti, D., Butt, A.R., Khasymski, A., Sarkar, P.: Cam: a topology aware minimum cost flow based resource manager for mapreduce applications in the cloud. In: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing, pp. 211–222. ACM Press, Hoboken (2012)Google Scholar
  24. 24.
    Lu, H., Yu, J.X., Feng, L., Li, Z.: Fully dynamic partitioning: handling data skew in parallel data cube computation. Distrib. Parallel Databases 13(2), 181–202 (2003)MATHCrossRefGoogle Scholar
  25. 25.
    Märtens, H.: A classification of skew effects in parallel database systems. In: Euro-Par 2001 Parallel Processing, pp. 291–300. Springer, New York (2001)Google Scholar
  26. 26.
    Massie, M.L., Chun, B.N., Culler, D.E.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 30(7), 817–840 (2004)CrossRefGoogle Scholar
  27. 27.
    Run-liu, W., Yun-hui, Y.: Low cost network coding algorithm for data distribution network. In: Proceedings of 8th International Conference on Wireless Communications, Networking and Mobile Computing (WiCOM), pp. 1–4 (2012). doi: 10.1109/WiCOM.2012.6478566
  28. 28.
    Schrijver, A.: On the history of combinatorial optimization (till 1960). Handbook of Discrete Optimization pp. 1–68 (2005)Google Scholar
  29. 29.
    Slagter, K., Hsu, C.H., Chung, Y.C., Yi, G.: Smartjoin: a network-aware multiway join for mapreduce. Clust. Comput. 17, 1–13 (2014)CrossRefGoogle Scholar
  30. 30.
    Spielman, D.A., Teng, S.H.: Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In: Proceedings of the 36h Annual ACM Symposium on Theory of Computing, pp. 81–90. ACM Press, New York (2004)Google Scholar
  31. 31.
    Vygen, J.: On dual minimum cost flow algorithms. Math. Methods Oper. Res. 56(1), 101–126 (2002)MATHMathSciNetCrossRefGoogle Scholar
  32. 32.
    Yook, J., Tilbury, D.: Performance evaluation of distributed control systems with reduced communication. Ann Arbor 1001, 48,109 (2001)Google Scholar
  33. 33.
    Yook, J.K., Tilbury, D.M., Soparkar, N.R.: Trading computation for bandwidth: reducing communication in distributed control systems using state estimators. IEEE Trans. Control Syst. Technol. 10(4), 503–518 (2002)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Xiaolu Zhang
    • 1
  • Jiafu Jiang
    • 1
  • Xiaotong Zhang
    • 1
  • Xuan Wang
    • 1
  1. 1.University of Science and Technology BeijingBeijingChina

Personalised recommendations