An adaptive failure recovery mechanism based on asymmetric routing for data center networks

Abstract

As the infrastructure of high-performance computing, the data center network plays an important role. As network failures occur frequently, data center networks demand highly performed, robust, and energy-efficient failure recovery mechanisms. Despite process, the existing work still has a huge scope to improve to satisfy these requirements. The backup-based failure recovery schemes reserve backup paths in advance, which results in a large energy consumption under normal network conditions. In order to solve the energy consumption problem, the existing adaptive failure recovery schemes are proposed to calculate the rerouting path of the traffic on the failed link, which reduces the energy consumption. However, most adaptive fault recovery solutions apply multi-path routing to calculate the re-routing path. As multi-path routing cannot detect the congestion status of the path under the asymmetric topology caused by link failures, the network is congested, which ends up in less robustness of the network. In view of this, we design and evaluate AFRM, a novel adaptive failure recovery mechanism that overcomes these challenges. AFRM uses asymmetrical routing to calculate the re-routing path by being congestion-aware and is more robust to topological asymmetries compared with existing schemes. The asymmetrical routing dynamically schedules flows to the path with the least marginal cost, which makes AFRM much more energy-efficient. Additionally, AFRM achieves fast link failure detection based on hash storage and flow table matching. Evaluations show that AFRM can do the trade-off between failure recovery time and energy consumption, reduce flow completion time, and increase network throughput compared with existing schemes.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

References

  1. 1.

    Bhandarkar SM, Arabnia HR (1995) The REFINE multiprocessor: theoretical properties and algorithms. Parallel Comput 21(11):1783–1806

    Article  Google Scholar 

  2. 2.

    Arabnia HR, Smith JW (1993) A reconfigurable interconnection network for imaging operations and its implementation using a multi-stage switching box. In: Proceedings of the 7th annual international high performance computing conference. The 1993 high performance computing: new horizons supercomputing symposium, Calgary, Alberta, Canada, June, pp 349–357

  3. 3.

    Arif Wani M, Arabnia HR (2003) Parallel edge-region-based segmentation algorithm targeted at reconfigurable multi-ring network. J Supercomput 25(1):43–63

    Article  Google Scholar 

  4. 4.

    Wei W, Gu H, Wang K (2019) Improving cloud-based IoT services through virtual network embedding in elastic optical inter-DC networks. IEEE Internet Things J 6(1):986–996

    Article  Google Scholar 

  5. 5.

    Yu X, Xu H, Gu H (2018) THOR: a scalable hybrid switching architecture for data center. IEEE Trans Commun 66(10):4653–4665

    Article  Google Scholar 

  6. 6.

    Hou W, Ning Z, Guo L (2017) Novel framework of risk-aware virtual network embedding in optical data center networks. IEEE Syst J 12(3):2473–2482

    Article  Google Scholar 

  7. 7.

    Raeisi B, Giorgetti A (2016) Software-based fast failure recovery in load balanced SDN-based datacenter networks. In: Proceedings of the 6th International Conference on Information Communication and Management, pp 95–99

  8. 8.

    Heng J (2016) The loss of a data center failure is calculated in seconds [EB/OL]. http://news.idcquan.com/news/96583.shtml

  9. 9.

    Gill P, Jain N, Nagappan N (2011) Understanding network failures in data centers: measurement, analysis, and implications. ACM SIGCOMM Comput Commun 41(4):350–361

    Article  Google Scholar 

  10. 10.

    Jain S, Kumar A, Mandal S (2013) B4: Experience with a globally-deployed software defined wan. ACM Spec Interest Group Data Commun 43(4):3–14

    Google Scholar 

  11. 11.

    Wu X, Turner D, Chen C (2012) NetPilot: automating datacenter network failure mitigation. ACM Spec Interest Group Data Commun 42(4):419–430

    Google Scholar 

  12. 12.

    Hong C, Kandula S, Mahajan R (2013) Achieving high utilization with software-driven WAN. ACM Spec Interest Group Data Commun 43(4):15–26

    Google Scholar 

  13. 13.

    Fonseca P, Mota E (2017) A survey on fault management in software-defined networks. IEEE Commun Surv Tutor 19(4):2284–2321

    Article  Google Scholar 

  14. 14.

    Chen J, Xu F (2015) When software defined networks meet fault tolerance: a survey. In: International Conference on Algorithms and Architectures for Parallel Processing, pp 351–368

  15. 15.

    Zhang S, Wang Y (2016) Backup-resource based failure recovery approach in SDN data plane. In: Network Operations and Management Symposium, pp 1–6

  16. 16.

    Capone A, Cascone C (2015) Detour planning for fast and reliable failure recovery in SDN with OpenState. In: Design of Reliable Communication Networks, pp 25–32

  17. 17.

    Caria M, Jukan A (2016) Link capacity planning for fault tolerant operation in hybrid SDN/OSPF networks. In: The Global Communication Conference, pp 1–6

  18. 18.

    Costa L, Buticchi G (2017) A fault-tolerant series-resonant DC–DC converter. IEEE Trans Power Electron 32(2):900–905

    Article  Google Scholar 

  19. 19.

    Sgambelluri A, Giorgetti A (2013) OpenFlow-based segment protection in Ethernet networks. J Opt Commun Netw 5(9):1066–1075

    Article  Google Scholar 

  20. 20.

    Kanagavelu R, Zhu Y (2018) A pro-active and adaptive mechanism for fast failure recovery in SDN data centers. In: Future of Information and Communication Conference, pp 239–257

  21. 21.

    Raeisi B, Giorgetti A (2016) Software-based fast failure recovery in load balanced SDN-based datacenter networks. In: Information Communication and Management, pp 95–99

  22. 22.

    Ghannami A, Shao C (2016) Efficient fast recovery mechanism in software-defined networks: multipath routing approach. In: International Conference for Internet Technology and Secured Transactions, pp 432–435

  23. 23.

    Astaneh S, Heydari S (2016) Optimization of SDN flow operations in multi-failure restoration scenarios. IEEE Trans Netw Serv Manag 13(3):421–432

    Article  Google Scholar 

  24. 24.

    Chen G, Lu Y (2018) FUSO: fast multi-path loss recovery for data center networks. IEEE Trans Netw 26(3):1376–1389

    Article  Google Scholar 

  25. 25.

    Borokhovich M, Schiff L (2014) Provable data plane connectivity with local fast failover: Introducing openflow graph algorithms. In: Proceedings of the Third Workshop on Hot Topics in Software Defined Networking, pp 121–126

  26. 26.

    Kuzniar M, Peresini P (2013) Automatic failure recovery for software-defined networks. In: Proceedings of the second ACM SIGCOMM workshop, pp 159–160

  27. 27.

    Reitblatt M, Canini M (2013) Fattire: declarative fault tolerance for software-defined networks. In: Proceedings of the Second ACM SIGCOMM Workshop, pp 109–114

  28. 28.

    Chen J, Ling J (2016) Failure recovery using vlan-tag in SDN: high speed with low memory requirement. In: Performance Computing and Communications Conference, pp 1–9

  29. 29.

    Kempf J, Bellagamba E, Kern A (2012) Scalable fault management for OpenFlow. In: IEEE International Conference on Communications, pp 6606–6610

  30. 30.

    Pen Y, Gong X, Guo L (2016) A survivability routing mechanism in SDN enabled wireless mesh networks: design and evaluation. China Commun 13(7):32–38

    Article  Google Scholar 

  31. 31.

    Tripathi R, Vignesh S, Tamarapalli V (2017) Cost efficient design of fault tolerant geo-distributed data centers. IEEE Trans Netw Serv Manag 14(2):289–301

    Article  Google Scholar 

  32. 32.

    Caria M, Jukan A (2016) Link capacity planning for fault tolerant operation in hybrid SDN/OSPF networks. In: IEEE Global Communications Conference, pp 1–6

  33. 33.

    Chen G, Zhao Y, Xu H (2017) Rapid failure recovery for routing in production data center networks. IEEE Trans Netw 25(4):1940–1953

    Article  Google Scholar 

  34. 34.

    Oh B, Lee J (2016) Feedback-based path failure detection and buffer blocking protection for MPTCP. IEEE Trans Netw 24(6):3450–3461

    Article  Google Scholar 

  35. 35.

    Rottenstrich O, Kanizo Y, Kaplan H (2018) Accurate traffic splitting on SDN switches. IEEE J Sel Areas Commun 36(10):2190–2201

    Article  Google Scholar 

  36. 36.

    Kim W, Hong J, Suh Y (2018) T-DCORAL: a threshold-based dynamic controller resource allocation for elastic control plane in software defined data center networks. IEEE Commun Lett 23:198–201

    Article  Google Scholar 

  37. 37.

    Shafiee M, Ghaderi J (2017) A simple congestion-aware algorithm for load balancing in datacenter networks. IEEE ACM Trans Netw 25(6):3670–3682

    Article  Google Scholar 

  38. 38.

    Alizadeh M, Greenberg A, Maltz D (2011) Data center TCP (DCTCP). ACM SIGCOMM Comput Commun 41(4):63–74

    Google Scholar 

  39. 39.

    Lu X, Liu J, Zhao H (2019) Collaborative target tracking of IoT heterogeneous nodes. Measurement 147:106872

    Article  Google Scholar 

  40. 40.

    Sun X, Wang S, Xia Y (2020) Predictive-trend-aware composition of web services with time-varying quality-of-service. IEEE Access 8:1910–1921

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Key R&D Program of China under Grant 2018YFE0202800, the National Natural Science Foundation of China under Grant 61634004 and 61934002, the Natural Science Foundation of Shaanxi Province for Distinguished Young Scholars under Grant No. 2020JC-26, the Fundamental Research Funds for the Central Universities under Grant No. JB190105, the Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing under Grant No. 2019A01, and the China Postdoctoral Science Foundation No. 2018M633465.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Huaxi Gu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Liu, Y., Gu, H., Wang, K. et al. An adaptive failure recovery mechanism based on asymmetric routing for data center networks. J Supercomput 77, 2103–2123 (2021). https://doi.org/10.1007/s11227-020-03337-4

Download citation

Keywords

  • Data center networks
  • Failure recovery
  • Asymmetrical routing
  • Marginal cost