A Survey on Global Management View: Toward Combining System Monitoring, Resource Management, and Load Prediction

Abstract

Today, enterprise applications impose more and more resource requirements to support an ascending number of clients and to deliver them an acceptable Quality of Service (QoS). To ensure such requirements are met, it is essential to apply appropriate resource and application monitoring techniques. Such techniques collect data to enable predictions and actions which can offer better system performance. Typically, system administrators need to consider different data sources, so making the relationship among them by themselves. To address these gaps and considering the context of general networked-based systems, we propose a survey that combines a discussion about system monitoring, data prediction, and resource management procedures in a unified view. The article discusses resource and application monitoring, resource management, and data forecast at both performance and architectural perspectives of enterprise systems. Our idea is to describe consolidated subjects such as monitoring metrics and resource scheduling, together with novel trends, including cloud elasticity and artificial intelligence-based load prediction algorithms. This survey links the aforesaid three pillars, emphasizing relationships among them and also pointing out opportunities and research challenges in the area.

This is a preview of subscription content, log in to check access.

References

  1. 1.

    Aaziz, O., Cook, J., Sharifi, H.: Push me pull you: Integrating opposing data transport modes for efficient hpc application monitoring. In: 2015 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, pp. 674-681 (2015)

  2. 2.

    Aceto, G., Botta, A., De Donato, W., Pescapè, A.: Cloud monitoring: a survey. Comput. Netw. 57(9), 2093–2115 (2013)

    Article  Google Scholar 

  3. 3.

    Agarwala, S., Poellabauer, C., Kong, J., Schwan, K., Wolf, M.: System-level resource monitoring in high-performance computing environments. J. Grid. Comput. 1(3), 273–289 (2003)

    MATH  Article  Google Scholar 

  4. 4.

    Akbar, M.F., Munir, E.U., Rafique, M.M., Malik, Z., Khan, S.U., Yang, L.T.: List-based task scheduling for cloud computing. In: 2016 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), pp. 652–659. https://doi.org/10.1109/iThings-GreenCom-CPSCom-SmartData.2016.143 (2016)

  5. 5.

    Al-Ayyoub, M., Daraghmeh, M., Jararweh, Y., Althebyan, Q.: Towards improving resource management in cloud systems using a multi-agent framework. Int. J. Cloud Comput. 5(1-2), 112–133 (2016)

    Article  Google Scholar 

  6. 6.

    Al-Dhuraibi, Y., Paraiso, F., Djarallah, N., Merle, P.: Elasticity in cloud computing: state of the art and research challenges. IEEE Trans. Serv. Comput. PP(99), 1–1 (2017). https://doi.org/10.1109/TSC.2017.2711009

    Google Scholar 

  7. 7.

    Al Wadia, M., Tahir Ismail, M.: Selecting wavelet transforms model in forecasting financial time series data based on arima model. Appl. Math. Sci. 5(7), 315–326 (2011)

    MathSciNet  MATH  Google Scholar 

  8. 8.

    Alhamazani, K., Ranjan, R., Mitra, K., Rabhi, F., Jayaraman, P.P., Khan, S.U., Guabtni, A., Bhatnagar, V.: An overview of the commercial cloud monitoring tools: research dimensions, design issues, and state-of-the-art. Computing 97(4), 357–377 (2015)

    MathSciNet  Article  Google Scholar 

  9. 9.

    Amiri, M., Mohammad-Khanli, L.: Survey on prediction models of applications for resources provisioning in cloud. Journal of Network and Computer Applications (2017)

  10. 10.

    Balcas, J., Kcira, D., Mughal, A., Newman, H., Spiropulu, M., Vlimant, J.: Monalisa, an agent-based monitoring and control system for the lhc experiments. In: Journal of Physics: Conference Series, IOP Publishing, vol. 898, p. 092055 (2017)

  11. 11.

    Borchert, K., Hirth, M., Zinner, T., Mocanu, D.C.: Correlating qoe and technical parameters of an sap system in an enterprise environment. In: 2016 28th International Teletraffic Congress (ITC 28), IEEE, vol. 3, pp. 34–36 (2016)

  12. 12.

    Bouabdallah, R., Lajmi, S., Ghedira, K.: Use of reactive and proactive elasticity to adjust resources provisioning in the cloud provider. In: 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS) (2016)

  13. 13.

    Box, G.E., Jenkins, G.M.: Time series analysis forecasting and control. Tech. rep., Wisconsin Univ Madison Dept of Statistics (1970)

    Google Scholar 

  14. 14.

    Box, G.E., Jenkins, G.M., Reinsel, G.C., Ljung, G.M.: Time series analysis: forecasting and control. Wiley, New York (2015)

    Google Scholar 

  15. 15.

    Carvallo, P., Cavalli, A.R., Mallouli, W., Rios, E.: Multi-cloud applications security monitoring. In: International Conference on Green, Pervasive, and Cloud Computing, Springer, pp. 748–758 (2017)

  16. 16.

    Casavant, T.L., Kuhl, J.G.: A taxonomy of scheduling in general-purpose distributed computing systems. IEEE Trans. Softw. Eng. 14(2), 141–154 (1988). https://doi.org/10.1109/32.4634

    Article  Google Scholar 

  17. 17.

    Chen, J., Wang, C., Zhou, B.B., Sun, L., Lee, Y.C., Zomaya, A.Y.: Tradeoffs between profit and customer satisfaction for service provisioning in the cloud. In: Proceedings of the 20th International Symposium on High Performance Distributed Computing, pp 229–238. ACM, New York (2011). http://doi.acm.org/10.1145/1996130.1996161

  18. 18.

    Choi, T.M., Yu, Y., Au, K.F.: A hybrid sarima wavelet transform method for sales forecasting. Decis. Support. Syst. 51(1), 130–140 (2011)

    Article  Google Scholar 

  19. 19.

    Duan, R., Prodan, R., Li, X.: Multi-objective game theoretic schedulingof bag-of-tasks workflows on hybrid clouds. IEEE Trans. Cloud Comput. 2(1), 29–42 (2014). https://doi.org/10.1109/TCC.2014.2303077

    Article  Google Scholar 

  20. 20.

    Farshchi, M., Schneider, J.G., Weber, I., Grundy J: Metric selection and anomaly detection for cloud operations using log and metric correlation analysis. Journal of Systems and Software (2017)

  21. 21.

    Fatema, K., Emeakaroha, V.C., Healy, P.D., Morrison, J.P., Lynn, T.: A survey of cloud monitoring tools: Taxonomy, capabilities and objectives. J. Parallel Distrib. Comput. 74(10), 2918–2933 (2014)

    Article  Google Scholar 

  22. 22.

    Fittkau, F., Hasselbring, W.: Elastic application-level monitoring for large software landscapes in the cloud. In: European conference on service-oriented and cloud computing, Springer, pp. 80–94 (2015)

  23. 23.

    Frachtenberg, E., Schwiegelshohn, U.: New challenges of parallel job scheduling. In: Proceedings of the 13th International Conference on Job Scheduling Strategies for Parallel Processing. http://dl.acm.org/citation.cfm?id=1791551.1791552, vol. JSSPP’07, pp 1–23. Springer-Verlag, Berlin (2008)

  24. 24.

    Galante, G., d Bona, L.C.E.: A Survey on Cloud Computing Elasticity. In: 2012 IEEE 5th International Conference on Utility and Cloud Computing, pp. 263–270. https://doi.org/10.1109/UCC.2012.30 (2012)

  25. 25.

    Galante, G., Erpen De Bona, L.C., Mury, A.R., Schulze, B., Rosa Righi, R.: An analysis of public clouds elasticity in the execution of scientific applications: a survey. J. Grid Comput. 14(2), 193–216 (2016). https://doi.org/10.1007/s10723-016-9361-3

    Article  Google Scholar 

  26. 26.

    Ghaderi, J.: Simple high-performance algorithms for scheduling jobs in the cloud. In: 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 345–352, https://doi.org/10.1109/ALLERTON.2015.7447025 (2015)

  27. 27.

    Guan, Q., Zhang, Z., Fu, S.: Proactive failure management by integrated unsupervised and semi-supervised learning for dependable cloud systems. In: 2011 6th International Conference on Availability, Reliability and Security, pp. 83–90. https://doi.org/10.1109/ARES.2011.20 (2011)

  28. 28.

    Holt, C.C.: Forecasting seasonals and trends by exponentially weighted moving averages. Int. J. Forecast. 20(1), 5–10 (2004). https://doi.org/10.1016/j.ijforecast.2003.09.015. http://www.sciencedirect.com/science/article/pii/S0169207003001134

    Article  Google Scholar 

  29. 29.

    Hsieh, T.J., Hsiao, H.F., Yeh, W.C.: Forecasting stock markets using wavelet transforms and recurrent neural networks: an integrated system based on artificial bee colony algorithm. Appl. Soft Comput. 11(2), 2510–2525 (2011)

    Article  Google Scholar 

  30. 30.

    Katsaros, G., Subirats, J., Fitó, J O, Guitart, J., Gilet, P., Espling, D.: A service framework for energy-aware monitoring and vm management in clouds. Futur. Gener. Comput. Syst. 29 (8), 2077–2091 (2013)

    Article  Google Scholar 

  31. 31.

    Khan, M., Khendek, F., Toeroe, M.: Monitoring service level workload and adapting highly available applications. In: Proceedings of the 31st Annual ACM Symposium on Applied Computing, ACM, pp. 522–529 (2016)

  32. 32.

    Khandelwal, I., Adhikari, R., Verma, G.: Time series forecasting using hybrid arima and ann models based on dwt decomposition. Proc. Comput. Sci. 48, 173–179 (2015)

    Article  Google Scholar 

  33. 33.

    Khashei, M., Bijari, M.: A novel hybridization of artificial neural networks and arima models for time series forecasting. Appl. Soft Comput. 11(2), 2664–2675 (2011). https://doi.org/10.1016/j.asoc.2010.10.015. http://www.sciencedirect.com/science/article/pii/S1568494610002759, the Impact of Soft Computing for the Progress of Artificial Intelligence

    Article  Google Scholar 

  34. 34.

    Krauter, K., Buyya, R., Maheswaran, M.: A taxonomy and survey of grid resource management systems for distributed computing. Software: Practice and Experience 32(2), 135–164 (2002). https://doi.org/10.1002/spe.432

    MATH  Google Scholar 

  35. 35.

    Liu, J., Pacitti, E., Valduriez, P., Mattoso, M.: A survey of data-intensive scientific workflow management. J. Grid Comput. 13(4), 457–493 (2015)

    Article  Google Scholar 

  36. 36.

    Liu, J., Pacitti, E., Valduriez, P., De Oliveira, D., Mattoso, M.: Multi-objective scheduling of scientific workflows in multisite clouds. Futur. Gener. Comput. Syst. 63, 76–95 (2016)

    Article  Google Scholar 

  37. 37.

    Liu, J., Pacitti, E., Valduriez, P., Mattoso, M.: Scientific workflow scheduling with provenance data in a multisite cloud. In: Transactions on Large-Scale Data-and Knowledge-Centered Systems XXXIII, Springer, pp. 80–112 (2017)

  38. 38.

    Liu, J., Pacitti, E., Valduriez, P.: A survey of scheduling frameworks in big data systems. Int. J. Cloud Comput. 7, 1–27 (2018)

    Article  Google Scholar 

  39. 39.

    Ma, H., Wang, L., Tak, B.C., Wang, L., Tang, C.: Auto-tuning Performance of MPI Parallel Programs Using Resource Management in Container-Based Virtual Cloud. In: 2016 IEEE 9th International Conference on Cloud Computing (CLOUD), pp. 545–552. https://doi.org/10.1109/CLOUD.2016.0078 (2016)

  40. 40.

    Madni, S.H.H., Latiff, M.S.A., Coulibaly, Y., Abdulhamid, S.M.: Resource Scheduling for Infrastructure As a Service (IaaS) in Cloud Computing. J. Netw. Comput. Appl. 68(C), 173–200 (2016). https://doi.org/10.1016/j.jnca.2016.04.016

    Article  Google Scholar 

  41. 41.

    Mandal, A., Ruth, P., Baldin, I., Król, D, Juve, G., Mayani, R., Da Silva, R.F., Deelman, E., Meredith, J., Vetter, J., et al.: Toward an end-to-end framework for modeling, monitoring and anomaly detection for scientific workflows. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops. IEEE, pp. 1370–1379 (2016)

  42. 42.

    Manvi, S.S., Shyam, G.K.: Resource management for infrastructure as a service (iaas) in cloud computing: a survey. J. Netw. Comput. Appl. 41, 424–440 (2014)

    Article  Google Scholar 

  43. 43.

    Markham, I.S., Rakes, T.R.: The effect of sample size and variability of data on the comparative performance of artificial neural networks and regression. Comput. Oper. Res. 25(4), 251–263 (1998)

    MATH  Article  Google Scholar 

  44. 44.

    Mell, P.M., Grance, T.: SP 800-145. The NIST definition of cloud computing. Tech. Rep. Gaithersburg, United States (2011)

    Google Scholar 

  45. 45.

    Milidiu, R.L., Machado, R.J., Renteria, R.P.: Time-series forecasting through wavelets transformation and a mixture of expert models. Neurocomputing 28(1), 145–156 (1999)

    Article  Google Scholar 

  46. 46.

    Morton, A.: Active and passive metrics and methods (with hybrid types in-between). RFC 7799 (Informational) (2016)

  47. 47.

    Netto, M.A.S., Calheiros, R.N., Rodrigues, E.R., Cunha, R.L.F., Buyya, R.: HPC cloud for scientific and business applications: taxonomy, vision, and research challenges. ACM Comput. Surv. 1 (1), 1–1 (2017)

    Article  Google Scholar 

  48. 48.

    Pahl, C.: Containerization and the PaaS Cloud. IEEE Cloud Comput. 2(3), 24–31 (2015). 10.1109/MCC.2015.51

    Article  Google Scholar 

  49. 49.

    Patel, D.K., Tripathy, D., Tripathy, C.: Survey of load balancing techniques for grid. J. Netw. Comput. Appl. 65(C), 103–119 (2016). https://doi.org/10.1016/j.jnca.2016.02.012

    Article  Google Scholar 

  50. 50.

    Pavlou, G.: On the evolution of management approaches, frameworks and protocols: a historical perspective. J. Netw. Syst. Manag. 15(4), 425–445 (2007). https://doi.org/10.1007/s10922-007-9082-9

    Article  Google Scholar 

  51. 51.

    Persico, V., Grimaldi, D., Pescapè, A, Salvi, A., Santini, S.: A fuzzy approach based on heterogeneous metrics for scaling out public clouds. IEEE Trans. Parallel Distrib. Syst. 28(8), 2117–2130 (2017). https://doi.org/10.1109/TPDS.2017.2651810

    Article  Google Scholar 

  52. 52.

    di Pietro, A., Huici, F., Costantini, D., Niccolini, S.: Decon: Decentralized coordination for large-scale flow monitoring.. In: Proceedings..., Proceedings of the IEEE Conference on Computer Communications (INFOCOM). https://doi.org/10.1109/INFCOMW.2010.5466642, pp 1–5. IEEE Computer Society, Washington (2010)

  53. 53.

    Poddar, R., Vishnoi, A., Mann, V.: HAVEN: Holistic load balancing and auto scaling in the cloud. In: 2015 7th International Conference on Communication Systems and Networks (COMSNETS), pp. 1–8. https://doi.org/10.1109/COMSNETS.2015.7098681 (2015)

  54. 54.

    d R Righi, R., Rodrigues, V.F., da Costa, C.A., Galante, G., de Bona, L.C.E., Ferreto, T.: AutoElastic: Automatic resource elasticity for high performance applications in the cloud. IEEE Trans. Cloud Comput. 4(1), 6–19 (2016). https://doi.org/10.1109/TCC.2015.2424876

    Article  Google Scholar 

  55. 55.

    Ranjan, R., Benatallah, B.: Programming cloud resource orchestration framework: operations and research challenges. arXiv:12042204 (2012)

  56. 56.

    Righi, R.D.R.: MigBSP: a new approach for processes rescheduling management on bulk synchronous parallel applications (2009)

  57. 57.

    Righi, R.D.R., Rodrigues, V.F., da Costa, C.A., Galante, G., de Bona, L.C.E., Ferreto, T.: Autoelastic: automatic resource elasticity for high performance applications in the cloud. IEEE Trans. Cloud Comput. 4(1), 6–19 (2016). https://doi.org/10.1109/TCC.2015.2424876

    Article  Google Scholar 

  58. 58.

    Rodrigues, V.F., Correa, E., da Costa, C.A., da Rosa Righi, R.: On exploring proactive cloud elasticity for internet of things demands. In: 2017 XLIII Latin American Computer Conference, CLEI 2017, Córdoba, Argentina, September 4-8, 2017, pp. 1–10. https://doi.org/10.1109/CLEI.2017.8226417 (2017)

  59. 59.

    Röhl, T, Eitzinger, J., Hager, G., Wellein, G.: Likwid monitoring stack: A flexible framework enabling job specific performance monitoring for the masses. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER), IEEE, pp. 781-784 (2017)

  60. 60.

    da Rosa Righi, R., Pilla, L.L., Carissimi, A.S., Navaux, P.O.A., Heiss, H.U.: Applying processes rescheduling over irregular BSP application, pp 213–223. Springer, Berlin (2009). https://doi.org/10.1007/978-3-642-01970-8_22

    Google Scholar 

  61. 61.

    da Rosa Righi, R., de Quadros Gomes, R., Rodrigues, V.F., da Costa, C.A., Alberti, A.M., Pilla, L.L., Navaux, P.O.A.: Migpf: Towards on self-organizing process rescheduling of bulk-synchronous parallel applications. Futur. Gener. Comput. Syst. 78, 272–286 (2018). https://doi.org/10.1016/j.future.2016.05.004. http://www.sciencedirect.com/science/article/pii/S0167739X16301145

    Article  Google Scholar 

  62. 62.

    Sahi, S.K., Dhaka, V.: A survey paper on workload prediction requirements of cloud computing. In: 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), IEEE, pp. 254–258 (2016)

  63. 63.

    Sawamura, R., Boeres, C., Rebello, V.E.F.: MEC: The Memory Elasticity Controller. In: 2016 IEEE 23rd international conference on high performance computing (HiPC), pp. 111–120. https://doi.org/10.1109/HiPC.2016.022 (2016)

  64. 64.

    Sekar, V., Reiter, M.K., Willinger, W., Zhang, H., Kompella, R.R., Andersen, D.G.: Csamp: A system for network-wide flow monitoring. In: Proceedings..., USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp 233–246. USENIX Association, Berkeley (2008)

  65. 65.

    Seneviratne, S., Witharana, S.: A survey on methodologies for runtime prediction on grid environments. In: 2014 7th International Conference on Information and Automation for Sustainability (ICIAfS), IEEE, pp. 1–6 (2014)

  66. 66.

    Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., Zekauskas, M.: A one-way active measurement protocol (owamp). RFC 4656 (Proposed Standard) (2006)

  67. 67.

    Shen, H.: RIAL: Resource intensity aware load balancing in clouds. IEEE Trans. Cloud Comput. PP(99), 1–1 (2017). https://doi.org/10.1109/TCC.2017.2737628

    Article  Google Scholar 

  68. 68.

    Singh, S., Chana, I.: A survey on resource scheduling in cloud computing: Issues and challenges. J. Grid Comput. 14(2), 217–264 (2016)

    Article  Google Scholar 

  69. 69.

    Sun, P., Wu, D., Wei, K., Guo, X.: Bans-based cloud resources monitoring system. In: 2015 8th International Symposium on Computational Intelligence and Design (ISCID), IEEE, vol. 2, pp. 445-448 (2015)

  70. 70.

    Tonouchi, T.: A light-weight application monitoring and statistical debugging for a black-box application. In: 2015 17th Asia-Pacific Network Operations and Management Symposium (APNOMS), IEEE, pp. 523–526 (2015)

  71. 71.

    Topcuoglu, H., Hariri, S., Wu, M.Y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002). https://doi.org/10.1109/71.993206

    Article  Google Scholar 

  72. 72.

    Waraich, S.S.: Classification of Dynamic Load Balancing Strategies in a Network of Workstations. In: 5th International Conference on Information Technology: New Generations (itng 2008), pp. 1263-1265. https://doi.org/10.1109/ITNG.2008.166 (2008)

  73. 73.

    Watts, J., Taylor, S.: A practical approach to dynamic load balancing. IEEE Trans. Parallel Distrib. Syst. 9(3), 235–248 (1998). https://doi.org/10.1109/71.674316

    Article  Google Scholar 

  74. 74.

    Weingärtner, R, Bräscher, G B, Westphall, C.B.: Cloud resource management: a survey on forecasting and profiling models. J. Netw. Comput. Appl. 47, 99–106 (2015)

    Article  Google Scholar 

  75. 75.

    Winters, P.R.: Forecasting sales by exponentially weighted moving averages. Manag. Sci. 6(3), 324–342 (1960)

    MathSciNet  MATH  Article  Google Scholar 

  76. 76.

    Xu, X., Chen, Y., Calero, J.M.A.: Distributed decentralized collaborative monitoring architecture for cloud infrastructures. Clust. Comput. 20(3), 2451–2463 (2017)

    Article  Google Scholar 

  77. 77.

    Yagoubi, B., Medebber, M.: A load balancing model for grid environment. In: 2007 22nd International Symposium on Computer and Information Sciences, pp. 1–7. https://doi.org/10.1109/ISCIS.2007.4456873(2007)

  78. 78.

    Yoo, W., Sim, A.: Time-series forecast modeling on high-bandwidth network measurements. J. Grid Comput. 14(3), 463–476 (2016)

    Article  Google Scholar 

  79. 79.

    Zhang, G.P.: Time series forecasting using a hybrid arima and neural network model. Neurocomputing 50, 159–175 (2003)

    MATH  Article  Google Scholar 

  80. 80.

    Zhang, H., Jiang, G., Yoshihira, K., Chen, H.: Proactive workload management in hybrid cloud computing. IEEE Trans. Netw. Serv. Manag. 11(1), 90–100 (2014). https://doi.org/10.1109/TNSM.2013.122313.130448

    Article  Google Scholar 

Download references

Acknowledgements

This article was partially supported by the following Brazilian agencies: CAPES, CNPq and FAPERGS. In addition, we would like to thank DELL for also supporting this research.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Rodrigo da Rosa Righi.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The final version of the present manuscript was approved by Marcio Lena on behalf of Dell Technologies.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

da Rosa Righi, R., Lehmann, M., Gomes, M.M. et al. A Survey on Global Management View: Toward Combining System Monitoring, Resource Management, and Load Prediction. J Grid Computing 17, 473–502 (2019). https://doi.org/10.1007/s10723-018-09471-x

Download citation

Keywords

  • Performance
  • Monitoring metrics
  • Computing infrastructure
  • Resource monitoring
  • Resource management
  • Load prediction
  • Time series