Advertisement

Performance Modeling of Big Data-Oriented Architectures

  • Marco Gribaudo
  • Mauro IaconoEmail author
  • Francesco Palmieri
Chapter
Part of the Computer Communications and Networks book series (CCN)

Abstract

Big Data applications provide new, disruptive tools to advance our knowledge about the mechanisms that characterize complex aspects of reality. Be it a high energy physics experiment or an analysis of social networks data, the strength of the approach is the availability of a huge richness of data; but, at the same time, it is also the main challenge, as this abundance of information must be processed at a bearable cost per information unit and requires higher scale systems to provide enough computing power. This is only possible if the Big Data platform is properly managed and exploited according to the needs of the applications, and a fundamental premise is the capability for a proper performance evaluation of the platform. In this chapter, we provide a glance over the main aspects of performance evaluation for Big Data architectures, together with some examples of model-based evaluation, in order to show how it is possible to characterize big scale architectures to support their correct management, and suggest a methodological coarse grain solution to exploit different conceptual and technical tools to integrate a flexible, model-based, performance analysis supported approach to Big Data systems design, capable of scaling up easily in the core evaluation stage means of Markovian agents.

Keywords

Performance analysis Big Data architectures Design methodology Markovian agents 

Notes

Acknowledgments

We would like to thank Dr E. Barbierato for his precious comments, that helped us to improve the quality of this chapter.

References

  1. 1.
    Fiore, U., Palmieri, F., Castiglione, A., De Santis, A.: A cluster-based data-centric model for network-aware task scheduling in distributed systems. Int. J. Parallel Prog. 42(5), 755–775 (2014)CrossRefGoogle Scholar
  2. 2.
    Wu, Y., Li, G., Wang, L., Ma, Y., Kolodziej, J., Khan, S.U.: A review of data intensive computing. In: The 12th IEEE International Conference on Scalable Computing and Communications (ScalCom 2012), IEEE (Dec 2012)Google Scholar
  3. 3.
    Madden, S.: From databases to Big Data. IEEE Int. Comput. 16(3), 4–6 (2012)CrossRefGoogle Scholar
  4. 4.
    Bertino, E., Bernstein, P., Agrawal, D., Davidson, S., Dayal, U., Franklin, M., Gehrke, J., Haas, L., Halevy, A., Han, J., et al.: Challenges and Opportunities with Big Data. (2011)Google Scholar
  5. 5.
    Fu, Y., Jiang, H., Xiao, N.: A scalable inline cluster deduplication framework for Big Data protection. In: Proceedings of the 13th International Middleware Conference. Middleware ’12, pp. 354–373. Springer, New York (2012)Google Scholar
  6. 6.
    Bryant, R.E., Katz, R.H., Lazowska, E.D.: Big-data computing: Creating revolutionary breakthroughs in commerce, science, and society. In: Computing Research Initiatives for the 21st Century. Computing Research Association (2008)Google Scholar
  7. 7.
    deRoos, D., Eaton, C., Lapis, G., Zikopoulos, P., Deutsch, T.: Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. 1st edn. McGraw-Hill Osborne Media (2011)Google Scholar
  8. 8.
    Inacio, E.C., Dantas, M.A.R.: A survey into performance and energy efficiency in hpc, cloud and big data environments. IJNVO 14(4), 299–318 (2014)CrossRefGoogle Scholar
  9. 9.
    Regola, N., Cieslak, D.A., Chawla, N.V.: The need to consider hardware selection when designing big data applications supported by metadata. In: Hu, W.C., Kaabouch, N. (eds.) Big Data Management, Technologies, and Applications. IGI Global pp. 381–396. (2014)Google Scholar
  10. 10.
    Majeed, A., Shah, M.A.: Energy efficiency in big data complex systems: a comprehensive survey of modern energy saving techniques. Complex Adapt. Syst. Model. 3(1), 1–29 (2015)CrossRefGoogle Scholar
  11. 11.
    Apache Hadoop: Apache Hadoop web siteGoogle Scholar
  12. 12.
    White, T.: Hadoop: The Definitive Guide. 1st edn. O’Reilly Media, Inc. (2009)Google Scholar
  13. 13.
    Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007. EuroSys ’07, pp. 59–72. ACM, New York, NY, USA (2007)Google Scholar
  14. 14.
    Oozie: Oozie web site (2011)Google Scholar
  15. 15.
    Amazon Inc.: Amazon Elastic Compute Cloud (Amazon EC2). http://aws.amazon.com/ec2/#pricing (2008)
  16. 16.
    Rackspace, US Inc.: The Rackspace Cloud. http://www.rackspace.com/cloud/ (2010)
  17. 17.
    Jung, G., Gnanasambandam, N., Mukherjee, T.: Synchronous parallel processing of big-data analytics services to optimize performance in federated clouds. In: Proceedings of the 2012 IEEE Fifth International Conference on Cloud Computing. CLOUD ’12, 811–818. Washington, DC, USA, IEEE Computer Society (2012)Google Scholar
  18. 18.
    Weatherspoon, H., Kubiatowicz, J.: Erasure coding vs. replication: A quantitative comparison. In: Revised Papers from the First International Workshop on Peer-to-Peer Systems. IPTPS ’01, pp. 328–338. Springer, London, UK, (2002)Google Scholar
  19. 19.
    Kameyama, H., Sato, Y.: Erasure codes with small overhead factor and their distributed storage applications. In: 41st Annual Conference on Information Sciences and Systems, 2007. CISS ’07, pp. 80–85 (March 2007)Google Scholar
  20. 20.
    Dandoush, A., Alouf, S., Nain, P.: Simulation analysis of download and recovery processes in P2P storage systems. In: 21st International Teletraffic Congress, 2009. ITC 21 2009, pp. 1–8 (Sept 2009)Google Scholar
  21. 21.
    Aguilera, M., Janakiraman, R., Xu, L.: Using erasure codes efficiently for storage in a distributed system. In: Proceedings of the International Conference on Dependable Systems and Networks, 2005. DSN 2005, pp. 336–345 (June 2005)Google Scholar
  22. 22.
    Wu, F., Qiu, T., Chen, Y., Chen, G.: Redundancy schemes for high availability in dhts. In: Pan, Y., Chen, D., Guo, M., Cao, J., Dongarra, J. (eds.) ISPA. Volume 3758 of Lecture Notes in Computer Science., pp. 990–1000. Springer (2005)Google Scholar
  23. 23.
    Rodrigues, R., Liskov, B.: High availability in dhts: Erasure coding vs. replication. In: Peer-to-Peer Systems IV 4th International Workshop IPTPS 2005, Ithaca, New York (Feb 2005)Google Scholar
  24. 24.
    Xiang, Y., Lan, T., Aggarwal, V., Chen, Y.F.R.: Joint latency and cost optimization for erasurecoded data center storage. SIGMETRICS Perform. Eval. Rev. 42(2), 3–14 (2014)CrossRefGoogle Scholar
  25. 25.
    Sathiamoorthy, M., Asteris, M., Papailiopoulos, D., Dimakis, A.G., Vadali, R., Chen, S., Borthakur, D.: Xoring elephants: novel erasure codes for big data. In: Proceedings of the 39th International Conference on Very Large Data Bases. PVLDB’13, VLDB Endowment pp. 325–336 (2013)Google Scholar
  26. 26.
    Lian, Q., Chen, W., Zhang, Z.: On the impact of replica placement to the reliability of distributed brick storage systems. In: Proceedings of the 25th IEEE International Conference on Distributed Computing Systems, 2005. ICDCS 2005, pp. 187–196 (June 2005)Google Scholar
  27. 27.
    Simon, V., Monnet, S., Feuillet, M., Robert, P., Sens, P.: SPLAD: scattering and placing data replicas to enhance long-term durability. Rapport de recherche RR-8533, INRIA (May 2014)Google Scholar
  28. 28.
    Gribaudo, M., Iacono, M., Manini, D.: Modeling replication and erasure coding in large scale distributed storage systems based on CEPH. In: ITAIS2015: Proceedings of XII Conference of the Italian Chapter of AIS. Volume to Appear of Lecture Notes in Information Systems and Organisation. Springer, Berlin, Heidelberg (2016)Google Scholar
  29. 29.
    Gribaudo, M., Iacono, M., Manini, D.: Improving reliability and performances in large scale distributed applications with erasure codes and replication. Future Gener. Comput. Syst. 56, 773–782 (2016)CrossRefGoogle Scholar
  30. 30.
    Apache Cassandra: Apache Cassandra web site (2009)Google Scholar
  31. 31.
    MongoDB: MongoDB web site (2011)Google Scholar
  32. 32.
    Apache HBase: Apache HBase web siteGoogle Scholar
  33. 33.
    Gandini, A., Gribaudo, M., Knottenbelt, W.J., Osman, R., Piazzolla, P.: Performance Evaluation of NoSQL Databases. In: Proceedings of the Computer Performance Engineering: 11th European Workshop, EPEW 2014, Florence, Italy, September 11-12, 2014, pp. 16–29. Springer International Publishing, Cham (2014)Google Scholar
  34. 34.
    Palmieri, F., Pardi, S. In: Enhanced Network Support for Scalable Computing Clouds. Volume 0 of Computer Communications and Networks. pp. 127–144. Springer, London (2010)Google Scholar
  35. 35.
    Chowdhury, M., Zaharia, M., Ma, J., Jordan, M., Stoica, I.: Managing data transfers in computer clusters with orchestra. SIGCOMM-Comput. Commun. Rev. 41(4), 98–109 (2011)CrossRefGoogle Scholar
  36. 36.
    Tierney, B., Kissel, E., Swany, D.M., Pouyoul, E.: Efficient data transfer protocols for Big Data. In: eScience, IEEE Computer Society, pp. 1–9 (2012)Google Scholar
  37. 37.
    Zahavi, E., Keslassy, I., Kolodny, A.: Distributed adaptive routing for big-data applications running on data center networks. In: Proceedings of the Eighth ACM/IEEE Symposium on Architectures for Networking and Communications Systems. ANCS ’12, pp. 99–110. ACM, New York, NY, USA (2012)Google Scholar
  38. 38.
    Palmieri, F., Pardi, S.: Towards a federated Metropolitan Area Grid environment: The SCoPE network-aware infrastructure. Future Gener. Comput. Syst. 26(8), 1241–1256 (2010)CrossRefGoogle Scholar
  39. 39.
    Esposito, C., Ficco, M., Palmieri, F., Castiglione, A.: Interconnecting federated clouds by using publish-subscribe service. Cluster Comput. 16(4), 887–903 (2013)CrossRefGoogle Scholar
  40. 40.
    Castiglione, A., Gribaudo, M., Iacono, M., Palmieri, F.: Modeling performances of concurrent big data applications. Softw.: Pract. Experience 45(8), 1127–1144 (2015)Google Scholar
  41. 41.
    Barbierato, E., Gribaudo, M., Iacono, M.: Performance evaluation of NoSQL Big Data applications using multi-formalism models. Future Gener. Comput. Syst. 37, 345–353 (2014)CrossRefGoogle Scholar
  42. 42.
    Castiglione, A., Gribaudo, M., Iacono, M., Palmieri, F.: Exploiting mean field analysis to model performances of big data architectures. Future Gener. Comput. Syst. 37, 203–211 (2014)CrossRefGoogle Scholar
  43. 43.
    Cerotti, D., Gribaudo, M., Iacono, M., Piazzolla, P.: Modeling and analysis of performances for concurrent multithread applications on multicore and graphics processing unit systems. Concurrency Comput.: Pract. Experience 28(2), 438–452 cpe.3504 (2016)Google Scholar
  44. 44.
    Xu, L., Cipar, J., Krevat, E., Tumanov, A., Gupta, N., Kozuch, M.A., Ganger, G.R.: Agility and performance in elastic distributed storage. Trans. Storage 10(4), 16:1–16:27 (2016)Google Scholar
  45. 45.
    Yan, F., Riska, A., Smirni, E.: Fast eventual consistency with performance guarantees for distributed storage. In: 32nd International Conference on Distributed Computing Systems Workshops (ICDCSW), 2012, pp. 23–28 (June 2012)Google Scholar
  46. 46.
    Barbierato, E., Gribaudo, M., Iacono, M.: Modeling and evaluating the effects of Big Data storage resource allocation in global scale cloud architectures. Int. J. Data Warehous. Min. 12(2), 1–20 (2016)CrossRefGoogle Scholar
  47. 47.
    Duan, S., Thummala, V., Babu, S.: Tuning database configuration parameters with iTuned. Proc. VLDB Endow. 2(1), 1246–1257 (2009)CrossRefGoogle Scholar
  48. 48.
    Zheng, W., Bianchini, R., Janakiraman, G.J., Santos, J.R., Turner, Y.: JustRunIt: Experiment-based management of virtualized data centers. In: Proceedings of the 2009 Conference on USENIX Annual Technical Conference. USENIX’09, pp. 18–18 USENIX Association, Berkeley, CA, USA (2009)Google Scholar
  49. 49.
    Mytilinis, I., Tsoumakos, D., Kantere, V., Nanos, A., Koziris, N.: I/O performance modeling for big data applications over cloud infrastructures. In: 2015 IEEE International Conference on Cloud Engineering, IC2E 2015, Tempe, AZ, USA, March 9–13, 2015, IEEE, pp. 201–206 (2015)Google Scholar
  50. 50.
    Shi, Y., Meng, X., Zhao, J., Hu, X., Liu, B., Wang, H.: Benchmarking cloud-based data management systems. In: Proceedings of the Second International Workshop on Cloud Data Management. CloudDB ’10, pp. 47–54. ACM, New York, NY, USA (2010)Google Scholar
  51. 51.
    Boulon, J., Konwinski, A., Qi, R., Rabkin, A., Yang, E., Yang, M.: Chukwa, a large-scale monitoring system. In: Proceedings of CCA, vol. 8 (2008)Google Scholar
  52. 52.
    Tan, J., Pan, X., Marinelli, E., Kavulya, S., Gandhi, R., Narasimhan, P.: Kahuna: Problem diagnosis for mapreduce-based cloud computing environments. In: Network Operations and Management Symposium (NOMS), 2010 IEEE, IEEE, pp. 112–119 (2010)Google Scholar
  53. 53.
    Creţu-Ciocârlie, G.F., Budiu, M., Goldszmidt, M.: Hunting for problems with artemis. In: Proceedings of the First USENIX Conference on Analysis of system logs, pp. 2–2. USENIX Association (2008)Google Scholar
  54. 54.
    Kavulya, S., Tan, J., Gandhi, R., Narasimhan, P.: An analysis of traces from a production mapreduce cluster. In: 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid), 2010, IEEE, pp. 94–103 (2010)Google Scholar
  55. 55.
    Hellerstein, J.: Google cluster data (2010)Google Scholar
  56. 56.
    Wilkes, J.: More google cluster data (2011)Google Scholar
  57. 57.
    Cardona, K., Secretan, J., Georgiopoulos, M., Anagnostopoulos, G.: A grid based system for data mining using MapReduce. Technical report, Technical Report TR-2007-02, AMALTHEA (2007)Google Scholar
  58. 58.
    Teng, F., Yu, L., Magoulès, F.: SimMapReduce: a simulator for modeling mapreduce framework. In: 5th FTRA International Conference on Multimedia and Ubiquitous Engineering (MUE), 2011, IEEE, pp. 277–282 (2011)Google Scholar
  59. 59.
    Hammoud, S., Li, M., Liu, Y., Alham, N.K., Liu, Z.: Mrsim: A discrete event based mapreduce simulator. In: Seventh International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2010, volume 6., IEEE, pp. 2993–2997 (2010)Google Scholar
  60. 60.
    Liu, Y., Li, M., Alham, N.K., Hammoud, S.: HSim: a mapreduce simulator in enabling cloud computing. Future Gener. Comput. Syst. 29(1), 300–308 (2013)CrossRefGoogle Scholar
  61. 61.
    Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of MapReduce programs. Proc. VLDB Endowment 4(11), 1111–1122 (2011)Google Scholar
  62. 62.
    Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: A self-tuning system for big data analytics. In: Proceedings of the Fifth CIDR Conference (2011)Google Scholar
  63. 63.
    Wang, G., Butt, A.R., Pandey, P., Gupta, K.: A simulation approach to evaluating design decisions in MapReduce setups. In: IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, 2009. MASCOTS’09. IEEE, pp. 1–11 (2009)Google Scholar
  64. 64.
    Fonseca, R., Porter, G., Katz, R.H., Shenker, S., Stoica, I.: X-trace: A Pervasive Network Tracing Framework. In: Proceedings of the 4th USENIX Conference on Networked Systems Design & Implementation. NSDI’07, pp. 20–20. USENIX Association, Berkeley, CA, USA (2007)Google Scholar
  65. 65.
    Ananthanarayanan, G., Kandula, S., Greenberg, A., Stoica, I., Lu, Y., Saha, B., Harris, E.: Reining in the outliers in map-reduce clusters using mantri. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, pp. 1–16. USENIX Association (2010)Google Scholar
  66. 66.
    Chen, Y., Ganapathi, A.S., Griffith, R., Katz, R.H.: Towards Understanding Cloud Performance Tradeoffs Using Statistical Workload Analysis and Replay. Technical Report UCB/EECS-2010-81, EECS Department, University of California, Berkeley (May 2010)Google Scholar
  67. 67.
    Ficco, M., Avolio, G., Palmieri, F., Castiglione, A.: An hla-based framework for simulation of large-scale critical systems. Concurrency Comput. 28(2), 400–419 (2016)CrossRefGoogle Scholar
  68. 68.
    OPNET: Opnet modeler. http://www.opnet.com/. Accessed 30 April 2016
  69. 69.
    NS2: Ns2 official website. http://www.isi.edu/nsnam/ns/. Accessed 30 April 2016
  70. 70.
    NS3: Ns3 official website. http://www.nsnam.org/documents.html. Accessed30 April 2016
  71. 71.
    OMNeT: Omnet++ official website. http://www.omnetpp.org/. 30 April 2016
  72. 72.
    REAL: Real 5.0 simulator overview. http://www.cs.cornell.edu/skeshav/real/overview.html. 30 April 2016
  73. 73.
    SSFNet: Scalable simulation framework (ssf), ssfnet homepage. http://www.ssfnet.org/homePage.html. Accessed 30 April 2016
  74. 74.
    J-Sim: J-sim homepage. https://sites.google.com/site/jsimofficial/. Accessed 30 April 2016
  75. 75.
    QualNet: Qualnet official site. http://web.scalable-networks.com/content/qualnet. Accessed 30 April 2016
  76. 76.
    Wan simulators and emulators. http://www.wan-sim.net/. Accessed 30 April 2016
  77. 77.
    Welsh, C.: GNS3 network simulation guide. Packt Publ. (2013)Google Scholar
  78. 78.
    Sivasubramaniam, A., Singla, A., Ramachandran, U., Venkateswaran, H.: On characterizing bandwidth requirements of parallel applications. In: ACM SIGMETRICS Performance Evaluation Review, vol 23, pp. 198–207. ACM (1995)Google Scholar
  79. 79.
    Papaefstathiou, E., Kerbyson, D.J., Nudd, G.R.: A layered approach to parallel software performance prediction: A case study. (1994) Technical Report CS-RR-262Google Scholar
  80. 80.
    Schopf, J.M., Berman, F.: Performance prediction in production environments. In: Parallel Processing Symposium, 1998. IPPS/SPDP 1998. Proceedings of the First Merged International... and Symposium on Parallel and Distributed Processing 1998, IEEE, pp. 647–653 (1998)Google Scholar
  81. 81.
    Armstrong, B., Eigenmann, R.: Performance forecasting: Characterization of applications on current and future architectures. Purdue Univ. School of ECE, High-Performance Computing Lab. Technical report ECE-HPCLab-97202 (1997)Google Scholar
  82. 82.
    Armstrong, B., Eigenmann, R.: Performance forecasting: Towards a methodology for characterizing large computational applications. In: Proceedings of the 1998 International Conference on Parallel Processing, 1998, IEEE, pp. 518–525. (1998)Google Scholar
  83. 83.
    Carrington, L., Snavely, A., Gao, X., Wolter, N.: A performance prediction framework for scientific applications. Computat. Sci. ICCS 2003, pp. 701–701 (2003)Google Scholar
  84. 84.
    Kerbyson, D.J., Alme, H.J., Hoisie, A., Petrini, F., Wasserman, H.J., Gittings, M.: Predictive performance and scalability modeling of a large-scale application. In: Proceedings of the 2001 ACM/IEEE Conference on Supercomputing (CDROM), pp. 37–37. ACM (2001)Google Scholar
  85. 85.
    Dinda, P.A., O’Hallaron, D.R.: An evaluation of linear models for host load prediction. In: Proceedings of the The Eighth International Symposium on High Performance Distributed Computing, 1999. IEEE, pp. 87–96 (1999)Google Scholar
  86. 86.
    Barbierato, E., Gribaudo, M., Iacono, M.: A performance modeling language for big data architectures. In: Rekdalsbakken, W., Bye, R.T., Zhang, H. (eds.) ECMS, European Council for Modeling and Simulation, pp. 511–517 (2013)Google Scholar
  87. 87.
    Barbierato, E., Gribaudo, M., Iacono, M.: Modeling apache hive based applications in big data architectures. In: 7th International Conference on Performance Evaluation Methodologies and Tools, VALUETOOLS 2013. (Dec 2013)Google Scholar
  88. 88.
    Andresen, D., Yang, T., Ibarra, O.H., Eğecioğlu, Ö.: Adaptive partitioning and scheduling for enhancing www application performance. J. Parallel Distrib. Comput. 49(1), 57–85 (1998)CrossRefzbMATHGoogle Scholar
  89. 89.
    Bodík, P., Griffith, R., Sutton, C., Fox, A., Jordan, M., Patterson, D.: Statistical machine learning makes automatic control practical for internet datacenters. In: Proceedings of the 2009 Conference on Hot topics in Cloud Computing. HotCloud’09, USENIX Association Berkeley, CA, USA (2009)Google Scholar
  90. 90.
    Anderson, E., Ganger, G.R., Wylie, J.J., Krevat, E., Shiran, T., Tucek, J.: Applying Performance Models to Understand Data-Intensive Computing Efficiency (2010)Google Scholar
  91. 91.
    Trivedi, K.S.: Probability and Statistics with Reliability, Queuing and Computer Science Applications. John Wiley and Sons Ltd., Chichester, UK (2002)zbMATHGoogle Scholar
  92. 92.
    Kurtz, T.: Approximation of Population Processes. Society for Industrial and Applied Mathematics (1981)Google Scholar
  93. 93.
    Bobbio, A., Gribaudo, M., Telek, M.: Analysis of large scale interacting systems by mean field method. In: 5th International Conference on Quantitative Evaluation of Systems—QEST2008, St. Malo (2008)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Marco Gribaudo
    • 1
  • Mauro Iacono
    • 2
    Email author
  • Francesco Palmieri
    • 3
  1. 1.DEIBPolitecnico di MilanoMilanItaly
  2. 2.DMFSeconda Università Degli Studi di NapoliCasertaItaly
  3. 3.DIUniversità Degli Studi di SalernoFiscianoItaly

Personalised recommendations