Scalability Evaluation of Big Data Processing Services in Clouds

  • Xin Zhou
  • Congfeng JiangEmail author
  • Yeliang Qiu
  • Tiantian Fan
  • Yumei Wang
  • Liangbin Zhang
  • Jian Wan
  • Weisong Shi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11459)


Currently, many cloud providers deploy their big data processing systems as cloud services, which helps users conveniently manage and process their data in clouds. Among different service providers’ big data processing services, how to evaluate and compare their scalability is an interesting and challenging work. Most traditional benchmark tools focus on performance evaluation of big data processing systems, such as aggregated throughput and IOPS, but fail to conduct a quantitative analysis of their scalability. In this paper, we propose a measurement methodology to quantify the scalability of big data processing services, which makes the cloud services scalability comparable. We conduct a group of comparative experiments on AliCloud E-MapReduce and Baidu MRS, and collect their respective scalability characteristics under Hadoop and Spark workloads. The scalability characteristics observed in our work could help cloud users choose the best cloud service platform to set up an optimized big data processing system to achieve their specific goals more successfully.


Big data Benchmark Scalability AliCloud Baidu cloud 



This work is supported by Natural Science Foundation of China (No. 61472109, No. 61572163 and No. 61472112) and Key Research and Development Program of Zhejiang Province (No. 2018C01098,2019C01059 and 2019C03134). This work is also supported in part by National Science Foundation (NSF) grant CNS-1205338 and CNS-1561216, and by the Introduction of Innovative R&D team program of Guangdong Province (No. 201001D0104726115). This work is supported by Alibaba Group through Alibaba Innovative Research (AIR) Program. This work is partially supported by Visiting Scholarship of Teachers’ Professional Development Program (No. FX2018050).


  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
    Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking cloud serving systems with YCSB. In: SoCC, pp. 143–154 (2010)Google Scholar
  7. 7.
    Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. Spec. Interest Group Oper. Syst. Oper. Syst. Rev. 44(2), 35–40 (2010)Google Scholar
  8. 8.
    George, L.: HBase - The Definitive Guide. O’Reilly, Newton (2011)Google Scholar
  9. 9.
    Cooper, B.F., et al.: PNUTS: Yahoo!’s hosted data serving platform. Proc. VLDB Endow. 1(2), 1277–1288 (2008)CrossRefGoogle Scholar
  10. 10.
    Shi, Y., Meng, X., Zhao, J., Hu, X., Liu, B., Wang, H.: Benchmarking cloud-based data management systems. In: Proceedings of the Second International Workshop on Cloud Data Management, pp. 47–54. ACM (2010)Google Scholar
  11. 11.
    Ferdman, M., et al.: Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In: ACM SIGARCH Computer Architecture News, vol. 40, pp. 37–48. ACM (2012)Google Scholar
  12. 12.
    Jia, Z., et al.: Understanding big data analytics workloads on modern processors. IEEE Trans. Parallel Distrib. Syst. 28(6), 1797–1810 (2017)CrossRefGoogle Scholar
  13. 13.
    Jia, Z., Wang, L., Zhan, J., Zhang, L., Luo, C.: Characterizing data analysis workloads in data centers. In: IISWC, pp. 66–76. IEEE (2013)Google Scholar
  14. 14.
    Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW), pp. 41–51. IEEE (2010)Google Scholar
  15. 15.
    Gray, J.: Graysort benchmark. Sort Benchmark.
  16. 16.
    Luo, C., et al.: CloudRank-D: benchmarking and ranking cloud computing systems for data processing applications. Front. Comput. Sci. 6(4), 347–362 (2012)MathSciNetGoogle Scholar
  17. 17.
    Jia, Z., et al.: The implications of diverse applications and scalable data sets in benchmarking big data systems. In: Rabl, T., Poess, M., Baru, C., Jacobsen, H.-A. (eds.) WBDB -2012. LNCS, vol. 8163, pp. 44–59. Springer, Heidelberg (2014). Scholar
  18. 18.
    Baru, C., Bhandarkar, M., Nambiar, R., Poess, M., Rabl, T.: Benchmarking big data systems and the bigdata top100 list. Big Data 1(1), 60–64 (2013)CrossRefGoogle Scholar
  19. 19.
    Dede, E., Fadika, Z., Govindaraju, M., Ramakrishnan, L.: Benchmarking MapReduce implementations under different application scenarios. Future Gener. Comput. Syst. 36, 389–399 (2014)CrossRefGoogle Scholar
  20. 20.
    Ming, Z., et al.: BDGS: a scalable big data generator suite in big data benchmarking. arXiv preprint arXiv:1401.5465 (2014)
  21. 21.
    Pavlo, A., et al.: A comparison of approaches to large-scale data analysis. In: Special Interest Group on Management Of Data, pp. 165–178. ACM (2009)Google Scholar
  22. 22.
    Rizzelli, G., Maier, G., Quagliotti, M., Schiano, M., Pattavina, A.: Assessing the scalability of next-generation wavelength switched optical networks. J. Lightwave Technol. 32(12), 2263–2270 (2014)CrossRefGoogle Scholar
  23. 23.
    Badia, S., Martín, A.F., Principe, J.: Implementation and scalability analysis of balancing domain decomposition methods. Arch. Comput. Methods Eng. 20(3), 239–262 (2013)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Gunther, N., Puglia, P., Tomasette, K.: Hadoop superlinear scalability. Queue 13(5), 20 (2015)Google Scholar
  25. 25.
    Gao, J., Pattabhiraman, P., Bai, X., Tsai, W.T.: Saas performance and scalability evaluation in clouds. In: 2011 IEEE 6th International Symposium on Service Oriented System Engineering (SOSE), pp. 61–71. IEEE (2011)Google Scholar
  26. 26.
    Jiang, C., Han, G., Lin, J., Jia, G., Shi, W., Wan, J.: Characteristics of co-allocated online services and batch jobs in internet data centers: a case study from alibaba cloud. IEEE Access 7, 22495–22508 (2019)CrossRefGoogle Scholar
  27. 27.
    Jiang, C., et al.: Energy efficiency comparison of hypervisors. Sustain. Comput.: Inf. Syst. 22, 311–321 (2019)Google Scholar
  28. 28.
    Jiang, C., et al.: Interdomain I/O optimization in virtualized sensor networks. Sensors 18(12), 4395 (2018)CrossRefGoogle Scholar
  29. 29.
    Qiu, Y., Jiang, C., Wang, Y., Ou, D., Li, Y., Wan, J.: Energy aware virtual machine scheduling in data centers. Energies 12(4), 646 (2019)CrossRefGoogle Scholar
  30. 30.
  31. 31.
  32. 32.
    OMalley, O.: Terabyte sort on apache Hadoop. Yahoo, pp. 1–3, May 2008.

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Xin Zhou
    • 1
    • 2
  • Congfeng Jiang
    • 1
    • 2
    Email author
  • Yeliang Qiu
    • 1
    • 2
  • Tiantian Fan
    • 1
    • 2
  • Yumei Wang
    • 1
    • 2
  • Liangbin Zhang
    • 3
  • Jian Wan
    • 4
  • Weisong Shi
    • 5
  1. 1.Key Laboratory of Complex Systems Modeling and Simulation, Ministry of EducationHangzhou Dianzi UniversityHangzhouChina
  2. 2.School of Computer Science and TechnologyHangzhou Dianzi UniversityHangzhouChina
  3. 3.College of Big Data and Software EngineeringZhejiang Wanli UniversityNingboChina
  4. 4.School of Information and Electronic EngineeringZhejiang University of Science and TechnologyHangzhouChina
  5. 5.Department of Computer ScienceWayne State UniversityDetroitUSA

Personalised recommendations