SAIR: significance-aware approach to improve QoR of big data processing in case of budget constraint

  • Hossein AhmadvandEmail author
  • Maziar Goudarzi


Nowadays, a wide range of enterprises are faced with big data processing in different domains such as transaction operations, business calculations and analytical computations. Large-scale computing is an approach for big data processing. Due to the cost of large-scale computing and limitations of enterprise budgets, it is hardly possible to process all the input data and therefore the Quality of Result (QoR) may be affected. SAIR is an approach to improve QoR of big data processing for aggregative usages based on significance variety when there is a budget constraint. In this paper, the most significant data portions have been assigned to the most efficient resources in terms of time and cost. If the budget is still available, other data portions have been assigned to remaining resources. In this approach, statistical methods and a sampling technique with a 95% of the confidence interval and 5% of error margin are used to identify the most and least significant data portions. By using this method, the users are able to improve QoR with respect to budget constraint and preferred finishing time. In the evaluation phase, applications from different domains such as document and text, transaction data and system logs are used. Our results indicate that SAIR improves QoR while meeting budget constraint for considered usages. This approach improves the QoR up to 15%, compared with the state of the art.


Big data Significance Quality of Result Data variety Budget constraint 



  1. 1.
    Barroso LA, Clidaras J, Hölzle U (2013) The datacenter as a computer: an introduction to the design of warehouse-scale machines, vol 8.3, 2nd edn. Morgan & Claypool, San Rafael, pp 1–154Google Scholar
  2. 2.
    Gantz J, Reinsel D (2012) The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. IDC iView IDC Anal Future 2007:1–16Google Scholar
  3. 3.
    Ahmadvand H, Goudarzi M (2017) Using data variety for efficient progressive big data processing in warehouse-scale computers. IEEE Comput Archit Lett 16(2):166–169CrossRefGoogle Scholar
  4. 4.
    Fekete J-D, Primet R (2016) Progressive analytics: a computation paradigm for exploratory data analysis. arXiv preprint arXiv, vol. 1607.05162Google Scholar
  5. 5.
    Mittal S (2016) A survey of techniques for approximate computing. ACM CSUR 48:62Google Scholar
  6. 6.
    Parasyris K, Vassiliadis V, Antonopoulos CD, Lalis S, Bellas N (2017) Significance-aware program execution on unreliable hardware. ACM TACO 14(2):12Google Scholar
  7. 7.
    Zhao Y, Calheiros RN, Gange G, Ramamohanarao K, Buyya R (2015) SLA-based resource scheduling for big data analytics as a service in cloud computing environments. In: 2015 44th International Conference on Parallel Processing (ICPP)Google Scholar
  8. 8.
    Honjo T, Oikawa K (2013) Hardware acceleration of hadoop mapreduce. In: 2013 IEEE International Conference on in Big DataGoogle Scholar
  9. 9.
    Shan Y, Wang B, Yan J, Wang Y, Xu N, Yang H (2010) FPMR: MapReduce framework on FPGA. In: Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate ArraysGoogle Scholar
  10. 10.
    Polato I, Ré R, Goldman A, Kon F (2014) A comprehensive view of Hadoop research—a systematic literature review. J Netw Comput Appl 46:1–25CrossRefGoogle Scholar
  11. 11.
    Mashayekhy L, Movahed Nejad M, Grosu D, Zhang Q, Shi W (2015) Energy-aware scheduling of mapreduce jobs for big data applications. IEEE Trans Parallel Distrib Syst 26(10):2720–2733CrossRefGoogle Scholar
  12. 12.
    Chandramouli B, Goldstein J, Quamar A (2013) Scalable progressive analytics on big data in the cloud. Proc VLDB Endow 6:1726–1737CrossRefGoogle Scholar
  13. 13.
    Condie T, Conway N, Alvaro P, Hellerstein JM, Elmeleegy K, Sears R (2010) MapReduce online. In NsdiGoogle Scholar
  14. 14.
    Wang Y, Shi W (2013) On optimal budget-driven scheduling algorithms for MapReduce jobs in the hetereogeneous cloud. Technical report TR-13–02, Carleton UniversityGoogle Scholar
  15. 15.
    Goiri I, Bianchini R, Nagarakatte S, Nguyen TD (2015) Approxhadoop: bringing approximations to mapreduce frameworks. ACM SIGARCH Comput Archit News 43:383–397CrossRefGoogle Scholar
  16. 16.
    Ahmadvand H, Goudarzi M, Foroutan F (2019) Gapprox: using Gallup approach for approximation in big data processing. J Big Data 6(1):20CrossRefGoogle Scholar
  17. 17.
    Vassiliadis V, Riehme J, Deussen J, Parasyris K, Antonopoulos CD, Bellas N, Lalis S, Naumann U (2016) Towards automatic significance analysis for approximate computing. In: 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)Google Scholar
  18. 18.
    Chen Y, An A (2016) Approximate parallel high utility itemset mining. Big Data Res 6:26–42CrossRefGoogle Scholar
  19. 19.
    Zamani AR, AbdelBaky M, Balouek-Thomert D, Rodero I, Parashar M (2017) Supporting data-driven workflows enabled by large scale observatories. In: IEEE 13th International Conference on e-Science (e-Science), Auckland, New ZealandGoogle Scholar
  20. 20.
    Zhang X, Wang J, Yin J (2016) Sapprox: enabling efficient and accurate approximations on sub-datasets with distribution-aware online sampling. Proc VLDB Endow 10(3):109–120CrossRefGoogle Scholar
  21. 21.
    Li K, Li G (2018) Approximate query processing: what is new and where to go? Data Sci Eng 3(4):379–397CrossRefGoogle Scholar
  22. 22.
    Agarwal S, Mozafari B, Panda A, Milner H, Madden S, Stoica I (2013) BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of the European Conference on Computer Systems (EuroSys)Google Scholar
  23. 23.
    Zheng C, Zhan J, Jia Z, Zhang L (2013) Characterizing os behavior of scale-out data center workloads. In: The Seventh Annual Workshop on the Interaction amongst Virtualization, Operating Systems and Computer Architecture (WIVOSCA 2013)Google Scholar
  24. 24.
    Lee Y, Lee Y (2011) Detecting ddos attacks with hadoop. In: Proceedings of The ACM CoNEXT Student WorkshopGoogle Scholar
  25. 25.
    Thusoo A, Shao Z, Anthony S, Borthakur D, Jain N, Sarma JS, Murthy R, Liu H (2010) Data warehousing and analytics infrastructure at Facebook. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of DataGoogle Scholar
  26. 26.
    Kaur N, Sood SK (2017) Efficient resource management system based on 4Vs of big data streams. Big Data ResearchGoogle Scholar
  27. 27.
    Jiang Y, Huang Z, Tsang DHK (2018) Towards max–min fair resource allocation for stream big data analytics in shared clouds. IEEE Trans Big Data 4(1):130–137CrossRefGoogle Scholar
  28. 28.
    Kelley J, Stewart C, Morris N, Tiwari D, He Y, Elnikety S (2017) Obtaining and managing answer quality for online data-intensive services. ACM TOMPECS 2(2):11Google Scholar
  29. 29.
    Li C, Zhu L, Liu Y, Luo Y (2017) Resource scheduling approach for multimedia cloud content management. J Supercomput 73(12):5150–5172CrossRefGoogle Scholar
  30. 30.
    Wang J, Zhang X, Yin J, Wang R, Wu H, Han D (2018) Speed up big data analytics by unveiling the storage distribution of sub-datasets. IEEE Trans Big Data 4(2):231–244CrossRefGoogle Scholar
  31. 31.
    Papadias D, Tao Y, Fu G, Seeger B (2005) Progressive skyline computation in database systems. ACM TODS 30:41–82CrossRefGoogle Scholar
  32. 32.
    Tan K-L, Eng P-K, Ooi BC (2001) Efficient progressive skyline computation. VLDB 1:301–310Google Scholar
  33. 33.
    Zhang D, Du Y, Xia T, Tao Y (2006) Progressive computation of the min-dist optimal-location query. In: Proceedings of the 32nd International Conference on Very Large Data BasesGoogle Scholar
  34. 34.
    Krishnan DR, Quoc DL, Bhatotia P, Fetzer C, Rodrigues R (2016) IncApprox: a data analytics system for incremental approximate computing. In: Proceedings of the 25th International Conference on World Wide Web. International World Wide Web Conferences Steering CommitteeGoogle Scholar
  35. 35.
    Conejero J, Corella S, Badia RM, Labarta J (2018) Task-based programming in COMPSs to converge from HPC to big data. Int J High Perform Comput Appl 32(1):45–60CrossRefGoogle Scholar
  36. 36.
    Qiu C, Shen H, Chen L (2018) Towards green cloud computing: demand allocation and pricing policies for cloud service brokerage. IEEE Trans Big Data. Google Scholar
  37. 37.
    Mian R, Martin P, Vazquez-Poletti JL (2012) Provisioning data analytic workloads in a cloud. Future Gen Comput Syst 29(6):1452–1458CrossRefGoogle Scholar
  38. 38.
    Malekimajd M, Ardagna D, Ciavotta M, Gianniti E, Passacantando M, Rizzi AM (2018) An optimization framework for the capacity allocation. J Supercomput 74(10):5314–5348CrossRefGoogle Scholar
  39. 39.
    BigDataBench. Accessed 15 Feb 2019
  40. 40.
    Cochran WG (2007) Sampling techniques. Wiley, HobokenzbMATHGoogle Scholar
  41. 41.
    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRefGoogle Scholar
  42. 42.
    Welcome to Apache™ Hadoop®! Accessed 15 Feb 2019
  43. 43.
    Apache Spark™—lightning-fast cluster computing. Accessed 15 Feb 2019
  44. 44.
    RDD Programming Guide. Accessed 15 Feb 2019
  45. 45.
    Wang L, Zhan J, Luo C, Zhu Y, Yang Q, He Y, Gao W, Jia Z, Shi Y, Zhang S, Zheng C, Lu G, Zhan K, Li X, Qiu B (2014) Bigdatabench: a big data benchmark suite from internet services. In: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)Google Scholar
  46. 46.
    UCI Machine Learning Repository. Accessed 15 Feb 2019
  47. 47.
    Sample CSV Data. Accessed 15 Feb 2019
  48. 48.
    Efron B, Tibshirani R (1986) Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat Sci 1(1):54–75MathSciNetCrossRefzbMATHGoogle Scholar
  49. 49.
    Amazon EC2 Dedicated Instances. Accessed 15 Feb 2019
  50. 50.
    Lohr SL (2009) Sampling: design and analysis. Cengage Learning, BostonzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Computer EngineeringSharif University of TechnologyTehranIran

Personalised recommendations