Advertisement

Classifying Big Data Analytic Approaches: A Generic Architecture

  • Yudith CardinaleEmail author
  • Sonia Guehis
  • Marta Rukoz
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 868)

Abstract

The explosion of the huge amount of generated data to be analyzed by several applications, imposes the trend of the moment, the Big Data boom, which in turn causes the existence of a vast landscape of architectural solutions. Non expert users who have to decide which analytical solutions are the most appropriates for their particular constraints and specific requirements in a Big Data context, are today lost, faced with a panoply of disparate and diverse solutions. To support users in this hard selection task, in a previous work, we proposed a generic architecture to classify Big Data Analytical Approaches and a set of criteria of comparison/evaluation. In this paper, we extend our classification architecture to consider more types of Big Data analytic tools and approaches and improve the list of criteria to evaluate them. We classify different existing Big Data analytics solutions according to our proposed generic architecture and qualitatively evaluate them in terms of the criteria of comparison. Additionally, we propose a preliminary design of a decision support system, intended to generate suggestions to users based on such classification and on a qualitative evaluation in terms of previous users experiences, users requirements, nature of the analysis they need, and the set of evaluation criteria.

Keywords

Big Data Analytic Analytic models for big data Analytical data management applications 

References

  1. 1.
    Kune, R., Konugurthi, P.K., Agarwal, A., Chillarige, R.R., Buyya, R.: The anatomy of big data computing. Softw. Pract. Exp. 46, 79–105 (2016)CrossRefGoogle Scholar
  2. 2.
    Grolinger, K., Higashino, W.A., Tiwari, A., Capretz, M.A.: Data management in cloud environments: NoSQL and NewSQL data stores. J. Cloud Comput.: Adv. Syst. Appl. 2, 22 (2013)CrossRefGoogle Scholar
  3. 3.
    Pavlo, A., Aslett, M.: What’s really new with NewSQL? SIGMOD Rec. 45, 45–55 (2016)CrossRefGoogle Scholar
  4. 4.
    Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mob. Netw. Appl. 19, 171–209 (2014)CrossRefGoogle Scholar
  5. 5.
    Philip Chen, C., Zhang, C.Y.: Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf. Sci. 275, 314–347 (2014)CrossRefGoogle Scholar
  6. 6.
    Cardinale, Y., Guehis, S., Rukoz, M.: Big data analytic approaches classification. In: Proceedings of the International Conference on Software Technologies, ICSOFT 2017, pp. 151–162. SCITEPRESS (2017)Google Scholar
  7. 7.
    Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)CrossRefGoogle Scholar
  8. 8.
    Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 165–178 (2009)Google Scholar
  9. 9.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)CrossRefGoogle Scholar
  10. 10.
    Battré, D., et al.: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: Proceedings of Symposium on Cloud Computing, pp. 119–130 (2010)Google Scholar
  11. 11.
    Warneke, D., Kao, O.: Nephele: efficient parallel data processing in the cloud. In: Proceedings of Workshop on Many-Task Computing on Grids and Supercomputers, pp. 8:1–8:10 (2009)Google Scholar
  12. 12.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of Conference on Networked Systems Design and Implementation, pp. 15–28 (2012)Google Scholar
  13. 13.
    Chattopadhyay, B., Lin, L., Liu, W., Mittal, S., et al.: Tenzing: a SQL implementation on the MapReduce framework. PVLDB 4, 1318–1327 (2011)Google Scholar
  14. 14.
    Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with Sawzall. Sci. Program. 13, 277–298 (2005)Google Scholar
  15. 15.
    Olston, C., Reed, B., Srivastava, U., Kumar, R., et al.: Pig latin: A not-so-foreign language for data processing. In: Proceedings of International Conference on Management of Data, pp. 1099–1110 (2008)Google Scholar
  16. 16.
    Beyer, K.S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M.Y., et al.: Jaql: a scripting language for large scale semistructured data analysis. PVLDB 4, 1272–1283 (2011)Google Scholar
  17. 17.
    Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry, R.R., Bradshaw, R., Weizenbaum, N.: FlumeJava: easy, efficient data-parallel pipelines. SIGPLAN Not. 45, 363–375 (2010)CrossRefGoogle Scholar
  18. 18.
    Meijer, E., Beckman, B., Bierman, G.: LINQ: reconciling object, relations and XML in the .NET framework. In: Proceedings of ACM International Conference on Management of Data, p. 706 (2006)Google Scholar
  19. 19.
    Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., et al.: Hive - a petabyte scale data warehouse using hadoop. In: Proceedings of International Conference on Data Engineering, pp. 996–1005 (2010)Google Scholar
  20. 20.
    Zhou, J., Bruno, N., Wu, M.C., Larson, P.A., Chaiken, R., Shakib, D.: SCOPE: parallel databases meet MapReduce. VLDB J. 21, 611–636 (2012)CrossRefGoogle Scholar
  21. 21.
    Chaiken, R., Jenkins, B., et al.: SCOPE: easy and efficient parallel processing of massive data sets. VLDB Endow. 1, 1265–1276 (2008)CrossRefGoogle Scholar
  22. 22.
    Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: SQL and rich analytics at scale. In: Proceedings of ACM International Conference on Management of Data, pp. 13–24 (2013)Google Scholar
  23. 23.
    Chen, S.: Cheetah: a high performance, custom data warehouse on top of MapReduce. VLDB Endow. 3, 1459–1468 (2010)CrossRefGoogle Scholar
  24. 24.
    Hasani, Z., Kon-Popovska, M., Velinov, G.: Lambda architecture for real time big data analytic. In: ICT Innovations 2014 Web Proceedings, pp. 133–143 (2014)Google Scholar
  25. 25.
    (Apache Flume). http://flume.apache.org/
  26. 26.
    Wang, G., Koshy, J., Subramanian, S., Paramasivam, K., Zadeh, M., Narkhede, N., Rao, J., Kreps, J., Stein, J.: Building a replicated logging system with Apache Kafka. Proc. VLDB Endow. 8, 1654–1655 (2015)CrossRefGoogle Scholar
  27. 27.
    (Apache Sqoop). http://sqoop.apache.org/
  28. 28.
    Lee, G., Lin, J., Liu, C., Lorek, A., Ryaboy, D.: The unified logging infrastructure for data analytics at Twitter. VLDB Endow. 5, 1771–1780 (2012)CrossRefGoogle Scholar
  29. 29.
    Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. 21, 169–190 (2012)CrossRefGoogle Scholar
  30. 30.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. In: Proceedings of the International WWW Conference, Brisbane, Australia, pp. 161–172 (1998)Google Scholar
  31. 31.
    Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: GraphX: graph processing in a distributed dataflow framework. In: Proceedings of the USENIX Conference on Operating Systems Design and Implementation, pp. 599–613 (2014)Google Scholar
  32. 32.
    Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the ACM International Conference on Management of Data, pp. 135–146. ACM (2010)Google Scholar
  33. 33.
    Wu, L., Sumbaly, R., Riccomini, C., Koo, G., Kim, H.J., Kreps, J., Shah, S.: Avatara: OLAP for web-scale analytics products. Proc. VLDB Endow. 5, 1874–1877 (2012)CrossRefGoogle Scholar
  34. 34.
    Sumbaly, R., Kreps, J., Gao, L., Feinberg, A., Soman, C., Shah, S.: Serving large-scale batch computed data with project Voldemort. In: Proceedings of the USENIX Conference on File and Storage Technologies, p. 18 (2012)Google Scholar
  35. 35.
    Gupta, A., Yang, F., Govig, J., Kirsch, A., Chan, K., Lai, K., Wu, S., Dhoot, S.G., Kumar, A.R., Agiwal, A., Bhansali, S., Hong, M., Cameron, J., et al.: Mesa: geo-replicated, near real-time, scalable data warehousing. PVLDB 7, 1259–1270 (2014)Google Scholar
  36. 36.
    Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. SIGOPS Oper. Syst. Rev. 37, 29–43 (2003)CrossRefGoogle Scholar
  37. 37.
    Fay, C., Jeffrey, D., Sanjay, G., et al.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 4:1–4:26 (2008)Google Scholar
  38. 38.
    Lamport, L.: Paxos made simple. ACM SIGACT News (Distrib. Comput. Column) 32, 51–58 (2001)Google Scholar
  39. 39.
    Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53, 64–71 (2010)CrossRefGoogle Scholar
  40. 40.
    Hall, A., Bachmann, O., Büssow, R., Gănceanu, S., Nunkesser, M.: Processing a trillion cells per mouse click. VLDB Endow. 5, 1436–1446 (2012)CrossRefGoogle Scholar
  41. 41.
    Xu, Y., Kostamaa, P., Gao, L.: Integrating hadoop and parallel DBMs. In: Proceedings of SIGMOD International Conference on Management of Data, pp. 969–974 (2010)Google Scholar
  42. 42.
    Friedman, E., Pawlowski, P., Cieslewicz, J.: SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. VLDB Endow. 2, 1402–1413 (2009)CrossRefGoogle Scholar
  43. 43.
    Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. Commun. ACM 54, 114–123 (2011)CrossRefGoogle Scholar
  44. 44.
    DeWitt, D.J., Halverson, A., Nehme, R., Shankar, S., Aguilar-Saborit, J., Avanes, A., Flasza, M., Gramling, J.: Split query processing in polybase. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 1255–1266 (2013)Google Scholar
  45. 45.
    Pedro, E., Rocha, P., Luis, E.d.B., Chris, C.: Cubrick: a scalable distributed MOLAP database for fast analytics. In: Proceedings of International Conference on Very Large Databases, pp. 1–4 (2015)Google Scholar
  46. 46.
    Gupta, A., Agarwal, D., Tan, D., Kulesza, J., Pathak, R., Stefani, S., Srinivasan, V.: Amazon redshift and the case for simpler data warehouses. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1917–1923 (2015)Google Scholar
  47. 47.
    Yang, F., Tschetter, E., Léauté, X., Ray, N., et al.: Druid: a real-time analytical data store. In: Proceedings of ACM International Conference on Management of Data, pp. 157–168 (2014)Google Scholar
  48. 48.
    Lamb, A., Fuller, M., Varadarajan, R., Tran, N., Vandiver, B., Doshi, L., Bear, C.: The vertica analytic database: C-store 7 years later. VLDB Endow. 5, 1790–1801 (2012)CrossRefGoogle Scholar
  49. 49.
    Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33, 103–111 (1990)CrossRefGoogle Scholar
  50. 50.
    Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5, 716–727 (2012)CrossRefGoogle Scholar
  51. 51.
    Simmhan, Y., Wickramaarachchi, C., Kumbhare, A.G., Frîncu, M., Nagarkar, S., Ravi, S., Raghavendra, C.S., Prasanna, V.K.: Scalable analytics over distributed time-series graphs using goffish. CoRR abs/1406.5975 (2014)Google Scholar
  52. 52.
    Shao, B., Wang, H., Li, Y.: Trinity: a distributed graph engine on a memory cloud. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 505–516 (2013)Google Scholar
  53. 53.
    Mayer, R., Mayer, C., Tariq, M.A., Rothermel, K.: GraphCEP: real-time data analytics using parallel complex event and graph processing. In: Proceedings of the ACM International Conference on Distributed and Event-based Systems, pp. 309–316 (2016)Google Scholar
  54. 54.
    Mayer, R., Koldehofe, B., Rothermel, K.: Predictable low-latency event detection with parallel complex event processing. IEEE Internet Things J. 2, 1 (2015)CrossRefGoogle Scholar
  55. 55.
    Acharjya, D.P., Ahmed, K.: A survey on big data analytics: challenges, open research issues and tools. Int. J. Adv. Comput. Sci. Appl. 7, 511–518 (2016)Google Scholar
  56. 56.
    Inoubli, W., Aridhi, S., Mezni, H., Jung, A.: An experimental survey on big data frameworks. ArXiv e-prints, pp. 1–41 (2017)Google Scholar
  57. 57.
    Madhuri, T., Sowjanya, P.: Microsoft Azure v/s Amazon AWS cloud services: a comparative study. J. Innov. Res. Sci. Eng. Technol. 5, 3904–3908 (2016)Google Scholar
  58. 58.
    Pkknen, P., Pakkala, D.: Reference architecture and classification of technologies, products and services for big data systems. Big Data Res. 2, 166–186 (2015)CrossRefGoogle Scholar
  59. 59.
    Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the hadoop ecosystem. J. Big Data 2, 1–36 (2015)CrossRefGoogle Scholar
  60. 60.
    Khalifa, S., Elshater, Y., Sundaravarathan, K., Bhat, A., Martin, P., Imam, F., Rope, D., et al.: The six pillars for building big data analytics ecosystems. ACM Comput. Surv. 49, 33:1–33:36 (2016)CrossRefGoogle Scholar
  61. 61.
    Poleto, T., de Carvalho, V.D.H., Costa, A.P.C.S.: The roles of big data in the decision-support process: an empirical investigation. In: Delibašić, B., Hernández, J.E., Papathanasiou, J., Dargam, F., Zaraté, P., Ribeiro, R., Liu, S., Linden, I. (eds.) ICDSST 2015. LNBIP, vol. 216, pp. 10–21. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-18533-0_2CrossRefGoogle Scholar
  62. 62.
    Lahcene, B., Ladjel, B., Yassine, O.: Coupling multi-criteria decision making and ontologies for recommending DBMS. In: Proceedings of International Conference on Management of Data (2017)Google Scholar
  63. 63.
    Sahri, S., Moussa, R., Long, D.D.E., Benbernou, S.: DBaaS-expert: a recommender for the selection of the right cloud database. In: Andreasen, T., Christiansen, H., Cubero, J.-C., Raś, Z.W. (eds.) ISMIS 2014. LNCS (LNAI), vol. 8502, pp. 315–324. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-08326-1_32CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Dpto. de Computación y TIUniversidad Simón BolívarCaracasVenezuela
  2. 2.Université Paris NanterreNanterreFrance
  3. 3.Université Paris Dauphine, PSL Research University, CNRS, UMR[7243], LAMSADEParisFrance

Personalised recommendations