Big Data Management in the Cloud: Evolution or Crossroad?

Hameurlain, Abdelkader; Morvan, Franck

doi:10.1007/978-3-319-34099-9_2

Abdelkader Hameurlain¹⁵ &
Franck Morvan¹⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 613))

Included in the following conference series:

1187 Accesses
6 Citations

Abstract

In this paper, we try to provide a synthetic and comprehensive state of the art concerning big data management in cloud environments. In this perspective, data management based on parallel and cloud (e.g. MapReduce) systems are overviewed, and compared by relying on meeting software requirements (e.g. data independence, software reuse), high performance, scalability, elasticity, and data availability. With respect to proposed cloud systems, we discuss evolution of their data manipulation languages and we try to learn some lessons should be exploited to ensure the viability of the next generation of large-scale data management systems for big data applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agarwal, S., Kandula, S., Bruno, N., Wu, M., Stoica, I., Zhou, J.: Reoptimizing data parallel computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, 25–27 April 2012, pp. 281–294 (2012). https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/agarwal
Agrawal, D., El Abbadi, A., Ooi, B.C., Das, S., Elmore, A.J.: The evolving landscape of data management in the cloud. IJCSE 7(1), 2–16 (2012). http://dx.doi.org/10.1504/IJCSE.2012.046177
Article Google Scholar
Akbarinia, R., Liroz-Gistau, M., Agrawal, D., Valduriez, P.: An efficient solution for processing skewed mapreduce jobs. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds.) DEXA 2015. LNCS, vol. 9262, pp. 417–429. Springer, Heidelberg (2015)
Chapter Google Scholar
Apache Spark. https://spark.incubator.apache.org/
Baru, C.K., Fecteau, G., Goyal, A., Hsiao, H., Jhingran, A., Padmanabhan, S., Wilson, W.G.: An overview of DB2 parallel edition. In: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, California, 22–25 May 1995, pp. 460–462 (1995). http://doi.acm.org/10.1145/223784.223876
Beyer, K.S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M.Y., Kanne, C., Özcan, F., Shekita, E.J.: JAQL: a scripting language for large scale semistructured data analysis. PVLDB 4(12), 1272–1283 (2011). http://www.vldb.org/pvldb/vol4/p1272-beyer.pdf
Google Scholar
Bondiombouy, C., Kolev, B., Levchenko, O., Valduriez, P.: Integrating big data and relational data with a functional SQL-like query language. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds.) DEXA 2015. LNCS, vol. 9261, pp. 170–185. Springer, Heidelberg (2015)
Chapter Google Scholar
Cariño, F., Kostamaa, P.: Exegesis of DBC/1012 and P-90 - industrial supercomputer database machines. In: Etiemble, D., Syre, J.-C. (eds.) PARLE 1992. LNCS, vol. 605, pp. 877–892. Springer, Heidelberg (1992). http://dx.doi.org/10.1007/3-540-55599-4_130
Google Scholar
Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 1(2), 1265–1276 (2008). http://www.vldb.org/pvldb/1/1454166.pdf
Google Scholar
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 4 (2008). http://doi.acm.org/10.1145/1365815.1365816
Article Google Scholar
Chang, L., Wang, Z., Ma, T., Jian, L., Ma, L., Goldshuv, A., Lonergan, L., Cohen, J., Welton, C., Sherry, G., Bhandarkar, M.: HAWQ: a massively parallel processing SQL engine in hadoop. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 1223–1234 (2014). http://doi.acm.org/10.1145/2588555.2595636
Chaudhuri, S.: What next?: a half-dozen data management research goals for big data and the cloud. In: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2012, Scottsdale, AZ, USA, 20–24 May 2012, pp. 1–4 (2012). http://doi.acm.org/10.1145/2213556.2213558
Chekuri, C., Hasan, W., Motwani, R.: Scheduling problems in parallel query optimization. In: Proceedings of the Fourteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, San Jose, California, USA, 22–25 May 1995, pp. 255–265 (1995). http://doi.acm.org/10.1145/212433.212471
Chen, M., Lo, M., Yu, P.S., Young, H.C.: Using segmented right-deep trees for the execution of pipelined hash joins. In: Proceedings of 18th International Conference on Very Large Data Bases, Vancouver, Canada, 23–27 August 1992, pp. 15–26 (1992). http://www.vldb.org/conf/1992/P015.PDF
Cloudera Impala. http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html
Cooper, B.F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H., Puz, N., Weaver, D., Yerneni, R.: PNUTS: Yahoo!’s hosted data serving platform. PVLDB 1(2), 1277–1288 (2008). http://www.vldb.org/pvldb/1/1454167.pdf
Google Scholar
Copeland, G.P., Alexander, W., Boughter, E.E., Keller, T.W.: Data placement in bubba. In: Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, 1–3 June 1988, pp. 99–108 (1988). http://doi.acm.org/10.1145/50202.50213
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: 6th Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, 6–8 December 2004, pp. 137–150 (2004). http://www.usenix.org/events/osdi04/tech/dean.html
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: amazon’s highly available key-value store. In: Proceedings of the 21st ACM Symposium on Operating Systems Principles 2007, SOSP 2007, Stevenson, Washington, USA, 14–17 October 2007, pp. 205–220 (2007). http://doi.acm.org/10.1145/1294261.1294281
DeWitt, D.J., Gray, J.: Parallel database systems: The future of high performance database systems. Commun. ACM 35(6), 85–98 (1992). http://doi.acm.org/10.1145/129888.129894
Article Google Scholar
DeWitt, D.J., Halverson, A., Nehme, R.V., Shankar, S., Aguilar-Saborit, J., Avanes, A., Flasza, M., Gramling, J.: Split query processing in polybase. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, 22–27 June 2013, pp. 1255–1266 (2013). http://doi.acm.org/10.1145/2463676.2463709
Doulkeridis, C., Nørvåg, K.: A survey of large-scale analytical query processing in mapreduce. VLDB J. 23(3), 355–380 (2014). http://dx.doi.org/10.1007/s00778-013-0319-9
Article Google Scholar
Englert, S., Glasstone, R., Hasan, W.: Parallelism and its price: a case study of nonstop SQL/MP. SIGMOD Rec. 24(4), 61–71 (1995). http://dx.doi.org/10.1145/219713.219760
Article Google Scholar
Floratou, A., Minhas, U.F., Özcan, F.: SQL-on-hadoop: full circle back to shared-nothing database architectures. PVLDB 7(12), 1295–1306 (2014). http://www.vldb.org/pvldb/vol7/p1295-floratou.pdf
Google Scholar
Floratou, A., Teletia, N., DeWitt, D.J., Patel, J.M., Zhang, D.: Can the elephants handle the NoSQL onslaught? PVLDB 5(12), 1712–1723 (2012). http://vldb.org/pvldb/vol5/p1712_avriliafloratou_vldb2012.pdf
Google Scholar
Gates, A., Natkovich, O., Chopra, S., Kamath, P., Narayanam, S., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a highlevel dataflow system on top of mapreduce: the pig experience. PVLDB 2(2), 1414–1425 (2009). http://www.vldb.org/pvldb/2/vldb09-1074.pdf
Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.: The Google file system. In: Proceedings of the 19th ACM Symposium on Operatig Systems Principles 2003, SOSP 2003, Bolton Landing, NY, USA, 19–22 October 2003, pp. 29–43 (2003). http://doi.acm.org/10.1145/945445.945450
Gray, J.: Evolution of data management. IEEE Comput. 29(10), 38–46 (1996). http://dx.doi.org/10.1109/2.539719
Article Google Scholar
Hadoop. http://hadoop.apache.org
Hameurlain, A., Morvan, F.: An optimization method of data communication and control for parallel execution of SQL queries. In: Proceedings of 4th International Conference on Database and Expert Systems Applications, DEXA 1993, Prague, Czech Republic, 6–8 September 1993, pp. 301–312 (1993). http://dx.doi.org/10.1007/3-540-57234-1_27
Google Scholar
Hameurlain, A., Morvan, F.: A parallel scheduling method for efficient query processing. In: Proceedings of the 1993 International Conference on Parallel Processing. Algorithms & Applications, Syracuse University, NY, USA, 16–20 August 1993, vol. III, pp. 258–262 (1993). http://dx.doi.org/10.1109/ICPP.1993.31
Hameurlain, A., Morvan, F.: Scheduling and mapping for parallel execution of extended SQL queries. In: CIKM 1995, Proceedings of the 1995 International Conference on Information and Knowledge Management, Baltimore, Maryland, USA, 28 November–2 December 1995, pp. 197–204 (1995). http://doi.acm.org/10.1145/221270.221567
Hameurlain, A., Morvan, F.: Parallel relational database systems: Why, how and beyond. In: Proceedings of 7th International Conference on Database and Expert Systems Applications, DEXA 1996, Zurich, Switzerland, 9–13 September 1996, pp. 302–312 (1996). http://dx.doi.org/10.1007/BFb0034690
Google Scholar
Hasan, W., Motwani, R.: Optimization algorithms for exploiting the parallelism-communication tradeoff in pipelined parallelism. In: VLDB 1994, Proceedings of 20th International Conference on Very Large Data Bases, Santiago de Chile, Chile, 12–15 September 1994, pp. 36–47 (1994), http://www.vldb.org/conf/1994/P036.PDF
Hasan, W., Motwani, R.: Coloring away communication in parallel query optimization. In: VLDB 1995, Proceedings of 21th International Conference on Very Large Data Bases, Zurich, Switzerland, 11–15 September 1995, pp. 239–250 (1995). http://www.vldb.org/conf/1995/P239.PDF
Hong, W.: Exploiting inter-operation parallelism in XPRS. In: Proceedings of the 1992 ACM SIGMOD International Conference on Management of Data, San Diego, California, 2–5 June 1992, pp. 19–28 (1992). http://doi.acm.org/10.1145/130283.130292
Indrawan-Santiago, M.: Database research: Are we at a crossroad? reflection on nosql. In: 15th International Conference on Network-Based Information Systems, NBiS 2012, Melbourne, Australia, 26–28 September 2012, pp. 45–51 (2012). http://dx.doi.org/10.1109/NBiS.2012.95
Jiang, D., Ooi, B.C., Shi, L., Wu, S.: The performance of mapreduce: an in-depth study. PVLDB 3(1), 472–483 (2010). http://www.comp.nus.edu.sg/vldb2010/proceedings/files/papers/E03.pdf
Google Scholar
Kabra, N., DeWitt, D.J.: Efficient mid-query re-optimization of sub-optimal query execution plans. In: SIGMOD 1998, Proceedings of ACM SIGMOD International Conference on Management of Data, Seattle, Washington, USA, 2–4 June 1998, pp. 106–117 (1998). http://doi.acm.org/10.1145/276304.276315
Kaldewey, T., Shekita, E.J., Tata, S.: Clydesdale: structured data processing on mapreduce. In: Proceedings of 15th International Conference on Extending Database Technology, EDBT 2012, Berlin, Germany, 27–30 March 2012, pp. 15–25 (2012). http://doi.acm.org/10.1145/2247596.2247600
Karanasos, K., Balmin, A., Kutsch, M., Ozcan, F., Ercegovac, V., Xia, C., Jackson, J.: Dynamically optimizing queries over large scale data platforms. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 943–954 (2014). http://doi.acm.org/10.1145/2588555.2610531
Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. Opera. Syst. Rev. 44(2), 35–40 (2010). http://doi.acm.org/10.1145/1773912.1773922
Article Google Scholar
Lanzelotte, R.S.G., Valduriez, P.: Extending the search strategy in a query optimizer. In: Proceedings of 17th International Conference on Very Large Data Bases, Barcelona, Catalonia, Spain, 3–6 September 1991, pp. 363–373 (1991). http://www.vldb.org/conf/1991/P363.PDF
Lee, K., Lee, Y., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with mapreduce: a survey. SIGMOD Rec. 40(4), 11–20 (2011). http://doi.acm.org/10.1145/2094114.2094118
Article Google Scholar
Li, F., Ooi, B.C., Özsu, M.T., Wu, S.: Distributed data management using mapreduce. ACM Comput. Surv. 46(3), 31: 1–31: 42 (2014). http://doi.acm.org/10.1145/2503009
Google Scholar
Livny, M., Khoshafian, S., Boral, H.: Multi-disk management algorithms. In: SIGMETRICS, pp. 69–77 (1987). http://doi.acm.org/10.1145/29903.29914
Google Scholar
Lu, H., Tan, K.L., Ooi, B.C.: Query Processing in Parallel Relational Database Systems. IEEE CS Press, Los Alamitos (1994)
Google Scholar
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, 10–12 June 2008, pp. 1099–1110 (2008). http://doi.acm.org/10.1145/1376616.1376726
Oracle. http://www.oracle.com/technetwork/bdc/hadoop-loader/connectors-
Ozsu, M.T., Valduriez, P.: Principles of Distributed Database Systems. Springer, New York (2011)
Google Scholar
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, 29 June–2 July 2009, pp. 165–178 (2009). http://doi.acm.org/10.1145/1559845.1559865
Schneider, D.A., DeWitt, D.J.: Tradeoffs in processing complex join queries via hashing in multiprocessor database machines. In: Proceedings of 16th International Conference on Very Large Data Bases, Brisbane, Queensland, Australia, 13–16 August 1990, pp. 469–480 (1990). http://www.vldb.org/conf/1990/P469.PDF
Soliman, M.A., Antova, L., Raghavan, V., El-Helw, A., Gu, Z., Shen, E., Caragea, G.C., Garcia-Alvarado, C., Rahman, F., Petropoulos, M., Waas, F., Narayanan, S., Krikellas, K., Baldwin, R.: Orca: a modular query optimizer architecture for big data. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 337–348 (2014). http://doi.acm.org/10.1145/2588555.2595637
Sqoop. http://sqoop.apache.org/
Stonebraker, M., Abadi, D.J., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: Mapreduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010). http://doi.acm.org/10.1145/1629175.1629197
Article Google Scholar
Stonebraker, M., Cattell, R.: 10 rules for scalable performance in ‘simple operation’ datastores. Commun. ACM 54(6), 72–80 (2011). doi:10.1145/1953122.1953144. http://doi.acm.org/10.1145/1953122.1953144
Article Google Scholar
Stonebraker, M., Madden, S., Dubey, P.: Intel “big data” science and technology center vision and execution plan. SIGMOD Rec. 42(1), 44–49 (2013). http://doi.acm.org/10.1145/2481528.2481537
Article Google Scholar
Tan, K., Lu, H.: Pipeline processing of multi-way join queries in shared-memory systems. In: Proceedings of the 1993 International Conference on Parallel Processing. Architecture, Syracuse University, NY, USA, 16–20 August 1993, vol. I, pp. 345–348 (1993). http://dx.doi.org/10.1109/ICPP.1993.147
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive - a warehousing solution over a map-reduce framework. PVLDB 2(2), 1626–1629 (2009)
Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., Murthy, R.: Hive - a petabyte scale data warehouse using hadoop. In: Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, Long Beach, California, USA, 1–6 March 2010, pp. 996–1005 (2010). http://dx.doi.org/10.1109/ICDE.2010.5447738
Trummer, I., Koch, C.: Multi-objective parametric query optimization. PVLDB 8(3), 221–232 (2014). http://www.vldb.org/pvldb/vol8/p221-trummer.pdf
Google Scholar
Valduriez, P.: Parallel database systems: open problems and new issues. Distrib. Parallel Databases 1(2), 137–165 (1993). doi:10.1007/BF01264049. http://dx.doi.org/10.1007/BF01264049
Article Google Scholar
Wiederhold, G.: Mediators in the architecture of future information systems. IEEE Comput. 25(3), 38–49 (1992). http://dx.doi.org/10.1109/2.121508
Article Google Scholar
Witkowski, A., Cariño, F., Kostamaa, P.: NCR 3700 - the next-generation industrial database computer. In: Proceedings of 19th International Conference on Very Large Data Bases, Dublin, Ireland, 24–27 August 1993, pp. 230–243 (1993). http://www.vldb.org/conf/1993/P230.PDF
Xu, Y., Kostamaa, P., Gao, L.: Integrating hadoop and parallel DBMs. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, 6–10 June 2010, pp. 969–974 (2010). http://doi.acm.org/10.1145/1807167.1807272
Zha, L., Zhang, J., Liu, W., Lin, J.: An uncoupled data process and transfer model for mapreduce. In: Hameurlain, A., Küng, J., Wagner, R., Bellatreche, L., Mohania, M. (eds.) TLDKS XVII. LNCS, vol. 8970, pp. 24–44. Springer, Heidelberg (2015). http://dx.doi.org/10.1007/978-3-662-46335-2_2
Google Scholar
Zhou, J., Bruno, N., Wu, M., Larson, P., Chaiken, R., Shakib, D.: SCOPE: parallel databases meet mapreduce. VLDB J. 21(5), 611–636 (2012). http://dx.doi.org/10.1109/PDIS.1993.253066
Article Google Scholar
Ziane, M., Zaït, M., Borla-Salamet, P.: Parallel query processing in DBS3. In: Proceedings of the 2nd International Conference on Parallel and Distributed Information Systems (PDIS 1993), Issues, Architectures, and Algorithms, San Diego, CA, USA, 20–23 January 1993, pp. 93–102 (1993). http://dx.doi.org/10.1109/PDIS.1993.253066

Download references

Author information

Authors and Affiliations

IRIT Institut de Recherche en Informatique de Toulouse IRIT, Paul Sabatier University, 118, Route de Narbonne, 31062, Toulouse Cedex, France
Abdelkader Hameurlain & Franck Morvan

Authors

Abdelkader Hameurlain
View author publications
You can also search for this author in PubMed Google Scholar
Franck Morvan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdelkader Hameurlain .

Editor information

Editors and Affiliations

Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Stanisław Kozielski
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Dariusz Mrozek
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Paweł Kasprowski
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Bożena Małysiak-Mrozek
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Daniel Kostrzewa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hameurlain, A., Morvan, F. (2016). Big Data Management in the Cloud: Evolution or Crossroad?. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery. BDAS BDAS 2015 2016. Communications in Computer and Information Science, vol 613. Springer, Cham. https://doi.org/10.1007/978-3-319-34099-9_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-34099-9_2
Published: 28 April 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-34098-2
Online ISBN: 978-3-319-34099-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics