Multistore Big Data Integration with CloudMdsQL

Bondiombouy, Carlyna; Kolev, Boyan; Levchenko, Oleksandra; Valduriez, Patrick

doi:10.1007/978-3-662-53455-7_3

Carlyna Bondiombouy¹⁷,
Boyan Kolev¹⁷,
Oleksandra Levchenko¹⁷ &
…
Patrick Valduriez¹⁷

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 9940))

539 Accesses
7 Citations

Abstract

Multistore systems have been recently proposed to provide integrated access to multiple, heterogeneous data stores through a single query engine. In particular, much attention is being paid on the integration of unstructured big data typically stored in HDFS with relational data. One main solution is to use a relational query engine that allows SQL-like queries to retrieve data from HDFS, which requires the system to provide a relational view of the unstructured data and hence is not always feasible. In this paper, we propose a functional SQL-like query language (based on CloudMdsQL) that can integrate data retrieved from different data stores, to take full advantage of the functionality of the underlying data processing frameworks by allowing the ad-hoc usage of user defined map/filter/reduce operators in combination with traditional SQL statements. Furthermore, our solution allows for optimization by enabling subquery rewriting so that bind join can be used and filter conditions can be pushed down and applied by the data processing framework as early as possible. We validate our approach through implementation and experimental validation with three data stores and representative queries. The experimental results demonstrate the usability of the query language and the benefits from query optimization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abouzeid, A., Badja-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. PVLDB 2, 922–933 (2009)
Google Scholar
Armbrust, M., Xin, R., Lian, C., Huai, Y., Liu, D., Bradley, J., Meng, X., Kaftan, T., Franklin, M., Ghodsi, A., Zaharia, M.: Spark SQL: relational data processing in Spark. In: ACM SIGMOD International Conference on Management of Data, pp. 1383–1394 (2015)
Google Scholar
Binnig, C., Rehrmann, R., Faerber, F., Riewe, R.: FunSQL: it is time to make SQL functional. In: EDBT/ICDT Conference, pp. 41–46 (2012)
Google Scholar
Bondiombouy, C., Kolev, B., Levchenko, O., Valduriez, P.: Integrating big data and relational data with a functional SQL-like query language. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds.) DEXA 2015. LNCS, vol. 9261, pp. 170–185. Springer, Heidelberg (2015)
Chapter Google Scholar
Bugiotti, F., Bursztyn, D., Deutsch, A., Ileana, I., Manolescu, I.: Invisible glue: scalable self-tuning multi-stores. In: CIDR Conference (2015)
Google Scholar
Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 1, 1265–1276 (2008)
Google Scholar
CoherentPaaS project. http://coherentpaas.eu
DeWitt, D., Halverson, A., Nehme, R., Shankar, S., Aguilar-Saborit, J., Avanes, A., Flasza, M., Gramling, M.: Split query processing in Polybase. In: ACM SIGMOD Conference, pp. 1255–1266 (2013)
Google Scholar
Duggan, J., Elmore, A.J., Stonebraker, M., Balazinska, M., Howe, B., Kepner, J., Madden, S., Maier, D., Mattson, T., Zdonik, S.: The BigDAWG polystore system. ACM SIGMOD Rec. 44(2), 11–16 (2015)
Article Google Scholar
Haas, L., Kossmann, D., Wimmers, E., Yang, J.: Optimizing queries across diverse data sources. In: International Conference on Very Large Databases (VLDB), pp. 276–285 (1997)
Google Scholar
Hacigümüs, H., Sankaranarayanan, J., Tatemura, J., LeFevre, J., Polyzotis, N.: Odyssey: a multi-store system for evolutionary analytics. PVLDB 6, 1180–1181 (2013)
Google Scholar
Kolev, B., Valduriez, P., Bondiombouy, C., Jiménez-Peris, R., Pau, R., Pereira, J.: CloudMdsQL: querying heterogeneous cloud data stores with a common language. In: Distributed and parallel databases, pp. 463–503 (2015). http://link.springer.com/article/10.1007%2Fs10619-015-7185-y
Google Scholar
LeFevre, J., Sankaranarayanan, J., Hacigümüs, H., Tatemura, J., Polyzotis, N., Carey, M.: MISO: souping up big data query processing with a multistore system. In: ACM SIGMOD Conference, pp. 1591–1602 (2014)
Google Scholar
Minpeng, Z., Tore, R.: Querying combined cloud-based and relational databases. In: International Conference on Cloud and Service Computing (CSC), pp. 330–335 (2011)
Google Scholar
Ong, K.W., Papakonstantinou, Y., Vernoux, R.: The SQL ++ semi-structured data model and query language: a capabilities survey of SQL-on-Hadoop, NoSQL and NewSQL databases (2014). Corr, abs/1405.3631
Google Scholar
Özsu, T., Valduriez, P.: Principles of Distributed Database Systems. Springer, New York (2011)
Google Scholar
Simitsis, A., Wilkinson, K., Castellanos, M., Dayal, U.: Optimizing analytic data flows for multiple execution engines. In: ACM SIGMOD Conference, pp. 829–840 (2012)
Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive - a warehousing solution over a map-reduce framework. PVLDB 2, 1626–1629 (2009)
Google Scholar
Tomasic, A., Raschid, L., Valduriez, P.: Scaling access to heterogeneous data sources with DISCO. IEEE Trans. Knowl. Data Eng. 10, 808–823 (1998)
Article Google Scholar
Valduriez, P., Danforth, S.: Functional SQL, an SQL upward compatible database programming language. Inf. Sci. 62, 183–203 (1992)
Article MATH Google Scholar
Wiederhold, G.: Mediators in the architecture of future information systems. Computer 25, 38–49 (1992)
Article Google Scholar
Wyss, C.M., Robertson, E.L.: Relational languages for metadata integration. ACM Trans. Database Syst. 30(2), 624–660 (2005)
Article Google Scholar
Yuanyuan, T., Zou, T., Özcan, F., Gonscalves, R., Pirahesh, H.: Joins for hybrid warehouses: exploiting massive parallelism in hadoop and enterprise data warehouses. In: EDBT/ICDT Conference, pp. 373–384 (2015)
Google Scholar
Zhou, J., Bruno, N., Wu, M., Larson, P., Chaiken, R., Shakib, D.: SCOPE: Parallel Databases Meet MapReduce. PVLDB 21, 611–636 (2012)
Google Scholar
Zhu, Q., Larson, P.-A.: A query sampling method for estimating local cost parameters in a multidatabase system. In: International Conference on Data Engineering (ICDE), pp. 144–153 (1994)
Google Scholar
Zhu, Q., Larson, P.-A.: Global query processing and optimization in the CORDS multidatabase system. In: International Conference on Parallel and Distributed Computing Systems, pp. 640–647 (1996)
Google Scholar
Zhu, Q., Sun, Y., Motheramgari, S.: Developing cost models with qualitative variables for dynamic multidatabase environments. In: International Conference on Data Engineering (ICDE), pp. 413–424 (2000)
Google Scholar

Download references

Acknowledgements

This research has been partially funded by the European Commission under project CoherentPaaS (FP7-611068).

Author information

Authors and Affiliations

Inria and LIRMM, University of Montpellier, Montpellier, France
Carlyna Bondiombouy, Boyan Kolev, Oleksandra Levchenko & Patrick Valduriez

Authors

Carlyna Bondiombouy
View author publications
You can also search for this author in PubMed Google Scholar
Boyan Kolev
View author publications
You can also search for this author in PubMed Google Scholar
Oleksandra Levchenko
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Valduriez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Boyan Kolev .

Editor information

Editors and Affiliations

IRIT, Paul Sabatier University , Toulouse, France
Abdelkader Hameurlain
FAW, University of Linz , Linz, Austria
Josef Küng
FAW, University of Linz , Linz, Austria
Roland Wagner
HP Labs , Sunnyvale, California, USA
Qimin Chen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bondiombouy, C., Kolev, B., Levchenko, O., Valduriez, P. (2016). Multistore Big Data Integration with CloudMdsQL. In: Hameurlain, A., Küng, J., Wagner, R., Chen, Q. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXVIII. Lecture Notes in Computer Science(), vol 9940. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-53455-7_3

Download citation

DOI: https://doi.org/10.1007/978-3-662-53455-7_3
Published: 10 September 2016
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-53454-0
Online ISBN: 978-3-662-53455-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics