Just-In-Time Data Distribution for Analytical Query Processing

Ivanova, Milena; Kersten, Martin; Groffen, Fabian

doi:10.1007/978-3-642-33074-2_16

Milena Ivanova¹⁹,
Martin Kersten¹⁹ &
Fabian Groffen¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7503))

Included in the following conference series:

East European Conference on Advances in Databases and Information Systems

780 Accesses
2 Citations

Abstract

Distributed processing commonly requires data spread across machines using a priori static or hash-based data allocation. In this paper, we explore an alternative approach that starts from a master node in control of the complete database, and a variable number of worker nodes for delegated query processing. Data is shipped just-in-time to the worker nodes using a need to know policy, and is being reused, if possible, in subsequent queries. A bidding mechanism among the workers yields a scheduling with the most efficient reuse of previously shipped data, minimizing the data transfer costs.

Just-in-time data shipment allows our system to benefit from locally available idle resources to boost overall performance. The system is maintenance-free and allocation is fully transparent to users. Our experiments show that the proposed adaptive distributed architecture is a viable and flexible alternative for small scale MapReduce-type of settings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bajda-Pawlikowski, K., Abadi, D.J., et al.: Efficient Processing of Data Warehousing Queries in a Split Execution Environment. In: SIGMOD, pp. 1165–1176 (2011)
Google Scholar
Chambers, C., Raniwala, A., et al.: FlumeJava: easy, efficient data-parallel pipelines. In: PLDI, pp. 363–375 (2010)
Google Scholar
Curino, C., Jones, E.P.C., et al.: Relational Cloud: a Database Service for the Cloud. In: CIDR, pp. 235–240 (2011)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proceedings of OSDI, pp. 137–150 (2004)
Google Scholar
Elmore, A.J., Das, S., Agrawal, D., Abbadi, A.E.: Zephyr: live migration in shared nothing databases for elastic cloud platforms. In: SIGMOD Conference, pp. 301–312 (2011)
Google Scholar
Floratou, A., Patel, J.M., Shekita, E.J., Tata, S.: Column-Oriented Storage Techniques for MapReduce. In: VLDB, pp. 419–429 (2011)
Google Scholar
Franklin, M.J., Jónsson, B.T., Kossmann, D.: Performance tradeoffs for client-server query processing. In: SIGMOD Conference, pp. 149–160 (1996)
Google Scholar
Goncalves, R., Kersten, M.L.: The data cyclotron query processing scheme. In: EDBT, pp. 75–86 (2010)
Google Scholar
Hadoop (2012), http://hadoop.apache.org/
Herodotou, H., Lim, H., et al.: Starfish: A self-tuning system for big data analytics. In: CIDR (2011)
Google Scholar
Ivanova, M., Kersten, M.L., Nes, N.J., Goncalves, R.: An architecture for recycling intermediates in a column-store. ACM Trans. Database Syst. 35(4), 24 (2010)
Article Google Scholar
Jiang, D., Ooi, B.C., Shi, L., Wu, S.: The Performance of MapReduce: An In-depth Study. PVLDB 3(1), 472–483 (2010)
Google Scholar
Kossmann, D., Franklin, M.J., Drasch, G.: Cache investment: integrating query optimization and distributed data placement. ACM Trans. Database Syst. 25(4), 517–558 (2000)
Article MATH Google Scholar
Olston, C., Reed, B., et al.: et al. Pig latin: a not-so-foreign language for data processing. In: SIGMOD Conference, pp. 1099–1110 (2008)
Google Scholar
Olston, C., Reed, B., Silberstein, A., Srivastava, U.: Automatic optimization of parallel dataflow programs. In: USENIX Annual Technical Conference, pp. 267–273 (2008)
Google Scholar
Pavlo, A., Paulson, E., et al.: A Comparison of Approaches to Large-scale Data Analysis. In: SIGMOD Conference, pp. 165–178 (2009)
Google Scholar
Plattner, C., Alonso, G., Özsu, M.T.: Extending DBMSs with Satellite Databases. VLDB J. 17(4), 657–682 (2008)
Article Google Scholar
Raman, V., Han, W., Narang, I.: Parallel querying with non-dedicated computers. In: VLDB, pp. 61–72 (2005)
Google Scholar
Röhm, U., Böhm, K., Schek, H.-J.: Cache-Aware Query Routing in a Cluster of Databases. In: ICDE, pp. 641–650 (2001)
Google Scholar
Thusoo, A., Sarma, J.S., et al.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2, 1626–1629 ( August 2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Centrum Wiskunde & Informatica (CWI), Amsterdam, The Netherlands
Milena Ivanova, Martin Kersten & Fabian Groffen

Authors

Milena Ivanova
View author publications
You can also search for this author in PubMed Google Scholar
Martin Kersten
View author publications
You can also search for this author in PubMed Google Scholar
Fabian Groffen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computing Science, Poznań University of Technology, Piotrowo 2, 60-965, Poznań, Poland
Tadeusz Morzy
Department of Computer Science, AG DBIS, University of Kaiserslautern, Germany, P.O. Box 3049, 67653, Kaiserslautern, Germany
Theo Härder
Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965, Poznan, Poland
Robert Wrembel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ivanova, M., Kersten, M., Groffen, F. (2012). Just-In-Time Data Distribution for Analytical Query Processing. In: Morzy, T., Härder, T., Wrembel, R. (eds) Advances in Databases and Information Systems. ADBIS 2012. Lecture Notes in Computer Science, vol 7503. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33074-2_16

Download citation

DOI: https://doi.org/10.1007/978-3-642-33074-2_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33073-5
Online ISBN: 978-3-642-33074-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics