Abstract
With the amount of data produced in several application domains, it is increasingly difficult to manage and query related large data repositories (https://www.lsstcorp.org/sciencewiki/images/DC_ Handbook_v1.1.pdf). Within the PetaSky project, we focus on the problem of managing scientific data in the field of cosmology. The data we consider are those of the LSST project. The overall expected size of the database that will be produced will exceed 60 PB. This paper presents preliminary results of experiments conducted on PT1.1 (http://lsst1.ncsa.uiuc.edu/schema/index.php?sVer=PT1_1 (with a size of 90 GB.)) and PT1.2 (http://lsst1.ncsa.uiuc.edu/schema/index.php?sVer=PT1_1 (with a size of 145 GB.)) data sets in order to compare the performances of both centralized and distributed database management systems. As for centralized systems, we have deployed three different DBMSs: Mysql, Postgresql and DBMS-X (a commercial relational database). Regarding distributed systems, we have deployed HadoopDB and Hive. The goal of these experiments is to report on the ability of these systems to support large scale declarative queries. We mainly investigate the impact of data partitioning, indexing and compression on query execution performances.
This work is partially supported by Centre National de la Recherche Scientifique-CNRS. Under the project Petasky-Mastodons (http://com.isima.fr/Petasky).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
XLDB (Extremely Large Data Bases, http://www.xldb.org) and SciDB (Scientific Data Bases, http://www.scidb.org/).
- 9.
A tool that gives the average disc speeds (http://en.wikipedia.org/wiki/Hdparm).
- 10.
- 11.
- 12.
The same configuration is used in [5], the popular Hadoop benchmark paper by Pavlo et al.
- 13.
- 14.
Called Catalog.
- 15.
- 16.
- 17.
Note that the expected final Source table will have 125 attributes.
- 18.
Note that the expected final Object table will have 470 attributes.
- 19.
- 20.
- 21.
- 22.
- 23.
These functions need to be implemented. Such queries will be considered in another test campaign. We already verified that all the functions can be implemented within Hive and HadoopDB.
- 24.
- 25.
This attribute is the primary key in the Object table and a foreign key in the Source table.
- 26.
- 27.
Query \(\sharp \)7 is the less expensive one for HadoopDB.
- 28.
Query \(\sharp \)6 is the most expensive one for Hive.
- 29.
- 30.
- 31.
https://dev.lsstcorp.org/trac/wiki/db/queries, e.g., Q007, Q008, Q013
- 32.
References
Abadi, D.: Consistency tradeoffs in modern distributed database system design: cap is only part of the story. Computer 45(2), 37–42 (2012)
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow. 2(1), 922–933 (2009)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)
DeWitt, D., Gray, J.: Parallel database systems: the future of high performance database systems. Commun. ACM 35(6), 85–98 (1992)
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD, pp. 165–178. ACM (2009)
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive-a petabyte scale data warehouse using hadoop. In: ICDE, pp. 996–1005. IEEE (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mesmoudi, A., Hacid, MS. (2014). A Comparison of Systems to Large-Scale Data Access. In: Han, WS., Lee, M., Muliantara, A., Sanjaya, N., Thalheim, B., Zhou, S. (eds) Database Systems for Advanced Applications. DASFAA 2014. Lecture Notes in Computer Science(), vol 8505. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-43984-5_12
Download citation
DOI: https://doi.org/10.1007/978-3-662-43984-5_12
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-43983-8
Online ISBN: 978-3-662-43984-5
eBook Packages: Computer ScienceComputer Science (R0)