A Comparison of Systems to Large-Scale Data Access

Mesmoudi, Amin; Hacid, Mohand-Saïd

doi:10.1007/978-3-662-43984-5_12

Amin Mesmoudi²¹ &
Mohand-Saïd Hacid²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8505))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1025 Accesses
2 Citations

Abstract

With the amount of data produced in several application domains, it is increasingly difficult to manage and query related large data repositories (https://www.lsstcorp.org/sciencewiki/images/DC_ Handbook_v1.1.pdf). Within the PetaSky project, we focus on the problem of managing scientific data in the field of cosmology. The data we consider are those of the LSST project. The overall expected size of the database that will be produced will exceed 60 PB. This paper presents preliminary results of experiments conducted on PT1.1 (http://lsst1.ncsa.uiuc.edu/schema/index.php?sVer=PT1_1 (with a size of 90 GB.)) and PT1.2 (http://lsst1.ncsa.uiuc.edu/schema/index.php?sVer=PT1_1 (with a size of 145 GB.)) data sets in order to compare the performances of both centralized and distributed database management systems. As for centralized systems, we have deployed three different DBMSs: Mysql, Postgresql and DBMS-X (a commercial relational database). Regarding distributed systems, we have deployed HadoopDB and Hive. The goal of these experiments is to report on the ability of these systems to support large scale declarative queries. We mainly investigate the impact of data partitioning, indexing and compression on query execution performances.

This work is partially supported by Centre National de la Recherche Scientifique-CNRS. Under the project Petasky-Mastodons (http://com.isima.fr/Petasky).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.adeptia.com/products/Gartner-Cool-Vendors-in-Integration-2010.pdf
2.
http://db.cs.berkeley.edu/papers/hpts85-nothing.pdf
3.
http://www.eecs.berkeley.edu/~culler/courses/cs252-s05/papers/DeepData.pdf
4.
http://www.cse.buffalo.edu/faculty/tkosar/papers/jnrl_philtrans_2011.pdf
5.
http://research.microsoft.com/en-us/um/cambridge/projects/towards2020science/
6.
http://www.nitrd.gov/pubs/200311_grand_challenges.pdf
7.
http://www.cs.purdue.edu/homes/ake/pub/CommunityCyberInfrastructureEnabledDiscovery.pdf
8.
XLDB (Extremely Large Data Bases, http://www.xldb.org) and SciDB (Scientific Data Bases, http://www.scidb.org/).
9.
A tool that gives the average disc speeds (http://en.wikipedia.org/wiki/Hdparm).
10.
http://www.lsst.org/files/docs/SRD.pdf
11.
http://hive.apache.org/
12.
The same configuration is used in [5], the popular Hadoop benchmark paper by Pavlo et al.
13.
http://hadoop.apache.org/
14.
Called Catalog.
15.
http://lsst1.ncsa.uiuc.edu/schema/index.php?sVer=PT1_1
16.
http://lsst1.ncsa.uiuc.edu/schema/index.php?sVer=PT1_2
17.
Note that the expected final Source table will have 125 attributes.
18.
Note that the expected final Object table will have 470 attributes.
19.
http://www.icis.anl.gov/programs/file.php?id=303&obj=MultiFile&field=filename&attachment=yes
20.
https://dev.lsstcorp.org/trac
21.
https://dev.lsstcorp.org/trac/wiki/db/queries
22.
http://lsst1.ncsa.uiuc.edu/schema/index.php
23.
These functions need to be implemented. Such queries will be considered in another test campaign. We already verified that all the functions can be implemented within Hive and HadoopDB.
24.
http://com.isima.fr/Petasky/groups/sous-groupe1/queries-1/at_download/file
25.
This attribute is the primary key in the Object table and a foreign key in the Source table.
26.
http://hadoopdb.sourceforge.net/guide/
27.
Query \(\sharp \)7 is the less expensive one for HadoopDB.
28.
Query \(\sharp \)6 is the most expensive one for Hive.
29.
https://dev.lsstcorp.org/trac
30.
http://www.lsst.org/files/docs/SRD.pdf
31.
https://dev.lsstcorp.org/trac/wiki/db/queries, e.g., Q007, Q008, Q013
32.
https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home

References

Abadi, D.: Consistency tradeoffs in modern distributed database system design: cap is only part of the story. Computer 45(2), 37–42 (2012)
Article MathSciNet Google Scholar
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow. 2(1), 922–933 (2009)
Article Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)
Google Scholar
DeWitt, D., Gray, J.: Parallel database systems: the future of high performance database systems. Commun. ACM 35(6), 85–98 (1992)
Article Google Scholar
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD, pp. 165–178. ACM (2009)
Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive-a petabyte scale data warehouse using hadoop. In: ICDE, pp. 996–1005. IEEE (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

CNRS, Université de Lyon, Université Lyon 1, LIRIS, UMR5205, 69622, Lyon, France
Amin Mesmoudi & Mohand-Saïd Hacid

Authors

Amin Mesmoudi
View author publications
You can also search for this author in PubMed Google Scholar
Mohand-Saïd Hacid
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amin Mesmoudi .

Editor information

Editors and Affiliations

Pohang University of Science and Technology (POSTECH), Pohang, Korea, Republic of (South Korea)
Wook-Shin Han
National University of Singapore, Singapore, Singapore
Mong Li Lee
Udayana University, Badung, Indonesia
Agus Muliantara
Udayana University, Badung, Indonesia
Ngurah Agus Sanjaya
Christian-Albrechts-Universität zu Kiel Institut für Informatik, Kiel, Germany
Bernhard Thalheim
Fudan University, Shanghai, China
Shuigeng Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mesmoudi, A., Hacid, MS. (2014). A Comparison of Systems to Large-Scale Data Access. In: Han, WS., Lee, M., Muliantara, A., Sanjaya, N., Thalheim, B., Zhou, S. (eds) Database Systems for Advanced Applications. DASFAA 2014. Lecture Notes in Computer Science(), vol 8505. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-43984-5_12

Download citation

DOI: https://doi.org/10.1007/978-3-662-43984-5_12
Published: 11 July 2014
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-43983-8
Online ISBN: 978-3-662-43984-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics