Skip to main content

A Comparison of Systems to Large-Scale Data Access

  • Conference paper
  • First Online:
Database Systems for Advanced Applications (DASFAA 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8505))

Included in the following conference series:

Abstract

With the amount of data produced in several application domains, it is increasingly difficult to manage and query related large data repositories (https://www.lsstcorp.org/sciencewiki/images/DC_ Handbook_v1.1.pdf). Within the PetaSky project, we focus on the problem of managing scientific data in the field of cosmology. The data we consider are those of the LSST project. The overall expected size of the database that will be produced will exceed 60 PB. This paper presents preliminary results of experiments conducted on PT1.1 (http://lsst1.ncsa.uiuc.edu/schema/index.php?sVer=PT1_1 (with a size of 90 GB.)) and PT1.2 (http://lsst1.ncsa.uiuc.edu/schema/index.php?sVer=PT1_1 (with a size of 145 GB.)) data sets in order to compare the performances of both centralized and distributed database management systems. As for centralized systems, we have deployed three different DBMSs: Mysql, Postgresql and DBMS-X (a commercial relational database). Regarding distributed systems, we have deployed HadoopDB and Hive. The goal of these experiments is to report on the ability of these systems to support large scale declarative queries. We mainly investigate the impact of data partitioning, indexing and compression on query execution performances.

This work is partially supported by Centre National de la Recherche Scientifique-CNRS. Under the project Petasky-Mastodons (http://com.isima.fr/Petasky).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.adeptia.com/products/Gartner-Cool-Vendors-in-Integration-2010.pdf

  2. 2.

    http://db.cs.berkeley.edu/papers/hpts85-nothing.pdf

  3. 3.

    http://www.eecs.berkeley.edu/~culler/courses/cs252-s05/papers/DeepData.pdf

  4. 4.

    http://www.cse.buffalo.edu/faculty/tkosar/papers/jnrl_philtrans_2011.pdf

  5. 5.

    http://research.microsoft.com/en-us/um/cambridge/projects/towards2020science/

  6. 6.

    http://www.nitrd.gov/pubs/200311_grand_challenges.pdf

  7. 7.

    http://www.cs.purdue.edu/homes/ake/pub/CommunityCyberInfrastructureEnabledDiscovery.pdf

  8. 8.

    XLDB (Extremely Large Data Bases, http://www.xldb.org) and SciDB (Scientific Data Bases, http://www.scidb.org/).

  9. 9.

    A tool that gives the average disc speeds (http://en.wikipedia.org/wiki/Hdparm).

  10. 10.

    http://www.lsst.org/files/docs/SRD.pdf

  11. 11.

    http://hive.apache.org/

  12. 12.

    The same configuration is used in [5], the popular Hadoop benchmark paper by Pavlo et al.

  13. 13.

    http://hadoop.apache.org/

  14. 14.

    Called Catalog.

  15. 15.

    http://lsst1.ncsa.uiuc.edu/schema/index.php?sVer=PT1_1

  16. 16.

    http://lsst1.ncsa.uiuc.edu/schema/index.php?sVer=PT1_2

  17. 17.

    Note that the expected final Source table will have 125 attributes.

  18. 18.

    Note that the expected final Object table will have 470 attributes.

  19. 19.

    http://www.icis.anl.gov/programs/file.php?id=303&obj=MultiFile&field=filename&attachment=yes

  20. 20.

    https://dev.lsstcorp.org/trac

  21. 21.

    https://dev.lsstcorp.org/trac/wiki/db/queries

  22. 22.

    http://lsst1.ncsa.uiuc.edu/schema/index.php

  23. 23.

    These functions need to be implemented. Such queries will be considered in another test campaign. We already verified that all the functions can be implemented within Hive and HadoopDB.

  24. 24.

    http://com.isima.fr/Petasky/groups/sous-groupe1/queries-1/at_download/file

  25. 25.

    This attribute is the primary key in the Object table and a foreign key in the Source table.

  26. 26.

    http://hadoopdb.sourceforge.net/guide/

  27. 27.

    Query \(\sharp \)7 is the less expensive one for HadoopDB.

  28. 28.

    Query \(\sharp \)6 is the most expensive one for Hive.

  29. 29.

    https://dev.lsstcorp.org/trac

  30. 30.

    http://www.lsst.org/files/docs/SRD.pdf

  31. 31.

    https://dev.lsstcorp.org/trac/wiki/db/queries, e.g., Q007, Q008, Q013

  32. 32.

    https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home

References

  1. Abadi, D.: Consistency tradeoffs in modern distributed database system design: cap is only part of the story. Computer 45(2), 37–42 (2012)

    Article  MathSciNet  Google Scholar 

  2. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow. 2(1), 922–933 (2009)

    Article  Google Scholar 

  3. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)

    Google Scholar 

  4. DeWitt, D., Gray, J.: Parallel database systems: the future of high performance database systems. Commun. ACM 35(6), 85–98 (1992)

    Article  Google Scholar 

  5. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD, pp. 165–178. ACM (2009)

    Google Scholar 

  6. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive-a petabyte scale data warehouse using hadoop. In: ICDE, pp. 996–1005. IEEE (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amin Mesmoudi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Mesmoudi, A., Hacid, MS. (2014). A Comparison of Systems to Large-Scale Data Access. In: Han, WS., Lee, M., Muliantara, A., Sanjaya, N., Thalheim, B., Zhou, S. (eds) Database Systems for Advanced Applications. DASFAA 2014. Lecture Notes in Computer Science(), vol 8505. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-43984-5_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-43984-5_12

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-43983-8

  • Online ISBN: 978-3-662-43984-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics