Skip to main content

User Defined Partitioning - Group Data Based on Computation Model

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5182))

Abstract

A technical trend in supporting large scale scientific applications is converging data intensive computation and data management for fast data access and reduced data flow. In a combined cluster platform, co-locating computation and data is the key to efficiency and scalability; and to make it happen, data must be partitioned in a way consistent with the computation model. However, with the current parallel database technology, data partitioning is primarily used to support flat parallel computing, and based on existing partition key values; for a given application, when the data scopes of function executions are determined by a high-level concept that is related to the application semantics but not presented in the original data, there would be no appropriate partition keys for grouping data.

Aiming at making application-aware data partitioning, we introduce the notion of User Defined Data Partitioning (UDP). UDP differs from the usual data partitioning methods in that it does not rely on existing partition key values, but extracts or generates them from the original data in a labeling process. The novelty of UDP is allowing data partitioning to be based on application level concepts for matching the data access scoping of the targeted computation model, and for supporting data dependency graph based parallel computing.

We applied this approach to architect a hydro-informatics system, for supporting periodical, near-real-time, data-intensive hydrologic computation on a database cluster. Our experimental results reveal its power in tightly coupling data partitioning with “pipelined” parallel computing in the presence of data processing dependencies.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal, R., Shim, K.: Developing Tightly-Coupled Data Mining Applications on a Relational Database System. In: Proceedings Second KDD Int. Conf. (1996)

    Google Scholar 

  2. Asanovic, K., Bodik, R., Catanzo, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The landscape of parallel computing research: A view from Berkeley, Tech Rep EECS-2006-183, U.C.Berkeley (2006)

    Google Scholar 

  3. Barclay, T., Gray, J., Chong, W.: TerraServer Bricks – A High Availability Cluster Alternative, Technical Report, MSR-TR-2004-107 (October 2004)

    Google Scholar 

  4. Barroso, L.A., Dean, J., H”olze, U.: Web search for a planet: The Google cluster architecture. IEEE Micro 23(2), 22–28 (2003)

    Article  Google Scholar 

  5. Brewer, E.A.: Delivering high availability for Inktomi search engines. In: Haas, L.M., Tiwary, A. (eds.) ACM SIGMOD Conf. (1998)

    Google Scholar 

  6. Bryant, R.E.: Data-Intensive Supercomputing: The case for DISC, CMU-CS-07-128 (2007)

    Google Scholar 

  7. Dayal, U., Chen, Q., Hsu, M.: Dynamic Data Warehousing. In: Mohania, M., Tjoa, A.M. (eds.) DaWaK 1999. LNCS, vol. 1676. Springer, Heidelberg (1999)

    Google Scholar 

  8. Chen, Q., Dayal, U., Hsu, M.: An OLAP-based Scalable Web Access Analysis Engine. In: Kambayashi, Y., Mohania, M., Tjoa, A.M. (eds.) DaWaK 2000. LNCS, vol. 1874. Springer, Heidelberg (2000)

    Google Scholar 

  9. Chen, Q., Hsu, M., Dayal, U.: A Data Warehouse/OLAP Framework for Scalable Telecommunication Tandem Traffic Analysis. In: Proc. of 16th ICDE Conf. (2000)

    Google Scholar 

  10. Chen, Q., Dayal, U., Hsu, M.: A Distributed OLAP Infrastructure for E-Commerce. In: Proc. Fourth IFCIS CoopIS Conference, UK (1999)

    Google Scholar 

  11. Chen, Q., Dayal, U., Hsu, M.: OLAP-based Scalable Profiling of Customer Behavior. In: Mohania, M., Tjoa, A.M. (eds.) DaWaK 1999. LNCS, vol. 1676. Springer, Heidelberg (1999)

    Google Scholar 

  12. Chen, Q., Kambayashi, Y.: Nested Relation Based Database Knowledge Representation. In: ACM-SIGMOD Conference (1991)

    Google Scholar 

  13. Dean, J.: Experiences with MapReduce, an abstraction for large-scale computation. In: Int. Conf. on Parallel Architecture and Compilation Techniques. ACM, New York (2006)

    Google Scholar 

  14. Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: Operating Systems Design and Implementation (2004)

    Google Scholar 

  15. DeWitt, D., Gray, J.: Parallel Database Systems: the Future of High Performance Database Systems. CACM 35(6) (June 1992)

    Google Scholar 

  16. Gray, J., Liu, D.T., Nieto-Santisteban, M.A., Szalay, A.S., Heber, G., DeWitt, D.: Scientific Data Management in the Coming Decade. SIGMOD Record 34(4) (2005)

    Google Scholar 

  17. Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: Symposium on Operating Systems Principles, pp. 29–43. ACM, New York (2003)

    Google Scholar 

  18. HDFS: http://hdf.ncsa.uiuc.edu/HDF5/

  19. Hsu, M., Xiong, Y.: Building a Scalable Web Query System. In: Bhalla, S. (ed.) DNIS 2007. LNCS, vol. 4777. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  20. HP Neoview enterprise datawarehousing platform, http://h71028.www7.hp.com/ERC/downloads/4AA0-7932ENW.pdf

  21. O’Connell, et al.: A Teradata Content-Based Multimedia Object Manager for Massively Parallel Architectures. In: ACM-SIGMOD Conf., Canada (1996)

    Google Scholar 

  22. Saarenvirta, G.: Operational Data Mining. DB2 Magazine 6 (2001)

    Google Scholar 

  23. Sagan, H.: Space-Filling Curves. Springer, Heidelberg (1994)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Il-Yeol Song Johann Eder Tho Manh Nguyen

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chen, Q., Hsu, M. (2008). User Defined Partitioning - Group Data Based on Computation Model. In: Song, IY., Eder, J., Nguyen, T.M. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2008. Lecture Notes in Computer Science, vol 5182. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85836-2_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-85836-2_37

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-85835-5

  • Online ISBN: 978-3-540-85836-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics