Skip to main content

Parallel Processing of “Group-By Join” Queries on Shared Nothing Machines

  • Conference paper
Book cover Software and Data Technologies (ICSOFT 2006)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 10))

Included in the following conference series:

  • 647 Accesses

Abstract

SQL queries involving join and group-by operations are frequently used in many decision support applications. In these applications, the size of the input relations is usually very large, so the parallelization of these queries is highly recommended in order to obtain a desirable response time. The main drawbacks of the presented parallel algorithms that treat this kind of queries are that they are very sensitive to data skew and involve expansive communication and Input/Output costs in the evaluation of the join operation. In this paper, we present an algorithm that minimizes the communication cost by performing the group-by operation before redistribution where only tuples that will be present in the join result are redistributed. In addition, it evaluates the query without the need of materializing the result of the join operation and thus reducing the Input/Output cost of join intermediate results. The performance of this algorithm is analyzed using the scalable and portable BSP (Bulk Synchronous Parallel) cost model which predicts a near-linear speed-up even for highly skewed data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Datta, A., Moon, B., Thomas, H.: A case for parallelism in datawarehousing and OLAP. In: Ninth International Workshop on Database and Expert Systems Applications, DEXA 1998, pp. 226–231. IEEE Computer Society, Vienna (1998)

    Chapter  Google Scholar 

  2. Chaudhuri, S., Shim, K.: Including Group-By in Query Optimization. In: Proceedings of the Twentieth International Conference on Very Large Databases, Santiago, Chile, pp. 354–366 (1994)

    Google Scholar 

  3. Tsois, A., Sellis, T.K.: The generalized pre-grouping transformation: Aggregate-query optimization in the presence of dependencies. In: VLDB, pp. 644–655 (2003)

    Google Scholar 

  4. Bamha, M.: An Optimal Skew-insensitive Join and Multi-join Algorithm for Distributed Architectures. In: Andersen, K.V., Debenham, J., Wagner, R. (eds.) DEXA 2005. LNCS, vol. 3588, pp. 616–625. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  5. Bamha, M., Hains, G.: A Skew-Insensitive Algorithm for Join and Multi-join Operations on Shared Nothing Machines. In: Ibrahim, M., Küng, J., Revell, N. (eds.) DEXA 2000. LNCS, vol. 1873. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  6. Bamha, M., Hains, G.: A frequency adaptive join algorithm for Shared Nothing machines. Journal of Parallel and Distributed Computing Practices (PDCP), 3(3), 333–345 (1999); appears also In: Columbus, F. (ed.) Progress in Computer Research, II, Nova Science Publishers (2001)

    Google Scholar 

  7. Seetha, M., Yu, P.S.: Effectiveness of parallel joins. IEEE, Transactions on Knowledge and Data Enginneerings 2, 410–424 (1990)

    Article  Google Scholar 

  8. Hua, K.A., Lee, C.: Handling data skew in multiprocessor database computers using partition tuning. In: Lohman, G.M., Sernadas, A., Camps, R. (eds.) Proc. of the 17th International Conference on Very Large Data Bases, Barcelona, Catalonia, Spain, pp. 525–535. Morgan Kaufmann, San Francisco (1991)

    Google Scholar 

  9. Wolf, J.L., Dias, D.M., Yu, P.S., Turek, J.: New algorithms for parallelizing relational database joins in the presence of data skew. IEEE Transactions on Knowledge and Data Engineering 6, 990–997 (1994)

    Article  Google Scholar 

  10. DeWitt, D.J., Naughton, J.F., Schneider, D.A., Seshadri, S.: Practical Skew Handling in Parallel Joins. In: Proceedings of the 18th VLDB Conference, Vancouver, British Columbia, Canada, pp. 27–40 (1992)

    Google Scholar 

  11. Yan, W.P., Larson, P.K.: Performing group-by before join. In: Proceedings of the 10th IEEE International Conference on Data Engineering, pp. 89–100. IEEE Computer Society Press, Los Alamitos (1994)

    Chapter  Google Scholar 

  12. Shatdal, A., Naughton, J.F.: Adaptive parallel aggregation algorithms. SIGMOD Record (ACM Special Interest Group on Management of Data) 24, 104–114 (1995)

    Google Scholar 

  13. Taniar, D., Jiang, Y., Liu, K., Leung, C.: Aggregate-join query processing in parallel database systems, In: Proceedings of The Fourth International Conference/Exhibition on High Performance Computing in Asia-Pacific Region HPC-Asia 2000, vol. 2, pp. 824–829. IEEE Computer Society Press, Los Alamitos (2000)

    Chapter  Google Scholar 

  14. Taniar, D., Rahayu, J.W.: Parallel processing of ’groupby-before-join’ queries in cluster architecture. In: Proceedings of the 1st International Symposium on Cluster Computing and the Grid, Brisbane, Qld, Australia, pp. 178–185. IEEE Computer Society Press, Los Alamitos (2001)

    Google Scholar 

  15. DeWitt, D.J., Gray, J.: Parallel database systems: The future of high performance database systems. Communications of the ACM 35, 85–98 (1992)

    Article  Google Scholar 

  16. Skillicorn, D.B., Hill, J.M.D., McColl, W.F.: Questions and Answers about BSP. Scientific Programming 6, 249–274 (1997)

    Google Scholar 

  17. Valiant, L.G.: A bridging model for parallel computation. Communications of the ACM 33, 103–111 (1990)

    Article  Google Scholar 

  18. Bisseling, R.H.: Parallel Scientific Computation: A Structured Approach using BSP and MPI. Oxford University Press, Oxford (2004)

    MATH  Google Scholar 

  19. Bamha, M., Hains, G.: An Efficient Equi-semi-join Algorithm for Distributed Architectures. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2005. LNCS, vol. 3515, pp. 755–763. Springer, Heidelberg (2005)

    Google Scholar 

  20. Carter, J.L., Wegman, M.N.: Universal classes of hash functions. Journal of Computer and System Sciences 18, 143–154 (1979)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Joaquim Filipe Boris Shishkov Markus Helfert

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Al Hajj Hassan, M., Bamha, M. (2008). Parallel Processing of “Group-By Join” Queries on Shared Nothing Machines. In: Filipe, J., Shishkov, B., Helfert, M. (eds) Software and Data Technologies. ICSOFT 2006. Communications in Computer and Information Science, vol 10. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70621-2_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-70621-2_19

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-70619-9

  • Online ISBN: 978-3-540-70621-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics