Abstract
SQL queries involving join and group-by operations are frequently used in many decision support applications. In these applications, the size of the input relations is usually very large, so the parallelization of these queries is highly recommended in order to obtain a desirable response time. The main drawbacks of the presented parallel algorithms that treat this kind of queries are that they are very sensitive to data skew and involve expansive communication and Input/Output costs in the evaluation of the join operation. In this paper, we present an algorithm that minimizes the communication cost by performing the group-by operation before redistribution where only tuples that will be present in the join result are redistributed. In addition, it evaluates the query without the need of materializing the result of the join operation and thus reducing the Input/Output cost of join intermediate results. The performance of this algorithm is analyzed using the scalable and portable BSP (Bulk Synchronous Parallel) cost model which predicts a near-linear speed-up even for highly skewed data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Datta, A., Moon, B., Thomas, H.: A case for parallelism in datawarehousing and OLAP. In: Ninth International Workshop on Database and Expert Systems Applications, DEXA 1998, pp. 226–231. IEEE Computer Society, Vienna (1998)
Chaudhuri, S., Shim, K.: Including Group-By in Query Optimization. In: Proceedings of the Twentieth International Conference on Very Large Databases, Santiago, Chile, pp. 354–366 (1994)
Tsois, A., Sellis, T.K.: The generalized pre-grouping transformation: Aggregate-query optimization in the presence of dependencies. In: VLDB, pp. 644–655 (2003)
Bamha, M.: An Optimal Skew-insensitive Join and Multi-join Algorithm for Distributed Architectures. In: Andersen, K.V., Debenham, J., Wagner, R. (eds.) DEXA 2005. LNCS, vol. 3588, pp. 616–625. Springer, Heidelberg (2005)
Bamha, M., Hains, G.: A Skew-Insensitive Algorithm for Join and Multi-join Operations on Shared Nothing Machines. In: Ibrahim, M., Küng, J., Revell, N. (eds.) DEXA 2000. LNCS, vol. 1873. Springer, Heidelberg (2000)
Bamha, M., Hains, G.: A frequency adaptive join algorithm for Shared Nothing machines. Journal of Parallel and Distributed Computing Practices (PDCP), 3(3), 333–345 (1999); appears also In: Columbus, F. (ed.) Progress in Computer Research, II, Nova Science Publishers (2001)
Seetha, M., Yu, P.S.: Effectiveness of parallel joins. IEEE, Transactions on Knowledge and Data Enginneerings 2, 410–424 (1990)
Hua, K.A., Lee, C.: Handling data skew in multiprocessor database computers using partition tuning. In: Lohman, G.M., Sernadas, A., Camps, R. (eds.) Proc. of the 17th International Conference on Very Large Data Bases, Barcelona, Catalonia, Spain, pp. 525–535. Morgan Kaufmann, San Francisco (1991)
Wolf, J.L., Dias, D.M., Yu, P.S., Turek, J.: New algorithms for parallelizing relational database joins in the presence of data skew. IEEE Transactions on Knowledge and Data Engineering 6, 990–997 (1994)
DeWitt, D.J., Naughton, J.F., Schneider, D.A., Seshadri, S.: Practical Skew Handling in Parallel Joins. In: Proceedings of the 18th VLDB Conference, Vancouver, British Columbia, Canada, pp. 27–40 (1992)
Yan, W.P., Larson, P.K.: Performing group-by before join. In: Proceedings of the 10th IEEE International Conference on Data Engineering, pp. 89–100. IEEE Computer Society Press, Los Alamitos (1994)
Shatdal, A., Naughton, J.F.: Adaptive parallel aggregation algorithms. SIGMOD Record (ACM Special Interest Group on Management of Data) 24, 104–114 (1995)
Taniar, D., Jiang, Y., Liu, K., Leung, C.: Aggregate-join query processing in parallel database systems, In: Proceedings of The Fourth International Conference/Exhibition on High Performance Computing in Asia-Pacific Region HPC-Asia 2000, vol. 2, pp. 824–829. IEEE Computer Society Press, Los Alamitos (2000)
Taniar, D., Rahayu, J.W.: Parallel processing of ’groupby-before-join’ queries in cluster architecture. In: Proceedings of the 1st International Symposium on Cluster Computing and the Grid, Brisbane, Qld, Australia, pp. 178–185. IEEE Computer Society Press, Los Alamitos (2001)
DeWitt, D.J., Gray, J.: Parallel database systems: The future of high performance database systems. Communications of the ACM 35, 85–98 (1992)
Skillicorn, D.B., Hill, J.M.D., McColl, W.F.: Questions and Answers about BSP. Scientific Programming 6, 249–274 (1997)
Valiant, L.G.: A bridging model for parallel computation. Communications of the ACM 33, 103–111 (1990)
Bisseling, R.H.: Parallel Scientific Computation: A Structured Approach using BSP and MPI. Oxford University Press, Oxford (2004)
Bamha, M., Hains, G.: An Efficient Equi-semi-join Algorithm for Distributed Architectures. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2005. LNCS, vol. 3515, pp. 755–763. Springer, Heidelberg (2005)
Carter, J.L., Wegman, M.N.: Universal classes of hash functions. Journal of Computer and System Sciences 18, 143–154 (1979)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Al Hajj Hassan, M., Bamha, M. (2008). Parallel Processing of “Group-By Join” Queries on Shared Nothing Machines. In: Filipe, J., Shishkov, B., Helfert, M. (eds) Software and Data Technologies. ICSOFT 2006. Communications in Computer and Information Science, vol 10. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70621-2_19
Download citation
DOI: https://doi.org/10.1007/978-3-540-70621-2_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70619-9
Online ISBN: 978-3-540-70621-2
eBook Packages: Computer ScienceComputer Science (R0)