Advertisement

Generalized communication cost efficient multi-way spatial join: revisiting the curse of the last reducer

  • 15 Accesses

Abstract

With the huge increase in usage of smart mobiles, social media and sensors, large volumes of location-based data is available. Location based data carries important signals pertaining to user intensive information as well as population characteristics. The key analytical tool for location based analysis is multi-way spatial join. Unlike the conventional join strategies, multi-way join using map-reduce offers a scalable, distributed computational paradigm and efficient implementation through communication cost reduction strategies. Controlled Replicate (C-Rep) is a useful strategy used in the literature to perform the multi-way spatial join efficiently. Though C-Rep performance is superior compared to naive sequential join, careful analysis of its performance reveals that such a strategy is plagued by the curse of the last reducer, wherein the skew inherently present in the datasets and the skew introduced by replication operation, causes some of the reducers to take much longer time compared to others. In this work, we design an algorithm GEMS (G eneralized Communication cost E fficient M ulti-Way S patial Join) to address the skewness inherent in the connectivity of spatial objects while performing a multi-way join. We analysed all the algorithms concerned, in terms of I/O and communication costs. We prove that the communication cost of GEMS approach is better than that of C-Rep by a factor O(α) where α is the number of reducers in a single row/column of a grid of reducers. Our experimental results on different datasets indicate that GEMS approach is three times superior(in terms of turn around time) compared to C-Rep.

This is a preview of subscription content, log in to check access.

Access options

Buy single article

Instant unlimited access to the full article PDF.

US$ 39.95

Price includes VAT for USA

Subscribe to journal

Immediate online access to all issues from 2019. Subscription will auto renew annually.

US$ 99

This is the net price. Taxes to be calculated in checkout.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Notes

  1. 1.

    Here we use the set of cells \(\mathcal {C}_{cr}\) to notionally distinguish between C-Rep and All-Replicate. Otherwise, \(\mathcal {C}_{cr}\) is equivalent to \(\mathcal {C}_{f}\) in the sense that it also represents the cells of the first quadrant.

  2. 2.

    The code for all the algorithms implemented is available in https://github.com/nageshbhattu/SpatialJoin.git

  3. 3.

    https://osmdata.openstreetmap.de/info/

References

  1. 1.

    Afrati F, Stasinopoulos N, Ullman J D, Vassilakopoulos A (2018) Sharesskew: An algorithm to handle skew for joins in mapreduce. Information Systems

  2. 2.

    Afrati FN, Ullman JD (2010) Optimizing joins in a map-reduce environment. In: Proceedings of the 13th International Conference on Extending Database Technology, EDBT ’10. ACM, New York, pp 99–110

  3. 3.

    Aji A, Wang F, Vo H, Lee R, Liu Q, Zhang X, Saltz J (2013) Hadoop gis: a high performance spatial data warehousing system over mapreduce. Proc VLDB Endowment 6(11):1009–1020

  4. 4.

    Aji A, Hoang V, Wang F (2015) Effective spatial data partitioning for scalable query processing. arXiv:150900910

  5. 5.

    Arge L, Procopiuc O, Ramaswamy S, Suel T, Vitter J S (1998) Scalable sweeping-based spatial join. In: VLDB, vol 98, pp 570–581

  6. 6.

    Blanas S, Patel J M, Ercegovac V, Rao J, Shekita E J, Tian Y (2010) A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, pp 975–986

  7. 7.

    Bouros P, Mamoulis N (2017) A forward scan based plane sweep algorithm for parallel interval joins. Proc VLDB Endowment 10(11):1346–1357

  8. 8.

    Bozanis P, Foteinos P (2007) Wer-trees. Data Knowl Eng 63(2):397–413

  9. 9.

    Brinkhoff T, Kriegel H P, Seeger B (1996) Parallel processing of spatial joins using r-trees. In: 1996. Proceedings of the Twelfth International Conference on Data engineering. IEEE, pp 258–265

  10. 10.

    Chaudhuri S (1998) An overview of query optimization in relational systems. In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS ’98. ACM, New York, pp 34–43. https://doi.org/10.1145/275487.275492

  11. 11.

    Cho E, Myers SA, Leskovec J (2011) Friendship and mobility: user movement in location-based social networks. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 1082–1090

  12. 12.

    Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

  13. 13.

    Dittrich JP, Seeger B (2000) Data redundancy and duplicate detection in spatial join processing. In: 2000. Proceedings. 16th International Conference on Data Engineering. IEEE, pp 535–546

  14. 14.

    Doulkeridis C, NOrvag K (2014) A survey of large-scale analytical query processing in mapreduce. The VLDB J 23(3):355–380. https://doi.org/10.1007/s00778-013-0319-9

  15. 15.

    Du Z, Zhao X, Ye X, Zhou J, Zhang F, Liu R (2017) An effective high-performance multiway spatial join algorithm with spark. ISPRS Int J Geo-Inf 6 (4):96

  16. 16.

    Eldawy A, Mokbel MF (2015a) The era of big spatial data. In: 2015 31st IEEE International Conference on Data Engineering Workshops. IEEE, pp 42–49

  17. 17.

    Eldawy A, Mokbel MF (2015b) Spatialhadoop: A mapreduce framework for spatial data. In: 2015 IEEE 31st International Conference on Data Engineering (ICDE). IEEE, pp 1352–1363

  18. 18.

    Eldawy A, Li Y, Mokbel MF, Janardan R (2013) Cg_hadoop: computational geometry in mapreduce. In: Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, pp 294–303

  19. 19.

    Günther O (1993) Efficient computation of spatial joins. In: Proceedings of the Ninth International Conference on Data Engineering. IEEE Computer Society, Washington, pp 50–59. http://dl.acm.org/citation.cfm?id=645478.654973

  20. 20.

    Gupta H, Chawda B (2014) ε-controlled-replicate: An improvedcontrolled-replicate algorithm for multi-way spatial join processing on map-reduce. In: International Conference on Web Information Systems Engineering. Springer, pp 278–293

  21. 21.

    Gupta H, Chawda B, Negi S, Faruquie TA, Subramaniam LV, Mohania M (2013) Processing multi-way spatial joins on map-reduce. In: Proceedings of the 16th International Conference on Extending Database Technology, EDBT ’13. ACM, New York, pp 113–124. https://doi.org/10.1145/2452376.2452390

  22. 22.

    Güting R H (1994) An introduction to spatial database systems. VLDB J Int J Very Large Data Bases 3(4):357–399

  23. 23.

    Jacox E H, Samet H (2003) Iterative spatial join. ACM Trans Database Syst 28(3):230–256. https://doi.org/10.1145/937598.937600

  24. 24.

    Jacox E H, Samet H (2007) Spatial join techniques. ACM Trans Database Syst (TODS) 32(1):7

  25. 25.

    Kipf A, Lang H, Pandey V, Persa RA, Boncz P, Neumann T, Kemper A (2018) Adaptive geospatial joins for modern hardware. arXiv:180209488

  26. 26.

    Kriegel N B H P, Schneider R, Seeger B (1990) The r*-tree: an e cient and robust access method for points and rectangles. In: Proceedings of the ACM SIGMOD Conference on Management of Data

  27. 27.

    Leskovec J, Rajaraman A, Ullman JD (2014) Mining of Massive Datasets, 2nd Ed. Cambridge University Press, Cambridge

  28. 28.

    Lin J, et al. (2009) The curse of zipf and limits to parallelization: a look at the stragglers problem in mapreduce. In: 7Th workshop on large-scale distributed systems for information retrieval. ACM Boston, vol 1, pp 57–62

  29. 29.

    Liroz-Gistau M, Akbarinia R, Agrawal D, Valduriez P (2016) Fp-hadoop: Efficient processing of skewed mapreduce jobs. Inf Syst 60:69–84

  30. 30.

    Liu Z, Zhang Q, Ahmed R, Boutaba R, Liu Y, Gong Z (2016) Dynamic resource allocation for mapreduce with partitioning skew. IEEE Trans Comput 65(11):3304–3317. https://doi.org/10.1109/TC.2016.2532860

  31. 31.

    Lo M L, Ravishankar C V (1996) Spatial hash-joins. In: ACM SIGMOD Record. ACM, vol 25, pp 247–258

  32. 32.

    Loboz C (2012) Cloud resource usage—heavy tailed distributions invalidating traditional capacity planning models. J Grid Comput 10(1):85–108

  33. 33.

    Mamoulis N, Papadias D (2001) Multiway spatial joins. ACM Trans Database Syst (TODS) 26(4):424–475

  34. 34.

    Nishimura S, Das S, Agrawal D, El Abbadi A (2013) Hbase: design and implementation of an elastic data infrastructure for cloud-scale location services. Distrib Parallel Database 31(2):289– 319

  35. 35.

    Nobari S, Tauheed F, Heinis T, Karras P, Bressan S, Ailamaki A (2013) Touch: in-memory spatial join by hierarchical data-oriented partitioning. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, pp 701–712

  36. 36.

    Okcan A, Riedewald M (2011) Processing theta-joins using mapreduce. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, pp 949–960

  37. 37.

    Papadias D, Arkoumanis D (2002) Search algorithms for multiway spatial joins. Int J Geograph Inf Sci 16(7):613–639

  38. 38.

    Papadias D, Mamoulis N, Delis B (1998) Algorithms for querying by spatial structure In: Proceedings of Very Large Data Bases Conference (VLDB), New York

  39. 39.

    Papadias D, Mamoulis N, Theodoridis Y (1999) Processing and optimization of multiway spatial joins using r-trees. In: Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, pp 44–55

  40. 40.

    Papadias D, Mamoulis N, Theodoridis Y (2001) Constraint-based processing of multiway spatial joins. Algorithmica 30(2):188–215

  41. 41.

    Park HH, Cha GH, Chung CW (1999) Multi-way spatial joins using r-trees: Methodology and performance evaluation. In: Advances in Spatial Databases. Springer, pp 229–250

  42. 42.

    Patel J M, Patel and DeWitt D J (1996) Partition based spatial-merge join. In: ACM SIGMOD Record. ACM, vol 25, pp 259–270

  43. 43.

    Patel JM, DeWitt DJ (2000) Clone join and shadow join: two parallel spatial join algorithms. In: Proceedings of the 8th ACM international symposium on Advances in geographic information systems. ACM, pp 54–61

  44. 44.

    Pavlo A, Paulson E, Rasin A, Abadi DJ, DeWitt DJ, Madden S, Stonebraker M (2009) A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, SIGMOD ’09. ACM, New York, pp 165–178. https://doi.org/10.1145/1559845.1559865

  45. 45.

    Pearce O, Gamblin T, de Supinski BR, Schulz M, Amato NM (2012) Quantifying the effectiveness of load balance algorithms. In: Proceedings of the 26th ACM International Conference on Supercomputing, ICS ’12. ACM, New York, pp 185–194

  46. 46.

    Sabek I, Mokbel MF (2017) On spatial joins in mapreduce. In: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, pp 21

  47. 47.

    Singh H, Bawa S (2017) A survey of traditional and mapreducebased spatial query processing approaches. SIGMOD Rec 46(2):18–29. https://doi.org/10.1145/3137586.3137590

  48. 48.

    Vassilakopoulos M, Corral A, Karanikolas N (2011) Join-queries between two spatial datasets indexed by a single r*-tree. SOFSEM 2011: Theory and Practice of Computer Science, pp 533–544

  49. 49.

    Vernica R, Carey M J, Li C (2010) Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, pp 495–506

  50. 50.

    Wang K, Han J, Tu B, Dai J, Zhou W, Song X (2010) Accelerating spatial data processing with mapreduce. In: 2010 IEEE 16th International Conference on Parallel and Distributed Systems (ICPADS). IEEE, pp 229–236

  51. 51.

    Zhang S, Han J, Liu Z, Wang K, Feng S (2009a) Spatial queries evaluation with mapreduce. In: 2009. GCC’09. Eighth International Conference on Grid and cooperative computing. IEEE, pp 287– 292

  52. 52.

    Zhang S, Han J, Liu Z, Wang K, Xu Z (2009B) Sjmr: Parallelizing spatial join with mapreduce on clusters. In: 2009. CLUSTER’09. IEEE international conference on Cluster computing and workshops. IEEE, pp 1–8

  53. 53.

    Zhang X, Chen L, Wang M (2012) Efficient multi-way theta-join processing using mapreduce. Proc VLDB Endow 5(11):1184–1195

  54. 54.

    Zhong Y, Han J, Zhang T, Li Z, Fang J, Chen G (2012) Towards parallel spatial query processing for big spatial data. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. IEEE, pp 2085–2094

  55. 55.

    Zhou X, Abel D J, Truffet D (1998) Data partitioning for parallel spatial join processing. Geoinformatica 2(2):175–204

Download references

Author information

Correspondence to S. Nagesh Bhattu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

A Asymptotic complexity of GEMS vs Diagonal based Replication Schemes

The proof for Theorem 1 is provided below:

Proof

Let the grid structure be as given in Fig. 16. Let the number of rows /columns be α. Let the spatial dataset of size D be uniformly distirbuted. Each cell gets \(P = \frac {D}{\alpha ^{2}}\) of the spatial data. Using lemma 1, we can see that the diagonal based replication schemes have a replication cost of O(α4P). Using lemma 2, we can also see that the GEMS approach has communication cost of O(α3P). As a consequence we can observe that the GEMS approach improves over diagonal based replication schemes by O(α) □

Fig. 16
figure16

Diagonal based replication schemes

Lemma 1

The communication cost of diagonal based replication schemes is O(α4P)

Proof

The diagonal based replication schemes perform the replication along the diagonal of the grid structure. Let (i,i) be a cell on the diagonal where i is counted from top right corner of the grid structure as depicted in Fig. 16. Consider the replication cost of all the cells of the grid structure which are on i’th row to the right of cell (i,i). Let j be the index varying from 1 to i-1 where j indicates the position of the cell counted from the right end of the grid structure. The replication cost of j’th cell on the i’th row is j*i*P as marked by the hashed vertical rectangle. Here, P is the amount of data to be replicated. The replication cost of all the cells on the i’th row to the right of cell (i,i) is \({\sum }_{j=1}^{i-1} i*j*P\). The replication cost of cells on the i’th column which are on the top of cell (i,i) can be similarly computed as \({\sum }_{j=1}^{i-1} i*j*P\). The replication cost (Cd) of all the cells in the grid structure can be computed by varying the position of i from 1 to α. Such computation is summarized as:

$$ \begin{array}{@{}rcl@{}} C_{d} &=& \left( {\sum}_{i=1}^{\alpha} (i^{2} + {\sum}_{j=1}^{i-1} 2*i*j) \right)* P \\ &=& \left( {\sum}_{i=1}^{\alpha} (i^{2} + 2 * i* {\sum}_{j=1}^{i-1} j) \right)* P \\ &=& \left( {\sum}_{i=1}^{\alpha} i^{2} + 2 * i* i*(i-1)/2 \right) * P \\ &=& \left( {\sum}_{i=1}^{\alpha} i^{2} + i^{3} - i^{2} \right) * P \\ &=& \left( {\sum}_{i=1}^{\alpha} i^{3} \right) * P \\ &=& \left( \frac{\alpha*(\alpha+1)}{2}\right)^{2} * P = O(\alpha^{4}*P) \end{array} $$
(2)

The first term on right hand side of the Eq. 2 refers to the communication cost of cells on the diagonal of the grid structure which is (i2P). □

Lemma 2

The communication cost of GEMS based replication scheme is O(α3P)

Proof

There are altogether α2 cells in a grid of α rows/columns. If a cell is present in i’th row and j’th column of the grid (as marked in Fig. 17), the replication will be done to all the cells in i’th row and j’th column. Hence, each cell is replicated to 2 ∗ α − 1 cells using GEMS approach. So the total replication cost of GEMS approach is

$$ \begin{array}{@{}rcl@{}} C_{g} = \alpha^{2}*(2*\alpha-1)*P \end{array} $$
(3)

Fig. 17
figure17

GEMS based replication

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bhattu, S.N., Potluri, A., Kadari, P. et al. Generalized communication cost efficient multi-way spatial join: revisiting the curse of the last reducer. Geoinformatica (2020) doi:10.1007/s10707-019-00387-6

Download citation

Keywords

  • Big data
  • Communication cost
  • Multi-way spatial join
  • Skewness