## Abstract

With the huge increase in usage of smart mobiles, social media and sensors, large volumes of location-based data is available. Location based data carries important signals pertaining to user intensive information as well as population characteristics. The key analytical tool for location based analysis is multi-way spatial join. Unlike the conventional join strategies, multi-way join using map-reduce offers a scalable, distributed computational paradigm and efficient implementation through communication cost reduction strategies. Controlled Replicate (C-Rep) is a useful strategy used in the literature to perform the multi-way spatial join efficiently. Though C-Rep performance is superior compared to naive sequential join, careful analysis of its performance reveals that such a strategy is plagued by the curse of the last reducer, wherein the skew inherently present in the datasets and the skew introduced by replication operation, causes some of the reducers to take much longer time compared to others. In this work, we design an algorithm GEMS (**G** eneralized Communication cost **E** fficient **M** ulti-Way **S** patial Join) to address the skewness inherent in the connectivity of spatial objects while performing a multi-way join. We analysed all the algorithms concerned, in terms of I/O and communication costs. We prove that the communication cost of GEMS approach is better than that of C-Rep by a factor O(*α*) where *α* is the number of reducers in a single row/column of a grid of reducers. Our experimental results on different datasets indicate that GEMS approach is three times superior(in terms of turn around time) compared to C-Rep.

This is a preview of subscription content, access via your institution.

## Notes

- 1.
Here we use the set of cells \(\mathcal {C}_{cr}\) to notionally distinguish between C-Rep and All-Replicate. Otherwise, \(\mathcal {C}_{cr}\) is equivalent to \(\mathcal {C}_{f}\) in the sense that it also represents the cells of the first quadrant.

- 2.
The code for all the algorithms implemented is available in https://github.com/nageshbhattu/SpatialJoin.git

- 3.

## References

- 1.
Afrati F, Stasinopoulos N, Ullman J D, Vassilakopoulos A (2018) Sharesskew: An algorithm to handle skew for joins in mapreduce. Information Systems

- 2.
Afrati FN, Ullman JD (2010) Optimizing joins in a map-reduce environment. In: Proceedings of the 13th International Conference on Extending Database Technology, EDBT ’10. ACM, New York, pp 99–110

- 3.
Aji A, Wang F, Vo H, Lee R, Liu Q, Zhang X, Saltz J (2013) Hadoop gis: a high performance spatial data warehousing system over mapreduce. Proc VLDB Endowment 6(11):1009–1020

- 4.
Aji A, Hoang V, Wang F (2015) Effective spatial data partitioning for scalable query processing. arXiv:150900910

- 5.
Arge L, Procopiuc O, Ramaswamy S, Suel T, Vitter J S (1998) Scalable sweeping-based spatial join. In: VLDB, vol 98, pp 570–581

- 6.
Blanas S, Patel J M, Ercegovac V, Rao J, Shekita E J, Tian Y (2010) A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, pp 975–986

- 7.
Bouros P, Mamoulis N (2017) A forward scan based plane sweep algorithm for parallel interval joins. Proc VLDB Endowment 10(11):1346–1357

- 8.
Bozanis P, Foteinos P (2007) Wer-trees. Data Knowl Eng 63(2):397–413

- 9.
Brinkhoff T, Kriegel H P, Seeger B (1996) Parallel processing of spatial joins using r-trees. In: 1996. Proceedings of the Twelfth International Conference on Data engineering. IEEE, pp 258–265

- 10.
Chaudhuri S (1998) An overview of query optimization in relational systems. In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS ’98. ACM, New York, pp 34–43. https://doi.org/10.1145/275487.275492

- 11.
Cho E, Myers SA, Leskovec J (2011) Friendship and mobility: user movement in location-based social networks. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 1082–1090

- 12.
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

- 13.
Dittrich JP, Seeger B (2000) Data redundancy and duplicate detection in spatial join processing. In: 2000. Proceedings. 16th International Conference on Data Engineering. IEEE, pp 535–546

- 14.
Doulkeridis C, NOrvag K (2014) A survey of large-scale analytical query processing in mapreduce. The VLDB J 23(3):355–380. https://doi.org/10.1007/s00778-013-0319-9

- 15.
Du Z, Zhao X, Ye X, Zhou J, Zhang F, Liu R (2017) An effective high-performance multiway spatial join algorithm with spark. ISPRS Int J Geo-Inf 6 (4):96

- 16.
Eldawy A, Mokbel MF (2015a) The era of big spatial data. In: 2015 31st IEEE International Conference on Data Engineering Workshops. IEEE, pp 42–49

- 17.
Eldawy A, Mokbel MF (2015b) Spatialhadoop: A mapreduce framework for spatial data. In: 2015 IEEE 31st International Conference on Data Engineering (ICDE). IEEE, pp 1352–1363

- 18.
Eldawy A, Li Y, Mokbel MF, Janardan R (2013) Cg_hadoop: computational geometry in mapreduce. In: Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, pp 294–303

- 19.
Günther O (1993) Efficient computation of spatial joins. In: Proceedings of the Ninth International Conference on Data Engineering. IEEE Computer Society, Washington, pp 50–59. http://dl.acm.org/citation.cfm?id=645478.654973

- 20.
Gupta H, Chawda B (2014)

*ε*-controlled-replicate: An improvedcontrolled-replicate algorithm for multi-way spatial join processing on map-reduce. In: International Conference on Web Information Systems Engineering. Springer, pp 278–293 - 21.
Gupta H, Chawda B, Negi S, Faruquie TA, Subramaniam LV, Mohania M (2013) Processing multi-way spatial joins on map-reduce. In: Proceedings of the 16th International Conference on Extending Database Technology, EDBT ’13. ACM, New York, pp 113–124. https://doi.org/10.1145/2452376.2452390

- 22.
Güting R H (1994) An introduction to spatial database systems. VLDB J Int J Very Large Data Bases 3(4):357–399

- 23.
Jacox E H, Samet H (2003) Iterative spatial join. ACM Trans Database Syst 28(3):230–256. https://doi.org/10.1145/937598.937600

- 24.
Jacox E H, Samet H (2007) Spatial join techniques. ACM Trans Database Syst (TODS) 32(1):7

- 25.
Kipf A, Lang H, Pandey V, Persa RA, Boncz P, Neumann T, Kemper A (2018) Adaptive geospatial joins for modern hardware. arXiv:180209488

- 26.
Kriegel N B H P, Schneider R, Seeger B (1990) The r*-tree: an e cient and robust access method for points and rectangles. In: Proceedings of the ACM SIGMOD Conference on Management of Data

- 27.
Leskovec J, Rajaraman A, Ullman JD (2014) Mining of Massive Datasets, 2nd Ed. Cambridge University Press, Cambridge

- 28.
Lin J, et al. (2009) The curse of zipf and limits to parallelization: a look at the stragglers problem in mapreduce. In: 7Th workshop on large-scale distributed systems for information retrieval. ACM Boston, vol 1, pp 57–62

- 29.
Liroz-Gistau M, Akbarinia R, Agrawal D, Valduriez P (2016) Fp-hadoop: Efficient processing of skewed mapreduce jobs. Inf Syst 60:69–84

- 30.
Liu Z, Zhang Q, Ahmed R, Boutaba R, Liu Y, Gong Z (2016) Dynamic resource allocation for mapreduce with partitioning skew. IEEE Trans Comput 65(11):3304–3317. https://doi.org/10.1109/TC.2016.2532860

- 31.
Lo M L, Ravishankar C V (1996) Spatial hash-joins. In: ACM SIGMOD Record. ACM, vol 25, pp 247–258

- 32.
Loboz C (2012) Cloud resource usage—heavy tailed distributions invalidating traditional capacity planning models. J Grid Comput 10(1):85–108

- 33.
Mamoulis N, Papadias D (2001) Multiway spatial joins. ACM Trans Database Syst (TODS) 26(4):424–475

- 34.
Nishimura S, Das S, Agrawal D, El Abbadi A (2013) Hbase: design and implementation of an elastic data infrastructure for cloud-scale location services. Distrib Parallel Database 31(2):289– 319

- 35.
Nobari S, Tauheed F, Heinis T, Karras P, Bressan S, Ailamaki A (2013) Touch: in-memory spatial join by hierarchical data-oriented partitioning. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, pp 701–712

- 36.
Okcan A, Riedewald M (2011) Processing theta-joins using mapreduce. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, pp 949–960

- 37.
Papadias D, Arkoumanis D (2002) Search algorithms for multiway spatial joins. Int J Geograph Inf Sci 16(7):613–639

- 38.
Papadias D, Mamoulis N, Delis B (1998) Algorithms for querying by spatial structure In: Proceedings of Very Large Data Bases Conference (VLDB), New York

- 39.
Papadias D, Mamoulis N, Theodoridis Y (1999) Processing and optimization of multiway spatial joins using r-trees. In: Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, pp 44–55

- 40.
Papadias D, Mamoulis N, Theodoridis Y (2001) Constraint-based processing of multiway spatial joins. Algorithmica 30(2):188–215

- 41.
Park HH, Cha GH, Chung CW (1999) Multi-way spatial joins using r-trees: Methodology and performance evaluation. In: Advances in Spatial Databases. Springer, pp 229–250

- 42.
Patel J M, Patel and DeWitt D J (1996) Partition based spatial-merge join. In: ACM SIGMOD Record. ACM, vol 25, pp 259–270

- 43.
Patel JM, DeWitt DJ (2000) Clone join and shadow join: two parallel spatial join algorithms. In: Proceedings of the 8th ACM international symposium on Advances in geographic information systems. ACM, pp 54–61

- 44.
Pavlo A, Paulson E, Rasin A, Abadi DJ, DeWitt DJ, Madden S, Stonebraker M (2009) A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, SIGMOD ’09. ACM, New York, pp 165–178. https://doi.org/10.1145/1559845.1559865

- 45.
Pearce O, Gamblin T, de Supinski BR, Schulz M, Amato NM (2012) Quantifying the effectiveness of load balance algorithms. In: Proceedings of the 26th ACM International Conference on Supercomputing, ICS ’12. ACM, New York, pp 185–194

- 46.
Sabek I, Mokbel MF (2017) On spatial joins in mapreduce. In: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, pp 21

- 47.
Singh H, Bawa S (2017) A survey of traditional and mapreducebased spatial query processing approaches. SIGMOD Rec 46(2):18–29. https://doi.org/10.1145/3137586.3137590

- 48.
Vassilakopoulos M, Corral A, Karanikolas N (2011) Join-queries between two spatial datasets indexed by a single r*-tree. SOFSEM 2011: Theory and Practice of Computer Science, pp 533–544

- 49.
Vernica R, Carey M J, Li C (2010) Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, pp 495–506

- 50.
Wang K, Han J, Tu B, Dai J, Zhou W, Song X (2010) Accelerating spatial data processing with mapreduce. In: 2010 IEEE 16th International Conference on Parallel and Distributed Systems (ICPADS). IEEE, pp 229–236

- 51.
Zhang S, Han J, Liu Z, Wang K, Feng S (2009a) Spatial queries evaluation with mapreduce. In: 2009. GCC’09. Eighth International Conference on Grid and cooperative computing. IEEE, pp 287– 292

- 52.
Zhang S, Han J, Liu Z, Wang K, Xu Z (2009B) Sjmr: Parallelizing spatial join with mapreduce on clusters. In: 2009. CLUSTER’09. IEEE international conference on Cluster computing and workshops. IEEE, pp 1–8

- 53.
Zhang X, Chen L, Wang M (2012) Efficient multi-way theta-join processing using mapreduce. Proc VLDB Endow 5(11):1184–1195

- 54.
Zhong Y, Han J, Zhang T, Li Z, Fang J, Chen G (2012) Towards parallel spatial query processing for big spatial data. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. IEEE, pp 2085–2094

- 55.
Zhou X, Abel D J, Truffet D (1998) Data partitioning for parallel spatial join processing. Geoinformatica 2(2):175–204

## Author information

### Affiliations

### Corresponding author

## Additional information

### Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Appendices

### Appendix

### A Asymptotic complexity of GEMS vs Diagonal based Replication Schemes

The proof for Theorem 1 is provided below:

###
*Proof*

Let the grid structure be as given in Fig. 16. Let the number of rows /columns be *α*. Let the spatial dataset of size D be uniformly distirbuted. Each cell gets \(P = \frac {D}{\alpha ^{2}}\) of the spatial data. Using lemma 1, we can see that the diagonal based replication schemes have a replication cost of *O*(*α*^{4} ∗ *P*). Using lemma 2, we can also see that the GEMS approach has communication cost of *O*(*α*^{3} ∗ *P*). As a consequence we can observe that the GEMS approach improves over diagonal based replication schemes by *O*(*α*) □

###
**Lemma 1**

*The communication cost of diagonal based replication schemes is O*(*α*^{4} ∗ *P*)

###
*Proof*

The diagonal based replication schemes perform the replication along the diagonal of the grid structure. Let (i,i) be a cell on the diagonal where i is counted from top right corner of the grid structure as depicted in Fig. 16. Consider the replication cost of all the cells of the grid structure which are on i’th row to the right of cell (i,i). Let j be the index varying from 1 to i-1 where j indicates the position of the cell counted from the right end of the grid structure. The replication cost of j’th cell on the i’th row is j*i*P as marked by the hashed vertical rectangle. Here, P is the amount of data to be replicated. The replication cost of all the cells on the i’th row to the right of cell (i,i) is \({\sum }_{j=1}^{i-1} i*j*P\). The replication cost of cells on the i’th column which are on the top of cell (i,i) can be similarly computed as \({\sum }_{j=1}^{i-1} i*j*P\). The replication cost (*C*_{d}) of all the cells in the grid structure can be computed by varying the position of i from 1 to *α*. Such computation is summarized as:

The first term on right hand side of the Eq. 2 refers to the communication cost of cells on the diagonal of the grid structure which is (*i*^{2} ∗ *P*). □

###
**Lemma 2**

*The communication cost of GEMS based replication scheme is O*(*α*^{3} ∗ *P*)

###
*Proof*

There are altogether *α*^{2} cells in a grid of *α* rows/columns. If a cell is present in i’th row and j’th column of the grid (as marked in Fig. 17), the replication will be done to all the cells in i’th row and j’th column. Hence, each cell is replicated to 2 ∗ *α* − 1 cells using GEMS approach. So the total replication cost of GEMS approach is

□

## Rights and permissions

## About this article

### Cite this article

Bhattu, S.N., Potluri, A., Kadari, P. *et al.* Generalized communication cost efficient multi-way spatial join: revisiting the curse of the last reducer.
*Geoinformatica* **24, **557–589 (2020). https://doi.org/10.1007/s10707-019-00387-6

Received:

Revised:

Accepted:

Published:

Issue Date:

### Keywords

- Big data
- Communication cost
- Multi-way spatial join
- Skewness