A Comparison of Distributed Spatial Data Management Systems for Processing Distance Join Queries

García-García, Francisco; Corral, Antonio; Iribarne, Luis; Mavrommatis, George; Vassilakopoulos, Michael

doi:10.1007/978-3-319-66917-5_15

Francisco García-García¹⁶,
Antonio Corral¹⁶,
Luis Iribarne¹⁶,
George Mavrommatis¹⁷ &
…
Michael Vassilakopoulos¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10509))

Included in the following conference series:

European Conference on Advances in Databases and Information Systems

1213 Accesses
8 Citations

Abstract

Due to the ubiquitous use of spatial data applications and the large amounts of spatial data that these applications generate, the processing of large-scale distance joins in distributed systems is becoming increasingly popular. Two of the most studied distance join queries are the K Closest Pair Query (KCPQ) and the \(\varepsilon \) Distance Join Query (\(\varepsilon \) DJQ). The KCPQ finds the K closest pairs of points from two datasets and the \(\varepsilon \) DJQ finds all the possible pairs of points from two datasets, that are within a distance threshold \(\varepsilon \) of each other. Distributed cluster-based computing systems can be classified in Hadoop-based and Spark-based systems. Based on this classification, in this paper, we compare two of the most current and leading distributed spatial data management systems, namely SpatialHadoop and LocationSpark, by evaluating the performance of existing and newly proposed parallel and distributed distance join query algorithms in different situations with big real-world datasets. As a general conclusion, while SpatialHadoop is more mature and robust system, LocationSpark is the winner with respect to the total execution time.

F. García-García, A. Corral, L. Iribarne and M. Vassilakopoulos — Work funded by the MINECO research project [TIN2013-41576-R].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Available at https://hadoop.apache.org/.
2.
Available at https://spark.apache.org/.
3.
Available at http://spatialhadoop.cs.umn.edu/datasets.html.
4.
Available at https://github.com/aseldawy/spatialhadoop2.
5.
Available at https://github.com/merlintang/SpatialSpark.

References

Aji, A., Wang, F., Vo, H., Lee, R., Liu, Q., Zhang, X., Saltz, J.H.: Hadoop-GIS: a high performance spatial data warehousing system over MapReduce. PVLDB 6(11), 1009–1020 (2013)
Google Scholar
Corral, A., Manolopoulos, Y., Theodoridis, Y., Vassilakopoulos, M.: Algorithms for processing \(K\)-closest-pair queries in spatial databases. Data Knowl. Eng. 49(1), 67–104 (2004)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI Conference, pp. 137–150 (2004)
Google Scholar
Eldawy, A., Alarabi, L., Mokbel, M.F.: Spatial partitioning techniques in SpatialHadoop. PVLDB 8(12), 1602–1613 (2015)
Google Scholar
Eldawy, A., Mokbel, M.F.: SpatialHadoop: a MapReduce framework for spatial data. In: ICDE Conference, pp. 1352–1363 (2015)
Google Scholar
García-García, F., Corral, A., Iribarne, L., Vassilakopoulos, M., Manolopoulos, Y.: Enhancing SpatialHadoop with closest pair queries. In: Pokorný, J., Ivanović, M., Thalheim, B., Šaloun, P. (eds.) ADBIS 2016. LNCS, vol. 9809, pp. 212–225. Springer, Cham (2016). doi:10.1007/978-3-319-44039-2_15
Chapter Google Scholar
Lenka, R.K., Barik, R.K., Gupta, N., Ali, S.M., Rath, A., Dubey, H.: Comparative analysis of SpatialHadoop and GeoSpark for geospatial big data analytics, CoRR abs/1612.07433 (2016)
Google Scholar
Li, F., Ooi, B.C., Özsu, M.T., Wu, S.: Distributed data management using MapReduce. ACM Comput. Surv. 46(3), 31:1–31:42 (2014)
Google Scholar
Roumelis, G., Corral, A., Vassilakopoulos, M., Manolopoulos, Y.: New plane-sweep algorithms for distance-based join queries in spatial databases. GeoInformatica 20(4), 571–628 (2016)
Article Google Scholar
Shi, J., Qiu, Y., Minhas, U.F., Jiao, L., Wang, C., Reinwald, B., Özcan, F.: Clash of the titans: mapreduce vs. spark for large scale data analytics. PVLDB 8(13), 2110–2121 (2015)
Google Scholar
Tang, M., Yu, Y., Malluhi, Q.M., Ouzzani, M., Aref, W.G.: Locationspark: a distributed in-memory data management system for big spatial data. PVLDB 9(13), 1565–1568 (2016)
Google Scholar
Tang, M., Yu, Y., Aref, W.G., Mahmood, A.R., Malluhi, Q.M., Ouzzani, M.: In-memory distributed spatial query processing and optimization, April 2017. http://merlintang.github.io/paper/memory-distributed-spatial.pdf
Xie, D., Li, F., Yao, B., Li, G., Zhou, L., Guo, M.: Simba: efficient in-memory spatial analytics. In: SIGMOD Conference, pp. 1071–1085 (2016)
Google Scholar
You, S., Zhang, J., Gruenwald, L.: Large-scale spatial join query processing in cloud. In: ICDE Workshops, pp. 34–41 (2015)
Google Scholar
You, S., Zhang, J., Gruenwald, L.: Spatial join query processing in cloud: Analyzing design choices and performance comparisons. In: ICPPW Conference, pp. 90–97 (2015)
Google Scholar
Yu, J., Wu, J., Sarwat, M.: GeoSpark: a cluster computing framework for processing large-scale spatial data. In: SIGSPATIAL Conference, pp. 70:1–70:4 (2015)
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI Conference, pp. 15–28 (2012)
Google Scholar
Zhang, H., Chen, G., Ooi, B.C., Tan, K.-L., Zhang, M.: In-memory big data management and processing: a survey. TKDE 27(7), 1920–1948 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, University of Almeria, Almeria, Spain
Francisco García-García, Antonio Corral & Luis Iribarne
DaSE Lab, Department of Electrical and Computer Engineering, University of Thessaly, Volos, Greece
George Mavrommatis & Michael Vassilakopoulos

Authors

Francisco García-García
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Corral
View author publications
You can also search for this author in PubMed Google Scholar
Luis Iribarne
View author publications
You can also search for this author in PubMed Google Scholar
George Mavrommatis
View author publications
You can also search for this author in PubMed Google Scholar
Michael Vassilakopoulos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Antonio Corral .

Editor information

Editors and Affiliations

Riga Technical University , Riga, Latvia
Mārīte Kirikova
Norwegian University of Science and Technology, Trondheim, Norway
Kjetil Nørvåg
University of Cyprus , Nicosia, Cyprus
George A. Papadopoulos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

García-García, F., Corral, A., Iribarne, L., Mavrommatis, G., Vassilakopoulos, M. (2017). A Comparison of Distributed Spatial Data Management Systems for Processing Distance Join Queries. In: Kirikova, M., Nørvåg, K., Papadopoulos, G. (eds) Advances in Databases and Information Systems. ADBIS 2017. Lecture Notes in Computer Science(), vol 10509. Springer, Cham. https://doi.org/10.1007/978-3-319-66917-5_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-66917-5_15
Published: 25 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66916-8
Online ISBN: 978-3-319-66917-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics