, Volume 23, Issue 1, pp 37–78 | Cite as

Spatial data management in apache spark: the GeoSpark perspective and beyond

  • Jia YuEmail author
  • Zongsi Zhang
  • Mohamed Sarwat


The paper presents the details of designing and developing GeoSpark, which extends the core engine of Apache Spark and SparkSQL to support spatial data types, indexes, and geometrical operations at scale. The paper also gives a detailed analysis of the technical challenges and opportunities of extending Apache Spark to support state-of-the-art spatial data partitioning techniques: uniform grid, R-tree, Quad-Tree, and KDB-Tree. The paper also shows how building local spatial indexes, e.g., R-Tree or Quad-Tree, on each Spark data partition can speed up the local computation and hence decrease the overall runtime of the spatial analytics program. Furthermore, the paper introduces a comprehensive experiment analysis that surveys and experimentally evaluates the performance of running de-facto spatial operations like spatial range, spatial K-Nearest Neighbors (KNN), and spatial join queries in the Apache Spark ecosystem. Extensive experiments on real spatial datasets show that GeoSpark achieves up to two orders of magnitude faster run time performance than existing Hadoop-based systems and up to an order of magnitude faster performance than Spark-based systems.


Spatial databases Distributed computing Big geospatial data 



  1. 1.
    NRC (2001) Committee on the science of climate change, climate change science: an analysis of some key questions, National Academies Press, WashingtonGoogle Scholar
  2. 2.
    Zeng N, Dickinson RE, Zeng X (1996) Climatic impact of amazon Deforestation? A mechanistic model study. Journal of Climate 9:859–883CrossRefGoogle Scholar
  3. 3.
    Chen C, Burton M, Greenberger E, Dmitrieva J (1999) Population migration and the variation of dopamine D4 receptor (DRD4) allele frequencies around the globe. Evol Hum Behav 20(5):309–324CrossRefGoogle Scholar
  4. 4.
    Woodworth PL, Menéndez M, Gehrels WR (2011) Evidence for century-timescale acceleration in mean sea levels and for recent changes in extreme sea levels. Surv Geophys 32(4-5):603–618CrossRefGoogle Scholar
  5. 5.
    Dhar S, Varshney U (2011) Challenges and business models for mobile location-based services and advertising. Commun ACM 54(5):121–128CrossRefGoogle Scholar
  6. 6.
    PostGIS Postgis.
  7. 7.
    Open Geospatial Consortium.
  8. 8.
    Aji A, Wang F, Vo H, Lee R, Liu Q, Zhang X, Saltz JH (2013) Hadoop-GIS: a high performance spatial data warehousing system over MapReduce. Proc Int Conf on Very Large Data Bases, VLDB 6(11):1009–1020Google Scholar
  9. 9.
    Eldawy A, Mokbel MF (2015) Spatialhadoop: a mapreduce framework for spatial data. In: Proceedings of the IEEE International Conference on Data Engineering, ICDE, pp 1352–1363Google Scholar
  10. 10.
    Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the USENIX symposium on Networked Systems Design and Implementation, NSDI, pp 15–28Google Scholar
  11. 11.
    Ashworth M (2016) Information technology – database languages – sql multimedia and application packages – part 3: Spatial, standard, International organization for standardization, Geneva, SwitzerlandGoogle Scholar
  12. 12.
    Pagel B-U, Six H-W, Toben H, Widmayer P (1993) Towards an analysis of range query performance in spatial data structures. In: Proceedings of the Twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of Database Systems PODS ’93Google Scholar
  13. 13.
    Patel JM, DeWitt DJ (1996) Partition based spatial-merge join. In: Proceedings of the ACM international conference on management of data, SIGMOD, pp 259–270Google Scholar
  14. 14.
    Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of the ACM international conference on management of data, SIGMOD, pp 47–57Google Scholar
  15. 15.
    Samet H (1984) The quadtree and related hierarchical data structures. ACM Comput Surv (CSUR) 16(2):187–260CrossRefGoogle Scholar
  16. 16.
    Eldawy A, Alarabi L, Mokbel MF (2015) Spatial partitioning techniques in spatial hadoop. Proc Int Conf on Very Large Data Bases, VLDB 8(12):1602–1605Google Scholar
  17. 17.
    Eldawy A, Mokbel MF, Jonathan C (2016) Hadoopviz: A mapreduce framework for extensible visualization of big spatial data. In: Proceedings of the IEEE International Conference on Data Engineering, ICDE, pp 601–612Google Scholar
  18. 18.
    Eldawy A, Mokbel MF (2014) Pigeon: a spatial mapreduce language. In: IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014, pp 1242–1245Google Scholar
  19. 19.
    Lu J, Guting RH (2012) Parallel secondo: boosting database engines with Hadoop. In: International conference on parallel and distributed systems, pp 738 –743Google Scholar
  20. 20.
    Vo H, Aji A, Wang F (2014) SATO: a spatial data partitioning framework for scalable query processing. In: Proceedings of the ACM international conference on advances in geographic information systems, ACM SIGSPATIAL, pp 545–548Google Scholar
  21. 21.
    Thusoo A, Sen JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive: a warehousing solution over a Map-Reduce framework. In: Proceedings of the International Conference on Very Large Data Bases, VLDB, pp 1626–1629Google Scholar
  22. 22.
    Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia M (2015) Spark SQL: relational data processing in spark. In: Proceedings of the ACM international conference on management of data, SIGMOD, pp 1383–1394Google Scholar
  23. 23.
    Xie D, Li F, Yao B, Li G, Zhou L, Guo M (2016) Simba: efficient in-memory spatial analytics. In: Proceedings of the ACM international conference on management of data, SIGMODGoogle Scholar
  24. 24.
    Sriharsha R Geospatial analytics using spark.
  25. 25.
    You S, Zhang J, Gruenwald L (2015) Large-scale spatial join query processing in cloud. In: Proceedings of the IEEE International Conference on Data Engineering Workshop, ICDEW, pp 34–41Google Scholar
  26. 26.
    Hughes NJ, Annex A, Eichelberger CN, Fox A, Hulbert A, Ronquest M (2015) Geomesa: a distributed architecture for spatio-temporal fusion. In: SPIE defense+ security, pp 94730F–94730F, International society for optics and photonicsGoogle Scholar
  27. 27.
    Finkel RA, Bentley JL (1974) Quad trees a data structure for retrieval on composite keys. Acta informatica 4(1):1–9CrossRefGoogle Scholar
  28. 28.
    Herring JR (2006) Opengis implementation specification for geographic information-simple feature access-part 2: Sql option, Open Geospatial Consortium IncGoogle Scholar
  29. 29.
  30. 30.
    Hunt P, Konar M, Junqueira FP, Reed B (2010) Zookeeper: Wait-free coordination for internet-scale systems. In: USENIX annual technical conference, Boston, MA, USA June 23-25Google Scholar
  31. 31.
    Butler H, Daly M, Doyle A, Gillies S, Schaub T, Schmidt C (2014) Geojson, Electronic.
  32. 32.
    Perry M, Herring J (2012) Ogc geosparql-a geographic query language for rdf data, OGC Implementation Standard SeptGoogle Scholar
  33. 33.
    Group H et al (2014) Hierarchical data format version 5Google Scholar
  34. 34.
    ESRI E (1998) Shapefile technical description, an ESRI white paperGoogle Scholar
  35. 35.
    Yu J, Sarwat M (2016) Two birds, one stone: A fast, yet lightweight, indexing scheme for modern database systems. Proc Int Conf on Very Large Data Bases, VLDB 10(4):385–396Google Scholar
  36. 36.
    Yu J, Sarwat M (2017) Indexing the pickup and drop-off locations of NYC taxi trips in postgresql - lessons from the road. In: Proceedings of the international symposium on advances in spatial and temporal databases, SSTD, pp 145–162Google Scholar
  37. 37.
    Taxi NYC, Commission L Nyc tlc trip data.
  38. 38.
    Robinson JT (1981) The k-d-b-tree: a search structure for large multidimensional dynamic indexes. In: Proceedings of the 1981 ACM SIGMOD international conference on management of data, Ann Arbor, Michigan, April 29 - May 1, 1981, pp 10–18Google Scholar
  39. 39.
    Opyrchal L, Prakash A (1999) Efficient object serialization in java. In: Proceedings of the 19th IEEE international conference on distributed computing systems workshops on electronic commerce and web-based applications/middleware, 1999, IEEE, pp 96–101Google Scholar
  40. 40.
    Cao P, Wang Z (2004) Efficient top-k query calculation in distributed networks. In: Proceedings of the twenty-third annual ACM symposium on principles of distributed computing, PODC 2004, St. John’s, Newfoundland, Canada, July 25-28, 2004, pp 206–215Google Scholar
  41. 41.
    Roussopoulos N, Kelley S, Vincent F (1995) Nearest neighbor queries. In: Proceedings of the ACM international conference on management of data, SIGMOD, pp 71–79Google Scholar
  42. 42.
    Zhou X, Abel DJ, Truffet D (1998) Data partitioning for parallel spatial join processing. Geoinformatica 2(2):175–204CrossRefGoogle Scholar
  43. 43.
    Luo G, Naughton JF, Ellmann CJ (2002) A non-blocking parallel spatial join algorithmGoogle Scholar
  44. 44.
    Zhang S, Han J, Liu Z, Wang K, Xu Z (2009) SJMR: parallelizing spatial join with mapreduce on clusters. In: Proceedings of the 2009 IEEE international conference on cluster computing, August 31 - September 4, 2009, New Orleans, Louisiana, USA, pp 1–8Google Scholar
  45. 45.
    Dittrich J, Seeger B (2000) Data redundancy and duplicate detection in spatial join processing. In: Proceedings of the 16th international conference on data engineering, San Diego, California, USA, February 28 - March 3, 2000, pp 535–546Google Scholar
  46. 46.
    Consortium OG (2010) Opengis web map tile service implementation standard, tech. rep., Tech. Rep. OGC 07-057r7. In: Masó J, Pomakis K, Julià N (eds) Open Geospatial Consortium. Available at
  47. 47.
    Ripley BD (2005) Spatial statistics, vol 575, Wiley, New YorkGoogle Scholar
  48. 48.
    Haklay MM, Weber P (2008) Openstreetmap: User-generated street maps. IEEE Pervasive Computing 7(4):12–18CrossRefGoogle Scholar
  49. 49.
  50. 50.
  51. 51.
    OpenStreetMap. Open street map zoom levels.

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Arizona State UniversityTempeUSA

Personalised recommendations