On Parallelizing Large Spatial Queries Using Map-Reduce
Vector Spatial data types such as lines, polygons or regions etc usually comprises of hundreds of thousands of latitude-longitude pairs to accurately represent the geometry of spatial features such as towns, rivers or villages. This leads to spatial data operations being computationally and memory intensive. A solution to deal with this is to distribute the operations amongst multiple computational nodes. Parallel spatial databases attempt to do this but at very small scales (of the order of 10s of nodes at most). Another approach would be to use distributed approaches such as Map-Reduce since spatial data operations map well to this paradigm. It affords us the advantage of being able to harness commodity hardware operating in a shared nothing mode while at the same time lending robustness to the computation since parts of the computation can be restarted on failure. In this paper, we present HadoopDB - a combination of Hadoop and Postgres spatial to efficiently handle computations on large spatial data sets. In HadoopDB, Hadoop serves as a means of coordinating amongst various computational nodes each of which performs the spatial query on a part of the data set. The Reduce stage helps collate the result data to yield the result of the original query. We present performance results to show that common spatial queries yields a speedup that nearly linear with the number of Hadoop processes deployed.
KeywordsMapReduce Hadoop postGIS Spatial Data HadoopDB
Unable to display preview. Download preview PDF.
- 1.Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation, vol. 6, p. 10. USENIX Association, San Francisco (2004)Google Scholar
- 2.Bialecki, A., Cafarella, M., Cutting, D., Malley, O.: Hadoop: a framework for running applications on large clusters built of commodity hardware, Wiki at http://lucene.apache.org/hadoop
- 5.Zhang, J., Mamoulis, N., Papadias, D., Tao, Y.: All-nearest-neighbors queries in spatial databases, p. 297 (June 2004)Google Scholar
- 6.Zhang, S., Han, J., Liu, Z., Wang, K., Xu, Z.: SJMR: Parallelizing spatial join with MapReduce on clusters. In: Proceedings of CLUSTER, pp. 1–8 (2009)Google Scholar
- 7.Dittrich, J.P., Seeger, B.: Data redundancy and duplicate detection in spatial join processing. In: ICDE 2000: Proceedings of the 16th International Conference on Data Engineering, pp. 535–546 (2000)Google Scholar
- 8.Brinkhoff, T., Kriegel, H.P., Seeger, B.: Parallel processing of spatial joins using R-trees. In: ICDE 1996: Proceedings of the Twelfth International Conference on Data Engineering, pp. 258–265 (1996)Google Scholar
- 10.Akdogan, A., Demiryurek, U., Banaei-Kashani, F., Shahabi, C.: Integrated Media Systems Center, University of Southern California, Los Angeles, CA 90089Google Scholar
- 11.Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive - a warehousing solution over a map-reduce framework. PVLDB 2(2), 1626–1629 (2009)Google Scholar
- 12.Abouzeid, A., Bajda-pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, E.: HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In: Proc. VLDB 2009 (2009)Google Scholar
- 15.Leptoukh, G.: NASA remote sensing data in earth sciences: Processing, archiving, distribution, applications at the GES DISC. In: Proc. of the 31st Intl. Symposium of Remote Sensing of Environment (2005)Google Scholar