Emerging trends in geospatial artificial intelligence (geoAI): potential applications for environmental epidemiology
Geospatial artificial intelligence (geoAI) is an emerging scientific discipline that combines innovations in spatial science, artificial intelligence methods in machine learning (e.g., deep learning), data mining, and high-performance computing to extract knowledge from spatial big data. In environmental epidemiology, exposure modeling is a commonly used approach to conduct exposure assessment to determine the distribution of exposures in study populations. geoAI technologies provide important advantages for exposure modeling in environmental epidemiology, including the ability to incorporate large amounts of big spatial and temporal data in a variety of formats; computational efficiency; flexibility in algorithms and workflows to accommodate relevant characteristics of spatial (environmental) processes including spatial nonstationarity; and scalability to model other environmental exposures across different geographic areas. The objectives of this commentary are to provide an overview of key concepts surrounding the evolving and interdisciplinary field of geoAI including spatial data science, machine learning, deep learning, and data mining; recent geoAI applications in research; and potential future directions for geoAI in environmental epidemiology.
KeywordsGeospatial artificial intelligence geoAI Spatial data science Machine learning Deep learning Data mining Remote sensing Environmental epidemiology Exposure modeling
Association of Computing Machinery
Data and Software Coordination and Integration Center
Environmental Protection Agency
geospatial artificial intelligence
garbage in, garbage out
geographic information system
graphics processing unit
particulate matter air pollution < 2.5 μm in diameter
Pediatric Research using the Integrated Sensor Monitoring Systems
Special Interest Group on Spatial Information
volunteered geographic information
Spatial science, also referred to as geographic information science, plays an important role in many scientific disciplines as it seeks to understand, analyze, and visualize real-world phenomena according to their locations. Spatial scientists apply technologies such as geographic information systems (GIS) and remote sensing to spatial (e.g., georeferenced) data to achieve these objectives – to identify and make sense of patterns in space. Tied to the current era of big data is the real-time generation of spatial big data, which have become ubiquitously available from geotagged social media posts on Twitter to environmental sensors collecting meteorological information . It has been suggested that at least 80% of all data are geographic in nature, as the majority of information around us can be georeferenced . By this measure, 80% of the 2.5 exabytes (2,500,000,000 gigabytes) of big data generated everyday is geographic . Data science, and by extension spatial data science, are still evolving fields that provide methods to organize how we think about and approach generating new knowledge from (spatial) big data.
The scientific field of geospatial artificial intelligence (geoAI) was recently formed from combining innovations in spatial science with the rapid growth of methods in artificial intelligence (AI), particularly machine learning (e.g., deep learning), data mining, and high-performance computing to glean meaningful information from spatial big data. geoAI is highly interdisciplinary, bridging many scientific fields including computer science, engineering, statistics, and spatial science. The innovation of geoAI partly lies in its applications to address real-world problems. In particular, geoAI applications were showcased at the inaugural 2017 Association of Computing Machinery (ACM) Special Interest Group on Spatial Information (SIGSPATIAL) International Workshop on GeoAI: AI and Deep Learning for Geographic Knowledge Discovery (the steering committee was led by the U.S. Department of Energy Oak Ridge National Laboratory Urban Dynamics Institute), which included advances in remote sensing image classification and predictive modeling for traffic. Further, the application of AI technologies for knowledge discovery from spatial data reflects a recent trend as demonstrated in other scientific communities including the International Symposium on Spatial and Temporal Databases. These novel geoAI methods can be used to address human health-related problems, for example, in environmental epidemiology . In particular, geoAI technologies are beginning to be used in the field of environmental exposure modeling, which is commonly used to conduct exposure assessment in these studies . Ultimately, one of the overarching goals for integrating geoAI with environmental epidemiology is to conduct more accurate and highly resolved modeling of environmental exposures (compared to conventional approaches), which in turn would lead to more accurate assessment of the environmental factors to which we are exposed, and thus improved understanding of the potential associations between environmental exposures and disease in epidemiologic studies. Further, geoAI provides methods to measure new exposures that have been previously difficult to capture.
The purpose of this commentary is to provide an overview of key concepts surrounding the emerging field of geoAI; recent advances in geoAI technologies and applications; and potential future directions for geoAI in environmental epidemiology.
Distinguishing between the buzzwords: the spatial in big data and data science
Several key concepts are currently at the forefront of understanding the geospatial big data revolution. Big data, such as electronic health records and customer transactions, are generally characterized by a high volume of data; large variety of data sources, formats, and structures; and a high velocity of new data creation [5, 6, 7]. As a consequence, big data require specialized methods and techniques for processing and analysis. Data science broadly refers to methods to provide new knowledge from the rigorous analysis of big data, integrating methods and concepts from disciplines including computer science, engineering, and statistics [8, 9]. The data science workflow generally resembles an iterative process of data import and processing, followed by cleaning, transformation, visualization, modeling, and finally communication of results .
Spatial data science is a niche and still forming field focused on methods to process, manage, analyze, and visualize spatial big data, providing opportunities to derive dynamic insights from complex spatial phenomena . Spatial data science workflows are comprised of steps for data manipulation, data integration, exploratory data analysis, visualization, and modeling – and are specifically applied to spatial data often using specialized software for spatial data formats . For example, a spatial data science workflow may include data wrangling using open source solutions such as the Geospatial Data Abstraction Library (GDAL), scripting in R, Python, and Spatial SQL for spatial analyses facilitated by high-performance computing (e.g., querying big data stored on a distributed data infrastructure through cloud computing platforms such as Amazon Web Services for analysis; or spatial big data analytics conducted on a supercomputer), and geovisualization using D3. Spatial data synthesis is considered an important challenge in spatial data science, which includes issues related to spatial data aggregation (of different scales) and spatial data integration (harmonizing diverse spatial data types related to format, reference, unit, etc.) . Advances in cyberGIS (defined as GIS based on advanced cyberinfrastructure and e-science) – and more broadly high-performance computing capabilities for high-dimensional data – have played an integral role in transforming our capacity to handle spatial big data and thus for spatial data science applications. For example, a National Science Foundation-supported cyberGIS supercomputer called ROGER was created in 2014, which enables the execution of geospatial applications requiring advanced cyberinfrastructure through high-performance computing (e.g., > 4 petabytes of high-speed persistent storage), graphics processing unit (GPU)-accelerated computing, big data-intensive subsystems using Hadoop and Spark, and Openstack cloud computing [11, 13].
As spatial data science continues to evolve as a discipline, spatial big data are constantly expanding, with two prominent examples being volunteered geographic information (VGI) and remote sensing. The term VGI encapsulates user-generated content with a locational component . In the past decade, VGI has seen an explosion with the advent and continued expansion of social media and smart phones, where users can post and thus create geotagged tweets on Twitter, Instagram photos, Snapchat videos, and Yelp reviews . Usage of VGI should be accompanied by an awareness of potential legal issues including but not limited to intellectual property, liability, and privacy for the operator, contributor, and user of VGI . Remote sensing is another type of spatial big data capturing characteristics of objects from a distance such as imagery from satellite sensors . Depending on the sensor, remote sensing spatial big data can be expansive in both its geographic coverage (spanning the entire globe) as well as its temporal coverage (with frequent revisit times). In recent years, we have seen an enormous increase in satellite remote sensing big data as private companies and governments continue to launch higher resolution satellites. For example, DigitalGlobe collects over 1 billion km2 of high-resolution imagery each year as part of its constellation of commercial satellites including the WorldView and GeoEye spacecraft . The U.S. Geological Survey and NASA Landsat program has continually launched earth-observing satellites since 1972, with spatial resolutions as fine as 15 m and increasing spectral resolution with each subsequent Landsat mission (e.g., Landsat 8 Operational Land Imager and Thermal Infrared Sensor launched in 2013 are comprised of 9 spectral bands and 2 thermal bands) .
Geospatial artificial intelligence (geoAI): nascent origins
Data science involves the application of methods in scientific fields such as artificial intelligence (AI) and data mining. AI refers to machines that make sense of the world, automating processes that create scalable insights from big data [5, 20]. Machine learning is a subset of AI that focuses on computers acquiring knowledge to iteratively extract information and learn from patterns in raw data [20, 21]. Deep learning is a cutting-edge type of machine learning that draws inspiration from brain function, representing a flexible and powerful way to enable computers to learn from experience and understand the world as a nested hierarchy of concepts, where the computer is able to learn complicated concepts by building them from simpler concepts . Deep learning has been applied to natural language processing, computer vision, and autonomous driving [20, 22]. Data mining refers to techniques to discover new and interesting patterns from large datasets such as identifying frequent itemsets in online transaction records . Many techniques for data mining were developed as part of machine learning . Applications of data mining techniques include recommender systems and cohort detection in social networks.
Geospatial artificial intelligence (geoAI) is an emerging science that utilizes advances in high-performance computing to apply technologies in AI, particularly machine learning (e.g., deep learning) and data mining to extract meaningful information from spatial big data. geoAI is both a specialized field within spatial science because particular spatial technologies, including GIS, must be used to process and analyze spatial data, and an applied type of spatial data science, as it is specifically focused on applying AI technologies to analyze spatial big data. The first-ever International Workshop on geoAI organized as part of the 2017 ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems brought together scientists across diverse disciplines, including geoscientists, computer scientists, engineers, and entrepreneurs to discuss the latest trends in deep learning for geographical data mining and knowledge discovery. Featured geoAI applications included deep learning architectures and algorithms for feature recognition in historical maps ; multi-sensor remote sensing image resolution enhancement ; and identification of the semantic similarity in VGI attributes for OpenStreetMap . The geoAI Workshop is one example of the recent trend in the application of AI to spatial data. For example, AI research has been presented at the International Symposium on Spatial and Temporal Databases, which features research in spatial, temporal, and spatiotemporal data management and related technologies.
Opportunities for geoAI in environmental epidemiology
Given the advances and capabilities on display in recent research, we can begin to connect the dots regarding how geoAI technologies can be specifically applied to environmental epidemiology. To determine the factors to which we may be exposed and thus may influence health, environmental epidemiologists implement direct methods of exposure assessment, such as biomonitoring (e.g., measured in urine), and indirect methods, such as exposure modeling. Exposure modeling involves the development of a model to represent a particular environmental variable using various data inputs (such as environmental measurements) and statistical methods (such as land use regression and generalized additive mixed models) . Exposure modeling is a cost-effective approach to assess the distribution of exposures in particularly large study populations compared to applying direct methods . Exposure models include basic proximity-based measures (e.g., buffers and measured distance) to more advanced modeling such as kriging . Spatial science has been critical in exposure modeling for epidemiologic studies over the past two decades, enabling environmental epidemiologists to use GIS technologies to create and link exposure models to health outcome data using geographic variables (e.g., geocoded addresses) to investigate the effects of factors such as air pollution on the risk of developing diseases such as cardiovascular disease [29, 30].
geoAI methods and big data infrastructures (e.g., Spark and Hadoop) can be applied to address challenges surrounding exposure modeling in environmental epidemiology – including inefficiency in computational processing and time (particularly when big data are compounded with large geographic study areas) and data-related constraints that affect spatial and/or temporal resolution. For example, previous exposure modeling efforts have often been associated with coarse spatial resolutions, impacting the extent to which the exposure model is able accurately estimate individual-level exposure (i.e., exposure measurement error), as well as limitations in temporal resolution which may result in failure to capture exposures during time windows relevant to developing the disease of interest . Advances in geoAI enable accurate, high-resolution exposure modeling for environmental epidemiologic studies, especially regarding high-performance computing to handle big data (big in space and time; spatiotemporal) as well as developing and applying machine and deep learning algorithms and big data infrastructures to extract the most meaningful and relevant pieces of input information to, for example, predict the amount of an environmental factor at a particular time and location.
A recent example of geoAI in action for environmental exposure assessment was a data-driven method developed to predict particulate matter air pollution < 2.5 μm in diameter (PM2.5) in Los Angeles, CA, USA . This research utilized the Pediatric Research using the Integrated Sensor Monitoring Systems (PRISMS) Data and Software Coordination and Integration Center (DSCIC) infrastructure [4, 31]. A spatial data mining approach using machine learning and OpenStreetMap (OSM) spatial big data was developed to enable selection of the most important OSM geographic features (e.g., land use and roads) predicting PM2.5 concentrations. This spatial data mining approach addresses important issues in air pollution exposure modeling regarding the spatial and temporal variability of the relevant “neighborhood” within which to determine how and which factors influence predicted exposures (spatial nonstationarity is discussed later). Using millions of geographic features available from OSM, the algorithm to create the PM2.5 exposure model first identified U.S. Environmental Protection Agency (EPA) air monitoring stations that exhibited similar temporal patterns in PM2.5 concentrations. The algorithm next trained a random forest model (a popular machine learning method using decision trees for classification and regression modeling) to generate the relative importance of each OSM geographic feature. This was performed by determining the geo-context, or which OSM features and within what distances (e.g., 100 m vs. 1000 m radius buffers) are associated with air monitoring stations (and their measured PM2.5 levels) characterized by a similar temporal pattern. Finally, the algorithm trained a second random forest model using the geo-contexts and measured PM2.5 at the air monitoring stations to predict PM2.5 concentrations at unmeasured locations (i.e., interpolation). Prediction errors were minimized through incorporating temporality of measured PM2.5 concentrations in each stage of the algorithm, although modeling would have been improved with time-varying information on predictors. The model predictive performance using measured PM2.5 levels at the EPA air monitoring stations as the gold standard showed an improvement compared to using inverse distance weighting, a commonly used spatial interpolation method . Through this innovative approach, Lin et al. (2017) developed a flexible spatial data mining-based algorithm that removes the need for a priori selection of predictors for exposure modeling, as important predictors may depend on the specific study area and time of day – essentially letting the data decide what is important for exposure modeling .
The application of geoAI, specifically using machine learning and data mining, to air pollution exposure modeling described in Lin et al. (2017) demonstrates several key advantages for exposure assessment in environmental epidemiology . geoAI algorithms can incorporate large amounts of spatiotemporal big data, which can improve both the spatial and temporal resolutions of the output predictions, depending on the spatial and temporal resolutions of the input data and/or downscaling methodologies to create finer resolution data from relatively coarser data. Beyond incorporating high-resolution big data that are being generated in real-time, existing historical big data, such as Landsat satellite remote sensing imagery from 1972 to present, can be used within geoAI frameworks for historical exposure modeling – advantageous to studying chronic diseases with long latency periods. This seamless usage and integration of spatial big data is facilitated by high-performance computing capabilities, which provide a computationally efficient approach to exposure modeling using high-dimensional data compared to other existing time-intensive approaches (e.g., dispersion modeling for air pollution) that may lack such computational infrastructures.
Further, the flexibility of geoAI workflows and algorithms can address properties of environmental exposures (as spatial processes) that are often ignored during modeling such as spatial nonstationarity and anisotropy . Spatial nonstationarity occurs when a global model is unsuitable for explaining a spatial process due to local variations in, for example, the associations between the spatial process and its predictors (i.e., drifts over space) [32, 33]. Lin et al. (2017) addressed spatial nonstationarity through creating unique geo-contexts using the OSM geographic features for air monitoring stations grouped into similar temporal patterns. Anisotropic spatial processes are characterized by directional effects , for example, the concentration of an air pollutant may be affected by wind speed and wind direction . The flexibility in geoAI workflows naturally allows for scalability to use and modify algorithms to accommodate more big data (e.g., unconventional datasets such as satellite remote sensing to derive city landscapes for air quality dispersion modeling), different types of big data, and extending modeling to predict different environmental exposures in different geographic areas. An additional facet of this flexibility includes the ability for many machine learning and data mining techniques to be conducted without a high degree of feature engineering, enabling the inclusion of large amounts of big data, for example greater amounts of surrogate variables when direct measures are unavailable. In addition, another potential area of application for geoAI involves algorithm development to quickly and accurately classify and identify objects from remote sensing data that have been previously difficult to capture, for example, features of the built environment based on spectral and other characteristics to generate detailed 3D representations of city landscapes.
Ultimately, geoAI applications for environmental epidemiology move us closer to achieving the goal of providing a highly resolved and more accurate picture of the environmental exposures to which we are exposed, which can be combined with other relevant information regarding health outcomes, confounders, etc., to investigate whether a particular environmental exposure is associated with a particular outcome of interest in an epidemiologic study. However, as with any exposure modeling endeavor, there must be careful scrutiny of data quality and consideration of data costs. In the context of the Lin et al. (2017) study , although this type of data-driven approach enables flexibility in the amount of spatial big data that can be incorporated and in allowing the data to determine model inputs, it is incumbent on the spatial data scientist to evaluate data quality and assess whether or not the spatial resolution and other data attributes are useful for the application at hand – to avoid what is referred to as garbage in, garbage out (GIGO) in computer science. Related to data quality is the importance of data-driven approaches to be balanced against the need for domain-specific expertise. For example, if a particular variable that is a known predictor of PM2.5 (irrespective of time and space) is not selected as part of a data-driven method for inclusion into exposure modeling, this may require modifications to the algorithm, evaluation of the input data, etc. Finally, as a currently evolving field, geoAI requires the expertise of multiple disciplines, including epidemiology, computer science, engineering, and statistics, to establish best practices for how to approach environmental exposure modeling given the complexities introduced by the biological, chemical, and physical properties of different environmental exposures, wide-ranging algorithms that can be developed and applied, and heterogeneous spatial big data characterized by varying scales, formats, and quality.
geoAI is an emerging interdisciplinary scientific field that harnesses the innovations of spatial science, artificial intelligence (particularly machine learning and deep learning), data mining, and high-performance computing for knowledge discovery from spatial big data. geoAI traces part of its roots from spatial data science, which is an evolving field that aims to help organize how we think about and approach processing and analyzing spatial big data. Recent research demonstrates movement towards practical applications of geoAI to address real-world problems from feature recognition to image enhancement. geoAI offers several advantages for environmental epidemiology, particularly for exposure modeling as part of exposure assessment, including the capability to incorporate large amounts of spatial big data of high spatial and/or temporal resolution; computational efficiency regarding time and resources; flexibility in accommodating important features of spatial (environmental) processes such as spatial nonstationarity; and scalability to model different environmental exposures in different geographic areas. Potential future geoAI applications for environmental epidemiology should utilize cross-disciplinary approaches to developing and establishing rigorous and best practices for exposure modeling that includes careful consideration of data quality and domain-specific expertise.
This work was supported by the National Institutes of Health (NIH) National Cancer Institute (NCI) Training Program in Cancer Epidemiology (T32 CA009001) and the Prevent Cancer Foundation.
TV was responsible for paper conception. TV, JEH, FL, and Y-YC contributed to the production of the manuscript and provided critical revisions to the final manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 2.IBM. Industry Insights: 2.5 quintillion bytes of data created every day. How does CPG & Retail manage it? https://www.ibm.com/blogs/insights-on-business/consumer-products/2-5-quintillion-bytes-of-data-created-every-day-how-does-cpg-retail-manage-it/. Accessed 30 Oct 2017.
- 3.Baker D, Nieuwenhuijsen MJ. Environmental epidemiology: study methods and application. New York: NY: Oxford University Press.Google Scholar
- 4.Lin Y, Chiang Y-Y, Pan F, Stripelis D, Ambite JL, Eckel SP, Habre R. Mining public datasets for modeling intra-city PM2.5 concentrations at a fine spatial resolution. In: Proceedings of the 25th ACM SIGSPATIAL international conference on advances in geographic information systems. Los Angeles area, CA: ACM; 2017. p. 1–10.Google Scholar
- 5.Dietrich D. Data science & big data analytics: discovering, analyzing, visualizing and presenting data. Indianapolis, IN: John Wiley & Sons, Inc; 2015.Google Scholar
- 7.McAfee A, Brynjolfsson E. Big data: the management revolution. Harv Bus Rev. 2012;90(10):60–8.Google Scholar
- 8.Dominici F, Parkes D. Harvard in Allston: data science: SoundCloud. Harvard University podcast; 2017. https://soundcloud.com/harvard/harvard-in-allston-data-science?in=harvard/sets/harvard-in-allston
- 10.Wickham H, Grolemund G. R for data science. Sebastopol, Canada: O'Reilly Media, Inc.; 2016.Google Scholar
- 12.Anselin L. Spatial data, spatial analysis and spatial data science. The University of Chicago: the Center for Spatial Data Science 2016.Google Scholar
- 13.University of Illinois Urbana-Champaign. ROGER: The CyberGIS Supercomputer. https://wiki.ncsa.illinois.edu/display/ROGER/ROGER%3A+The+CyberGIS+Supercomputer. Accessed 30 Oct 2017.
- 18.DigitalGlobe. The DigitalGlobe Constellation. https://dg-cms-uploads-production.s3.amazonaws.com/uploads/document/file/223/Constellation_Brochure_forWeb.pdf. Accessed 30 Oct 2017.
- 19.U.S. Geological Survey. Landsat. https://landsat.usgs.gov/. Accessed 30 Oct 2017.
- 20.Goodfellow I, Bengio Y, Courville A. Deep learning. Cambridge, MA: The MIT Press; 2016.Google Scholar
- 24.Witten IH, Frank E, Hall MA. Data mining: practical machine learning tools and techniques. 3rd ed. Burlington, MA: Morgan Kaufmann Publishers; 2016.Google Scholar
- 25.Duan W, Chiang Y-Y, Knoblock CA, Jain V, Feldman D, Uhl JH, Leyk S. Automatic alignment of geographic features in contemporary vector data and historical maps. In: Proceedings of the 25th ACM SIGSPATIAL international conference on advances in geographic information systems. Los Angeles area, California: ACM; 2017. p. 45–54.Google Scholar
- 26.Collins CB, Beck JM, Bridges SM, Rushing JA, Graves SJ. Deep learning for multisensor image resolution enhancement. In: Proceedings of the 25th ACM SIGSPATIAL international conference on advances in geographic information systems. Los Angeles area, California: ACM; 2017. p. 37–44.Google Scholar
- 27.Majic I, Winter S, Tomko M. Finding equivalent keys in OpenStreetMap: semantic similarity computation based on extensional definitions. In: Proceedings of the 25th ACM SIGSPATIAL international conference on advances in geographic information systems. Los Angeles area, California: ACM; 2017. p. 24–32.Google Scholar
- 30.Hart JE, Puett RC, Rexrode KM, Albert CM, Laden F. Effect modification of long-term air pollution exposures and the risk of incident cardiovascular disease in US women. J Am Heart Assoc. 2015;4(12)Google Scholar
- 32.O'Sullivan D, Unwin D. Geographic information analysis. Hoboken, NJ: John Wiley & Sons; 2014.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.