# Spatial Data Analysis and Econometrics

- 602 Downloads

## Abstract

Key developments in the econometric analysis of spatial cross-section data are reviewed. The spatial connectivity matrix (W) is introduced and its implications for spatial autocorrelation (SAC) is explained. Alternative statistical tests for spatial autocorrelation are reviewed. The spatial autoregression model (SAR) is introduced and its relation to regression models with spatial lagged dependent variables is explained. A common factor test is described, which tests the hypothesis that SAC is induced by the omission of spatial lagged dependent variables. Alternative estimation methods for spatial lag models are compared and contrasted, including maximum likelihood and instrumental variable methods.

Spatial statistical methods such as spatial principal components generated by W, spatial filtering and geographically weighted regression are reviewed. The fundamental differences between spatial data and time series data are emphasized. Time is inherently sequential whereas space is not. Time is potentially infinite whereas space is not. Time has a natural unit of measurement (hours, months, years) whereas space does not. The MAUP (modifiable area unit problem) is discussed, which arises because, unlike physical space, socioeconomic space does not have a natural unit of measurement.

## 3.1 Introduction

In Chap. 2 we surveyed key concepts and developments in the econometric analysis of time series, which may be unfamiliar to spatial econometricians, but are essential to the understanding of the econometric analysis of nonstationary spatial panel data. In the present chapter, we survey key concepts in the econometric analysis of spatial data, which may be unfamiliar to time series econometricians, but are essential to the understanding of the econometric analysis of nonstationary panel data. Just as we suggested that Chap. 2 may be skipped by time series practitioners, so we suggest that the present chapter may be skipped by those familiar with spatial econometrics. We note, however, that whereas practitioners of spatial econometrics typically have some familiarity with the econometrics of time series, practitioners of time series econometrics usually have no familiarity with spatial econometrics for reasons given in Chap. 1. Indeed, spatial econometricians have incorporated advances in times series analysis to improve their understanding of concepts such as spatial dependence and spatial scale. The current explosion in computing power and geo-coded information has put space and distance squarely back on the agenda such that notions of ‘the death of distance’ (Cairncross 1997) have been greatly exaggerated.

The intellectual moorings of spatial econometrics is probably Cliff and Ord’s (1969) seminal paper on ‘The Problem of Spatial Autocorrelation’. This marked the collaboration of statistics with geography in an attempt to apply statistical theory to spatial data. Over time, this fusion spawned two sub disciplines. Within statistics, the emergence of the field of spatial statistics was marked by the publication of volumes by Ripley (1981) and Cressie (1993). These were concerned with the analysis of spatial patterns and spatial stochastic variation using implicitly spatial (geo-referenced) data. In the world of econometrics, the sub-field of spatial econometrics began to take root at roughly the same time as evinced by Paelinck and Klaassen’s *Spatial Econometrics* in 1979. They defined the new field as concerned with spatial dependence and spatial asymmetries in economic relationships between places, and the explicit incorporation of space in urban and regional modeling. Intellectual histories of spatial econometrics consider the publication of this book as a watershed event (Anselin 2010; Florax and van der Vlist 2003).

The academic antecedents of the synthesis of time series and spatial analysis is harder to trace. Bartlett (1955) and Whittle (1954) observed that notions from time series could be extended to the analysis of spatial data, but they drew attention to conceptual differences between time and space. Most importantly, whereas time is linear, sequential and unidirectional because time moves forward in equal steps, space is nonlinear, non-sequential and multi-directional because space has no ordering, it has several dimensions (north-south, east-west), and it is usually not measured in equal steps. In time series data there is only one coordinate (t); t − 1 occurs before t and the difference between t − 1 and t is always 1. By contrast, spatial data need at least two coordinates, latitude (m) and longitude (n), there is no sequencing within and between m and n and the distance between them is variant. Moreover, spatial data consist of areas rather than points in space. Because these areas vary by shape and size the distance between them is not constant.

Perhaps the first attempt at synthesizing spatial analysis and time series can be attributed to Bennett (1979) with the publication of *Spatial Time Series*. The preface of the volume articulates a goal of serving as a ‘bridging function … between spatial analysis procedures and the wider fields of engineering, econometrics and statistics for which most of the theory of non-spatial systems has been developed to date’ (p. ii). The volume then proceeds to present a systems-analytic approach to understanding environmental and socio-economic systems that operate over both space and time such as geomorphological change on the one hand and labor market processes on the other. While the book is cognizant of spatial and temporal dependence, the treatment of these issues and those of spatial non-stationarity and spatial heterogeneity is very different to that of modern-day spatial and time series econometrics and reflects the ‘pre unit-root econometrics’ era. Furthermore, the volume is strangely silent on the issue of spatial panel data in which distinct spatial patterns may arise due to both local clustering and pervasive common factors such as climate or macroeconomic developments. The former represents weak spatial (cross sectional) dependence between units while the latter indicates strong, aggregate dependence generated by shocks that affect the spatial units differentially. Bennett implicitly assumed that the data are stationary, both temporally and spatially as did many other authors writing before the “unit root revolution” of the 1980s.

## 3.2 The Nature of Spatial Data

Spatial data are inherently ‘messy’ especially in the social sciences. In the natural sciences surfaces may be measured by two dimensional grids, or by three dimensional blocs e.g. the atmosphere in meteorology and oceans in oceanography. In the social sciences spatial data come in all sorts of shapes and sizes due to the chaotic development of the spatial economy. Spatial data are collected from ‘the field’ be that the archive, the survey, or the social media. As such, they do not neatly adhere to the conventions of classic statistical measurement for stochastic modeling, thereby violating iid assumptions made in standard statistical inference and hypothesis testing. As mentioned, spatial data are generally not equally spaced as are time series data. Spatial observations may be aggregates, such that the dependence structure in the data may change as new observations are added, thereby inducing spatial nonstationarity. Moreover, the locations that form the building blocks of spatial data may be endogenous since the locational choices of households and firms may not be independent of the characteristics of these locations, including their physical and socioeconomic amenities, such as climate, quality of schools and government incentives. The same applies to the realm of non-geographic space. Firms locate in a particular product space in order to maximize profits. Their ‘location’ is therefore endogenous. Instrumenting for this choice of location and its ‘distance’ from other product locations is difficult. It is for this reason that locational choice is mainly considered exogenous, and the characteristics of places are considered as given.

Units of time are fundamentally different in these respects from units of space. The realization of a variable during time *t* does not depend on how time is measured in the way that the realization of a variable in spatial unit *i* may depend on how space is measured. For example, annual GDP is simply the sum of quarterly GDP. Time aggregation does not involve conceptual issues in the measurement of GDP. Matters are different when spatial units are aggregated. Combining spatial units A and B inevitably conceals migration between A and B, and may also conceal differences in the structure of economic activity e.g. agriculture versus manufacturing.

On the other hand, spatial data may have econometric advantages. Pinske and Slade (2010) note that while endogeneity issues may pervade spatial data, they tend also to offer more instrumental variables for handling these issues because, as already noted, higher order spatial lags are imperfectly correlated with their lower order counterparts. The first order spatial lag for unit i (\( {\tilde{y}}_i \)) is imperfectly correlated with its second-order spatial lag (\( {\tilde{\tilde{y}}}_i \)), hence the latter can be used as an instrumental variable for the former. For example, higher order spatial lags of exogenous variables may serve as instrumental variables for the spatial lag of house prices.

Additionally, spatial data are unlikely to be spatially stationary. Spatial stationarity, as discussed in Chap. 5, arises when the sample moments in spatial cross-section data are independent of where in space they are measured. Weak spatial stationarity applies when first (means) and second (variances and covariances) moments are independent of space. Strong spatial stationarity applies when higher order moments are also independent of space. Many spatial processes exhibit highly irregular (non-smooth) behavior in their covariance structure. For example, boundaries between different locations or geomorphological patterns based on non-uniform geology may induce sharp changes in covariances across space, thereby inducing spatial nonstationarity. Although Tobler’s First Law of Geography implies that spatial covariances tend to zero with the distance between them, this is not a sufficient condition for stationarity (as it would be in time series data) as noted by Granger (1969).

As noted in Chap. 1, spatial asymptotics are inherently different to temporal asymptotics. Since infill-asymptotics are not appropriate for socioeconomic data, we focus on increasing-domain asymptotics. Regarding the latter, in Chap. 1 we distinguished between edge effects that are immovable, and edge effects induced by sampling. On the one hand, dependence between spatial units intuitively slows down the rate of asymptotic convergence relative to independent cross-section data. On the other hand, as we shall see in Chap. 5, the higher dimensionality of spatial data makes it easier to detect spurious correlation in nonstationary spatial data than in nonstationary time series data. As noted by Cressie (1993), matters are different when the data are stationary; in two-dimensional space the central limit theorem may cease to apply. Moreover, for national datasets such as US states, the edges of space are immovable in which case increasing-domain asymptotics are no longer applicable (Cressie 1993; Wooldridge 2010; Elhorst 2014). Whereas T is conceptually infinite, matters are different regarding space, which is inherently fixed if edges are immovable, and especially as far as social sciences are concerned where borders are created through physical barriers, geopolitics, language, religion and culture.

Spatial data differ from time series data in that the area over which they are compiled may be configured in many arbitrary ways. This gives rise to the “modifiable areal unit problem” (MAUP) first identified by Yule and Kendall (1950) and discussed further below. MAUP posits that statistical results can vary depending on the way data are apportioned to spatial units of different sizes (the scale issue). In addition, even when base areal units are of similar size or scale, variation in results can arise due to the topology or zoning system used. MAUP therefore comprises two related issues driven by the nature of spatial data.

The analysis of spatial data requires attention to spatial heterogeneity, spatial dependence and spatial scale. All three are mutually connected: the appropriate modeling of spatial heterogeneity depends on choice of scale and the correct choice of scale increases prediction accuracy and mitigates spatial dependence.

## 3.3 Spatial Connectivity

Spatial dependence is characterized through the connectivity matrix (W), which is specified exogenously. If there are N spatial units, W is an N × N matrix with elements w_{ij} where i = 1,2,…,N. These elements express the spatial relation between unit i and j. Since unit i cannot have a spatial relationship with itself w_{ii} = 0, i.e. the leading diagonal of W is zero. There are many ways of specifying W. For example, if only contiguous units are related, w_{ij} is zero if units i and j are not contiguous. In this case W is a sparse matrix because most of its elements are zero. If spatial connectivity is defined in terms of the distance between units i and j d_{ij}, W will no longer be sparse but it will be symmetric, i.e. w_{ij} = w_{ji} because d_{ij} = d_{ji}. If spatial connectivity takes account of the relative sizes of units i and j so that w_{ij} depends on the size of unit i relative to the size of unit j, W will be asymmetric because w_{ij} does not equal w_{ji}. A key issue in spatial econometrics involves the specification of W (LeSage and Pace 2014; Qu and Lee 2015). We shall have more to say about this matter in Chap. 4, where we discuss specification tests for alternative definitions of W, and where we consider whether W can be estimated instead of being specified exogenously.

W is usually normalized so that its row elements sum to one, i.e. \( {\sum}_{j\ne 1}^N{w}_{ij}=1 \). The spatial lag of y_{i} is defined as \( {\tilde{y}}_i={\sum}_{j\ne i}^N{w}_{ij}{y}_j \), which is a weighted average of y in spatial units outside unit i. The N-vector of spatial lags may be written as \( \tilde{y}= Wy \). Powers of W express higher order spatial lags. For example, \( W\tilde{y}={W}^2y=\tilde{\tilde{y}} \) denotes the second order spatial lag of y.

## 3.4 The Spatial Lag Model

Spatial variables are denoted by tildes. For example, \( \tilde{x}= Wx \). In Eq. (3.1a) λ denotes the “spatial autoregressive” or SAR coefficient, and δ denotes the vector of “spatial Durbin” coefficients. In Eq. (3.1b) ρ denotes the “spatial autocorrelation” or SAC coefficient. The SAR coefficient induces spatial spillover because y_{i} depends on y_{j} unless λ = 0. The SAC coefficient induces a second type of spatial spillover because u_{i} depends on u_{j} unless ρ = 0. The spatial Durbin coefficient induces a third type of spatial spillover because y_{i} depends on x_{kj} unless δ_{k} = 0.

_{i}with respect to x

_{k}in spatial unit j is:

_{k}in unit j affects y

_{j}through β

_{k}, which spills-over onto y

_{i}. The second component is induced by the spatial Durbin lag; an increase in x

_{kj}spills over directly onto y

_{i}through δ

_{k}. If the SAR coefficient (λ) is positive a

_{ii}exceeds unity; an increase in x

_{k}in unit i affects y

_{i}directly, which in turn affects y

_{j}. The latter reverberates back onto y

_{i}hence the multiplier exceeds 1. If the SAR coefficient is negative a

_{ii}is less than 1.

_{ij}and c

_{ij}vary by i and j, they are spatially state dependent. Investigators might be interested in three aggregations of Eq. (3.3). The first refers to the average own impulse response (when j = i):

Equation (3.4a) includes the direct effect of an increase in x_{k} in unit i on y_{i}, which equals β_{k} in all units, as well as the indirect effects via the spatial lag coefficient (λ) and the spatial Durbin lag coefficient (δ_{k}). Therefore, the average indirect effect is Eq. (3.4a) minus β_{k}. This average indirect effect has become a standard feature in most spatial econometric software.

_{k}increases globally and not just locally:

_{k}increases locally, Eq. (3.4b) refers to the average impulse response for region i when x

_{k}increases globally. If β, δ and λ are positive, Eq. (3.4b) must be larger than Eq. (3.4a). The third aggregate refers to the average effect for all spatial units when x

_{k}increases globally. It is the average of Eq. (3.4b):

_{A}and S

_{C}sum the elements in A and C.

_{i}with respect to x

_{kj}. Another type of spatial impulse response is with respect to the innovations (ε). The counterpart to Eq. (3.3) is:

Whereas the spatial impulse responses in Eqs. (3.4a–3.4c) do not depend on ρ, matters are different in the case of Eq. (3.5). See e.g. LeSage and Pace (2009) for further discussion of these impulse responses.

In non-spatial models ρ, λ and δ are assumed to be zero. If λ and δ are not zero, Eq. (3.2) is simply y = Xβ + u, and will be misspecified. OLS estimates of β will be biased and inconsistent because of omitted variables. Note also, that if λ is not zero, OLS is biased and inconsistent because the spatial lag of y must be correlated with u. The OLS estimate of λ is biased upward if positive and biased downward if negative. Furthermore, the OLS estimate of λ is biased and inconsistent because of ρ. Spatial autocorrelation induces dependence between the spatial lag of y in Eq. (3.1a) and u since \( \tilde{y} \) depends on \( \tilde{u} \) and u depends on \( \tilde{u} \).

The specification of spatial Durbin lags does not have adverse econometric implications for OLS. If X is exogenous, so are spatial lags of X. Therefore, OLS estimates of β and δ are unbiased and consistent provided λ = 0. If, however, ρ is not zero, these OLS estimates continue to be unbiased and consistent, but they cease to be efficient.

*J*=

*∂ε*/

*∂y*, which according to Eq. (3.2) is equal to D

^{−1}. The log likelihood function includes the logarithm of the determinant of D

^{−1}(ln|

*D*

^{−1}|), and the ML estimates of ρ and λ are solved using the first order conditions with respect to ρ and λ:

Since W is exogenous, Eqs. (3.6a and 3.6b) depend only on ρ and λ. Equation (3.6a) uses the fact that *∂D*^{−1}/*∂ρ* = − *W*(*I* − *λW*).

*A*=

*I*+

*ρW*+

*ρ*

^{2}

*W*

^{2}+…,

Equation (3.7) clarifies that the instrumental variables consist of higher order spatial lags of the exogenous variable. Since λ is a fraction, the higher order terms in parentheses tend to zero. In practice, the polynomial for A is truncated. If the model includes spatial Durbin lags as in Eq. (3.1a) the IVs must include second order spatial lags and above. Otherwise, they must include first order spatial lags and above. See e.g. Anselin (1988) for further details and discussion.

## 3.5 Spatial Autocorrelation

The spatial autocorrelation coefficient (ρ) in Eq. (3.1b) would only be justified if the error terms (u) in Eq. (3.1a) happened to be spatially autocorrelated. Spatial autocorrelation arises when georeferenced data are correlated. This association can arise due to technical reasons, for example, data incongruence between the spatial extent of the phenomenon of interest and the institutional units for which it is available. It can also arise for substantive reasons, in particular spillovers and unobserved pervasive phenomena that induce correlation across space. If error terms happen to be spatially autocorrelated, u in Eq. (3.1a) ceases to be iid. Testing for spatial autocorrelation in error terms may reveal model misspecification, and spatial heterogeneity. Due to the multi-directional nature of space, testing for spatial autocorrelation is not a spatial variant of the Durbin–Watson statistic commonly used for testing for temporal autocorrelation. Nevertheless, the basic principles are similar.

We introduce various approaches for measuring spatial autocorrelation in error terms, and their associated significance tests.

### Moran’s I

_{i}∑

_{j}

*w*

_{ij}=

*N*in which event Eq. (3.8a) becomes:

If u_{i} = u_{j} the error terms are perfectly positively correlated, in which case I = 1 according to Eq. (3.8b). If at the other extreme u_{i} = −u_{j} the error terms are perfectly negatively correlated in which case I = −1. If the error terms are uncorrelated I = 0.

If z exceeds its critical value of 1.96 (p = 0.5) the error terms are positively spatially autocorrelated. See Cressie (1993) for a full articulation.

The absence of global spatial autocorrelation (I = 0) might conceal local spatial autocorrelation, or what appears to be global spatial autocorrelation might in fact be induced by pockets of local spatial autocorrelation.

Perfect positive correlation arises when u_{i} = u_{j} in which case C = 0. Perfect negative correlation arises when u_{i} = −u_{j} in which case C = 2. There is no spatial autocorrelation when C = 1. By focusing on differences between u_{i} and u_{j} rather than their products, C is more sensitive to local spatial autocorrelation than Moran’s I.

In contrast to Moran I and C statistics, Getis and Ord (1992) introduced a spatial autocorrelation statistic (the G statistic) that indicates the degree to which low and high values in the data are spatially clustered. This is generally used as a local indicator of spatial autocorrelation whereby the weight matrix is row-adjusted to 1 and the row-multiplier uses N − 1 instead of N. Strictly speaking G measures spatial concentration rather than spatial autocorrelation, and resembles a spatial Gini coefficient. Therefore, unlike Moran I and C, which are meaningful measures of spatial autocorrelation in error terms, G is more sensibly applied to measuring the spatial concentration in variables such as crime, income etc. The G statistic is a scale invariant (but not location invariant) statistic applicable to variables that are positive and have a natural origin. The distribution of the statistic in terms of z and p values indicates spatial clustering. For this reason it is popularly portrayed in GIS packages as a ‘hot spot’ statistic. Large and significant z scores for G indicate clustering of high values (hot spots). Significant clustering of low value z scores indicates cold spots.

### Lagrange Multiplier Statistic

In Lagrange multiplier (LM) tests the null hypothesis is assumed to be false, and the model is estimated ignoring the null hypothesis. Subsequently, an auxiliary equation is estimated to test the null hypothesis. The auxiliary equation usually controls for the variables specified in the model. It also specifies the omitted variables generated by the null hypothesis. If the LM test statistic rejects the null hypothesis, no damage was induced by ignoring it. If, on the other hand, the null hypothesis is not rejected by the LM test, the null hypothesis should not have been ignored in the first place. Since in many situations the null hypothesis is ignorable, LM tests have become increasingly popular. It saves the bother of allowing for potential restrictions, which might prove to be irrelevant.

^{2}of Eq. (3.14) must be zero because b = c = d = 0 by definition. Regressing error terms on the covariates, which generated them, must deliver zero goodness-of-fit. The LM statistic is NR

^{2}which is distributed \( {\chi}_1^2 \). If the LM statistic is less than its critical value, ignoring SAC did no harm. If instead LM exceeds its critical value it must be because e is significantly different from zero, in which event SAC was not ignorable.

Equation (3.14) cannot be estimated by OLS for the same reasons that Eq. (3.1b) cannot be estimated by OLS; the spatial lags of y and u are not independent of v. The preferred method of estimation is ML.

## 3.6 Spatial Heterogeneity

Spatial heterogeneity is a special case of observed heterogeneity relating to variation over space. It arises when the effects of covariates on dependent variables vary by location. Geographical patterns that correlate across space may not just be statistical nuisances to be treated by including additional variables, but might express spatial heterogeneity. In contrast to spatial dependence and spillover discussed above, coping with spatial heterogeneity does not call on special estimation techniques. It can be incorporated by using spatially distinct units between which model parameters vary or by allowing model parameters to vary over space as in geographically weighted regression (see below).

Spatial heterogeneity is particularly challenging since it is often difficult to distinguish it from spatial dependence (Anselin 2010). This ‘linear inverse problem’ arises when trying to reconstruct an object (function) from indirect observations (Kirsch 1996). Solving this problem means recovering the function from noisy observations. This is akin to inversely deriving data from parameters of a model instead of vice versa. In the context of distinguishing between spatial heterogeneity and spatial dependence this problem is confounded by the blurred distinction between true and apparent spatial clustering. The essence of the problem is that cross-sectional data, while allowing the identification of clusters and patterns, do not provide sufficient information to identify the processes that led to these patterns. As a result, it is impossible to distinguish between the case where spatial patterns are due to structural change (apparent clustering) or follow from a true inherently spatial process of change.

In practice, this is further complicated because each form of misspecification may suggest the other form in diagnostics and specification tests. For example, tests against spatial autocorrelation have power against heteroscedasticity, and tests against heteroscedasticity have power against spatial autocorrelation (Anselin and Griffith 1988). Spatial heterogeneity provides the basis for the specification of the structure of the heterogeneity in a spatial model. Ignoring spatial heterogeneity leads to estimation inefficiency but not to estimation bias.

### Geographically Weighted Regression

In spatial analysis, heterogeneity is commonly addressed thorough spatial filtering (Griffith 2003) as discussed further below, or geographically weighted regression (GWR) (Fotheringham et al. 2002). Both techniques have not attracted much attention in the econometrics community and are described briefly here.

In nonparametric regression the parameter estimates vary according to the data in the vicinity of the observations. Hence, for observation i, β_{i} depends on the data in the vicinity y_{i} and x_{i}. GWR deals with spatial heterogeneity because it allows β_{i} to vary with the data in the vicinity of spatial unit *i*. Typically, linear regression applied to spatial data assumes a spatially stationary process in that a particular shock will elicit the same response across space regardless of where it occurs. This is patently unreasonable as sampling variation may cause relationships to vary across space. Furthermore, intrinsic differences across places (cultural, political and behavioral practices) may elicit different responses to common shocks. Finally a global formulation may misspecify a model that is inherently local.

_{ii}and c

_{ii}vary by i. However, the mechanism in GWR is non-spatial, and is induced instead by spatial heterogeneity. Anselin (2010) has remarked that it may be difficult to distinguish between spatial heterogeneity as in GWR and spatial dependence as in Eq. (3.1a). However, GWR assumes that y

_{i}is independent of x

_{j}and ε

_{j}, whereas Eq. (3.1a) does not. Since Eqs. (3.1a and 3.15a) are non-nested, non-nested tests (discussed in Chap. 4) should be able to distinguish between them.

_{i}= (α

_{i}β

_{ki})′ the GWR estimator for θ

_{i}is:

_{i}is a symmetric N × N matrix with spatial weights v

_{ij}and v

_{ii}= 1. Because these weights depend exclusively on distance, we distinguish them from their counterparts in W. This is why w

_{ii}= 0 but v

_{ii}= 1. A popular specification of V

_{i}is the exponential distribution:

*h*represents the kernel bandwidth. As

*h*increases, the gradient of the kernel becomes less steep and more data points are included in the local calibration. The choice of bandwidth involves a tradeoff between bias and variance. Too small a bandwidth leads to large variances in estimates of β

_{i}since only a small number of observations are considered in each local regression. If too large a bandwidth is selected this creates large bias in the local estimates. This will tend to iron-out the variance in the data points and create a bias by masking local characteristics shrinking the estimates to the size of their global counterparts. To find the optimal value of h, bandwidth optimization strategies are preferred to an ad hoc selected bandwidth (Fotheringham et al. 2002; Páez et al. 2011). GWR estimates are relatively insensitive to the choice of weighting function as long as it is a continuous distance-based function. However, they are sensitive to the degree of distance-decay.

## 3.7 Modifiable Areal Unit Problem (MAUP)

The MAUP framework addresses the issue of the sensitivity of empirical estimation to the selection of geographic units and particularly their arbitrary nature. Yule and Kendall (1950) observed that correlation coefficients could vary depending on how space is aggregated. This observation involves three separate issues: does the number of spatial units affect the results? Does their size (scale) and shape (topology or configuration) have any effect? Since the number of units varies inversely with their size and perhaps their shape, we focus on size and shape. Size effects arise from choosing a spatial resolution (disaggregating or re-aggregating) from a given set of data. Shape effects arise from choosing the relevant data (defining the shape of spatial units) given a level of spatial resolution.

Suppose that rich and poor people are distributed randomly across space so that in truth there is no spatial inequality. Suppose an investigator aggregates these data into circular areas of fixed diameter. The investigator could create artificial spatial inequality by selectively drawing circles around random spatial concentrations of rich and poor. The investigator could generate yet more spurious inequality if he changes the shapes of his spatial units by grouping poor and rich in separate spatial units. The more freedom allowed in choosing shapes, the more artificial inequality may be generated from the data.

The effect of spatial size is not just a statistical issue but also a practical one. With much policy emphasis being directed towards generating ‘agglomerations’ and ‘clusters’, getting size ‘right’ empirically would seem to have important implications with respect to knowing what works. A common approach is to use simulated experiments. Using three experiments Openshaw and Taylor (1979) investigate how different sized units influence the relationship between the percentage of Republican voters and the share of the population over the age of 60 for 99 counties in Iowa. They report clear differences between size and shape and relate this to the interaction of contiguity inherent in the former with the spatial autocorrelation in the data. A Monte Carlo type experiment with random allocation of values and their allocation to geo-referenced points across different sized grids conducted by Amrhein (1995) has shown that size and shape have little effect and that changes in variance are influenced only by number of units.

More recently, Briant et al. (2010) illustrate how differentiating various French zoning systems by size and shape only marginally affects determinants of wages and trade. They find MAUP-based distortions across six different zoning systems, but in general they conclude that size and shape are of secondary importance compared with estimation issues. Overall, spatial size is a more cogent issue than spatial shape in distorting estimates of trade than of wages. They suggest that gravity models of trade use variables aggregated under different spatial configurations to those of wage models. For example, in trade estimation both sides of the regression may not be treated uniformly (i.e. averaged) as is likely in wage models. As MAUP distortions are related to whether the distribution of variables is preserved, they are more potent in trade estimations than in wage estimations.

At first sight, the issue of spatial shape seems deceptively innocuous and has been considered of ‘third order’ importance in handling spatial data aggregation (Briant et al. 2010). In fact it consists of two separate but often-confounded issues: spatial zoning representing the arrangement of contiguous polygons, and spatial grouping which expresses the arrangement of non-contiguous polygons. Zoning is really a special case of grouping with a contiguity constraint and represents the most common form of spatial arrangement. For the purpose of spatial econometric analysis the issue of contiguity however is critical as it is this arrangement that facilitates spatial spillover.

Spatial datasets typically comprise discrete units that are administrative or geopolitical creations. They are thus exposed to distortions that can arise from edge effects. For example, spatial units on the edge of a square lattice are less exposed to spatial spillover because they have three neighbors instead of four, and in the corners of the lattice they are even less exposed because they only have two neighbors. In oblong lattices there is less spatial dependence than in square lattices because spatial units are closer to the edge in the former. Surprisingly, seminal discussions of the MAUP (Openshaw and Taylor 1979), Fotheringham and Wong (1991) have over-looked this issue. We address this issue in Chap. 5 where we develop unit root tests for SAR coefficients for different topologies.

In contrast, Briant et al. (2010) have tested different spatial topologies such as administrative, grid and random zoning configurations ignoring the implications of spatial spillovers. In parallel to topology they have looked at whether the nature of the information to be allocated to spatial topologies (summed or aggregated information) is sensitive to spatial shape. In the extreme case, this assumes that values in one spatial unit are independent of adjacent or surrounding units, i.e. spatial autarchy is imposed. Shape is found to be of small importance. Smaller more regular units (grid cells) reduce variance and impose homogeneity. Larger units increase volatility and increase heterogeneity. While this seems to indicate MAUP-like bias, spatial dependence is ignored.

## 3.8 Spatial Filtering

*y*=

*Xβ*+

*u*, but u is spatially autocorrelated according to Eq. (3.1b). Since u = Bε, where B = (I − ρW)

^{−1}, the model may be expressed in terms of spatially filtered variables by substituting Bε for u in the model, and then multiplying both side by B

^{−1}to obtain:

Notice that in Eq. (3.16a) ε is iid, in which case β may be estimated by OLS given ρ, which is estimated first. A crucial assumption in spatial filtering is that ρ is a nuisance parameter that can be concentrated out of the likelihood function. This would be permissible if ρ and β were independent. Although \( p\lim \kern0.3em \widehat{\beta} \) is independent of ρ, matters are different in finite samples. Consequently, in finite samples spatial filtering might induce bias in the estimates of ρ as well as β. The obvious less radical alternative to spatial filtering is to estimate ρ and β jointly by ML.

Spatial filtering treats (I − ρW) as a common factor that applies to y and X. A rival specification is Eq. (3.1a) in which u is assumed to be spatially uncorrelated. Equation (3.16a) imposes the restrictions λ = ρ and δ = ρβ, as may be seen in Eq. (3.16b). Since Eq. (3.16b) is nested in Eq. (3.1a) a likelihood ratio test of the two models may be used to determine which model is preferable. Since Eq. (3.16b) imposes two restrictions the likelihood ratio test is distributed \( {\chi}_2^2 \). The common factor (spatial filtering) cannot be rejected if 2LR < \( {\chi}_2^2 \), otherwise Eq. (3.1a) is preferable.

In summary, spatial filtering treats spatial autocorrelation as a nuisance parameter rather than as a diagnostic device for detecting model misspecification. Evidence of spatial autocorrelation may indicate that the model is misspecified in terms of its spatial dynamics. In Chap. 2 we noted that serially correlated errors in time series models typically imply that the model is dynamically misspecified. We also tend to think that spatially autocorrelated errors typically imply that the model is misspecified in terms of its spatial dynamics. If, indeed, the correct model happens to have spatially autocorrelated error terms, spatially filtered OLS is consistent. However, it may be biased in finite samples. Either way, the justification for spatial filtering is questionable.

### Spatial Eigenvectors

The rank of a matrix is equal to the number of its non-zero eigenvalues. Since W is N × N its rank is N. However, many of its eigenvalues are most probably small, especially when N is large. The eigenvalues are labelled by n = 1,2,…,N in terms of their size, where n = 1 is the largest and n = N the smallest. From the spectral decomposition theorem of a matrix W = ΓΛΓ′, where Λ is a diagonal N × N matrix with the eigenvalues on the leading diagonal, and Γ is the associated matrix of eigenvectors with elements γ_{nj}. Since ΓΓ′ = I_{N} the eigenvectors are orthonormal, i.e. they are independent of each other. Griffith (1996) showed that W can be decomposed into N orthogonal spatial components using the eigenvectors, as in principal components analysis. The most important of these components is for n = 1 and the least important is for n = N. These spatial components may be used to characterize the spatial dependence generated by W in terms of peripherality, regionality and other spatial attributes that might be geographically meaningful.

Because the eigenvectors are orthogonal, there is no collinearity between their M covariates. Consequently, investigators may rapidly determine the optimal specification of Eq. (3.17a) in terms of exclusion restrictions for some and even many of the M eigenvectors. For example, Getis and Griffith (2002) use house price data in which N = 48, M = 14 and the final number of included eigenvectors is only 3. An advantage over Eq. (3.17c) is that Eq. (3.17a) is less parametric in the sense that the only the empirically relevant eigenvalues of W are retained.

Equation (3.17d) is the counterpart to Eq. (3.16a) when β and ρ are estimated jointly in that β and ϕ are estimated jointly. However, whereas the estimation of β and ρ is by ML, Eq. (3.17d) may be estimated by OLS.

## 3.9 Spatial Panel Data

This chapter has been exclusively concerned with spatial cross-section data. On the whole, no new major econometric issues are involved in spatial panel data, provided that they are stationary. Textbooks on panel data usually include a chapter on spatial data (Baltagi 2013; Pesaran 2015). See e.g. Elhorst (2014) on static and dynamic spatial panel data econometrics when the data are stationary. If instead the data are nonstationary, radical changes are involved in the econometric analysis of spatial panel data as well as spatial cross-section data. In Chap. 5 we address the issue of nonstationarity in spatial cross-section data. In Chaps. 7– 9 we address the issue of spatio-temporal nonstationarity in spatial panel data. Recently, attention has been drawn to the difference between strong and weak cross-section dependence in stationary as well as nonstationary panel data. This issue is addressed in Chap. 10.

The literature on the econometric analysis of spatial panel data dates back at least to Anselin (1988, Chap. 10). See also Chap. 6 below for a historical review. However, empirical work using spatial panel data developed slowly. An early contribution is Elhorst (2003) who considered the case of temporally static models with spatial dynamics induced by spatial lagged endogenous variables with fixed spatial effects. As in cross-section data, the presence of spatial lagged endogenous variables in panel data affects the identification of SAR coefficients. The econometric solutions discussed above regarding cross-section data (ML, IV, GMM) are directly applicable to temporally static panel data, provided that these data are stationary. Elhorst follows ML procedures used for non-spatial panel data by concentrating out the regional fixed effects from the likelihood function by demeaning the data. Panel counterparts for Moran I and LM statistics for testing for spatial autocorrelation in the residuals have been developed too. Procedures for this are available, for example, in Stata using the xsmle command, in Matlab’s econometric toolbox, and in R.

In dynamic panel data models the “incidental parameter problem” is expressed by the fact that the estimated fixed effects are consistent but biased in finite samples. Moreover, concentrating out fixed effects by demeaning or differencing the data induces inconsistency. Since Arellano and Bond (1991) the standard solution to this inconsistency problem has been to use higher order lagged differences of the endogenous variable as instrumental variables for their first order lagged difference. Subsequently, GMM versions of this solution were developed. The availability of these solutions in econometric software packages has made them popular relative to the alternatives.

The main alternative is bias correction (Hahn and Kuersteiner 2002) of the aforementioned biased estimates induced by the incidental parameter problem. Since the bias is known analytically (as explained in Chap. 6), it is straightforward to correct it. Although bias correction has not proved popular in dynamic panel data econometrics in general, the opposite has been the case in spatial econometrics. Yu et al. (2008) developed quasi ML (QML) methods for panel data models with spatiotemporal dynamics using a two-step procedure. In the first step, they concentrate out the fixed effects and estimate the model ignoring the incidental parameter problem. In the second step they bias-correct the parameter estimates from the first step. A similar two-step procedure was used by Beenstock and Felsenstein (2007) and is discussed in Chap. 6.

Since Pesaran (2006) the spatial econometric analysis of panel data has been challenged by a rival interpretation of cross-section dependence. According to this rival cross-section independence is induced by common factors that have spatially heterogeneous factor loadings. Cross-section dependence stems from the differential effects of shared common factors. Whereas spatial dependence is based on proximity, its common factor rival is not. Subsequently, spatial dependence has been referred to as weak cross-section dependence, and dependence induced by common factors has been referred to by strong cross-section dependence. In Chap. 10 we discuss these issues in detail, and join the growing consensus that the two types of cross-section dependence are not mutually exclusive.

This brief overview of developments in the econometric analysis of spatial panel data refers exclusively to stationary panel data. It certainly does not apply to nonstationary panel data. Indeed, as noted in Chap. 1, there is no literature on the econometric analysis of nonstationary spatial panel. Perhaps the only exception is Yu et al. (2012) who consider the case where the data are nonstationary and are known to be cointegrated. They assume what we seek to test; in practice we do not know that the data are indeed cointegrated. To these ends we follow the methodological path trod by our predecessors in non-spatial panel data (see Chap. 2). We develop the asymptotic theory for testing for the presence of spatiotemporal unit root tests in which N is fixed and T → ∞. We then use simulation methods to compute critical values under the null hypothesis that a spatiotemporal unit root is present. Then we develop the asymptotic theory for testing for spatial panel cointegration where the variables contain spatiotemporal unit roots. Here too N is fixed and T → ∞. Finally, we use simulation methods to compute critical values for spatial panel cointegration tests.

This chapter has presented some of the key ideas in spatial data analysis that are pertinent to time series econometricians. Where applicable, we have highlighted the time series roots of current spatial econometrics. For example, the SAR model is derived from the simple time series autoregressive models and spatial filtering is a variant of time series differencing. Despite these commonalities, there are still many differences. These derive from the unique nature of spatial data and impact on some of the key issues in handling spatial data for econometric time series analysis. The challenges of spatial heterogeneity, dependence, and MAUP, which arise in spatial data do not have any counterparts in time series data. The ‘messy’ nature of spatial data noted above has manifested itself in the ‘problem’ of spatial (observational) autocorrelation and dependence (Cliff and Ord 1969) that plagues econometric model specifications. Spatial data can also be ‘noisy’ typically characterized by stochastic errors and underlying time trends (non stationarity).

Furthermore, spatial data can also be ‘dirty’ containing corruptions and inconsistencies. A plethora of dedicated techniques have emerged to mitigate some of these excesses. These harness advances in computational sciences and the increasing availability of spatial panel data. The toolbox serving the spatial econometrician has been continually extended through new forms of spatial data interpolation and imputation techniques, extensions to impulse response modeling through spatial filtering, reducing error propagation through GIS and so on. Manipulating these data via smoothing techniques or partitioning into polygons needs to be cognizant of the spatial dependence present in the data. Cross-product statistics such as Moran’s I, Geary’s C, Getis and Ord’s G and other geo-statistical measures for evaluating dependence such as clustering algorithms, rest on an evaluation of the degree to which the data are spatially heterogeneous. Other techniques such as interpolation or surface smoothing via filtering are also grounded in the nature of trends in the data and the specification of a structure for spatial dependence. As time series econometrics gets increasingly involved in cross sectional dependence (both strong and weak), it cannot continue to be oblivious of these issues.

We devote the next chapter to the spatial connectivity matrix, which is the hallmark of the spatial econometric analysis of cross-section data and panel data, and which has been a key focus in the present chapter.

## References

- Amrhein CG (1995) Searching for the elusive aggregation effect: evidence from statistical simulations. Environ Plan A 27:105–119CrossRefGoogle Scholar
- Anselin L (1988) Spatial econometrics: methods and models. Kluwer Academic, DordrechtCrossRefGoogle Scholar
- Anselin L (2010) Thirty years of spatial econometrics. Pap Reg Sci 89(1):3–25CrossRefGoogle Scholar
- Anselin L, Griffith DA (1988) Do spatial effects really matter in regression analysis? Pap Reg Sci Assoc 65:11–34CrossRefGoogle Scholar
- Arellano M, Bond S (1991) Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations. Rev Econ Stud 58:277–297CrossRefGoogle Scholar
- Baltagi BH (2013) Econometric analysis of panel data, 5th edn. Wiley, ChichesterGoogle Scholar
- Bartlett MS (1955) An introduction to stochastic processes. Cambridge University Press, CambridgeGoogle Scholar
- Beenstock M, Felsenstein D (2007) Spatial vector autoregressions. Spat Econ Anal 2:167–196CrossRefGoogle Scholar
- Bennett RJ (1979) Spatial time series: analysis, forecasting, control. Pion, LondonGoogle Scholar
- Briant A, Combes PP, Lafourcade M (2010) Dots to boxes: do the size and shape of spatial units jeopardize economic geography estimations? J Urban Econ 67(3):287–302CrossRefGoogle Scholar
- Cairncross F (1997) The death of distance: how the communications revolution is changing our lives. Harvard Business School Press, Boston, MAGoogle Scholar
- Cliff A, Ord J (1969) The problem of spatial autocorrelation. In: Scott AJ (ed) Studies in regional science, London papers in regional science. Pion, London, pp 25–55Google Scholar
- Cressie NAC (1993) Statistics for spatial data. Wiley, New YorkGoogle Scholar
- Elhorst JP (2003) Specification and estimation of spatial panel data models. Int Reg Sci Rev 26:244–268CrossRefGoogle Scholar
- Elhorst JP (2014) From spatial cross-section data to spatial panel data. Springer, BerlinGoogle Scholar
- Florax RJGM, van der Vlist AJ (2003) Spatial econometric data analysis: moving beyond traditional models. Int Reg Sci Rev 26(3):223–243CrossRefGoogle Scholar
- Fotheringham AS, Wong DWS (1991) The modifiable areal unit problem in multivariate statistical analysis. Environ Plan A 23:1025–1044CrossRefGoogle Scholar
- Fotheringham AS, Brunsdon C, Charlton M (2002) Geographically weighted regression: the analysis of spatially varying relationships. Wiley, LondonGoogle Scholar
- Getis A, Griffith DA (2002) Comparative spatial filtering in regression analysis. Geogr Anal 32:131–140Google Scholar
- Getis A, Ord JK (1992) The analysis of spatial association using distance statistics. Geogr Anal 24:189–206CrossRefGoogle Scholar
- Granger CWJ (1969) Spatial data and time series analysis. In: Scott A (ed) Studies in regional science, London papers in regional science. Pion, London, pp 1–24Google Scholar
- Griffith DA (1996) Spatial autocorrelation and eigenfunctions of the geographic weights matrix accompanying geo-referenced data. Can Geogr 40(4):351–367CrossRefGoogle Scholar
- Griffith DA (2003) Spatial autocorrelation and spatial filtering: gaining understanding through theory and scientific visualization. Springer, BerlinCrossRefGoogle Scholar
- Hahn J, Kuersteiner G (2002) Asymptotically unbiased inference for a dynamic panel model with fixed effects when both N and T are large. Econometrica 70:1639–1657CrossRefGoogle Scholar
- Kirsch A (1996) An introduction to the mathematical theory of inverse problems. Springer, New YorkCrossRefGoogle Scholar
- LeSage JP, Pace RK (2014) The biggest myth in spatial econometrics. Econometrics 2(4):217–249CrossRefGoogle Scholar
- LeSage JP, Pace RK (2009) Introduction to spatial econometrics. CRC, Boca Raton, FLCrossRefGoogle Scholar
- Moran PAP (1950) Notes on continuous stochastic phenomena. Biometrika 37:17–23CrossRefGoogle Scholar
- Openshaw S, Taylor PJ (1979) A million or so correlation coefficients: three experiment on the modifiable areal unit problem. In: Wrigley N (ed) Statistical applications in the spatial sciences. Pion, London, pp 127–144Google Scholar
- Paelinck J, Klaassen L (1979) Spatial econometrics. Saxon House, FarnboroughGoogle Scholar
- Páez A, Farber S, Wheeler D (2011) A simulation-based study of geographically weighted regression as a method for investigating spatially varying relationships. Environ Plan A 43(12):2992–3010CrossRefGoogle Scholar
- Pesaran MH (2006) Estimation and inference in large heterogeneous panels with a multifactor error structure. Econometrica 74:967–1012CrossRefGoogle Scholar
- Pesaran MH (2015) Time series and panel data econometrics. Oxford University Press, OxfordCrossRefGoogle Scholar
- Pinske J, Slade ME (2010) The future of spatial econometrics. J Reg Sci 50(1):103–117CrossRefGoogle Scholar
- Qu X, Lee L (2015) Estimating a spatial autoregressive model with an endogenous spatial weight matrix. J Econ 184(2):209–232CrossRefGoogle Scholar
- Ripley BD (1981) Spatial statistics. Wiley, New YorkCrossRefGoogle Scholar
- Whittle P (1954) On stationary processes in the plane. Biometrika 49:434–449CrossRefGoogle Scholar
- Wooldridge JM (2010) Econometric analysis of cross section and panel data, 2nd edn. MIT Press, Cambridge, MAGoogle Scholar
- Yu J, de Jong R, Lee L-F (2008) Quasi-maximum likelihood estimators for spatial dynamic panel data with fixed effects with both N and T large. J Econ 146:118–134CrossRefGoogle Scholar
- Yu J, de Jong R, Lee L-F (2012) Estimation for spatial dynamic panel data with fixed effects: the case of spatial cointegration. J Econ 167:16–37CrossRefGoogle Scholar
- Yule GU, Kendall MG (1950) An introduction to the theory of statistics. Charles Griffin, LondonGoogle Scholar