Spatial Aggregation Methods for Investigating the MAUP Effects in Migration Analysis
In this paper, we investigate the effects of scale and zone configuration on migration indicators and spatial interaction model parameters using a software system known as the IMAGE Studio. Internal migration flows in the United Kingdom and the local authority districts between which they move are aggregated into sets of increasingly fewer and larger polygons using alternative zone design algorithms. Indicators of migration intensity, impact and distance are revealed to vary significantly by scale but less so by zonation, whereas migration effectiveness and distance show greater scale independence but more sensitivity to zone configuration. Equal area and population optimised regions improve the quality of measures to a certain degree depending upon the imposition of shape constraints.
KeywordsZone aggregation MAUP effects Migration indicators Spatial interaction modelling IMAGE studio
Spatial analysts are now familiar with the axiom that statistical indicators and model parameters that quantify different features of a particular human geographic phenomenon may vary with the spatial scale for which data are available and with the configuration (or shape) of the zones for which data are reported at each scale. This variation is attributable to the so-called ‘scale’ and ‘zonation’ effects of the Modifiable Areal Unit Problem (MAUP) that Openshaw (1984) documented carefully in his famous CATMOG publication and which has been addressed by a number of geographers since then, most recently by Lloyd (2014) and Manley (2014). Many studies of the MAUP effects have considered the impact of scale and zonation problems using attribute data in the form of stock variables measured for a limited set of scales and zonation systems. Our context is that of internal migration flows, where two geographies (of origin and destination) are involved and where individuals change usual address from one location to another during some period of time. Internal migration data are often released by the national statistical agencies as flows between the zones that constitute certain administrative or census geographies and in most cases, the geographies of origin and destination are equivalent. Migration flows in the 12 month period before the 2011 Census in the United Kingdom (UK), for example, are available in the form of symmetric origin-destination matrices at certain spatial scales (Duke-Williams et al. 2018) and consequently, the volume and intensity of migration between zones will be scale dependent. Thus, for example, the volume of migrants over 1 year of age between 404 local authority districts in the UK in the 12 months before the 2011 Census was around 2.8 million people and the crude migration intensity was 44.3 per thousand population, whereas only around 1.2 million individuals or 18.7 per thousand of the population moved between the 12 UK regions (2011 Census Special Migration Statistics1 extracted from UK Data Service using WICID).
The aim of this paper is to investigate what are the MAUP implications for migration indicators and spatial interaction model parameters when we apply different zone design methods to a set of Basic Spatial Units (BSUs) for which we have data on inter-zonal migration flows such as the UK local authority districts mentioned above. We have chosen four alternative zone aggregation methods and our objective is to identify the variation in results of using each method, exposing some of the advantages and problems of each approach along the way. The algorithms are explained in detail in the third section of the paper, the data used in the analyses are introduced in the fourth section, and the results are reported in the fifth section. To begin with, however, we introduce the IMAGE project which has been the context in which this research has been undertaken and outline the structure and framework of the IMAGE Studio and its subsystems. The paper finishes with some conclusions and suggestions for further work.
The MAUP, the IMAGE Project and the IMAGE Studio
Whilst the MAUP was first identified by Gehlke and Biehl (1934), it remained relatively unexplored by geographers until Openshaw and Taylor (1979) demonstrated how spatial data analysis using bivariate correlation methods might result in rather different coefficients depending on the number of spatial units (the scale) used to define the same area. These authors also identified an ‘aggregation problem’ as the second component of the MAUP, arising when the same number of zones were involved but their size and shape were allowed to vary. Subsequently, Openshaw and Rao (1995) used the example of Liverpool to demonstrate how the patterns of concentration of ethnic minority populations across 119 census wards in 1991 could be almost completely reversed by re-engineering the boundaries based on the underlying 2926 census enumeration districts into 119 zones of equal population.
Further explorations of the MAUP were reported in studies during the 1990s (e.g. Fotheringham and Wong 1991; Holt et al. 1996) and Marble (2000) challenged the research community to provide examples of situations in which the MAUP was an important problem. Flowerdew (2011), using bivariate correlation between pairs of variables from the 2001 Census for England, demonstrated that in many cases, the MAUP makes little or no difference but that there are some relationships where the effect is significant. Other studies (e.g. Holt et al. 1996; Tranmer and Steel 2001; Manley 2005) have provided measures that can be used to show the effect of the MAUP based on the variances of the variables concerned or within-area homogeneity. In the following sub-section, we explain the context in which an investigation of the MAUP has been imperative and outline the structure of the software system that has been developed to automate the procedures for identifying both the scale and zonation components.
The IMAGE (Internal Migration Around the GlobE) project2 is an international research project funded by the Australian Research Council and based at the University of Queensland to facilitate cross-national comparisons of internal migration using a set of migration indicators that measure aspects of migration including intensity, distance, connectivity and impact (Bell et al. 2002) that can be used to advance understanding of the way that migration within countries varies around the world. Considerable effort has been spent on constructing a global inventory of internal migration data sources (Bell et al. 2015a) and creating a repository of migration and related (boundary and population) data sets (Bell et al. 2014). The IMAGE project had a number of objectives that derive from analysis of the data sets held in the repository, including the comparison of overall migration intensities in countries for which data are available or can be estimated (Bell et al. 2015b), the distances over which people migrate and the frictional effect of distance on migration (Stillwell et al. 2016) and the impact of migration on population distributions in different countries (Rees et al. 2016).
One of the key obstacles confronting cross-national comparison of migration indicators is the inequality or inconsistency in the geographical zones for which migration data are captured and collected in different countries. Every country has its own hierarchy of geographies; in some cases, such as small islands or principalities, there is only one spatial unit and no hierarchy; in other cases, data may be available for three or four tiers of geography with different numbers of spatial units in each level. However, the boundaries of each of these sets of zones define polygons that are unique in shape and size and the migration indicators associated with each geography in one country are not directly comparable with those relating to administrative or census geographies in other countries. In attempting to make comparisons of migration rates between, say, the NUTS 1 regions of the European Union (EU) countries, we encounter both components of the MAUP: there are different numbers of NUTS 1 regions in each country and the spatial configuration, i.e. the size and shape of each region, is different. Exactly the same problem applies when we attempt to make cross-national comparisons on a global level.
In response to this challenge, we have proposed a methodology which involves progressively aggregating a set of zones for any single country − called Basic Spatial Units (BSUs) − into larger and fewer zones − called Aggregated Spatial Regions (ASRs) − and generating multiple different configurations of zones at each level of aggregation or scale. Sets of migration indicators and model parameters are then computed at various levels for different configurations and summarised using measures of central tendency and deviation; variation in the value of a summary indicator from one level of ASRs to another can be identified as measuring the scale effect whilst variation in the indicator values between the zone configurations at any one level can be interpreted as the zonation effect. The IMAGE Studio has been constructed for automating the computation processes involved.
The IMAGE Studio
The Data preparation subsystem is where the various data sets are assembled and prepared for use in the other subsystems. The three data sets required are: (i) a matrix of migration flow counts with rows representing origin BSUs and columns representing destination BSUs and with BSU codes 1, 2, … n in the first column and first row respectively; (ii) a vector of populations at risk with the equivalent numeric code for each BSU in the first column; and (iii) boundary data for the BSUs in the form of a shapefile containing the numeric BSU code for each polygon. A matrix of distances between BSUs (with BSU codes in the first row and column in same order as migration flows) can also be input if this is available from a particular source or has been estimated independently.
It is necessary that every BSU polygon is deemed to be contiguous with at least one other polygon and that all ‘island’ polygons are joined to the rest of the system. This latter specification is important in countries where polygons are separated by stretches of water and no contiguous boundaries are present. In the UK, for example, it is clear from Fig. 2 that Northern Ireland and the Western Isles of Scotland are not ‘connected’ to the rest of mainland UK. This process is undertaken manually by adding to the contiguity file the codes of polygons that are most suitable for connection based on ferry routes or just proximity. It is necessary that contiguities are included for pairs of BSUs in both directions. A file of BSU centroids is also produced since these are the points representing the gravitational centres of all BSUs that are used to calculate distances between zones.
The Aggregation subsystem is required for the creation of spatial aggregations of BSUs into what we call Aggregated Spatial Regions (ASRs). The subsystem provides functionality for both single or multiple aggregation. In the case of the former, the user chooses the number of ASRs that are to be created from the initial BSUs and the number of required configurations of these ASRs at that one selected scale. If the raw data contained 400 BSUs, the user might want to aggregate the BSUs into 200 ASRs, for example, and produce 100 different configurations of these ASRs. Alternatively, with multiple aggregation, the user might specify a scale increment or step with which to aggregate BSUs on an iterative basis as well as the number of configurations at each scale. For example, if there are 100 BSUs and the user aggregates them using a scale step of 10 zones with 100 configurations, then the aggregations will take place into sets of 10, 20, 30, 40, 50, 60, 70, 80 and 90 ASRs with 100 configurations at each scale. Since the initial BSUs are used for creating each configuration at each scale, this process can consume considerable amounts of computer time, so fewer configurations (e.g. 50) are often adopted in practice. Implementing the aggregation process involves choosing a spatial algorithm that is fed with the normalised data from the Data preparation subsystem to produce centroid coordinates, inter-centroid distances, contiguities, flow matrices and populations for each set of ASRs which can then be used in the migration indicators and modelling subsystems. This paper reports some results generated when using different zone design algorithms that are outlined in more detail in the next section.
The Migration indicators subsystem is where internal migration indicators are calculated for the set of initial BSUs or for each set of ASRs. The subsystem calculates the indicators at two levels: indicators at the global or system-wide level refer to measures for all BSUs or ASRs; indicators at the local level refer to measures for the individual BSUs. Local migration indicators for ASRs are not computed because each set of ASRs will be different from one scale to the next and therefore comparison of local indicators between scales will be compromised. The global indicators include basic descriptive counts: total population, population density, total migration flows and the mean, median, maximum and minimum values in the cells of the migration matrix together with various measures of migration intensity, effectiveness, connectivity and inequality. The local migration indicators computed for each BSU include those used for system-wide analysis and those capturing variation in out-migration and in-migration flows and in distance, turnover and churn. Full details of how each indicator is defined and calculated are available in the Image Studio manual (Daras 2014).
Aggregation Methods and Indicators
Automated Aggregation Methods
Two Initial Random Aggregation (IRA) algorithms have been implemented in the IMAGE Studio: IRA and IRA-wave. The former provides a high degree of randomisation to ensure that the resulting aggregations are different during the iterations. Aggregation only takes place between contiguous zones and the algorithm is implemented following Openshaw’s Fortran subroutine (Openshaw 1976). The latter aggregation algorithm is a hybrid version of the former with strong influences from the mechanics of the Breadth First Search (BFS) algorithm. If we require N aggregated zones, the first step of the IRA-wave algorithm is to select N BSUs randomly from the initial set and assign each one to an empty region (ASR). Using an iterative process until all the BSUs have been allocated to the N ASRs, the algorithm identifies the BSUs contiguous with each ASR, targeting only the BSUs without an assigned ASR and adds them to each ASR respectively. The advantages of using the IRA-wave algorithm include its speed in producing a large number of initial aggregations and the fact that it produces relatively well-shaped regions in comparison to the more irregular shapes derived using the IRA algorithm.
The minimisation of the attribute distances between the mean of the ASRs and their constituent BSUs produces homogeneous ASRs consisting of BSUs with similar values for the selected variable. The similarity function in the IMAGE Studio can be used for delivering two aggregation outputs, one based on minimising the differences in population density between ASRs which captures ASR urban/rural characteristics, and the other based on minimising the intra-ASR migration flows between the BSUs in each ASR and results in ASRs with higher/lower intra-ASR flows respectively.
In a zone design context, the way to proceed from an existing aggregation to a better one is by swapping areal units at the borders of the ASRs, while optimising an objective function. During these swaps, it is possible for one ASR to lose its contiguity and therefore a method of holding contiguity intact is essential. For example, Openshaw’s Automated Zoning Procedure (AZP) tackled this problem by tracing an adjacency matrix using the Depth First Search (DFS) algorithm. The method of maintaining ASR contiguities should be as simple as possible, avoiding complicated structures that may lead to an exponential increase of processing time, during the iterative zone design procedure.
Additional zone design properties could be identified as equally important, such as the initial aggregation algorithm, the starting point for a zone design system. An initial aggregation targeting the criteria directly is avoided as the main zone design procedure is likely to be trapped into local optima and end the process, thus providing an inadequate solution. Hence, Openshaw (1977, 1978) suggested the use of an IRA algorithm focusing on the principle of contiguous zones as an appropriate first aggregation, which provides a high degree of randomisation to ensure that the resulting aggregations differ during each iteration. It has been implemented in the IMAGE Studio with object-oriented principles, thus avoiding the sustained sequential processes and resulting in much quicker random aggregation (Daras 2014). However, the alternative IRA-wave algorithm, a hybrid version of the original IRA algorithm and the BFS algorithm, provides a swifter solution and is often preferred when further optimisation is not required.
Although the three characteristics of a zone design system: the objective function, the contiguity checking algorithm and the initial aggregation are structurally important, it is possible to introduce further criteria in order to influence the shape (compactness) of ASRs. Evidently, each criterion applied to zone design acts as a constraint on the optimum solution with an additional increase of processing time. Therefore, extensive use of criteria should be avoided if the study does not require such constraints.
Internal Migration Indicators
Thus, a high migration impact might result from high levels of both CMI and MEI or a high value of one component offsetting a low value of the other. The variation in the relationship between these two components has been explained by Rees et al. (2016).
Sources of Internal Migration Data and Spatial Units
Internal migration data are collected in countries around the world using various different collection instruments; in England and Wales, for example, the national statistical agency – the Office for National Statistics (ONS) – retains a migration question in its decadal census but estimates annual migration between censuses by comparing the addresses of National Health Service (NHS) patient registers from 1 year to the next, and also draws on the Labour Force Survey (LFS) for samples of data on migrants whose behaviour is linked to the labour market.
In this paper, we use internal migration flows for the UK obtained from the 2011 Census Special Migration Statistics (SMS) to illustrate results from the IMAGE Studio. The data format is a matrix of the flows between 404 local authority districts (LADs) in the UK for the 12 month period prior to the 2011 Census. There are three national statistical agencies in the UK − for England and Wales, Scotland and Northern Ireland respectively − each of which undertakes an independent but partially harmonized census. One consequence of this division of labour is that the ONS has to compile a full set of sub-national migration flows between LADs in the UK. This synthesis is only undertaken with single-year census data once a decade.
Populations at risk are required if the user wishes to compute migration intensities or use the population equality algorithm; in this instance, usually resident populations of LADs across the UK in 2011 are extracted from the 2011 Census using the InFuse interface to Aggregate Data on the UK Data Service web site. These end-of-period populations are not the ideal populations at risk for migration rates in the previous 12 months but since no start-of-period populations are available, and therefore no mid-period populations can easily be derived, the end-of-period populations are deemed to be the most suitable. Finally, the boundaries of these LAD administrative units have been sourced from the UK Data Service repository of Boundary Data using the EasyDownload facility. While the Studio’s Data Preparation Subsystem automatically ensures that all mainland LADs are contiguous with at least one other mainland LAD, contiguities between each of the Isle of Wight, Belfast (Northern Ireland), Western Isles, Orkneys and Shetlands and their respective nearest neighbours on the mainland are added to the contiguities file that is created by the Data Preparation Subsystem for use in the subsequent aggregation. Having explained where the data come from, we turn our attention to reporting the results of running the various aggregation approaches available in the Studio with the UK 2011 Census internal migration flow data.
Choice of Aggregation Algorithm
Scale and Zonation Effects for Selected Indicators Using the IRA Wave Algorithm
The shaded areas around the lines of central tendency reflect the variation due to alternative configurations or shapes of ASRs as measured by the inter-quartile range (darker shading) and the full range (lighter shading). The shaded areas give a useful visualisation of the zonation effect of the MAUP, an effect which is most apparent for the MEI indicator and least evident for the CMI. Thus, we observe that whereas the number of zones is important in measuring the intensity and the overall impact of migration on the population distribution, the shape and configuration of zones is more important when measuring how effective migration is as a process of population redistribution.
Scale Effects Using the IRA Algorithm with Alternative Functions
The redistribution of the population through internal migration has become increasingly important as a component of population change in many countries around the world, including the UK, yet most research studies are based on data on migration flows between one set of administrative or statistical zones at one particular spatial scale. This poses intractable problems for policy-makers who want to compare internal migration in different countries using one or more indicators and to understand the relationship between migration and development. As we have shown in the case of the UK, the intensity at which people move between regions depends upon the size and shape of the regions concerned and it is only when all internal migrations are included in an aggregate CMI that direct comparisons between countries can be made and national league tables constructed as reported in Bell et al. (2015b). However, we contend that the IMAGE Studio provides researchers and practitioners with a means to develop a much better understanding of how different migration indicators are affected by the scale and zonation components of the MAUP. In this paper we have looked at selected indicators of migration intensity, effectiveness, impact, distance and distance deterrence and shown that whilst intensity, impact and distance are revealed to vary significantly by scale but less so by zonation, migration effectiveness and distance show greater scale independence but more sensitivity to zone shape.
Whilst these results are based on analysis of multiple zone configurations across a range of scales, the paper has also reported the scale effects when zones are optimised at different scales using the alternative algorithms available in the Studio that maximise certain objective functions subject to the constraints of contiguity. There are subtle differences in the scale gradients for particular indicators with the zone shape constraint serving to reduce the variations between the results from using different algorithms in all cases. We also observe that an optimized indicator at a particular scale may fall outside the range of values computed when the IRA wave algorithm is adopted. This finding is not unexpected because we explore only a fraction of possible configurations using the IRA-wave aggregations (200 iterations per scale) under the shape constrains of adjacent regions. Fundamentally, the full exploration of possible configurations is a large computational problem and even today an exhaustive algorithm is only applicable to small aggregation problems (Keane 1975). One interesting conclusion that emerges from using the IMAGE Studio is that a migration indictor such as the CMI calculated on the basis of published migration data at one spatial scale does not necessarily reflect the ‘true’ migration rate because it is reflecting particular size and shape characteristics of the zones in the country that have been used to collect the migration data in the first place. The mean value of the CMI computed from many configurations at any one scale, i.e. with the same number of zones, will offer a better measure. Further research is required using countries where data are available on migration at different spatial scales to compare published rates with estimated means derived using the IMAGE Studio from configurations based on lower level spatial units.
Whilst the results of the IMAGE project have reported the use of the Studio for comparative analysis of internal migration in different countries around the world (Bell et al. 2015b; Rees et al. 2016; Stillwell et al. 2016) where zone systems are very different, there is also the potential in using the Studio to explore how scale and zonation effects might vary by demographic (age, sex, ethnicity) or socio-economic (occupation, tenure, health status) group in any single country (see Stillwell et al. 2018, for an initial study of variations by age group in the UK). A further avenue of investigation might be to explore the relationship between migration indicators and explanatory variables at different spatial scales using correlation analysis of the type that was employed to investigate the MAUP effects in earlier studies of stock variables. Moreover, the aggregation algorithms in Studio might be usefully adapted to provide an automated system for aggregating explanatory variables and generating summary measures.
The IMAGE Studio was developed as part of the Discovery Project, DP11010136, Comparing Internal Migration around the World (2011–2015), funded by the Australian Research Council.
Compliance with Ethical Standards
Conflict of Interest
The authors declare that they have no conflict of interest.
- Bell, M., Bernard, A., Ueffing, P., & Charles-Edwards, E. (2014). The IMAGE repository: A user guide. Working Paper No 2014/01, Queensland Centre for Population Research, School of Geography, Planning and Environmental Management, The University of Queensland, Brisbane.Google Scholar
- Daras, K. (2014). IMAGE studio 1.4.2 user manual. School of Geography, University of Leeds, Leeds.Google Scholar
- Duke-Williams, O., Routsis, V., & Stillwell, J. (2018). Census interaction data and the means of access. In J. Stillwell (Ed.), The Routledge handbook of census resources, methods and applications: Unlocking the UK 2011 census (pp. 110–125). London: Routledge.Google Scholar
- Holt, D., Steel, D. G., & Tranmer, M. (1996). Area homogeneity and the modifiable areal unit problem. Geographical Systems, 3, 181–200.Google Scholar
- Luenberger, D. (1973). Introduction to linear and non-linear programming. Boston: Addison-Wesley.Google Scholar
- Manley, D. (2005). The Modifiable areal unit phenomenon: an investigation into the scale effect using UK census data. Unpublished PhD Thesis, School of Geography and Geosciences, University of St Andrews.Google Scholar
- Openshaw, S. (1976). A regionalisation procedure for a comparative regional taxonomy. Area, 8, 149–152.Google Scholar
- Openshaw, S. (1978). An optimal zoning approach to the study of spatially aggregated data. In I. Masser & P. J. B. Brown (Eds.), Spatial representation and spatial interaction (pp. 93–113). Leiden: Martinus Nijhoff.Google Scholar
- Openshaw, S. (1984). The modifiable areal unit problem, CATMOG 38. Norwich: Geo Books.Google Scholar
- Openshaw, S., & Taylor, P. (1979). A million or so correlation coefficients: Three experiments on the modifiable areal unit problem. In N. Wrigley (Ed.), Statistical applications in the spatial sciences (pp. 127–144). London: Pion.Google Scholar
- Rees, P., Bell, M., Kupiszewski, M., Kupiszewska, D., Ueffing, P., Bernard, A., Charles-Edwards, E., & Stillwell, J. (2016). The impact of internal migration on population redistribution: an international comparison. Population, Space and Place, Published online in Wiley Online Library, 23. https://doi.org/10.1002/psp.2036.
- Stillwell, J. (1983) SPAINT: A computer program for spatial interaction model calibration and analysis. Computer Manual 14, School of Geography, University of Leeds, Leeds.Google Scholar
- Stillwell, J. (1990). Spatial interaction models and the propensity to migrate over distance. In J. Stillwell & P. Congdon (Eds.), Migration models: Macro and micro approaches (pp. 34–56). London: Belhaven Press.Google Scholar
- Stillwell, J., Lomax, N., & Chatagnier, S. (2018). Changing intensities and spatial patterns of internal migration in the United Kingdom. In J. Stillwell (Ed.), The Routledge handbook of census resources, methods and applications: Unlocking the UK 2011 census (pp. 362–376). London: Routledge.Google Scholar
- Tranmer, M., & Steel, D. (2001). Using local census data to investigate scale effects. In N. J. Tate & P. M. Atkinson (Eds.), Modelling scale in geographical information science (pp. 105–122). Chichester: Wiley.Google Scholar
- Wilson, A. G. (1970). Entropy in urban and regional Modelling. London: Pion.Google Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.