Quantifying demographic and socioeconomic transitions for computational epidemiology: an open-source modeling approach applied to India
Demographic and socioeconomic changes such as increasing urbanization, migration, and female education shape population health in many low- and middle-income countries. These changes are rarely reflected in computational epidemiological models, which are commonly used to understand population health trends and evaluate policy interventions. Our goal was to create a “backbone” simulation modeling approach to allow computational epidemiologists to explicitly reflect changing demographic and socioeconomic conditions in population health models.
We developed, evaluated, and “open-sourced” a generalized approach to incorporate longitudinal, commonly available demographic and socioeconomic data into epidemiological simulations, illustrating the feasibility and utility of our approach with data from India. We constructed a series of nested microsimulations of increasing complexity, calibrating each model to longitudinal sociodemographic and vital registration data. We then selected the model that was most consistent with the data (i.e., greater accuracy) while containing the fewest parameters (i.e., greater parsimony). We validated the selected model against additional data sources not used for calibration.
We found that standard computational epidemiology models that do not incorporate demographic and socioeconomic trends quickly diverged from past mortality and population size estimates, while our approach remained consistent with observed data over decadal time courses. Our approach additionally enabled the examination of complex relations between demographic, socioeconomic and health parameters, such as the relationship between changes in educational attainment or urbanization and changes in fertility, mortality, and migration rates.
Incorporating demographic and socioeconomic trends in computational epidemiology is feasible through the “open source” approach, and could critically alter population health projections and model-based evaluations of health policy interventions in unintuitive ways.
KeywordsMathematical model India Demography Socioeconomic factors Population health
Deviance Information Criterion
District Level Household and Facility Survey
National Family and Health Survey
Stanford Project for Open Knowledge in Epidemiology
Sample Registration System of India
United Nations Population Division
To understand and address population health trends and to evaluate potential health policy interventions, mathematical simulation models are commonly used—relating individual-level risk factor exposures and interventions to population-level health outcomes. Demographic and socioeconomic conditions that shape population health are changing rapidly in many low- and middle-income countries. These changes are challenging to incorporate into models, as they affect population health in complex ways. For example, rapid urbanization may have both positive and negative effects on population health. Urbanization can increase access to skilled medical care, and potentially facilitate higher household earnings that correlate with improved health outcomes . However, rural migrants to urban areas often encounter increased exposure to environmental pollution, slum living, and disease risks stemming from unhealthy diets [2, 3, 4]. Large developing countries are shifting from being majority rural to mostly urban by 2050, highlighting the pressing importance of understanding the health effects of complex socioeconomic transitions . In addition to urbanization, other complex socioeconomic transitions include the increase in age-associated disability and chronic diseases [6, 7]. Educational attainment and literacy levels are also increasing [8, 9], and accompany lower fertility, higher female labor force participation and associated complex changes in maternal and child health outcomes .
Modeling the complex interactions of demographic and socioeconomic conditions requires accounting for simultaneous, interacting exposures experienced by individuals over their lifetimes . In the past, numerous high-quality health policy models implicitly assumed that current exposures to demographic and socioeconomic conditions would remain the same in the future [12, 13, 14, 15]. This assumption is understandable, given the challenges of estimating parameters to describe how different birth cohorts experience varying exposures; however, such factors may influence the accuracy of projections made by these models.
The goal of our study is to bridge the gap between available data on demographic and socioeconomic changes in low- and middle-income countries and simulation models of population health and health policy. Our specific aim in this paper is to create, validate, and “open source” a simulation modeling approach that allows population health modelers to explicitly reflect changing demographic and socioeconomic conditions. Our approach is intended to be intuitive and simple enough to be easily incorporated into health policy simulations, but also faithful to available data. We illustrate it using India’s demographic and socioeconomic data on trends in fertility, all-cause mortality, education, and migration, using the types of datasets available from many developing countries . To facilitate replication and extension of our approach, we provide our complete data and model code (see Additional File 1 and Tables AF1-AF5).
Data used for model calibration and parameter estimationa
Number of total children ever born to mother, by maternal age and urban/rural residence, Additional file 1: Table AF1
Sample Registration System (SRS): 1972–2008 
Probability of death by calendar year, age, sex, and urban/rural residence, Additional file 1: Table AF2
Prevalence of no education, primary school, secondary school, and greater than secondary school education, by age, sex, and urban/rural residence, Additional file 1: Table AF3
Proportion of women who had migrated from urban to rural areas or vice versa within the last 6 years, 12 years, and ever in their lifetime, stratified by age, sex, and urban/rural residence, Additional file 1: Table AF4
United Nations Population Division, 1992–2010 
Absolute population size by calendar year, stratified by urban/rural residence, Additional file 1: Table AF5
Comparisons of three calibrated models reveal that one incorporating secular trends is more consistent with the observed dataa
ΔDIC when fit against Table 1 data sources
Age, sex, urban/rural residence, fertility, mortality
Age, sex, urban/rural residence, fertility, mortality, educational attainment
+5.2 versus model 1
Age, sex, urban/rural residence, fertility, mortality, educational attainment, migration
−259.1 versus model 2
We used population-representative data sources (Table 1). These included the United Nations Population Division’s (UNPD) historic estimates and future projections of population size and composition by sex and urban/rural residence (1992–2025) , as well as India’s Sample Registration System (SRS) and four large-scale household surveys conducted in India between 1992 and 2008 [23, 24, 25, 26, 27]. Household surveys included all available waves of India’s National Family and Health Survey (NFHS-1 [1992–3], NFHS-2 [1998–9], and NFHS-3 [2005–6]), which are part of the Demographic and Health Surveys conducted in over 90 countries every five years . We also used India’s District Level Household and Facility Survey (DLHS-3 [2007–8]) to provide more recent data . All analyses of the household surveys employed sample weights to adjust for non-coverage and non-response, allowing us to calculate nationally-representative estimates.
Starting population size and composition
The starting size and composition of India’s female population in 1992 was input into the models based on the UNDP overall female population size and its urban/rural-specific age distribution . The educational attainment category distribution (0, 1–5, 6–12, or >12 years of schooling) for each age and urban/rural specific subgroup was determined from the NFHS-1 .
The NFHS provided data on fertility. Specifically, each NFHS wave provided estimates of the number of children born to a woman stratified by a woman’s age, urban/rural residence, educational attainment category, and calendar year.
The NFHS provided information on migration between rural and urban areas in India, specifically the proportion of women who had migrated from rural to urban areas or vice versa within the previous six years, 12 years, and ever in their lifetime, stratified by age, urban/rural residential status, educational attainment category, and calendar year.
The SRS provided the main data on mortality . Specifically, the SRS life tables for 1972 to 2008 contain estimated death rates stratified by age, sex, urban/rural residence, and calendar year. However, as the SRS life tables are not stratified by educational attainment, we supplemented the SRS data with information derived from the DLHS-3 on mortality stratified by educational attainment from verbal autopsies of household members (further detailed below in the Mortality section of ‘Modeled Processes and Inputs’) .
Modeled processes and inputs
As illustrated in Fig. 1, individual women in our models age deterministically and experience annual risks of fertility, mortality, and (when included in the model) migration. Risks depend on the calendar year of a woman’s birth, her current age, urban/rural residence, and (when included) educational attainment category. Below we summarize these modeled processes (further details in Additional file 1: AF1).
In equation 1, G(t) is the cumulative age-specific fertility rate up to the age t (i.e., total births per woman of each age in in a given NFHS survey wave). The parameter F is the cumulative total fertility rate for women of age t in the given survey wave, a is the median age of fertility (age of giving birth to half of the total number of children that will ever be born to that woman), and b is the length of the age interval during which the fertility level rises from 5 % to 95 % of the cumulative rate. The parameters for the G–P model along with a linear secular trend in fertility were fit to the empirical data as part of our overall calibration procedure, which uses the Markov Chain Monte Carlo (MCMC) estimation approach detailed below. Newborn girls enter the model via this fertility process and are then included as members of new birth cohorts, such that they age and potentially experience fertility themselves in future years. The newborn girls share the same urban/rural residential status as their mothers at the time of their birth and are placed into educational attainment categories based on parameters fit to ensure that the educational attainment of these girls match secular trends in education prevalence when they reach the 20 to 24 year age range (see below under ‘Educational Attainment’).
Estimated the relative risk of death based on educational attainment for a woman of a given age and urban/rural status from the DLHS (see Additional file 1: AF1 for details on estimating the relative risk). The NFHS provided estimates of the proportion p of the population in each educational attainment category for each group in the three calendar years of the NFHS waves; we focused on 20 to 24 year-olds in the decompositions, since educational attainment had plateaued by this age.
After decomposing the data into the mortality rate for the reference group (lowest educational attainment level) and the relative risk of death for each higher educational attainment group relative to the reference group, we fit a standard Lee-Carter-type model to the decomposed mortality rates to project future trends in mortality . The model fits three parameters to the log mortality rate: a constant, a parameter multiplied by calendar year, and a parameter multiplied by age. The three parameters were fit to the data for each of three age clusters (<1 year olds, 1–10 year olds, >10 year olds) along with all other model parameters (e.g., those describing fertility and migration) as part of a single Markov Chain Monte Carlo (MCMC) fitting process described below.
Educational attainment in each of the four categories (0, 1–5, 6–12, or >12 years of schooling) was assigned to each newborn girl based on birth year and urban/rural residence, accounting for the linear secular trends in educational attainment by calendar year such that the educational prevalence each year follows the trend among women aged 20 to 24 years old.
The NFHS-2 provides counts of how many urban dwellers in 1998 lived in a rural zone in 1992, and conversely how many rural dwellers in 1998 lived in an urban zone in 1992 . The NFHS-3 similarly reveals how many urban dwellers in 2005 lived in a rural zone in 1999 and 1993, and how many rural dwellers in 2005 lived in an urban zone in 1999 and 1993 . We estimated, through the MCMC fitting procedure detailed below, the annual net rate of rural-to-urban migration and the linear trend in this rate to match the NFHS data.
Model targets, calibration, selection, and validation
We sought a model that accurately and parsimoniously reproduced the observed data on levels and trends in fertility, mortality, urban/rural status, educational attainment and migration, and could help us infer the joint uncertainty distribution around parameter estimates describing these processes to learn about the correspondence between these demographic and socioeconomic factors. Such a model and joint uncertainty distribution can then be used to make future population projections with uncertainty bounds and be incorporated into decision-analytic health policy models to examine the effects of interventions over time. Therefore, we calibrated each of our three candidate models (Fig. 1) to a set of empirical targets using a standard MCMC algorithm .
The model targets for calibration included those listed in Table 1 (see complete data in Additional file 1: Tables AF1-AF5). The MCMC algorithm updated vague prior distributions on a set of calibrated parameters to fit these targets. These parameters included the cumulative total fertility rate, median age of fertility, length of the age interval by urban/rural residence, and (when included) by educational attainment level, along with a linear secular trend in fertility (Equation 1 above); the Lee-Carter parameters for mortality rate along with the relative risk of mortality by urban/rural residence and (when included) by educational attainment level; the net rural-to-urban migration rate and linear secular trend in migration rate by (when included) educational attainment level; and (when included) the linear secular trend in educational attainment by urban/rural residence. We repeated the fitting process 10 times from random starting points to ensure convergence to a stable posterior joint distribution of parameter estimates (see Additional file 1: Figure AF1). We then selected among the three candidate models using the Deviance Information Criterion (DIC) to choose the calibrated model that best fit the data relative to its complexity  (see Additional file 1: AF1 for details of the MCMC calibration, including convergence statistics).
Two forms of model validation were performed [32, 33]. We evaluated internal validity by ensuring that modeled outcomes fit all input data shown in Table 1. We then evaluated external validity by ensuring that modeled estimates of life expectancy among simulated individuals matched independent estimates .
Assessing the importance of capturing secular trends
To assess the degree to which these efforts to model trends in demographic and socioeconomic conditions could alter model projections, we compared the chosen model to a static model equivalent that included all model components but did not include changes over time in any of the demographic or socioeconomic inputs; that is, fertility, mortality, migration or educational attainment parameters were held fixed at their starting-year values, as is the current standard approach [12, 13, 14, 15]. We first compared the two models over the period 1992 to 2010, contrasting their urban- and rural-specific population size estimates with observed data . We next compared the two models starting from the year 2010 and projecting a further 15 years into the future, in order to characterize the degree of divergence between the two sets of model estimates (fixed and with trends) and independent United Nations population projections . We finally compared life expectancy estimates from the two models in terms of both historical (1992–2010) and future projections (2010–2025), and contrasted predictions from the models for the impact of a simulated intervention: efforts to increase the level of educational attainment achieved by rural women, which would have provided universal primary education to rural females in the year 2000 . Prior intervention studies (i.e., cluster randomized trials and natural experiments) have established that increasing primary education availability to women lowers mortality through a number of complex mechanisms such as reducing early marriage and associated premature fertility that increases the risk of maternal mortality [35, 36]. We increased the educational attainment rates among rural females to simulate universal primary education in 2000, comparing the resultant estimated life expectancy differences between the two models over subsequent years.
Model code is provided in Additional file 1: AF1 in accordance with IPSOR-SMDM Modeling Good Research Practices guidelines . All model calculations and simulations were performed in MATLAB version R2013b (The Mathworks, Cambridge, MA, USA). All analyses of survey data used to construct model inputs and targets were performed in Stata 13.1 (StataCorp LP, College Station, TX, USA).
Model selection, calibration, and validation
Relative risk of death declines significantly with educationa
RR of death – urban
RR of death – rural
Calibration resulted in a posterior joint distribution of model parameters that revealed the magnitude of important relationships between key demographic, socioeconomic, and population health variables, such as the relationship between higher educational attainment and the higher probability of rural-to-urban migration. Additional file 1: Figure AF1 displays the posterior probability distributions of the fitted parameters, which are further detailed in Additional file 1: Table AF6. Implications of these relationships are described below.
Modeled fertility trends
Modeled mortality trends
Modeled migration trends
We estimated a significant transition towards increasing exposure to urban environments through migration and urbanization. The overall trend suggested that 8 % of rural women would transition to urban areas between 1992 and 2010 (95 % CI: 6–10 %). Additional file 1: Figure AF4 illustrates the detailed model fits to migration rates across all available years of data (R2 > 81 % for the model fit to the migration data). Rural-to-urban transitions were particularly common for women in their early 20s (20–24 year-olds had a 16 % chance of transitioning to urban areas between 1992 and 2010, 95 % CI: 14–17 %). The probability of such a rural-to-urban transition was significantly higher for women with greater educational attainment; the probability was only 5 % for 20–24-year-old rural women in the lowest educational attainment category (95 % CI: 4–6 %), but was 33 % for 20–24 year-old rural women in the highest educational attainment category (95 % CI: 32–35 %).
Modeled educational attainment trends
Calibrated educational attainment exhibited an increasing secular trend and strong urban/rural differences. The proportion of the population in the lowest educational category decreased by 1.4 % (95 % CI: 1.4–1.5 %) among urban and 2.0 % (95 % CI: 1.5–2.6 %) among rural populations over the simulation period, while the proportion of the population in the highest educational category increased non-significantly by 1.3 % among urban (95 % CI: −1.4-3.9 %) and by 0.3 % among rural populations (95 % CI: −0.3-1.0 %). The largest category among urban dwelling women by 2008 was women with 6 to 12 years of schooling, making up 54 % of the urban population but only 39 % of the rural women, among whom the largest group was those with no education (41 %). Additional file 1: SI Figure AF5 illustrates educational attainment trends in India over time along with the model fits disaggregated by age, birth cohort, and urban/rural residence (R2 > 97 % for model fits to the educational data).
Model comparison: Are demographic and socioeconomic time trends necessary for modeling important population health risk factors?
Since many current population health models assume that demographic and socioeconomic exposures are fixed in time, we examined whether our modeling approach that included time trends in these exposures might influence modeled outcomes in important ways. We found that including trends is important for predicting population sizes that remain consistent with both past observed data and future demographic projections made from more complex demography models that are not typically possible to incorporate into health policy simulations. Specifically, we compared our model to an equivalent static model that did not include changes in educational attainment nor in risks of fertility, mortality, or migration over time—a proxy for many models in the current literature.
There is increasing recognition that demographic and socioeconomic conditions can profoundly affect population health and that in many countries these conditions are changing. Yet changes in such conditions are rarely included in mathematical simulation models of population health. Our goal in this paper was to create, validate, and “open source” a simulation modeling approach to allow population health modelers to explicitly reflect changing demographic and socioeconomic conditions.
Our approach allowed us to account for simultaneous, interacting exposures experienced by individuals over their lifetimes – in this case, exposures related to urban/rural residence, educational attainment, and migration. The approach also facilitated quantification of complex correlations among demographic and socioeconomic factors important to health, such as the relations between educational attainment, fertility, migration, and mortality risk.
The experiments conducted in this study add important knowledge to the existing literature on population health modeling. Numerous methods have been proposed to fit increasingly complex models to empirical data [11, 18, 37, 38, 39, 40, 41]; our approach focuses specifically on estimation of exposures and trends in exposure in a manner that is both accurate but also simple . The data fitting and calibration approach itself is not new, relying on Markov Chain Monte Carlo methods that are freely-available . The key challenge we addressed here is how to incorporate common demographic and socioeconomic data into a framework that will allow their rapid integration into population health and health policy models. We focused on fertility, mortality, education, and migration data, but our approach can be expanded to other exposures like those repeatedly documented in demographic and health surveys . Incorporating time trends in such data allowed us to generate more accurate population size and life expectancy projections than would have been the case if we had relied on a classical model assuming no change in demographic or socioeconomic factors. Our approach was found to avert potentially serious errors in projections of population health trends. Of note, several HIV-specific models have incorporated some demographic parameters in the past (i.e., migration rates) . Our approach here adds to that literature by permitting systematic evaluation of the importance of each of the standard types of demographic parameters (birth rates, death rates, education rates, migration rates), to develop a routine approach to determining whether or not a parameter would add value or unnecessary complexity to a given model.
As with all mathematical models, our model requires assumptions and associated caveats. Our goal was not to capture all aspects of complex demographic and socioeconomic factors that could influence population size, fertility, or mortality. Rather, we purposefully focused on a model that incorporated several important factors while remaining reasonably parsimonious. The approach is readily adaptable and expandable to other modeling situations, and we have aimed to support this by “open-sourcing” the model code. Nevertheless, the model employed household survey-based data. Such data can suffer from various biases like those related to recall and self-report, which may, for example, lead to some misclassification of educational attainment, particularly as we had to impute mortality rates by educational attainment category given the absence of detailed mortality rates by education class in India’s vital registration database. Nevertheless, our selected model simultaneously fit multiple independent data sources providing information on multiple population health metrics.
Our model is less detailed than many formal models in demography and is intended to be used over short decadal time scales. Our goal was to bridge the gap between health policy models focused on detailed disease natural histories and intervention delivery, and formal demographic models that seek to provide general, very long-range population size projections . Rather than using classical demographic modeling techniques for long-term population size projections, our objective was to find key parameters that could be manipulated to allow public health intervention simulations delivered over 10 to 20 years, such as interventions targeting rural-to-urban migrants, or interventions addressing low educational attainment among rural women [21, 44]. This required modeling both population size itself (as in the demography models) and also generating parameters that characterize underlying factors and relationships linked to changes in population size and mortality (which must be manipulated to simulate interventions). Most formal demographic models capture long-term population trends, but do not include the underlying factors that generate such trends.
Future research should address the question of how data on demographic and socioeconomic conditions might be better standardized across countries. Much of our effort involved data gathering, cleaning, and organization as shown in the Additional file 1: SI Tables to enable easier entry into the model generation and comparison process. Providing organized data in formats that allow similar model comparisons across countries could greatly assist in comparing interventions across low- and middle-income countries . Open databases for commonly-used data collected at multiple time points would provide opportunities to understand how social dynamics affect the vulnerability or resilience of different populations, particularly as population processes such as urbanization offer highly complex outcomes that are not intuitive to anticipate.
We find that incorporating demographic and socioeconomic trends into mathematical models of population health and health policy is important, as the omission of such trends influences model-projected outcomes in complex ways. Incorporating demographic and socioeconomic trends is currently highly feasible through an “open source” approach developed in this study, and is becoming even more feasible as data from low- and middle-income countries continue to become widely available.
- 5.United Nations. World Urbanization Prospects The 2011 Revision. Geneva: UN; 2012.Google Scholar
- 6.United Nations. World Population Prospects: The 2012 Revision. Geneva: UN; 2013.Google Scholar
- 8.UNESCO Institute for Statistics. Adult and Youth Literacy, 1990–2015: Analysis of Data for 41 Selected Countries, Adult and Youth Literacy, 1990–2015: Analysis of Data for 41 Selected Countries. Geneva: UN; 2012.Google Scholar
- 10.Grepin KA, Klugman J. Closing the Deadly Gap between What We Know and What We Do: Investing in Women’s Reproductive Health. Washington D.C: World Bank; 2013.Google Scholar
- 11.Goldhaber-Fiebert JD, Brandeau M. Modeling and calibration for exposure to time-varying, modifiable risk factors: the example of smoking behavior in India. Med Decis Making. 2014;34:e0272989X13518272.Google Scholar
- 19.World Bank. World Development Indicators. Washington D.C: IBRD; 2014.Google Scholar
- 20.Murthi M, Guio A-C, Dreze J. Mortality, fertility, and gender bias in India: A district-level analysis. Popul Dev Rev. 1995;745–782.Google Scholar
- 23.Ministry of Home Affairs. Sample registration system. New Delhi: Office of the registrar general & census commissioner, India; 2011.Google Scholar
- 24.International Institute for Population Sciences. National family health survey, India 1992–93. Bombay: IIPS; 1995.Google Scholar
- 25.International Institute for Population Sciences. National family health survey, India 1998–99. Bombay: IIPS; 2001.Google Scholar
- 26.International Institute for Population Sciences. National family health survey, India 2005–06. Bombay: IIPS; 2008.Google Scholar
- 27.International Institute for Population Sciences. District level household and facility Survey 2007–08. Bombay: IIPS; 2010.Google Scholar
- 28.Pasupuleti SS, Pathak P. Special form of Gompertz model and its application. Genus. 2011;66.Google Scholar
- 29.Wang YC, Graubard BI, Rosenberg MA, Kuntz KM, Zauber AG, Kahle L, et al. Derivation of Background Mortality by Smoking and Obesity in Cancer Simulation Models. Med Decis Making. 2012:0272989X12458725.Google Scholar
- 31.O’Hagan A, Forster J, Kendall MG. Bayesian Inference. Arnold London; 2004Google Scholar
- 36.Caldwell JC. How is greater maternal education translated into lower child mortality? Health Transit Rev. 1994;224–229.Google Scholar
- 42.Bolker BM. Ecological models and data in R. USA: Princeton University Press; 2008.Google Scholar
- 45.World Health Organization. United Nations High-Level Meeting on Noncommunicable Disease Prevention and Control. Geneva: WHO; 2011.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.