Introduction

Tuberculosis (TB), one of the first known human diseases, is still one of the major causes of death worldwide; about 2 million people die each year from the disease. TB has many indicators, affecting the bones, central nervous system, and many other organ systems, but it is primarily a pulmonary disease that is introduced by the deposition of Mycobacterium tuberculosis, contained in aerosol droplets, onto lung alveolar surfaces (Smith 2013). Tuberculosis is closely linked to both overcrowding and malnutrition, making it one of the principal diseases of poverty (Lawn and Zumla 2011). Many people in the developing world contract tuberculosis because of a poor immune system, largely due to high rates of HIV infection and the corresponding development of AIDS (Lawn and Zumla 2011).

Household crowding, deficient health services, and poor access to health services are well-known factors linked with TB prevalence and should be integrated in health policy planning. An effective programme for TB control should include these features not only as individual risk factors but as population determinants, comprising the basis for a territorial approach to a surveillance system (Barr et al. 2001). Strong evidence for an association between TB and poverty is already available, expressed by higher TB incidence rates in crowed urban areas and amongst low-income and illiterate populations (Waaler 2002). Health services access, in addition, is limited in the same populations, both as a result of lack of sufficient health facilities in poor areas and because poorer and illiterate people are less aware of their own health status. In a typical clinical practice, and most epidemiological studies, socio-economic determinants of disease are included in terms of an individual person’s risk factors (Kawachi and Berkman 2003). In ecological studies, on the other hand, the focus is on the community as an entity in itself, an entity more complex than the sum of the individual persons who make it up (Berkman and Kawachi 2000). However, individual-level factors, such as behaviours and lifestyles, common to persons across areas and subject to public health interventions, may also play a role in TB incidence. The question in ecological analysis is not about the causes of disease cases, but the causes of disease incidence (Rose 2001).

Previous disease mapping work has been based on collecting, mapping and analysing prevalence or incidence data with conventional statistical techniques, which are affected by random variation due to population variability and a loss of statistical power when cases are assigned to subgroups. The differences in geographical distribution due to random errors may be wrongly interpreted as a true variation of epidemiological interest. A Bayesian method, on the other hand, is used to model the random and true variation separately (Bergamaschi et al. 2006) and is a substitute for the frequentist methods.

Bayesian techniques can provide some shrinkage and spatial smoothing of raw standardised incidence ratio estimates, which are intensely influenced by the size of the population at risk, resulting in a noisy and unclear picture of the true unobserved risks (Richardson et al. 2004).

South Africa ranks as the third highest in the world burden of TB and for 3 consecutive years (2007, 2008 and 2009) the disease ranked as the number one among the ten leading underlying natural causes of death in SA (WHO 2010). Tuberculosis in the Eastern Cape Province mainly affects the economically active age group. Within the age group of 25–34 years the percentage distribution of reported TB cases was 15.9, 0.7 and 23.1% for the years 2003, 2004 and 2005 respectively. A report on the South African National Burden of Disease study 2000 Eastern Cape Province by the medical research council (MRC) showed that tuberculosis was the second leading cause of death among women and the third leading cause of death among men aged 15–44 years. Eastern Cape Province ranks as having the second highest burden of TB by province after KwaZulu Natal [National Department of Health, (NDoH) 2010 data]. In people with normal immune systems, the lifetime risk of progressing from latent TB infection to active TB disease is 10%. HIV, by weakening the immune system, increases a person’s risk of progressing from latent TB infection to active TB disease by 10% per year. The province has an extremely high burden of TB, co-infection with TB and HIV (TB/HIV), and multi-drug-resistant tuberculosis (MDR-TB). In 2008, there were more than 60,000 new TB cases in the province. Of these, there were 1251 confirmed cases of MDR-TB and 385 confirmed cases of extensively drug-resistant tuberculosis (XDR-TB). In 2010, the total of new TB and re-treatment cases identified in the province stood at 62,226 (ECAC 2012).

The connection between tuberculosis (TB) and socio-economic status is well acknowledged (Souza et al. 2000; Waaler 2002) and studies have succeeded in demonstrating an association between poverty and TB incidence rates (Krieger et al. 2003). The objective of this article was therefore to model the relationship between TB prevalence at the provincial level and socio-economic and demographic measures using Poisson regression and to discuss the capability of various indicators in guiding preventive actions and interventions.

Methods

Epidemiological data sources

This is a retrospective secondary data study using Eastern Cape Province TB notification and survey data. All data used were extracted from the electronic tuberculosis register (ETR) of the 24 health sub-districts of the province including the two metropolitan municipalities, namely, Nelson Mandela and Buffalo. The data obtained were for the period of 2012–2015.

The socio-economic and socio-demographic indicators and variables were obtained from publications of the Eastern Cape Socio-Economic Consultative Council (ECSECC 2014) containing reports from all local municipalities (Figs. 1 and 2).

Fig. 1
figure 1

Map a: Eastern Cape Province showing 37 district municipalities and 2 metropolitans. Map b: 24 health sub-districts for the TB data set

Fig. 2
figure 2

Posterior estimated relative risks maps of the seven models

Bayesian methods

We used a combination of a Bayesian approach and a generalised linear mixed model (GLMM) to smooth out the variability in observed disease rates and to estimate the association between TB prevalence rates, averaged over the observed period and chosen covariates of socio-economic vulnerability and demographic factors, and it therefore takes the form of:

$$ \mathit{\log}{\lambda}_i={\beta}_0+{\sum}_{i=1}^k{X}_i{\beta}_i+{u}_i+{v}_i $$
(1)

In the convolution regression model in (1) above, vi is a non-spatially structured random effect, typically assumed to be independent Gaussian, with zero mean and variance \( {\sigma}_v^2 \), and it is usually included in models to account for extra-Poisson variation because of some non-measured important covariates. The structured spatial random effects, u = (ui,…, un), explains spatial dependence, with a prior distribution taken as an intrinsic conditional autoregressive model (ICAR), as seen in (2), where the mean value for ui is the weighted average of the neighbouring random effects and the variance, \( {\sigma}_u^2 \), controls the strength of the local spatial dependence:

$$ P\left({\mu}_i|{\mu}_{j\ne i}\right)\sim N\left(\frac{\sum_{j\ne i}{w}_{ij}{\mu}_j}{\sum_{j\ne i}{w}_{ij}},\frac{\sigma^2}{\sum_{j\ne i}{w}_{ij}}\ \right) $$
(2)

Bayesian modelling depends on the ability to compute posterior distributions to provide estimates for all the corresponding model parameters. The majority of these posterior distributions are straightforward to calculate. Distributions with a conjugate prior typically have a posterior distribution that follows a standard distributional form. In many cases, however, the computation required is more complex and a more advanced method is essential to calculate the posterior distribution. These advanced approaches usually make use of some form of numerical simulation, generally by drawing a sample of parameter values from an approximation of the posterior distribution f(θ| Y) to allow estimation of model parameter distribution.

For our model, six covariates (socio-economic and demographic factors) were taken as explanatory variables for the relative risk of the disease. Let λi be the number of new TB cases in area i, x1 = Gini coefficient (a measurement of how income or poverty is equally distributed), x2 = poverty rate (% number of individuals living below the poverty line, although there is no official poverty line defined for South Africa), x3 = unemployment rate (%), x4 = no schooling (%; for persons aged 20+ years), x5 = average household size and x6 = population density of the regions/municipalities. The Gini coefficient and poverty and unemployment rates are considered in this study as distal factors, while no schooling, average household size, and population density are taken as proximal factors.

Two spatial random effects of ui and vi were used as an unstructured random effect to measure for spatial heterogeneity and as a structured random effect to measure for spatial dependency among the regions respectively. In this study, seven separate multilevel models including/excluding the covariates/spatial random effects were developed and treated as non-independent Poisson random variables with means λ = (λ1, …, λ) to investigate whether the covariates influenced part or all of the spatial correlations to TB risks.

Model comparisons were carried out using the deviance information criterion (DIC), which combines a measure of fit and a measure of model complexity based on the effective number of parameters. Smaller DIC values show a more fitting model (Spiegelhalter et al. 2002). The R codes used are presented in the Appendix Table 2.

Results

Table 1 shows the estimates from the combination of the Bayesian approach and the GLMM to assess spatial heterogeneity in the TB relative risk for the 2014 data set by investigating its relationship with some socio-economic and demographic variables in the Eastern Cape Province of South Africa. In 2014, Eastern Cape had 37,365 notified TB cases from 24 health sub-districts, with about 91.1% bacteriological coverage [ratio of the number of pulmonary tuberculosis (PTB) patients diagnosed by bacteriological tests to the total number of PTB patients reported, excluding children 0–4 years].

Table 1 Posterior estimated means and SD of risk factors associated with TB by a GLMM model for Eastern Cape Province

In this study, seven separate multilevel models including/excluding the covariates/spatial random effects were developed and treated as non-independent Poisson random variables with means λi = (λ1, …, λn) to investigate whether the covariates influenced part or all of the spatial correlation in the TB relativity. In Table 1, the measures of association and their respective means and standard errors are presented for all covariates in all models. In the distal models, the most significant explanatory variable was poverty. The Gini coefficient, unemployment, and no schooling, which are three of the most widely used socio-economic indicators in South Africa, were not significant in any of the models and had very low standard errors. This was possibly due to one of the indirct effects of the trio in TB occurrence through their influence on income (Singh-Manoux et al. 2002). For the proximal models in model 2 and the rest of the models, average household size was significant, indicating that each additional person in the house increased the risk of TB. Population density also played a significant role in the disease incidence, although with a low degree of association (models 2, 3, 5, and 6) and indirectly influenced the average household size. In the models without the spatial random effects (models 1, 2, and 3), the effects of poverty and average household size were vital.

The largest significant factor, however, was average household size.

In addition to being an indicator of poverty and population density, it was also an important covariate in the household transmission of TB. The addition of spatial random effects to the full model (comprising the distal and proximal models) decreased the significance of the effects of the previous covariates on TB incidence. By adding the conditional autoregressive random (CAR) effects, we also found that they can bring about significant changes in the posterior mean and variance of fixed effects compared with the non-spatial regression model. On the basis of the DIC, model 5 performed better than all the other models in which only the proximal factors and random effects were considered. In this best fit model, average household size and population density had a positive association with the relative risk of TB prevalence in the province. The impression is that models with a smaller DIC should be chosen in preference to models with a larger DIC. Both models are penalised by the value of \( \overline{\ D} \), which favours a good fit, but also [in common with Akaike information criterion (AIC) and Bayesian information criterion (BIC)] by the effective number of parameters, pD. Since \( \overline{D} \) will decrease as the number of parameters in a model increases, the pD term compensates for this effect by favouring models with a smaller number of parameters.

When the GLMM was adjusted for the proximal and distal spatial random effects models, it was seen to affect largely only the influence of poverty, which resulted in the decrease of the effect of almost all the other covariates. This outcome can be understood not only as a result of the colinearity between average household size and poverty, but also because of the fact that poverty is spatially clustered.

Discussion

The model-by-model approach used to include variables in connected blocks enabled us to identify the significance of each covariate associated with the spread of the disease and the effects of spatial random terms. First, their inclusion in the models also significantly improved the fit of the models by the values from the DIC (Souza et al. 2007). Further, we established that socio-economic variables like the Gini coefficient and unemployment, which are distal factors, cannot on their own explain the incidence of TB. Proximal variables classified as demographic factors, which are related to population density and average household size during the period under study, were found to be significant in this study. The results showed the multilevel effects of socio-economic and demographic factors on TB, and it was generally observed that the relative risks imposed by poverty, population density, and average household size were strong and positively related to the disease outcome. We also found that the Gini coefficient, unemployment, and no schooling were not positively associated with TB prevalence in any of the models.

The combination of some socio-economic and demographic factors as distal and proximal factors, respectively, showed that both risk factors have interplaying effects on TB prevalence, as individuals tend to converge with relation to their similar economic and demographic resemblance. The Eastern Cape Province has a predominantly Black population; hence, there is a stronger tendency to converge as a community, which makes healthy individuals susceptible to the disease. Together, these variables show the role of different social and demographic practices respectively.

When only the goodness of fit as estimated by the DIC was considered, it was found that the models with spatial random effects were better fitted, even when the model had no independent variable. Better specific estimates of the relative risk of the disease, as established by the inclusion and analyses of spatial random effects, and good fits were found in models 4, 5, and 6, thereby ratifying the assumption that the risk of occurrence is associated with the neighbourhood effect of each spatial unit. It is important to note that spatial random effects models on their own do not describe any disease incidence.

Implementing a fully categorised or multilevel Bayesian model to assess the relative risk of the incidence of TB was a suitable technique for this study. It is known that ‘all models are incorrect, but some are useful’; the possibility to assess the impact of explanatory variables at the ecological level as well the spatial patterns should be encouraged and supported in epidemiology in spite of the characteristic problems of this method (Hampel 1987).

Results from this study present two different alternatives for dealing with observing TB in the Eastern Cape Province. The use of socio-economic and demographic variables related to the prevalence of TB in the province points to the indicators to be examined and considered at the provincial level, while the mapping of risk areas show the need for health service interventions to target groups of the population at higher risk of contracting the disease. In addition, poverty, population density, average household size, and the spatial structure emphasise the associations between socio-economic and demographic deprivation and TB in this setting.

A restraint faced by many studies using this methodology is that most TB identification surveys (especially in heavy-burden and low-income countries like South Africa) do not have access to high-level statistical programmes, nor do they often have specialists to perform such analyses satisfactorily. Moreover, identifying the social and demographic processes that exacerbate circumstances of collective risk for the disease are essential to balanced planning of health and government mediations and may aid with creating ESSes on a territorial basis.