Background

Over the last several decades, extensive research on health effects of environmental pollutant exposures has advanced our understanding in pollutant toxicities and related biological mechanisms. These led us to control and lower national regulatory standards for some pollutants (e.g., blood lead levels in children), which has reduced the burden of disease and prevented substantial environmental exposure related diseases. Despite these achievements, a huge research gap remains unanswered: What is the health effect of exposure to pollutant mixtures? Although there is growing concern of potential health effects of exposure to pollutant mixtures, most previous studies have been limited to single pollutants, i.e., the unit of analysis is based on a single pollutant. This is due to statistical challenges, such as high degrees of correlation between pollutants, confounding due to co-pollutants (i.e., a spurious association in a single pollutant approach may be observed if the single pollutant is a proxy for other co-pollutants or a mixture of pollutants), lack of replication cohorts, and lack of statistical approaches to evaluating pollutant mixtures [1]. Other methodologic challenges, such as difficulty in exposure assessment of pollutant mixtures with regards to accuracy and feasibility and measurement errors, are also important to hinder multipollutant approaches but the present study will focus on statistical challenges.

Recently, several methods have been proposed to explore health effects of multiple pollutants [2,3,4]. These include variable selection approaches (least absolute shrinkage and selection operator (LASSO) [5], elastic-net [6], adaptive elastic-net [7]); dimension reduction techniques (principal component analysis (PCA), partial least squares (PLS) [8], weighted quantile sum (WQS) regression [9]); Bayesian approaches (Bayesian model averaging (BMA) [10], Bayesian kernel machine regression (BKMR) [11]); and recursive partitioning (classification and regression tree (CART) [12], random forest [13], Bayesian additive regression tree (BART) [14]). A few studies implemented these methods to analyze multiple pollutants [3, 11, 15,16,17,18]. A recent workshop organized by the National Institute of Environmental Health and Sciences (NIEHS) suggested that there is a need for development of novel statistical approaches and appropriate use of available statistical methods in the analysis of combined exposure data from epidemiologic studies [19]. It is also important that methods accounting for the interactions between pollutants within a mixture and cumulative pollutant exposure are needed to estimate the risk of disease [1].

To address these challenges and research gaps, we propose an updated ‘Environmental Risk Score (ERS)’. We initially proposed an ERS as a potential summary measure of effects of multiple pollutants in epidemiologic research [20]. The underlying idea behind an ERS is to build a predictive risk model as a weighted sum of the pollutant levels from simultaneous assessment of multiple pollutants [20]. Weights are determined by the magnitudes (standardized regression coefficients) of the association between each pollutant and the outcome of interest. Our initial study of ERS has several limitations in that quantification of health effects was based on the main effects of individual pollutants; pollutant-pollutant interactions were not considered; and the computed ERS is conceptualized and created in a disease-specific way (i.e., an ERS for one disease is not applicable for another disease) [20]. The updated ERS we propose here will account for these two limitations. The use of modern machine learning methods we consider also bypass the limitation of working with weighted parameters to derive the ERS and offer the flexibility of working directly with predictions that can be generated by a wide class of tools (for example, tree-based methods, smoothing and selection methods) beyond the traditional regression model.

With this objective, we examined the association between metal mixtures and cardiovascular health outcomes in U.S. adults. Heavy metals are widespread, and notable for their toxic effects even at low levels of exposure encountered in the general environment. The cardiovascular system is a main target organ for various heavy metals [21, 22]. Common biological mechanisms by which heavy metals affect the cardiovascular system include increased oxidative stress and lipid peroxidation, inhibition of antioxidant defense systems, smooth muscle cell proliferation, and endothelial injury and apoptosis [22, 23]. Although cardiovascular effects of individual metals have been extensively studied, little is known about cumulative effects of metal mixtures. To test this question, we examined 20 metal biomarkers measured in urine or whole blood from 6 cycles of the National Health and Nutrition Examination Survey (from NHANES 2003–2004 to 2013–2014). We first constructed ERS of metal mixtures for gamma-glutamyl transferase (GGT), a marker of oxidative stress [24, 25], because oxidative stress is a common disease pathway linking environmental pollutant exposure and numerous health endpoints. We then evaluated the associations between ERS and cardiovascular endpoints (blood pressure, hypertension and total and cardiovascular disease (CVD) mortality). Therefore, the constructed ERS will reflect cumulative risk of oxidative stress due to metal mixtures and can also be applied to other health outcomes relevant to oxidative stress, such as cancer, type-2 diabetes, etc. To identify important metals and potential metal-metal interactions, we employed adaptive elastic-net [7], BART [14] and BKMR [11]. We also used Super Learner, an ensemble machine learning prediction algorithm that utilizes a weighted combination of many candidate learners [26]. We compared the prediction performance among these different statistical approaches in a test set to provide insights with regards to data complexity (e.g., sample size, number of predictors, non-linearity, high degrees of correlation, and potential pollutant-pollutant interactions).

Methods

Study population

Data were from the U.S. NHANES cycles from 2003 to 2004 through 2013–14 (six continuous NHANES cycles). Two earlier cycles of NHANES (1999–2000 and 2001–2002) were not included because arsenic species were not assessed in those cycles. NHANES is a cross-sectional study designed to be representative of the health and diet of the non-institutionalized U.S. population. NHANES employs a complex, multi-stage sampling design with accompanying design weights. For the present study, 10,805 adults aged 20 years or older who participated in the sub-study of heavy metals in urine, were eligible. We excluded 1141 participants who had missing data in the outcome (GGT, n = 663), exposures (heavy metals, n = 862) and core covariates (smoking, education, body mass index (BMI), n = 197) (the numbers of missing in parentheses are not mutually exclusive), resulting in the sample size of 9664 for ERS construction. Those excluded were older and more likely to be female and non-Hispanic black and less educated (Additional file 1: Table S1). Any potential confounding bias by these factors was minimized by adjustment as covariates in regression models. Given that participants did not know their exposure levels at the recruitment, selection bias due to differential missing data in exposure is unlikely. We additionally excluded 464 and 489 participants when examining blood pressure and hypertension, respectively, due to missing in information on blood pressure and hypertension. For mortality, 6404 participants were available for analysis because mortality data were not available in the last two cycles. NHANES is a publicly available data set and all participants in NHANES provide written informed consent, consistent with approval by the National Center for Health Statistics Institutional Review Board.

Assessment for heavy metals

Heavy metals in urine (antimony, total arsenic, arsenous acid, arsenic acid, arsenobetaine, arsenocholine, monomethylarsonic acid, dimethylarsonic acid, barium, cadmium, cobalt, cesium, molybdenum, lead, thallium, tungsten, uranium: n = 17) and whole blood (lead, cadmium, total mercury: n = 3) were analyzed with inductively coupled-plasma dynamic reaction cell-mass spectrometry (ICP-DRC-MS). Detailed laboratory methods and quality control/quality assurance data available in the NHANES website (for example, the laboratory procedure manual for heavy metals in urine in NHANES 2003–2004 available at https://wwwn.cdc.gov/nchs/data/nhanes/2003-2004/labmethods/l06_c_met_hm.pdf). Platinum and beryllium in urine were excluded because more than 95% observations were below limits of detection (LODs). In NHANES, concentrations below LOD were imputed with a value equal to LOD/√2. Although there is some concern in this commonly used substitution method [27], we did not conduct other alternative methods to avoid too complicated modeling and because statistical approaches for mixtures do not support such modeling.

Assessment for outcomes and covariates

Serum GGT was assayed with Beckman Synchron LX20 or Beckman UniCel DxC800 Synchron via an enzymatic rate method. Blood pressure was measured up to four consecutive times by certified examiners with standardized protocols. We calculated means of systolic and diastolic blood pressure (SBP/DBP) by averaging up to three measures after disregarding the first reading. Hypertension was defined as SBP/DBP ≥ 140/90 mmHg, self-reported physician diagnosis of hypertension, or self-reported use of hypertension medication. Public-use linked mortality data were available for NHANES through 2009–2010. These data provides mortality follow-up data from the date of survey participation through December 31, 2011. Therefore, only the first 4 cycles from 2003 to 2004 to 2009–2010 were used for mortality. Total mortality and cause-specific mortalities (CVD (the 10th International Classification of Disease (ICD-10): I00-I09, I11, I13, I20-51) and cancer (C00-C97)) were linked to NHANES participants. Important covariates were chosen a priori and included age, sex, race/ethnicity (Mexican American, Other Hispanic, non-Hispanic white, non-Hispanic black, Other), education (<high school diploma, high school diploma, and >high school diploma), smoking status (never, former, current), BMI, and urinary creatinine. We selected education as an indicator of socioeconomic status because it is widely used and has less missing data than other proxies, such as household income or poverty income ratio. Urinary creatinine, an indicator of urine dilution [28, 29], was measured using either Jaffe reaction or an enzymatic method.

Statistical analysis

Although we used complex sampled NHANES data, we did not consider survey components (sampling weights, clusters, and strata) when constructing ERS because most statistical packages for implementing the ML methods do not allow such survey components. We used the survey package in R (version 3.3.1) in the final analyses of associating ERS and cardiovascular outcomes.

We calculated pair-wise correlations among 20 metals and created a correlation-matrix heat map. We applied logarithmic transformation with base 10 to GGT and metal pollutants because the distributions of the raw values were highly skewed and the shapes of dose–response relationships were closer to be linear with log-transformation. We chose log with base 10 rather than natural log for easier interpretation of regression coefficients (i.e., one unit increase in a log-transformed variable is equivalent to a ten-fold increase in its raw values). All analyses were conducted using R. A schematic representation of the data accumulation and analytic procedures is presented in Fig. 1.

Fig. 1
figure 1

Schematic diagram of Environmental Risk Score (ERS) construction and analytical methods. AENET-I, adaptive elastic-net with main effects and pairwise interactions; BART, Bayesian additive regression tree; BKMR, Bayesian kernel machine regression; PRESS, predicted residual sums of squares; MSE, mean square error; MSPE, mean square prediction error; AUC, area under the receiver operating characteristics curve; OR, odds ratio; SBP/DBP, systolic and diastolic blood pressure; CVD, cardiovascular disease

Construction of ERS using statistical approaches for pollutant mixtures

We utilized adaptive elastic net, Bayesian Additive Regression Trees (BART), Bayesian Kernel Machine Regression (BKMR), and Super Learner (SL) to determine heavy metals contributing to ERS for GGT.

  1. 1)

    Adaptive elastic net (Zou and Zhang, [7])

Elastic net is a hybrid approach that blends LASSO and ridge regression to overcome the limitation of LASSO on data with highly correlated variables [6]. LASSO and ridge regression are regularized regression techniques that include a penalty term to constrain the size of the estimated coefficients. LASSO shrinks coefficients towards exact zeroes and thus performs variable selection. With highly correlated variables as predictors, LASSO tends to select only one out of these correlated variables and ignore the others. Elastic net also executes variable selection, but it has the opportunity to select a group of non-zero collinear variables. Adaptive elastic net, proposed by Zou and Zhang [7], is an adaptive version of elastic net that not only deals with the collinearity problem over elastic net but satisfies the asymptotic normality assumption that allows us to conduct statistical inference and hypothesis testing by providing large sample standard errors (SEs) and p-values. Adaptive weights ensure smaller coefficients are shrunk faster to zero, whereas larger coefficients are penalized less.

For subject i (i = 1,  … , N), let Y i  represent the continuous outcome GGT, \( {E}_i^j\ \left(j=1,\dots, p\right) \) be the j-th given environmental pollutant, and Z i  (k × 1) be the vector of covariates/confounders that the model adjusts for, i.e. age, BMI, urinary creatinine, gender, race/ethnicity, smoking status and education. Let X i denote all predictors including Z i , E i and E i  × E i . Then the estimates from the adaptive elastic net method are defined by

$$ \widehat{\beta}=\left(1+\frac{\lambda_2}{N}\right)\left\{ argmin\left(\sum_{i=1}^N{\left({Y}_i-{X}_i\beta \right)}^2+{\lambda}_1\sum_{j=1}^P{\widehat{\omega}}_j\left|{\beta}_j\right|+{\lambda}_2\sum_{j=1}^P{\beta_j}^2\right)\right\}, $$

where \( {\widehat{\omega}}_j \) represents the weight of j-th metal pollutant, which can be determined by the beta coefficient from elastic net. These weights can allow coefficients of relatively less important variables (metals) to be shrunk to zero’s more efficiently. Optimal tuning parameters (λ1 and λ2) were chosen based on 5-fold cross-validations to minimize prediction errors. In our application we chose not to conduct variable selection on Z i , and thus we did not penalize the coefficients associated with Z i .

We constructed ERS using the adaptive elastic net with an underlying model considering not only the main effects but all possible combinations of pairwise linear interactions (AENET-I). For comparison purposes, we also ran adaptive elastic net with only main effects (AENET-M), which estimate marginal associations between individual variables within a mixture and the outcome. However, in presence of complex non-linear interaction among the predictors, it is hard to assess the marginal effect of a single pollutant without referencing to the value of the other pollutants. The ‘statistical’ interactions in AENET-I imply departures from additive joint effects. ERS was computed as a weighted sum of the selected non-zero predictors from each model:

$$ {ERS}_i=\sum_{j=1}^P\widehat{\beta_j}{E}_i^j+\sum_{k=1}^{P-1}\sum_{l=k+1}^P\widehat{\beta_{kl}}{E}_i^k{E}_i^l $$

where \( \widehat{\beta_j} \) is the coefficient of environmental pollutant predictor j and \( \widehat{\beta_{kl}} \) is the coefficient of the interaction of environmental pollutant predictors k and l. Note that most coefficients are zero since these coefficients are shrunk by AENET. All models were adjusted for age, sex, race/ethnicity, education, smoking status, BMI, and urinary creatinine. The R package gcdnet (version 1.0.4) [30] was used to implement adaptive elastic net. The variables with non-zero coefficients were identified as noteworthy/important and their corresponding coefficient estimates and P-values were computed.

  1. 2)

    Bayesian Additive Regression Trees (BART) (Chipman et al. [14])

BART is a Bayesian “sum-of-trees” model for ensemble inference from a large number of trees [14]. First, we regress Y i Z i and E i Z i , let \( {Y}_i^{\ast } \) and \( {E}_i^{\ast } \) be the residuals obtained from these regression. Then consider the following model.

$$ {Y}_i^{\ast }=f\left({E}_i^{\ast}\right)+{\epsilon}_i,{\varepsilon}_i\sim N\left(0,{\sigma}^2\right). $$

where f is an unknown function and \( {Y}_i^{\ast } \) is the residual of GGT adjusted for all the covariates Z i . \( {E}_i^{\ast } \) is the residual of all the metal levels adjusted for Z. We consider to model f(E ) by a sum of m regression trees \( f\left({E}^{\ast}\right)\approx h\left({E}^{\ast}\right)\equiv {\sum}_{j=1}^m{g}_j\left({E}^{\ast}\right) \) where each g j denotes the j-th regression tree.

A sum-of-trees model is basically an additive model with multiple components. It not only estimates interaction effects, but also takes additive effects into account. Therefore parametric pairwise E x E interaction terms were not specifically included in the BART models (also not in BKMR and SL below). The key idea of BART is to implement the sum-of-trees model with a prior that regularizes the fit by keeping the individual tree effects small. At each Markov chain Monte Carlo (MCMC) iteration, we produced a draw from the joint posterior (f, σ)∣(Y, E, Z) and iterated with Gibbs sampling until the convergence with a default choice of m = 200. BART was implemented with R package BayesTree (version 0.3-1.4) [31]. ERS using BART was constructed as a posterior predicted residual of GGT (we use the posterior mean) for subject i after removing the effect of Z by

$$ {ERS}_i=\widehat{Y^{\ast }} $$

The variable inclusion proportions were computed in the training set as a measure of the variable importance using the R package bartMachine (version 1.2.3) [32].

  1. 3)

    Bayesian kernel machine regression (BKMR) (Bobb et al. [11])

The main idea of kernel machine regression (KMR) is to flexibly model the relationship between a large number of variables and a particular response variable. Our modeling framework is

$$ {Y}_i=h\left({E}_i\right)+{Z}_i^T\beta +{\varepsilon}_i, $$

where ε i  ~ N(0, σ 2) and h is a flexible function of metal pollutants E i , which is characterized through a kernel machine representation. Usually, Gaussian kernel is the default choice, which flexibly captures a wide range of underlying functional forms for h, and can be expressed as

$$ K\left({E}_i,{E}_{i\hbox{'}}\right)=\exp \left\{-\sum_{j=1}^P{r}_j{\left({E}_i^j-{E}_{i\hbox{'}}^j\right)}^2\right\} $$

Here E i and E i represent vectors of predictors for two different individuals, and r j  > 0 denotes the tuning parameter that control the smoothness of h as a function of the exposure E. Intuitively, the kernel function shrinks the predicted response of two individuals with similar exposure profiles toward each other. BKMR is based on KMR and conducts Bayesian inference for the model above [11]. Flat priors for the parameters were chosen in this study. Using MCMC methods, we iterated for 2000 times to ensure the convergence of the method and get the optimal parameter estimates (r, β). BKMR was implemented with R package bkmr (version 0.2.0) [33]. ERS using BKMR was constructed as a predicted value of GGT for subject i by

$$ {ERS}_i=\widehat{h}\left({E}_i\right). $$

The posterior inclusion probabilities of each variable were computed in the training set as a measure of variable importance.

  1. 4)

    Super Learner (SL) (van der Laan et al., [27])

The main idea of SL is to find the optimal prediction by using a combination of predictions from a collection of given algorithms to minimize cross-validation risk [26]. R package SuperLearner (version 2.0.21) [34] was used to predict the outcome (here GGT) through a 10-fold cross-validation. The SL algorithm includes estimations from many candidate learners L. In our study, we utilized the following learners: tree based methods (R packages: bartMachine, caret, randomForest), regression based methods (glm, gam, step, glm.interaction), shrinkage (regularized regression) methods (ridge, glmnet), Bagging and Boosting methods (xgboost, ipredbagg) and others (nnls, SVM).

Suppose we consider M input algorithms/methods for constructing the ensemble predictor. For a given method m (m = 1, 2, 3,  … , M), we fit a model of the form Y  = ψ m (E ), where Y is the residual of GGT adjusted for all the covariates Z and E is the residual of all the metal levels after regressing on Z. Each method provides a prediction of the form \( \widehat{Y_m^{\ast }}= \) \( {\widehat{\psi}}_m\left({E}^{\ast}\right) \). We restrict our attention to candidate ensemble predictors of the form

$$ {\psi}_{SL}\left({E}^{\ast}\right)={\sum}_{m=1}^M{a}_m{\psi}_m\left({E}^{\ast}\right)\kern0.5em \mathrm{such}\kern0.5em \mathrm{that}\kern0.5em \mathrm{each}\kern0.5em {a}_m\ge 0,\forall m\kern0.5em \mathrm{and}\kern0.5em {\sum}_{m=1}^M{a}_m=1. $$

The set of weights corresponding to each method a = {a m , m = 1⋯, M} is now determined by a V-fold cross-validation. For each of the v = 1 ,  ⋯  , V fold, the entire learning set of size n is divided into binary splits of training and validation set that are mutually exclusive. For the v-th fold, we use the indicator (\( {\mathrm{B}}_n^v(i)=0 \)) to denote observations in the training set and \( \Big({\mathrm{B}}_n^v(i)=1 \)) to denote observations in the validation set. Let n test(v) denote the size of the test or validation sample in the v-th fold. The vector a is obtained by estimating \( {\widehat{\psi}}_m \) on the training data but evaluating the risk on the test data and minimizing the V-fold cross-validated prediction loss.

$$ \widehat{a}=\mathit{\arg}\kern0.5em {\mathrm{min}}_a{\sum}_v\frac{1}{n^{test}(v)}{\sum}_{i:{B}_n^v(i)=1}{\left({Y}_i^{\ast }-{\sum}_{m=1}^M{a}_m{\widehat{\psi}}_m\left({E}_i^{\ast}\right)\right)}^2. $$

The final SL fit, i.e.,ERS SL is computed by combining \( \widehat{a} \) with \( {\widehat{\psi}}_m\left({E}^{\ast}\right) \), m = 1 , 2 ,  …  , M:

$$ {\mathrm{ERS}}_{SL}={\sum}_{m=1}^M\widehat{a_m}{\widehat{\psi}}_m\left({E}^{\ast}\right). $$

There is no single consensus measure to evaluate variable importance in the SL. We therefore, computed a sensible metric in the training set as done in BART and BKMR. First we ran SL with all 20 metals and calculated the sum of squared errors (SSE). We then removed one metal, ran SL with the remaining 19 metals and calculated SSE [SSE(−i)]. We repeated this for each metal. The variable importance was computed as the difference between SSE(−i) and SSE divided by SSE [i.e., (SSE(−i) – SSE)/SSE]. For comparison purposes, we also compared individual algorithms within SL in terms of the model prediction performance.

Assessment of predictive power of ERS

To evaluate and compare the performance of 5 ERS’s (AENET-I, AENET-M, BART, BKMR, and SL), we randomly split the full data (all cycles combined) by a 1∶1 ratio: the first part (n  =  4832) used for estimation/training and the second part (n  =  4832) for validation/testing. We repeated this random split three times and constructed ERS each time but the results (i.e., selected predictors or variable importance) were consistent. We, therefore, report the results based on the first split data. Three metrics were computed: First, we used linear regression for GGT and fit ERS as a continuous variable and computed correlation coefficients between GGT and ERS and the predicted residual sums of squares (PRESS), a statistic measuring model goodness of fit. Mean squared error (MSE) in the training data and mean squared prediction error (MSPE) in the testing data were also calculated to assess prediction performance. Second, we dichotomized GGT at the 90th percentile (50 U/L), and conducted logistic regression analysis with this dichotomized outcome and continuous ERS as predictor. We used area under the receiver operating characteristic (ROC) curve or AUC to assess predictive ability of the ERS with these binary endpoints. Third, in order to assess risk stratification/discrimination power of the ERS, we further categorized ERS by its quintiles based on the distribution in the training set and conducted logistic regression with categorical ERS in the testing set. We computed the odds ratio (OR) for the highest quintile vs. the lowest quintile of ERS to measure the risk stratification properties of ERS.

Associations between ERS and cardiovascular endpoints

Next, we examined the associations between each ERS and cardiovascular endpoints, such as SBP, DBP, hypertension, and mortality (total and cardiovascular). Survey linear regression, survey logistic regression and survey Cox-proportional hazard models were used for SBP/DBP, hypertension, and mortality, respectively. Age was used as the time scale in Cox-proportional hazard models [35]. Using age as the time-scale implies delayed entry with left truncation occurring at the age at inclusion. In this approach, the hazard function can be directly interpreted as the age-specific incident function. We also examined cancer mortality which is a non-cardiovascular endpoint but related to oxidative stress as a potential biological mechanism. Separate Cox models were fit for each cause-specific outcome and log(hazard ratio (HR)) was obtained from each Cox model. The same covariates used above were adjusted for all models. To standardize the distributions of different ERS’s, we computed z-scores for each ERS by subtracting the mean of the corresponding ERS divided by its standard deviation (SD). We report β coefficients (95% confidence intervals (CIs)) for SBP/DBP, ORs for hypertension, and HRs for mortality outcomes for a one-unit increase in the z-score of ERS which is equivalent to a one SD increase in its original scale.

Results

Table 1 presents population characteristics for the whole and by each cycle. Mean (SD) of age was 49.2 (17.2) years and approximately 51% were female. Overall, the proportions of Mexican American, other Hispanic, Non-Hispanic white, Non-Hispanic black, and other race/ethnicity were 16.7%, 8.1%, 46.6%, 20.4%, and 8.2%, respectively. Starting in the cycle of 2007–2008, all Hispanics were oversampled, not just Mexican Americans, thus the proportion of other Hispanic increased since this cycle. Likewise, Asian Americans were oversampled starting in the cycle of 2011–2012. There were decreasing trends in the prevalence of current smokers and low education attainment (<high school diploma) over the study period. The means (SDs) of GGT, SBP and DBP were 21.7 (1.94) U/L, 123.2 (18.7) mm Hg, and 70.0 (12.0) mm Hg, respectively. The prevalence of hypertension was 36.9%. Over the mean of 4.5 years of follow-up, the mortality rates of all causes, CVD and cancer were 5.6%, 1.5% and 1.4%, respectively.

Table 1 Characteristics of the study population overall and by NHANES cycle

Table 2 shows geometric means (GMs) and geometric standard deviations (GSDs) of metals measured in either whole blood or urine overall and by each cycle. Most metals showed that GMs were higher than GSDs, suggesting skewness. There were decreasing trends in the population-level concentrations of lead in both blood and urine. Urinary arsenobetaine concentrations increased in the cycles of 2011–2012 and 2013–2014, which may result from the increased proportion of Asians who eat more fish (arsenobetaine is mainly from fish consumption). Figure 2 shows a heat map of Spearman correlations between metal biomarkers. There were moderate to high correlations between blood lead and blood cadmium (Spearman correlation (rho) = 0.35); between blood total mercury and urinary arsenobetain (rho = 0.47); and among arsenic species (monomethylarsonic acid (MMA), arsenous acid (As III), arsenic acid (As V), and arsenocholine: rho = 0.43 to 0.87). A group of urinary metals including total arsenic, dimethylarsonic acid (DMA), cadmium, lead, cobalt, thallium, cesium, barium, uranium, tungsten, and molybdenum also had modest to high correlations each other (rho = 0.22 to 0.82).

Table 2 Geometric means and geometric standard deviations of metals overall and by NHANES cycle
Fig. 2
figure 2

Heat map of Spearman correlations between metal biomarkers. Asterisk next to the metal names indicates metals measured in whole blood. As, arsenic; As III, arsenous acid; As V, arsenic acid; MMA, monomethylarsonic acid (MMA); DMA, dimethylarsonic acid; Mo, molybdenum

Figure 3 and Additional file 1: Table S2 shows the variable selection results for AENET-M and AENET-I. Nine predictors (main effects) were selected in AENET-M and 24 predictors (13 main effects and 11 pairwise interactions) were selected in AENET-I. In AENET-I, important metals selected include lead (blood) (β = 0.31, p = 6.2 × 10−6), cadmium (urine) (β = 0.22, p = 9.8 × 10−7), DMA (β = 0.28, p = 5.9 × 10−5), cobalt (β = −0.14, p = 1.2 × 10−8), MMA (β = −0.10, p = 1.3 × 10−10), and barium (β = 0.05, p = 7.9 × 10−10) in terms of either a large magnitude of the association (|β| > 0.1) or a small p-value (<1 × 10−5). Interactions of cesium-lead (blood) (β = −0.15, p = 4.5 × 10−4) and of mercury-As V (β = −0.15, p = 6.1 × 10−2) were also identified. In BART, urinary cadmium seems to contribute to prediction of ERS most (variable importance (the relative proportion of the contribution) = 9.49%) followed by cobalt (7.21%), DMA (7.14%), tungsten (7.13%), MMA (6.78%) and so on (Additional file 1: Table S3). In SL, As V had the highest variable importance measure (25.1%) followed by MMA (24.2%), tungsten (22.9%), total As (22.6%), DMA (22.6%), and cadmium (21.5%) (Additional file 1: Table S3). Combining the ranks of variable importance between BART and SL, cadmium, tungsten, MMA, DMA, As V, barium and cobalt were ranked as the top 7 metals (Additional file 1: Table S3). In BKMR, almost all metals had posterior inclusion probabilities (PIP) of 1 except arsenous acid (As III, PIP = 0.96) and arsenic acid (As V, PIP = 0.85) (data not shown).

Fig. 3
figure 3

Selected predictors of the main effects (diagonal cells) and pairwise interactions (off-diagonal combinations) for serum gamma-glutamyl transferase (GGT) in adaptive elastic net. Bubble size indicates the magnitude of the association. The number inside indicates p-value. Asterisk next to the metal names indicates metals measured in whole blood. As, arsenic; As III, arsenous acid; As V, arsenic acid; MMA, monomethylarsonic acid (MMA); DMA, dimethylarsonic acid; Mo, molybdenum

The ERS’s from each statistical approach ranged from −0.22 to 0.25 for AENET-M; −0.26 to 0.49 for AENET-I; −0.22 to 0.45 for BART; −1.01 to 1.83 for BKMR; and −0.18 to 0.27 for SL (Table 3). ERS from BKMR had a wider range whereas ERS from SL had a narrower range than other approaches. Pairwise correlations among AENET-I, BART and SL were relatively high (>0.6), whereas BKMR had weak correlations with other approach (<0.12) (Additional file 1: Figure S1).

Table 3 Comparison of ERS distribution and risk prediction performance by different statistical approaches

For risk prediction performance for continuous GGT, AENET-I outperformed other approaches in terms of correlation between GGT and ERS, PRESS, and MSPE in the testing set (Table 3). The correlation coefficients between GGT and ERS in the training set were 0.22 in AENET-M and 0.24 in AENET-I and similar and slightly improved correlation coefficients were obtained in the testing set. The correlation coefficient between GGT and ERS in BART was 0.35 in the training set and 0.20 in the testing set. BKMR and SL overfit GGT prediction in the training set (correlation coefficient = 0.82 in BKMR and 0.75 in SL). BKMR poorly predicted GGT in the testing set (correlation coefficient = 0.0) but SL performed reasonably in the testing set (correlation coefficient = 0.20). For dichotomous GGT in the testing set, AENET-I and SL performed best (Table 3 and Additional file 1: Figure S2). The addition of their ERS modestly improved the AUC for GGT (AUC from 0.66 to 0.70 for both). Again, the same overfitting issue was observed for the ERS from BKMR in the training set.

Figure 4 shows ORs of having high GGT levels (50 U/L) comparing the highest vs. the lowest quintiles of ERS. After controlling for base covariates, ORs of having high GGT levels ranged from 1.69 (95% CI, 1.20 to 2.38) for BKMR, 3.42 (95% CI, 2.45 to 4.78) for BART, 3.62 (95% CI, 2.49 to 5.25) for AENET-M, 3.73 (95% CI, 2.66 to 5.23) for SL, to 4.29 (95% CI, 2.97 to 6.18) for AENET-I. These ORs were even stronger than those for individual metals. ORs comparing the highest vs. the lowest quintile of lead, cadmium, mercury and total arsenic were 1.71 (95% CI, 1.20 to 2.44), 1.17 (95% CI, 0.79 to 1.74), 1.29 (95% CI, 0.95 to 1.75), 1.54 (95% CI, 1.06 to 2.22), respectively. Furthermore, the ERS effect sizes (except ERS for BKMR) were stronger than those for socio-demographic factors or BMI. For example, OR for females vs males was 2.16 (95% CI 1.74, 2.67); OR for 10 kg/m2 increase in BMI was 1.44 (95 CI, 1.25, 1.65); and OR for <high school vs. college or higher was 1.76 (95% CI, 1.26, 2.46) (data not shown).

Fig. 4
figure 4

Odds ratios (95% confidence intervals) of having high GGT (50 U/L and above) comparing the highest vs. the lowest quintiles of ERS and individual pollutants that compose the ERS in the testing set. All models were adjusted for age, BMI, creatinine, gender, race/ethnicity, smoking status and education

Table 4 presents the associations between ERS and various cardiovascular endpoints. Because the ERS prediction performance for BKMR was not good, we excluded the BKMR-based ERS in these analyses. All ERS’s showed significant associations with systolic and diastolic blood pressure and hypertension. ERS based on SL showed the largest associations: one SD increase in SL-ERS was associated with a 1.03 mmHg (95% CI, 0.57, 1.48) higher in SBP; a 1.61 mmHg (95% CI, 1.28, 1.95) higher in DBP. For hypertension, one SD increase in each ERS from AENET-I, BART and SL were associated with ORs of 1.26 (95% CI, 1.15, 1.38), 1.17 (1.09, 1.25), and 1.30 (1.20, 1.40), respectively. For comparison, blood lead and blood cadmium were associated with ORs of 1.08 (0.99, 1.18) and 1.06 (0.98, 1.16). ERS’s showed non-significant positive associations with both total and CVD mortality. Positive associations with cancer mortality were observed for all ERS’s but those associations did not reach statistical significance except AENET-M (HR = 1.50, 95% CI, 1.05, 2.15).

Table 4 Associations of health endpoints (blood pressure, hypertension, and mortality) with ERS’s from different statistical approaches. For the comparison purpose, associations with blood lead and blood cadmium are presented

Discussion

There is growing interest in evaluating health effects of real-life environmental exposure as mixtures. Scientists have recognized the need to develop novel statistical approaches to predicting disease risk associated with exposure to pollutant mixtures [19]. The present study proposes an updated ERS [20] using cross-cutting statistical approaches developed to handle statistical challenges in pollutant mixtures: high degrees of correlations among pollutants; pollutant-pollutant interactions; and cumulative risk from pollutant mixtures [1]. Among the ERS’s constructed to predict GGT, an indicator of oxidative stress, AENET-I, a regularized regression-based variable selection method that considered both main effects and pairwise interactions, performed best. Despite their slight worse performance in prediction, BART, a nonparametric Bayesian regression approach that uses dimensionally adaptive random basis elements [14], and SL, an ensemble of multiple learners [26], in which evaluation of interaction effects are built, showed comparable prediction performance only with the main effect terms in their models. The observed associations between ERS and GGT were larger than those for known toxic metals assessed individually, suggesting that cumulative (combined) effects may be larger than each individual effect. We also found significant associations between ERS and DBP and hypertension, implying that individuals who have higher metal mixtures-related oxidative stress may be at higher risk for high blood pressure.

Choice of statistical approaches is a critical step in research on multi-pollutants and mixtures. As listed in Introduction, a number of different statistical approaches have been developed to analyze high dimensional and correlated data. A workshop by NIEHS in 2015, entitled “Statistical Approaches for Assessing Health Effects of Environmental Chemical Mixtures in Epidemiology Studies”, reported that no one statistical approach performs better than another and a statistical approach should be chosen based on a specific scientific question and hypothesis related to pollutant mixtures [19]. Our primary goal and specific question when constructing ERS using 4 different statistical approaches was to develop better prediction models for early health effects on a common disease pathway (i.e., oxidative stress) while accounting for high degree correlations and potential interactions among metals. Then these prediction models are used to assess their risk stratification and discrimination power for predicting specific health endpoints. The statistical approaches we used and other approaches listed in Introduction are ‘agnostic’ and the selected predictors do not necessarily reflect causal agents or biological interactions [36]. However, when predictive modeling is the primary goal rather than causal explanatory modeling, causality of the selected predictors may be of less importance if highly correlated pollutants are examined [37]. Either causal predictors or their correlated proxies should have similar predictive power and in general, variables with less noise are likely to be selected [38]. Biologically-based dose–response functions, such as physiologically based pharmacokinetic (PBPK) modeling, may better capture causal explanation but require more complex modeling that is hard to test and toxicological data that are often not available. Our predictive modeling can provide a good approximation of underlying causal relationships by capturing complex relationships and patterns among pollutants within mixtures [39].

In terms of choice of statistical models, BKMR was found to be the least “scalable” among the methods we considered. AENET-I has the advantage of providing important variables and working in a parametric regression structure and works well under sparsity and large-p-small-n situation. BART and BKMR are non-parametric methods based on kernel smoothing and recursive partitioning but have difficulty with a truly ultra-dimensional covariate space. SL is a highly flexible and powerful method that can incorporate any prediction generating tool as an ingredient and data-adaptively weigh the predictions. The limitation is that it does not provide measures of variable importance or carry out variable selection. All four methods can handle modest to strong correlation among predictors. While we have allowed our construction of ERS to be driven by statistical considerations, it may be worthwhile to group the pollutants by known biological mechanisms. Bayesian methods and grouped-LASSO type methods have the potential of incorporating this information through a priori grouping and incorporation of priors. Contrasting data-driven methods with biology-driven methods and combining epidemiology with biology in an optimal way remain a relevant research topic.

AENET most likely provides biased coefficients because of shrinkage estimation. Some studies attempted to obtain unpenalized coefficients by re-fitting multiple-exposure OLS regression models with selected variables post E-net selection [17]. In general, it is well-known that running OLS post model selection with the selected variables, using the same dataset leads to high false discovery rates and over-optimistic results. Thus we refrain from using ERS that is constructed by fitting OLS post model-selection. Fitting OLS on a new independent dataset with the selected variables is a valid strategy. There are new variants of LASSO estimates, for example the debiased LASSO that can handle this issue better but are still in developing stages of implementation. With advances in post-model selection theory, we would have better inferential tools under penalized variable selection methods like AENET [40].

One important approach we attempted is to use an oxidative stress marker as the outcome. A challenge for constructing the cumulative effects of multiple pollutant exposure like ERS is that they are disease-specific. ERS constructed to predict CVD may not predict cancer risk well. ERS for common disease pathways, such as oxidative stress, inflammation, epigenetic modification, and endocrine disruption, may capture cumulative early biological effects and discriminate individuals who are at increased risk of manifesting various downstream clinical diseases. GGT has been suggested an oxidative stress marker [24, 25] and widely used in large-scale epidemiologic research [41,42,43]. Markers that directly measure oxidative damages in vivo such as F2-isoprostane [44, 45] may improve risk prediction and thus better discriminate metal mixture health risk. Our analyses of linking ERS for oxidative stress and clinical outcomes suggest that this cumulative risk score is a useful tool to predict the clinical outcomes related to oxidative stress. One limitation is that given its non-specificity, the ERS in the current study may not capture disease risk for complex diseases as good as the ERS constructed for a specific disease. A more targeted ERS for a specific disease will need to be built with different weights.

Cardiovascular effects of heavy metals, especially lead, cadmium and arsenic, have long been acknowledged and examined in epidemiologic research. A recent expert review by the National Toxicology Program found sufficient evidence to conclude that lead exposure, even at low exposure (<10 μg/dL), is associated with elevated blood pressure and hypertension but the evidence is limited or inadequate for other cardiovascular endpoints [46]. For arsenic, several systematic reviews suggest there is evidence for a causal association between high long-term arsenic exposure and CVD endpoints; however, conclusive evidence is lacking for an association with low-to-moderate arsenic levels [47, 48]. For cadmium, chronic exposure may be an independent risk factor for CVD, but further studies with individual level exposure and standardized CVD outcome assessments and accounting for confounding by smoking are needed to establish a causal association [49]. For other non-essential metals, only a few epidemiologic studies were conducted in relation to CVD [50,51,52,53,54]. A recent systematic review concluded that the current evidence is insufficient to support the causality of the association between environmental metals and CVD because of the small number of studies [55].

Although individual metal effects were not our primary interest, the metals identified in the multipollutant models deserve further consideration. Our multipollutant models (AENET-I, BART and SL) confirmed that those historically established metals (lead, cadmium and arsenic as methylated forms (MMA and DMA)) are important predictors. Interestingly, these metals turned out to have important pairwise interactions with other metals: for lead, DMA, cesium, molybdenum, and tungsten; for cadmium, cesium and uranium; for DMA, molybdenum and uranium. Cobalt was selected as the second most important variable in BART. Our AENET models suggested an inverse association between cobalt and GGT. This may not be surprising given that cobalt is an essential trace element and is part of vitamin B12 (cobalamin) which is vital to maintain human health [56]. The levels of cobalt typically found in the environment is not known to be harmful [57]. However, recent studies conducted using NHANES data found significant positive associations of cobalt with cardiovascular and cerebrovascular disease [50] and age-related impaired mobility [58]. Agarwal et al. [50] also reported a significant positive association between cardiovascular and cerebrovascular disease and tungsten which was selected as a main effect as well as interaction with lead in AENET-I and as the fourth most important variable in BART. Tungsten, commonly used in metal alloys [59], has also been positively associated with diabetes [60] and peripheral artery disease [54]. Barium, a silvery-white metal used for many different purposes in industry [61], was also selected in our multipollutant models. Urinary barium has been associated with higher insulin resistance [60], obesity [62], and lower thyroid hormone levels [63]. It should be noted that these findings are subject to reverse causality given the nature of cross-sectional design and that phenotypes such as diabetes may increase metal excretions in urine [64].

The present study has numerous limitations. We examined only metal mixtures. In real world, we are exposed to many other environmental pollutants that may cause oxidative stress leading to CVD and other health consequences. Therefore, the associations found here are still subject to potential confounding by co-pollutants other than metals. Because biomonitoring data in NHANES were not measured in all participants (i.e., some pollutants were measured only in a subset (e.g., urinary metals were measured in one third of the participants in each cycle) and different classes of pollutants were measured in different subsets in order to reduce the burden of examinations), we were not able to include other classes of pollutants in this study. Although assessment of a large number of pollutants from different classes is costly and a challenging task, it is a prerequisite for evaluating the impact of pollutant mixtures on human health.

Relatedly, the environment used in our study is limited to chemical environmental pollutants. Nonchemical stressors such as behavioral and psychosocial factors also play an important role in chronic disease risk independently as well as in conjunction with chemical pollutants. Our approach was to treat these factors as confounders, i.e., to separate the effects of nonchemical stressors out of those of the chemical pollutant mixtures. ERS can be expanded to a broader range of the environment encompassing various non-genetic factors as well as chemical-nonchemical interactions depending on investigator’s points of view and research questions. As the concept of ‘exposome’, defined as the totality of exposures over the course of a lifetime [65, 66], is emerging, our proposed approach will be useful to integrate health risks from a wide range of environmental factors identified by novel statistical approaches beyond the conventional linear regression-based modeling.

In AENET algorithms, we assumed linear (in fact, log-linear) exposure-outcome relationships and linear interactions. However, non-linear, non-monotonic dose–response relationships, for example, low-dose effects of endocrine disrupting chemicals [67], may exist. Due to such model misspecification, important variables may not be selected in AENET and not contribute to the corresponding ERS. It should be noted that the selected pairwise interactions in AENET-I may reflect non-linear dose–response relationships rather than biological interactions. In fact, in presence of non-linearity, a complex linear model attempts to capture quadratic or higher order main effect terms by estimating a non-linear interaction term [68].

In addition, for AENET and BKMR, our strategy has been to always include the covariates/confounders Z in the predictor space and not penalize coefficients associated with Z. For BART and SL we first regressed GGT on Z and each of the E’s on Z and used the residuals from these models as the input variables for outcome and predictors. We are aware that this does not reproduce the model Y|E, Z. This is driven by limitations in the implementation software. By adopting this approach, we also assume that the covariates have a linear relationship with GGT without any non-linearity or interactions within the set of Z and ignore metal by covariate interaction terms of the form E*Z. Future extensions can consider this possibility and attempt to treating Z uniformly across the methods.

The primary goal of epidemiologic research is to identify causal factors and causal structures. Although a hypothesis-driven knowledge-based approach is a standard practice to build the causal structure in epidemiologic research, an agnostic data-driven machine learning approach may be useful when limited knowledge of the causal structure is available. Many environmental chemicals, especially emerging new toxicants, have little information on underlying biological mechanisms and toxicological and epidemiological evidence that preclude us from incorporating the causal structure to the analysis. Data-driven machine learning approaches used in the present study can help us identify causal structure among environmental chemicals within a mixture. Of note, it is of less interest in public health whether the selected chemicals are causal or markers of other causal factors when those chemicals are highly correlated because regulations targeted on one chemical would likely control the other chemicals. Another important role of constructing an ERS is to study effect modification of summary exposure by other factors (for example genetic factors) and provide risk stratification in the population. For example, ERS could be used as a summary measure in a follow-up gene-environment interaction analyses or to study the effect of epigenetic changes on environmental exposure.

As mentioned earlier, we constructed an ERS for oxidative stress as a common pathway for multiple diseases to overcome the construction of multiple disease-specific ERSs. However, complex diseases such as CVD involve multiple disease pathways, and therefore, our proposed ERS may capture only partial disease risk and can be less appropriate if the selected pathway is a minor one in disease etiology. Prior knowledge regarding underlying biological mechanisms is a key prerequisite to conduct the proposed work. Ideally, multiple ERS representing each plausible biological pathway could be constructed and adaptively weighted to provide a more comprehensive risk assessment for a complex chronic disease.

Each individual metal measured in either whole blood or urine has different half-lives and therefore, may not necessarily capture the actual exposure concentrations relevant to the current disease risk. Most urinary metals used in this study have short half-lives and thus their urinary concentrations reflect exposures happened over the past hours to days. Such metals may have not been detected to be associated with GGT in our analysis and could be missed in relation to other health endpoints if long-term cumulative exposure to low levels cause those health outcomes. Another source of exposure measurement error is LODs of all metal biomarkers. For simplicity, we used a conventional ad hoc method of imputing the concentrations below LOD with LOD/√2, but it is known to result in bias especially when the proportion of concentrations below LOD is high [27]. These exposure measurement errors are generally non-differential and lead to bias towards the null if each exposure is evaluated with the outcome individually in single pollutant models. However, bias would be more complicated and even differential measurement errors may occur in multi-pollutant models when multiple pollutants have different degrees of measurement errors and those measurement errors are not independent because some of the effects of more poorly assessed pollutants may be transferred to the effect estimates of better assessed pollutants [69].

Conclusions

In summary, the current study suggests that ERS is a useful tool for characterizing cumulative risk from pollutant mixtures, with accounting for statistical challenges such as high degrees of correlations and pollutant-pollutant interactions. ERS constructed for an intermediate marker like GGT is predictive of related disease endpoints. This new approach of multi-pollutant framework will help better understand the real-world health impacts of pollutant mixtures and facilitate risk stratification and targeted preventive intervention [20]. To overcome the methodological challenges discussed here and raised in the recent NIEHS workshop [19], collective work such as systematic assessment of a wide range of pollutants in well characterized cohorts and creation of a mixture data consortium will need to be done.