Abstract
The availability of race data is essential for identifying and addressing racial/ethnic disparities in the health care system; however, patient self-reported racial/ethnic information is often missing. Indirect methods for estimating race have been developed, but they usually only consider geocoded and surname data as predictors, may perform poorly among racial minorities, they do not adjust for possible errors for specific datasets, and are unable to provide race estimates for subjects missing some of this information. The objective of this study was to address these limitations by developing novel methods for imputing race/ethnicity when this information is partially missing. By viewing the unobserved race as missing data, we explored different multiple imputation methods for imputing race/ethnicity, and we applied these methods to a subset of Rhode Island Medicaid beneficiaries. Current race imputation methods and newly developed ones were compared using area under the ROC curve statistics and racial composition estimates to identify methods and sets of predictors that yield superior race imputations. Family race was identified as an important predictor and should be included in race estimation models when possible. Bayesian regression models (BRM) provide better race estimates than previously proposed methods. Missing race was multiply imputed using joint modeling and fully conditional specification. Post-imputation analyses showed that fully conditional specification with a BRM is superior to joint modeling for race imputation. The proposed fully conditional specification method is a flexible, effective way of estimating race/ethnicity that allows for propagation of imputation error and ease of interpretation in further analyses.
Similar content being viewed by others
References
Adjaye-Gbewonyo, D., Bednarczyk, R.A., Davis, R.L., Omer, S.B.: Using the Bayesian improved surname geocoding method (BISG) to create a working classification of race and ethnicity in a diverse managed care population: a validation study. Health Serv. Res. 49(1), 268–283 (2013)
Consumer Financial Protection Bureau: Using publicly available information to proxy for unidentified race and ethnicity : a methodology and assessment. Consumer Financial Protection Bureau, United States (2014)
Elliott, M.N., Fremont, A., Morrison, P.A., Pantoja, P., Lurie, N.: A new method for estimating race/ethnicity and associated disparities where administrative records lack self-reported race/ethnicity. Health Serv. Res. 43(5p1), 1722–1736 (2008)
Elliott, M.N., Morrison, P.A., Fremont, A., McCaffrey, D.F., Pantoja, P., Lurie, N.: Using the Census Bureau’s surname list to improve estimates of race/ethnicity and associated disparities. Health Serv. Outcomes Res. Methodol. 9(2), 69 (2009)
Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27(8), 861–874 (2006)
Fiscella, K., Fremont, A.M.: Use of geocoding and surname analysis to estimate race and ethnicity. Health Serv. Res. 41(4 Pt 1), 1482–1500 (2006)
Hassett, P.: Taking on racial and ethnic disparities in health care: the experience at Aetna. Health Aff. 24(2), 417–420 (2005)
Honaker, J., King, G., Blackwell, M.: Amelia: A program for missing data. R package version 1.7.5 (2018). https://cran.r-project.org/web/packages/Amelia/
Hosmer, D.W., Lemeshow, S.: Applied Logistic Regression. Wiley, Hoboken (2000)
Hosmer, D.W., Lemeshow, S., Sturdivant, R.X.: Applied Logistic Regression. Wiley, Hoboken (2013)
Kruschke, J.K.: Doing Bayesian Data Analysis: A Tutorial with R and BUGS. Academic Press, Burlington, MA (2011)
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, Hoboken (2002)
Liu, Y., De, A.: Multiple imputation by fully conditional specification for dealing with missing data in a large epidemiologic study. Int. J. Stat. Med. Res. 4(3), 287–295 (2015)
Ma, Y., Zhang, W., Lyman, S., Huang, Y.: The HCUP SID imputation project: improving statistical inferences for health disparities research by imputing missing race data. Health Serv. Res. 53(3), 1870–1889 (2018)
Ng, J.H., Ye, F., Ward, L.M., Haffer, S.C.C., Scholle, S.H.: Data on race, ethnicity, and language largely incomplete for managed care plan members. Health Aff. (Project Hope) 36(3), 548–552 (2017)
Polson, N.G., Scott, J.G., Windle, J.: Bayesian inference for logistic models using Polya-Gamma latent variables (2013a). arXiv:1205.0310
Polson, N.G., Scott, J.G., Windle, J.: BayesLogit (2013b)
Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. Wiley, Hoboken (1987)
Schafer, J.L.: Analysis of Incomplete Multivariate Data, 1. ed., 1. CRC Press Reprint ed. Monographs on Statistics and Applied Probability, vol. 72. Chapman & Hall/CRC, Boca Raton (2000)
Schafer, J.L.: Mix: Estimation/Multiple imputation for mixed categorical and continuous data. R package version 1.0-10. (2017). https://CRAN.R-project.org/package=mix
Seaman, S.R., Hughes, R.A.: Relative efficiency of joint-model and full-conditional-specification multiple imputation when conditional models are compatible: the general location model. Stat. Methods Med. Res. 27(6), 1603–1614 (2018)
Ulmer, C., McFadden, B., Nerenz, D.R.: Race, Ethnicity, and Language Data: Standardization for Health Care Quality Improvement. National Academies Academic Press, Washington, D.C. (2009)
van Buuren, S.: Multiple imputation of discrete and continuous data by fully conditional specification. Stat. Methods Med. Res. 16(3), 219–242 (2007)
van Buuren, S., Groothuis-Oudshoorn, K.: Mice: Multivariate imputation by chained equations in R. J. Stat. Softw. 45(3), 1–67. (2018). http://www.jstatsoft.org/v45/i03/
Word, D.L., Coleman, C.D., Nunziata, R., Kominski, R.: Demographic Aspects of Surnames from Census 2000. US Census Bureau, Suitland (2008)
Acknowledgements
Research was supported by a Grant (R40 MC 28319) from the Health Resources and Services Administration.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All authors declare they have no conflict of interest.
Ethical approval
All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.
Informed consent
The need for informed consent was waived by the institutional review board.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1
Pseudocode for multiple imputation with the FCS algorithm is presented below.
-
1.
Populate incomplete dataset with initial starting values
-
2.
Draw \( \theta_{R} \) from \( P(\theta_{R} |R^{obs} ,Y_{ - R} ,M,A) \)
-
3.
Draw \( R^{miss} \) from \( P(R|R^{obs} ,Y_{ - R} ,M,A,\theta_{R} ) \) and substitute this into the dataset
-
4.
Draw \( \theta_{{G^{\prime}}} \) from \( P(\theta_{{G^{\prime}}} |G^{'obs} ,Y_{{ - G^{\prime}}} ,M,A) \)
-
5.
Draw \( G^{'miss} \) from \( P(G^{\prime}|G^{'obs} ,Y_{{ - G^{\prime}}} ,M,A,\theta_{{G^{\prime}}} ) \) and substitute this into the dataset
-
6.
Draw \( \theta_{{S^{\prime}}} \) from \( P(\theta_{{S^{\prime}}} |S^{'obs} ,Y_{{ - S^{\prime}}} ,M,A) \)
-
7.
Draw \( S^{'miss} \) from \( P(S^{\prime}|S^{'obs} ,Y_{{ - S^{\prime}}} ,M,A,\theta_{{S^{\prime}}} ) \) and substitute this into the dataset
-
8.
Repeat Steps 2–7 until the cycle reaches convergence. The current draws are the set of imputed values.
-
9.
Repeat Steps 1–8 \( m \) times to obtain \( m \) imputed datasets
Appendix 2
To determine the specification for \( P({\mathbf{R}}|{\mathbf{Y}}_{{ - {\mathbf{R}}}} ,{\mathbf{M}},{\mathbf{A}},{\varvec{\uptheta}}_{{\mathbf{R}}} ) \) and to compare various multiple imputation methods, the observed race for \( n_{e} \) of the \( n \) = 6087 individuals with fully observed data was set to missing. The remaining set of \( n_{f} \) individuals were used to obtain parameter estimates for the BRMs. Using the final sample of parameter estimates, probabilities of individual \( i \) belonging to race \( r \in \varUpsilon \) for each of the \( n_{e} \) individuals, given by \( p_{ir} = \frac{{exp({\mathbf{x}}_{{\mathbf{i}}}^{{\mathbf{T}}} {\varvec{\upbeta}}_{{\mathbf{r}}} )}}{{\sum\nolimits_{r \in \varUpsilon } {(exp({\mathbf{x}}_{{\mathbf{i}}}^{{\mathbf{T}}} {\varvec{\upbeta}}_{{\mathbf{r}}} ))} }} \), were computed. The probabilities \( {\mathbf{p}}_{{\mathbf{i}}} = (p_{iWhite} ,p_{iBlack} ,p_{iAI} ,p_{iHispanic} ,p_{iAPI} ) \) for \( i \in \{ 1,2, \ldots ,n_{e} \} \) were compared to the observed races for these individuals using AUC and racial composition.
Individuals in the testing set were those whose race has been set equal to missing. Thus, when determining which set each of these \( n \) individuals will be placed in, it is important to specify whether race is missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR). We considered all three in our analysis; implementation details for each are available below.
2.1 MCAR
The race data is MCAR if the missingness of race is unrelated to any of the study variables; that is, the pattern of race missingness is independent of observable variables (Little and Rubin 2002). In practice, this is a very strong assumption. We implement this assumption by randomly selecting \( n_{e} \) individuals from the group of \( n \). We set \( n_{e} \) = 2891 and \( n_{f} \) = 3196 such that the percentage of individuals with missing race is 47.5% because this is the percentage of missing race in the full RI dataset.
2.2 MAR
The race data is MAR if the missingness can be explained by variables for which there is complete information. Using the individuals in rows V and VI of Fig. 1, we fit a logistic regression model where the dependent variable is an indicator for whether race is missing and the independent variables are the fully observed geocoded and surname probabilities, family race indicators, language, and age. Using the estimated parameters for this model (intercept, geocoded, surname, family race indicators, language, and age in Table 8), we estimate the probability that race is missing for each of the \( n \) individuals with fully observed data. These probabilities are then used to determine whether an individual belongs in the training set or the testing set. In our analysis, \( n_{f} \) = 3195 while \( n_{e} \) = 2892; hence, race is set to missing for 47.5% of the \( n \) beneficiaries considered for this simulation.
2.3 MNAR
The race data is MNAR when the missing entries for race depends on the racial groups, even after controlling for other variables with complete information. To simulate MNAR, we fit the same logistic regression model described in the MAR section and also incorporate parameters for the observed race (Table 9). Using this combined set of parameters, we compute the probability that race is missing for each of the \( n \) beneficiaries with completely observed data. Similar to before, these are used to determine whether an individual will be placed in the training set or the testing set. In our analysis, \( n_{f} \) = 3197 while \( n_{e} \) = 2890; the percentage of individuals whose race is set to missing is 47.5%.
Note: The intercepts reported in Tables 8 and 9 were not the intercepts estimated from the logistic regression model. Rather, these were modified so that the percentage of individuals in the test was 47.5% across all missing data mechanisms.
Rights and permissions
About this article
Cite this article
Silva, G.C., Trivedi, A.N. & Gutman, R. Developing and evaluating methods to impute race/ethnicity in an incomplete dataset. Health Serv Outcomes Res Method 19, 175–195 (2019). https://doi.org/10.1007/s10742-019-00200-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10742-019-00200-9