Skip to main content
Log in

Developing and evaluating methods to impute race/ethnicity in an incomplete dataset

  • Published:
Health Services and Outcomes Research Methodology Aims and scope Submit manuscript

Abstract

The availability of race data is essential for identifying and addressing racial/ethnic disparities in the health care system; however, patient self-reported racial/ethnic information is often missing. Indirect methods for estimating race have been developed, but they usually only consider geocoded and surname data as predictors, may perform poorly among racial minorities, they do not adjust for possible errors for specific datasets, and are unable to provide race estimates for subjects missing some of this information. The objective of this study was to address these limitations by developing novel methods for imputing race/ethnicity when this information is partially missing. By viewing the unobserved race as missing data, we explored different multiple imputation methods for imputing race/ethnicity, and we applied these methods to a subset of Rhode Island Medicaid beneficiaries. Current race imputation methods and newly developed ones were compared using area under the ROC curve statistics and racial composition estimates to identify methods and sets of predictors that yield superior race imputations. Family race was identified as an important predictor and should be included in race estimation models when possible. Bayesian regression models (BRM) provide better race estimates than previously proposed methods. Missing race was multiply imputed using joint modeling and fully conditional specification. Post-imputation analyses showed that fully conditional specification with a BRM is superior to joint modeling for race imputation. The proposed fully conditional specification method is a flexible, effective way of estimating race/ethnicity that allows for propagation of imputation error and ease of interpretation in further analyses.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Adjaye-Gbewonyo, D., Bednarczyk, R.A., Davis, R.L., Omer, S.B.: Using the Bayesian improved surname geocoding method (BISG) to create a working classification of race and ethnicity in a diverse managed care population: a validation study. Health Serv. Res. 49(1), 268–283 (2013)

    Article  PubMed  PubMed Central  Google Scholar 

  • Consumer Financial Protection Bureau: Using publicly available information to proxy for unidentified race and ethnicity : a methodology and assessment. Consumer Financial Protection Bureau, United States (2014)

  • Elliott, M.N., Fremont, A., Morrison, P.A., Pantoja, P., Lurie, N.: A new method for estimating race/ethnicity and associated disparities where administrative records lack self-reported race/ethnicity. Health Serv. Res. 43(5p1), 1722–1736 (2008)

    Article  PubMed  PubMed Central  Google Scholar 

  • Elliott, M.N., Morrison, P.A., Fremont, A., McCaffrey, D.F., Pantoja, P., Lurie, N.: Using the Census Bureau’s surname list to improve estimates of race/ethnicity and associated disparities. Health Serv. Outcomes Res. Methodol. 9(2), 69 (2009)

    Article  Google Scholar 

  • Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27(8), 861–874 (2006)

    Article  Google Scholar 

  • Fiscella, K., Fremont, A.M.: Use of geocoding and surname analysis to estimate race and ethnicity. Health Serv. Res. 41(4 Pt 1), 1482–1500 (2006)

    PubMed  PubMed Central  Google Scholar 

  • Hassett, P.: Taking on racial and ethnic disparities in health care: the experience at Aetna. Health Aff. 24(2), 417–420 (2005)

    Article  Google Scholar 

  • Honaker, J., King, G., Blackwell, M.: Amelia: A program for missing data. R package version 1.7.5 (2018). https://cran.r-project.org/web/packages/Amelia/

  • Hosmer, D.W., Lemeshow, S.: Applied Logistic Regression. Wiley, Hoboken (2000)

    Book  Google Scholar 

  • Hosmer, D.W., Lemeshow, S., Sturdivant, R.X.: Applied Logistic Regression. Wiley, Hoboken (2013)

    Book  Google Scholar 

  • Kruschke, J.K.: Doing Bayesian Data Analysis: A Tutorial with R and BUGS. Academic Press, Burlington, MA (2011)

    Google Scholar 

  • Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, Hoboken (2002)

    Book  Google Scholar 

  • Liu, Y., De, A.: Multiple imputation by fully conditional specification for dealing with missing data in a large epidemiologic study. Int. J. Stat. Med. Res. 4(3), 287–295 (2015)

    Article  PubMed  PubMed Central  Google Scholar 

  • Ma, Y., Zhang, W., Lyman, S., Huang, Y.: The HCUP SID imputation project: improving statistical inferences for health disparities research by imputing missing race data. Health Serv. Res. 53(3), 1870–1889 (2018)

    Article  PubMed  Google Scholar 

  • Ng, J.H., Ye, F., Ward, L.M., Haffer, S.C.C., Scholle, S.H.: Data on race, ethnicity, and language largely incomplete for managed care plan members. Health Aff. (Project Hope) 36(3), 548–552 (2017)

    Article  Google Scholar 

  • Polson, N.G., Scott, J.G., Windle, J.: Bayesian inference for logistic models using Polya-Gamma latent variables (2013a). arXiv:1205.0310

  • Polson, N.G., Scott, J.G., Windle, J.: BayesLogit (2013b)

  • Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)

    Article  Google Scholar 

  • Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. Wiley, Hoboken (1987)

    Book  Google Scholar 

  • Schafer, J.L.: Analysis of Incomplete Multivariate Data, 1. ed., 1. CRC Press Reprint ed. Monographs on Statistics and Applied Probability, vol. 72. Chapman & Hall/CRC, Boca Raton (2000)

    Google Scholar 

  • Schafer, J.L.: Mix: Estimation/Multiple imputation for mixed categorical and continuous data. R package version 1.0-10. (2017). https://CRAN.R-project.org/package=mix

  • Seaman, S.R., Hughes, R.A.: Relative efficiency of joint-model and full-conditional-specification multiple imputation when conditional models are compatible: the general location model. Stat. Methods Med. Res. 27(6), 1603–1614 (2018)

    Article  PubMed  Google Scholar 

  • Ulmer, C., McFadden, B., Nerenz, D.R.: Race, Ethnicity, and Language Data: Standardization for Health Care Quality Improvement. National Academies Academic Press, Washington, D.C. (2009)

    Google Scholar 

  • van Buuren, S.: Multiple imputation of discrete and continuous data by fully conditional specification. Stat. Methods Med. Res. 16(3), 219–242 (2007)

    Article  PubMed  Google Scholar 

  • van Buuren, S., Groothuis-Oudshoorn, K.: Mice: Multivariate imputation by chained equations in R. J. Stat. Softw. 45(3), 1–67. (2018). http://www.jstatsoft.org/v45/i03/

    Google Scholar 

  • Word, D.L., Coleman, C.D., Nunziata, R., Kominski, R.: Demographic Aspects of Surnames from Census 2000. US Census Bureau, Suitland (2008)

    Google Scholar 

Download references

Acknowledgements

Research was supported by a Grant (R40 MC 28319) from the Health Resources and Services Administration.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gabriella C. Silva.

Ethics declarations

Conflict of interest

All authors declare they have no conflict of interest.

Ethical approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Informed consent

The need for informed consent was waived by the institutional review board.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1

Pseudocode for multiple imputation with the FCS algorithm is presented below.

  1. 1.

    Populate incomplete dataset with initial starting values

  2. 2.

    Draw \( \theta_{R} \) from \( P(\theta_{R} |R^{obs} ,Y_{ - R} ,M,A) \)

  3. 3.

    Draw \( R^{miss} \) from \( P(R|R^{obs} ,Y_{ - R} ,M,A,\theta_{R} ) \) and substitute this into the dataset

  4. 4.

    Draw \( \theta_{{G^{\prime}}} \) from \( P(\theta_{{G^{\prime}}} |G^{'obs} ,Y_{{ - G^{\prime}}} ,M,A) \)

  5. 5.

    Draw \( G^{'miss} \) from \( P(G^{\prime}|G^{'obs} ,Y_{{ - G^{\prime}}} ,M,A,\theta_{{G^{\prime}}} ) \) and substitute this into the dataset

  6. 6.

    Draw \( \theta_{{S^{\prime}}} \) from \( P(\theta_{{S^{\prime}}} |S^{'obs} ,Y_{{ - S^{\prime}}} ,M,A) \)

  7. 7.

    Draw \( S^{'miss} \) from \( P(S^{\prime}|S^{'obs} ,Y_{{ - S^{\prime}}} ,M,A,\theta_{{S^{\prime}}} ) \) and substitute this into the dataset

  8. 8.

    Repeat Steps 2–7 until the cycle reaches convergence. The current draws are the set of imputed values.

  9. 9.

    Repeat Steps 1–8 \( m \) times to obtain \( m \) imputed datasets

Appendix 2

To determine the specification for \( P({\mathbf{R}}|{\mathbf{Y}}_{{ - {\mathbf{R}}}} ,{\mathbf{M}},{\mathbf{A}},{\varvec{\uptheta}}_{{\mathbf{R}}} ) \) and to compare various multiple imputation methods, the observed race for \( n_{e} \) of the \( n \) = 6087 individuals with fully observed data was set to missing. The remaining set of \( n_{f} \) individuals were used to obtain parameter estimates for the BRMs. Using the final sample of parameter estimates, probabilities of individual \( i \) belonging to race \( r \in \varUpsilon \) for each of the \( n_{e} \) individuals, given by \( p_{ir} = \frac{{exp({\mathbf{x}}_{{\mathbf{i}}}^{{\mathbf{T}}} {\varvec{\upbeta}}_{{\mathbf{r}}} )}}{{\sum\nolimits_{r \in \varUpsilon } {(exp({\mathbf{x}}_{{\mathbf{i}}}^{{\mathbf{T}}} {\varvec{\upbeta}}_{{\mathbf{r}}} ))} }} \), were computed. The probabilities \( {\mathbf{p}}_{{\mathbf{i}}} = (p_{iWhite} ,p_{iBlack} ,p_{iAI} ,p_{iHispanic} ,p_{iAPI} ) \) for \( i \in \{ 1,2, \ldots ,n_{e} \} \) were compared to the observed races for these individuals using AUC and racial composition.

Individuals in the testing set were those whose race has been set equal to missing. Thus, when determining which set each of these \( n \) individuals will be placed in, it is important to specify whether race is missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR). We considered all three in our analysis; implementation details for each are available below.

2.1 MCAR

The race data is MCAR if the missingness of race is unrelated to any of the study variables; that is, the pattern of race missingness is independent of observable variables (Little and Rubin 2002). In practice, this is a very strong assumption. We implement this assumption by randomly selecting \( n_{e} \) individuals from the group of \( n \). We set \( n_{e} \) = 2891 and \( n_{f} \) = 3196 such that the percentage of individuals with missing race is 47.5% because this is the percentage of missing race in the full RI dataset.

2.2 MAR

The race data is MAR if the missingness can be explained by variables for which there is complete information. Using the individuals in rows V and VI of Fig. 1, we fit a logistic regression model where the dependent variable is an indicator for whether race is missing and the independent variables are the fully observed geocoded and surname probabilities, family race indicators, language, and age. Using the estimated parameters for this model (intercept, geocoded, surname, family race indicators, language, and age in Table 8), we estimate the probability that race is missing for each of the \( n \) individuals with fully observed data. These probabilities are then used to determine whether an individual belongs in the training set or the testing set. In our analysis, \( n_{f} \) = 3195 while \( n_{e} \) = 2892; hence, race is set to missing for 47.5% of the \( n \) beneficiaries considered for this simulation.

Table 8 Parameter values used to simulate MAR

2.3 MNAR

The race data is MNAR when the missing entries for race depends on the racial groups, even after controlling for other variables with complete information. To simulate MNAR, we fit the same logistic regression model described in the MAR section and also incorporate parameters for the observed race (Table 9). Using this combined set of parameters, we compute the probability that race is missing for each of the \( n \) beneficiaries with completely observed data. Similar to before, these are used to determine whether an individual will be placed in the training set or the testing set. In our analysis, \( n_{f} \) = 3197 while \( n_{e} \) = 2890; the percentage of individuals whose race is set to missing is 47.5%.

Table 9 Parameter values used to simulate MNAR

Note: The intercepts reported in Tables 8 and 9 were not the intercepts estimated from the logistic regression model. Rather, these were modified so that the percentage of individuals in the test was 47.5% across all missing data mechanisms.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Silva, G.C., Trivedi, A.N. & Gutman, R. Developing and evaluating methods to impute race/ethnicity in an incomplete dataset. Health Serv Outcomes Res Method 19, 175–195 (2019). https://doi.org/10.1007/s10742-019-00200-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10742-019-00200-9

Keywords

Navigation