Developing and evaluating methods to impute race/ethnicity in an incomplete dataset

Silva, Gabriella C.; Trivedi, Amal N.; Gutman, Roee

doi:10.1007/s10742-019-00200-9

Developing and evaluating methods to impute race/ethnicity in an incomplete dataset

Published: 08 June 2019

Volume 19, pages 175–195, (2019)
Cite this article

Health Services and Outcomes Research Methodology Aims and scope Submit manuscript

992 Accesses
13 Citations
8 Altmetric
1 Mention
Explore all metrics

Abstract

The availability of race data is essential for identifying and addressing racial/ethnic disparities in the health care system; however, patient self-reported racial/ethnic information is often missing. Indirect methods for estimating race have been developed, but they usually only consider geocoded and surname data as predictors, may perform poorly among racial minorities, they do not adjust for possible errors for specific datasets, and are unable to provide race estimates for subjects missing some of this information. The objective of this study was to address these limitations by developing novel methods for imputing race/ethnicity when this information is partially missing. By viewing the unobserved race as missing data, we explored different multiple imputation methods for imputing race/ethnicity, and we applied these methods to a subset of Rhode Island Medicaid beneficiaries. Current race imputation methods and newly developed ones were compared using area under the ROC curve statistics and racial composition estimates to identify methods and sets of predictors that yield superior race imputations. Family race was identified as an important predictor and should be included in race estimation models when possible. Bayesian regression models (BRM) provide better race estimates than previously proposed methods. Missing race was multiply imputed using joint modeling and fully conditional specification. Post-imputation analyses showed that fully conditional specification with a BRM is superior to joint modeling for race imputation. The proposed fully conditional specification method is a flexible, effective way of estimating race/ethnicity that allows for propagation of imputation error and ease of interpretation in further analyses.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imputing race and ethnicity in healthcare claims databases

Article 08 March 2022

Discrepancies in Race and Ethnicity Documentation: a Potential Barrier in Identifying Racial and Ethnic Disparities

Article Open access 08 September 2016

The Impact of Patient-Provider Race/Ethnicity Concordance on Provider Visits: Updated Evidence from the Medical Expenditure Panel Survey

Article 24 June 2019

References

Adjaye-Gbewonyo, D., Bednarczyk, R.A., Davis, R.L., Omer, S.B.: Using the Bayesian improved surname geocoding method (BISG) to create a working classification of race and ethnicity in a diverse managed care population: a validation study. Health Serv. Res. 49(1), 268–283 (2013)
Article PubMed PubMed Central Google Scholar
Consumer Financial Protection Bureau: Using publicly available information to proxy for unidentified race and ethnicity : a methodology and assessment. Consumer Financial Protection Bureau, United States (2014)
Elliott, M.N., Fremont, A., Morrison, P.A., Pantoja, P., Lurie, N.: A new method for estimating race/ethnicity and associated disparities where administrative records lack self-reported race/ethnicity. Health Serv. Res. 43(5p1), 1722–1736 (2008)
Article PubMed PubMed Central Google Scholar
Elliott, M.N., Morrison, P.A., Fremont, A., McCaffrey, D.F., Pantoja, P., Lurie, N.: Using the Census Bureau’s surname list to improve estimates of race/ethnicity and associated disparities. Health Serv. Outcomes Res. Methodol. 9(2), 69 (2009)
Article Google Scholar
Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27(8), 861–874 (2006)
Article Google Scholar
Fiscella, K., Fremont, A.M.: Use of geocoding and surname analysis to estimate race and ethnicity. Health Serv. Res. 41(4 Pt 1), 1482–1500 (2006)
PubMed PubMed Central Google Scholar
Hassett, P.: Taking on racial and ethnic disparities in health care: the experience at Aetna. Health Aff. 24(2), 417–420 (2005)
Article Google Scholar
Honaker, J., King, G., Blackwell, M.: Amelia: A program for missing data. R package version 1.7.5 (2018). https://cran.r-project.org/web/packages/Amelia/
Hosmer, D.W., Lemeshow, S.: Applied Logistic Regression. Wiley, Hoboken (2000)
Book Google Scholar
Hosmer, D.W., Lemeshow, S., Sturdivant, R.X.: Applied Logistic Regression. Wiley, Hoboken (2013)
Book Google Scholar
Kruschke, J.K.: Doing Bayesian Data Analysis: A Tutorial with R and BUGS. Academic Press, Burlington, MA (2011)
Google Scholar
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, Hoboken (2002)
Book Google Scholar
Liu, Y., De, A.: Multiple imputation by fully conditional specification for dealing with missing data in a large epidemiologic study. Int. J. Stat. Med. Res. 4(3), 287–295 (2015)
Article PubMed PubMed Central Google Scholar
Ma, Y., Zhang, W., Lyman, S., Huang, Y.: The HCUP SID imputation project: improving statistical inferences for health disparities research by imputing missing race data. Health Serv. Res. 53(3), 1870–1889 (2018)
Article PubMed Google Scholar
Ng, J.H., Ye, F., Ward, L.M., Haffer, S.C.C., Scholle, S.H.: Data on race, ethnicity, and language largely incomplete for managed care plan members. Health Aff. (Project Hope) 36(3), 548–552 (2017)
Article Google Scholar
Polson, N.G., Scott, J.G., Windle, J.: Bayesian inference for logistic models using Polya-Gamma latent variables (2013a). arXiv:1205.0310
Polson, N.G., Scott, J.G., Windle, J.: BayesLogit (2013b)
Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
Article Google Scholar
Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. Wiley, Hoboken (1987)
Book Google Scholar
Schafer, J.L.: Analysis of Incomplete Multivariate Data, 1. ed., 1. CRC Press Reprint ed. Monographs on Statistics and Applied Probability, vol. 72. Chapman & Hall/CRC, Boca Raton (2000)
Google Scholar
Schafer, J.L.: Mix: Estimation/Multiple imputation for mixed categorical and continuous data. R package version 1.0-10. (2017). https://CRAN.R-project.org/package=mix
Seaman, S.R., Hughes, R.A.: Relative efficiency of joint-model and full-conditional-specification multiple imputation when conditional models are compatible: the general location model. Stat. Methods Med. Res. 27(6), 1603–1614 (2018)
Article PubMed Google Scholar
Ulmer, C., McFadden, B., Nerenz, D.R.: Race, Ethnicity, and Language Data: Standardization for Health Care Quality Improvement. National Academies Academic Press, Washington, D.C. (2009)
Google Scholar
van Buuren, S.: Multiple imputation of discrete and continuous data by fully conditional specification. Stat. Methods Med. Res. 16(3), 219–242 (2007)
Article PubMed Google Scholar
van Buuren, S., Groothuis-Oudshoorn, K.: Mice: Multivariate imputation by chained equations in R. J. Stat. Softw. 45(3), 1–67. (2018). http://www.jstatsoft.org/v45/i03/
Google Scholar
Word, D.L., Coleman, C.D., Nunziata, R., Kominski, R.: Demographic Aspects of Surnames from Census 2000. US Census Bureau, Suitland (2008)
Google Scholar

Download references

Acknowledgements

Research was supported by a Grant (R40 MC 28319) from the Health Resources and Services Administration.

Author information

Authors and Affiliations

Department of Biostatistics, Brown University, 121 South Main Street, Box G-S121-7, Providence, RI, 02912, USA
Gabriella C. Silva & Roee Gutman
Department of Health Services, Policy and Practice, Brown University, 121 South Main Street, 7th Floor, Providence, RI, 02903, USA
Amal N. Trivedi

Authors

Gabriella C. Silva
View author publications
You can also search for this author in PubMed Google Scholar
Amal N. Trivedi
View author publications
You can also search for this author in PubMed Google Scholar
Roee Gutman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gabriella C. Silva.

Ethics declarations

Conflict of interest

All authors declare they have no conflict of interest.

Ethical approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Informed consent

The need for informed consent was waived by the institutional review board.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1

Pseudocode for multiple imputation with the FCS algorithm is presented below.

1.
Populate incomplete dataset with initial starting values
2.
Draw \( \theta_{R} \) from \( P(\theta_{R} |R^{obs} ,Y_{ - R} ,M,A) \)
3.
Draw \( R^{miss} \) from \( P(R|R^{obs} ,Y_{ - R} ,M,A,\theta_{R} ) \) and substitute this into the dataset
4.
Draw \( \theta_{{G^{\prime}}} \) from \( P(\theta_{{G^{\prime}}} |G^{'obs} ,Y_{{ - G^{\prime}}} ,M,A) \)
5.
Draw \( G^{'miss} \) from \( P(G^{\prime}|G^{'obs} ,Y_{{ - G^{\prime}}} ,M,A,\theta_{{G^{\prime}}} ) \) and substitute this into the dataset
6.
Draw \( \theta_{{S^{\prime}}} \) from \( P(\theta_{{S^{\prime}}} |S^{'obs} ,Y_{{ - S^{\prime}}} ,M,A) \)
7.
Draw \( S^{'miss} \) from \( P(S^{\prime}|S^{'obs} ,Y_{{ - S^{\prime}}} ,M,A,\theta_{{S^{\prime}}} ) \) and substitute this into the dataset
8.
Repeat Steps 2–7 until the cycle reaches convergence. The current draws are the set of imputed values.
9.
Repeat Steps 1–8 \( m \) times to obtain \( m \) imputed datasets

Appendix 2

To determine the specification for \( P({\mathbf{R}}|{\mathbf{Y}}_{{ - {\mathbf{R}}}} ,{\mathbf{M}},{\mathbf{A}},{\varvec{\uptheta}}_{{\mathbf{R}}} ) \) and to compare various multiple imputation methods, the observed race for \( n_{e} \) of the \( n \) = 6087 individuals with fully observed data was set to missing. The remaining set of \( n_{f} \) individuals were used to obtain parameter estimates for the BRMs. Using the final sample of parameter estimates, probabilities of individual \( i \) belonging to race \( r \in \varUpsilon \) for each of the \( n_{e} \) individuals, given by \( p_{ir} = \frac{{exp({\mathbf{x}}_{{\mathbf{i}}}^{{\mathbf{T}}} {\varvec{\upbeta}}_{{\mathbf{r}}} )}}{{\sum\nolimits_{r \in \varUpsilon } {(exp({\mathbf{x}}_{{\mathbf{i}}}^{{\mathbf{T}}} {\varvec{\upbeta}}_{{\mathbf{r}}} ))} }} \), were computed. The probabilities \( {\mathbf{p}}_{{\mathbf{i}}} = (p_{iWhite} ,p_{iBlack} ,p_{iAI} ,p_{iHispanic} ,p_{iAPI} ) \) for \( i \in \{ 1,2, \ldots ,n_{e} \} \) were compared to the observed races for these individuals using AUC and racial composition.

Individuals in the testing set were those whose race has been set equal to missing. Thus, when determining which set each of these \( n \) individuals will be placed in, it is important to specify whether race is missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR). We considered all three in our analysis; implementation details for each are available below.

2.1 MCAR

The race data is MCAR if the missingness of race is unrelated to any of the study variables; that is, the pattern of race missingness is independent of observable variables (Little and Rubin 2002). In practice, this is a very strong assumption. We implement this assumption by randomly selecting \( n_{e} \) individuals from the group of \( n \). We set \( n_{e} \) = 2891 and \( n_{f} \) = 3196 such that the percentage of individuals with missing race is 47.5% because this is the percentage of missing race in the full RI dataset.

2.2 MAR

The race data is MAR if the missingness can be explained by variables for which there is complete information. Using the individuals in rows V and VI of Fig. 1, we fit a logistic regression model where the dependent variable is an indicator for whether race is missing and the independent variables are the fully observed geocoded and surname probabilities, family race indicators, language, and age. Using the estimated parameters for this model (intercept, geocoded, surname, family race indicators, language, and age in Table 8), we estimate the probability that race is missing for each of the \( n \) individuals with fully observed data. These probabilities are then used to determine whether an individual belongs in the training set or the testing set. In our analysis, \( n_{f} \) = 3195 while \( n_{e} \) = 2892; hence, race is set to missing for 47.5% of the \( n \) beneficiaries considered for this simulation.

Table 8 Parameter values used to simulate MAR

Full size table

2.3 MNAR

The race data is MNAR when the missing entries for race depends on the racial groups, even after controlling for other variables with complete information. To simulate MNAR, we fit the same logistic regression model described in the MAR section and also incorporate parameters for the observed race (Table 9). Using this combined set of parameters, we compute the probability that race is missing for each of the \( n \) beneficiaries with completely observed data. Similar to before, these are used to determine whether an individual will be placed in the training set or the testing set. In our analysis, \( n_{f} \) = 3197 while \( n_{e} \) = 2890; the percentage of individuals whose race is set to missing is 47.5%.

Table 9 Parameter values used to simulate MNAR

Full size table

Note: The intercepts reported in Tables 8 and 9 were not the intercepts estimated from the logistic regression model. Rather, these were modified so that the percentage of individuals in the test was 47.5% across all missing data mechanisms.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Silva, G.C., Trivedi, A.N. & Gutman, R. Developing and evaluating methods to impute race/ethnicity in an incomplete dataset. Health Serv Outcomes Res Method 19, 175–195 (2019). https://doi.org/10.1007/s10742-019-00200-9

Download citation

Received: 19 November 2018
Revised: 10 May 2019
Accepted: 31 May 2019
Published: 08 June 2019
Issue Date: 14 September 2019
DOI: https://doi.org/10.1007/s10742-019-00200-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Developing and evaluating methods to impute race/ethnicity in an incomplete dataset

Abstract

Access this article

Similar content being viewed by others

Imputing race and ethnicity in healthcare claims databases

Discrepancies in Race and Ethnicity Documentation: a Potential Barrier in Identifying Racial and Ethnic Disparities

The Impact of Patient-Provider Race/Ethnicity Concordance on Provider Visits: Updated Evidence from the Medical Expenditure Panel Survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Publisher's Note

Appendices

Appendix 1

Appendix 2

2.1 MCAR

2.2 MAR

2.3 MNAR

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Developing and evaluating methods to impute race/ethnicity in an incomplete dataset

Abstract

Access this article

Similar content being viewed by others

Imputing race and ethnicity in healthcare claims databases

Discrepancies in Race and Ethnicity Documentation: a Potential Barrier in Identifying Racial and Ethnic Disparities

The Impact of Patient-Provider Race/Ethnicity Concordance on Provider Visits: Updated Evidence from the Medical Expenditure Panel Survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Publisher's Note

Appendices

Appendix 1

Appendix 2

2.1 MCAR

2.2 MAR

2.3 MNAR

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation