Skip to main content

Advertisement

Log in

Imputing race and ethnicity in healthcare claims databases

  • Published:
Health Services and Outcomes Research Methodology Aims and scope Submit manuscript

Abstract

Our objective was to enhance existing methods for indirectly estimating race/ethnicity in health care data by exploring ways to improve imputation accuracy with a total of 9,812,306 hospital visits from the Connecticut statewide hospitalization claims database from 2012 to 2017. Using this data, we developed multinomial logistic regression models to predict patients’ race and ethnicity when assuming that 50% of race/ethnicity is missing completely at random. Our models included predictors derived from Connecticut birth records, US Census data, and demographic patient-level data, and were compared using performance measures. Our model correctly classified the race and ethnicity of approximately 85% of patients in the Connecticut hospitalization claims data. We found the following [sensitivities and specificities] for our five race/ethnicity categories: non-Hispanic White [94, 83], non-Hispanic Black [76, 97], non-Hispanic Asian or Pacific Islander [41, 99.6], Hispanic [87, 95], and non-Hispanic other race [5, 99.7]. First name, surname, census tract and insurance type were key predictors. Further, Connecticut-specific name dictionaries were better at identifying non-White race and ethnicity compared to the national 2010 US Census surname dictionary. Therefore, state-specific health records, census information, and patients’ demographic characteristics can be utilized to improve the prediction of missing racial and ethnic information in Connecticut hospitalization claims. In addition, this approach can be adapted to other state-specific healthcare databases, which enhances opportunities to investigate and address racial disparities in health outcomes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • All-Payer Claims Database Council: Interactive State Report Map (2015). https://www.apcdcouncil.org/state/map. Accessed 5 Nov 2019

  • Adjaye-Gbewonyo, D., Bednarczyk, R.A., Davis, R.L., Omer, S.B.: Using the bayesian improved surname geocoding method (BISG) to create a working classification of race and ethnicity in a diverse managed care population: a validation study. Health Serv. Res. 49(1), 268–283 (2014)

    Article  Google Scholar 

  • Bartlett, J.W., Harel, O., Carpenter, J.R.: Asymptotically unbiased estimation of exposure odds ratios in complete records logistic regression. Am. J. Epidemiol. 182(8), 730–736 (2015)

    Article  Google Scholar 

  • Becker, A.L.: Health disparities in Connecticut: causes, effects, and what we can do. Connecticut Health Foundation. 2020. https://www.cthealth.org/latest-news/news-releases/new-report-health-disparities-in-connecticut-causes-effects-and-what-we-can-do/. Accessed 20 Jan 2021

  • Bilheimer, L.T., Sisk, J.E.: Collecting adequate data on racial and ethnic disparities in health: The challenges continue. Health Aff. 27(2), 383–391 (2008)

    Article  Google Scholar 

  • Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960)

    Article  Google Scholar 

  • Comenetz, J.: Frequently occurring surnames in the 2010 Census (2016). https://raw.githubusercontent.com/cfpb/proxymethodology/master/input_files/Names_2010Census.csv. Accessed 2 June 2019

  • Connecticut Hospital Association: ChimeData Overview (2019). https://cthosp.org/member-services/chimedata/chimedata-overview/. Accessed 10 Oct 2019

  • Connecticut State Data Center. 2010 Census redistricting data and shapefiles (Public Law 94-171). https://ctsdc.uconn.edu/connecticut_census_data/#2010_redistricting. Accessed 8 Sep 2019

  • Davies, S.M., McDonald, K., Danielson, E., et al.: Inventory and prioritization of measures to support the growing effort in transparency using all-payer claims databases. Prepared under Contract No. HHSA2902001200003I, Task Order 5. AHRQ Publication No. 17-0022-1-EF. Rockville, MD, Agency for Healthcare Research and Quality, March (2017)

  • Derose, S.F., Contreras, R., Coleman, K.J., Koebnick, C., Jacobsen, S.J.: Race and ethnicity data quality and imputation using US census data in an integrated health system: the kaiser permanente Southern California experience. Med. Care Res. Rev. 70(3), 330–345 (2013)

    Article  Google Scholar 

  • Doshi, R.P., Yan, J., Aseltine, R.H., Jr.: Age differences in racial/ethnic disparities in preventable hospitalizations for heart failure in Connecticut, 2009–2015: a population-based longitudinal study. Public Health Rep. 135(1), 56–65 (2020)

    Article  Google Scholar 

  • Elliott, M.N., Morrison, P.A., Fremont, A., McCaffrey, D.F., Pantoja, P., Lurie, N.: Using the census Bureau’s surname list to improve estimates of race/ethnicity and associated disparities. Health Serv. Outcomes Res. Methodol. 9(2), 69 (2009)

    Article  Google Scholar 

  • Elliott, M.N., Fremont, A., Morrison, P.A., Pantoja, P., Lurie, N.: A new method for estimating race/ethnicity and associated disparities where administrative records lack self-reported race/ethnicity. Health Serv. Res. 43(5p1), 1722–1736 (2008)

    Article  Google Scholar 

  • Fiscella, K., Fremont, A.M.: Use of geocoding and surname analysis to estimate race and ethnicity. Health Serv. Res. 41(4p1), 1482–1500 (2006)

    PubMed  PubMed Central  Google Scholar 

  • Fremont, A., Weissman, J.S., Hoch, E., Elliott, M.N.: When race/ethnicity data are lacking: using advanced indirect estimation methods to measure disparities. Rand Health Quart. 6(1), 16 (2016)

    Google Scholar 

  • Graham, G.: Disparities in cardiovascular disease risk in the United States. Curr. Cardiol. Rev. 11(3), 238–245 (2015)

    Article  Google Scholar 

  • Gutierrez, J., Williams, O.A.: A decade of racial and ethnic stroke disparities in the United States. Neurology 82(12), 1080–1082 (2014)

    Article  Google Scholar 

  • Haas, A., Elliott, M.N., Dembosky, J.W., Adams, J.L., Wilson-Frederick, S.M., Mallett, J.S., Gaillot, S., Haffer, S.C., Haviland, A.M.: Imputation of race/ethnicity to enable measurement of HEDIS performance by race/ethnicity. Health Serv. Res. 54(1), 13–23 (2019)

    Article  Google Scholar 

  • Joynt, K.E., Orav, E.J., Jha, A.K.: Thirty-day readmission rates for Medicare beneficiaries by race and site of care. JAMA 305(7), 675–681 (2011)

    Article  CAS  Google Scholar 

  • Lauderdale, D.S., Kestenbaum, B.: Asian American ethnic identification by surname. Popul. Res. Policy Rev. 19(3), 283–300 (2000)

    Article  Google Scholar 

  • Lauderdale, D.S., Kestenbaum, B.: Mortality rates of elderly Asian American populations based on medicare and social security data. Demography 39(3), 529–540 (2002)

    Article  Google Scholar 

  • Ma, Y., Zhang, W., Lyman, S., Huang, Y.: The HCUP SID imputation project: improving statistical inferences for health disparities research by imputing missing race data. Health Serv. Res. 53(3), 1870–1889 (2018)

    Article  Google Scholar 

  • Morrison, P.A., Word, D.L., Coleman, C.D.: Using first names to estimate racial proportions in populations. In: Population Association of America Annual Meeting (2001)

  • Pepe, M.S.: The statistical evaluation of medical tests for classification and prediction. Oxford University Press, New York (2003)

    Google Scholar 

  • Peters, A., Sachs, J., Porter, J., Love, D., Costello, A.: The value of all-payer claims databases to states. N. c. Med. J. 75(3), 211–213 (2014)

    PubMed  Google Scholar 

  • Silva, G.C., Trivedi, A.N., Gutman, R.: Developing and evaluating methods to impute race/ethnicity in an incomplete dataset. Health Serv. Outcomes Res. Methodol. 19(2–3), 175–195 (2019)

    Article  Google Scholar 

  • Snyder, R.A., Hu, C.Y., Zafar, S.N., Francescatti, A., Chang, G.J.: Racial disparities in recurrence and overall survival in patients with locoregional colorectal cancer. JNCI. 2020 Nov 24

  • US Census Bureau: State population by characteristics (2010–2019). https://www.census.gov/content/census/en/data/datasets/time-series/demo/popest/2010s-state-detail.html. Accessed 23 Oct 2020

  • US Census Bureau: TIGER/Line Shapefiles (2010). https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.2010.html. Accessed 8 Sept 2019

  • Word, D.L., Perkins, R.C.: Building a Spanish surname list for the 1990’s: a new approach to an old problem. Population Division, US Bureau of the Census, Washington, DC (1996)

    Google Scholar 

  • Xue, Y., Harel, O., Aseltine, R.H., Jr.: Imputing race and ethnic information in administrative health data. Health Serv. Res. 54(4), 957–963 (2019a)

    PubMed  PubMed Central  Google Scholar 

  • Xue, Y., Harel O., Aseltine Jr, R.H.: Comparison of imputation methods for race and ethnic information in administrative health data. In 2019b 13th international conference on sampling theory and applications (SampTA), pp. 1–4. IEEE, 2019b

Download references

Acknowledgements

Funding to support this research was received through a cooperative agreement between the State of Connecticut and the Centers for Medicare and Medicaid Services (1G1CMS331404). Hospitalization and birth registry data were obtained from the Connecticut Department of Public Health.

Funding

Funding to support this research was received through a cooperative agreement between the State of Connecticut and the Centers for Medicare and Medicaid Services (1G1CMS331404).

Author information

Authors and Affiliations

Authors

Contributions

OH and RA acquired funding for the project, developed the methodology, and substantively contributed to and revised the manuscript. KZ implemented the methodology and was a major contributor in writing the manuscript. The authors assume full responsibility for all such analyses, interpretations and conclusions. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Ofer Harel.

Ethics declarations

Conflicts of interest

The authors declare that they have no competing interests.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Availability of data and materials

Hospitalization and birth registry data were obtained from the Connecticut Department of Public Health (CT DPH) which does not endorse or assume any responsibility for any analyses, interpretations or conclusions based on the data. The authors assume full responsibility for all such analyses, interpretations and conclusions. This study was approved by the CT DPH Human Investigations Committee. There data are not publicly available because they contain personally identifiable information. Contact the CT DPH to inquire about access to these data sets.

Code availability

The code will be available upon request.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zavez, K., Harel, O. & Aseltine, R.H. Imputing race and ethnicity in healthcare claims databases. Health Serv Outcomes Res Method 22, 493–507 (2022). https://doi.org/10.1007/s10742-022-00273-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10742-022-00273-z

Keywords

Navigation