Abstract
Our objective was to enhance existing methods for indirectly estimating race/ethnicity in health care data by exploring ways to improve imputation accuracy with a total of 9,812,306 hospital visits from the Connecticut statewide hospitalization claims database from 2012 to 2017. Using this data, we developed multinomial logistic regression models to predict patients’ race and ethnicity when assuming that 50% of race/ethnicity is missing completely at random. Our models included predictors derived from Connecticut birth records, US Census data, and demographic patient-level data, and were compared using performance measures. Our model correctly classified the race and ethnicity of approximately 85% of patients in the Connecticut hospitalization claims data. We found the following [sensitivities and specificities] for our five race/ethnicity categories: non-Hispanic White [94, 83], non-Hispanic Black [76, 97], non-Hispanic Asian or Pacific Islander [41, 99.6], Hispanic [87, 95], and non-Hispanic other race [5, 99.7]. First name, surname, census tract and insurance type were key predictors. Further, Connecticut-specific name dictionaries were better at identifying non-White race and ethnicity compared to the national 2010 US Census surname dictionary. Therefore, state-specific health records, census information, and patients’ demographic characteristics can be utilized to improve the prediction of missing racial and ethnic information in Connecticut hospitalization claims. In addition, this approach can be adapted to other state-specific healthcare databases, which enhances opportunities to investigate and address racial disparities in health outcomes.
Similar content being viewed by others
References
All-Payer Claims Database Council: Interactive State Report Map (2015). https://www.apcdcouncil.org/state/map. Accessed 5 Nov 2019
Adjaye-Gbewonyo, D., Bednarczyk, R.A., Davis, R.L., Omer, S.B.: Using the bayesian improved surname geocoding method (BISG) to create a working classification of race and ethnicity in a diverse managed care population: a validation study. Health Serv. Res. 49(1), 268–283 (2014)
Bartlett, J.W., Harel, O., Carpenter, J.R.: Asymptotically unbiased estimation of exposure odds ratios in complete records logistic regression. Am. J. Epidemiol. 182(8), 730–736 (2015)
Becker, A.L.: Health disparities in Connecticut: causes, effects, and what we can do. Connecticut Health Foundation. 2020. https://www.cthealth.org/latest-news/news-releases/new-report-health-disparities-in-connecticut-causes-effects-and-what-we-can-do/. Accessed 20 Jan 2021
Bilheimer, L.T., Sisk, J.E.: Collecting adequate data on racial and ethnic disparities in health: The challenges continue. Health Aff. 27(2), 383–391 (2008)
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960)
Comenetz, J.: Frequently occurring surnames in the 2010 Census (2016). https://raw.githubusercontent.com/cfpb/proxymethodology/master/input_files/Names_2010Census.csv. Accessed 2 June 2019
Connecticut Hospital Association: ChimeData Overview (2019). https://cthosp.org/member-services/chimedata/chimedata-overview/. Accessed 10 Oct 2019
Connecticut State Data Center. 2010 Census redistricting data and shapefiles (Public Law 94-171). https://ctsdc.uconn.edu/connecticut_census_data/#2010_redistricting. Accessed 8 Sep 2019
Davies, S.M., McDonald, K., Danielson, E., et al.: Inventory and prioritization of measures to support the growing effort in transparency using all-payer claims databases. Prepared under Contract No. HHSA2902001200003I, Task Order 5. AHRQ Publication No. 17-0022-1-EF. Rockville, MD, Agency for Healthcare Research and Quality, March (2017)
Derose, S.F., Contreras, R., Coleman, K.J., Koebnick, C., Jacobsen, S.J.: Race and ethnicity data quality and imputation using US census data in an integrated health system: the kaiser permanente Southern California experience. Med. Care Res. Rev. 70(3), 330–345 (2013)
Doshi, R.P., Yan, J., Aseltine, R.H., Jr.: Age differences in racial/ethnic disparities in preventable hospitalizations for heart failure in Connecticut, 2009–2015: a population-based longitudinal study. Public Health Rep. 135(1), 56–65 (2020)
Elliott, M.N., Morrison, P.A., Fremont, A., McCaffrey, D.F., Pantoja, P., Lurie, N.: Using the census Bureau’s surname list to improve estimates of race/ethnicity and associated disparities. Health Serv. Outcomes Res. Methodol. 9(2), 69 (2009)
Elliott, M.N., Fremont, A., Morrison, P.A., Pantoja, P., Lurie, N.: A new method for estimating race/ethnicity and associated disparities where administrative records lack self-reported race/ethnicity. Health Serv. Res. 43(5p1), 1722–1736 (2008)
Fiscella, K., Fremont, A.M.: Use of geocoding and surname analysis to estimate race and ethnicity. Health Serv. Res. 41(4p1), 1482–1500 (2006)
Fremont, A., Weissman, J.S., Hoch, E., Elliott, M.N.: When race/ethnicity data are lacking: using advanced indirect estimation methods to measure disparities. Rand Health Quart. 6(1), 16 (2016)
Graham, G.: Disparities in cardiovascular disease risk in the United States. Curr. Cardiol. Rev. 11(3), 238–245 (2015)
Gutierrez, J., Williams, O.A.: A decade of racial and ethnic stroke disparities in the United States. Neurology 82(12), 1080–1082 (2014)
Haas, A., Elliott, M.N., Dembosky, J.W., Adams, J.L., Wilson-Frederick, S.M., Mallett, J.S., Gaillot, S., Haffer, S.C., Haviland, A.M.: Imputation of race/ethnicity to enable measurement of HEDIS performance by race/ethnicity. Health Serv. Res. 54(1), 13–23 (2019)
Joynt, K.E., Orav, E.J., Jha, A.K.: Thirty-day readmission rates for Medicare beneficiaries by race and site of care. JAMA 305(7), 675–681 (2011)
Lauderdale, D.S., Kestenbaum, B.: Asian American ethnic identification by surname. Popul. Res. Policy Rev. 19(3), 283–300 (2000)
Lauderdale, D.S., Kestenbaum, B.: Mortality rates of elderly Asian American populations based on medicare and social security data. Demography 39(3), 529–540 (2002)
Ma, Y., Zhang, W., Lyman, S., Huang, Y.: The HCUP SID imputation project: improving statistical inferences for health disparities research by imputing missing race data. Health Serv. Res. 53(3), 1870–1889 (2018)
Morrison, P.A., Word, D.L., Coleman, C.D.: Using first names to estimate racial proportions in populations. In: Population Association of America Annual Meeting (2001)
Pepe, M.S.: The statistical evaluation of medical tests for classification and prediction. Oxford University Press, New York (2003)
Peters, A., Sachs, J., Porter, J., Love, D., Costello, A.: The value of all-payer claims databases to states. N. c. Med. J. 75(3), 211–213 (2014)
Silva, G.C., Trivedi, A.N., Gutman, R.: Developing and evaluating methods to impute race/ethnicity in an incomplete dataset. Health Serv. Outcomes Res. Methodol. 19(2–3), 175–195 (2019)
Snyder, R.A., Hu, C.Y., Zafar, S.N., Francescatti, A., Chang, G.J.: Racial disparities in recurrence and overall survival in patients with locoregional colorectal cancer. JNCI. 2020 Nov 24
US Census Bureau: State population by characteristics (2010–2019). https://www.census.gov/content/census/en/data/datasets/time-series/demo/popest/2010s-state-detail.html. Accessed 23 Oct 2020
US Census Bureau: TIGER/Line Shapefiles (2010). https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.2010.html. Accessed 8 Sept 2019
Word, D.L., Perkins, R.C.: Building a Spanish surname list for the 1990’s: a new approach to an old problem. Population Division, US Bureau of the Census, Washington, DC (1996)
Xue, Y., Harel, O., Aseltine, R.H., Jr.: Imputing race and ethnic information in administrative health data. Health Serv. Res. 54(4), 957–963 (2019a)
Xue, Y., Harel O., Aseltine Jr, R.H.: Comparison of imputation methods for race and ethnic information in administrative health data. In 2019b 13th international conference on sampling theory and applications (SampTA), pp. 1–4. IEEE, 2019b
Acknowledgements
Funding to support this research was received through a cooperative agreement between the State of Connecticut and the Centers for Medicare and Medicaid Services (1G1CMS331404). Hospitalization and birth registry data were obtained from the Connecticut Department of Public Health.
Funding
Funding to support this research was received through a cooperative agreement between the State of Connecticut and the Centers for Medicare and Medicaid Services (1G1CMS331404).
Author information
Authors and Affiliations
Contributions
OH and RA acquired funding for the project, developed the methodology, and substantively contributed to and revised the manuscript. KZ implemented the methodology and was a major contributor in writing the manuscript. The authors assume full responsibility for all such analyses, interpretations and conclusions. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they have no competing interests.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Availability of data and materials
Hospitalization and birth registry data were obtained from the Connecticut Department of Public Health (CT DPH) which does not endorse or assume any responsibility for any analyses, interpretations or conclusions based on the data. The authors assume full responsibility for all such analyses, interpretations and conclusions. This study was approved by the CT DPH Human Investigations Committee. There data are not publicly available because they contain personally identifiable information. Contact the CT DPH to inquire about access to these data sets.
Code availability
The code will be available upon request.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zavez, K., Harel, O. & Aseltine, R.H. Imputing race and ethnicity in healthcare claims databases. Health Serv Outcomes Res Method 22, 493–507 (2022). https://doi.org/10.1007/s10742-022-00273-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10742-022-00273-z