Abstract
Background
Adult studies have shown that nursing overtime and unit overcrowding are associated with increased adverse patient events but there exists little evidence for the Neonatal Intensive Care Unit (NICU). We investigate the main determinants of nosocomial infections and medical accidents in a NICU using state-of-the-art machine learning techniques. Our analysis focuses on a retrospective study on the 7,438 neonates admitted in the CHU de Québec NICU (capacity of 51 beds) from 10 April 2008 to 28 March 2013. Daily administrative data on nursing overtime hours, total regular hours, number of admissions, patient characteristics, as well as information on nosocomial infections and on the timing and type of medical errors were retrieved from various hospital-level datasets.
Methods
We use a generalized mixed effects regression tree model with random effects (GMERT-RI) to elaborate predictions trees for the two outcomes. Neonates' characteristics and daily exposure to numerous covariates are used in the model. GMERT-RI is suitable for binary outcomes and is a recent extension of the standard tree-based method. The model allows to determine the most important predictors.
Results
Diagnosis-related group level, regular hours of work, overtime, admission rates, birth weight and occupation rates are the main predictors for both outcomes. On the other hand, gestational age, C-Section, multiple births, medical/surgical and number of admissions are poor predictors.
Conclusion
The GMERT-RI algorithm is a powerful tool. It is well suited to unearth potential correlations in the context of unbalanced panel data and discrete health outcomes, two common features of clinical data. In the particular setting of a NICU, we find that institutional features (overtime hours, occupancy rates, etc.) are just as important drivers as neonate-specific medical conditions in predicting medical accidents and health care associated infections. From an operational point of view, prediction trees can complement traditional management tools in preventing undesirable health outcomes in the NICU.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12553-022-00723-1/MediaObjects/12553_2022_723_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12553-022-00723-1/MediaObjects/12553_2022_723_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12553-022-00723-1/MediaObjects/12553_2022_723_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12553-022-00723-1/MediaObjects/12553_2022_723_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12553-022-00723-1/MediaObjects/12553_2022_723_Fig5_HTML.png)
Similar content being viewed by others
Data availability
Confidential proprietary hospital-level data. Can not be shared.
Code availability
Public domain R packages.
Notes
ML has been used for identification of disease onset, classification of disease severity, predicting epileptic seizures, ... It is fast becoming a hybrid physician-support tools thanks to the vast amount of data generated in healthcare systems. Although machine learning can prove a powerful tool, there is potential for misuse; model performance can be inflated through overfitting and, consequently, will not generalize to the greater population. But a number of recent methods – including the one we use – have been proposed to expand the applicability of machine learning tools and ensure robustness of results for within-subject factors and random effects (see Schultz et al. [18]).
The logistic regression is often used as a baseline model with which to gauge more sophisticated machine learning approaches. It is often appropriate for clinical outcomes when using a small set of variables (see Gao et al. [14]). More sophisticated regularized variants of the logistic regression (e.g., lasso-regularized and ridge logistic regressions) allow to remove uninformative variables and/or identify near-linear relationships between some subsets (see Tibshirani [15]). Yet, the main disadvantage of logistic regression is that it may require large sample sizes to achieve reliable performance, particularly in the presence of high-dimensional variable sets (see Schultz et al. [18]).
This occurred whenever a nurse either started her shift earlier than planned or finished later than scheduled. Working beyond 16 consecutive hours per day was prohibited.
Reporting the information on the timing as well as the type of MA is mandatory.
It represents the highest cross validation error less than the sum of the minimum cross validation error and the standard deviation of the error on that tree.
The misclassification rate (MCR) is given by \(MCR=\left( \sum _{i=1}^{N^{(v)} } \sum _{t=1}^{T^{(v)}_i} \mid y_{it} - \widehat{y}_{it} \mid \right) /T^{(v)}\) where \(\widehat{y}_{it}\) is the predicted class of observation t in cluster i: \(\widehat{y}_{it} = \text {Bernoulli} \left( \widehat{\mu }_{it} \right)\) with \(\widehat{\mu }_{it} = \left( 1 + \exp \left( - \widehat{f}(X^{\prime }_{it}) - Z^{\prime }_{it} \widehat{u}_i \right) \right) ^{-1}\). \(\widehat{f}(X^{\prime }_{it})\) is the predicted fixed component that results from the tree and \(Z^{\prime }_{it} \widehat{u}_i\) is its predicted random part corresponding to its cluster. \(N^{(v)}\) is the number of clusters in the validation set, \(T^{(v)}_i\) is the size of cluster i and \(T^{(v)}\) is the total number of observations in the validation set.
In a Monte Carlo simulation study with random effects, Hajjem et al. [13] have shown that the mixed-effects classification trees give better results than the usual classification trees even with a misspecified random component part.
Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. The Gini impurity can be computed by summing the probability \(p_i\) of an item with label i being chosen times the probability \((1-p_i)\) of a mistake in categorizing that item. To compute Gini impurity \(I_G\) for a set of items with J classes, suppose \(i \in \left\{ 1, 2, ...,J \right\}\) and let \(p_i\) be the fraction of items labeled with class i, then \(I_G = 1 - \sum _{i=1}^{J} p^{2}_i\). In our case, \(J=2\) for accident (1) and no accident (0).
We use the recent R package ROCR which allows to create cutoff-parameterized 2D performance curves by freely combining any two from over 25 performance measures.
To save on space, probabilities smaller than 1% appear as “0” and those above 99% appear as “1” inside the nodes.
These numbers correspond to the sum of the observations in each cell of the terminal nodes, or leaves. On the left-hand side this gives 126 = 22 + 13 + 37 + 13 + 13 + 28, while on the right-hand side it corresponds to 152 = 20 + 22 + 33 + 17 + 28 + 17 + 15.
Note that the model slightly overestimates the true number of infections, i.e. 372 instead of 272. On the other hand, if we focus on events with probabilities strictly larger than 80% then overestimation is reduced significantly, i.e. from 372 to 289.
References
Tucker J, Tarnow-Mordi W, Gould C, Parry G, Marlow N. On behalf of the UK neonatal staffing study collaborative group. UK neonatal intensive care services in 1996. Child Fetal Neonatal Ed. 1999;80:F233-34.
Polin RA, Denson S, Brady MT. Strategies for prevention of health care–associated infections in the NICU. Pediatrics. 2012;129(4):e1085–93.
Beltempo M, Lacroix G, Cabot M, Blais R, Piedboeuf B. Association of nursing overtime, nurse staffing and unit occupancy with medical incidents and outcomes of very preterm infants. J Perinatol. 2017;38:175 EP –. https://doi.org/10.1038/jp.2017.146.
Russell RB, Green NS, Steiner CA, Meikle S, Howse JL, Poschman K, Dias T, Potetz L, Davidoff MJ, Damus K, Petrini JR. Cost of hospitalization for preterm and low birth weight infants in the United States. Pediatrics. 2007;120(1):1–9.
Beltempo M, Lacroix G, Cabot M, Piedboeuf B. Factors and costs associated with the use of registered nurse overtime in the neonatal intensive care unit. Pediatrics and Neonatal Nursing Open Journal. 2016;4:17–23.
Berney B, Needleman J. Trends in nurse overtime, 1995–2002. Policy Polit Nurs Pract. 2005;6:183–90.
Bae S-H. Presence of nurse mandatory overtime regulations and nurse and patient outcomes. Nursing Economic$. 2013;31(2):59–89.
Lin H. Revisiting the relationship between nurse staffing and quality of care in nursing homes: An instrumental variables approach. J Health Econ. 2014;37:13–24.
Cimiotti JP, Aiken LH, Sloane DM, Evan SWu. Nurse staffing, burnout, and health care-associated infection. Am J Infect Control. 2012;40(6):486–90.
Trinkoff AM, Johantgen M, Storr CL, Gurses AP, Liang Y, Han K. Nurses’ work schedule characteristics, nurse staffing, and patient mortality. Nurs Res. 2011;60(1):1–8.
Beltempo M, Bresson G, Étienne J-M, Lacroix G. Infections, accidents and nursing overtime in a neonatal intensive care unit. Eur J Health Econ. 2021.
Clarke SLN, Parmesar K, Saleem MA, Ramanan AV. Future of machine learning in paediatrics. Arch Dis Child. 2021;1–6.
Hajjem A, Larocque D, Bellavance F. Generalized mixed effects regression trees. Statist Probab Lett. 2017;126:114–8.
Gao C, Sun H, Wang T, Tang M, Bohnen NI, Müller MLTM, Herman T, Giladi N, Kalinin A, Spino C, et al. Model-based and model-free machine learning techniques for diagnostic prediction and classification of clinical outcomes in Parkinson’s disease. Sci Rep. 2018;8(1):1–21.
Tibshirani R. Regression shrinkage and selection via the lasso. J Roy Stat Soc: Ser B (Methodol). 1996;58(1):267–88.
Hsiao C. An Econometrician’s perspective on Big Data. In: Li T, Pesaran MH, Terrell D, editors. Essays in Honor of Cheng Hsiao. Emerald Publishing Limited; 2020. p. 413–23.
Bresson G. Comments on “An econometrician’s perspective on big data” by Cheng Hsiao. In: Li T, Pesaran MH, Terrell D, editors. Essays in Honor of Cheng Hsiao. Emerald Publishing Limited; 2020. p 431–43.
Schultz BG, Joukhadar Z, Nattala U, Quiroga MDM, Bolk F, Vogel AP. Best practices for supervised machine learning when examining biomarkers in clinical populations. In: Moustafa AA, editor. Big Data in Psychiatry & Neurology. Elsevier; 2021. p. 1–34.
Fédération Interprofessionnelle de la Santé du Québec. Convention collective 2011-2015, article 19.01. 2011.
Hajjem A, Bellavance F, Larocque D. Mixed-effects random forest for clustered data. J Stat Comput Simul. 2014;84(6):1313–28.
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Prediction. New York: Inference and Data Mining. Springer-Verlag; 2009.
Hugonnet S, Chevrolet J-C, Pittet D. The effect of workload on infection risk in critically ill patients. Crit Care Med. 2007;35(1):76–81.
Firth D. Bias reduction of maximum likelihood estimates. Biometrika. 1993;80(1):27–38.
King G, Zeng L. Logistic regression in rare events data. Polit Anal. 2001;9(2):137–63.
Bradburn MJ, Deeks JJ, Berlin JA, Russell Localio A. Much ado about nothing: a comparison of the performance of meta-analytical methods with rare events. Stat Med. 2007;26(1):53–77.
Hegelich S. Decision trees and random forests: machine learning techniques to classify rare events. European Policy Analysis. 2016;2(1):98–120.
Zhao Y, Wong ZS-Y, Tsui KL. A framework of rebalancing imbalanced healthcare data for rare events’ classification: a case of look-alike sound-alike mix-up incident detection. J Healthc Eng. 2018;2018:1–11.
Fujiwara K, Huang Y, Hori K, Nishioji K, Kobayashi M, Kamaguchi M, Kano M. Over-and under-sampling approach for extremely imbalanced and small minority data problem in health record analysis. Front Public Health. 2020;8(178):1–15.
Wang HY. Logistic regression for massive data with rare events. In: International Conference on Machine Learning. Proceedings of Machine Learning Research. 2020. p. 9829–36.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethical approval
This is an observational study. The Research Ethics Board of the Centre Universitaire de l’Hôpital de Québec (CHU de Québec) has approved the research that has been conducted for this study.
Conflicts of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We are grateful to the participants of the research seminar at the Institut de Science Financière et d’Assurances (Institute of Financial Sciences and Insurance), Lyon, France, for their comments and remarks. The usual disclaimer applies.
Appendix: the GMERT-RI algorithm
Appendix: the GMERT-RI algorithm
The GMERT-RI algorithm of [13] is defined as follows. Recall that for the generalized mixed model (GLMM),
-
\(y_{it} \mid u_i\) belongs to the exponential family of distribution.
-
\(\mu _{it}=E \left[ y_{it} \mid u_i \right]\) and \(g \left( \mu _{it} \right) = \eta _{it} = X_{it} \beta + Z_{it} u_{i}\) for some known link function g. In our case, we use the logit link. \(\mu _{it} = \frac{e^{ \eta _{it} }}{1 + e^{ \eta _{it} }}\) and \(g \left( \mu _{it} \right) = \log \left( \frac{\mu _{it} }{1 - \mu _{it} } \right) = \eta _{it}\). So, \(\mu _{i}=E \left[ y_{i} \mid u_i \right]\) and \(g \left( \mu _{i} \right) = \eta _{i} = X_{i} \beta + Z_{i} u_{i}\) with \(u_{i} \sim N \left( 0, \Sigma \right)\).
-
\(Cov \left[ y_{i} \mid u_i \right] =\sigma ^2 v \left( \mu _{i} \right)\) where \(\sigma ^2\) is a dispersion parameter and \(v \left( \mu _{i} \right) = diag \left[ v_{i1}, ..., v_{iT_i} \right] = diag \left[ v \left( \mu _{i1} \right) , ..., v \left( \mu _{iT_i} \right) \right]\) where \(v \left( . \right)\) is a known variance function.
The generalized mixed effects regression tree (GMERT-RI) model, proposed by [13], can be written as \(\eta _{i} = f\left( X_{i} \right) + Z_{i} u_{i}\) with \(u_{i} \sim N \left( 0, \Sigma \right)\) where the linear fixed part \(X_{i} \beta\) is replaced by the function \(f\left( X_{i} \right)\) that will be estimated with a standard regression tree model. A first-order Taylor-series expansion yields the linearized response variable, \(\widetilde{y}_i = g \left( \mu _{i} \right) + \left( y_i - \mu _i \right) g^{\prime } \left( \mu _{i} \right)\) and the mixed fixed effect regression tree (MERT) pseudo-model is defined as follows: \(\widetilde{y}_i = f\left( X_{i} \right) + Z_{i} u_{i} + e_i\). The GMERT-RI algorithm is basically the penalized quasi-likelihood (PQL) algorithm used to fit GLMMs where the weighted linear mixed effects (LME) pseudo-model is replaced by a weighted MERT pseudo-model. Therefore, the fixed part \(f\left( X_{i} \right)\) is estimated with a standard regression tree model. The GMERT-RI algorithm of [13] is the following:
GMERT-RI ALGORITHM (see Hajjem et al. [13], p. 115)
As shown by [13], the GMERT-RI model can be used to predict the response for two categories of new observations: those who belong to a cluster included in the sample used to fit the model and those excluded from the sample. To predict the response for a new observation from the first category, one uses both its corresponding fixed component prediction \(\widehat{f}\left( X_i \right)\) and the predicted random part \(Z_i \widehat{u}_{i}\) corresponding to its cluster. This is a cluster-specific estimate. For the latter category, one can only use its corresponding fixed component prediction (i.e., the random part is set to 0).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Beltempo, M., Bresson, G. & Lacroix, G. Using machine learning to predict nosocomial infections and medical accidents in a NICU. Health Technol. 13, 75–87 (2023). https://doi.org/10.1007/s12553-022-00723-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12553-022-00723-1