Intra-observer and inter-observer variability of clinical annotations of monitoring data
- 1k Downloads
KeywordsTime Series Relevant Event Clinical Annotation Level Shift Monitoring Time
In order to evaluate new methods for alarm generation from monitoring data, a gold standard of alarm evaluation is needed. Nearly all clinical studies into monitoring alarms used clinician judgement and annotation as the reference standard. We investigated the intra-observer and inter-observer variability between two intensivists in the classification of monitoring time series.
A total of 3,092 time series segments (heart rate and blood pressures) of 30 minutes each from six critically ill patients were presented to two experienced intensivists (MD1 and MD2) offline and were visually classified into clinically relevant patterns (no change, level shift, trend) by the physicians separately. One intensivist (MD2) repeated the classification 4 weeks after the first analysis on the same dataset.
MD1 found clinically relevant events in 36%, and MD2 in 29% of all time series. In 16% of all cases both intensivists came to different classifications. In 10% even the direction of change was classified differently. MD2 classified 10% of all cases differently between the first and second analysis. Even if level changes and trends were treated as one universal pattern of change, intra-individual variability (MD2 first analysis vs MD2 second analysis) was still 5% and inter-individual variability (MD1 vs MD2, only unequivocal classifications) was 10%.
Although this study is small with only two observers who were investigated, it clearly shows that there is a significant intra-individual and inter-individual variability in the classification of monitoring events done by experienced clinicians. These findings are supported by studies into image analysis that also found high intra-individual and inter-individual variability. High inter-observer and intra-observer variability is a challenge for clinical studies into new alarm algorithms. Our findings also show a need for reliable classification methods.