Introduction

Patient safety and the mitigation of medical errors are of growing importance [1]. Poor decision-making and lack of skill in clinical procedures can be significant factors in many of the errors that are reported [2]. Research into clinical skills would suggest a critical role for ‘continual practice’ and maximising training time to reach an ‘appropriate’ level of performance [3]. Simulation-based training has demonstrated that skills can be acquired as well as measured without the need to ‘learn on real patients’ [4, 5]. Many healthcare tasks and procedures can be simulated using computer technology for training purposes and provide novices with a way to improve or maintain their skills [6,7,8]. In addition to technical skill acquisition, we know that the errors made in the clinical environment are also related to non-technical skills [9] and hence there is a need to understand the relationship between skill and cognitive load during procedures. For example, a high cognitive load may affect the non-technical leadership skills of the operator in the clinical environment.

Eye tracking in medical research

One interest in measuring performance is investigating the link between visual attention (eye gaze) and clinical performance. This domain investigates whether an operator’s eye gaze behaviour is correlated with their level of competence during a clinical procedure [10,11,12,13,14]. The ‘mind-eye hypothesis’ [15] states that visual attention can indicate cognitive activity [16,17,18]. Put differently, where someone looks can be indicative of their cognitive experience and thus their level of expertise, situational awareness, uncertainty and perhaps the likelihood that their future actions could cause harm to a patient. A recent study with surgical tasks [11] was shown to discriminate between novices and experts using eye tracking metrics.

Attentional capacity

Clinical decision-making is comprised of many steps including perception, attention, information processing, information storage (including organisation) and then knowledge retrieval from long-term memory at the appropriate time [19]. One aspect of cognition that has received no consideration in related literature is ‘attention’, yet this is of paramount importance to the interventional cardiologist who is learning a new set of skills. Attention refers to the ability to cognitively focus on an object or activity. It is well known that humans have a limited attentional capacity [20]. The human mind can only attend to a finite amount of information at any given time. When a novice clinical operator is acquiring new skills, they use almost all of their attentional resources to monitor what their hands are doing in addition to the spatial judgments and clinical decision-making. This results in limited ‘additional’ attentional capacity for the novice [21] and hence why this study involves the aforementioned visual stimulus task.

This study aims to (1) use wearable technology to determine metrics that could be used to auto-assess operator and procedural performance and (2) to determine whether a visual stimulus task can be used to measure attentional capacity and whether performance of this task is associative to operator errors. Both objectives were carried out using a state-of-the-art, high-fidelity, full-physics VR simulator which provided the means for recording the procedural performance of interventional cardiologists. This work could lead to ‘smart operating rooms’ that can provide live metrics on individual and team performances, providing critical automated analytical feedback.

Ethical approval for this study was granted across the island of Ireland: (1) Ulster University (ref: FCEEFC 20160630), (2) University College Cork (ref: ECM 4 (g) 09/08/16).

Methods

This study involved investigating the use of (1) eye tracking, (2) psychophysiological monitoring and (3) attentional capacity in surgical simulation-based assessment (specifically in coronary angiography). We recorded data from two different groups of interventional cardiologists to test the significance of metrics in discriminating between novices and experts. Data collection took place in the ASSERT Centre, University College Cork.

Study components

The study was comprised of a surgical simulator with simulated patient cases, eye tracking glasses and an E4 wristband for monitoring the operator’s psychophysiology. For the visual stimulus task, an additional LCD display screen was used to display the playing cards.

Simulated coronary angiography

A Mentice VIST-LabFootnote 1 and VIST G5 software (developed by Mentice, Sweden) provided the simulated coronary angiography cases (model details: VIST G5 + VIST-C LD, Coronary PRO v2.3.3, Coronary Angiography v1.3.3 and Coronary Educator v1.1.2). Two cases were assessed by a teaching- and consultant-level interventional cardiologist. One case allowed the participant to practise with the system, and the second case was the primary data collection session. Each participant was allocated ‘up to 30 min’ to practise using the first case allowing the participant to gain a level of familiarity with the simulator. The investigator provided a demonstration on how to use the simulator. Participants were tasked with taking nine views controlling the C-arm:

  • Right Coronary Artery (RCA)

    • Left Anterior Oblique (LAO) 30°, Cranial 15°

    • Right Anterior Oblique (RAO) 30°

    • Anteriorposterior (AP)

  • Left anterior descending (LAD)

    • AP

    • RAO 30°, Caudal 30°

    • RAO 10°, Cranial 40°

    • LAO 50°, Cranial 30°

    • LAO 40°, Caudal 30°

    • Lateral

Wearable technologies

SMIFootnote 2 eye tracking glasses were used to measure visual attention during procedures. The glasses allow the participant to move freely while performing the procedure; while capturing temporal and spatial metrics. Empatica’s E4Footnote 3 wristband provided real-time measurements of the participant’s heart rate, inter-beat intervals (or heart rate variability), EDA (4 Hz), skin temperature (4 Hz) and an accelerometer (32 Hz).

Visual stimulus card task to measure attentional capacity

To measure attentional capacity by proxy, each participant was given an additional visual stimulus to monitor (playing cards) and tasked to verbally respond with the word ‘heart’ when a given playing card (queen of hearts) appeared on the LCD screen. It was made clear that the priority should be performing the procedure but to undertake this additional task if they could. Two variations of the stimulus were provided, one for each of the two attempts. The first acted as a baseline measurement with less additional attention required, and the second performance demanded greater attentional capacity. We increased the number of cards the participant could examine per minute between the first and second performance.

This aspect of the study is based on the works from Weaver [22] and Smith [23]. In Smith’s experiment, a playing card provides 5.7 bits/items of information. Using this measurement, the difference for information output between the stimulus tasks presented during the first and second procedures can be quantified. However, the exposure duration of the playing card is also important and the 2 s exposure duration was determined to be appropriate for this study.

Participants are asked to examine the cards and detect a specific card that they were instructed to verbally acknowledge. Both variations (see ‘Appendix A’ for further detail) have the same design: continual blocks of 20 s with one card that they are instructed to verbally acknowledge. Within these 20 s blocks, ten different cards would appear for 2 s each. Using a random number generator, the random position (within the 20 s block) of the specific card would be continually changed according to an integer 1–10 (referencing its position in the block). This approach semi-randomised the appearance of the playing card while guaranteeing that the participant would have three cards to acknowledge every 60 s. The first performance attempt only provided three playing cards (5.7 bits/item) exposed for 2 s each and therefore an information output of 17.1 bits per 60 s. In contrast, the second performance stimulus card involved a continuous sequence of cards and had information output of 171 bits every 60 s.

Protocol

The protocol is comprised of four stages: (1) demonstration of the VIST-Lab simulator, (2) setting up the wearable technology, (3) participant attempts the first task and (4) participant attempts the second task. Details are as follows:

Explanation and demonstration of the VIST-Lab simulator

  • Participants were informed that a 0.035 guide wire and 5F catheter with a contrast syringe were already connected for use.

  • C-Arm controls to facilitate different views were demonstrated. They were asked to record nine views.

  • They were shown how to start the case and select instruments.

  • They were provided with a practice case and given up to 30 min, allowing for familiarity with the simulator.

Assistance with wearable technology

  • Before the main procedure, it was necessary to calibrate the eye tracking glasses and begin recording data for both wearable devices.

    • Wristband

      • Once comfortably fitted, wristband was switched on, and using an iOS application, the recording session was initialised via Bluetooth.

    • Eye tracking glasses

      • Once comfortable, the glasses were connected via USB to the portable recording device.

      • Three-point calibration was completed.

Data analysis

Procedural performance

The following performance metrics were exported from the VIST simulator after each session:

  • Performance duration (minutes)

  • Total errors

    • Type 1: vessel wall scraping

    • Type 2: moving without wire

    • Type 3: too deep in ostium

    • Type 4: wire in small branch

  • Wire and catheter use (including counts for each time selected and subsequently detected entering the simulator)

Stimulus card task

Using laboratory cameras and eye tracking footage, the cards that were correctly acknowledged by each participant in each performance were counted against all stimulus cards that appeared. A percentage of correctly acknowledged cards were used as an assessment metric.

Eye tracking metrics

Four AOIs were defined as the instruments selection screen, the stimulus screen displaying the cards, the X-ray and the vital signs (see Fig. 1). Eye gaze metrics are derived from fixations and saccades. A fixation is when the participant is fixating on single location using their fovea vision, and a saccade can be a vector between two fixations or rapid movements between fixations [24]. The following eye tracking metrics were calculated which have been used in similar studies [25,26,27,28]:

Fig. 1
figure 1

Main image: Mentice VIST-Lab simulator, with the four AOIs identified. Bottom right: a participant during procedural performance, wearing eye tracking glasses connected to the portable recording device placed to the left on the simulator table and wearing the Empatica’s E4 wristband on their wrist (hidden)

  • AOI specific metrics: dwell %, fixation %, first fixation duration (ms).

  • General eye tracking (non-AOI): fixation frequency, saccade frequency and saccade latency (ms).

  • AOI Fixation Transition Counts.

Fixation transitions count the direct switching of fixations from one AOI to another. Additionally, the counts for transitions between AOIs were totalled into a new metric called total transitions. Another metric was developed using total transitions against procedure duration, i.e. fixation transition frequency (transitions between AOIs per second).

Wristband measurements

Measurements recorded from the E4 wristband include heart rate (bpm), inter-beat interval (SD of inter-beat intervals taken as heart rate variability), EDA (micro-Siemens or µS) and skin temperature (°C) and triaxial accelerometry (X-, Y-, Z-axis values at 32 Hz). From the latter, we computed the acceleration magnitude (ACC) using Euclidian distance.

Statistical methods

The R programming language was used for the data analytics. Summary statistics for groups are presented as mean and standard deviation (mean ± SD). Different significance tests were chosen to perform depending on (1) data distribution: Mann–Whitney U test if non-normal distribution, and (2) unequal/equal variance: Welch t test if unequal, Student’s t test if equal. All significance tests reported as p values were either Mann–Whitney U or Welch tests as no equal variances were found. Either the Pearson product moment coefficient (r) or the Spearman rank-order correlation coefficient (ρ) was used for correlation analysis depending on the normality of the variables. The Shapiro test was used for normality testing in this instance (null hypothesis is that data are normally distributed). Also, Bonferroni-corrected alpha values are presented for transparency.

Results

Table 1 describes the demographics of the novices and experts in this study.

Table 1 Participant demographic information

Novices had a mean experience in years of 2.8 ± 1.8 versus 19.9 ± 5.9 for experts (p < 0.001). Novices had participated in a mean 113 ± 91 coronary angiogram cases in past 12 months versus 464 ± 225 for experts (p < 0.01). Experts had more experience in simulation-based training (86% vs 43%). Almost all participants had never used the VIST-Lab simulator (0% vs 14%, 7% in total). Participants were also asked to declare whether they were left- or right-handed (1L and 6R vs 2L and 5R). The only females in the study (n = 3) were novices. Experts were more likely to signal ‘early’ (before 30 min was complete) that they were ready to begin the next case. Experts had a mean practice time of 19 min compared with 28 min for novices (p = 0.04).

Procedural performance

Table 2 presents the key metrics for procedural performance for both attempts. Table 3 presents changes in errors and the stimulus task card acknowledgement %, either improvement or deterioration, between the first and final attempt. It is notable that experts increase their total errors compared with novices, along with a poorer card acknowledgement rate.

Table 2 Group comparison: procedure performance metrics
Table 3 Group comparison: key metric changes between attempts

Stimulus task and total errors

Figure 2 shows the correlation between the less demanding stimulus card task (first procedure attempt) and total errors. There is a moderate but statistically insignificant positive correlation between card acknowledgement rate and total errors (ρ = 0.42, p = 0.13). Similar correlation values exist between novices and experts (novices: r = 0.38 [p = 0.39] vs experts: ρ = 0.38 [p = 0.40]). Figure 3 shows the correlation coefficients between card acknowledgement rates and total errors for the final procedure attempt (involving the more demanding card stimulus task). When including all participants, there is a statistically insignificant moderate negative correlation (r = − 0.46, p = 0.10), however an obvious outlier exists. This outlier is 5.98 SDs (or standard units/deviations) from the mean distance (residual) from the regression line, hence justifying its removal. With this outlier removed, there is a statistically significant strong negative correlation (r = − 0.84, p = 0.0003). There is a statistically significant strong negative correlation between errors and card acknowledgement rates within the expert group (r = − 0.95, p = 0.0009). However, no such corresponding correlation exists when only analysing the novice group only (ρ = − 0.18, p = 0.70).

Fig. 2
figure 2

Card acknowledgement % effect on total errors for first attempt. (1) All participants (full dataset), (2) novice only, (3) expert only

Fig. 3
figure 3

Card acknowledgement % relationship with total errors for the final attempt. (1) All participants (full dataset) included, (2) a clear outlier (a novice) is removed from dataset, (3) novice only, (4) novice only with outlier removed, (5) expert only

Eye tracking metrics

Table 4 presents the group comparison of AOI specific eye tracking metrics: instruments, vital signs, X-ray and stimulus.

Table 4 Group comparison: eye tracking metrics on AOIs on the display screens (Bonferroni-corrected alpha values for 15 tests = 0.003 and for four tests = 0.013)

Experts had a significantly larger dwell % (11.1 ± 4.3 vs 4.7 ± 1.6, p = 0.006) and fixation % (8.5 ± 3.5 vs 3.5 ± 1.4, p = 0.007) on the instruments screen. In addition, experts had a significantly higher totalled dwell % (63 ± 10% vs 42 ± 20%, p = 0.03) and fixation % (50.2 ± 9.6 vs 33.5 ± 17, p = 0.04).

Table 5 presents the general eye gaze metrics with none being statistically significantly different between the groups.

Table 5 Group comparison: general eye tracking metrics

Table 6 shows the fixation transitions between all AOIs. None of the transition count differences are significantly different between groups. Figure 4 shows the group difference for transition frequency: transitions made between any of the AOIs per second.

Table 6 Group comparison: fixation transitions between AOIs (display screens)
Fig. 4
figure 4

Group comparison for transition frequency over all AOIs

Wristband measurements

Table 7 presents the statistical analysis of the signals EDA, HRV (i.e. inter-beat intervals), skin temperature and accelerometry (ACC) that are recorded from the wearable E4 wristband. The table provides summary statistics (i.e. mean, min, max and SD) for each signal. No strong statistical correlations were found between the E4 wristband signals and the groups. As shown in Fig. 5, the only insightful significant difference of note is that experts had a larger SD of EDA (2.52 ± 2.38 vs 0.89 ± 0.74 µS, p = 0.04). However, if applying Bonferroni-corrected alpha values, then these are not significant findings (Bonferroni-corrected alpha values for 16 tests = 0.003 and for four tests = 0.013)

Table 7 Group comparison: psychophysiological measurements from E4 wristband recorded during performances (Bonferroni-corrected alpha values for 16 tests = 0.003 and for four tests = 0.013)
Fig. 5
figure 5

Group comparison of calculated SD for recorded EDA during both attempts

Discussion

This is the first study to use eye tracking and psychophysiological monitoring in this setting. This is also the first study to use a visual stimulus task as a proxy to measure attentional capacity during surgical procedures. This study resulted in several metrics that could be used in a model to automatically discriminate between novices and experts, perhaps leading to assess proficiency in the real setting. Experts had greater dwell time on the X-ray which perhaps indicates their superiority in spatial awareness and coordination; however, this was not statistically significant (p = 0.13). Experts also had greater transitions between AOIs which could indicate their intention for more frequent cross-referencing (although this did not quite reach statistical significance, p = 0.06). The wristband produced only a small number of metrics that are of interest. Regarding accelerometer-recorded movement, the hands/fingers would be of higher value in future analysis and therefore would necessitate a different type of wearable monitoring tool. Most interestingly, we discovered that card acknowledgement rate during the stimulus task is predictive of the number of handling errors in a procedure (for experts only). It is also interesting to observe the lack of visual attention dedicated to the patient vital signs from both the novice and expert groups (1.6% and 1.7%, respectively).

There is potentially significant value for quantified behaviour during high stakes operations within various environments, from the operating room to the cockpit of a commercial aircraft. Despite the difficult and time-consuming methods required to capture these data, its value when used with machine learning techniques could result in smarter, more responsive environments with intelligent feedback provided to the operators.

Experts complete their first attempt faster than novices; however, in the final attempt, there is little difference. This could be indicative of the confounded effect that the added stimulus task had on the procedural performance—whatever effect it has had on the novice, it could be much more pronounced with experts. Experts have less total errors in their first attempt in comparison with novices, and performance two sees this flipped with the expert committing more errors than the novice. This is a surprising result; however, this result is not statistically significant (p = 0.20). One interesting difference is that in the first performance, the expert had 0 ± 0 scraping vessel wall errors reported from the simulator, while in comparison the novice had 1 ± 3. However, when it came to the final attempt, including a much more demanding stimulus task, this inverted despite both groups performing the same case for a second time (in theory, you would expect a better performance), with experts reporting 2 ± 2 compared to novices reporting the same 1 ± 2.

It can be speculated that experts are affected more by the second variation of the stimulus task compared to the novices. Other than this, it can be suggested that either the sample size is too small or that the experts have possibly lost concentration or have demonstrated a waning interest in the challenge by the second attempt.

The stimulus card task produced mixed results when looking at both performances. There were no significant differences in how the groups performed on the additional task while carrying out the procedure. In the second performance, novices improved their correct card acknowledgement rate while the expert % deteriorated slightly changing to a more demanding stimulus task. It could be speculated that the distraction of the cards had a greater impact on experts, perhaps since experts can quickly become ‘in the flow’ given they are more influenced by automatic muscle memory and ‘autopilot’ abilities. Likewise, perhaps the novices are less ‘set’ with the process and additionally, expecting a challenge, therefore able to adapt better. Hence, while experts should have more attentional capacity to undertake an additional task, they are influenced by routine automatic muscle memory which makes it difficult to use an additional task as a proxy for measuring attentional capacity.

The largest effect sizes found when looking for key correlations are that for the final performance two, the expert card acknowledgement % is strongly negatively correlated with the total errors. This relationship for the final attempt is also seen (though not as strongly) with all participants once we have removed one outlier. With the less demanding and less frequent stimulus provided to the participants, card acknowledgement % seems to be weakly positively correlated with total errors. This is consistent in both groups with almost no difference in effect coefficient and p value.

This study has suggested that eye tracking could have a role to play in the automated assessment of interventional cardiologist trainees with this type of high-fidelity surgical simulator. The eye tracking metrics have been able to quantify how the expert significantly spends much more visual attention (both with dwell % and its encompassing fixation %) at the display screens compared to the novice. This might be intuitive to those familiar with surgery and may predict it as an expected consequence of superior spatial awareness analogous to an experienced driver (where the expert makes many actions automatically without delay and the need to visually attend to the objects their hands interact with). On average, the expert spends much more of their visual attention looking at the instruments display screen (selecting and changing instruments). We also found that on average the expert will have a higher frequency of fixation transitions between the display screens compared with the novice.

Finally, the attempt to analyse psychophysiological measurements acquired using the E4 wristband has provided little insight. One outcome is that the expert will record a significantly higher SD of EDA for their measurements over time in comparison with the novice. What greater SD in quantified arousal from skin conductance means in a clinical performance setting is up for debate.

Limitations

Despite the high-fidelity of the laboratory and virtual reality simulator, these data were not recorded in a real clinical environment with real patients. Moreover, we did not fully simulate environmental features such as noise and ongoing staff interactions. We acknowledge that it may never be possible to simulate a procedure that is in par with the real event, since the psychological fidelity is very difficult to emulate. This is a limitation of this study since we are assuming that metrics acquired in simulation settings are transferable to real-life settings. The low sample numbers while understood (feasibility of gathering data from numerous extremely busy operators to partake in a study during a three-month period) hinder what can be inferred from the results. While the sample size is small, each correlation coefficient is accompanied by a statistical test and p value that considers the sample size (degrees of freedom) in its calculation. A limitation to this study includes the fact that one of the procedures included a ‘distraction’ of undertaking a secondary unrelated task, i.e. card acknowledgements. In addition, we acknowledge a lack of a proper control group to compare with the procedure that included this additional distracting card acknowledgement task. Also, we must acknowledge that there was no baseline psychophysiological measurement of the participant before the session. For example, context for a participant that was already stressed is not considered or that some participants may have been eager to leave within a certain time, having a rushed effect on their final procedure attempt. Another limitation is that we removed an outlier for a correlation computation because this outlier was 5.98 SDs (or standard deviations/units) from the mean distance (or mean residual) from the regression line; however, often outliers in small samples can be meaningful and removing them can dramatically change results. We acknowledge the limitation of multiple hypothesis tests which increase the likelihood of type 1 errors (false positives) and false discovery rates; however, we have included Bonferroni-corrected alpha values where appropriate. We also acknowledge that participants with prior exposure to the simulation technology can be a confounder in studies that measure performance on a simulator where some subjects have had prior experience of the technology and others have not, which begs the question whether some operator performances are partly influenced by their proficiency with the simulator technology. However, only 7% of subjects had prior experience with the simulator.

Future work

Some metrics almost statistically discriminate between the two groups but perhaps lack significance due to the low sample numbers. As a result, we have provided guidance in ‘Appendix B’ for future recruitment using power calculations based on the effect sizes in this study. Future studies attempting a similar experimental set-up should consider the length of time provided to participants for practice and familiarising with the surgical simulator. This would reduce the confounder of computer literacy. For further testing of the stimulus card task, other metrics such as mean saccade latency (ms) specific to the stimulus card (from the moment it appears on screen) to the moment it is acknowledged may follow in future work—this would be a more precise measurement of attentional capacity in comparison with the rudimentary count of correctly acknowledging the card. Furthermore, while we only used the procedure errors as detected by the MENTICE VIST simulator, other procedural errors could be classified in future studies, such as those described by Mazomenos et al. [29]. Other future work could determine the extent of which brief prior exposure or proficiency of using the simulator technology can affect the operator’s procedural performance on the simulator?” Put differently, can knowledge of the simulator be a confounder in studies such as the one described in this paper.

Looking beyond the simulation laboratory setting, capturing psychophysiological metrics and measurements in a real clinical environment, while still running a simulation would add to the validity of the data captured. In the case of this procedure, using a simulated operator room with full immersion: leads, scrubs and a theatre team to support the participant. This could drive larger differences between genuine novices and experts. Beyond that, it would seem that this work is linked with a greater goal of creating what could be called ‘smart theatres’.

Conclusions

This work contributes to the future of sensor-based smart theatres and the ‘quantified physician’ for assessing trainees and operators and to perhaps provide ongoing automated analytical feedback to individuals and teams to drive performance. The study captured a unique dataset with psychophysiological metrics along with a novel measurement of attentional capacity recorded during an important highly skilled clinical procedure. Only a few significant differences between groups have been found when using these metrics: most notably the dwell % and fixation % spent on the display screens. However, the point of this exploratory study is to highlight a number of novel variables that warrant further investigation for assessing proficiency, namely: dwell time on screens, fixation transition frequency between screens, SD of EDA signal and card acknowledgement rates (when using an additional task to measure attentional capacity).

We do acknowledge that this paper mainly focuses on ‘construct validity’ since we wanted to determine whether the metrics can distinguish between novices and experts before providing a more granular analysis which would require a greater number of subjects. Overall, this study provides incentive for further work in the area, with larger sample sizes, a larger range of procedures and using higher fidelity environments.