Keywords

1 Introduction

Transoesophageal echocardiography (TEE) is the standard for anaesthesia management and outcome evaluation in cardiovascular interventions. It is also used extensively for monitoring critically ill patients in intensive care. The success of the procedure is chiefly dependent on the acquisition of appropriate US views that allow for a thorough hemodynamic evaluation to be conducted. To capture high-quality TEE images, practitioners must possess refined psychomotor abilities and advanced hand-eye coordination. Both require rigorous training and practice.

To facilitate the education of new interventionalists, standardize reporting and quality, accreditation organisations have defined a set of practice guidelines, for performing a comprehensive TEE exam [5, 6]. Nevertheless, training is hindered because performance evaluation is, almost exclusively, carried out through expert supervision. Typically, senior medical personnel grade TEE exams and review logbooks, a laborious process that requires significant amount of time. As a result, trainees rarely receive immediate feedback. Performance evaluation is a key element in interventional medicine and alternative, preferably automated, methods for evaluating TEE competency are necessary [14]. So far, objective assessment in TEE is focused exclusively on the kinematic analysis of the US probe with various motion parameters found to be indicative of the level of operational expertise [9, 10]. Although these are important findings, probe kinematic information is not available in clinical settings and only captured in simulation systems. Recent studies emphasise the benefits of virtual reality (VR) simulators that offer a risk-free environment where trainees can practice repeatedly at the their own convenience [2]. Evidence of performance improvement after training on VR systems, as well as skill retention and transferability have been reported [1, 3, 4, 11, 13]. Incorporating performance evaluation and structured feedback, will allow further use of VR platforms for training and assessment.

In this work, we introduce the use of Convolutional Neural Networks (CNNs) for the automated evaluation of acquired TEE images. CNNs have found many applications in medical imaging and computer-assisted surgery [8], but this is the first time they are applied to skills assessment. We aim to generate high-level features in order to develop a system capable of assigning TEE performance scores, essentially replicating expert evaluation. We generated a dataset of 16060 simulated TEE images from participants of varied experience and use it to retrain two CNN architectures (Alexnet, VGG), converted to perform regression. Three reviewers provided ground truth labels by blindly grading the images with two different manual scores. Tested on a set of 2596 images, the developed CNN architectures estimated the average reviewers’ score with a root mean square error (RMSE) ranging from 7% − 14%. This level of accuracy, which is near the resolution of the average scores from the three evaluators, highlights the potential of CNN algorithms for refined performance evaluation.

2 Methods

2.1 Dataset Generation

Fig. 1.
figure 1

(a) The HeartWorks simulator, inset the US probe movements; (b) The heart model, the probe and US scanning field; (c) The simulated TEE image

We experimented using the HeartWorks TEE simulation platform, (Inventive Medical, Ltd, London, U.K.) a high-fidelity VR simulator that emulates realistic exam settings (Fig. 1). Synthetic US images are generated based on an anatomically accurate cardiac model, illustrated in Fig. 1b, that is deformable to mimic a beating heart. A detector on the probe’s tip extracts the position and orientation of the US scanning field which are then used to graphically render the 2D US slice (Fig. 1c) from the 3D model. The data collection study consisted of a single TEE exam in which participants had to capture 10 US views, shown in Fig. 2 in a specific sequence. The selected views are a subset of the 20 suggested views recommended by ASE/SCA [6] and include planes from every depth window of the TEE exam (mid-esophageal, transgastric and deep-transgastric)). Experiments were performed under supervision by a consultant anaesthetist that introduced the study and relayed the sequence of views. For capturing and storing data the participant used a foot-pedal to generate a full-HD image and a short video (\(\sim 1.5s\)) of the imaged US plane. Each video contained 44 frames.

Fig. 2.
figure 2

The sequence of the 10 TEE views used in the study: 1: Mid-Esophageal 4-Chamber (centred at tricuspid valve), 2: Mid-Esophageal 2-Chamber, 3: Mid-Esophageal Aortic Valve Short-Axis, 4: Transgastric Mid-Short-Axis, 5: Mid-Esophageal Right Ventricle inflow-outflow, 6: Mid-Esophageal Aortic Valve Long-Axis, 7: Transgastric 2-Chamber, 8: Mid-Esophageal 4-Chamber (centred at left ventricle), 9: Deep Transgastric Long-Axis, 10: Mid-Esophageal Mitral Commissural.

In total, 38 participants of varied experience performed the experiments. The population included accredited anaesthetists having performed more than 500 exams, less experienced practitioners and trainees in the early stage of their residency. Participants were allowed time to familiarise themselves with the setup and the simulator. Manual scoring was blindly performed by three expert anaesthetists based solely on the acquired videos/images. Each view was assessed with two distinct image quality metrics. The first metric is a criteria-based score evaluated on a predetermined checklist, of which each item was assigned a binary value (0-not met, 1-met). The checklists for two of the views are depicted in Table 1 and are derived following the latest ASE/SCA imaging guidelines for each view [6]. This technique broadly evaluates three attributes, the correct angulation of the US probe in each view, the presence/visibility of specific heart tissue and the proper positioning of the probe in the oesophageal lumen. The number of items varied for different views so did the maximum score. The percentage of criteria (CP) met over the total number was used to provide a uniform measure among all views. The second score is a general impression (GI) assessment of the US video/image scored on a 0–4 scale, which assess the overall quality of the acquired image. Grades from the three evaluators were averaged to obtain a single mean score per US view for each volunteer. As expected the two scores are highly correlated (\(\rho \sim 0.93\)). Inter-rater variability was independently evaluated for each view, using the interclass correlation coefficient (ICC) and Krippendorff’s Alpha (KA). Both metrics show very good agreement between the three evaluators with ICC \(\sim 0.9\) and KA \(\sim 0.8\) for all views.

Figure 3 illustrates two examples in the opposite ends of the quality spectrum from views 3 and 7. The average quality scores are given inset and we annotated the elements in the images that satisfy the criteria in the checklist of each view, provided in Table 1. The images on the left are of poor quality and only meet a small number of the checklists’ items. For example the top left ME AV SAX image has the correct probe rotation and visualises the three cusps of the aortic valve. It fails to meet the rest of the criteria. The bottom left image of the TG2C view, only achieved correct probe angulation, but because of inadequate positioning fails to satisfy the rest of the criteria. Consequently, both CP and GI scores are low, since both images on the left side are of unacceptable quality. Images on the right side are examples of ideally imaged views fully satisfying the respective checklists and achieving full marks in both metrics.

Fig. 3.
figure 3

Scoring examples for Views 3 and 7, from different participants, with annotated structures of importance. Left images are scored poorly whereas right images obtain excellent marks. Top row, View 3 - LA: left atrium, RA: right atrium, TV: tricuspid valve, RV: right ventricle, AV: aortic valve, PV: pulmonary valve, circle indicates visibility of AV cusps; Bottom row, View 7 - LV: left ventricle, LA: left atrium, MV: mitral valve and arrows showing leaflets

Table 1. The checklists used for the ME AV SAX (View 3) and TG 2C (View 7) TEE views.

We recorded 365 video sequences from the 38 participants with 15 views failing to store properly. For our investigation, we extracted all 16060 (i.e. 365\(\,\times \,\)44) frames from the stored videos and used the mean manual scores as labels. All frames from a given video were labelled with the average score of that view on the premise that reviewers assigned their grades after watching the short videos so we consider that the mark equally represents all frames. No probe movement takes place in the videos, only the simulated beating of the heart model. Therefore the qualitative attributes of the stored view are the same in all frames. We divide the dataset using the 80% − 20% rule for training and testing, considering the total number of volunteers. Frames from 32 participants were designated for training (13464) and from 6 for testing (2596).

2.2 CNN Architectures

We opted to develop CNN models for performing a regression task and train them to learn to estimate the performance score as a single continuous variable; \(CP\in \{0, \cdots , 100\}\), \(GI\in \{0, \cdots 4\}\). Since the checklists’ criteria and their number are different among views, it was not feasible to structure and train a single model for evaluating individual criteria for all views. This would require a non-efficient approach with separate sub-models per view. Hence a single CP score per view was computed and estimated. We experimented with two established CNN architectures namely, Alexnet and VGG, originally built to perform image classification tasks [7, 12]. We repurpose them by restructuring their output stage and consider 10 available classes, one for each TEE view. The final fully-connected (FC) layer of both CNNs is resized with a dimension of 10. One additional FC layer with output size \(d=1\) and linear activation is added to complete the regression operation and estimate the score. For classifying the input to one of the TEE views, softmax activation is applied after the FC layer with \(d=10\). Effectively we structure our network so that it can be trained to both estimate the performance scores and recognize the corresponding view of the input. Figure 4 illustrates the two customised architectures with the added layers.

Fig. 4.
figure 4

The two networks (a) Alexnet, (b) VGG, developed for the TEE score estimation task. The customized output stage with the added FC layers and the softmax activation for classification is enclosed in the boxes.

3 Experimentation and Results

CNN models were implemented with the TensorFlow framework. The training dataset was randomized and images were resized from 1200\(\,\times \,\)1000, to 227\(\,\times \,\)227 for Alexnet and 224\(\,\times \,\)224 for VGG. Batches of 128 (Alexnet) and 64 (VGG) were used. The mean square error was set as the loss function and gradient descent optimization with adaptive moment estimation was performed with a learning rate of 0.001. Both networks were initialized with publicly available weights from the ILSVRC challenge [7, 12], apart from the additional dense layers we introduced, which were assigned random weights and trained from scratch. Backpropagation was used to update the weights. The two architectures were independently trained for each performance metric and convergence was achieved after 2 K iterations for Alexnet and after 12 K for VGG. The models were also trained to classify images to their respective view, achieving over 98% accuracy. Table 2 lists overall RMSE results and the RMSE on score intervals, from estimating the two image quality scores on the 2596 testing images. Both models perform adequately but, owing to its denser structure, VGG outperforms Alexnet significantly and has smaller error variability, providing excellent accuracy for both metrics. To obtain a single score per video, similarly to the three evaluators, we grouped the predictions of the frames from the same video and averaged them. The per video results, for 59 videos from the 6 testing participants (one video was not captured) are shown in Fig. 5. The RMSE of the grouped results is lower for both networks, that also give consistent estimations in frames from the same video, indicated by low standard deviation values (\(\sigma _{CP}\simeq 3.5\), \(\sigma _{GI}\simeq 0.2\)).

Table 2. Overall and interval RMSE results of the developed networks.
Fig. 5.
figure 5

Grouped estimation results per testing video. RMSE and average \(\sigma \) values: (top) Alexnet − CP: 15.78 (\(\sigma \) = 3.34), GI: 0.8 (\(\sigma \) = 0.22); (bottom) VGG − CP: 5.2 (\(\sigma \) = 3.55), GI: 0.33 (\(\sigma \) = 0.23)

4 Conclusions

In this article we demonstrated the applicability of CNNs architectures for automated quality evaluation of TEE images. We collected a rich dataset of 16060 simulated images graded with two manual scores (CP, GI) assigned by three evaluators. We experimented with two established CNN models, restructured to perform regression and trained these to estimate the manual scores. Validated on 2596 images, the developed models estimate the manual scores with high accuracy. Alexnet achieved an overall RMSE of 16.23% and 0.83, while the denser VGG had better performance achieving 7.28% and 0.42 for CP and GI respectively. These very promising outcomes indicate the potential of CNN methods for automated skill assessment in image-guided surgical and diagnostic procedures. Future work will focus on augmenting the CNN models and investigating their translational ability in evaluating the quality of real TEE images.