Keywords

1 Introduction

The aim of a computer-assisted surgery (CAS) system is to provide the surgeon with the right type of assistance at the right time. To achieve this, context awareness is crucial. This means that the system must be able to understand the processes currently taking place in the operating room (OR) and adapt its behavior accordingly. Surgical workflow analysis covers the challenging topic of perceiving, understanding, and describing surgical processes [11].

A common approach is to analyze surgical processes by interpreting a time series of signals that are recorded by sensors – in some cases also by humans – in the OR. As laparoscopic surgeries are performed via camera, methods that require only video as input sensor data are of special interest, since the video can be collected effortlessly during surgery.

State-of-the-art video-based approaches for workflow analysis rely on deep neural networks [1, 2, 9, 15, 18]. However, deep learning-based methods require large amounts of labeled data for training. Especially in surgery, obtaining a sufficient amount of annotated video data is difficult and costly.

To alleviate the problem of limited training data, it is common to pretrain neural networks and fine-tune them afterwards. Often, networks are pretrained using labeled data coming from another domain, such as ImageNet [4]. Another way is to use unlabeled data from the same domain and train on a proxy task using labels inherent in the data, which is called self-supervised learning.

For self-supervised learning from video, a number of ideas have been proposed [5, 8, 12, 13, 16]. Most exploit the temporal coherence of video, which implies that (i) consecutive frames are in temporal order, (ii) frames change slowly over time, and (iii) frames change steadily, i.e., abrupt motions are unlikely.

The studies [12, 13] propose proxy tasks based on the temporal order between frames. In line with this, [2] use the task to order pairs of laparoscopic images for pretraining a network for surgical phase segmentation. Surgical phase segmentation [14] is the problem of recognizing the surgical phase being performed by the surgeon at each point during surgery. Another proxy task for this problem is to predict the progress and remaining duration of a surgery [18].

Intuitively, these tasks encourage the network to learn discriminative features that are useful to infer the absolute or relative temporal position of a video frame. In contrast, [5, 8, 16] aim at learning features that are invariant to typical alterations occurring between adjacent frames, such as slight rotations or deformations. To this end, they aim to ensure that temporally close frames, which most likely depict the same semantic scene, are mapped to similar representations in feature space. This idea goes back to Slow Feature Analysis (SFA) [17].

In this paper, we describe and compare different approaches to exploit temporal coherence while pretraining a convolutional neural network (CNN) for surgical phase segmentation. We assume the pretraining encourages the CNN to learn features that are invariant to irrelevant changes between adjacent frames, such as slight movements of instruments or of the endoscope, while being discriminative enough to distinguish between semantically different frames.

To promote reproducibility and to fuel future research, we made our code available at https://gitlab.com/nct_tso_public/pretrain_tc.

Experiments using the Cholec80 dataset [15] demonstrate that a CNN pretrained to exploit the temporal coherence of unlabeled laparoscopic video outperforms a non-pretrained CNN after being fine-tuned for surgical phase segmentation. When only 20 labeled videos are available, the proposed pretraining achieves an increase from 67.8 to 78.6 as measured by \(F_1\) score.

2 Methods

The core of our neural network architecture for surgical phase segmentation is a ResNet-50 CNN [6]. We initialize it with ImageNet [4] pretrained weights and further train it on unlabeled videos of laparoscopic surgeries, using an SFA-based approach for self-supervised learning. This encourages the CNN to map temporally close video frames to similar representations in feature space.

More formally, the CNN learns an embedding \(f:\mathbb {R}^{3 \times h \times w}\rightarrow \mathbb {R}^{d}\), where \(\mathbb {R}^{d}\) is the d-dimensional feature space and \(\mathbb {R}^{3 \times h \times w}\) is the space of laparoscopic video frames with height h, width w, and three color channels (RGB). Let \(I_t \in \mathbb {R}^{3 \times h \times w}\) denote the frame at time step t. To suffice temporal coherence, we require that \(f(I_t) \approx f(I_{t + \varDelta })\) for a small \(\varDelta \) with \(|\varDelta | < \delta \). To learn an embedding that is discriminative and to avoid trivial solutions such as \(f(I_t):= 0\), we require that \(f(I_t)\) and \(f(I_{t + \varGamma })\) lie further apart in feature space when \(\varGamma \) is large, i.e., \(|\varGamma | > \gamma \) (see Subsect. 2.1 for details). \(\delta \) and \(\gamma \) are non-negative real-valued parameters.

To evaluate the efficacy of the proposed self-supervised pretraining approach, we extend the CNN into a recurrent neural network (RNN) and fine-tune the CNN-RNN for surgical phase segmentation using annotated laparoscopic videos (see Subsect. 2.2). We can then compare the performance of the pretrained CNN-RNN to the performance of a CNN-RNN that has been trained solely for the surgical phase segmentation task (see Sect. 3).

2.1 Self-supervised Pretraining

For self-supervised pretraining, the output layer of the ResNet-50 CNN is replaced with a fully connected layer with \(d = 4096\) output neurons (FeatureNet). As the CNN has been pretrained on ImageNet, we only adjust the weights of the conv5_x layers and of the newly added fully connected layer during training.

Given a frame \(I_t\), we calculate the embedding \(F_t := f(I_t)\) by forwarding the frame through FeatureNet and taking the output \((o_1, o_2, ..., o_d)^T \in \mathbb {R}^{d}\) at the last layer. We train FeatureNet to learn a temporally coherent video frame embedding using one of the following methods. Throughout this section, D denotes a distance function, in our case the L2 norm.

  1. (a)

    Training with contrastive loss

    Given a video with T frames, we create a tuple \((I_t, I_{t + \varDelta }, I_{t + \varGamma })\) by sampling t from \([0, T - 1]\), \(\varDelta \) from \([-\delta , \delta ]\), and \(\varGamma \) from \([-(T - 1), -\gamma ] \cup [\gamma , T - 1]\) uniformly at random. Regarding FeatureNet as a Siamese network [3], we propagate the temporally close pair \((I_t, I_{t + \varDelta })\) through the CNN and calculate \(D(F_t, F_{t + \varDelta })\). Likewise, we propagate the temporally distant pair \((I_t, I_{t + \varGamma })\) and calculate \(D(F_t, F_{t + \varGamma })\). Finally, we calculate the contrastive loss [5]

    $$\begin{aligned}L_c(F_t, F_{t + \varDelta }, F_{t + \varGamma }) = D(F_t, F_{t + \varDelta }) + \mathtt {max}\lbrace 0, m_c - D(F_t, F_{t + \varGamma })\rbrace . \end{aligned}$$

    This loss function encourages \(F_t\) to be close to \(F_{t + \varDelta }\), while \(F_t\) and \(F_{t + \varGamma }\) are enforced to be separated by margin \(m_c\).

  2. (b)

    Training with ranking loss

    A training tuple \((I_t, I_{t + \varDelta }, I_{t + \varGamma })\) is created the same way as in method (a). Regarding FeatureNet as a Triplet Siamese Network, we propagate the triplet \((I_t, I_{t + \varDelta }, I_{t + \varGamma })\) through the CNN and calculate the ranking loss [16]

    $$\begin{aligned} L_r(F_t, F_{t + \varDelta }, F_{t + \varGamma }) = \mathtt {max}\lbrace 0, D(F_t, F_{t + \varDelta }) - D(F_t, F_{t + \varGamma }) + m_r\rbrace . \end{aligned}$$

    This loss function considers the distance between \(F_t\) and \(F_{t + \varDelta }\) relative to the distance between \(F_t\) and \(F_{t + \varGamma }\) and encourages \(F_t\) and \(F_{t + \varDelta }\) to be closer together than \(F_t\) and \(F_{t + \varGamma }\) by a margin of \(m_r\).

  3. (c)

    Training with 1\(^\mathbf{st}\) & 2\(^\mathbf{nd}\) order contrastive loss

    While (first order) temporal coherence requires the first order temporal derivatives in the learned feature space to be small, i.e., \(F_t \approx F_{t + \varDelta }\), second order temporal coherence [8] requires the second order temporal derivatives to be small, i.e., \(F_t - F_{t + \varDelta } \approx F_{t + \varDelta } - F_{t + 2\varDelta }\) for a small value of \(\varDelta \).

    Intuitively, first order temporal coherence ensures that embeddings do not change quickly over time, while second order temporal coherence ensures that the changes are consistent, or steady, across neighboring frames. Applying the contrastive loss function to second order temporal coherence yields

    $$\begin{aligned} L_{c_2}(F_t, F_{t + \varDelta }, F_{t + 2\varDelta }, F_{t + \varGamma }) = L_c(F_t - F_{t + \varDelta }, F_{t + \varDelta } - F_{t + 2\varDelta },F_{t + \varDelta } - F_{t + \varGamma }) \end{aligned}$$

    In practice, we create a training tuple \((I_t, I_{t + \varDelta }, I_{t + 2\varDelta }, I_{t + \varGamma })\) by sampling t, \(\varDelta \), and \(\varGamma \) as described in method (a). Regarding FeatureNet as a Triplet Siamese Network, we propagate the triplets \((I_t, I_{t + \varDelta }, I_{t + 2\varDelta })\) and \((I_t, I_{t + \varDelta }, I_{t + \varGamma })\) through the network and calculate \(L_{c_2}\). We then combine it with the first order contrastive loss \(L_c\) into an overall loss \(L_{c + c_2} = L_c + \omega L_{c_2},\) where \(\omega = 0.5\) is a non-negative real-valued weight parameter.

Table 1. Performance of the baseline (first row) and the pretrained models on the surgical phase segmentation task. #OPs denotes how many labeled OPs were used.

2.2 Supervised Fine-Tuning for Surgical Phase Segmentation

Once pretrained, we modify the CNN for surgical phase segmentation by extending it into an RNN using a long short-term memory unit (LSTM) [7] with 512 neurons. The LSTM is followed by a fully connected layer, which has one output neuron per surgical phase. We refer to this CNN-LSTM as PhaseNet. During fine-tuning, the weights of the CNN and the LSTM are jointly optimized. However, the weights of the ResNet-50 layers below conv5_x stay frozen.

3 Evaluation

For evaluation, we used the publicly available Cholec80 dataset [15]. It consists of 80 videos from laparoscopic cholecystectomies, annotated with surgical phase labels. We divided the dataset into four sets A, B, C, and D of equal size and similar average procedure length. A, B, and C were used for training, while D was withheld for testing. For pretraining, we extracted video frames at 5 Hz. Training and testing for phase segmentation was performed at 1 Hz. Each frame was downsized to \(384 \times 216\) px.

We trained three different versions of FeatureNet, one with each of the pretraining variants described in Sect. 2.1. The union of sets A, B, and C (i.e., 60 videos in total) was used as training data, ignoring the labels. Each CNN was trained for 25 epochs. Per epoch, we randomly sampled 250 tuples per video, which were processed in batches of size 64. \(\delta \) was set to 30 s (15 s for variant (c)), \(\gamma \) to 120 s and \(m_c = m_r = 2\). We used the Adam optimizer [10] with a learning rate of \(10^{-4}\). All newly added layers were initialized with random values from the range \((\frac{-1}{\sqrt{n}} , \frac{1}{\sqrt{n}})\), with n being the number of neurons in the layer.

To evaluate the suitability of the proposed pretraining approach for surgical phase segmentation, each of the pretrained CNNs (contrastive, ranking, and 1\(^\mathbf{st}\) & 2\(^\mathbf{nd}\) order contrastive) was extended into a PhaseNet and fine-tuned using the labeled videos from either set A (#OPs = 20), sets A and B (#OPs = 40), or sets A, B, and C (#OPs = 60). As baseline, a PhaseNet without self-supervised pretraining (no pretraining) was fine-tuned in the same manner. Note that the underlying ResNet-50 CNN had still been pretrained on ImageNet.

For fine-tuning the networks, we used the Adam optimizer [10] with a learning rate of \(10^{-4}\) and a batch size of 128. After every batch, the content of the LSTM’s hidden state was saved and restored for the next batch. Due to hardware restraints, gradients were only accumulated for three batches before applying the optimizer. Training was stopped once the accuracy on the training set climbed above 99.9%. All newly added layers were initialized as described above.

The results of evaluating each PhaseNet on test set D can be found in Table 1. We calculated the metrics accuracy, recall, and precision as defined in [14]. The \(F_1\) score is the harmonic mean of precision and recall. The metrics were averaged over all operations in the test set. Table 2 presents the phase-wise results of the best performing pretrained PhaseNet (1\(^\mathbf{st}\) & 2\(^\mathbf{nd}\) order contrastive) compared to the PhaseNet that did not undergo self-supervised pretraining.

Table 2. Comparison of the baseline and the best performing pretrained model. We report the average \(F_1\) scores calculated for each of the phases P1 to P7.

4 Discussion

Table 1 clearly shows that all three pretrained models outperform the baseline when being fine-tuned on the same set of labeled training data. The performance boost is especially apparent when only 20 labeled videos are available. Here, in terms of \(F_1\) score, pretraining achieves an increase from 67.8 to up to 78.6 while halving the standard deviation. Pretraining still improves performance when more labeled videos are available. Notably, the pretrained models fine-tuned on only 40 labeled videos outperform the baseline trained on 60 videos. We conclude that the proposed SFA-based pretraining enables a CNN to learn feature representations that are beneficial to the task of surgical phase segmentation.

Comparing the three pretraining variants, we do not find big differences. All in all, using a combination of first and second order temporal coherence for pretraining seems to offer the largest boost to performance, especially when only few (20) labeled videos are used.

Looking at the results with respect to each surgical phase (Table 2), we see that most phases benefit greatly from pretraining (variant 1\(^\mathbf{st}\) & 2\(^\mathbf{nd}\) order contrastive) when only 20 labeled videos are available. The effect of pretraining diminishes when the number of labeled videos is increased, but is still noticeable in the majority of phases. Only the benefit to phase P7 seems negligible.

P7 contains visual similarities with P5 and P6, which makes them difficult to distinguish. Since the phase is short (about 1 to 3 min), frames that we label as close during pretraining may belong to previous phases. Likewise, frames that belong to previous phases but are temporally close are not selected as distant pair. Hence, the network learns features that are rather invariant than discriminative with regard to phase P7 and P6 or P5.

To shed some light on the features learned during pretraining, we investigated which images the network considers similar. We selected query frames \(\lbrace I^q \rbrace \) from a video used during pretraining. Then, for each frame \(I^q\) and each video v in the test set, we identified the frame \(I^{q, v}\) in v that is most similar to \(I^q\), i.e., closest to \(I^q\) in feature space. More formally, \(I^{q, v} = \mathop {\hbox {argmin}}\nolimits _{I_t \in v} D(f(I^q), f(I_t))\), where D was chosen to be the L2 norm. To calculate the embedding f, we used the 1\(^\mathbf{st}\) & 2\(^\mathbf{nd}\)order contrastive pretrained FeatureNet (before fine-tuning).

Figure 1 presents four selected queries. Generally, it can be seen that images that are close in feature space show similar scenes with regard to anatomical structures and/or tool presence. The first and second query frames depict scenes that only differ in the amount of blood visible, a trait also observed in the retrieved frames. Likewise, the third and fourth query frames show similar scenes. However, the third query frame is unusual as the specimen bag is closed. Observing that the retrieved images are semantically not closely related to the query frame, we assume that its embedding does not reflect the presence of the specimen bag. For the fourth query frame, which is visually similar but more representative, semantically similar frames are retrieved.

We refrain from comparing temporal coherence-based learning to other pretraining methods for surgical phase segmentation [2, 18] since these studies were conducted using other datasets, namely EndoVis2015 (7 cholecystectomies) in [2] and 120 cholecystectomies in [18].

Fig. 1.
figure 1

Image retrieval task. Each row represents one query. Left-most: Query frame. Right: The frames closest in feature space, one per test video. Numbers denote distance to query frame. The depicted frames are sorted with regard to this distance.

5 Summary

In this paper, we show that the temporal coherence of unlabeled laparoscopic video can be exploited for self-supervised pretraining by training a CNN to map temporally close video frames onto embeddings that are close in feature space. When extended into a CNN-LSTM architecture for surgical phase segmentation, all pretrained models outperform the non-pretrained baseline when being fine-tuned on the same labeled dataset. Using a combination of first and second order temporal coherence, the pretrained models even perform similarly or better than the baseline when less labeled data is used. Combining our approach with temporal order-based concepts into a more holistic temporal coherence-based pretraining method could possibly enhance the discriminative properties of the learned embedding and improve performance even further.

Future work will address the question whether the learned embeddings can be used for unsupervised detection of more fine-grained video segments, such as surgical activities or steps. Furthermore, we will investigate whether the notion of slow and steady features is beneficial for regularization during supervised training compared to using the concept during a separate pretraining phase.