Temporal Coherence-based Self-supervised Learning for Laparoscopic Workflow Analysis

Funke, Isabel; Jenke, Alexander; Mees, Sören Torge; Weitz, Jürgen; Speidel, Stefanie; Bodenstedt, Sebastian

doi:10.1007/978-3-030-01201-4_11

Isabel Funke³⁶,
Alexander Jenke³⁶,
Sören Torge Mees³⁷,
Jürgen Weitz³⁷,
Stefanie Speidel³⁶ &
…
Sebastian Bodenstedt³⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11041))

Included in the following conference series:

2337 Accesses
18 Citations
1 Altmetric

Abstract

In order to provide the right type of assistance at the right time, computer-assisted surgery systems need context awareness. To achieve this, methods for surgical workflow analysis are crucial. Currently, convolutional neural networks provide the best performance for video-based workflow analysis tasks. For training such networks, large amounts of annotated data are necessary. However, collecting a sufficient amount of data is often costly, time-consuming, and not always feasible. In this paper, we address this problem by presenting and comparing different approaches for self-supervised pretraining of neural networks on unlabeled laparoscopic videos using temporal coherence. We evaluate our pretrained networks on Cholec80, a publicly available dataset for surgical phase segmentation, on which a maximum $F_1$ score of 84.6 was reached. Furthermore, we were able to achieve an increase of the $F_1$ score of up to 10 points when compared to a non-pretrained neural network.

I. Funke and S. Bodenstedt—Both authors contributed equally to this work.

You have full access to this open access chapter, Download conference paper PDF

Automated segmentation of phases, steps, and tasks in laparoscopic cholecystectomy using deep learning

Article 09 November 2023

LRTD: long-range temporal dependency based active learning for surgical workflow recognition

Article 25 June 2020

Weakly-Supervised Learning for Tool Localization in Laparoscopic Videos

Keywords

1 Introduction

The aim of a computer-assisted surgery (CAS) system is to provide the surgeon with the right type of assistance at the right time. To achieve this, context awareness is crucial. This means that the system must be able to understand the processes currently taking place in the operating room (OR) and adapt its behavior accordingly. Surgical workflow analysis covers the challenging topic of perceiving, understanding, and describing surgical processes [11].

A common approach is to analyze surgical processes by interpreting a time series of signals that are recorded by sensors – in some cases also by humans – in the OR. As laparoscopic surgeries are performed via camera, methods that require only video as input sensor data are of special interest, since the video can be collected effortlessly during surgery.

State-of-the-art video-based approaches for workflow analysis rely on deep neural networks [1, 2, 9, 15, 18]. However, deep learning-based methods require large amounts of labeled data for training. Especially in surgery, obtaining a sufficient amount of annotated video data is difficult and costly.

To alleviate the problem of limited training data, it is common to pretrain neural networks and fine-tune them afterwards. Often, networks are pretrained using labeled data coming from another domain, such as ImageNet [4]. Another way is to use unlabeled data from the same domain and train on a proxy task using labels inherent in the data, which is called self-supervised learning.

For self-supervised learning from video, a number of ideas have been proposed [5, 8, 12, 13, 16]. Most exploit the temporal coherence of video, which implies that (i) consecutive frames are in temporal order, (ii) frames change slowly over time, and (iii) frames change steadily, i.e., abrupt motions are unlikely.

The studies [12, 13] propose proxy tasks based on the temporal order between frames. In line with this, [2] use the task to order pairs of laparoscopic images for pretraining a network for surgical phase segmentation. Surgical phase segmentation [14] is the problem of recognizing the surgical phase being performed by the surgeon at each point during surgery. Another proxy task for this problem is to predict the progress and remaining duration of a surgery [18].

Intuitively, these tasks encourage the network to learn discriminative features that are useful to infer the absolute or relative temporal position of a video frame. In contrast, [5, 8, 16] aim at learning features that are invariant to typical alterations occurring between adjacent frames, such as slight rotations or deformations. To this end, they aim to ensure that temporally close frames, which most likely depict the same semantic scene, are mapped to similar representations in feature space. This idea goes back to Slow Feature Analysis (SFA) [17].

In this paper, we describe and compare different approaches to exploit temporal coherence while pretraining a convolutional neural network (CNN) for surgical phase segmentation. We assume the pretraining encourages the CNN to learn features that are invariant to irrelevant changes between adjacent frames, such as slight movements of instruments or of the endoscope, while being discriminative enough to distinguish between semantically different frames.

To promote reproducibility and to fuel future research, we made our code available at https://gitlab.com/nct_tso_public/pretrain_tc.

Experiments using the Cholec80 dataset [15] demonstrate that a CNN pretrained to exploit the temporal coherence of unlabeled laparoscopic video outperforms a non-pretrained CNN after being fine-tuned for surgical phase segmentation. When only 20 labeled videos are available, the proposed pretraining achieves an increase from 67.8 to 78.6 as measured by $F_1$ score.

2 Methods

The core of our neural network architecture for surgical phase segmentation is a ResNet-50 CNN [6]. We initialize it with ImageNet [4] pretrained weights and further train it on unlabeled videos of laparoscopic surgeries, using an SFA-based approach for self-supervised learning. This encourages the CNN to map temporally close video frames to similar representations in feature space.

More formally, the CNN learns an embedding $f:\mathbb {R}^{3 \times h \times w}\rightarrow \mathbb {R}^{d}$, where $\mathbb {R}^{d}$ is the d-dimensional feature space and $\mathbb {R}^{3 \times h \times w}$ is the space of laparoscopic video frames with height h, width w, and three color channels (RGB). Let $I_t \in \mathbb {R}^{3 \times h \times w}$ denote the frame at time step t. To suffice temporal coherence, we require that $f(I_t) \approx f(I_{t + \varDelta })$ for a small $\varDelta $ with $|\varDelta | < \delta $. To learn an embedding that is discriminative and to avoid trivial solutions such as $f(I_t):= 0$, we require that $f(I_t)$ and $f(I_{t + \varGamma })$ lie further apart in feature space when $\varGamma $ is large, i.e., $|\varGamma | > \gamma $ (see Subsect. 2.1 for details). $\delta $ and $\gamma $ are non-negative real-valued parameters.

To evaluate the efficacy of the proposed self-supervised pretraining approach, we extend the CNN into a recurrent neural network (RNN) and fine-tune the CNN-RNN for surgical phase segmentation using annotated laparoscopic videos (see Subsect. 2.2). We can then compare the performance of the pretrained CNN-RNN to the performance of a CNN-RNN that has been trained solely for the surgical phase segmentation task (see Sect. 3).

2.1 Self-supervised Pretraining

For self-supervised pretraining, the output layer of the ResNet-50 CNN is replaced with a fully connected layer with $d = 4096$ output neurons (FeatureNet). As the CNN has been pretrained on ImageNet, we only adjust the weights of the conv5_x layers and of the newly added fully connected layer during training.

Given a frame $I_t$, we calculate the embedding $F_t := f(I_t)$ by forwarding the frame through FeatureNet and taking the output $(o_1, o_2, ..., o_d)^T \in \mathbb {R}^{d}$ at the last layer. We train FeatureNet to learn a temporally coherent video frame embedding using one of the following methods. Throughout this section, D denotes a distance function, in our case the L2 norm.

(a)
Training with contrastive loss

Given a video with T frames, we create a tuple $(I_t, I_{t + \varDelta }, I_{t + \varGamma })$ by sampling t from $[0, T - 1]$, $\varDelta $ from $[-\delta , \delta ]$, and $\varGamma $ from $[-(T - 1), -\gamma ] \cup [\gamma , T - 1]$ uniformly at random. Regarding FeatureNet as a Siamese network [3], we propagate the temporally close pair $(I_t, I_{t + \varDelta })$ through the CNN and calculate $D(F_t, F_{t + \varDelta })$. Likewise, we propagate the temporally distant pair $(I_t, I_{t + \varGamma })$ and calculate $D(F_t, F_{t + \varGamma })$. Finally, we calculate the contrastive loss [5]
$$\begin{aligned}L_c(F_t, F_{t + \varDelta }, F_{t + \varGamma }) = D(F_t, F_{t + \varDelta }) + \mathtt {max}\lbrace 0, m_c - D(F_t, F_{t + \varGamma })\rbrace . \end{aligned}$$
This loss function encourages $F_t$ to be close to $F_{t + \varDelta }$, while $F_t$ and $F_{t + \varGamma }$ are enforced to be separated by margin $m_c$.
(b)
Training with ranking loss

A training tuple $(I_t, I_{t + \varDelta }, I_{t + \varGamma })$ is created the same way as in method (a). Regarding FeatureNet as a Triplet Siamese Network, we propagate the triplet $(I_t, I_{t + \varDelta }, I_{t + \varGamma })$ through the CNN and calculate the ranking loss [16]
$$\begin{aligned} L_r(F_t, F_{t + \varDelta }, F_{t + \varGamma }) = \mathtt {max}\lbrace 0, D(F_t, F_{t + \varDelta }) - D(F_t, F_{t + \varGamma }) + m_r\rbrace . \end{aligned}$$
This loss function considers the distance between $F_t$ and $F_{t + \varDelta }$ relative to the distance between $F_t$ and $F_{t + \varGamma }$ and encourages $F_t$ and $F_{t + \varDelta }$ to be closer together than $F_t$ and $F_{t + \varGamma }$ by a margin of $m_r$.
(c)
Training with 1$^\mathbf{st}$ & 2$^\mathbf{nd}$ order contrastive loss

While (first order) temporal coherence requires the first order temporal derivatives in the learned feature space to be small, i.e., $F_t \approx F_{t + \varDelta }$, second order temporal coherence [8] requires the second order temporal derivatives to be small, i.e., $F_t - F_{t + \varDelta } \approx F_{t + \varDelta } - F_{t + 2\varDelta }$ for a small value of $\varDelta $.

Intuitively, first order temporal coherence ensures that embeddings do not change quickly over time, while second order temporal coherence ensures that the changes are consistent, or steady, across neighboring frames. Applying the contrastive loss function to second order temporal coherence yields
$$\begin{aligned} L_{c_2}(F_t, F_{t + \varDelta }, F_{t + 2\varDelta }, F_{t + \varGamma }) = L_c(F_t - F_{t + \varDelta }, F_{t + \varDelta } - F_{t + 2\varDelta },F_{t + \varDelta } - F_{t + \varGamma }) \end{aligned}$$
In practice, we create a training tuple $(I_t, I_{t + \varDelta }, I_{t + 2\varDelta }, I_{t + \varGamma })$ by sampling t, $\varDelta $, and $\varGamma $ as described in method (a). Regarding FeatureNet as a Triplet Siamese Network, we propagate the triplets $(I_t, I_{t + \varDelta }, I_{t + 2\varDelta })$ and $(I_t, I_{t + \varDelta }, I_{t + \varGamma })$ through the network and calculate $L_{c_2}$. We then combine it with the first order contrastive loss $L_c$ into an overall loss $L_{c + c_2} = L_c + \omega L_{c_2},$ where $\omega = 0.5$ is a non-negative real-valued weight parameter.

Table 1. Performance of the baseline (first row) and the pretrained models on the surgical phase segmentation task. #OPs denotes how many labeled OPs were used.

Full size table

2.2 Supervised Fine-Tuning for Surgical Phase Segmentation

Once pretrained, we modify the CNN for surgical phase segmentation by extending it into an RNN using a long short-term memory unit (LSTM) [7] with 512 neurons. The LSTM is followed by a fully connected layer, which has one output neuron per surgical phase. We refer to this CNN-LSTM as PhaseNet. During fine-tuning, the weights of the CNN and the LSTM are jointly optimized. However, the weights of the ResNet-50 layers below conv5_x stay frozen.

3 Evaluation

For evaluation, we used the publicly available Cholec80 dataset [15]. It consists of 80 videos from laparoscopic cholecystectomies, annotated with surgical phase labels. We divided the dataset into four sets A, B, C, and D of equal size and similar average procedure length. A, B, and C were used for training, while D was withheld for testing. For pretraining, we extracted video frames at 5 Hz. Training and testing for phase segmentation was performed at 1 Hz. Each frame was downsized to $384 \times 216$ px.

We trained three different versions of FeatureNet, one with each of the pretraining variants described in Sect. 2.1. The union of sets A, B, and C (i.e., 60 videos in total) was used as training data, ignoring the labels. Each CNN was trained for 25 epochs. Per epoch, we randomly sampled 250 tuples per video, which were processed in batches of size 64. $\delta $ was set to 30 s (15 s for variant (c)), $\gamma $ to 120 s and $m_c = m_r = 2$. We used the Adam optimizer [10] with a learning rate of $10^{-4}$. All newly added layers were initialized with random values from the range $(\frac{-1}{\sqrt{n}} , \frac{1}{\sqrt{n}})$, with n being the number of neurons in the layer.

To evaluate the suitability of the proposed pretraining approach for surgical phase segmentation, each of the pretrained CNNs (contrastive, ranking, and 1$^\mathbf{st}$ & 2$^\mathbf{nd}$ order contrastive) was extended into a PhaseNet and fine-tuned using the labeled videos from either set A (#OPs = 20), sets A and B (#OPs = 40), or sets A, B, and C (#OPs = 60). As baseline, a PhaseNet without self-supervised pretraining (no pretraining) was fine-tuned in the same manner. Note that the underlying ResNet-50 CNN had still been pretrained on ImageNet.

For fine-tuning the networks, we used the Adam optimizer [10] with a learning rate of $10^{-4}$ and a batch size of 128. After every batch, the content of the LSTM’s hidden state was saved and restored for the next batch. Due to hardware restraints, gradients were only accumulated for three batches before applying the optimizer. Training was stopped once the accuracy on the training set climbed above 99.9%. All newly added layers were initialized as described above.

The results of evaluating each PhaseNet on test set D can be found in Table 1. We calculated the metrics accuracy, recall, and precision as defined in [14]. The $F_1$ score is the harmonic mean of precision and recall. The metrics were averaged over all operations in the test set. Table 2 presents the phase-wise results of the best performing pretrained PhaseNet (1$^\mathbf{st}$ & 2$^\mathbf{nd}$ order contrastive) compared to the PhaseNet that did not undergo self-supervised pretraining.

Table 2. Comparison of the baseline and the best performing pretrained model. We report the average $F_1$ scores calculated for each of the phases P1 to P7.

Full size table

4 Discussion

Table 1 clearly shows that all three pretrained models outperform the baseline when being fine-tuned on the same set of labeled training data. The performance boost is especially apparent when only 20 labeled videos are available. Here, in terms of $F_1$ score, pretraining achieves an increase from 67.8 to up to 78.6 while halving the standard deviation. Pretraining still improves performance when more labeled videos are available. Notably, the pretrained models fine-tuned on only 40 labeled videos outperform the baseline trained on 60 videos. We conclude that the proposed SFA-based pretraining enables a CNN to learn feature representations that are beneficial to the task of surgical phase segmentation.

Comparing the three pretraining variants, we do not find big differences. All in all, using a combination of first and second order temporal coherence for pretraining seems to offer the largest boost to performance, especially when only few (20) labeled videos are used.

Looking at the results with respect to each surgical phase (Table 2), we see that most phases benefit greatly from pretraining (variant 1$^\mathbf{st}$ & 2$^\mathbf{nd}$ order contrastive) when only 20 labeled videos are available. The effect of pretraining diminishes when the number of labeled videos is increased, but is still noticeable in the majority of phases. Only the benefit to phase P7 seems negligible.

P7 contains visual similarities with P5 and P6, which makes them difficult to distinguish. Since the phase is short (about 1 to 3 min), frames that we label as close during pretraining may belong to previous phases. Likewise, frames that belong to previous phases but are temporally close are not selected as distant pair. Hence, the network learns features that are rather invariant than discriminative with regard to phase P7 and P6 or P5.

To shed some light on the features learned during pretraining, we investigated which images the network considers similar. We selected query frames $\lbrace I^q \rbrace $ from a video used during pretraining. Then, for each frame $I^q$ and each video v in the test set, we identified the frame $I^{q, v}$ in v that is most similar to $I^q$, i.e., closest to $I^q$ in feature space. More formally, $I^{q, v} = \mathop {\hbox {argmin}}\nolimits _{I_t \in v} D(f(I^q), f(I_t))$, where D was chosen to be the L2 norm. To calculate the embedding f, we used the 1$^\mathbf{st}$ & 2$^\mathbf{nd}$order contrastive pretrained FeatureNet (before fine-tuning).

Figure 1 presents four selected queries. Generally, it can be seen that images that are close in feature space show similar scenes with regard to anatomical structures and/or tool presence. The first and second query frames depict scenes that only differ in the amount of blood visible, a trait also observed in the retrieved frames. Likewise, the third and fourth query frames show similar scenes. However, the third query frame is unusual as the specimen bag is closed. Observing that the retrieved images are semantically not closely related to the query frame, we assume that its embedding does not reflect the presence of the specimen bag. For the fourth query frame, which is visually similar but more representative, semantically similar frames are retrieved.

We refrain from comparing temporal coherence-based learning to other pretraining methods for surgical phase segmentation [2, 18] since these studies were conducted using other datasets, namely EndoVis2015 (7 cholecystectomies) in [2] and 120 cholecystectomies in [18].

5 Summary

In this paper, we show that the temporal coherence of unlabeled laparoscopic video can be exploited for self-supervised pretraining by training a CNN to map temporally close video frames onto embeddings that are close in feature space. When extended into a CNN-LSTM architecture for surgical phase segmentation, all pretrained models outperform the non-pretrained baseline when being fine-tuned on the same labeled dataset. Using a combination of first and second order temporal coherence, the pretrained models even perform similarly or better than the baseline when less labeled data is used. Combining our approach with temporal order-based concepts into a more holistic temporal coherence-based pretraining method could possibly enhance the discriminative properties of the learned embedding and improve performance even further.

Future work will address the question whether the learned embeddings can be used for unsupervised detection of more fine-grained video segments, such as surgical activities or steps. Furthermore, we will investigate whether the notion of slow and steady features is beneficial for regularization during supervised training compared to using the concept during a separate pretraining phase.

References

Aksamentov, I., Twinanda, A.P., Mutter, D., Marescaux, J., Padoy, N.: Deep neural networks predict remaining surgery duration from cholecystectomy videos. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10434, pp. 586–593. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66185-8_66
Chapter Google Scholar
Bodenstedt, S., et al.: Unsupervised temporal context learning using convolutional neural networks for laparoscopic workflow analysis. arXiv preprint arXiv:1702.03684 (2017)
Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification using a “siamese” time delay neural network. In: NIPS, pp. 737–744 (1994)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)
Google Scholar
Goroshin, R., Bruna, J., Tompson, J., Eigen, D., LeCun, Y.: Unsupervised learning of spatiotemporally coherent metrics. In: ICCV, pp. 4086–4093 (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Jayaraman, D., Grauman, K.: Slow and steady feature analysis: higher order temporal coherence in video. In: CVPR, pp. 3852–3861 (2016)
Google Scholar
Jin, Y., et al.: SV-RCNet: workflow recognition from surgical videos using recurrent convolutional network. IEEE Trans. Med. Imaging 37(5), 1114–1126 (2018)
Article Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Lalys, F., Jannin, P.: Surgical process modelling: a review. Int. J. Comput. Assist. Radiol. Surg. 9(3), 495–511 (2014)
Article Google Scholar
Lee, H.Y., Huang, J.B., Singh, M., Yang, M.H.: Unsupervised representation learning by sorting sequences. In: ICCV, pp. 667–676 (2017)
Google Scholar
Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_32
Chapter Google Scholar
Padoy, N., Blum, T., Ahmadi, S.A., Feussner, H., Berger, M.O., Navab, N.: Statistical modeling and recognition of surgical workflow. Med. Image Anal. 16(3), 632–641 (2012)
Article Google Scholar
Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., de Mathelin, M., Padoy, N.: EndoNet: a deep architecture for recognition tasks on laparoscopic videos. IEEE Trans. Med. Imaging 36(1), 86–97 (2017)
Article Google Scholar
Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: ICCV, pp. 2794–2802 (2015)
Google Scholar
Wiskott, L., Sejnowski, T.J.: Slow feature analysis: unsupervised learning of invariances. Neural Comput. 14(4), 715–770 (2002)
Article Google Scholar
Yengera, G., Mutter, D., Marescaux, J., Padoy, N.: Less is more: surgical phase recognition with less annotations through self-supervised pre-training of CNN-LSTM networks. arXiv preprint arXiv:1805.08569 (2018)

Download references

Author information

Authors and Affiliations

Department for Translational Surgical Oncology, National Center for Tumor Diseases (NCT), Partner Site Dresden, Dresden, Germany
Isabel Funke, Alexander Jenke, Stefanie Speidel & Sebastian Bodenstedt
Department of Visceral, Thoracic and Vascular Surgery, Faculty of Medicine, University Hospital Carl Gustav Carus, TU Dresden, Dresden, Germany
Sören Torge Mees & Jürgen Weitz

Authors

Isabel Funke
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Jenke
View author publications
You can also search for this author in PubMed Google Scholar
Sören Torge Mees
View author publications
You can also search for this author in PubMed Google Scholar
Jürgen Weitz
View author publications
You can also search for this author in PubMed Google Scholar
Stefanie Speidel
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Bodenstedt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Isabel Funke .

Editor information

Editors and Affiliations

University College London, London, UK
Danail Stoyanov
University of Leeds, Leeds, UK
Zeike Taylor
University of Rennes, Rennes, France
Duygu Sarikaya
University of Western Ontario, London, ON, Canada
Jonathan McLeod
Universitat Pompeu Fabra, Barcelona, Spain
Miguel Angel González Ballester
IBM Research, Yorktown Heights, NY, USA
Noel C.F. Codella
Sunnybrook Health Science Centre, Toronto, ON, Canada
Anne Martel
German Cancer Research Center, Heidelberg, Baden-Württemberg, Germany
Lena Maier-Hein
Johns Hopkins University, Baltimore, USA
Anand Malpani
Harvard Medical School, Boston, USA
Marco A. Zenati
University of Western Ontario, London, Canada
Sandrine De Ribaupierre
Xiamen University, Xiamen, China
Luo Xiongbiao
IRCAD, Strasbourg, France
Toby Collins
KUKA Laboratories GmbH, Augsburg, Germany
Tobias Reichl
Aachen University of Applied Sciences, Julich, Nordrhein-Westfalen, Germany
Klaus Drechsler
Fraunhofer IDM@NTU, Singapore, Singapore
Marius Erdt
Children's National Health System, Washington, D.C., DC, USA
Marius George Linguraru
Fraunhofer IGD, Darmstadt, Hessen, Germany
Cristina Oyarzun Laura
Children's National Health System, Washington, D.C., DC, USA
Raj Shekhar
Fraunhofer IGD, Darmstadt, Hessen, Germany
Stefan Wesarg
University of Central Arkansas, Conway, USA
M. Emre Celebi
Rutgers University, Piscataway, USA
Kristin Dana
Memorial Sloan Kettering Cancer Center, New York, USA
Allan Halpern

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Funke, I., Jenke, A., Mees, S.T., Weitz, J., Speidel, S., Bodenstedt, S. (2018). Temporal Coherence-based Self-supervised Learning for Laparoscopic Workflow Analysis. In: Stoyanov, D., et al. OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis. CARE CLIP OR 2.0 ISIC 2018 2018 2018 2018. Lecture Notes in Computer Science(), vol 11041. Springer, Cham. https://doi.org/10.1007/978-3-030-01201-4_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-01201-4_11
Published: 02 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01200-7
Online ISBN: 978-3-030-01201-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Temporal Coherence-based Self-supervised Learning for Laparoscopic Workflow Analysis

Abstract

Similar content being viewed by others

Automated segmentation of phases, steps, and tasks in laparoscopic cholecystectomy using deep learning

LRTD: long-range temporal dependency based active learning for surgical workflow recognition

Weakly-Supervised Learning for Tool Localization in Laparoscopic Videos

Keywords

1 Introduction

2 Methods

2.1 Self-supervised Pretraining

2.2 Supervised Fine-Tuning for Surgical Phase Segmentation

3 Evaluation

4 Discussion

5 Summary

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Temporal Coherence-based Self-supervised Learning for Laparoscopic Workflow Analysis

Abstract

Similar content being viewed by others

Automated segmentation of phases, steps, and tasks in laparoscopic cholecystectomy using deep learning

LRTD: long-range temporal dependency based active learning for surgical workflow recognition

Weakly-Supervised Learning for Tool Localization in Laparoscopic Videos

Keywords

1 Introduction

2 Methods

2.1 Self-supervised Pretraining

2.2 Supervised Fine-Tuning for Surgical Phase Segmentation

3 Evaluation

4 Discussion

5 Summary

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation