Using 3D Convolutional Neural Networks to Learn Spatiotemporal Features for Automatic Surgical Gesture Recognition in Video

Isabel Funke; Sebastian Bodenstedt; Florian Oehme; Felix von Bechtolsheim; Jürgen Weitz; Stefanie Speidel

doi:10.1007/978-3-030-32254-0_52

Using 3D Convolutional Neural Networks to Learn Spatiotemporal Features for Automatic Surgical Gesture Recognition in Video

Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/Gutachten › Beitrag in Konferenzband › Beigetragen › Begutachtung

Beitragende

Isabel Funke - , Nationales Centrum für Tumorerkrankungen Dresden (Autor:in)
Sebastian Bodenstedt - , Nationales Zentrum für Tumorerkrankungen (NCT) Dresden (Autor:in)
Florian Oehme - , Klinik und Poliklinik für Viszeral- Thorax- und Gefäßchirurgie, Universitätsklinikum Carl Gustav Carus Dresden (Autor:in)
Felix von Bechtolsheim - , Klinik und Poliklinik für Viszeral- Thorax- und Gefäßchirurgie, Universitätsklinikum Carl Gustav Carus Dresden (Autor:in)
Jürgen Weitz - , Klinik und Poliklinik für Viszeral- Thorax- und Gefäßchirurgie, Exzellenzcluster CeTI: Zentrum für Taktiles Internet, Universitätsklinikum Carl Gustav Carus Dresden (Autor:in)
Stefanie Speidel - , Nationales Centrum für Tumorerkrankungen Dresden, Exzellenzcluster CeTI: Zentrum für Taktiles Internet (Autor:in)

Abstract

Automatically recognizing surgical gestures is a crucial step towards a thorough understanding of surgical skill. Possible areas of application include automatic skill assessment, intra-operative monitoring of critical surgical steps, and semi-automation of surgical tasks. Solutions that rely only on the laparoscopic video and do not require additional sensor hardware are especially attractive as they can be implemented at low cost in many scenarios. However, surgical gesture recognition based only on video is a challenging problem that requires effective means to extract both visual and temporal information from the video. Previous approaches mainly rely on frame-wise feature extractors, either handcrafted or learned, which fail to capture the dynamics in surgical video. To address this issue, we propose to use a 3D Convolutional Neural Network (CNN) to learn spatiotemporal features from consecutive video frames. We evaluate our approach on recordings of robot-assisted suturing on a bench-top model, which are taken from the publicly available JIGSAWS dataset. Our approach achieves high frame-wise surgical gesture recognition accuracies of more than 84%, outperforming comparable models that either extract only spatial features or model spatial and low-level temporal information separately. For the first time, these results demonstrate the benefit of spatiotemporal CNNs for video-based surgical gesture recognition.

Details

Originalsprache	Englisch
Titel	Medical Image Computing and Computer Assisted Intervention – MICCAI 2019 - 22nd International Conference, Proceedings
Redakteure/-innen	Dinggang Shen, Pew-Thian Yap, Tianming Liu, Terry M. Peters, Ali Khan, Lawrence H. Staib, Caroline Essert, Sean Zhou
Herausgeber (Verlag)	Springer Science and Business Media B.V.
Seiten	467-475
Seitenumfang	9
ISBN (elektronisch)	978-3-030-32254-0
ISBN (Print)	978-3-030-32253-3
Publikationsstatus	Veröffentlicht - 2019
Peer-Review-Status	Ja

Publikationsreihe

Reihe	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Band	11768 LNCS
ISSN	0302-9743

Konferenz

Titel	22nd International Conference on Medical Image Computing and Computer-Assisted Intervention, MICCAI 2019
Dauer	13 - 17 Oktober 2019
Stadt	Shenzhen
Land	China

Externe IDs

Scopus	85075673945
ORCID	/0000-0002-4590-1908/work/163293979

Schlagworte

Schlagwörter

Action segmentation, Convolutional Neural Network, Spatiotemporal modeling, Surgical gesture, Video understanding

Bibliotheksschlagworte

004 Informatik

Forschungsportal der TU Dresden