Gesture Recognition in Robotic Surgery With Multimodal Attention

Beatrice Van Amsterdam; Isabel Funke; Eddie Edwards; Stefanie Speidel; Justin Collins; Ashwin Sridhar; John Kelly; Matthew J. Clarkson; Danail Stoyanov

doi:10.1109/TMI.2022.3147640

Gesture Recognition in Robotic Surgery With Multimodal Attention

Publikation: Beitrag in Fachzeitschrift › Forschungsartikel › Beigetragen › Begutachtung

Beitragende

Beatrice Van Amsterdam - , University College London (Autor:in)
Isabel Funke - , Nationales Centrum für Tumorerkrankungen Dresden, Exzellenzcluster CeTI: Zentrum für Taktiles Internet (Autor:in)
Eddie Edwards - , University College London (Autor:in)
Stefanie Speidel - , Deutsches Zentrum für Neurodegenerative Erkrankungen, Standort Dresden (Partner: DZNE der Helmholtzgemeinschaft), Nationales Centrum für Tumorerkrankungen Dresden (Autor:in)
Justin Collins - , University College London (Autor:in)
Ashwin Sridhar - , University College London (Autor:in)
John Kelly - , University College London (Autor:in)
Matthew J. Clarkson - , University College London (Autor:in)
Danail Stoyanov - , University College London (Autor:in)

Abstract

Automatically recognising surgical gestures from surgical data is an important building block of automated activity recognition and analytics, technical skill assessment, intra-operative assistance and eventually robotic automation. The complexity of articulated instrument trajectories and the inherent variability due to surgical style and patient anatomy make analysis and fine-grained segmentation of surgical motion patterns from robot kinematics alone very difficult. Surgical video provides crucial information from the surgical site with context for the kinematic data and the interaction between the instruments and tissue. Yet sensor fusion between the robot data and surgical video stream is non-trivial because the data have different frequency, dimensions and discriminative capability. In this paper, we integrate multimodal attention mechanisms in a two-stream temporal convolutional network to compute relevance scores and weight kinematic and visual feature representations dynamically in time, aiming to aid multimodal network training and achieve effective sensor fusion. We report the results of our system on the JIGSAWS benchmark dataset and on a new in vivo dataset of suturing segments from robotic prostatectomy procedures. Our results are promising and obtain multimodal prediction sequences with higher accuracy and better temporal structure than corresponding unimodal solutions. Visualization of attention scores also gives physically interpretable insights on network understanding of strengths and weaknesses of each sensor.

Details

Originalsprache	Englisch
Seiten (von - bis)	1677-1687
Seitenumfang	11
Fachzeitschrift	IEEE Transactions on Medical Imaging
Jahrgang	2022
Ausgabenummer	41(7)
Publikationsstatus	Veröffentlicht - 1 Juli 2022
Peer-Review-Status	Ja

Externe IDs

PubMed	35108200
ORCID	/0000-0002-4590-1908/work/163293968

Schlagworte

ASJC Scopus Sachgebiete

Schlagwörter

Multimodal attention, Robotic surgery, Surgical data science, Surgical gesture recognition

Bibliotheksschlagworte

610 Medizin und Gesundheit

Forschungsportal der TU Dresden