NeuralFeels with neural fields: Visuotactile perception for in-hand manipulation

Research output: Contribution to journalResearch articleContributedpeer-review

Contributors

  • Sudharshan Suresh - , Carnegie Mellon University, Meta (Author)
  • Haozhi Qi - , Meta, University of California at Berkeley (Author)
  • Tingfan Wu - , Meta (Author)
  • Taosha Fan - , Meta (Author)
  • Luis Pineda - , Meta (Author)
  • Mike Lambeta - , Meta (Author)
  • Jitendra Malik - , Meta, University of California at Berkeley (Author)
  • Mrinal Kalakrishnan - , Meta (Author)
  • Roberto Calandra - , Clusters of Excellence CeTI: Centre for Tactile Internet, Chair of Machine Learning for Robotics (CeTi) (Author)
  • Michael Kaess - , Carnegie Mellon University (Author)
  • Joseph Ortiz - , Meta (Author)
  • Mustafa Mukadam - , Meta (Author)

Abstract

To achieve human-level dexterity, robots must infer spatial awareness from multimodal sensing to reason over contact interactions. During in-hand manipulation of novel objects, such spatial awareness involves estimating the object’s pose and shape. The status quo for in-hand perception primarily uses vision and is restricted to tracking a priori known objects. Moreover, visual occlusion of objects in hand is imminent during manipulation, preventing current systems from pushing beyond tasks without occlusion. We combined vision and touch sensing on a multifingered hand to estimate an object’s pose and shape during in-hand manipulation. Our method, NeuralFeels, encodes object geometry by learning a neural field online and jointly tracks it by optimizing a pose graph problem. We studied multimodal in-hand perception in simulation and the real world, interacting with different objects via a proprioception-driven policy. Our experiments showed final reconstruction F scores of 81% and average pose drifts of 4.7 millimeters, which was further reduced to 2.3 millimeters with known object models. In addition, we observed that, under heavy visual occlusion, we could achieve improvements in tracking up to 94% compared with vision-only methods. Our results demonstrate that touch, at the very least, refines and, at the very best, disambiguates visual estimates during in-hand manipulation. We release our evaluation dataset of 70 experiments, FeelSight, as a step toward benchmarking in this domain. Our neural representation driven by multimodal sensing can serve as a perception backbone toward advancing robot dexterity.

Details

Original languageEnglish
Article numbereadl0628
JournalScience Robotics
Volume9
Issue number96
Publication statusPublished - Nov 2024
Peer-reviewedYes

External IDs

PubMed 39536124
ORCID /0000-0001-9430-8433/work/173989269

Keywords

ASJC Scopus subject areas