A Touch, Vision, and Language Dataset for Multimodal Alignment

Letian Fu; Gaurav Datta; Huang Huang; William Chung Ho Panitch; Jaimyn Drake; Joseph Ortiz; Mustafa Mukadam; Mike Lambeta; Roberto Calandra; Ken Goldberg

A Touch, Vision, and Language Dataset for Multimodal Alignment

Research output: Contribution to journal › Conference article › Contributed › peer-review

Contributors

Letian Fu - , University of California at Berkeley (Author)
Gaurav Datta - , University of California at Berkeley (Author)
Huang Huang - , University of California at Berkeley (Author)
William Chung Ho Panitch - , University of California at Berkeley (Author)
Jaimyn Drake - , University of California at Berkeley (Author)
Joseph Ortiz - , Meta Platforms, Inc. (Author)
Mustafa Mukadam - , Meta Platforms, Inc. (Author)
Mike Lambeta - , Meta Platforms, Inc. (Author)
Roberto Calandra - , Clusters of Excellence CeTI: Centre for Tactile Internet, Chair of Machine Learning for Robotics (CeTi) (Author)
Ken Goldberg - , University of California at Berkeley (Author)

Abstract

Touch is an important sensing modality for humans, but it has not yet been incorporated into a multimodal generative language model. This is partially due to the difficulty of obtaining natural language labels for tactile data and the complexity of aligning tactile readings with both visual observations and language descriptions. As a step towards bridging that gap, this work introduces a new dataset of 44K in-the-wild vision-touch pairs, with English language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V (90%). We use this dataset to train a vision-language-aligned tactile encoder for open-vocabulary classification and a touch-vision-language (TVL) model for text generation using the trained encoder. Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) tactile-vision-language alignment over existing models trained on any pair of those modalities. Although only a small fraction of the dataset is human labeled, the TVL model demonstrates improved visual-tactile understanding over GPT-4V (+12%) and open-source vision-language models (+32%) on a new touch-vision understanding benchmark. Code, checkpoints and data are available on https://tactile-vlm.github.io.

Details

Original language	English
Pages (from-to)	14080-14101
Number of pages	22
Journal	Proceedings of Machine Learning Research
Volume	235
Publication status	Published - 2024
Peer-reviewed	Yes

Conference

Title	41st International Conference on Machine Learning
Abbreviated title	ICML 2024
Conference number	41
Duration	21 - 27 July 2024
Website	https://icml.cc/Conferences/2024
Location	Messe Wien Congress and Convention Center
City	Wien
Country	Austria

External IDs

ORCID	/0000-0001-9430-8433/work/173989268

Research Portal of the TU Dresden

A Touch, Vision, and Language Dataset for Multimodal Alignment

Contributors

Abstract

Details

Conference

External IDs

Keywords

ASJC Scopus subject areas