A Touch, Vision, and Language Dataset for Multimodal Alignment

Letian Fu; Gaurav Datta; Huang Huang; William Chung Ho Panitch; Jaimyn Drake; Joseph Ortiz; Mustafa Mukadam; Mike Lambeta; Roberto Calandra; Ken Goldberg

A Touch, Vision, and Language Dataset for Multimodal Alignment

Publikation: Beitrag in Fachzeitschrift › Konferenzartikel › Beigetragen › Begutachtung

Beitragende

Letian Fu - , University of California at Berkeley (Autor:in)
Gaurav Datta - , University of California at Berkeley (Autor:in)
Huang Huang - , University of California at Berkeley (Autor:in)
William Chung Ho Panitch - , University of California at Berkeley (Autor:in)
Jaimyn Drake - , University of California at Berkeley (Autor:in)
Joseph Ortiz - , Meta (Autor:in)
Mustafa Mukadam - , Meta (Autor:in)
Mike Lambeta - , Meta (Autor:in)
Roberto Calandra - , Exzellenzcluster CeTI: Zentrum für Taktiles Internet, Professur für Machine Learning for Robotics (CeTi) (Autor:in)
Ken Goldberg - , University of California at Berkeley (Autor:in)

Abstract

Touch is an important sensing modality for humans, but it has not yet been incorporated into a multimodal generative language model. This is partially due to the difficulty of obtaining natural language labels for tactile data and the complexity of aligning tactile readings with both visual observations and language descriptions. As a step towards bridging that gap, this work introduces a new dataset of 44K in-the-wild vision-touch pairs, with English language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V (90%). We use this dataset to train a vision-language-aligned tactile encoder for open-vocabulary classification and a touch-vision-language (TVL) model for text generation using the trained encoder. Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) tactile-vision-language alignment over existing models trained on any pair of those modalities. Although only a small fraction of the dataset is human labeled, the TVL model demonstrates improved visual-tactile understanding over GPT-4V (+12%) and open-source vision-language models (+32%) on a new touch-vision understanding benchmark. Code, checkpoints and data are available on https://tactile-vlm.github.io.

Details

Originalsprache	Englisch
Seiten (von - bis)	14080-14101
Seitenumfang	22
Fachzeitschrift	Proceedings of Machine Learning Research
Jahrgang	235
Publikationsstatus	Veröffentlicht - 2024
Peer-Review-Status	Ja

Konferenz

Titel	41st International Conference on Machine Learning, ICML 2024
Dauer	21 - 27 Juli 2024
Stadt	Vienna
Land	Österreich

Externe IDs

ORCID	/0000-0001-9430-8433/work/173989268

Forschungsportal der TU Dresden

A Touch, Vision, and Language Dataset for Multimodal Alignment

Beitragende

Abstract

Details

Konferenz

Externe IDs

Schlagworte

ASJC Scopus Sachgebiete