A Touch, Vision, and Language Dataset for Multimodal Alignment
Research output: Contribution to journal › Conference article › Contributed › peer-review
Contributors
Abstract
Touch is an important sensing modality for humans, but it has not yet been incorporated into a multimodal generative language model. This is partially due to the difficulty of obtaining natural language labels for tactile data and the complexity of aligning tactile readings with both visual observations and language descriptions. As a step towards bridging that gap, this work introduces a new dataset of 44K in-the-wild vision-touch pairs, with English language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V (90%). We use this dataset to train a vision-language-aligned tactile encoder for open-vocabulary classification and a touch-vision-language (TVL) model for text generation using the trained encoder. Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) tactile-vision-language alignment over existing models trained on any pair of those modalities. Although only a small fraction of the dataset is human labeled, the TVL model demonstrates improved visual-tactile understanding over GPT-4V (+12%) and open-source vision-language models (+32%) on a new touch-vision understanding benchmark. Code, checkpoints and data are available on https://tactile-vlm.github.io.
Details
Original language | English |
---|---|
Pages (from-to) | 14080-14101 |
Number of pages | 22 |
Journal | Proceedings of Machine Learning Research |
Volume | 235 |
Publication status | Published - 2024 |
Peer-reviewed | Yes |
Conference
Title | 41st International Conference on Machine Learning |
---|---|
Abbreviated title | ICML 2024 |
Conference number | 41 |
Duration | 21 - 27 July 2024 |
Website | |
Location | Messe Wien Congress and Convention Center |
City | Wien |
Country | Austria |
External IDs
ORCID | /0000-0001-9430-8433/work/173989268 |
---|