TAFT: A Transformer-Based Approach for Format Transformation

Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/GutachtenBeitrag in KonferenzbandBeigetragenBegutachtung

Beitragende

Abstract

The presence of heterogeneous data formats within data lakes poses challenges when attempting to analyze or further process such data. While data cleaning tools can remove heterogeneities within individual documents, they fail to address global format heterogeneities across multiple documents. For example, two documents store addresses each in a consistent format, thus not counting as a target for existing data cleaning tools. However, these consistent formats may still differ from each other, thereby posing global format heterogeneities. In order to close this gap, we present the framework TAFT (A Transformer-based Approach for Format Transformation), designed to remove these global format heterogeneities at scale, without human-in-the-loop involvement. To this end, we leverage a transformer-based model to convert the document columns into a uniform format based on types describing their content, such as Address or Name. With minimal configuration effort, we achieve state-of-the-art results without any further human intervention.

Details

OriginalspracheEnglisch
TitelDS Late Breaking Contributions 2024
Redakteure/-innenFrancesca Naretto, Roberto Pellungrini
Seitenumfang4
Band3928
PublikationsstatusVeröffentlicht - 2024
Peer-Review-StatusJa

Publikationsreihe

ReiheCEUR Workshop Proceedings
Band3928
ISSN1613-0073

Konferenz

Titel27th International Conference on Discovery Science
KurztitelDS 2024
Veranstaltungsnummer27
Dauer14 - 16 Oktober 2024
Webseite
OrtUniversity of Pisa
StadtPisa
LandItalien

Externe IDs

ORCID /0000-0001-8107-2775/work/180371895
ORCID /0000-0002-5985-4348/work/180372254
Scopus 86000509016

Schlagworte

ASJC Scopus Sachgebiete

Schlagwörter

  • data preparation, format transformation, heterogeneity