TAFT: A Transformer-Based Approach for Format Transformation
Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/Gutachten › Beitrag in Konferenzband › Beigetragen › Begutachtung
Beitragende
Abstract
The presence of heterogeneous data formats within data lakes poses challenges when attempting to analyze or further process such data. While data cleaning tools can remove heterogeneities within individual documents, they fail to address global format heterogeneities across multiple documents. For example, two documents store addresses each in a consistent format, thus not counting as a target for existing data cleaning tools. However, these consistent formats may still differ from each other, thereby posing global format heterogeneities. In order to close this gap, we present the framework TAFT (A Transformer-based Approach for Format Transformation), designed to remove these global format heterogeneities at scale, without human-in-the-loop involvement. To this end, we leverage a transformer-based model to convert the document columns into a uniform format based on types describing their content, such as Address or Name. With minimal configuration effort, we achieve state-of-the-art results without any further human intervention.
Details
| Originalsprache | Englisch |
|---|---|
| Titel | DS Late Breaking Contributions 2024 |
| Redakteure/-innen | Francesca Naretto, Roberto Pellungrini |
| Seitenumfang | 4 |
| Band | 3928 |
| Publikationsstatus | Veröffentlicht - 2024 |
| Peer-Review-Status | Ja |
Publikationsreihe
| Reihe | CEUR Workshop Proceedings |
|---|---|
| Band | 3928 |
| ISSN | 1613-0073 |
Konferenz
| Titel | 27th International Conference on Discovery Science |
|---|---|
| Kurztitel | DS 2024 |
| Veranstaltungsnummer | 27 |
| Dauer | 14 - 16 Oktober 2024 |
| Webseite | |
| Ort | University of Pisa |
| Stadt | Pisa |
| Land | Italien |
Externe IDs
| ORCID | /0000-0001-8107-2775/work/180371895 |
|---|---|
| ORCID | /0000-0002-5985-4348/work/180372254 |
| Scopus | 86000509016 |
Schlagworte
ASJC Scopus Sachgebiete
Schlagwörter
- data preparation, format transformation, heterogeneity