TAFT: A Transformer-Based Approach for Format Transformation
Research output: Contribution to book/Conference proceedings/Anthology/Report › Conference contribution › Contributed › peer-review
Contributors
Abstract
The presence of heterogeneous data formats within data lakes poses challenges when attempting to analyze or further process such data. While data cleaning tools can remove heterogeneities within individual documents, they fail to address global format heterogeneities across multiple documents. For example, two documents store addresses each in a consistent format, thus not counting as a target for existing data cleaning tools. However, these consistent formats may still differ from each other, thereby posing global format heterogeneities. In order to close this gap, we present the framework TAFT (A Transformer-based Approach for Format Transformation), designed to remove these global format heterogeneities at scale, without human-in-the-loop involvement. To this end, we leverage a transformer-based model to convert the document columns into a uniform format based on types describing their content, such as Address or Name. With minimal configuration effort, we achieve state-of-the-art results without any further human intervention.
Details
| Original language | English |
|---|---|
| Title of host publication | DS Late Breaking Contributions 2024 |
| Editors | Francesca Naretto, Roberto Pellungrini |
| Number of pages | 4 |
| Volume | 3928 |
| Publication status | Published - 2024 |
| Peer-reviewed | Yes |
Publication series
| Series | CEUR Workshop Proceedings |
|---|---|
| Volume | 3928 |
| ISSN | 1613-0073 |
Conference
| Title | 27th International Conference on Discovery Science |
|---|---|
| Abbreviated title | DS 2024 |
| Conference number | 27 |
| Duration | 14 - 16 October 2024 |
| Website | |
| Location | University of Pisa |
| City | Pisa |
| Country | Italy |
External IDs
| ORCID | /0000-0001-8107-2775/work/180371895 |
|---|---|
| ORCID | /0000-0002-5985-4348/work/180372254 |
| Scopus | 86000509016 |
Keywords
ASJC Scopus subject areas
Keywords
- data preparation, format transformation, heterogeneity