TAFT: A Transformer-Based Approach for Format Transformation

Research output: Contribution to book/Conference proceedings/Anthology/ReportConference contributionContributedpeer-review

Contributors

Abstract

The presence of heterogeneous data formats within data lakes poses challenges when attempting to analyze or further process such data. While data cleaning tools can remove heterogeneities within individual documents, they fail to address global format heterogeneities across multiple documents. For example, two documents store addresses each in a consistent format, thus not counting as a target for existing data cleaning tools. However, these consistent formats may still differ from each other, thereby posing global format heterogeneities. In order to close this gap, we present the framework TAFT (A Transformer-based Approach for Format Transformation), designed to remove these global format heterogeneities at scale, without human-in-the-loop involvement. To this end, we leverage a transformer-based model to convert the document columns into a uniform format based on types describing their content, such as Address or Name. With minimal configuration effort, we achieve state-of-the-art results without any further human intervention.

Details

Original languageEnglish
Title of host publicationDS Late Breaking Contributions 2024
EditorsFrancesca Naretto, Roberto Pellungrini
Number of pages4
Volume3928
Publication statusPublished - 2024
Peer-reviewedYes

Publication series

SeriesCEUR Workshop Proceedings
Volume3928
ISSN1613-0073

Conference

Title27th International Conference on Discovery Science
Abbreviated titleDS 2024
Conference number27
Duration14 - 16 October 2024
Website
LocationUniversity of Pisa
CityPisa
CountryItaly

External IDs

ORCID /0000-0001-8107-2775/work/180371895
ORCID /0000-0002-5985-4348/work/180372254
Scopus 86000509016

Keywords

ASJC Scopus subject areas

Keywords

  • data preparation, format transformation, heterogeneity