Word2Vec Embeddings for Categorical Values in Synthetic Tabular Generation

Publikation: Beitrag in FachzeitschriftKonferenzartikelBeigetragenBegutachtung

Abstract

Although more and more generative models for synthetic tabular data exist, not all of them handle numerical and categorical data equally well. Some of the approaches are solely limited to numerical data. To extend the application of these methods beyond numerical data, we propose a Word2Vec-inspired approach for converting categorical values into numerical values. We demonstrate on the Census Income dataset that the proposed embeddings are capable of learning semantic relationships for ordinal variables. In general, we observed that with larger embedding sizes the quality of the learned embeddings increases. We trained state-of-art CTGAN models on this data and compared it to CTGAN's built-in method for learning categorical data. Our proposed method achieved comparable results. We, therefore, suggest our proposed method as a versatile algorithm that can improve on the synthetic tabular data generation without the need to change existing architectures.

Details

OriginalspracheEnglisch
Seiten (von - bis)613-622
Seitenumfang10
FachzeitschriftInternational Conference on Computational Science and Computational Intelligence (CSCI)
PublikationsstatusVeröffentlicht - 2022
Peer-Review-StatusJa

Konferenz

Titel2022 International Conference on Computational Science and Computational Intelligence, CSCI 2022
Dauer14 - 16 Dezember 2022
StadtLas Vegas
LandUSA/Vereinigte Staaten

Externe IDs

ORCID /0000-0002-1887-4772/work/164198991
ORCID /0000-0002-9888-8460/work/164199201

Schlagworte

Schlagwörter

  • Embeddings, synthetic tabular data generation, Word2Vec