Word2Vec Embeddings for Categorical Values in Synthetic Tabular Generation
Publikation: Beitrag in Fachzeitschrift › Konferenzartikel › Beigetragen › Begutachtung
Beitragende
Abstract
Although more and more generative models for synthetic tabular data exist, not all of them handle numerical and categorical data equally well. Some of the approaches are solely limited to numerical data. To extend the application of these methods beyond numerical data, we propose a Word2Vec-inspired approach for converting categorical values into numerical values. We demonstrate on the Census Income dataset that the proposed embeddings are capable of learning semantic relationships for ordinal variables. In general, we observed that with larger embedding sizes the quality of the learned embeddings increases. We trained state-of-art CTGAN models on this data and compared it to CTGAN's built-in method for learning categorical data. Our proposed method achieved comparable results. We, therefore, suggest our proposed method as a versatile algorithm that can improve on the synthetic tabular data generation without the need to change existing architectures.
Details
Originalsprache | Englisch |
---|---|
Seiten (von - bis) | 613-622 |
Seitenumfang | 10 |
Fachzeitschrift | International Conference on Computational Science and Computational Intelligence (CSCI) |
Publikationsstatus | Veröffentlicht - 2022 |
Peer-Review-Status | Ja |
Konferenz
Titel | 2022 International Conference on Computational Science and Computational Intelligence, CSCI 2022 |
---|---|
Dauer | 14 - 16 Dezember 2022 |
Stadt | Las Vegas |
Land | USA/Vereinigte Staaten |
Externe IDs
ORCID | /0000-0002-1887-4772/work/164198991 |
---|---|
ORCID | /0000-0002-9888-8460/work/164199201 |
Schlagworte
ASJC Scopus Sachgebiete
Schlagwörter
- Embeddings, synthetic tabular data generation, Word2Vec