Word2Vec Embeddings for Categorical Values in Synthetic Tabular Generation
Research output: Contribution to journal › Conference article › Contributed › peer-review
Contributors
Abstract
Although more and more generative models for synthetic tabular data exist, not all of them handle numerical and categorical data equally well. Some of the approaches are solely limited to numerical data. To extend the application of these methods beyond numerical data, we propose a Word2Vec-inspired approach for converting categorical values into numerical values. We demonstrate on the Census Income dataset that the proposed embeddings are capable of learning semantic relationships for ordinal variables. In general, we observed that with larger embedding sizes the quality of the learned embeddings increases. We trained state-of-art CTGAN models on this data and compared it to CTGAN's built-in method for learning categorical data. Our proposed method achieved comparable results. We, therefore, suggest our proposed method as a versatile algorithm that can improve on the synthetic tabular data generation without the need to change existing architectures.
Details
Original language | English |
---|---|
Pages (from-to) | 613-622 |
Number of pages | 10 |
Journal | International Conference on Computational Science and Computational Intelligence (CSCI) |
Publication status | Published - 2022 |
Peer-reviewed | Yes |
Conference
Title | 2022 International Conference on Computational Science and Computational Intelligence, CSCI 2022 |
---|---|
Duration | 14 - 16 December 2022 |
City | Las Vegas |
Country | United States of America |
External IDs
ORCID | /0000-0002-1887-4772/work/164198991 |
---|---|
ORCID | /0000-0002-9888-8460/work/164199201 |
Keywords
ASJC Scopus subject areas
Keywords
- Embeddings, synthetic tabular data generation, Word2Vec