Word2Vec Embeddings for Categorical Values in Synthetic Tabular Generation

Research output: Contribution to journalConference articleContributedpeer-review

Abstract

Although more and more generative models for synthetic tabular data exist, not all of them handle numerical and categorical data equally well. Some of the approaches are solely limited to numerical data. To extend the application of these methods beyond numerical data, we propose a Word2Vec-inspired approach for converting categorical values into numerical values. We demonstrate on the Census Income dataset that the proposed embeddings are capable of learning semantic relationships for ordinal variables. In general, we observed that with larger embedding sizes the quality of the learned embeddings increases. We trained state-of-art CTGAN models on this data and compared it to CTGAN's built-in method for learning categorical data. Our proposed method achieved comparable results. We, therefore, suggest our proposed method as a versatile algorithm that can improve on the synthetic tabular data generation without the need to change existing architectures.

Details

Original languageEnglish
Pages (from-to)613-622
Number of pages10
JournalInternational Conference on Computational Science and Computational Intelligence (CSCI)
Publication statusPublished - 2022
Peer-reviewedYes

Conference

Title2022 International Conference on Computational Science and Computational Intelligence, CSCI 2022
Duration14 - 16 December 2022
CityLas Vegas
CountryUnited States of America

External IDs

ORCID /0000-0002-1887-4772/work/164198991
ORCID /0000-0002-9888-8460/work/164199201

Keywords

Keywords

  • Embeddings, synthetic tabular data generation, Word2Vec