Genetic Algorithm for Subset Selection in Synthetic Tabular Data
Research output: Contribution to journal › Conference article › Contributed › peer-review
Contributors
Abstract
Subset selection has been widely studied but remains underexplored for synthetic tabular data, particularly in data sharing contexts that require high quality data. While generative models can produce large volumes of synthetic data, directly sampling and releasing such data risks including low quality or unrepresentative samples, which can reduce data utility. An alternative approach is to generate more data than needed and subsequently select a subset that better meets specific quality criteria. This paper introduces a genetic algorithm (GA)-based method for optimizing such subset selection. The proposed GA is independent of any specific fitness function, enabling adaptation to diverse evaluation metrics, their combinations, or varying use case requirements. We benchmarked the method on five medical datasets, each synthesized by multiple generative architectures, and consistently found that the GA selected subsets outperformed both the initial synthetic datasets and a random subset selection baseline. Notably, initializing the GA with systematically generated synthetic subsets led to nearly twice the improvement over the baselines compared to random initialization, emphasizing the importance of more informed starting solutions. The proposed GA-based method proved especially beneficial for smaller datasets, which are frequently encountered in clinical domains, such as rare disease research. While performance gains diminished for larger datasets due to combinatorial complexity, this work highlights the potential of GA-driven optimization as a foundation for future research into scalable and adaptive subset selection methods for synthetic data sharing.
Details
| Original language | English |
|---|---|
| Pages (from-to) | 1-6 |
| Journal | IEEE Access |
| Publication status | Published - 2025 |
| Peer-reviewed | Yes |
Conference
| Title | 2nd International Conference on Artificial Intelligence, Computer, Data Sciences, and Applications |
|---|---|
| Abbreviated title | ACDSA 2025 |
| Conference number | 2 |
| Duration | 7 - 9 August 2025 |
| Website | |
| Location | Antalya Bilim University |
| City | Antalya |
| Country | Turkey |
External IDs
| ORCID | /0000-0002-1887-4772/work/196688956 |
|---|---|
| ORCID | /0000-0002-9888-8460/work/196691456 |
Keywords
ASJC Scopus subject areas
Keywords
- genetic algorithm, subset selection, synthetic data, tabular data