Genetic Algorithm for Subset Selection in Synthetic Tabular Data
Publikation: Beitrag in Fachzeitschrift › Konferenzartikel › Beigetragen › Begutachtung
Beitragende
Abstract
Subset selection has been widely studied but remains underexplored for synthetic tabular data, particularly in data sharing contexts that require high quality data. While generative models can produce large volumes of synthetic data, directly sampling and releasing such data risks including low quality or unrepresentative samples, which can reduce data utility. An alternative approach is to generate more data than needed and subsequently select a subset that better meets specific quality criteria. This paper introduces a genetic algorithm (GA)-based method for optimizing such subset selection. The proposed GA is independent of any specific fitness function, enabling adaptation to diverse evaluation metrics, their combinations, or varying use case requirements. We benchmarked the method on five medical datasets, each synthesized by multiple generative architectures, and consistently found that the GA selected subsets outperformed both the initial synthetic datasets and a random subset selection baseline. Notably, initializing the GA with systematically generated synthetic subsets led to nearly twice the improvement over the baselines compared to random initialization, emphasizing the importance of more informed starting solutions. The proposed GA-based method proved especially beneficial for smaller datasets, which are frequently encountered in clinical domains, such as rare disease research. While performance gains diminished for larger datasets due to combinatorial complexity, this work highlights the potential of GA-driven optimization as a foundation for future research into scalable and adaptive subset selection methods for synthetic data sharing.
Details
| Originalsprache | Englisch |
|---|---|
| Seiten (von - bis) | 1-6 |
| Fachzeitschrift | IEEE Access |
| Publikationsstatus | Veröffentlicht - 2025 |
| Peer-Review-Status | Ja |
Konferenz
| Titel | 2nd International Conference on Artificial Intelligence, Computer, Data Sciences, and Applications |
|---|---|
| Kurztitel | ACDSA 2025 |
| Veranstaltungsnummer | 2 |
| Dauer | 7 - 9 August 2025 |
| Webseite | |
| Ort | Antalya Bilim University |
| Stadt | Antalya |
| Land | Türkei |
Externe IDs
| ORCID | /0000-0002-1887-4772/work/196688956 |
|---|---|
| ORCID | /0000-0002-9888-8460/work/196691456 |
Schlagworte
ASJC Scopus Sachgebiete
Schlagwörter
- genetic algorithm, subset selection, synthetic data, tabular data