Genetic Algorithm for Subset Selection in Synthetic Tabular Data

Publikation: Beitrag in FachzeitschriftKonferenzartikelBeigetragenBegutachtung

Abstract

Subset selection has been widely studied but remains underexplored for synthetic tabular data, particularly in data sharing contexts that require high quality data. While generative models can produce large volumes of synthetic data, directly sampling and releasing such data risks including low quality or unrepresentative samples, which can reduce data utility. An alternative approach is to generate more data than needed and subsequently select a subset that better meets specific quality criteria. This paper introduces a genetic algorithm (GA)-based method for optimizing such subset selection. The proposed GA is independent of any specific fitness function, enabling adaptation to diverse evaluation metrics, their combinations, or varying use case requirements. We benchmarked the method on five medical datasets, each synthesized by multiple generative architectures, and consistently found that the GA selected subsets outperformed both the initial synthetic datasets and a random subset selection baseline. Notably, initializing the GA with systematically generated synthetic subsets led to nearly twice the improvement over the baselines compared to random initialization, emphasizing the importance of more informed starting solutions. The proposed GA-based method proved especially beneficial for smaller datasets, which are frequently encountered in clinical domains, such as rare disease research. While performance gains diminished for larger datasets due to combinatorial complexity, this work highlights the potential of GA-driven optimization as a foundation for future research into scalable and adaptive subset selection methods for synthetic data sharing.

Details

OriginalspracheEnglisch
Seiten (von - bis)1-6
FachzeitschriftIEEE Access
PublikationsstatusVeröffentlicht - 2025
Peer-Review-StatusJa

Konferenz

Titel2nd International Conference on Artificial Intelligence, Computer, Data Sciences, and Applications
KurztitelACDSA 2025
Veranstaltungsnummer2
Dauer7 - 9 August 2025
Webseite
OrtAntalya Bilim University
StadtAntalya
LandTürkei

Externe IDs

ORCID /0000-0002-1887-4772/work/196688956
ORCID /0000-0002-9888-8460/work/196691456

Schlagworte

Schlagwörter

  • genetic algorithm, subset selection, synthetic data, tabular data