Genetic Algorithm for Subset Selection in Synthetic Tabular Data

Research output: Contribution to journalConference articleContributedpeer-review

Abstract

Subset selection has been widely studied but remains underexplored for synthetic tabular data, particularly in data sharing contexts that require high quality data. While generative models can produce large volumes of synthetic data, directly sampling and releasing such data risks including low quality or unrepresentative samples, which can reduce data utility. An alternative approach is to generate more data than needed and subsequently select a subset that better meets specific quality criteria. This paper introduces a genetic algorithm (GA)-based method for optimizing such subset selection. The proposed GA is independent of any specific fitness function, enabling adaptation to diverse evaluation metrics, their combinations, or varying use case requirements. We benchmarked the method on five medical datasets, each synthesized by multiple generative architectures, and consistently found that the GA selected subsets outperformed both the initial synthetic datasets and a random subset selection baseline. Notably, initializing the GA with systematically generated synthetic subsets led to nearly twice the improvement over the baselines compared to random initialization, emphasizing the importance of more informed starting solutions. The proposed GA-based method proved especially beneficial for smaller datasets, which are frequently encountered in clinical domains, such as rare disease research. While performance gains diminished for larger datasets due to combinatorial complexity, this work highlights the potential of GA-driven optimization as a foundation for future research into scalable and adaptive subset selection methods for synthetic data sharing.

Details

Original languageEnglish
Pages (from-to)1-6
JournalIEEE Access
Publication statusPublished - 2025
Peer-reviewedYes

Conference

Title2nd International Conference on Artificial Intelligence, Computer, Data Sciences, and Applications
Abbreviated titleACDSA 2025
Conference number2
Duration7 - 9 August 2025
Website
LocationAntalya Bilim University
CityAntalya
CountryTurkey

External IDs

ORCID /0000-0002-1887-4772/work/196688956
ORCID /0000-0002-9888-8460/work/196691456

Keywords

Keywords

  • genetic algorithm, subset selection, synthetic data, tabular data