Enhancing Clinical Data Infrastructure for AI Research: Comparative Evaluation of Data Management Architectures

Research output: Contribution to journalResearch articleContributedpeer-review

Abstract

BACKGROUND: The rapid growth of clinical data, driven by digital technologies and high-resolution sensors, presents significant challenges for health care organizations aiming to support advanced artificial intelligence research and improve patient care. Traditional data management approaches may struggle to handle the large, diverse, and rapidly updating datasets prevalent in modern clinical environments.

OBJECTIVE: This study aimed to compare 3 clinical data management architectures-clinical data warehouses, clinical data lakes, and clinical data lakehouses-by analyzing their performance using the FAIR (findable, accessible, interoperable, and reusable) principles and the big data 5 V's (volume, variety, velocity, veracity, and value). The aim was to provide guidance on selecting an architecture that balances robust data governance with the flexibility required for advanced analytics.

METHODS: We developed a comprehensive analysis framework that integrates aspects of data governance with technical performance criteria. A rapid literature review was conducted to synthesize evidence from multiple studies, focusing on how each architecture manages large, heterogeneous, and dynamically updating clinical data. The review assessed key dimensions such as scalability, real-time processing capabilities, metadata consistency, and the technical expertise required for implementation and maintenance.

RESULTS: The results show that clinical data warehouses offer strong data governance, stability, and structured reporting, making them well suited for environments that require strict compliance and reliable analysis. However, they are limited in terms of real-time processing and scalability. In contrast, clinical data lakes offer greater flexibility and cost-effective scalability for managing heterogeneous data types, although they may suffer from inconsistent metadata management and challenges in maintaining data quality. Clinical data lakehouses combine the strengths of both approaches by supporting real-time data ingestion and structured querying; however, their hybrid nature requires high technical expertise and involves complex integration efforts.

CONCLUSIONS: The optimal data management architecture for clinical applications depends on an organization's specific needs, available resources, and strategic goals. Health care institutions need to weigh the trade-offs between robust data governance, operational flexibility, and scalability to build future-proof infrastructures that support both clinical operations and artificial intelligence research. Further research should focus on simplifying the complexity of hybrid models and improving the integration of clinical standards to improve overall system reliability and ease of implementation.

Details

Original languageEnglish
Article numbere74976
JournalJournal of Medical Internet Research
Volume27
Publication statusPublished - 1 Aug 2025
Peer-reviewedYes

External IDs

PubMedCentral PMC12357119
Scopus 105012764050
ORCID /0000-0003-2126-290X/work/196678716
ORCID /0000-0003-0154-2867/work/196689297
ORCID /0000-0002-9888-8460/work/196691455

Keywords

Keywords

  • Artificial Intelligence, Data Management, Humans