How well do process-based and data-driven hydrological models learn from limited discharge data?

Research output: Contribution to journalResearch articleContributedpeer-review

Contributors

  • Maria Staudinger - , University of Zurich (Author)
  • Anna Herzog - , University of Potsdam (Author)
  • Ralf Loritz - , Karlsruhe Institute of Technology (Author)
  • Tobias Houska - , Chair of Soil Resources and Land Use, Department of Landscape Ecology and Resources Management, Justus Liebig University Giessen (Author)
  • Sandra Pool - , Swiss Federal Institute of Aquatic Science and Technology (Author)
  • Diana Spieler - , Chair of Hydrology, TUD Dresden University of Technology (Author)
  • Paul D. Wagner - , Kiel University (Author)
  • Juliane Mai - , University of Waterloo (Author)
  • Jens Kiesel - , Kiel University, Stone Environmental, Inc. (Author)
  • Stephan Thober - , Helmholtz Centre for Environmental Research (Author)
  • Björn Guse - , Kiel University, Helmholtz Centre Potsdam - German Research Centre for Geosciences (Author)
  • Uwe Ehret - , Karlsruhe Institute of Technology (Author)

Abstract

It is widely assumed that data-driven models achieve good results only with sufficiently large training data, whereas process-based models are usually expected to be superior in data-poor situations. To investigate this, we calibrated several process-based and data-driven hydrological models using training datasets of observed discharge that differed in terms of both the number of data points and the type of data selection, allowing us to make a systematic comparison of the learning behaviour of the different model types. Four data-driven models (conditional probability distributions, regression trees, artificial neural networks, and long short-term memory networks) and three process-based models (GR4J, HBV, and SWAT+) were included in the testing, applied in three meso-scale catchments representing different landscapes in Germany: the Iller in the Alpine region, the Saale in the low mountain ranges, and the Selke in the transition between the Harz and central German lowlands. We used information measures (joint entropy and conditional entropy) for system analysis and model performance evaluation because they offer several desirable properties: they extend seamlessly from uni- to multivariate data, they allow direct comparison of predictive uncertainty with and without model simulations, and their boundedness helps to put results into perspective. In addition to the main question of this study – to what extent does the performance of different models depend on the training dataset? – we investigated whether the selection of training data (random, according to information content, contiguous time periods, or independent time points) plays a role. We also examined whether the shape of the learning curve for different models can be used to predict the achievable model performance based on the information contained in the data and whether using more spatially distributed model inputs improves model performance compared to using spatially lumped inputs. Process-based models outperformed data-driven ones for small amounts of training data due to their predefined structure. However, as the amount of training data increases, the learning curve of process-based models quickly saturates, and data-driven models become more effective. In particular, the long short-term memory network outperforms all process-based models when trained with more than 2–5 years of data and continues to learn from additional training data without approaching saturation. Surprisingly, fully random sampling of training data points for the HBV model led to better learning results than consecutive random sampling or optimal sampling in terms of information content. Analysing multivariate catchment data allows predictions about how these data can be used to predict discharge. When no memory was considered, the conditional entropy was high. However, as soon as memory was introduced in the form of the previous day or week, the conditional entropy decreased, suggesting that memory is an important component of the data and that capturing it improves model performance. This was particularly evident in the catchments in the low mountain ranges and the Alpine region.

Details

Original languageEnglish
Pages (from-to)5005-5029
Number of pages25
JournalHydrology and earth system sciences
Volume29
Issue number19
Publication statusPublished - 8 Oct 2025
Peer-reviewedYes