Autonomy Loops for Monitoring, Operational Data Analytics, Feedback, and Response in HPC Operations

Research output: Contribution to book/conference proceedings/anthology/reportConference contributionContributedpeer-review

Contributors

  • Francieli Boito - , University of Bordeaux (Author)
  • Jim Brandt - , Sandia National Laboratories (Author)
  • Valeria Cardellini - , University of Rome Tor Vergata (Author)
  • Philip Carns - , Argonne National Laboratory (Author)
  • Florina M. Ciorba - , University of Basel (Author)
  • Hilary Egan - , National Renewable Energy Laboratory (Author)
  • Ahmed Eleliemy - , University of Basel (Author)
  • Ann Gentile - , Sandia National Laboratories (Author)
  • Thomas Gruber - , Erlangen National High Performance Computing Center (NHR@FAU) (Author)
  • Jeff Hanson - , Hewlett Packard Enterprise (Author)
  • Utz-Uwe Haus - , Hewlett Packard Labs (Author)
  • Kevin Huck - , University of Oregon (Author)
  • Thomas Ilsche - , Center for Information Services and High Performance Computing (ZIH), TUD Dresden University of Technology (Author)
  • Thomas Jakobsche - , University of Basel (Author)
  • Terry Jones - , Oak Ridge National Laboratory (Author)
  • Sven Karlsson - , Technical University of Denmark (Author)
  • Abdullah Mueen - , University of New Mexico (Author)
  • Michael Ott - , Leibniz Supercomputing Centre (Author)
  • Tapasya Patki - , Lawrence Livermore National Laboratory (Author)
  • Ivy Peng - , KTH Royal Institute of Technology (Author)
  • Krishnan Raghavan - , Argonne National Laboratory (Author)
  • Stephen Simms - , Lawrence Berkeley National Laboratory (Author)
  • Kathleen Shoga - , Lawrence Livermore National Laboratory (Author)
  • Michael Showerman - , University of Illinois at Urbana-Champaign (Author)
  • Devesh Tiwari - , Northeastern University (Author)
  • Torsten Wilde - , Hewlett Packard Enterprise (Author)
  • Keiji Yamamoto - , RIKEN R-CCS (Author)

Abstract

Many High Performance Computing (HPC) facilities have developed and deployed frameworks in support of continuous monitoring and operational data analytics (MODA) to help improve efficiency and throughput. Because of the complexity and scale of systems and workflows and the need for low-latency response to address dynamic circumstances, automated feedback and response have the potential to be more effective than current human-in-the-loop approaches which are laborious and error prone. Progress has been limited, however, by factors such as the lack of infrastructure and feedback hooks, and successful deployment is often site- and case-specific. In this position paper we report on the outcomes and plans from a recent Dagstuhl Seminar, seeking to carve a path for community progress in the development of autonomous feedback loops for MODA, based on the established formalism of similar (MAPE-K) loops in autonomous computing and self-adaptive systems. By defining and developing such loops for significant cases experienced across HPC sites, we seek to extract commonalities and develop conventions that will facilitate interoperability and interchangeability with system hardware, software, and applications across different sites, and will motivate vendors and others to provide telemetry interfaces and feedback hooks to enable community development and pervasive deployment of MODA autonomy loops.

Details

Original languageEnglish
Title of host publication2023 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops)
PublisherIEEE
Pages37-43
Number of pages7
ISBN (print)979-8-3503-7063-8
Publication statusPublished - 31 Oct 2023
Peer-reviewedYes

Workshop

Title2023 IEEE International Conference on Cluster Computing Workshops
Abbreviated titleCLUSTER Workshops 2023
Duration31 October 2023
Website
Degree of recognitionInternational event
LocationHilton Santa Fe Historic Plaza
CitySanta Fe
CountryUnited States of America

External IDs

Scopus 85179622490
ORCID /0000-0002-5437-3887/work/154740531

Keywords

Research priority areas of TU Dresden

DFG Classification of Subject Areas according to Review Boards

Keywords

  • Seminars, Feedback loop, Production systems, Data analysis, Conferences, Propulsion, Throughput