Autonomy Loops for Monitoring, Operational Data Analytics, Feedback, and Response in HPC Operations

Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/GutachtenBeitrag in KonferenzbandBeigetragenBegutachtung

Beitragende

  • Francieli Boito - , University of Bordeaux (Autor:in)
  • Jim Brandt - , Sandia National Laboratories (Autor:in)
  • Valeria Cardellini - , University of Rome Tor Vergata (Autor:in)
  • Philip Carns - , Argonne National Laboratory (Autor:in)
  • Florina M. Ciorba - , Universität Basel (Autor:in)
  • Hilary Egan - , National Renewable Energy Laboratory (Autor:in)
  • Ahmed Eleliemy - , Universität Basel (Autor:in)
  • Ann Gentile - , Sandia National Laboratories (Autor:in)
  • Thomas Gruber - , Zentrum für Nationales Hochleistungsrechnen Erlangen (NHR@FAU) (Autor:in)
  • Jeff Hanson - , Hewlett Packard Enterprise (Autor:in)
  • Utz-Uwe Haus - , Hewlett Packard Labs (Autor:in)
  • Kevin Huck - , University of Oregon (Autor:in)
  • Thomas Ilsche - , Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH), Technische Universität Dresden (Autor:in)
  • Thomas Jakobsche - , Universität Basel (Autor:in)
  • Terry Jones - , Oak Ridge National Laboratory (Autor:in)
  • Sven Karlsson - , Technical University of Denmark (Autor:in)
  • Abdullah Mueen - , University of New Mexico (Autor:in)
  • Michael Ott - , Leibniz-Rechenzentrum (LRZ) (Autor:in)
  • Tapasya Patki - , Lawrence Livermore National Laboratory (Autor:in)
  • Ivy Peng - , KTH Royal Institute of Technology (Autor:in)
  • Krishnan Raghavan - , Argonne National Laboratory (Autor:in)
  • Stephen Simms - , Lawrence Berkeley National Laboratory (Autor:in)
  • Kathleen Shoga - , Lawrence Livermore National Laboratory (Autor:in)
  • Michael Showerman - , University of Illinois at Urbana-Champaign (Autor:in)
  • Devesh Tiwari - , Northeastern University (Autor:in)
  • Torsten Wilde - , Hewlett Packard Enterprise (Autor:in)
  • Keiji Yamamoto - , RIKEN R-CCS (Autor:in)

Abstract

Many High Performance Computing (HPC) facilities have developed and deployed frameworks in support of continuous monitoring and operational data analytics (MODA) to help improve efficiency and throughput. Because of the complexity and scale of systems and workflows and the need for low-latency response to address dynamic circumstances, automated feedback and response have the potential to be more effective than current human-in-the-loop approaches which are laborious and error prone. Progress has been limited, however, by factors such as the lack of infrastructure and feedback hooks, and successful deployment is often site- and case-specific. In this position paper we report on the outcomes and plans from a recent Dagstuhl Seminar, seeking to carve a path for community progress in the development of autonomous feedback loops for MODA, based on the established formalism of similar (MAPE-K) loops in autonomous computing and self-adaptive systems. By defining and developing such loops for significant cases experienced across HPC sites, we seek to extract commonalities and develop conventions that will facilitate interoperability and interchangeability with system hardware, software, and applications across different sites, and will motivate vendors and others to provide telemetry interfaces and feedback hooks to enable community development and pervasive deployment of MODA autonomy loops.

Details

OriginalspracheEnglisch
Titel2023 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops)
Herausgeber (Verlag)IEEE
Seiten37-43
Seitenumfang7
ISBN (Print)979-8-3503-7063-8
PublikationsstatusVeröffentlicht - 31 Okt. 2023
Peer-Review-StatusJa

Workshop

Titel2023 IEEE International Conference on Cluster Computing Workshops
KurztitelCLUSTER Workshops 2023
Dauer31 Oktober 2023
Webseite
BekanntheitsgradInternationale Veranstaltung
OrtHilton Santa Fe Historic Plaza
StadtSanta Fe
LandUSA/Vereinigte Staaten

Externe IDs

Scopus 85179622490
ORCID /0000-0002-5437-3887/work/154740531

Schlagworte

Forschungsprofillinien der TU Dresden

DFG-Fachsystematik nach Fachkollegium

Schlagwörter

  • Seminars, Feedback loop, Production systems, Data analysis, Conferences, Propulsion, Throughput