PIKA: Center-Wide and Job-Aware Cluster Monitoring

Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/GutachtenBeitrag in KonferenzbandBeigetragenBegutachtung

Abstract

Nowadays, performance optimization is more or less an established procedure in high-performance computing (HPC) centers. To sustainably increase compute efficiency of such systems, we need to increase the awareness of efficiency on both the operator's and the users' side. Therefore, we propose an infrastructure for continuous monitoring and analysis, which automatically characterizes HPC jobs and provides a systematic approach to identify underperforming compute jobs with optimization potential. The recorded metadata and time-series data can be visualized live at runtime or post-mortem and are eventually stored for long-term analysis. The monitoring has a negligible overhead on the compute nodes and neither influences nor limits the user applications.

Details

OriginalspracheEnglisch
Titel2020 IEEE International Conference on Cluster Computing (CLUSTER)
Herausgeber (Verlag)IEEE Computer Society, Washington
Seiten424-432
Seitenumfang9
ISBN (elektronisch)978-1-7281-6677-3
ISBN (Print)978-1-7281-6678-0
PublikationsstatusVeröffentlicht - 14 Sept. 2020
Peer-Review-StatusJa

Publikationsreihe

ReiheIEEE International Conference on Cluster Computing
ISSN1552-5244

Konferenz

Titel2020 IEEE International Conference on Cluster Computing
KurztitelCLUSTER 2020
Dauer14 - 17 September 2020
Webseite
BekanntheitsgradInternationale Veranstaltung
Ortonline
StadtKobe
LandJapan

Externe IDs

Scopus 85096230773
WOS 000698696500051

Schlagworte

Schlagwörter

  • monitoring, data collection, data visualization, data analysis, collectd, LIKWID