PIKA: Center-Wide and Job-Aware Cluster Monitoring

Research output: Contribution to book/conference proceedings/anthology/reportConference contributionContributedpeer-review

Abstract

Nowadays, performance optimization is more or less an established procedure in high-performance computing (HPC) centers. To sustainably increase compute efficiency of such systems, we need to increase the awareness of efficiency on both the operator's and the users' side. Therefore, we propose an infrastructure for continuous monitoring and analysis, which automatically characterizes HPC jobs and provides a systematic approach to identify underperforming compute jobs with optimization potential. The recorded metadata and time-series data can be visualized live at runtime or post-mortem and are eventually stored for long-term analysis. The monitoring has a negligible overhead on the compute nodes and neither influences nor limits the user applications.

Details

Original languageEnglish
Title of host publication2020 IEEE International Conference on Cluster Computing (CLUSTER)
PublisherIEEE Computer Society, Washington
Pages424-432
Number of pages9
ISBN (electronic)978-1-7281-6677-3
ISBN (print)978-1-7281-6678-0
Publication statusPublished - 14 Sept 2020
Peer-reviewedYes

Publication series

SeriesIEEE International Conference on Cluster Computing
ISSN1552-5244

Conference

Title2020 IEEE International Conference on Cluster Computing
Abbreviated titleCLUSTER 2020
Duration14 - 17 September 2020
Website
Degree of recognitionInternational event
Locationonline
CityKobe
CountryJapan

External IDs

Scopus 85096230773
WOS 000698696500051

Keywords

Keywords

  • monitoring, data collection, data visualization, data analysis, collectd, LIKWID