PIKA: Center-Wide and Job-Aware Cluster Monitoring

Robert Dietrich; Frank Winkler; Andreas Knüpfer; Wolfgang Nagel

doi:10.1109/CLUSTER49012.2020.00061

PIKA: Center-Wide and Job-Aware Cluster Monitoring

Research output: Contribution to book/Conference proceedings/Anthology/Report › Conference contribution › Contributed › peer-review

Contributors

Robert Dietrich - , Center for Information Services and High Performance Computing (ZIH) (Author)
Frank Winkler - , Center for Information Services and High Performance Computing (ZIH) (Author)
Andreas Knüpfer - , Center for Information Services and High Performance Computing (ZIH) (Author)
Wolfgang Nagel - , Center for Information Services and High Performance Computing (ZIH), Chair of Computer Architecture (Author)

Abstract

Nowadays, performance optimization is more or less an established procedure in high-performance computing (HPC) centers. To sustainably increase compute efficiency of such systems, we need to increase the awareness of efficiency on both the operator's and the users' side. Therefore, we propose an infrastructure for continuous monitoring and analysis, which automatically characterizes HPC jobs and provides a systematic approach to identify underperforming compute jobs with optimization potential. The recorded metadata and time-series data can be visualized live at runtime or post-mortem and are eventually stored for long-term analysis. The monitoring has a negligible overhead on the compute nodes and neither influences nor limits the user applications.

Details

Original language	English
Title of host publication	2020 IEEE International Conference on Cluster Computing (CLUSTER)
Publisher	IEEE Computer Society, Washington
Pages	424-432
Number of pages	9
ISBN (electronic)	978-1-7281-6677-3
ISBN (print)	978-1-7281-6678-0
Publication status	Published - 14 Sept 2020
Peer-reviewed	Yes

Publication series

Series	IEEE International Conference on Cluster Computing
ISSN	1552-5244

Conference

Title	2020 IEEE International Conference on Cluster Computing
Abbreviated title	CLUSTER 2020
Duration	14 - 17 September 2020
Website	https://clustercomp.org/2020/
Degree of recognition	International event
Location	online
City	Kobe
Country	Japan

External IDs

Scopus	85096230773
WOS	000698696500051

Keywords

monitoring, data collection, data visualization, data analysis, collectd, LIKWID

Research Portal of the TU Dresden

Contributors

Abstract

Details

Publication series

Conference

External IDs

Keywords

Keywords