PIKA: Center-Wide and Job-Aware Cluster Monitoring
Research output: Contribution to book/Conference proceedings/Anthology/Report › Conference contribution › Contributed › peer-review
Contributors
Abstract
Nowadays, performance optimization is more or less an established procedure in high-performance computing (HPC) centers. To sustainably increase compute efficiency of such systems, we need to increase the awareness of efficiency on both the operator's and the users' side. Therefore, we propose an infrastructure for continuous monitoring and analysis, which automatically characterizes HPC jobs and provides a systematic approach to identify underperforming compute jobs with optimization potential. The recorded metadata and time-series data can be visualized live at runtime or post-mortem and are eventually stored for long-term analysis. The monitoring has a negligible overhead on the compute nodes and neither influences nor limits the user applications.
Details
Original language | English |
---|---|
Title of host publication | 2020 IEEE International Conference on Cluster Computing (CLUSTER) |
Publisher | IEEE Computer Society, Washington |
Pages | 424-432 |
Number of pages | 9 |
ISBN (electronic) | 978-1-7281-6677-3 |
ISBN (print) | 978-1-7281-6678-0 |
Publication status | Published - 14 Sept 2020 |
Peer-reviewed | Yes |
Publication series
Series | IEEE International Conference on Cluster Computing |
---|---|
ISSN | 1552-5244 |
Conference
Title | 2020 IEEE International Conference on Cluster Computing |
---|---|
Abbreviated title | CLUSTER 2020 |
Duration | 14 - 17 September 2020 |
Website | |
Degree of recognition | International event |
Location | online |
City | Kobe |
Country | Japan |
External IDs
Scopus | 85096230773 |
---|---|
WOS | 000698696500051 |
Keywords
Keywords
- monitoring, data collection, data visualization, data analysis, collectd, LIKWID