Towards Resilience in HPC: A Prototype to Analyze and Predict System Behavior

Siavash Ghiasvand

Towards Resilience in HPC: A Prototype to Analyze and Predict System Behavior

Research output: Types of thesis › Doctoral thesis

Contributors

Siavash Ghiasvand - , Center for Information Services and High Performance Computing (ZIH) (Author)

Abstract

Following the growth of high performance computing systems (HPC) in size and complexity, and the advent of faster and more complex Exascale systems, failures became the norm rather than the exception. Hence, the protection mechanisms need to be improved. The most de facto mechanisms such as checkpoint/restart or redundancy may also fail to support the continuous operation of future HPC systems in the presence of failures. Failure prediction is a new protection approach that is beneficial for HPC systems with a short mean time between failure. The failure prediction mechanism extends the existing protection mechanisms via the dynamic adjustment of the protection level. This work provides a prototype to analyze and predict system behavior using statistical analysis to pave the path toward resilience in HPC systems. The proposed anomaly detection method is noise-tolerant by design and produces accurate results with as little as 30 minutes of historical data. Machine learning models complement the main approach and further improve the accuracy of failure predictions up to 85%. The fully automatic unsupervised behavior analysis approach, proposed in this work, is a novel solution to protect future extreme-scale systems against failures.

Details

Original language	English
Awarding Institution	Technische Universität Dresden
Supervisors/Advisors	Nagel, Wolfgang Erwin, Mentor Lieber, Matthias, Mentor Ciorba, Florina Monica, Mentor
Publication status	Published - 2020

No renderer: customAssociatesEventsRenderPortal,dk.atira.pure.api.shared.model.researchoutput.Thesis

Keywords

anomaly detection, failure prediction, high performance computing, system logs, resilience

Research Portal of the TU Dresden

Towards Resilience in HPC: A Prototype to Analyze and Predict System Behavior

Contributors

Abstract

Details

Keywords

Keywords

Related content

Best PhD Pitch

Third place - ScienceSlam

Second place - ScienceSlam