Towards Resilience in HPC: A Prototype to Analyze and Predict System Behavior

Research output: Types of ThesisDoctoral thesis

Abstract

Following the growth of high performance computing systems (HPC) in size and complexity, and the advent of faster and more complex Exascale systems, failures became the norm rather than the exception. Hence, the protection mechanisms need to be improved. The most de facto mechanisms such as checkpoint/restart or redundancy may also fail to support the continuous operation of future HPC systems in the presence of failures. Failure prediction is a new protection approach that is beneficial for HPC systems with a short mean time between failure. The failure prediction mechanism extends the existing protection mechanisms via the dynamic adjustment of the protection level. This work provides a prototype to analyze and predict system behavior using statistical analysis to pave the path toward resilience in HPC systems. The proposed anomaly detection method is noise-tolerant by design and produces accurate results with as little as 30 minutes of historical data. Machine learning models complement the main approach and further improve the accuracy of failure predictions up to 85%. The fully automatic unsupervised behavior analysis approach, proposed in this work, is a novel solution to protect future extreme-scale systems against failures.

Details

Original languageEnglish
Awarding Institution
Supervisors/Advisors
  • Nagel, Wolfgang Erwin, Mentor
  • Lieber, Matthias, Mentor
  • Ciorba, Florina Monica, Mentor
Publication statusPublished - 2020
No renderer: customAssociatesEventsRenderPortal,dk.atira.pure.api.shared.model.researchoutput.Thesis

Keywords

Keywords

  • anomaly detection, failure prediction, high performance computing, system logs, resilience