Towards Resilience in HPC: A Prototype to Analyze and Predict System Behavior

Siavash Ghiasvand

Towards Resilience in HPC: A Prototype to Analyze and Predict System Behavior

Publikation: Hochschulschrift/Abschlussarbeit › Dissertation

Beitragende

Siavash Ghiasvand - , Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH) (Autor:in)

Abstract

Following the growth of high performance computing systems (HPC) in size and complexity, and the advent of faster and more complex Exascale systems, failures became the norm rather than the exception. Hence, the protection mechanisms need to be improved. The most de facto mechanisms such as checkpoint/restart or redundancy may also fail to support the continuous operation of future HPC systems in the presence of failures. Failure prediction is a new protection approach that is beneficial for HPC systems with a short mean time between failure. The failure prediction mechanism extends the existing protection mechanisms via the dynamic adjustment of the protection level. This work provides a prototype to analyze and predict system behavior using statistical analysis to pave the path toward resilience in HPC systems. The proposed anomaly detection method is noise-tolerant by design and produces accurate results with as little as 30 minutes of historical data. Machine learning models complement the main approach and further improve the accuracy of failure predictions up to 85%. The fully automatic unsupervised behavior analysis approach, proposed in this work, is a novel solution to protect future extreme-scale systems against failures.

Details

Originalsprache	Englisch
Gradverleihende Hochschule	Technische Universität Dresden
Betreuer:in / Berater:in	Nagel, Wolfgang Erwin, Mentor:in Lieber, Matthias, Mentor:in Ciorba, Florina Monica, Mentor:in
Publikationsstatus	Veröffentlicht - 2020

No renderer: customAssociatesEventsRenderPortal,dk.atira.pure.api.shared.model.researchoutput.Thesis

Schlagworte

Schlagwörter

anomaly detection, failure prediction, high performance computing, system logs, resilience

Forschungsportal der TU Dresden

Towards Resilience in HPC: A Prototype to Analyze and Predict System Behavior

Beitragende

Abstract

Details

Schlagworte

Schlagwörter

Verknüpfte Inhalte

Best PhD Pitch

Third place - ScienceSlam

Second place - ScienceSlam