Lessons Learned from Spatial and Temporal Correlation of Node Failures in High Performance Computers

Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/GutachtenBeitrag in KonferenzbandBeigetragenBegutachtung

Beitragende

Abstract

In this paper we study the correlation of node failures in time and space. Our study is based on measurements of a production high performance computer over an 8-month time period. We draw possible types of correlations between node failures and show that, in many cases, there are direct correlations between observed node failures. The significance of such a study is twofold: achieving a clearer understanding of correlations between node failures and enabling failure detection as early as possible. The results of this study are aimed at helping the system administrators minimize (or even prevent) the destructive effects of correlated node failures.

Details

OriginalspracheEnglisch
Titel2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)
Herausgeber (Verlag)Wiley-IEEE Press
Seiten377-381
Seitenumfang5
ISBN (Print)978-1-4673-8775-0
PublikationsstatusVeröffentlicht - 2016
Peer-Review-StatusJa

Konferenz

Titel24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing
KurztitelPDP 2016
Veranstaltungsnummer24
Dauer17 - 19 Februar 2016
Webseite
BekanntheitsgradInternationale Veranstaltung
OrtAquila Atlantis Hotel
StadtHeraklion
LandGriechenland

Externe IDs

Scopus 84968830699
WOS 000381810900055

Schlagworte