Lessons Learned from Spatial and Temporal Correlation of Node Failures in High Performance Computers
Research output: Contribution to book/Conference proceedings/Anthology/Report › Conference contribution › Contributed › peer-review
Contributors
Abstract
In this paper we study the correlation of node failures in time and space. Our study is based on measurements of a production high performance computer over an 8-month time period. We draw possible types of correlations between node failures and show that, in many cases, there are direct correlations between observed node failures. The significance of such a study is twofold: achieving a clearer understanding of correlations between node failures and enabling failure detection as early as possible. The results of this study are aimed at helping the system administrators minimize (or even prevent) the destructive effects of correlated node failures.
Details
Original language | English |
---|---|
Title of host publication | 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP) |
Publisher | Wiley-IEEE Press |
Pages | 377-381 |
Number of pages | 5 |
ISBN (print) | 978-1-4673-8775-0 |
Publication status | Published - 2016 |
Peer-reviewed | Yes |
Conference
Title | 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing |
---|---|
Abbreviated title | PDP 2016 |
Conference number | 24 |
Duration | 17 - 19 February 2016 |
Website | |
Degree of recognition | International event |
Location | Aquila Atlantis Hotel |
City | Heraklion |
Country | Greece |
External IDs
Scopus | 84968830699 |
---|---|
WOS | 000381810900055 |