Analysis of Node Failures in High Performance Computers Based on System Logs

Siavash Ghiasvand; Florina Monica Ciorba; Ronny Tschüter; Wolfgang Erwin Nagel

Analysis of Node Failures in High Performance Computers Based on System Logs

Research output: Contribution to conferences › Poster › Contributed › peer-review

Contributors

Siavash Ghiasvand - , Center for Information Services and High Performance Computing (ZIH) (Author)
Florina Monica Ciorba - , Center for Information Services and High Performance Computing (ZIH) (Author)
Ronny Tschüter - , Center for Information Services and High Performance Computing (ZIH) (Author)
Wolfgang Erwin Nagel - , Center for Information Services and High Performance Computing (ZIH), Chair of Computer Architecture (Author)

Abstract

The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. In the near future, it is expected that the mean time between failures of HPC systems becomes too short and that current failure recovery mechanisms will no longer be able to recover the systems from failures. Early failure detection is, thus, essential to prevent their destructive effects. Based on measurements of a production system at TU Dresden over an 8-month time period, we study the correlation of node failures in time and space. We infer possible types of correlations and show that in many cases the observed node failures are directly correlated. The significance of such a study is achieving a clearer understanding of correlations between observed node failures and enabling failure detection as early as possible. The results aimed to help system administrators minimize (or prevent) the destructive effects of failures.

Details

Original language	English
Publication status	Published - Nov 2015
Peer-reviewed	Yes

Conference

Title	The International Conference for High Performance Computing, Networking, Storage, and Analysis
Abbreviated title	SC 15
Conference number	15
Duration	15 - 20 November 2015
Website	http://sc15.supercomputing.org
Degree of recognition	International event
Location	Convention Center
City	Austin
Country	United States of America

Research Portal of the TU Dresden

Analysis of Node Failures in High Performance Computers Based on System Logs

Contributors

Abstract

Details

Conference

Keywords