Distributed wait state tracking for runtime MPI deadlock detection

Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/GutachtenBeitrag in KonferenzbandBeigetragenBegutachtung

Beitragende

Abstract

The widely used Message Passing Interface (MPI) with its multitude of communication functions is prone to usage errors. Runtime error detection tools aid in the removal of these errors. We develop MUST as one such tool that provides a wide variety of automatic correctness checks. Its correctness checks can be run in a distributed mode, except for its deadlock detection. This limitation applies to a wide range of tools that either use centralized detection algorithms or a timeout approach. In order to provide scalable and distributed deadlock detection with detailed insight into deadlock situations, we propose a model for MPI blocking conditions that we use to formulate a distributed algorithm. This algorithm implements scalable MPI deadlock detection in MUST. Stress tests at up to 4,096 processes demonstrate the scalability of our approach. Finally, overhead results for a complex benchmark suite demonstrate an average runtime increase of 34% at 2,048 processes.

Details

OriginalspracheEnglisch
TitelSC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Herausgeber (Verlag)IEEE, New York [u. a.]
Seiten1-12
Seitenumfang12
ISBN (Print)978-1-4503-2378-9
PublikationsstatusVeröffentlicht - 2013
Peer-Review-StatusJa

Konferenz

Titel2013 International Conference for High Performance Computing, Networking, Storage and Analysis
KurztitelSC13
Veranstaltungsnummer
Dauer17 - 22 November 2013
BekanntheitsgradInternationale Veranstaltung
Ort
StadtDenver
LandUSA/Vereinigte Staaten

Externe IDs

researchoutputwizard legacy.publication#56764
ORCID /0000-0002-5321-9343/work/142236749
Scopus 84899699102

Schlagworte

Schlagwörter

  • distributed wait state tracking, runtime MPI deadlock detection