Distributed wait state tracking for runtime MPI deadlock detection

Tobias Hilbrich; Bronis R. de Supinski; Wolfgang E. Nagel; Joachim Protze; Christel Baier; Matthias S. Müller

doi:10.1145/2503210.2503237

Distributed wait state tracking for runtime MPI deadlock detection

Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/Gutachten › Beitrag in Konferenzband › Beigetragen › Begutachtung

Beitragende

Tobias Hilbrich - , Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH) (Autor:in)
Bronis R. de Supinski - , Lawrence Livermore National Laboratory (Autor:in)
Wolfgang E. Nagel - , Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH), Professur für Rechnerarchitektur (Autor:in)
Joachim Protze - , Jülich Aachen Research Alliance (JARA), Rheinisch-Westfälische Technische Hochschule Aachen (Autor:in)
Christel Baier - , Professur für Algebraische und logische Grundlagen der Informatik (Autor:in)
Matthias S. Müller - , Jülich Aachen Research Alliance (JARA), Rheinisch-Westfälische Technische Hochschule Aachen (Autor:in)

Abstract

The widely used Message Passing Interface (MPI) with its multitude of communication functions is prone to usage errors. Runtime error detection tools aid in the removal of these errors. We develop MUST as one such tool that provides a wide variety of automatic correctness checks. Its correctness checks can be run in a distributed mode, except for its deadlock detection. This limitation applies to a wide range of tools that either use centralized detection algorithms or a timeout approach. In order to provide scalable and distributed deadlock detection with detailed insight into deadlock situations, we propose a model for MPI blocking conditions that we use to formulate a distributed algorithm. This algorithm implements scalable MPI deadlock detection in MUST. Stress tests at up to 4,096 processes demonstrate the scalability of our approach. Finally, overhead results for a complex benchmark suite demonstrate an average runtime increase of 34% at 2,048 processes.

Details

Originalsprache	Englisch
Titel	SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Herausgeber (Verlag)	Institute of Electrical and Electronics Engineers (IEEE)
Seiten	1-12
Seitenumfang	12
ISBN (Print)	978-1-4503-2378-9
Publikationsstatus	Veröffentlicht - 2013
Peer-Review-Status	Ja

Konferenz

Titel	2013 International Conference for High Performance Computing, Networking, Storage and Analysis
Kurztitel	SC13
Veranstaltungsnummer
Dauer	17 - 22 November 2013
Bekanntheitsgrad	Internationale Veranstaltung
Ort
Stadt	Denver
Land	USA/Vereinigte Staaten

Externe IDs

researchoutputwizard	legacy.publication#56764
ORCID	/0000-0002-5321-9343/work/142236749
Scopus	84899699102

Schlagworte

Schlagwörter

distributed wait state tracking, runtime MPI deadlock detection

Forschungsportal der TU Dresden