An Adaptive Failure Detection Protocol

Christof Fetzer; Michel Raynal; Frederic Tronel

An Adaptive Failure Detection Protocol

Research output: Contribution to conferences › Paper › Contributed › peer-review

Contributors

Christof Fetzer - , Chair of Systems Engineering (Author)
Michel Raynal - (Author)
Frederic Tronel - (Author)

Abstract

The detection of process failures is a crucial problem system designers have to cope with in order to build fault-tolerant distributed platforms. Unfortunately, it is impossible to distinguish with certainty a crashed process from a very slow process in a purely asynchronous distributed system. This prevents some problems from being solved in such systems. That is why failure detector oracles have been introduced to circumvent these impossibility results. The paper presents a relatively simple protocol that allows a process to "monitor" another process, and consequently to detect its crash. This protocol relies as much as possible on application messages to do this monitoring. Different from previous process crash detection protocols, it uses control messages only when no application message is sent by the monitoring process to the observed process. When the underlying system satisfies the partial synchrony assumption, it actually implements an eventually perfect failure detector (i.e., a failure detector of the class usually denoted OP). Moreover if the average observed transmission delay is finite and the upper layer application terminates within a bounded number of steps for any failure detector in OP after the failure detector becomes "perfect", then, when run with the proposed protocol, it also terminates correctly. These properties make the protocol inexpensive, implementable, and powerful. The paper also describes performance measurements of an implementation of the protocol.

Details

Original language	English
Pages	146-153
Number of pages	8
Publication status	Published - 2001
Peer-reviewed	Yes

Conference

Title	2001 Pacific Rim International Symposium on Dependable Computing
Abbreviated title	PRDC '01
Conference number
Duration	17 December 2001
Degree of recognition	International event
Location
City	Seoul
Country	Korea, Republic of

External IDs

Scopus	84885938405

Keywords

Research priority areas of TU Dresden

Information Technology and Microelectronics

DFG Classification of Subject Areas according to Review Boards

Security and Dependability

Keywords

protocols, Detectors, Computer crashes, Fault detection, condition moonitoring, Delay, Measurement, Middleware, Fault tolerance, fault diagnosis, distributed computing, system recovery, adaptive failure detection protocol, process failure detection, system designers, Fault-tolerant distributed platforms, crashed process, very slow process, purely asynchronous distributed system, failure detector oracles, simple protocol, application messages, process crash detection protocols, control messages, monitoring process, observed process, partial synchrony assumption, perfect failure detector, average observed transmission delay upper layer application, performance measurements

Research Portal of the TU Dresden

Contributors

Abstract

Details

Conference

External IDs

Keywords

Research priority areas of TU Dresden

DFG Classification of Subject Areas according to Review Boards

Keywords