Scalable error isolation for distributed systems: modeling, correctness proofs, and additional experiments

Publikation: Vorabdruck/Dokumentation/BerichtArbeitspapier

Beitragende

Abstract

In distributed systems, data corruption on a single node
can propagate to other nodes in the system and cause
severe outages. The probability of data corruption is
already non-negligible today in large computer popula-
tions (e.g., in large datacenters). The resilience of pro-
cessors is expected to decline in the near future, making
it necessary to devise cost-effective software approaches
to deal with data corruption.
In this paper, we present SEI, an algorithm that tol-
erates Arbitrary State Corruption (ASC) faults and pre-
vents data corruption from propagating across a dis-
tributed system. SEI scales in three dimensions: mem-
ory, number of processing threads, and development ef-
fort. To evaluate development effort, fault coverage,
and performance with our library, we hardened two real-
world applications: a DNS resolver and memcached.
Hardening these applications required minimal changes
to the existing code base, and the performance overhead
is negligible in the case of applications that are not CPU-
intensive, such as memcached. The memory overhead
is negligible independent of the application when using
ECC memory. Finally, SEI covers faults effectively: it
detected all hardware-injected errors and reduced un-
detected errors from 44% down to only 0.15% of the
software-injected computation errors in our experiments.

Details

OriginalspracheEnglisch
PublikationsstatusVeröffentlicht - 1 Feb. 2015
No renderer: customAssociatesEventsRenderPortal,dk.atira.pure.api.shared.model.researchoutput.WorkingPaper

Schlagworte

Forschungsprofillinien der TU Dresden

DFG-Fachsystematik nach Fachkollegium