Scalable error isolation for distributed systems: modeling, correctness proofs, and additional experiments
Research output: Preprint/Documentation/Report › Working paper
Contributors
Abstract
In distributed systems, data corruption on a single node
can propagate to other nodes in the system and cause
severe outages. The probability of data corruption is
already non-negligible today in large computer popula-
tions (e.g., in large datacenters). The resilience of pro-
cessors is expected to decline in the near future, making
it necessary to devise cost-effective software approaches
to deal with data corruption.
In this paper, we present SEI, an algorithm that tol-
erates Arbitrary State Corruption (ASC) faults and pre-
vents data corruption from propagating across a dis-
tributed system. SEI scales in three dimensions: mem-
ory, number of processing threads, and development ef-
fort. To evaluate development effort, fault coverage,
and performance with our library, we hardened two real-
world applications: a DNS resolver and memcached.
Hardening these applications required minimal changes
to the existing code base, and the performance overhead
is negligible in the case of applications that are not CPU-
intensive, such as memcached. The memory overhead
is negligible independent of the application when using
ECC memory. Finally, SEI covers faults effectively: it
detected all hardware-injected errors and reduced un-
detected errors from 44% down to only 0.15% of the
software-injected computation errors in our experiments.
can propagate to other nodes in the system and cause
severe outages. The probability of data corruption is
already non-negligible today in large computer popula-
tions (e.g., in large datacenters). The resilience of pro-
cessors is expected to decline in the near future, making
it necessary to devise cost-effective software approaches
to deal with data corruption.
In this paper, we present SEI, an algorithm that tol-
erates Arbitrary State Corruption (ASC) faults and pre-
vents data corruption from propagating across a dis-
tributed system. SEI scales in three dimensions: mem-
ory, number of processing threads, and development ef-
fort. To evaluate development effort, fault coverage,
and performance with our library, we hardened two real-
world applications: a DNS resolver and memcached.
Hardening these applications required minimal changes
to the existing code base, and the performance overhead
is negligible in the case of applications that are not CPU-
intensive, such as memcached. The memory overhead
is negligible independent of the application when using
ECC memory. Finally, SEI covers faults effectively: it
detected all hardware-injected errors and reduced un-
detected errors from 44% down to only 0.15% of the
software-injected computation errors in our experiments.
Details
Original language | English |
---|---|
Publication status | Published - 1 Feb 2015 |
No renderer: customAssociatesEventsRenderPortal,dk.atira.pure.api.shared.model.researchoutput.WorkingPaper