Transactional Encoding for Tolerating Transient Hardware Errors

Research output: Contribution to book/Conference proceedings/Anthology/ReportChapter in book/Anthology/ReportContributedpeer-review

Contributors

Abstract

The decreasing feature size of integrated circuits leads to less reliable hardware with higher likelihood for errors. Without adding additional failure detection and masking mechanisms, the next generations of CPUs would at least be unfit for executing mission- and safety-critical applications. One common approach is the replicated execution of programs on redundant cores, which is increasingly difficult considering that most programs are non-deterministic. To be able to detect and mask execution errors, one typically need to execute three copies of each thread.

In this paper, we propose and evaluate transactional encoding, a novel approach to detect and mask transient hardware errors such that one can build safe applications on top of unreliable components. Transactional encoding relies on a combination of arithmetic codes for detecting transient hardware errors and transactional memory for recovery and tolerance of transient errors. We present a prototype software implementation that encodes applications using an LLVM-based compiler and executes them with a customized software transactional memory algorithm. Our evaluation shows that our system can successfully survive between 90-96% of transient hardware errors.

Details

Original languageEnglish
Title of host publicationStabilization, Safety, and Security of Distributed Systems: 15th International Symposium (SSS 2015)
Pages1-16
Number of pages16
Volume8255
Publication statusPublished - 2013
Peer-reviewedYes

External IDs

Scopus 84893972869

Keywords

Research priority areas of TU Dresden

DFG Classification of Subject Areas according to Review Boards

Keywords

  • Error Detectin, Code Word, Error Recovery, Trnsactional memory, arithmetic code