Transactional Encoding for Tolerating Transient Hardware Errors
Research output: Contribution to book/Conference proceedings/Anthology/Report › Chapter in book/Anthology/Report › Contributed › peer-review
Contributors
Abstract
The decreasing feature size of integrated circuits leads to less reliable hardware with higher likelihood for errors. Without adding additional failure detection and masking mechanisms, the next generations of CPUs would at least be unfit for executing mission- and safety-critical applications. One common approach is the replicated execution of programs on redundant cores, which is increasingly difficult considering that most programs are non-deterministic. To be able to detect and mask execution errors, one typically need to execute three copies of each thread.
In this paper, we propose and evaluate transactional encoding, a novel approach to detect and mask transient hardware errors such that one can build safe applications on top of unreliable components. Transactional encoding relies on a combination of arithmetic codes for detecting transient hardware errors and transactional memory for recovery and tolerance of transient errors. We present a prototype software implementation that encodes applications using an LLVM-based compiler and executes them with a customized software transactional memory algorithm. Our evaluation shows that our system can successfully survive between 90-96% of transient hardware errors.
In this paper, we propose and evaluate transactional encoding, a novel approach to detect and mask transient hardware errors such that one can build safe applications on top of unreliable components. Transactional encoding relies on a combination of arithmetic codes for detecting transient hardware errors and transactional memory for recovery and tolerance of transient errors. We present a prototype software implementation that encodes applications using an LLVM-based compiler and executes them with a customized software transactional memory algorithm. Our evaluation shows that our system can successfully survive between 90-96% of transient hardware errors.
Details
Original language | English |
---|---|
Title of host publication | Stabilization, Safety, and Security of Distributed Systems: 15th International Symposium (SSS 2015) |
Pages | 1-16 |
Number of pages | 16 |
Volume | 8255 |
Publication status | Published - 2013 |
Peer-reviewed | Yes |
External IDs
Scopus | 84893972869 |
---|
Keywords
Research priority areas of TU Dresden
DFG Classification of Subject Areas according to Review Boards
Keywords
- Error Detectin, Code Word, Error Recovery, Trnsactional memory, arithmetic code