Transactional Encoding for Tolerating Transient Hardware Errors
Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/Gutachten › Beitrag in Buch/Sammelband/Gutachten › Beigetragen › Begutachtung
Beitragende
Abstract
The decreasing feature size of integrated circuits leads to less reliable hardware with higher likelihood for errors. Without adding additional failure detection and masking mechanisms, the next generations of CPUs would at least be unfit for executing mission- and safety-critical applications. One common approach is the replicated execution of programs on redundant cores, which is increasingly difficult considering that most programs are non-deterministic. To be able to detect and mask execution errors, one typically need to execute three copies of each thread.
In this paper, we propose and evaluate transactional encoding, a novel approach to detect and mask transient hardware errors such that one can build safe applications on top of unreliable components. Transactional encoding relies on a combination of arithmetic codes for detecting transient hardware errors and transactional memory for recovery and tolerance of transient errors. We present a prototype software implementation that encodes applications using an LLVM-based compiler and executes them with a customized software transactional memory algorithm. Our evaluation shows that our system can successfully survive between 90-96% of transient hardware errors.
In this paper, we propose and evaluate transactional encoding, a novel approach to detect and mask transient hardware errors such that one can build safe applications on top of unreliable components. Transactional encoding relies on a combination of arithmetic codes for detecting transient hardware errors and transactional memory for recovery and tolerance of transient errors. We present a prototype software implementation that encodes applications using an LLVM-based compiler and executes them with a customized software transactional memory algorithm. Our evaluation shows that our system can successfully survive between 90-96% of transient hardware errors.
Details
Originalsprache | Englisch |
---|---|
Titel | Stabilization, Safety, and Security of Distributed Systems: 15th International Symposium (SSS 2015) |
Seiten | 1-16 |
Seitenumfang | 16 |
Band | 8255 |
Publikationsstatus | Veröffentlicht - 2013 |
Peer-Review-Status | Ja |
Externe IDs
Scopus | 84893972869 |
---|
Schlagworte
Forschungsprofillinien der TU Dresden
DFG-Fachsystematik nach Fachkollegium
Schlagwörter
- Error Detectin, Code Word, Error Recovery, Trnsactional memory, arithmetic code