Generic Soft-Error Detection and Correction for Concurrent Data Structures

Publikation: Beitrag in FachzeitschriftForschungsartikelBeigetragenBegutachtung

Beitragende

  • Christoph Borchert - , Technische Universität (TU) Dortmund (Autor:in)
  • Horst Schirmeier - , Technische Universität (TU) Dortmund (Autor:in)
  • Olaf Spinczyk - , Technische Universität (TU) Dortmund (Autor:in)

Abstract

Recent studies indicate that transient memory errors (soft errors) have become a relevant source of system failures. This paper presents a generic software-based fault-tolerance mechanism that transparently recovers from memory errors in object-oriented program data structures. The main benefits are the flexibility to choose from an extensible toolbox of easily pluggable error detection and correction schemes, such as Hamming and CRC codes. This is achieved by a combination of aspect-oriented and generative programming techniques. Furthermore, we present a wait-free synchronization algorithm for error detection in data structures that are used concurrently by multiple threads of control. We give a formal correctness proof and show the excellent scalability of our approach in a multiprocessor environment. In a case study, we present our experiences with selectively hardening the eCos operating system and its benchmark suite. We explore the trade-off between resiliency and performance by choosing only the most vulnerable data structures for error recovery. Thereby, the total number of system failures, manifesting as silent data corruptions and crashes, is reduced by 69.14 percent at a negligible runtime overhead of 0.36 percent.

Details

OriginalspracheEnglisch
Aufsatznummer7097670
Seiten (von - bis)22-36
Seitenumfang15
Fachzeitschrift IEEE Transactions on Dependable and Secure Computing
Jahrgang14
Ausgabenummer1
PublikationsstatusVeröffentlicht - 1 Feb. 2017
Peer-Review-StatusJa
Extern publiziertJa

Externe IDs

Scopus 85010434989
ORCID /0000-0002-1427-9343/work/167216811

Schlagworte

Schlagwörter

  • Redundancy, Data structures, Runtime, Instruction sets, Benchmark testing, Programming, Kernel