Hybrid Hardware/Software Detection of Multi-Bit Upsets in Memory
Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/Gutachten › Beitrag in Konferenzband › Beigetragen › Begutachtung
Beitragende
Abstract
Bit flips in main memory can be caused by a multitude of environmental effects, such as heat or radiation, as well as by malicious actors exploiting Rowhammer-style hardware vulnerabilities. The industry-standard countermeasure is SEC-DED ECC memory, which can reliably correct single-and detect double-bit flips in a data word. However, larger multi-bit upsets (MBUs) regularly occur in real-world systems, and – as shown by an analysis in this paper – have a high probability of being miscorrected. Software-implemented hardware fault tolerance (SIHFT) mechanisms can flexibly handle MBUs, but incur significant runtime costs. In this paper, we propose to combine hardware ECC as a low-cost detector and SIHFT as a handler for miscorrected MBUs that recategorizes them as uncorrectable. A preliminary evaluation on the basis of differential checksums shows a 98.5 % reduction in miscorrected silent data corruptions with a very moderate execution-time overhead.
Details
| Originalsprache | Englisch |
|---|---|
| Titel | Proceedings - 2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, DSN-W 2024 |
| Herausgeber (Verlag) | Institute of Electrical and Electronics Engineers (IEEE) |
| Seiten | 94-97 |
| Seitenumfang | 4 |
| ISBN (elektronisch) | 9798350395723 |
| ISBN (Print) | 979-8-3503-9573-0 |
| Publikationsstatus | Veröffentlicht - 27 Juni 2024 |
| Peer-Review-Status | Ja |
Konferenz
| Titel | 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks |
|---|---|
| Kurztitel | DSN 2024 |
| Veranstaltungsnummer | 54 |
| Dauer | 24 - 27 Juni 2024 |
| Webseite | |
| Ort | Pullman Hotel |
| Stadt | Brisbane |
| Land | Australien |
Externe IDs
| ORCID | /0000-0002-1427-9343/work/166764857 |
|---|---|
| Scopus | 85203820132 |
Schlagworte
ASJC Scopus Sachgebiete
Schlagwörter
- Fault tolerance, Fault tolerant systems, Hardware, Heating systems, Memory management, Runtime, Software, fault tolerance, ECC, multi-bit upset, fault detection, software-implemented hardware fault tolerance, hybrid, DRAM