Hybrid Hardware/Software Detection of Multi-Bit Upsets in Memory

Research output: Contribution to book/Conference proceedings/Anthology/ReportConference contributionContributedpeer-review

Contributors

Abstract

Bit flips in main memory can be caused by a multitude of environmental effects, such as heat or radiation, as well as by malicious actors exploiting Rowhammer-style hardware vulnerabilities. The industry-standard countermeasure is SEC-DED ECC memory, which can reliably correct single-and detect double-bit flips in a data word. However, larger multi-bit upsets (MBUs) regularly occur in real-world systems, and – as shown by an analysis in this paper – have a high probability of being miscorrected. Software-implemented hardware fault tolerance (SIHFT) mechanisms can flexibly handle MBUs, but incur significant runtime costs. In this paper, we propose to combine hardware ECC as a low-cost detector and SIHFT as a handler for miscorrected MBUs that recategorizes them as uncorrectable. A preliminary evaluation on the basis of differential checksums shows a 98.5 % reduction in miscorrected silent data corruptions with a very moderate execution-time overhead.

Details

Original languageEnglish
Title of host publication2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W)
PublisherIEEE
Pages94-97
Number of pages4
ISBN (electronic)9798350395723
ISBN (print)979-8-3503-9573-0
Publication statusPublished - 27 Jun 2024
Peer-reviewedYes

Conference

Title2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Abbreviated titleDSN 2024
Conference number54
Duration24 - 27 June 2024
Website
LocationPullman Hotel
CityBrisbane
CountryAustralia

External IDs

ORCID /0000-0002-1427-9343/work/166764857
Scopus 85203820132

Keywords

Keywords

  • Fault tolerance, Fault tolerant systems, Hardware, Heating systems, Memory management, Runtime, Software