RAMpage: Graceful Degradation Management for Memory Errors in Commodity Linux Servers

Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/GutachtenBeitrag in KonferenzbandBeigetragenBegutachtung

Beitragende

  • Horst Schirmeier - , Professur für Betriebssysteme, Technische Universität (TU) Dortmund (Autor:in)
  • Jens Neuhalfen - , Technische Universität (TU) Dortmund (Autor:in)
  • Ingo Korb - , Technische Universität (TU) Dortmund (Autor:in)
  • Olaf Spinczyk - , Technische Universität (TU) Dortmund (Autor:in)
  • Michael Engel - , Technische Universität (TU) Dortmund (Autor:in)

Abstract

Memory errors are a major source of reliability problems in current computers. Undetected errors may result in program termination, or, even worse, silent data corruption. Recent studies have shown that the frequency of permanent memory errors is an order of magnitude higher than previously assumed and regularly affects everyday operation. Often, neither additional circuitry to support hardware-based error detection nor downtime for performing hardware tests can be afforded. In the case of permanent memory errors, a system faces two challenges: detecting errors as early as possible and handling them while avoiding system downtime. To increase system reliability, we have developed RAMpage, an online memory testing infrastructure for commodity x86-64-based Linux servers, which is capable of efficiently detecting memory errors and which provides graceful degradation by withdrawing affected memory pages from further use. We describe the design and implementation of RAMpage and present results of an extensive qualitative as well as quantitative evaluation.

Details

OriginalspracheEnglisch
Titel2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing
Herausgeber (Verlag)IEEE
Seiten89-98
Seitenumfang10
ISBN (Print)978-0-7695-4590-5
PublikationsstatusVeröffentlicht - 14 Dez. 2011
Peer-Review-StatusJa

Konferenz

Titel2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing
Dauer12 - 14 Dezember 2011
OrtPasadena, CA, USA

Externe IDs

Scopus 84857724913
ORCID /0000-0002-1427-9343/work/167216806

Schlagworte

Schlagwörter

  • Random access memory, Kernel, Testing, Memory management, Linux, Degradation, Servers