RAMpage: Graceful Degradation Management for Memory Errors in Commodity Linux Servers

Research output: Contribution to book/Conference proceedings/Anthology/ReportConference contributionContributedpeer-review

Contributors

  • Horst Schirmeier - , Chair of Operating Systems, Dortmund University of Technology (Author)
  • Jens Neuhalfen - , Dortmund University of Technology (Author)
  • Ingo Korb - , Dortmund University of Technology (Author)
  • Olaf Spinczyk - , Dortmund University of Technology (Author)
  • Michael Engel - , Dortmund University of Technology (Author)

Abstract

Memory errors are a major source of reliability problems in current computers. Undetected errors may result in program termination, or, even worse, silent data corruption. Recent studies have shown that the frequency of permanent memory errors is an order of magnitude higher than previously assumed and regularly affects everyday operation. Often, neither additional circuitry to support hardware-based error detection nor downtime for performing hardware tests can be afforded. In the case of permanent memory errors, a system faces two challenges: detecting errors as early as possible and handling them while avoiding system downtime. To increase system reliability, we have developed RAMpage, an online memory testing infrastructure for commodity x86-64-based Linux servers, which is capable of efficiently detecting memory errors and which provides graceful degradation by withdrawing affected memory pages from further use. We describe the design and implementation of RAMpage and present results of an extensive qualitative as well as quantitative evaluation.

Details

Original languageEnglish
Title of host publication2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing
PublisherIEEE
Pages89-98
Number of pages10
ISBN (print)978-0-7695-4590-5
Publication statusPublished - 14 Dec 2011
Peer-reviewedYes

Conference

Title2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing
Duration12 - 14 December 2011
LocationPasadena, CA, USA

External IDs

Scopus 84857724913
ORCID /0000-0002-1427-9343/work/167216806

Keywords

Keywords

  • Random access memory, Kernel, Testing, Memory management, Linux, Degradation, Servers