Low-overhead fault tolerance for high-throughput data processing systems

André Martin; Thomas Knauth; Stephan Creutz; Diogo Becker de Brum; Stefan Weigert; Andrey Brito; Christof Fetzer

doi:doi:10.1109/ICDCS.2011.29

Low-overhead fault tolerance for high-throughput data processing systems

Publikation: Beitrag zu Konferenzen › Paper › Beigetragen › Begutachtung

Beitragende

André Martin - , Professur für Systems Engineering (SE) (Autor:in)
Thomas Knauth - , Professur für Systems Engineering (SE) (Autor:in)
Stephan Creutz - (Autor:in)
Diogo Becker de Brum - (Autor:in)
Stefan Weigert - , Professur für Systems Engineering (SE) (Autor:in)
Andrey Brito - (Autor:in)
Christof Fetzer - , Professur für Systems Engineering (SE) (Autor:in)

Abstract

The MapReduce programming paradigm proved to be a useful approach for building highly scalable data processing systems. One important reason for its success is simplicity, including the fault tolerance mechanisms. However, this simplicity comes at a price: efficiency. MapReduce's fault tolerance scheme stores too much intermediate information on disk. This inefficiency negatively affects job completion time. Furthermore, this inefficiency in particular forbids the application of MapReduce in near real-time scenarios where jobs need to produce results quickly. In this paper, we discuss an alternative fault tolerance scheme that is inspired by virtual synchrony. The key feature of our approach is a low-overhead deterministic execution. Deterministic execution reduces the amount of persistently stored information. In addition, because persisting intermediate results are no longer required for fault tolerance, we use more efficient communication techniques that considerably improve job completion time and throughput. Our contribution is twofold: (i) we enable the use of MapReduce for jobs ranging from seconds to a few tens of seconds, satisfying these deadlines even in the case of failures, (ii) we considerably reduce the fault tolerance overhead and as such the overhead of MapReduce in general. Our modifications are transparent to the application.

Details

Originalsprache	Englisch
Seiten	689-699
Seitenumfang	11
Publikationsstatus	Veröffentlicht - 2011
Peer-Review-Status	Ja

Konferenz

Titel	2011 31st IEEE International Conference on Distributed Computing Systems
Kurztitel	ICDCS '11
Veranstaltungsnummer	31
Dauer	20 - 24 Juni 2011
Bekanntheitsgrad	Internationale Veranstaltung
Stadt	Minneapolis
Land	USA/Vereinigte Staaten

Externe IDs

Scopus	80051907648

Schlagworte

Forschungsprofillinien der TU Dresden

Informationstechnologien und Mikroelektronik

DFG-Fachsystematik nach Fachkollegium

Sicherheit und Verlässlichkeit

Schlagwörter

Fault Tolerance, Tault tolerant systems, synchronisation, Programming, Computer crashes, Aggregates, Monitoring, data handling, deterministic algorithms, Parallel programming

Forschungsportal der TU Dresden