Active replication for latency-sensitive stream processing in Apache Flink
Publikation: Beitrag zu Konferenzen › Paper › Beigetragen › Begutachtung
Beitragende
Abstract
Stream processing frameworks allow processing
massive amounts of data shortly after it is produced, and
enable a fast reaction to events in scenarios such as data
center monitoring, smart transportation, or telecommunication
networks. Many scenarios depend on the fast and reliable
processing of incoming data, requiring low end-to-end latencies
from the ingest of a new event to the corresponding output.
The occurrence of faults jeopardizes these guarantees: Currently-
leading high-availability solutions for stream processing such as
Spark Streaming or Apache Flink’s implement passive replication
through snapshotting, requiring a stop-the-world operation to
recover from a failure. Active replication, while incurring higher
deployment costs, can overcome these limitations and allow
to mask the impact of faults and match stringent end-to-end
latency requirements. We present the design, implementation,
and evaluation of active replication in the popular Apache
Flink platform. Our study explores two alternative designs, a
leader-based approach leveraging external services (Kafka and
ZooKeeper) and a leaderless implementation leveraging a novel
deterministic merging algorithm. Our evaluation using a series
of microbenchmarks and a SaaS cloud monitoring scenario on a
37-server cluster show that the actively-replicated Flink can fully
mask the impact of faults on end-to-end latency.
massive amounts of data shortly after it is produced, and
enable a fast reaction to events in scenarios such as data
center monitoring, smart transportation, or telecommunication
networks. Many scenarios depend on the fast and reliable
processing of incoming data, requiring low end-to-end latencies
from the ingest of a new event to the corresponding output.
The occurrence of faults jeopardizes these guarantees: Currently-
leading high-availability solutions for stream processing such as
Spark Streaming or Apache Flink’s implement passive replication
through snapshotting, requiring a stop-the-world operation to
recover from a failure. Active replication, while incurring higher
deployment costs, can overcome these limitations and allow
to mask the impact of faults and match stringent end-to-end
latency requirements. We present the design, implementation,
and evaluation of active replication in the popular Apache
Flink platform. Our study explores two alternative designs, a
leader-based approach leveraging external services (Kafka and
ZooKeeper) and a leaderless implementation leveraging a novel
deterministic merging algorithm. Our evaluation using a series
of microbenchmarks and a SaaS cloud monitoring scenario on a
37-server cluster show that the actively-replicated Flink can fully
mask the impact of faults on end-to-end latency.
Details
Originalsprache | Englisch |
---|---|
Seiten | 56-66 |
Seitenumfang | 11 |
Publikationsstatus | Veröffentlicht - 2021 |
Peer-Review-Status | Ja |
Konferenz
Titel | 2021 40th International Symposium on Reliable Distributed Systems |
---|---|
Kurztitel | SRDS 2021 |
Veranstaltungsnummer | 40 |
Dauer | 20 - 23 September 2021 |
Webseite | |
Stadt | Chicago |
Land | USA/Vereinigte Staaten |
Externe IDs
Scopus | 85122983074 |
---|---|
Mendeley | 1003d635-7258-3bbd-b32f-09baa4a7bdee |
Schlagworte
ASJC Scopus Sachgebiete
Schlagwörter
- Active replication, Apache Flink, Fault tolerance, Stream Processing, cloud computing, computer centres, fault tolerant computing, merging, Active replication, Apache Flink, Fault tolerance, Stream Processing