ApproxJoin: Approximate Distributed Joins

Do Le Quoc; Istemi Ekin Akkus; Pramod Bhatotia; Spyros Blanas; Ruichuan Chen; Christof Fetzer; Thorsten Strufe

doi:10.1145/3267809.3267834

ApproxJoin: Approximate Distributed Joins

Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/Gutachten › Beitrag in Konferenzband › Beigetragen › Begutachtung

Beitragende

Do Le Quoc - , Professur für Systems Engineering (SE) (Autor:in)
Istemi Ekin Akkus - (Autor:in)
Pramod Bhatotia - , Distributed Systems (NFoG) (cfaed) (Autor:in)
Spyros Blanas - (Autor:in)
Ruichuan Chen - (Autor:in)
Christof Fetzer - , Professur für Systems Engineering (SE) (Autor:in)
Thorsten Strufe - , Professur für Datenschutz und Datensicherheit (DD) (Autor:in)

Abstract

A distributed join is a fundamental operation for processing massive datasets in parallel. Unfortunately, computing an equi-join over such datasets is very resource-intensive, even when done in parallel. Given this cost, the equi-join operator becomes a natural candidate for optimization using approximation techniques, which allow users to trade accuracy for latency. Finding the right approximation technique for joins, however, is a challenging task. Sampling, in particular, cannot be directly used in joins; naïvely performing a join over a sample of the dataset will not preserve statistical properties of the query result.

To address this problem, we introduce ApproxJoin. We interweave Bloom filter sketching and stratified sampling with the join computation in a new operator that preserves statistical properties of an aggregation over the join output. ApproxJoin leverages Bloom filters to avoid shuffling non-joinable data items around the network, and then applies stratified sampling to obtain a representative sample of the join output. We implemented ApproxJoin in Apache Spark, and evaluated it using microbenchmarks and real-world workloads. Our evaluation shows that ApproxJoin scales well and significantly reduces data movement, without sacrificing tight error bounds on the accuracy of the final results. ApproxJoin achieves a speedup of up to 9x over unmodified Spark-based joins with the same sampling ratio. Furthermore, the speedup is accompanied by a significant reduction in the shuffled data volume, which is up to 82x less than unmodified Spark-based joins.

Details

Originalsprache	Englisch
Titel	Proceedings of the ACM Symposium on Cloud Computing (SoCC18)
Erscheinungsort	New York, NY, USA
Herausgeber (Verlag)	Association for Computing Machinery, Inc
Seiten	426–438
Seitenumfang	13
ISBN (Print)	9781450360111
Publikationsstatus	Veröffentlicht - 2018
Peer-Review-Status	Ja

Externe IDs

Scopus	85059015197

Schlagworte

Forschungsprofillinien der TU Dresden

Informationstechnologien und Mikroelektronik

DFG-Fachsystematik nach Fachkollegium

Sicherheit und Verlässlichkeit

Schlagwörter

stratified sampling, approximate computing and distributed systems, multi-way joins, Approximate join processing, Approximate join processing, multi-way joins, stratified sampling, approximate computing and distributed systems

Forschungsportal der TU Dresden