Cache-efficient aggregation: Hashing is sorting

Ingo Müller; Peter Sanders; Arnaud Lacurie; Wolfgang Lehner; Franz Färber

doi:10.1145/2723372.2747644

Cache-efficient aggregation: Hashing is sorting

Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/Gutachten › Beitrag in Konferenzband › Beigetragen › Begutachtung

Beitragende

Ingo Müller - , Karlsruher Institut für Technologie, SAP Research (Autor:in)
Peter Sanders - , Karlsruher Institut für Technologie (Autor:in)
Arnaud Lacurie - , SAP Research (Autor:in)
Wolfgang Lehner - , Professur für Datenbanken (Autor:in)
Franz Färber - , Karlsruher Institut für Technologie (Autor:in)

Abstract

For decades researchers have studied the duality of hashing and sorting for the implementation of the relational operators, especially for efficient aggregation. Depending on the underlying hardware and software architecture, the specifically implemented algorithms, and the data sets used in the experiments, different authors came to different conclusions about which is the better approach. In this paper we argue that in terms of cache efficiency, the two paradigms are actually the same. We support our claim by showing that the complexity of hashing is the same as the complexity of sorting in the external memory model. Furthermore we make the similarity of the two approaches obvious by designing an algorithmic framework that allows to switch seamlessly between hashing and sorting during execution. The fact that we mix hashing and sorting routines in the same algorithmic framework allows us to leverage the advantages of both approaches and makes their similarity obvious. On a more practical note, we also show how to achieve very low constant factors by tuning both the hashing and the sorting routines to modern hardware. Since we observe a complementary dependency of the constant factors of the two routines to the locality of the input, we exploit our framework to switch to the faster routine where appropriate. The result is a novel relational aggregation algorithm that is cache-efficient-independently and without prior knowledge of input skew and output cardinality-, highly parallelizable on modern multi-core systems, and operating at a speed close to the memory bandwidth, thus outperforming the state-of-the-art by up to 3:7×.

Details

Originalsprache	Englisch
Titel	SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
Herausgeber (Verlag)	Association for Computing Machinery
Seiten	1123-1136
Seitenumfang	14
ISBN (elektronisch)	9781450327589
Publikationsstatus	Veröffentlicht - 27 Mai 2015
Peer-Review-Status	Ja

Konferenz

Titel	ACM SIGMOD International Conference on Management of Data, SIGMOD 2015
Dauer	31 Mai - 4 Juni 2015
Stadt	Melbourne
Land	Australien

Externe IDs

ORCID	/0000-0001-8107-2775/work/198592324

Schlagworte

ASJC Scopus Sachgebiete

Software
Information systems

Schlagwörter

"Group by", Adaptive algorithm, Aggregation, Cache-efficient, Grouping, Hashing, Robust performance, Shared-memory, Sorting

Forschungsportal der TU Dresden