Cache-efficient aggregation: Hashing is sorting

Ingo Müller; Peter Sanders; Arnaud Lacurie; Wolfgang Lehner; Franz Färber

doi:10.1145/2723372.2747644

Cache-efficient aggregation: Hashing is sorting

Research output: Contribution to book/Conference proceedings/Anthology/Report › Conference contribution › Contributed › peer-review

Contributors

Ingo Müller - , Karlsruhe Institute of Technology, SAP Research (Author)
Peter Sanders - , Karlsruhe Institute of Technology (Author)
Arnaud Lacurie - , SAP Research (Author)
Wolfgang Lehner - , Chair of Databases (Author)
Franz Färber - , Karlsruhe Institute of Technology (Author)

Abstract

For decades researchers have studied the duality of hashing and sorting for the implementation of the relational operators, especially for efficient aggregation. Depending on the underlying hardware and software architecture, the specifically implemented algorithms, and the data sets used in the experiments, different authors came to different conclusions about which is the better approach. In this paper we argue that in terms of cache efficiency, the two paradigms are actually the same. We support our claim by showing that the complexity of hashing is the same as the complexity of sorting in the external memory model. Furthermore we make the similarity of the two approaches obvious by designing an algorithmic framework that allows to switch seamlessly between hashing and sorting during execution. The fact that we mix hashing and sorting routines in the same algorithmic framework allows us to leverage the advantages of both approaches and makes their similarity obvious. On a more practical note, we also show how to achieve very low constant factors by tuning both the hashing and the sorting routines to modern hardware. Since we observe a complementary dependency of the constant factors of the two routines to the locality of the input, we exploit our framework to switch to the faster routine where appropriate. The result is a novel relational aggregation algorithm that is cache-efficient-independently and without prior knowledge of input skew and output cardinality-, highly parallelizable on modern multi-core systems, and operating at a speed close to the memory bandwidth, thus outperforming the state-of-the-art by up to 3:7×.

Details

Original language	English
Title of host publication	SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
Publisher	Association for Computing Machinery
Pages	1123-1136
Number of pages	14
ISBN (electronic)	9781450327589
Publication status	Published - 27 May 2015
Peer-reviewed	Yes

Conference

Title	ACM SIGMOD International Conference on Management of Data, SIGMOD 2015
Duration	31 May - 4 June 2015
City	Melbourne
Country	Australia

External IDs

ORCID	/0000-0001-8107-2775/work/198592324

Keywords

ASJC Scopus subject areas

Software
Information Systems

Keywords

"Group by", Adaptive algorithm, Aggregation, Cache-efficient, Grouping, Hashing, Robust performance, Shared-memory, Sorting

Research Portal of the TU Dresden

Contributors

Abstract

Details

Conference

External IDs

Keywords

ASJC Scopus subject areas

Keywords