Cache-efficient aggregation: Hashing is sorting

Research output: Contribution to book/Conference proceedings/Anthology/ReportConference contributionContributedpeer-review

Contributors

  • Ingo Müller - , Karlsruhe Institute of Technology, SAP Research (Author)
  • Peter Sanders - , Karlsruhe Institute of Technology (Author)
  • Arnaud Lacurie - , SAP Research (Author)
  • Wolfgang Lehner - , Chair of Databases (Author)
  • Franz Färber - , Karlsruhe Institute of Technology (Author)

Abstract

For decades researchers have studied the duality of hashing and sorting for the implementation of the relational operators, especially for efficient aggregation. Depending on the underlying hardware and software architecture, the specifically implemented algorithms, and the data sets used in the experiments, different authors came to different conclusions about which is the better approach. In this paper we argue that in terms of cache efficiency, the two paradigms are actually the same. We support our claim by showing that the complexity of hashing is the same as the complexity of sorting in the external memory model. Furthermore we make the similarity of the two approaches obvious by designing an algorithmic framework that allows to switch seamlessly between hashing and sorting during execution. The fact that we mix hashing and sorting routines in the same algorithmic framework allows us to leverage the advantages of both approaches and makes their similarity obvious. On a more practical note, we also show how to achieve very low constant factors by tuning both the hashing and the sorting routines to modern hardware. Since we observe a complementary dependency of the constant factors of the two routines to the locality of the input, we exploit our framework to switch to the faster routine where appropriate. The result is a novel relational aggregation algorithm that is cache-efficient-independently and without prior knowledge of input skew and output cardinality-, highly parallelizable on modern multi-core systems, and operating at a speed close to the memory bandwidth, thus outperforming the state-of-the-art by up to 3:7×.

Details

Original languageEnglish
Title of host publicationSIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages1123-1136
Number of pages14
ISBN (electronic)9781450327589
Publication statusPublished - 27 May 2015
Peer-reviewedYes

Conference

TitleACM SIGMOD International Conference on Management of Data, SIGMOD 2015
Duration31 May - 4 June 2015
CityMelbourne
CountryAustralia

External IDs

ORCID /0000-0001-8107-2775/work/198592324

Keywords

ASJC Scopus subject areas

Keywords

  • "Group by", Adaptive algorithm, Aggregation, Cache-efficient, Grouping, Hashing, Robust performance, Shared-memory, Sorting