Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching

Maximilian Knespel; Holger Brunst

doi:10.1145/3588195.3592992

Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching

Research output: Contribution to book/Conference proceedings/Anthology/Report › Conference contribution › Contributed › peer-review

Contributors

Maximilian Knespel - (Author)
Holger Brunst - , Department of Interdisciplinary Application Development and Coordination (IAK) (Author)

Abstract

Gzip is a file compression format, which is ubiquitously used. Although a multitude of gzip implementations exist, only pugz can fully utilize current multi-core processor architectures for decompression. Yet, pugz cannot decompress arbitrary gzip files. It requires the decompressed stream to only contain byte values 9-126. In this work, we present a generalization of the parallelization scheme used by pugz that can be reliably applied to arbitrary gzip-compressed data without compromising performance. We show that the requirements on the file contents posed by pugz can be dropped by implementing an architecture based on a cache and a parallelized prefetcher. This architecture can safely handle faulty decompression results, which can appear when threads start decompressing in the middle of a gzip file by using trial and error. Using 128 cores, our implementation reaches 8.7 GB/s decompression bandwidth for gzip-compressed base64-encoded data, a speedup of 55 over the single-threaded GNU gzip, and 5.6 GB/s for the Silesia corpus, a speedup of 33 over GNU gzip.

Details

Original language	English
Title of host publication	HPDC 2023 - Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing
Place of Publication	Orlando, FL, US
Publisher	Association for Computing Machinery (ACM), New York
Pages	295-307
Number of pages	13
ISBN (electronic)	9798400701559
Publication status	Published - 19 Jun 2023
Peer-reviewed	Yes

External IDs

Scopus	85169612617

Keywords

ASJC Scopus subject areas

Software
Artificial Intelligence
Information Systems
Safety, Risk, Reliability and Quality
Hardware and Architecture
Computer Networks and Communications
Computer Science Applications

Keywords

decompression, gzip, parallel algorithm, performance, random access

Research Portal of the TU Dresden

Contributors

Abstract

Details

External IDs

Keywords

ASJC Scopus subject areas

Keywords