Integrating Lightweight Compression Capabilities into Apache Arrow

Research output: Contribution to book/Conference proceedings/Anthology/ReportConference contributionContributedpeer-review

Contributors

Abstract

With the ongoing shift to a data-driven world in almost all application domains, the management and in particular the analytics of large amounts of data gain in importance. For that reason, a variety of new big data systems has been developed in recent years. Aside from that, a revision of the data organization and formats has been initiated as a foundation for these big data systems. In this context, Apache Arrow is a novel cross-language development platform for in-memory data with a standardized language-independent columnar memory format. The data is organized for efficient analytic operations on modern hardware, whereby Apache Arrow only supports dictionary encoding as a specific compression approach. However, there exists a large corpus of lightweight compression algorithms for columnar data which helps to reduce the necessary memory space as well as to increase the processing performance. Thus, we present a flexible and language-independent approach integrating lightweight compression algorithms into the Apache Arrow framework in this paper. With our so-called ArrowComp approach, we preserve the unique properties of Apache Arrow, but enhance the platform with a large variety of lightweight compression capabilities.

Details

Original languageEnglish
Title of host publicationDATA 2020 - Proceedings of the 9th International Conference on Data Science, Technology and Applications
EditorsSlimane Hammoudi, Christoph Quix, Jorge Bernardino
PublisherSciTePress - Science and Technology Publications
Pages55-66
Number of pages12
ISBN (electronic)9789897584404
Publication statusPublished - 2020
Peer-reviewedYes

Conference

Title9th International Conference on Data Science, Technology and Applications
Abbreviated titleDATA 2020
Conference number9
Descriptionheld in conjunction with ICSOFT 2020, ICINCO 2020, SIMULTECH 2020, ICETE 2020 and DeLTA 2020
Duration7 - 9 July 2020
Website
LocationOnline
CityParis
CountryFrance

External IDs

dblp conf/data/HildebrandtHL20
Scopus 85091968887
ORCID /0000-0001-8107-2775/work/142253553

Keywords

Keywords

  • Apache arrow, Columnar data, Data formats, Integration, Lightweight compression