A cost-based storage format selector for materialized results in big data frameworks

Rana Faisal Munir; Alberto Abelló; Oscar Romero; Maik Thiele; Wolfgang Lehner

doi:10.1007/s10619-019-07271-0

A cost-based storage format selector for materialized results in big data frameworks

Publikation: Beitrag in Fachzeitschrift › Forschungsartikel › Beigetragen › Begutachtung

Beitragende

Rana Faisal Munir - , Professur für Datenbanken, UPC Universitat Politècnica de Catalunya (Barcelona Tech) (Autor:in)
Alberto Abelló - , UPC Universitat Politècnica de Catalunya (Barcelona Tech) (Autor:in)
Oscar Romero - , UPC Universitat Politècnica de Catalunya (Barcelona Tech) (Autor:in)
Maik Thiele - , Professur für Datenbanken (Autor:in)
Wolfgang Lehner - , Professur für Datenbanken (Autor:in)

Abstract

Modern big data frameworks (such as Hadoop and Spark) allow multiple users to do large-scale analysis simultaneously, by deploying data-intensive workflows (DIWs). These DIWs of different users share many common tasks (i.e, 50–80%), which can be materialized and reused in future executions. Materializing the output of such common tasks improves the overall processing time of DIWs and also saves computational resources. Current solutions for materialization store data on Distributed File Systems by using a fixed storage format. However, a fixed choice is not the optimal one for every situation. Specifically, different layouts (i.e., horizontal, vertical or hybrid) have a huge impact on execution, according to the access patterns of the subsequent operations. In this paper, we present a cost-based approach that helps deciding the most appropriate storage format in every situation. A generic cost-based framework that selects the best format by considering the three main layouts is presented. Then, we use our framework to instantiate cost models for specific Hadoop storage formats (namely SequenceFile, Avro and Parquet), and test it with two standard benchmark suits. Our solution gives on average 1.33× speedup over fixed SequenceFile, 1.11× speedup over fixed Avro, 1.32× speedup over fixed Parquet, and overall, it provides 1.25× speedup.

Details

Originalsprache	Englisch
Seiten (von - bis)	335-364
Seitenumfang	30
Fachzeitschrift	Distributed and parallel databases : an international journal
Jahrgang	38
Ausgabenummer	2
Publikationsstatus	Veröffentlicht - 1 Juni 2020
Peer-Review-Status	Ja

Externe IDs

Scopus	85065646456
ORCID	/0000-0001-8107-2775/work/142253445

Schlagworte

ASJC Scopus Sachgebiete

Schlagwörter

Big data, Cost model, Data-intensive workflows, HDFS, Materialized results, Storage format

Forschungsportal der TU Dresden