Resilient store: A heuristic-based data format selector for intermediate results

Research output: Contribution to book/Conference proceedings/Anthology/ReportConference contributionContributedpeer-review

Contributors

  • Rana Faisal Munir - , UPC Polytechnic University of Catalonia (Barcelona Tech) (Author)
  • Oscar Romero - , UPC Polytechnic University of Catalonia (Barcelona Tech) (Author)
  • Alberto Abelló - , UPC Polytechnic University of Catalonia (Barcelona Tech) (Author)
  • Besim Bilalli - , UPC Polytechnic University of Catalonia (Barcelona Tech) (Author)
  • Maik Thiele - , TUD Dresden University of Technology (Author)
  • Wolfgang Lehner - , TUD Dresden University of Technology (Author)

Abstract

Large-scale data analysis is an important activity in many organizations that typically requires the deployment of data-intensive workflows. As data is processed these workflows generate large intermediate results, which are typically pipelined from one operator to the following. However, if materialized, these results become reusable, hence, subsequent workflows need not recompute them. There are already many solutions that materialize intermediate results but all of them assume a fixed data format. A fixed format, however, may not be the optimal one for every situation. For example, it is well-known that different data fragmentation strategies (e.g., horizontal and vertical) behave better or worse according to the access patterns of the subsequent operations. In this paper, we present ResilientStore, which assists on selecting the most appropriate data format for materializing intermediate results. Given a workflow and a set of materialization points, it uses rule-based heuristics to choose the best storage data format based on subsequent access patterns.We have implemented ResilientStore for HDFS and three different data formats: SequenceFile, Parquet and Avro. Experimental results show that our solution gives 18% better performance than any solution based on a single fixed format.

Details

Original languageEnglish
Title of host publication Model and Data Engineering
EditorsÓscar Pastor, Jesús M. Almendros Jiménez, Yamine Aït-Ameur, Ladjel Bellatreche
PublisherSpringer Verlag
Pages42-56
Number of pages15
ISBN (print)9783319455464
Publication statusPublished - 2016
Peer-reviewedYes
Externally publishedYes

Publication series

SeriesLecture Notes in Computer Science, Volume 9893
ISSN0302-9743

Conference

Title6th International Conference on Model and Data Engineering, MEDI 2016
Duration21 - 23 September 2016
CityAlmeria
CountrySpain

External IDs

ORCID /0000-0001-8107-2775/work/142253538

Keywords

Keywords

  • Big data, Data format, Data-intensive workflows, HDFS, Intermediate results

Library keywords