Table identification and reconstruction in spreadsheets

Research output: Contribution to book/conference proceedings/anthology/reportConference contributionContributedpeer-review

Contributors

  • Elvis Koci - , Chair of Databases, UPC Polytechnic University of Catalonia (Barcelona Tech) (Author)
  • Maik Thiele - , Chair of Databases (Author)
  • Oscar Romero - , UPC Polytechnic University of Catalonia (Barcelona Tech) (Author)
  • Wolfgang Lehner - , Chair of Databases (Author)

Abstract

Spreadsheets are one of the most successful content generation tools, used in almost every enterprise to perform data transformation, visualization, and analysis. The high degree of freedom provided by these tools results in very complex sheets, intermingling the actual data with formatting, formulas, layout artifacts, and textual metadata. To unlock the wealth of data contained in spreadsheets, a human analyst will often have to understand and transform the data manually. To overcome this cumbersome process, we propose a framework that is able to automatically infer the structure and extract the data from these documents in a canonical form. In this paper, we describe our heuristics-based method for discovering tables in spreadsheets, given that each cell is classified as either header, attribute, metadata, data, or derived. Experimental results on a real-world dataset of 439 worksheets (858 tables) show that our approach is feasible and effectively identifies tables within partially structured spreadsheets.

Details

Original languageEnglish
Title of host publicationAdvanced Information Systems Engineering
EditorsEric Dubois, Klaus Pohl
PublisherSpringer, Berlin [u. a.]
Pages527-541
Number of pages15
ISBN (print)9783319595351
Publication statusPublished - 2017
Peer-reviewedYes

Publication series

SeriesLecture Notes in Computer Science, Volume 10253
ISSN0302-9743

Conference

TitleForum and Doctoral Consortium Papers Presented at the 29th International Conference on Advanced Information Systems Engineering, CAiSE-Forum-DC 2017
Duration12 - 16 June 2017
CityEssen
CountryGermany

External IDs

ORCID /0000-0001-8107-2775/work/142253526

Keywords

Keywords

  • Document, Grid, Identification, Layout, Recognition, Speadsheet, Table, Tabular

Library keywords