Table identification and reconstruction in spreadsheets

Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/GutachtenBeitrag in KonferenzbandBeigetragenBegutachtung

Beitragende

Abstract

Spreadsheets are one of the most successful content generation tools, used in almost every enterprise to perform data transformation, visualization, and analysis. The high degree of freedom provided by these tools results in very complex sheets, intermingling the actual data with formatting, formulas, layout artifacts, and textual metadata. To unlock the wealth of data contained in spreadsheets, a human analyst will often have to understand and transform the data manually. To overcome this cumbersome process, we propose a framework that is able to automatically infer the structure and extract the data from these documents in a canonical form. In this paper, we describe our heuristics-based method for discovering tables in spreadsheets, given that each cell is classified as either header, attribute, metadata, data, or derived. Experimental results on a real-world dataset of 439 worksheets (858 tables) show that our approach is feasible and effectively identifies tables within partially structured spreadsheets.

Details

OriginalspracheEnglisch
TitelAdvanced Information Systems Engineering
Redakteure/-innenEric Dubois, Klaus Pohl
Herausgeber (Verlag)Springer, Berlin [u. a.]
Seiten527-541
Seitenumfang15
ISBN (Print)9783319595351
PublikationsstatusVeröffentlicht - 2017
Peer-Review-StatusJa

Publikationsreihe

ReiheLecture Notes in Computer Science, Volume 10253
ISSN0302-9743

Konferenz

TitelForum and Doctoral Consortium Papers Presented at the 29th International Conference on Advanced Information Systems Engineering, CAiSE-Forum-DC 2017
Dauer12 - 16 Juni 2017
StadtEssen
LandDeutschland

Externe IDs

ORCID /0000-0001-8107-2775/work/142253526

Schlagworte

Schlagwörter

  • Document, Grid, Identification, Layout, Recognition, Speadsheet, Table, Tabular

Bibliotheksschlagworte