Table identification and reconstruction in spreadsheets
Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/Gutachten › Beitrag in Konferenzband › Beigetragen › Begutachtung
Beitragende
Abstract
Spreadsheets are one of the most successful content generation tools, used in almost every enterprise to perform data transformation, visualization, and analysis. The high degree of freedom provided by these tools results in very complex sheets, intermingling the actual data with formatting, formulas, layout artifacts, and textual metadata. To unlock the wealth of data contained in spreadsheets, a human analyst will often have to understand and transform the data manually. To overcome this cumbersome process, we propose a framework that is able to automatically infer the structure and extract the data from these documents in a canonical form. In this paper, we describe our heuristics-based method for discovering tables in spreadsheets, given that each cell is classified as either header, attribute, metadata, data, or derived. Experimental results on a real-world dataset of 439 worksheets (858 tables) show that our approach is feasible and effectively identifies tables within partially structured spreadsheets.
Details
Originalsprache | Englisch |
---|---|
Titel | Advanced Information Systems Engineering |
Redakteure/-innen | Eric Dubois, Klaus Pohl |
Herausgeber (Verlag) | Springer, Berlin [u. a.] |
Seiten | 527-541 |
Seitenumfang | 15 |
ISBN (Print) | 9783319595351 |
Publikationsstatus | Veröffentlicht - 2017 |
Peer-Review-Status | Ja |
Publikationsreihe
Reihe | Lecture Notes in Computer Science, Volume 10253 |
---|---|
ISSN | 0302-9743 |
Konferenz
Titel | Forum and Doctoral Consortium Papers Presented at the 29th International Conference on Advanced Information Systems Engineering, CAiSE-Forum-DC 2017 |
---|---|
Dauer | 12 - 16 Juni 2017 |
Stadt | Essen |
Land | Deutschland |
Externe IDs
ORCID | /0000-0001-8107-2775/work/142253526 |
---|
Schlagworte
ASJC Scopus Sachgebiete
Schlagwörter
- Document, Grid, Identification, Layout, Recognition, Speadsheet, Table, Tabular