Table identification and reconstruction in spreadsheets
Research output: Contribution to book/conference proceedings/anthology/report › Conference contribution › Contributed › peer-review
Contributors
Abstract
Spreadsheets are one of the most successful content generation tools, used in almost every enterprise to perform data transformation, visualization, and analysis. The high degree of freedom provided by these tools results in very complex sheets, intermingling the actual data with formatting, formulas, layout artifacts, and textual metadata. To unlock the wealth of data contained in spreadsheets, a human analyst will often have to understand and transform the data manually. To overcome this cumbersome process, we propose a framework that is able to automatically infer the structure and extract the data from these documents in a canonical form. In this paper, we describe our heuristics-based method for discovering tables in spreadsheets, given that each cell is classified as either header, attribute, metadata, data, or derived. Experimental results on a real-world dataset of 439 worksheets (858 tables) show that our approach is feasible and effectively identifies tables within partially structured spreadsheets.
Details
Original language | English |
---|---|
Title of host publication | Advanced Information Systems Engineering |
Editors | Eric Dubois, Klaus Pohl |
Publisher | Springer, Berlin [u. a.] |
Pages | 527-541 |
Number of pages | 15 |
ISBN (print) | 9783319595351 |
Publication status | Published - 2017 |
Peer-reviewed | Yes |
Publication series
Series | Lecture Notes in Computer Science, Volume 10253 |
---|---|
ISSN | 0302-9743 |
Conference
Title | Forum and Doctoral Consortium Papers Presented at the 29th International Conference on Advanced Information Systems Engineering, CAiSE-Forum-DC 2017 |
---|---|
Duration | 12 - 16 June 2017 |
City | Essen |
Country | Germany |
External IDs
ORCID | /0000-0001-8107-2775/work/142253526 |
---|
Keywords
ASJC Scopus subject areas
Keywords
- Document, Grid, Identification, Layout, Recognition, Speadsheet, Table, Tabular