DECO: A dataset of annotated spreadsheets for layout and table recognition

Research output: Contribution to book/conference proceedings/anthology/reportConference contributionContributedpeer-review

Contributors

  • Elvis Koci - , Chair of Databases (Author)
  • Maik Thiele - , Chair of Databases (Author)
  • Josephine Rehak - , TUD Dresden University of Technology (Author)
  • Oscar Romero - , UPC Polytechnic University of Catalonia (Barcelona Tech) (Author)
  • Wolfgang Lehner - , Chair of Databases (Author)

Abstract

This paper presents DECO (Dresden Enron COrpus), a dataset of spreadsheet files, annotated on the basis of layout and contents. It comprises of 1,165 files, extracted from the Enron corpus. Three different annotators (judges) assigned layout roles (e.g., Header, Data, and Notes) to non-empty cells and marked the borders of tables. Files that do not contain tables were flagged using categories such as Template, Form, and Report. Subsequently, a thorough analysis is performed to uncover the characteristics of the overall dataset and specific annotations. The results are discussed in this paper, providing several takeaways for future works. Furthermore, this work describes in detail the annotation methodology, going through the individual steps. The dataset, methodology, and tools are made publicly available, so that they can be adopted for further studies. DECO is available at: https://wwwdb.inf.tu-dresden.de/research-projects/deexcelarator/

Details

Original languageEnglish
Title of host publication2019 International Conference on Document Analysis and Recognition (ICDAR)
PublisherIEEE Computer Society, Washington
Pages1280-1285
Number of pages6
ISBN (electronic)9781728128610, 978-1-7281-3014-9
ISBN (print)978-1-7281-3015-6
Publication statusPublished - Sept 2019
Peer-reviewedYes

Publication series

SeriesInternational Conference on Document Analysis and Recognition (ICDAR)
ISSN1520-5363

Conference

Title15th IAPR International Conference on Document Analysis and Recognition
Abbreviated titleICDAR 2019
Conference number15
Duration20 - 25 September 2019
LocationInternational Convention Centre
CitySydney
CountryAustralia

External IDs

dblp conf/icdar/KociTR0L19
ORCID /0000-0001-8107-2775/work/142253490

Keywords

Keywords

  • Annotation, Corpus, Dataset, Enron, Forms, Layout, Recognition, Spreadsheet, Table, Templates