DECO: A dataset of annotated spreadsheets for layout and table recognition

Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/GutachtenBeitrag in KonferenzbandBeigetragenBegutachtung

Beitragende

Abstract

This paper presents DECO (Dresden Enron COrpus), a dataset of spreadsheet files, annotated on the basis of layout and contents. It comprises of 1,165 files, extracted from the Enron corpus. Three different annotators (judges) assigned layout roles (e.g., Header, Data, and Notes) to non-empty cells and marked the borders of tables. Files that do not contain tables were flagged using categories such as Template, Form, and Report. Subsequently, a thorough analysis is performed to uncover the characteristics of the overall dataset and specific annotations. The results are discussed in this paper, providing several takeaways for future works. Furthermore, this work describes in detail the annotation methodology, going through the individual steps. The dataset, methodology, and tools are made publicly available, so that they can be adopted for further studies. DECO is available at: https://wwwdb.inf.tu-dresden.de/research-projects/deexcelarator/

Details

OriginalspracheEnglisch
Titel2019 International Conference on Document Analysis and Recognition (ICDAR)
Herausgeber (Verlag)IEEE Computer Society, Washington
Seiten1280-1285
Seitenumfang6
ISBN (elektronisch)9781728128610, 978-1-7281-3014-9
ISBN (Print)978-1-7281-3015-6
PublikationsstatusVeröffentlicht - Sept. 2019
Peer-Review-StatusJa

Publikationsreihe

ReiheInternational Conference on Document Analysis and Recognition (ICDAR)
ISSN1520-5363

Konferenz

Titel15th IAPR International Conference on Document Analysis and Recognition
KurztitelICDAR 2019
Veranstaltungsnummer15
Dauer20 - 25 September 2019
OrtInternational Convention Centre
StadtSydney
LandAustralien

Externe IDs

dblp conf/icdar/KociTR0L19
ORCID /0000-0001-8107-2775/work/142253490

Schlagworte

Schlagwörter

  • Annotation, Corpus, Dataset, Enron, Forms, Layout, Recognition, Spreadsheet, Table, Templates