Cell Classification for layout recognition in spreadsheets

Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/GutachtenBeitrag in KonferenzbandBeigetragenBegutachtung

Beitragende

  • Elvis Koci - , Technische Universität Dresden, UPC Universitat Politècnica de Catalunya (Barcelona Tech) (Autor:in)
  • Maik Thiele - , Technische Universität Dresden (Autor:in)
  • Oscar Romero - , UPC Universitat Politècnica de Catalunya (Barcelona Tech) (Autor:in)
  • Wolfgang Lehner - , Professur für Datenbanken (Autor:in)

Abstract

Spreadsheets compose a notably large and valuable dataset of documents within the enterprise settings and on the Web. Although spreadsheets are intuitive to use and equipped with powerful functionalities, extracting and reusing data from them remains a cumbersome and mostly manual task. Their greatest strength, the large degree of freedom they provide to the user, is at the same time also their greatest weakness, since data can be arbitrarily structured. Therefore, in this paper we propose a supervised learning approach for layout recognition in spreadsheets. We work on the cell level, aiming at predicting their correct layout role, out of five predefined alternatives. For this task we have considered a large number of features not covered before by related work. Moreover, we gather a considerably large dataset of annotated cells, from spreadsheets exhibiting variability in format and content. Our experiments, with five different classification algorithms, show that we can predict cell layout roles with high accuracy. Subsequently, in this paper we focus on revising the classification results, with the aim of repairing misclassifications. We propose a sophisticated approach, composed of three steps, which effectively corrects a reasonable number of inaccurate predictions.

Details

OriginalspracheEnglisch
TitelKnowledge Discovery, Knowledge Engineering and Knowledge Management - 8th International Joint Conference, IC3K 2016, Revised Selected Papers
Redakteure/-innenDavid Aveiro, Ana Fred, Jan Dietz, Jorge Bernardino, Kecheng Liu, Joaquim Filipe
Herausgeber (Verlag)Springer Verlag
Seiten78-100
Seitenumfang23
ISBN (Print)9783319997001
PublikationsstatusVeröffentlicht - 2019
Peer-Review-StatusJa

Publikationsreihe

ReiheCommunications in Computer and Information Science
Band914
ISSN1865-0929

Konferenz

Titel8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2016
Dauer9 - 11 November 2016
StadtPorto
LandPortugal

Externe IDs

ORCID /0000-0001-8107-2775/work/142253495

Schlagworte

Schlagwörter

  • Analysis, Classification, Document, Layout, Recognition, Speadsheet, Table, Tabular