Active Learning for Spreadsheet Cell Classification

Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/GutachtenBeitrag in KonferenzbandBeigetragenBegutachtung

Beitragende

Abstract

Spreadsheets are mainly the most successful content generation tools, used in almost every enterprise to create a plethora of semistructured data. However, this information is often intermingled with various formatting, layout, and textual metadata, making it hard to identify and extract the actual tabularly structured payload. For this reason, automated information extraction from spreadsheets is a challenging task. Previous papers proposed cell classification as a first step of the table extraction process, which, however, requires a substantial amount of labeled training data, that is expensive to obtain. Therefore, in this paper we investigate a semi-supervised approach called Active Learning (AL), that can be used to train classification models by selecting only the most informative examples from an unlabeled dataset. In detail, we implement an AL cycle for spreadsheet cell classification by investigating different selection strategies and stopping criteria. We compare the performance of various AL strategies and derive guidelines for semi-supervised cell classification. Our experiments show, that by implementing AL for cell classification, we are able to reduce the amount of training data by 90% without any accuracy losses compared to a passive classifier.

Details

OriginalspracheEnglisch
TitelProceedings of the Workshops of the EDBT/ICDT 2020 Joint Conference, Copenhagen, Denmark, March 30, 2020
Redakteure/-innenAlexandra Poulovassilis
Seitenumfang6
PublikationsstatusVeröffentlicht - 2020
Peer-Review-StatusJa

Publikationsreihe

ReiheCEUR Workshop Proceedings
Band2578
ISSN1613-0073

Konferenz

TitelWorkshops of the 23rd International Conference on Extending Database Technology/23rd International Conference on Database Theory, EDBT-ICDT-WS 2020
Dauer30 März - 2 April 2020
StadtCopenhagen
LandDänemark

Externe IDs

Scopus 85082745566
ORCID /0000-0001-8107-2775/work/142253451
ORCID /0000-0002-5985-4348/work/162348854

Schlagworte

ASJC Scopus Sachgebiete

Schlagwörter

  • Active Learning, Classification, Information Extraction, Machine Learning, Semi-supervised, Spreadsheets