Active Learning for Spreadsheet Cell Classification

Research output: Contribution to book/conference proceedings/anthology/reportConference contributionContributedpeer-review

Contributors

Abstract

Spreadsheets are mainly the most successful content generation tools, used in almost every enterprise to create a plethora of semistructured data. However, this information is often intermingled with various formatting, layout, and textual metadata, making it hard to identify and extract the actual tabularly structured payload. For this reason, automated information extraction from spreadsheets is a challenging task. Previous papers proposed cell classification as a first step of the table extraction process, which, however, requires a substantial amount of labeled training data, that is expensive to obtain. Therefore, in this paper we investigate a semi-supervised approach called Active Learning (AL), that can be used to train classification models by selecting only the most informative examples from an unlabeled dataset. In detail, we implement an AL cycle for spreadsheet cell classification by investigating different selection strategies and stopping criteria. We compare the performance of various AL strategies and derive guidelines for semi-supervised cell classification. Our experiments show, that by implementing AL for cell classification, we are able to reduce the amount of training data by 90% without any accuracy losses compared to a passive classifier.

Details

Original languageEnglish
Title of host publicationProceedings of the Workshops of the EDBT/ICDT 2020 Joint Conference, Copenhagen, Denmark, March 30, 2020
EditorsAlexandra Poulovassilis
Number of pages6
Publication statusPublished - 2020
Peer-reviewedYes

Publication series

SeriesCEUR Workshop Proceedings
Volume2578
ISSN1613-0073

Conference

TitleWorkshops of the 23rd International Conference on Extending Database Technology/23rd International Conference on Database Theory, EDBT-ICDT-WS 2020
Duration30 March - 2 April 2020
CityCopenhagen
CountryDenmark

External IDs

Scopus 85082745566
ORCID /0000-0001-8107-2775/work/142253451
ORCID /0000-0002-5985-4348/work/162348854

Keywords

ASJC Scopus subject areas

Keywords

  • Active Learning, Classification, Information Extraction, Machine Learning, Semi-supervised, Spreadsheets