Active Learning for Spreadsheet Cell Classification
Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/Gutachten › Beitrag in Konferenzband › Beigetragen › Begutachtung
Beitragende
Abstract
Spreadsheets are mainly the most successful content generation tools, used in almost every enterprise to create a plethora of semistructured data. However, this information is often intermingled with various formatting, layout, and textual metadata, making it hard to identify and extract the actual tabularly structured payload. For this reason, automated information extraction from spreadsheets is a challenging task. Previous papers proposed cell classification as a first step of the table extraction process, which, however, requires a substantial amount of labeled training data, that is expensive to obtain. Therefore, in this paper we investigate a semi-supervised approach called Active Learning (AL), that can be used to train classification models by selecting only the most informative examples from an unlabeled dataset. In detail, we implement an AL cycle for spreadsheet cell classification by investigating different selection strategies and stopping criteria. We compare the performance of various AL strategies and derive guidelines for semi-supervised cell classification. Our experiments show, that by implementing AL for cell classification, we are able to reduce the amount of training data by 90% without any accuracy losses compared to a passive classifier.
Details
Originalsprache | Englisch |
---|---|
Titel | Proceedings of the Workshops of the EDBT/ICDT 2020 Joint Conference, Copenhagen, Denmark, March 30, 2020 |
Redakteure/-innen | Alexandra Poulovassilis |
Seitenumfang | 6 |
Publikationsstatus | Veröffentlicht - 2020 |
Peer-Review-Status | Ja |
Publikationsreihe
Reihe | CEUR Workshop Proceedings |
---|---|
Band | 2578 |
ISSN | 1613-0073 |
Konferenz
Titel | Workshops of the 23rd International Conference on Extending Database Technology/23rd International Conference on Database Theory, EDBT-ICDT-WS 2020 |
---|---|
Dauer | 30 März - 2 April 2020 |
Stadt | Copenhagen |
Land | Dänemark |
Externe IDs
Scopus | 85082745566 |
---|---|
ORCID | /0000-0001-8107-2775/work/142253451 |
ORCID | /0000-0002-5985-4348/work/162348854 |
Schlagworte
ASJC Scopus Sachgebiete
Schlagwörter
- Active Learning, Classification, Information Extraction, Machine Learning, Semi-supervised, Spreadsheets