Active Learning for Spreadsheet Cell Classification
Research output: Contribution to book/conference proceedings/anthology/report › Conference contribution › Contributed › peer-review
Contributors
Abstract
Spreadsheets are mainly the most successful content generation tools, used in almost every enterprise to create a plethora of semistructured data. However, this information is often intermingled with various formatting, layout, and textual metadata, making it hard to identify and extract the actual tabularly structured payload. For this reason, automated information extraction from spreadsheets is a challenging task. Previous papers proposed cell classification as a first step of the table extraction process, which, however, requires a substantial amount of labeled training data, that is expensive to obtain. Therefore, in this paper we investigate a semi-supervised approach called Active Learning (AL), that can be used to train classification models by selecting only the most informative examples from an unlabeled dataset. In detail, we implement an AL cycle for spreadsheet cell classification by investigating different selection strategies and stopping criteria. We compare the performance of various AL strategies and derive guidelines for semi-supervised cell classification. Our experiments show, that by implementing AL for cell classification, we are able to reduce the amount of training data by 90% without any accuracy losses compared to a passive classifier.
Details
Original language | English |
---|---|
Title of host publication | Proceedings of the Workshops of the EDBT/ICDT 2020 Joint Conference, Copenhagen, Denmark, March 30, 2020 |
Editors | Alexandra Poulovassilis |
Number of pages | 6 |
Publication status | Published - 2020 |
Peer-reviewed | Yes |
Publication series
Series | CEUR Workshop Proceedings |
---|---|
Volume | 2578 |
ISSN | 1613-0073 |
Conference
Title | Workshops of the 23rd International Conference on Extending Database Technology/23rd International Conference on Database Theory, EDBT-ICDT-WS 2020 |
---|---|
Duration | 30 March - 2 April 2020 |
City | Copenhagen |
Country | Denmark |
External IDs
Scopus | 85082745566 |
---|---|
ORCID | /0000-0001-8107-2775/work/142253451 |
ORCID | /0000-0002-5985-4348/work/162348854 |
Keywords
ASJC Scopus subject areas
Keywords
- Active Learning, Classification, Information Extraction, Machine Learning, Semi-supervised, Spreadsheets