Building the Dresden Web Table Corpus: A Classification Approach
Research output: Contribution to book/Conference proceedings/Anthology/Report › Conference contribution › Contributed › peer-review
Contributors
Abstract
In recent years, researchers have recognized relational tables on the Web as an important source of information. To assist this research we developed the Dresden Web Tables Corpus (DWTC), a collection of about 125 million data tables extracted from the Common Crawl (CC) which contains 3.6 billion web pages and is 266TB in size. As the vast majority of HTML tables are used for layout purposes and only a small share contains genuine tables with different surface forms, accurate table detection is essential for building a large-scale Web table corpus. Furthermore, correctly recognizing the table structure (e.g. horizontal listings, matrices) is important in order to understand the role of each table cell, distinguishing between label and data cells. In this paper, we present an extensive table layout classification that enables us to identify the main layout categories of Web tables with very high precision. We therefore identify and develop a plethora of table features, different feature selection techniques and several classification algorithms. We evaluate the effectiveness of the selected features and compare the performance of various state-of-the-art classification algorithms. Finally, the winning approach is employed to classify millions of tables resulting in the Dresden Web Table Corpus (DWTC).
Details
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2015 2nd IEEE/ACM International Symposium on Big Data Computing, BDC 2015 |
| Editors | Rajkumar Buyya, Ioan Raicu, Omer Rana |
| Publisher | Institute of Electrical and Electronics Engineers (IEEE) |
| Pages | 41-50 |
| Number of pages | 10 |
| ISBN (electronic) | 978-0-7695-5696-3 |
| Publication status | Published - 11 Feb 2016 |
| Peer-reviewed | Yes |
Conference
| Title | 2nd IEEE/ACM International Symposium on Big Data Computing, BDC 2015 |
|---|---|
| Duration | 7 - 10 December 2015 |
| City | Limassol |
| Country | Cyprus |
External IDs
| ORCID | /0000-0001-8107-2775/work/198592310 |
|---|
Keywords
Research priority areas of TU Dresden
DFG Classification of Subject Areas according to Review Boards
Subject groups, research areas, subject areas according to Destatis
ASJC Scopus subject areas
Keywords
- Data preprocessing, Machine learning, Web mining