Building the Dresden Web Table Corpus: A Classification Approach
Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/Gutachten › Beitrag in Konferenzband › Beigetragen › Begutachtung
Beitragende
Abstract
In recent years, researchers have recognized relational tables on the Web as an important source of information. To assist this research we developed the Dresden Web Tables Corpus (DWTC), a collection of about 125 million data tables extracted from the Common Crawl (CC) which contains 3.6 billion web pages and is 266TB in size. As the vast majority of HTML tables are used for layout purposes and only a small share contains genuine tables with different surface forms, accurate table detection is essential for building a large-scale Web table corpus. Furthermore, correctly recognizing the table structure (e.g. horizontal listings, matrices) is important in order to understand the role of each table cell, distinguishing between label and data cells. In this paper, we present an extensive table layout classification that enables us to identify the main layout categories of Web tables with very high precision. We therefore identify and develop a plethora of table features, different feature selection techniques and several classification algorithms. We evaluate the effectiveness of the selected features and compare the performance of various state-of-the-art classification algorithms. Finally, the winning approach is employed to classify millions of tables resulting in the Dresden Web Table Corpus (DWTC).
Details
| Originalsprache | Englisch |
|---|---|
| Titel | Proceedings - 2015 2nd IEEE/ACM International Symposium on Big Data Computing, BDC 2015 |
| Redakteure/-innen | Rajkumar Buyya, Ioan Raicu, Omer Rana |
| Herausgeber (Verlag) | Institute of Electrical and Electronics Engineers (IEEE) |
| Seiten | 41-50 |
| Seitenumfang | 10 |
| ISBN (elektronisch) | 978-0-7695-5696-3 |
| Publikationsstatus | Veröffentlicht - 11 Feb. 2016 |
| Peer-Review-Status | Ja |
Konferenz
| Titel | 2nd IEEE/ACM International Symposium on Big Data Computing, BDC 2015 |
|---|---|
| Dauer | 7 - 10 Dezember 2015 |
| Stadt | Limassol |
| Land | Zypern |
Externe IDs
| ORCID | /0000-0001-8107-2775/work/198592310 |
|---|
Schlagworte
Forschungsprofillinien der TU Dresden
DFG-Fachsystematik nach Fachkollegium
Fächergruppen, Lehr- und Forschungsbereiche, Fachgebiete nach Destatis
ASJC Scopus Sachgebiete
Schlagwörter
- Data preprocessing, Machine learning, Web mining