Building the Dresden Web Table Corpus: A Classification Approach

Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/GutachtenBeitrag in KonferenzbandBeigetragenBegutachtung

Beitragende

Abstract

In recent years, researchers have recognized relational tables on the Web as an important source of information. To assist this research we developed the Dresden Web Tables Corpus (DWTC), a collection of about 125 million data tables extracted from the Common Crawl (CC) which contains 3.6 billion web pages and is 266TB in size. As the vast majority of HTML tables are used for layout purposes and only a small share contains genuine tables with different surface forms, accurate table detection is essential for building a large-scale Web table corpus. Furthermore, correctly recognizing the table structure (e.g. horizontal listings, matrices) is important in order to understand the role of each table cell, distinguishing between label and data cells. In this paper, we present an extensive table layout classification that enables us to identify the main layout categories of Web tables with very high precision. We therefore identify and develop a plethora of table features, different feature selection techniques and several classification algorithms. We evaluate the effectiveness of the selected features and compare the performance of various state-of-the-art classification algorithms. Finally, the winning approach is employed to classify millions of tables resulting in the Dresden Web Table Corpus (DWTC).

Details

OriginalspracheEnglisch
TitelProceedings - 2015 2nd IEEE/ACM International Symposium on Big Data Computing, BDC 2015
Redakteure/-innenRajkumar Buyya, Ioan Raicu, Omer Rana
Herausgeber (Verlag)Institute of Electrical and Electronics Engineers (IEEE)
Seiten41-50
Seitenumfang10
ISBN (elektronisch)978-0-7695-5696-3
PublikationsstatusVeröffentlicht - 11 Feb. 2016
Peer-Review-StatusJa

Konferenz

Titel2nd IEEE/ACM International Symposium on Big Data Computing, BDC 2015
Dauer7 - 10 Dezember 2015
StadtLimassol
LandZypern

Externe IDs

ORCID /0000-0001-8107-2775/work/198592310

Schlagworte

Forschungsprofillinien der TU Dresden

Fächergruppen, Lehr- und Forschungsbereiche, Fachgebiete nach Destatis

Schlagwörter

  • Data preprocessing, Machine learning, Web mining