Building the Dresden Web Table Corpus: A Classification Approach

Research output: Contribution to book/Conference proceedings/Anthology/ReportConference contributionContributedpeer-review

Contributors

Abstract

In recent years, researchers have recognized relational tables on the Web as an important source of information. To assist this research we developed the Dresden Web Tables Corpus (DWTC), a collection of about 125 million data tables extracted from the Common Crawl (CC) which contains 3.6 billion web pages and is 266TB in size. As the vast majority of HTML tables are used for layout purposes and only a small share contains genuine tables with different surface forms, accurate table detection is essential for building a large-scale Web table corpus. Furthermore, correctly recognizing the table structure (e.g. horizontal listings, matrices) is important in order to understand the role of each table cell, distinguishing between label and data cells. In this paper, we present an extensive table layout classification that enables us to identify the main layout categories of Web tables with very high precision. We therefore identify and develop a plethora of table features, different feature selection techniques and several classification algorithms. We evaluate the effectiveness of the selected features and compare the performance of various state-of-the-art classification algorithms. Finally, the winning approach is employed to classify millions of tables resulting in the Dresden Web Table Corpus (DWTC).

Details

Original languageEnglish
Title of host publicationProceedings - 2015 2nd IEEE/ACM International Symposium on Big Data Computing, BDC 2015
EditorsRajkumar Buyya, Ioan Raicu, Omer Rana
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages41-50
Number of pages10
ISBN (electronic)978-0-7695-5696-3
Publication statusPublished - 11 Feb 2016
Peer-reviewedYes

Conference

Title2nd IEEE/ACM International Symposium on Big Data Computing, BDC 2015
Duration7 - 10 December 2015
CityLimassol
CountryCyprus

External IDs

ORCID /0000-0001-8107-2775/work/198592310

Keywords

Research priority areas of TU Dresden

DFG Classification of Subject Areas according to Review Boards

Subject groups, research areas, subject areas according to Destatis

Keywords

  • Data preprocessing, Machine learning, Web mining