Towards a Hybrid Imputation Approach Using Web Tables

Ahmad Ahmadov; Maik Thiele; Julian Eberius; Wolfgang Lehner; Robert Wrembel

doi:10.1109/BDC.2015.38

Towards a Hybrid Imputation Approach Using Web Tables

Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/Gutachten › Beitrag in Konferenzband › Beigetragen › Begutachtung

Beitragende

Ahmad Ahmadov - , Professur für Datenbanken (Autor:in)
Maik Thiele - , Professur für Datenbanken (Autor:in)
Julian Eberius - , Professur für Datenbanken (Autor:in)
Wolfgang Lehner - , Professur für Datenbanken (Autor:in)
Robert Wrembel - , Poznań University of Technology (Autor:in)

Abstract

Data completeness is one of the most important data quality dimensions and an essential premise in data analytics. With new emerging Big Data trends such as the data lake concept, which provides a low cost data preparation repository instead of moving curated data into a data warehouse, the problem of data completeness is additionally reinforced. While traditionally the process of filling in missing values is addressed by the data imputation community using statistical techniques, we complement these approaches by using external data sources from the data lake or even the Web to lookup missing values. In this paper we propose a novel hybrid data imputation strategy that, takes into account the characteristics of an incomplete dataset and based on that chooses the best imputation approach, i.e. either a statistical approach such as regression analysis or a Web-based lookup or a combination of both. We formalize and implement both imputation approaches, including a Web table retrieval and matching system and evaluate them extensively using a corpus with 125M Web tables. We show that applying statistical techniques in conjunction with external data sources will lead to a imputation system which is robust, accurate, and has high coverage at the same time.

Details

Originalsprache	Englisch
Titel	2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)
Redakteure/-innen	Rajkumar Buyya, Ioan Raicu, Omer Rana
Herausgeber (Verlag)	Institute of Electrical and Electronics Engineers (IEEE)
Seiten	21-30
Seitenumfang	10
ISBN (elektronisch)	978-0-7695-5696-3
Publikationsstatus	Veröffentlicht - 11 Feb. 2016
Peer-Review-Status	Ja

Konferenz

Titel	2nd IEEE/ACM International Symposium on Big Data Computing, BDC 2015
Dauer	7 - 10 Dezember 2015
Stadt	Limassol
Land	Zypern

Externe IDs

ORCID	/0000-0001-8107-2775/work/198592309

Forschungsportal der TU Dresden

Towards a Hybrid Imputation Approach Using Web Tables

Beitragende

Abstract

Details

Konferenz

Externe IDs

Schlagworte

Forschungsprofillinien der TU Dresden

DFG-Fachsystematik nach Fachkollegium

Fächergruppen, Lehr- und Forschungsbereiche, Fachgebiete nach Destatis

ASJC Scopus Sachgebiete

Schlagwörter