Retro: Relation retrofitting for in-database machine learning on textual data

Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/GutachtenBeitrag in KonferenzbandBeigetragenBegutachtung

Beitragende

Abstract

There are massive amounts of textual data residing in databases, valuable for many machine learning (ML) tasks. Since ML techniques depend on numerical input representations, word embeddings are increasingly utilized to convert symbolic representations such as text into meaningful numbers. However, a naïve one-to-one mapping of each word in a database to a word embedding vector is not sufficient since it would miss to incorporate rich context information given by the database schema, e.g. which words appear in the same column or are related to each other. Additionally, many text values in databases are very specific and would not have any counterpart within the word embedding. In this paper, we therefore, propose Retro (RElational reTROfitting), a novel approach to learn numerical representations of text values in databases, capturing the information encoded by general-purpose word embeddings and the database-specific information encoded by the tabular relations. We formulate relation retrofitting as a learning problem and present an efficient algorithm solving it. We investigate the impact of various hyperparameters on the learning problem. Our evaluation shows that embedding generated for database text values using Retro are ready-to-use for many ML tasks and even outperform state-of-the-art techniques.

Details

OriginalspracheEnglisch
TitelAdvances in Database Technology - EDBT 2020
Redakteure/-innenAngela Bonifati, Yongluan Zhou, Marcos Antonio Vaz Salles, Alexander Bohm, Dan Olteanu, George Fletcher, Arijit Khan, Bin Yang
Herausgeber (Verlag)OpenProceedings.org
Seiten411-414
Seitenumfang4
ISBN (elektronisch)9783893180837
PublikationsstatusVeröffentlicht - 2020
Peer-Review-StatusJa

Publikationsreihe

ReiheAdvances in database technology : proceedings / EDBT ...
Band2020-March

Konferenz

Titel23rd International Conference on Extending Database Technology, EDBT 2020
Dauer30 März - 2 April 2020
StadtCopenhagen
LandDänemark

Externe IDs

Scopus 85084173498
ORCID /0000-0001-8107-2775/work/142253449

Schlagworte