Retro: Relation retrofitting for in-database machine learning on textual data

Research output: Contribution to book/conference proceedings/anthology/reportConference contributionContributedpeer-review

Contributors

Abstract

There are massive amounts of textual data residing in databases, valuable for many machine learning (ML) tasks. Since ML techniques depend on numerical input representations, word embeddings are increasingly utilized to convert symbolic representations such as text into meaningful numbers. However, a naïve one-to-one mapping of each word in a database to a word embedding vector is not sufficient since it would miss to incorporate rich context information given by the database schema, e.g. which words appear in the same column or are related to each other. Additionally, many text values in databases are very specific and would not have any counterpart within the word embedding. In this paper, we therefore, propose Retro (RElational reTROfitting), a novel approach to learn numerical representations of text values in databases, capturing the information encoded by general-purpose word embeddings and the database-specific information encoded by the tabular relations. We formulate relation retrofitting as a learning problem and present an efficient algorithm solving it. We investigate the impact of various hyperparameters on the learning problem. Our evaluation shows that embedding generated for database text values using Retro are ready-to-use for many ML tasks and even outperform state-of-the-art techniques.

Details

Original languageEnglish
Title of host publicationAdvances in Database Technology - EDBT 2020
EditorsAngela Bonifati, Yongluan Zhou, Marcos Antonio Vaz Salles, Alexander Bohm, Dan Olteanu, George Fletcher, Arijit Khan, Bin Yang
PublisherOpenProceedings.org
Pages411-414
Number of pages4
ISBN (electronic)9783893180837
Publication statusPublished - 2020
Peer-reviewedYes

Publication series

SeriesAdvances in database technology : proceedings / EDBT ...
Volume2020-March

Conference

Title23rd International Conference on Extending Database Technology, EDBT 2020
Duration30 March - 2 April 2020
CityCopenhagen
CountryDenmark

External IDs

Scopus 85084173498
ORCID /0000-0001-8107-2775/work/142253449