Retro: Relation retrofitting for in-database machine learning on textual data
Research output: Contribution to book/conference proceedings/anthology/report › Conference contribution › Contributed › peer-review
Contributors
Abstract
There are massive amounts of textual data residing in databases, valuable for many machine learning (ML) tasks. Since ML techniques depend on numerical input representations, word embeddings are increasingly utilized to convert symbolic representations such as text into meaningful numbers. However, a naïve one-to-one mapping of each word in a database to a word embedding vector is not sufficient since it would miss to incorporate rich context information given by the database schema, e.g. which words appear in the same column or are related to each other. Additionally, many text values in databases are very specific and would not have any counterpart within the word embedding. In this paper, we therefore, propose Retro (RElational reTROfitting), a novel approach to learn numerical representations of text values in databases, capturing the information encoded by general-purpose word embeddings and the database-specific information encoded by the tabular relations. We formulate relation retrofitting as a learning problem and present an efficient algorithm solving it. We investigate the impact of various hyperparameters on the learning problem. Our evaluation shows that embedding generated for database text values using Retro are ready-to-use for many ML tasks and even outperform state-of-the-art techniques.
Details
Original language | English |
---|---|
Title of host publication | Advances in Database Technology - EDBT 2020 |
Editors | Angela Bonifati, Yongluan Zhou, Marcos Antonio Vaz Salles, Alexander Bohm, Dan Olteanu, George Fletcher, Arijit Khan, Bin Yang |
Publisher | OpenProceedings.org |
Pages | 411-414 |
Number of pages | 4 |
ISBN (electronic) | 9783893180837 |
Publication status | Published - 2020 |
Peer-reviewed | Yes |
Publication series
Series | Advances in database technology : proceedings / EDBT ... |
---|---|
Volume | 2020-March |
Conference
Title | 23rd International Conference on Extending Database Technology, EDBT 2020 |
---|---|
Duration | 30 March - 2 April 2020 |
City | Copenhagen |
Country | Denmark |
External IDs
Scopus | 85084173498 |
---|---|
ORCID | /0000-0001-8107-2775/work/142253449 |