Corpus and Baseline Model for Domain-Specific Entity Recognition in German

Research output: Contribution to book/conference proceedings/anthology/reportConference contributionContributedpeer-review

Abstract

Transfer Learning approaches are a promising means to analyze low-resource domain specific texts. The German SmartData corpus is the first German corpus, annotated with entities from different domains, and thus allows to investigate transfer learning approaches for Named Entity Recognition (NER) on different domains. In order to prepare such investigations, this work includes a thorough analysis of the SmartData corpus, and a revision w.r.t. annotations and the split into training and test data, considering the distribution of document and entity types. Based on that a baseline model for NER using BiLSTM-CRF neural networks including hyperparameter optimization is presented.

Details

Original languageEnglish
Title of host publication2020 6th IEEE Congress on Information Science and Technology (CiSt)
PublisherWiley-IEEE Press
Pages314-320
Number of pages7
ISBN (print)978-1-7281-6647-6
Publication statusPublished - 12 Jun 2021
Peer-reviewedYes

Conference

Title2020 6th IEEE Congress on Information Science and Technology (CiSt)
Duration5 - 12 June 2021
LocationAgadir - Essaouira, Morocco

External IDs

Scopus 85103811992
Ieee 10.1109/CiSt49399.2021.9357189
ORCID /0000-0001-9756-6390/work/142250120

Keywords

Keywords

  • Training, Information science, Annotations, Transfer learning, Neural networks, Training data, Optimization, Named Entity Recognition, NER, natural language processing, transfer learning