Named Entity Recognition for Low-Resource Languages - Profiting from Language Families
Research output: Contribution to book/conference proceedings/anthology/report › Conference contribution › Contributed › peer-review
Contributors
Abstract
Machine learning drives forward the development in many areas of Natural Language Processing (NLP). Until now, many NLP systems and research are focusing on high-resource languages, i.e. languages for which many data resources exist. Recently, so-called low-resource languages increasingly come into focus. In this context, multi-lingual language models, which are trained on related languages to a target low-resource language, may enable NLP tasks on this low-resource language. In this work, we investigate the use of multi-lingual models for Named Entity Recognition (NER) for low-resource languages. We consider the West Slavic language family and the low-resource languages Upper Sorbian and Kashubian. Three RoBERTa models were trained from scratch, two mono-lingual models for Czech and Polish, and one bi-lingual model for Czech and Polish. These models were evaluated on the NER downstream task for Czech, Polish, Upper Sorbian, and Kashubian, and compared to existing state-of-the-art models such as RobeCzech, HerBERT, and XLM-R. The results indicate that the mono-lingual models perform better on the language they were trained on, and both the mono-lingual and language family models outperform the large multi-lingual model in downstream tasks. Overall, the study shows that low-resource West Slavic languages can benefit from closely related languages and their models.
Details
Original language | English |
---|---|
Title of host publication | EACL 2023 - 9th Workshop on Slavic Natural Language Processing, Proceedings of the SlavicNLP 2023 |
Publisher | The Association for Computational Linguistics |
Pages | 1-10 |
Number of pages | 10 |
ISBN (electronic) | 9781959429579 |
Publication status | Published - 2023 |
Peer-reviewed | Yes |
Publication series
Series | Proceedings of the Workshop (SlavicNLP) |
---|
Workshop
Title | 9th Workshop on Slavic Natural Language Processing |
---|---|
Abbreviated title | Slavic NLP 2023 |
Conference number | 9 |
Description | held in conjunction with the 17th Conference of the European Chapter of the Association for Computational Linguistics |
Duration | 6 May 2023 |
Website | |
Location | Valamar Lacroma Dubrovnik & online |
City | Dubrovnik |
Country | Croatia |
External IDs
ORCID | /0000-0001-9756-6390/work/146644781 |
---|---|
ORCID | /0000-0003-2684-102X/work/146646156 |