Named Entity Recognition for Low-Resource Languages - Profiting from Language Families
Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/Gutachten › Beitrag in Konferenzband › Beigetragen › Begutachtung
Beitragende
Abstract
Machine learning drives forward the development in many areas of Natural Language Processing (NLP). Until now, many NLP systems and research are focusing on high-resource languages, i.e. languages for which many data resources exist. Recently, so-called low-resource languages increasingly come into focus. In this context, multi-lingual language models, which are trained on related languages to a target low-resource language, may enable NLP tasks on this low-resource language. In this work, we investigate the use of multi-lingual models for Named Entity Recognition (NER) for low-resource languages. We consider the West Slavic language family and the low-resource languages Upper Sorbian and Kashubian. Three RoBERTa models were trained from scratch, two mono-lingual models for Czech and Polish, and one bi-lingual model for Czech and Polish. These models were evaluated on the NER downstream task for Czech, Polish, Upper Sorbian, and Kashubian, and compared to existing state-of-the-art models such as RobeCzech, HerBERT, and XLM-R. The results indicate that the mono-lingual models perform better on the language they were trained on, and both the mono-lingual and language family models outperform the large multi-lingual model in downstream tasks. Overall, the study shows that low-resource West Slavic languages can benefit from closely related languages and their models.
Details
Originalsprache | Englisch |
---|---|
Titel | EACL 2023 - 9th Workshop on Slavic Natural Language Processing, Proceedings of the SlavicNLP 2023 |
Herausgeber (Verlag) | The Association for Computational Linguistics |
Seiten | 1-10 |
Seitenumfang | 10 |
ISBN (elektronisch) | 9781959429579 |
Publikationsstatus | Veröffentlicht - 2023 |
Peer-Review-Status | Ja |
Publikationsreihe
Reihe | Proceedings of the Workshop (SlavicNLP) |
---|
Workshop
Titel | 9th Workshop on Slavic Natural Language Processing |
---|---|
Kurztitel | Slavic NLP 2023 |
Veranstaltungsnummer | 9 |
Beschreibung | held in conjunction with the 17th Conference of the European Chapter of the Association for Computational Linguistics |
Dauer | 6 Mai 2023 |
Webseite | |
Ort | Valamar Lacroma Dubrovnik & online |
Stadt | Dubrovnik |
Land | Kroatien |
Externe IDs
ORCID | /0000-0001-9756-6390/work/146644781 |
---|---|
ORCID | /0000-0003-2684-102X/work/146646156 |