Named Entity Recognition for Low-Resource Languages - Profiting from Language Families
Research output: Contribution to book/Conference proceedings/Anthology/Report › Conference contribution › Contributed › peer-review
Contributors
Abstract
Machine learning drives forward the development in many areas of Natural Language Processing (NLP). Until now, many NLP systems and research are focusing on high-resource languages, i.e. languages for which many data resources exist. Recently, so-called low-resource languages increasingly come into focus. In this context, multi-lingual language models, which are trained on related languages to a target low-resource language, may enable NLP tasks on this low-resource language. In this work, we investigate the use of multi-lingual models for Named Entity Recognition (NER) for low-resource languages. We consider the West Slavic language family and the low-resource languages Upper Sorbian and Kashubian. Three RoBERTa models were trained from scratch, two mono-lingual models for Czech and Polish, and one bi-lingual model for Czech and Polish. These models were evaluated on the NER downstream task for Czech, Polish, Upper Sorbian, and Kashubian, and compared to existing state-of-the-art models such as RobeCzech, HerBERT, and XLM-R. The results indicate that the mono-lingual models perform better on the language they were trained on, and both the mono-lingual and language family models outperform the large multi-lingual model in downstream tasks. Overall, the study shows that low-resource West Slavic languages can benefit from closely related languages and their models.
Details
| Original language | English |
|---|---|
| Title of host publication | EACL 2023 - 9th Workshop on Slavic Natural Language Processing, Proceedings of the SlavicNLP 2023 |
| Publisher | The Association for Computational Linguistics |
| Pages | 1-10 |
| Number of pages | 10 |
| ISBN (electronic) | 9781959429579 |
| Publication status | Published - 2023 |
| Peer-reviewed | Yes |
Publication series
| Series | Proceedings of the Workshop (SlavicNLP) |
|---|
Workshop
| Title | 9th Workshop on Slavic Natural Language Processing |
|---|---|
| Abbreviated title | Slavic NLP 2023 |
| Conference number | 9 |
| Description | held in conjunction with the 17th Conference of the European Chapter of the Association for Computational Linguistics |
| Duration | 6 May 2023 |
| Website | |
| Location | Valamar Lacroma Dubrovnik & online |
| City | Dubrovnik |
| Country | Croatia |
External IDs
| ORCID | /0000-0001-9756-6390/work/146644781 |
|---|---|
| ORCID | /0000-0003-2684-102X/work/146646156 |