Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs

Research output: Preprint/Documentation/ReportPreprint

Contributors

Abstract

As part of the OpenGPT-X project, two multilingual large language models (LLMs) have been developed to support all 24 official languages of the European Union, promoting Europe's linguistic diversity. These models were trained on a dataset consisting of approximately 60% non-English data and utilize a custom multilingual tokenizer to address the limitations of existing LLMs, which often focus on English or a few high-resource languages. The development process involved optimizing data composition, tokenizer design, and training methodologies. The models demonstrate competitive performance across multilingual benchmarks, as evidenced by their results on European versions of ARC, HellaSwag, MMLU, and TruthfulQA.

Details

Original languageEnglish
Publication statusPublished - 30 Sept 2024
No renderer: customAssociatesEventsRenderPortal,dk.atira.pure.api.shared.model.researchoutput.WorkingPaper

Keywords

Keywords

  • cs.CL, cs.AI, cs.LG