Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs

Mehdi Ali; Michael Fromm; Klaudia Thellmann; Jan Ebert; Alexander Arno Weber; Richard Rutmann; Charvi Jain; Max Lübbering; Daniel Steinigen; Johannes Leveling; Katrin Klug; Jasper Schulze Buschhoff; Lena Jurkschat; Hammam Abdelwahab; Benny Jörg Stein; Karl-Heinz Sylla; Pavel Denisov; Nicolo' Brandizzi; Qasid Saleem; Anirban Bhowmick; Lennard Helmer; Chelsea John; Pedro Ortiz Suarez; Malte Ostendorff; Alex Jude; Lalith Manjunath; Samuel Weinbach; Carolin Penke; Oleg Filatov; Shima Asaadi; Fabio Barth; Rafet Sifa; Fabian Küch; Andreas Herten; René Jäkel; Georg Rehm; Stefan Kesselheim; Joachim Köhler; Nicolas Flores-Herr

doi:10.48550/arXiv.2410.03730

Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs

Research output: Preprint/Documentation/Report › Preprint

Contributors

Mehdi Ali - (Author)
Michael Fromm - (Author)
Klaudia Thellmann - , Department of Distributed and Data Intensive Computing (VDR) (Author)
Jan Ebert - , Jülich Research Centre (Author)
Alexander Arno Weber - (Author)
Richard Rutmann - (Author)
Charvi Jain - , Fraunhofer Institute for Intelligent Analysis and Information Systems, Lamarr Institute for Machine Learning and Artificial Intelligence (Author)
Max Lübbering - (Author)
Daniel Steinigen - (Author)
Johannes Leveling - (Author)
Katrin Klug - (Author)
Jasper Schulze Buschhoff - (Author)
Lena Jurkschat - , Department of Distributed and Data Intensive Computing (VDR) (Author)
Hammam Abdelwahab - (Author)
Benny Jörg Stein - (Author)
Karl-Heinz Sylla - (Author)
Pavel Denisov - (Author)
Nicolo' Brandizzi - (Author)
Qasid Saleem - (Author)
Anirban Bhowmick - (Author)
Lennard Helmer - (Author)
Chelsea John - (Author)
Pedro Ortiz Suarez - (Author)
Malte Ostendorff - (Author)
Alex Jude - (Author)
Lalith Manjunath - , Department of Distributed and Data Intensive Computing (VDR) (Author)
Samuel Weinbach - (Author)
Carolin Penke - (Author)
Oleg Filatov - (Author)
Shima Asaadi - , Fraunhofer Institute for Integrated Circuits (Author)
Fabio Barth - (Author)
Rafet Sifa - (Author)
Fabian Küch - (Author)
Andreas Herten - (Author)
René Jäkel - , Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI Dresden), Department of Distributed and Data Intensive Computing (VDR) (Author)
Georg Rehm - (Author)
Stefan Kesselheim - (Author)
Joachim Köhler - (Author)
Nicolas Flores-Herr - (Author)

Abstract

We present two multilingual LLMs designed to embrace Europe's linguistic diversity by supporting all 24 official languages of the European Union. Trained on a dataset comprising around 60% non-English data and utilizing a custom multilingual tokenizer, our models address the limitations of existing LLMs that predominantly focus on English or a few high-resource languages. We detail the models' development principles, i.e., data composition, tokenizer optimization, and training methodologies. The models demonstrate competitive performance across multilingual benchmarks, as evidenced by their performance on European versions of ARC, HellaSwag, MMLU, and TruthfulQA.

Details

Original language	English
Publication status	Published - 30 Sept 2024

Keywords

cs.CL, cs.AI, cs.LG

Research Portal of the TU Dresden

Contributors

Abstract

Details

Keywords

Keywords