Carina – A Corpus of Aligned German Read Speech Including Annotations

H. Kath; S. Stone; S. Rapp; P. Birkholz

doi:10.1109/ICASSP43922.2022.9746160

Carina – A Corpus of Aligned German Read Speech Including Annotations

Research output: Contribution to book/Conference proceedings/Anthology/Report › Conference contribution › Contributed › peer-review

Contributors

H. Kath - , TUD Dresden University of Technology, German Research Center for Artificial Intelligence (DFKI) (Author)
S. Stone - , Chair of Speech Technology and Cognitive Systems (Author)
S. Rapp - , Darmstadt University of Applied Sciences (Author)
P. Birkholz - , Chair of Speech Technology and Cognitive Systems (Author)

Abstract

This paper presents the semi-automatically created Corpus of Aligned Read Speech Including Annotations (CARInA), a speech corpus based on the German Spoken Wikipedia Corpus (GSWC). CARInA tokenizes, consolidates and organizes the vast, but rather unstructured material contained in GSWC. The contents are grouped by annotation completeness, and extended by canonic, morphosyntactic and prosodic annotations. The annotations are provided in BPF and TextGrid format. It contains 194 hours of speech material from 327 speakers, of which 124 hours are fully phonetically aligned and 30 hours are fully aligned at all annotation levels. CARInA is freely available, designed to grow and improve over time, and suitable for large-scale speech analyses or machine learning tasks as illustrated by two examples shown in this paper.

Details

Original language	English
Title of host publication	ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Pages	6157-6161
Number of pages	5
ISBN (electronic)	9781665405409
Publication status	Published - 2022
Peer-reviewed	Yes

External IDs

Scopus	85131257708

Research Portal of the TU Dresden

Carina – A Corpus of Aligned German Read Speech Including Annotations

Contributors

Abstract

Details

External IDs

Keywords

ASJC Scopus subject areas

Keywords