Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab and Convolutional Recurrent Neural Networks

Y. Gao; P. Birkholz; Ya Li

doi:10.1109/TASLP.2024.3372874

Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab and Convolutional Recurrent Neural Networks

Publikation: Beitrag in Fachzeitschrift › Forschungsartikel › Beigetragen › Begutachtung

Beitragende

Y. Gao - , Beijing University of Posts and Telecommunications (Autor:in)
P. Birkholz - , Professur für Sprachtechnologie und Kognitive Systeme (Autor:in)
Ya Li - , Beijing University of Posts and Telecommunications (Autor:in)

Abstract

Articulatory copy synthesis (ACS) refers to the synthetic reproduction of natural utterances. The existing methods of ACS have the limitations of poor generalizability for unknown speakers, high computing costs, the lack of systematic evaluation, etc. Here we propose an ACS method based on the articulatory speech synthesizer VocalTractLab (VTL) and convolutional recurrent neural networks. We first created paired articulatory-acoustic samples using VTL, and then trained neural-network-based ACS models with acoustic features and articulatory trajectories as inputs and outputs, respectively. The basic approach for training relied on fully synthetic training data (and was later supplemented with natural speech and corresponding synthetic articulatory data). In addition, to represent as much of the articulatory and acoustic space as possible, the training samples were augmented by varying the phonation type, speaking effort, and the vocal tract length of the synthetic utterances. Furthermore, two regularization methods were proposed: one based on the smoothness loss of articulatory trajectories and another based on the acoustic loss between original and estimated acoustic features. For given new utterances of arbitrary length, the trained ACS models could estimate articulatory trajectories that were then fed into VTL to synthesize new speech. Experiments showed that our proposed ACS method achieved an average correlation coefficient of 0.983 between the reference and estimated VTL articulatory parameters for speaker-dependent German utterances. When applied to speaker-independent German, English, and Mandarin Chinese utterances, the copy-synthesized speech achieved recognition rates of 73.88%, 52.92%, and 52.41%, respectively, using the automatic speech recognizer Google Speech-to-Text.

Details

Originalsprache	Englisch
Seiten (von - bis)	1845-1858
Seitenumfang	14
Fachzeitschrift	IEEE/ACM Transactions on Audio Speech and Language Processing
Jahrgang	32
Publikationsstatus	Veröffentlicht - 2024
Peer-Review-Status	Ja

Externe IDs

Scopus	85187340878

Forschungsportal der TU Dresden

Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab and Convolutional Recurrent Neural Networks

Beitragende

Abstract

Details

Externe IDs

Schlagworte