Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab and Convolutional Recurrent Neural Networks

Y. Gao; P. Birkholz; Ya Li

doi:10.1109/TASLP.2024.3372874

Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab and Convolutional Recurrent Neural Networks

Research output: Contribution to journal › Research article › Contributed › peer-review

Contributors

Y. Gao - , Beijing University of Posts and Telecommunications (Author)
P. Birkholz - , Chair of Speech Technology and Cognitive Systems (Author)
Ya Li - , Beijing University of Posts and Telecommunications (Author)

Abstract

Articulatory copy synthesis (ACS) refers to the synthetic reproduction of natural utterances. The existing methods of ACS have the limitations of poor generalizability for unknown speakers, high computing costs, the lack of systematic evaluation, etc. Here we propose an ACS method based on the articulatory speech synthesizer VocalTractLab (VTL) and convolutional recurrent neural networks. We first created paired articulatory-acoustic samples using VTL, and then trained neural-network-based ACS models with acoustic features and articulatory trajectories as inputs and outputs, respectively. The basic approach for training relied on fully synthetic training data (and was later supplemented with natural speech and corresponding synthetic articulatory data). In addition, to represent as much of the articulatory and acoustic space as possible, the training samples were augmented by varying the phonation type, speaking effort, and the vocal tract length of the synthetic utterances. Furthermore, two regularization methods were proposed: one based on the smoothness loss of articulatory trajectories and another based on the acoustic loss between original and estimated acoustic features. For given new utterances of arbitrary length, the trained ACS models could estimate articulatory trajectories that were then fed into VTL to synthesize new speech. Experiments showed that our proposed ACS method achieved an average correlation coefficient of 0.983 between the reference and estimated VTL articulatory parameters for speaker-dependent German utterances. When applied to speaker-independent German, English, and Mandarin Chinese utterances, the copy-synthesized speech achieved recognition rates of 73.88%, 52.92%, and 52.41%, respectively, using the automatic speech recognizer Google Speech-to-Text.

Details

Original language	English
Pages (from-to)	1845-1858
Number of pages	14
Journal	IEEE/ACM Transactions on Audio Speech and Language Processing
Volume	32
Publication status	Published - Mar 2024
Peer-reviewed	Yes

External IDs

Scopus	85187340878

Keywords

ASJC Scopus subject areas

Keywords

Speech inversion, VocalTractLab (VTL), articulatory synthesis, convolutional recurrent neural networks, copy synthesis

Research Portal of the TU Dresden

Contributors

Abstract

Details

External IDs

Keywords

ASJC Scopus subject areas

Keywords