Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab and Convolutional Recurrent Neural Networks
Research output: Contribution to journal › Research article › Contributed › peer-review
Contributors
Abstract
Articulatory copy synthesis (ACS) refers to the synthetic reproduction of natural utterances. The existing methods of ACS have the limitations of poor generalizability for unknown speakers, high computing costs, the lack of systematic evaluation, etc. Here we propose an ACS method based on the articulatory speech synthesizer VocalTractLab (VTL) and convolutional recurrent neural networks. We first created paired articulatory-acoustic samples using VTL, and then trained neural-network-based ACS models with acoustic features and articulatory trajectories as inputs and outputs, respectively. The basic approach for training relied on fully synthetic training data (and was later supplemented with natural speech and corresponding synthetic articulatory data). In addition, to represent as much of the articulatory and acoustic space as possible, the training samples were augmented by varying the phonation type, speaking effort, and the vocal tract length of the synthetic utterances. Furthermore, two regularization methods were proposed: one based on the smoothness loss of articulatory trajectories and another based on the acoustic loss between original and estimated acoustic features. For given new utterances of arbitrary length, the trained ACS models could estimate articulatory trajectories that were then fed into VTL to synthesize new speech. Experiments showed that our proposed ACS method achieved an average correlation coefficient of 0.983 between the reference and estimated VTL articulatory parameters for speaker-dependent German utterances. When applied to speaker-independent German, English, and Mandarin Chinese utterances, the copy-synthesized speech achieved recognition rates of 73.88%, 52.92%, and 52.41%, respectively, using the automatic speech recognizer Google Speech-to-Text.
Details
Original language | English |
---|---|
Pages (from-to) | 1845-1858 |
Number of pages | 14 |
Journal | IEEE/ACM Transactions on Audio Speech and Language Processing |
Volume | 32 |
Publication status | Published - 2024 |
Peer-reviewed | Yes |
External IDs
Scopus | 85187340878 |
---|