Glottal inverse filtering based on articulatory synthesis and deep learning

Research output: Contribution to book/Conference proceedings/Anthology/ReportConference contributionContributedpeer-review

Contributors

Abstract

We propose a new method to estimate the glottal vocal tract excitation from speech signals based on deep learning. To that end, a bidirectional recurrent neural network with long short-term memory units was trained to predict the glottal airflow derivative from the speech signal. Since natural reference data for this task is unobtainable at the required scale, we used the articulatory speech synthesizer VocalTractLab to generate a large dataset containing synchronous connected speech and glottal airflow signals for training. The trained model's performance was objectively evaluated by means of stationary synthetic signals from the OPENGLOT glottal inverse filtering benchmark dataset and by using our dataset of connected synthetic speech. Compared to the state of the art, the proposed model produced a more accurate estimation using OPENGLOT's physically synthesized signals but was less accurate for its computationally simulated signals. However, our model was much more accurate and plausible on the connected speech signals, especially for sounds with mixed excitation (e.g. fricatives) or sounds with pronounced zeros in their transfer function (e.g. nasals). Future work will introduce more variety into the training data (e.g. regarding pitch and phonation) and focus on estimating features of the glottal flow instead of the entire waveform.

Details

Original languageEnglish
Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Pages1327-1331
Number of pages5
Volume2022-September
Publication statusPublished - 2022
Peer-reviewedYes

External IDs

Scopus 85140055042

Keywords

Keywords

  • Glottal inverse filtering, glottal source estimation, source-filter separation, speech synthesis