Crowdsourcing Ecologically-Valid Dialogue Data for German

Yannick Frommherz; Alessandra Zarcone

doi:10.3389/fcomp.2021.686050

Crowdsourcing Ecologically-Valid Dialogue Data for German

Research output: Contribution to journal › Research article › Contributed › peer-review

Contributors

Yannick Frommherz - , Chair of Applied Linguistics (Author)
Alessandra Zarcone - (Author)

Abstract

Despite their increasing success, user interactions with smart speech assistants (SAs) are still very limited compared to human-human dialogue. One way to make SA interactions more natural is to train the underlying natural language processing modules on data which reflects how humans would talk to a SA if it was capable of understanding and producing natural dialogue given a specific task. Such data can be collected applying a Wizard-of-Oz approach (WOz), where user and system side are played by humans. WOz allows researchers to simulate human-machine interaction while benefitting from the fact that all participants are human and thus dialogue-competent. More recent approaches have leveraged simple templates specifying a dialogue scenario for crowdsourcing large-scale datasets. Template-based collection efforts, however, come at the cost of data diversity and naturalness. We present a method to crowdsource dialogue data for the SA domain in the WOz framework, which aims at limiting researcher-induced bias in the data while still allowing for a low-resource, scalable data collection. Our method can also be applied to languages other than English (in our case German), for which fewer crowd-workers may be available. We collected data asynchronously, relying only on existing functionalities of Amazon Mechanical Turk, by formulating the task as a dialogue continuation task. Coherence in dialogues is ensured, as crowd-workers always read the dialogue history, and as a unifying scenario is provided for each dialogue. In order to limit bias in the data, rather than using template-based scenarios, we handcrafted situated scenarios which aimed at not pre-script-ing the task into every single detail and not priming the participants’ lexical choices. Our scenarios cued people’s knowledge of common situations and entities relevant for our task, without directly mentioning them, but relying on vague language and circumlocutions. We compare our data (which we publish as the CROWDSS corpus; n = 113 dialogues) with data from MultiWOZ, showing that our scenario approach led to considerably less scripting and priming and thus more ecologically-valid dialogue data. This suggests that small investments in the collection setup can go a long way in improving data quality, even in a low-resource setup.

Details

Original language	English
Article number	686050
Journal	Frontiers in Computer Science
Volume	3
Publication status	Published - 21 Jun 2021
Peer-reviewed	Yes

External IDs

Scopus	85117930230
ORCID	/0000-0002-3167-1670/work/142249135

Keywords

ASJC Scopus subject areas

Keywords

German, Wizard-of-Oz, crowdsourcing, dialogue data, ecological validity, situated knowledge, voice assistants

Research Portal of the TU Dresden

Contributors

Abstract

Details

External IDs

Keywords

ASJC Scopus subject areas

Keywords