A Pipeline for the Creation of Multimodal Corpora from YouTube Videos

Nathan Dykes; Anna Wilson; Peter Uhrig

A Pipeline for the Creation of Multimodal Corpora from YouTube Videos

Research output: Contribution to conferences › Paper › Contributed › peer-review

Contributors

Nathan Dykes - , Friedrich-Alexander University Erlangen-Nürnberg (Author)
Anna Wilson - , University of Oxford (Author)
Peter Uhrig - , Department of Distributed and Data Intensive Computing (VDR), Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI Dresden) (Author)

Abstract

This paper introduces an open-source pipeline for the creation of multimodal corpora from YouTube videos. It minimizes storage and bandwidth requirements, because the videos themselves need not be downloaded and can remain on YouTube's servers. It also minimizes processing requirements by using YouTube's automatically generated subtitles, thus avoiding a computationally expensive automatic speech recognition processing step. The pipeline combines standard tools and provides as its output a corpus file in the industry-standard vertical format used by many corpus managers. It is straightforwardly extensible with the addition of further levels of annotation and can be adapted to languages other than English.

Details

Original language	English
Pages	1-5
Number of pages	5
Publication status	Published - 2023
Peer-reviewed	Yes

Workshop

Title	1st Workshop on Linguistic Insights from and for Multimodal Language Processing
Abbreviated title	LIMO 2023
Conference number	1
Description	Co-located with KONVENS 2023 (Konferenz zur Verarbeitung natürlicher Sprache/Conference on Natural Language Processing)
Duration	22 September 2023
Website	https://sites.google.com/view/limo2023/home
Location	Technische Hochschule Ingolstadt
City	Ingolstadt
Country	Germany

Keywords

ASJC Scopus subject areas

Linguistics and Language
Language and Linguistics