A Pipeline for the Creation of Multimodal Corpora from YouTube Videos

Nathan Dykes; Anna Wilson; Peter Uhrig

A Pipeline for the Creation of Multimodal Corpora from YouTube Videos

Publikation: Beitrag zu Konferenzen › Paper › Beigetragen › Begutachtung

Beitragende

Nathan Dykes - , Friedrich-Alexander-Universität Erlangen-Nürnberg (Autor:in)
Anna Wilson - , University of Oxford (Autor:in)
Peter Uhrig - , Abteilung Verteiltes und Datenintensives Rechnen (VDR), Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI Dresden) (Autor:in)

Abstract

This paper introduces an open-source pipeline for the creation of multimodal corpora from YouTube videos. It minimizes storage and bandwidth requirements, because the videos themselves need not be downloaded and can remain on YouTube's servers. It also minimizes processing requirements by using YouTube's automatically generated subtitles, thus avoiding a computationally expensive automatic speech recognition processing step. The pipeline combines standard tools and provides as its output a corpus file in the industry-standard vertical format used by many corpus managers. It is straightforwardly extensible with the addition of further levels of annotation and can be adapted to languages other than English.

Details

Originalsprache	Englisch
Seiten	1-5
Seitenumfang	5
Publikationsstatus	Veröffentlicht - 2023
Peer-Review-Status	Ja

Workshop

Titel	1st Workshop on Linguistic Insights from and for Multimodal Language Processing
Kurztitel	LIMO 2023
Veranstaltungsnummer	1
Beschreibung	Co-located with KONVENS 2023 (Konferenz zur Verarbeitung natürlicher Sprache/Conference on Natural Language Processing)
Dauer	22 September 2023
Webseite	https://sites.google.com/view/limo2023/home
Ort	Technische Hochschule Ingolstadt
Stadt	Ingolstadt
Land	Deutschland

Schlagworte

ASJC Scopus Sachgebiete

Linguistik und Sprache
Sprache und Linguistik