A Pipeline for the Creation of Multimodal Corpora from YouTube Videos

Publikation: Beitrag zu KonferenzenPaperBeigetragenBegutachtung

Beitragende

Abstract

This paper introduces an open-source pipeline for the creation of multimodal corpora from YouTube videos. It minimizes storage and bandwidth requirements, because the videos themselves need not be downloaded and can remain on YouTube's servers. It also minimizes processing requirements by using YouTube's automatically generated subtitles, thus avoiding a computationally expensive automatic speech recognition processing step. The pipeline combines standard tools and provides as its output a corpus file in the industry-standard vertical format used by many corpus managers. It is straightforwardly extensible with the addition of further levels of annotation and can be adapted to languages other than English.

Details

OriginalspracheEnglisch
Seiten1-5
Seitenumfang5
PublikationsstatusVeröffentlicht - 2023
Peer-Review-StatusJa

Workshop

Titel1st Workshop on Linguistic Insights from and for Multimodal Language Processing
KurztitelLIMO 2023
Veranstaltungsnummer1
BeschreibungCo-located with KONVENS 2023 (Konferenz zur Verarbeitung natürlicher Sprache/Conference on Natural Language Processing)
Dauer22 September 2023
Webseite
OrtTechnische Hochschule Ingolstadt
StadtIngolstadt
LandDeutschland

Schlagworte