L3X: Long Object List Extraction from Long Documents

Sneha Singhania; Simon Razniewski; Gerhard Weikum

doi:10.1145/3746252.3761460

L3X: Long Object List Extraction from Long Documents

Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/Gutachten › Beitrag in Konferenzband › Beigetragen › Begutachtung

Beitragende

Sneha Singhania - , Max-Planck-Institut für Informatik (Autor:in)
Simon Razniewski - , Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI Dresden), Professur für Wissensbasierte Künstliche Intelligenz (ScaDS.AI Dresden/Leipzig) (Autor:in)
Gerhard Weikum - , Max-Planck-Institut für Informatik (Autor:in)

Abstract

Information extraction with LLMs is typically geared toward extracting individual subject-predicate-object (SPO) triples from short factual texts such as Wikipedia or news articles. In contrast, the L3X methodology tackles the task of extracting long lists from long texts: given a target subject S and predicate P, the goal is to extract the complete list of all objects O for which SPO holds. This is especially challenging over long texts, like entire books or large web crawls, where many objects are long-tail entities. We demonstrate L3X, a web-based system designed for this previously unexplored task. L3X comprises of recall-oriented candidate generation using LLMs in RAG mode, with novel methods for ranking and batching passages, followed by precision-oriented scrutinization. Our demo supports exploring multiple configurations, including LLM-only and RAG baselines, showcasing use cases like fiction-character relations from book series (e.g., 50+ friends of Harry Potter) and business relations from web pages (e.g., CEOs of Toyota).

Details

Originalsprache	Englisch
Titel	CIKM 2025 - Proceedings of the 34th ACM International Conference on Information and Knowledge Management
Seiten	6693-6697
Seitenumfang	5
Publikationsstatus	Veröffentlicht - 10 Nov. 2025
Peer-Review-Status	Ja

Konferenz

Titel	34th ACM International Conference on Information and Knowledge Management
Kurztitel	CIKM 2025
Veranstaltungsnummer	34
Dauer	10 - 14 November 2025
Webseite	https://cikm2025.org/
Ort	COEX
Stadt	Seoul
Land	Südkorea

Externe IDs

ORCID	/0000-0002-5410-218X/work/200631820

Schlagworte

ASJC Scopus Sachgebiete

Schlagwörter

information extraction, long documents, narrative text

Forschungsportal der TU Dresden