Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

Cornelius Kummer; Lena Jurkschat; Michael Färber; Sahar Vahdati

doi:10.1007/978-3-032-21289-4_17

Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

Research output: Contribution to book/Conference proceedings/Anthology/Report › Conference contribution › Contributed › peer-review

Contributors

Cornelius Kummer - , TUD Dresden University of Technology (Author)
Lena Jurkschat - , Department Cognitive AI, Department of Distributed and Data Intensive Computing (VDR), Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI Dresden) (Author)
Michael Färber - , Chair of Scalable Software Architectures for Data Analytics (ScaDS.AI Dresden/Leipzig), Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI Dresden) (Author)
Sahar Vahdati - , Department of Distributed and Data Intensive Computing (VDR), Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI Dresden) (Author)

Abstract

With the wide adoption of language models for IR – and specifically RAG systems – the latency of the underlying LLM becomes a crucial bottleneck, since the long contexts of retrieved passages lead large prompts and therefore, compute increase. Prompt compression, which reduces the size of input prompts while aiming to preserve performance on downstream tasks, has established itself as a cost-effective and low-latency method for accelerating inference in large language models. However, its usefulness depends on whether the additional preprocessing time during generation is offset by faster decoding. We present the first systematic, large-scale study of this trade-off, with thousands of runs and 30,000 queries across several open-source LLMs and three GPU classes. Our evaluation separates compression overhead from decoding latency while tracking output quality and memory usage. LLMLingua achieves up to 18% end-to-end speed-ups, when prompt length, compression ratio, and hardware capacity are well matched, with response quality remaining statistically unchanged across summarization, code generation, and question answering tasks. Outside this operating window, however, the compression step dominates and cancels out the gains. We also show that effective compression can reduce memory usage enough to offload workloads from data center GPUs to commodity cards, with only a 0.3 s increase in latency. Our open-source profiler predicts the latency break-even point for each model–hardware setup, providing practical guidance on when prompt compression delivers real-world benefits.

Details

Original language	English
Title of host publication	Advances in Information Retrieval
Editors	Ricardo Campos, Adam Jatowt, Yanyan Lan, Mohammad Aliannejadi, Christine Bauer, Sean MacAvaney, Avishek Anand, Zhaochun Ren, Suzan Verberne, Nan Bai, Masoud Mansoury
Publisher	Springer Science and Business Media B.V.
Pages	257-271
Number of pages	15
ISBN (electronic)	978-3-032-21289-4
ISBN (print)	978-3-032-21288-7
Publication status	Published - Mar 2026
Peer-reviewed	Yes

Publication series

Series	Lecture Notes in Computer Science
Volume	16483 LNCS
ISSN	0302-9743

Conference

Title	48th European Conference on Information Retrieval
Abbreviated title	ECIR 2026
Conference number	48
Duration	29 March - 2 April 2026
Website	https://ecir2026.eu/
Location	Lijm & Cultuur
City	Delft
Country	Netherlands

External IDs

ORCID	/0000-0001-5458-8645/work/215836104

Keywords

ASJC Scopus subject areas

Keywords

inference, latency analysis, LLMs, open source models, performance evaluation, prompt compression

Research Portal of the TU Dresden

Contributors

Abstract

Details

Publication series

Conference

External IDs

Keywords

ASJC Scopus subject areas

Keywords