PEDANTIC: A Dataset for the Automatic Examination of Definiteness in Patent Claims

Valentin Knappich; Annemarie Friedrich; Anna Hätty; Simon Razniewski

PEDANTIC: A Dataset for the Automatic Examination of Definiteness in Patent Claims

Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/Gutachten › Beitrag in Konferenzband › Beigetragen › Begutachtung

Beitragende

Valentin Knappich - , Bosch Center for Artificial Intelligence, Universität Augsburg (Autor:in)
Annemarie Friedrich - , Universität Augsburg (Autor:in)
Anna Hätty - , Bosch Center for Artificial Intelligence (Autor:in)
Simon Razniewski - , Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI Dresden), Professur für Wissensbasierte Künstliche Intelligenz (ScaDS.AI Dresden/Leipzig), Technische Universität Dresden (Autor:in)

Abstract

Patent claims define the scope of protection for an invention. If there are ambiguities in a claim, it is rejected by the patent office. In the US, this is referred to as indefiniteness (35 U.S.C § 112(b)) and is among the most frequent reasons for patent application rejection. The development of automatic methods for patent definiteness examination has the potential to make patent drafting and examination more efficient, but no annotated dataset has been published to date. We introduce PEDANTIC (Patent Definiteness Examination Corpus), a novel dataset of 14k US patent claims from patent applications relating to Natural Language Processing (NLP), annotated with reasons for indefiniteness. We construct PEDANTIC using a fully automatic pipeline that retrieves office action documents from the USPTO and uses Large Language Models (LLMs) to extract the reasons for indefiniteness. A human validation study confirms the pipeline’s accuracy in generating high-quality annotations. To gain insight beyond binary classification metrics, we implement an LLM-as-Judge evaluation that compares the free-form reasoning of every model-cited reason with every examiner-cited reason. We show that LLM agents based on Qwen 2.5 32B and 72B struggle to outperform logistic regression baselines on definiteness prediction, even though they often correctly identify the underlying reasons. PEDANTIC provides a valuable resource for patent AI researchers, enabling the development of advanced examination models. We release the dataset and code at https://github.com/boschresearch/pedantic-patentsemtech.

Details

Originalsprache	Englisch
Titel	6th Workshop on Patent Text Mining and Semantic Technologies (PatentSemTech)
Seiten	21-38
Seitenumfang	18
Band	4062
Publikationsstatus	Veröffentlicht - 2025
Peer-Review-Status	Ja

Publikationsreihe

Reihe	CEUR Workshop Proceedings
ISSN	1613-0073

Workshop

Titel	6th Workshop on Patent Text Mining and Semantic Technologies
Kurztitel	PatentSemTech 2025
Veranstaltungsnummer	6
Beschreibung	held in conjunction with 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2025)
Dauer	17 Juli 2025
Webseite	http://ifs.tuwien.ac.at/patentsemtech/
Ort	Padova Congress center
Stadt	Padova
Land	Italien

Externe IDs

ORCID	/0000-0002-5410-218X/work/198595065

Schlagworte

ASJC Scopus Sachgebiete

Allgemeine Computerwissenschaft

Schlagwörter

Patent AI, Patent Clarity, Patent Classification, Patent Definiteness, Patent Examination

Forschungsportal der TU Dresden