<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PEDANTIC: A Dataset for the Automatic Examination of Definiteness in Patent Claims</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Valentin Knappich</string-name>
          <email>valentin.knappich@de.bosch.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Annemarie Friedrich</string-name>
          <email>annemarie.friedrich@uni-a.de</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anna Hätty</string-name>
          <email>anna.haetty@de.bosch.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simon Razniewski</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ScaDS.AI</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>TU Dresden</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>21</fpage>
      <lpage>38</lpage>
      <abstract>
        <p>Patent claims define the scope of protection for an invention. If there are ambiguities in a claim, it is rejected by the patent ofice. In the US, this is referred to as indefiniteness (35 U.S.C § 112(b)) and is among the most frequent reasons for patent application rejection. The development of automatic methods for patent definiteness examination has the potential to make patent drafting and examination more eficient, but no annotated dataset has been published to date. We introduce PEDANTIC (Patent Definiteness Ex amination Corpus), a novel dataset of 14k US patent claims from patent applications relating to Natural Language Processing (NLP), annotated with reasons for indefiniteness. We construct PEDANTIC using a fully automatic pipeline that retrieves ofice action documents from the USPTO and uses Large Language Models (LLMs) to extract the reasons for indefiniteness. A human validation study confirms the pipeline's accuracy in generating high-quality annotations. To gain insight beyond binary classification metrics, we implement an LLM-as-Judge evaluation that compares the free-form reasoning of every model-cited reason with every examiner-cited reason. We show that LLM agents based on Qwen 2.5 32B and 72B struggle to outperform logistic regression baselines on definiteness prediction, even though they often correctly identify the underlying reasons. PEDANTIC provides a valuable resource for patent AI researchers, enabling the development of advanced examination models. We release the dataset and code at https://github.com/boschresearch/pedantic-patentsemtech.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Patent AI</kwd>
        <kwd>Patent Examination</kwd>
        <kwd>Patent Definiteness</kwd>
        <kwd>Patent Clarity</kwd>
        <kwd>Patent Classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The patent system plays a crucial role in fostering innovation by granting inventors exclusive rights to
their inventions. Central to this system are patent claims, which are concise and precise statements
that define the metes and bounds of the protected invention. The process of obtaining a patent involves
rigorous examination by patent ofices to ensure that the application meets specific criteria, including
novelty, non-obviousness, and, critically, definiteness. The latter requires that every claim is suficiently
clear and unambiguous to enable a person skilled in the art (called the Person of Ordinary Skill in the
Art, or POSITA) to understand the scope of the invention. In American patent law, this is defined in 35
U.S.C. § 112(b) (comparable to clarity in the EU1), which states that “the specification shall conclude with
one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant
regards as his invention.” The Manual of Patent Examining Procedure (MPEP) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] provides detailed
instructions for the examination. We present the most common categories of indefiniteness in Table 1.
      </p>
      <p>
        Ensuring definiteness is challenging for patent attorneys and examiners. Patent applications typically
undergo multiple rejection-response cycles, averaging 26 months from filing to disposition [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. With
increasing application volumes [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], AI-powered methods are needed to improve examination eficiency
and consistency. Automating examination could yield significant cost savings. For instance, it could
assist examiners, aid attorneys in drafting more robust applications, and improve accessibility for those
without extensive patent expertise. It could also provide feedback for training and evaluating automatic
drafting systems [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Indefiniteness is one of the most common reasons for rejection [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6, 7</xref>
        ], yet its automatic examination
has received very little attention in prior work compared to novelty and non-obviousness. To bridge
this gap, we specifically target the automatic examination of definiteness. We argue that predicting
definiteness as a binary label is insuficient for practical applications. For patent drafters and examiners,
knowing that a claim is indefinite does not provide actionable information. Rather, understanding
why it is indefinite is crucial, as it enables targeted improvements to the claim. For this purpose, we
introduce PEDANTIC, a dataset of patent claims annotated with detailed justifications for rejection due
to indefiniteness. In addition to the claims, PEDANTIC includes fine-grained indefiniteness categories
(such as ’antecedent basis’) along with a free-form reasoning, and the afected ranges of the claim
for every indefiniteness reason. We create it fully automatically, leveraging Large Language Models
(LLMs) to parse USPTO ofice actions into a structured format that includes these annotations. In
total, PEDANTIC includes 14k claims from 3k utility patent applications relating to Natural Language
Processing (NLP) filed after 2014. It facilitates a nuanced evaluation, distinguishing between systems
that identify indefiniteness based on superficial clues and those that are able to pinpoint the correct
underlying issue. In addition to established classification metrics, we implement a reference-based
LLM-as-Judge [8] evaluation that compares each model-cited rejection reason with every examiner-cited
rejection reason. We release our dataset and code publicly to facilitate further research.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Automated Patent Examination. Most related work on automatic patent examination has focused
on novelty and non-obviousness. Datasets for novelty assessment [9, 10, 11, 12] pair patent claims
either with short passages [9, 10] or large chunks [11, 12] from prior art. They use citations marked
as novelty-destroying by the examiner as positive samples and obtain negative samples either from
other citations [9, 10, 11] and/or from related patents [10, 12]. Novelty has been evaluated using
BERTbased models [9, 10, 11, 12, 13, 14], graph neural networks [15, 16], and LLMs [11, 17]. Lee et al. [18]
furthermore present a dataset for the prediction of patent edits following rejections based on novelty
and non-obviousness.</p>
      <p>Patent Clarity and Definiteness. Hido et al. [19] predict patentability using linguistic features
including syntactic complexity and word age. Kong et al. [20] model patent readability using § 112(a)
(lack of disclosure) with linguistic features, finding university-issued patents more readable than
corporate-issued ones. Lo and Chu [7] frame patentability prediction as a multi-label classification task,
including definiteness, using BERT-based models, but do not report definiteness-specific results. Ashtor
[21] trains a definiteness classifier based on linguistic features and reaches 68% AUROC. They use this
classifier as proxy for clarity and show that clarity has improved over time through policy changes
regarding definiteness rejections.</p>
      <p>LLM Explanations. While the primary purpose of a classifier is to predict the most likely class,
practical applications often demand more transparency and insight into the reason behind a prediction. There
is a large body of work concerning explainable AI (XAI) [22] that has developed various techniques to
produce explanations for a classifier’s prediction, including feature attribution methods, counterfactual
explanations, rule extraction, and example-based reasoning. Recently, LLMs have been employed to
directly generate explanations along with their predictions. Such self-explanations have been shown to
perform on par with traditional explainability methods [23], but also have limited faithfulness [24, 25].
In this work, we propose to use the free-form reasons written by the patent examiner in the rejection
full-text as ground truth to evaluate whether a model predicts indefiniteness for the right reasons.</p>
    </sec>
    <sec id="sec-3">
      <title>3. PEDANTIC Dataset</title>
      <p>In this section, we describe the creation of our dataset, including the retrieval of seed patent applications,
the retrieval of rejection notices, document parsing, and dataset splitting.</p>
      <sec id="sec-3-1">
        <title>3.1. Seed Patent Applications</title>
        <p>Our dataset creation pipeline first requires a set of seed patent applications. We focus on patent
applications in the area of NLP, i.e., applications with the CPC class “G06F40” (“Handling natural
language data”), but our pipeline is agnostic to this and can be readily used with seed applications
from any other field. We query the USPTO Open Data Portal (ODP) 2 API for applications from 2014 to
date with at least one filed rejection notice. We do not consider applications prior to 2014 because the
requirements for definiteness changed significantly after the US Supreme Court’s ruling on Nautilus vs.
Biosig [26]. Before 2014, a claim was only rejected for indefiniteness if it was “insolubly ambiguous”,
whereas afterward, indefiniteness was interpreted as a claim “[failing] to inform, with reasonable
certainty, those skilled in the art about the scope of the invention”. Ashtor [21] finds empirical evidence
that this ruling has significantly increased the fraction of claims being rejected for indefiniteness and
the average claim clarity. To avoid exploitable biases, we filter out applications filed before 2014.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Document Download</title>
        <p>For all retrieved seed patents, we download the claim, specification and ofice action documents from
the ODP File Wrapper API3. We only consider the first ofice action, as it generally establishes the core</p>
        <sec id="sec-3-2-1">
          <title>2https://data.uspto.gov/ 3https://data.uspto.gov/apis/patent-file-wrapper/search</title>
          <p>Reason
Antecedent Basis
Undefined Term
Relative Term
Exemplary Phrasing
Functional Claiming
Contradicting Limitations
Omission of Essential
Elements or Steps
Dependence
Other
2173.05(e)
2173.05(a)
2173.05(b)
2173.05(d)
2173.05(g)</p>
          <p>2173</p>
          <p>Claim contains a term referencing an element lacking a clear
prior introduction, creating ambiguity what it references.</p>
          <p>A term lacks a clear, accepted and/or unambiguous meaning to
a POSITA, making the claim’s scope uncertain.</p>
          <p>A relative term (e.g., ’thin,’ ’substantial’) is used without
providing a clear point of comparison, rendering the claim’s scope
indefinite.</p>
          <p>Claim uses ’such as,’ or similar phrasing, making it unclear
whether the listed items are exhaustive or merely examples,
leading to indefiniteness.</p>
          <p>Claim recites ’means for’ or ’step for’ without disclosing
adequate corresponding structure, material, or acts in the
specification, as required under 35 U.S.C. 112(f) or pre-AIA equivalent.</p>
          <p>Claim includes an element that contradicts or is inconsistent
with other claim limitations, making the claim’s scope unclear.</p>
          <p>Claim fails to recite an element, step, or cooperative relationship
between elements/steps that is essential to the invention as
disclosed.</p>
          <p>Claim depends on an indefinite claim.</p>
          <p>Catch-all category for indefiniteness reasons not covered above.
grounds for rejection. While subsequent ofice actions may address remaining or newly introduced
issues, the initial rejection often provides the most comprehensive overview of the examiner’s concerns.
We select the latest claim and specification documents preceding this ofice action and download all
three documents in XML format. We parse the XML documents into Markdown and JSON for easier
further processing.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Ofice Action Parsing</title>
        <p>We parse the ofice actions’ full-text into a structured representation using Gemma 3 27B [ 27]. First,
we select the sections related to indefiniteness by filtering for sections with headings containing “112”.
Next, we prompt the LLM with the selected sections and instruct it to extract indefiniteness reasons in a
JSON schema. In particular, each rejection contains the text snippet arguing why the claim is indefinite,
a category from Table 1, and a list of recited phrases. We instruct the LLM to extract the free-form
argumentation and recited phrases verbatim and not to paraphrase, extend or explain them. The prompt
is attached in Appendix A. Lastly, we use fuzzy matching to find the occurrences of recited phrases in
the claim text. An example from our dataset is shown in Figure 1. A claim can have multiple rejection
reasons, but most have only one (see Section 3.7).</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Sampling Definite Claims</title>
        <p>To ensure robust evaluation, we balance the dataset between definite and indefinite claims. We use
claims from applications whose ofice actions do not contain the term “112(b)”, i.e., applications whose
claims are all definite. This makes sure that all claims labelled as definite are indeed not rejected for
indefiniteness; using claims from the same applications as the previously extracted indefinite claims</p>
        <p>Claims
Applications</p>
        <p>Rejection
Reason</p>
        <p>Total
Definite</p>
        <p>Indefinite
Independent
Dependent</p>
        <p>Total
Definite
Indefinite</p>
        <p>Total
Antecedent Basis
Undefined Term</p>
        <p>Relative Term
Exemplary Phrasing</p>
        <p>Functional Claiming
Contradicting Limitation</p>
        <p>Omission of Essential</p>
        <p>Elements or Steps</p>
        <p>All</p>
        <p>Train
8730 (60.06%)
could introduce noise if the parsing pipeline does not detect all rejections. To also balance the number
of applications in each class, we first compute the average number of indefinite claims per application
included in the dataset. We iterate through the definite applications in random order and sample the
same number of claims until there are as many definite as indefinite claims. While sampling claims
from the applications, we round the number of sampled claims down or up depending on whether there
are currently more or less claims per application than in the indefinite samples.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Dataset Splits</title>
        <p>We randomly split the resulting dataset into train (60%), test (30%), and validation (10%). To avoid data
leakage, we perform this split on the application-level, such that claims from the same application are
always in the same split.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Human Validation Study</title>
        <p>To validate the quality of our extracted annotations, we manually inspect 50 randomly sampled claims
from PEDANTIC. Among these claims, 24 are definite and 26 are indefinite, with a total of 27 reasons
for indefiniteness (one indefinite claim has two reasons while the rest have one). For each claim, we
compare the rejection document with the extracted annotations and analyze the correctness of the
binary label, the extracted reasoning texts, and the assigned categories. We find that all binary labels
are correct. The extracted free-form reasoning is also correct in all samples except one, where it says
“The underlined lacks antecedent basis”, but this formatting is not translated into Markdown. This
validates the reliability of our automatically extracted binary labels and reasoning texts. However, we
ifnd substantial noise in the assigned categories, with 19 out of 27 reasons having the correct one. In six
out of eight incorrect categories, none of the proposed categories would have been a good fit. The LLM
assigned the category “undefined term”, instead of “other” as instructed. In the remaining two out of
eight incorrect category assignments, the correct category would have been “antecedent basis”, but the
LLM assigned “undefined term”. In both cases, the term “antecedent basis” was not mentioned explicitly,
unlike in most reasoning texts of this category.
0
0.00</p>
      </sec>
      <sec id="sec-3-7">
        <title>3.7. Dataset Statistics</title>
        <p>Indefiniteness Categories. Table 2 also shows the distribution of indefiniteness categories. Missing
or ambiguous antecedent bases and unclear definitions dominate the dataset, constituting a combined
73% of all indefiniteness reasons. Relative terms and functional claiming are also commonly cited, each
constituting about 9% of indefiniteness reasons. Contradicting limitations, exemplary phrasing and
omission of essential elements or steps are each cited in less than 5% of indefiniteness reasons.
Characteristics of definite vs indefinite claims. If superficial characteristics difer between definite
and indefinite claims, a trained classifier will likely use them as a shortcut. We therefore report several
such characteristics across the classes. As shown in Figure 2, the fraction of indefinite claims remains
around 50% over time, i.e., models gain no advantage from the filing date, even if they infer it from
the specific technology and terminology in the claim. As shown in Table 3, the indefinite claims are
more frequently independent claims than definite claims. In other words, in our dataset, independent
claims are more likely to be rejected due to indefiniteness than dependent claims. A plausible reason
is that independent claims are longer, i.e., they introduce more features that could be indefinite. We
consequently observe that indefinite claims are longer on average in terms of characters, words, and
features.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Definiteness Prediction</title>
      <p>In this section, we present two baseline approaches to predict indefiniteness: logistic regression and an
LLM agent. The available input is the claim in question and the accompanying patent description.</p>
      <sec id="sec-4-1">
        <title>4.1. Logistic Regression</title>
        <p>We include logistic regression in our experiments because it is computationally eficient and interpretable,
and because Ashtor [21] has shown it to achieve non-trivial performance predicting indefiniteness.
4A dependent claim is a claim that refers back to another claim, incorporating its features and narrowing its scope of protection.
Independent ✓+ ✗</p>
        <p>Definite
✓
✗
✓+ ✗</p>
        <p>Indefinite
✓</p>
        <p>✗</p>
        <p>We use TF-IDF features and a number of handcrafted linguistic features, similar to Ashtor [21]. The
latter include the claim length, the claim length relative to the description length, readability metrics,
trigger word flags, and a flag indicating whether the claim is independent. The full list is visualized in
Figure 3. For all features sets, we train separate classifiers for the binary classification and the multi-label
classification.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. LLM Agent</title>
        <p>
          LLMs have shown impressive performance on many tasks related to patents [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], yet they have not
been evaluated on definiteness prediction. We implement a zero-shot LLM agent to identify the issues
causing indefiniteness in the claims. The prompt (see Appendix B) includes instructions and the claim
in question. The agent is equipped with two tools that allow it to search for relevant information in the
remaining document. First, we add a tool that returns a claim given its claim number, allowing the agent
to analyze parent claims, and other claims deemed relevant. Second, we implement a TF-IDF search
tool that allows the agent to retrieve paragraphs containing certain key words or phrases from the
patent’s description section. We choose this tool-based approach because pasting the entire description
quickly fills up the context window, while pre-selecting the relevant parts is restrictive and requires
hand-crafting selection criteria and retrieval mechanisms. Using these tools, the agent can flexibly
retrieve whatever information it deems relevant. The agent analyses the claim and performs tool calls
in an interleaved fashion until it arrives at a final prediction.
        </p>
        <p>Rather than generating the binary label directly (definite/indefinite), we instruct the LLM to use a
verbalized expression of likelihood among a set of possible options ranging from “almost no chance” to
“almost certain”. Confidence scores, if reliable, should allow users to configure a level of strictness for
issue detection. Each expression is converted to a numerical probability according to the empirically
determined human perception of probability expressions by Fagen-Ulmschneider [28]. We choose this
approach because Tian et al. [29] and Xiong et al. [30] show that asking the LLM for a confidence value
is equally or more reliable than logit-based confidence estimation.</p>
        <p>Lastly, we instruct the LLM to format the result as JSON according to a fixed schema. The final output
contains a prediction of the likelihood of the claim being rejected, and a list of potential reasons for the
indefiniteness, where each reason contains a confidence score (using the same verbalized expressions
as above), a free-form reasoning, one of the categories in Table 1, and a list of claim recitations. Thus,
the output is in the same format as the structured representations extracted from the ofice actions,
with the additional confidence scores. This allows quantitative evaluation of the binary classification,
multi-label classification, and the correctness of textual reasons.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation Metrics</title>
      <p>In this section, we propose a set of metrics to evaluate claim indefiniteness prediction models. First,
since the core task is binary classification, we use established metrics that measure how well the system
can predict the binary label “definite” or “indefinite”. While this binary label is already interesting by
itself, practical applications demand a more fine-grained and explainable approach. It is equally or even
more important to know why a claim is indefinite to give actionable feedback to drafters. To that end,
we implement an LLM-as-Judge [8] approach that determines whether an examiner-cited reason and
a model-cited reason point to the same essential issue in the claim. Lastly, we evaluate the task as a
multi-label classification, where the overlap in the categories of cited reasons also indicate whether a
model decided to accept or reject a claim for the right reasons.</p>
      <p>1.00
0.75
0.50
0.25
0.00
− 0.25
− 0.50
− 0.75</p>
      <sec id="sec-5-1">
        <title>5.1. Binary Classification</title>
        <p>We evaluate binary classification performance for identifying indefinite claims using the following
metrics:
• Precision: The proportion of true positives (i.e., claims correctly predicted as indefinite) out of all
claims predicted as indefinite by the model.
• Recall: The proportion of true positives out of all claims found indefinite by the examiner.
• F1-score: The harmonic mean of precision and recall, providing a balanced measure of both.
• AUROC: The area under the Receiver Operating Curve (ROC), which plots the true positive rate
against the false positive rate for varying classification thresholds.
• Accuracy: The proportion of correctly classified claims (as either definite or indefinite) out of all
claims.</p>
        <p>These metrics provide a general understanding of the system’s ability to distinguish between definite
and indefinite claims. For all models, we compute these metrics for a confidence threshold that balances
the predictions, i.e., we first determine the threshold with which half of the claims from the validation
set are predicted as indefinite.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Pairwise Reasoning Judge</title>
        <p>To assess the quality of the model’s reasoning for indefiniteness, we employ a reference-based
LLMas-Judge approach [8]. We prompt Gemma 3 27B to evaluate whether an examiner-cited reason and a
model-cited reason point to the same essential issue in the claim on a scale from 1 (worst) to 5 (best)
(we normalize the final scores to the range [0, 100]). The few-shot prompt is attached in Appendix C.
To allow the model to analyze the ground truth and the predicted reason before settling on a grade,
we first prompt it to find similarities and diferences, then ask for a numerical grade in a follow-up
message. Following latest research in LLM-as-Judge systems [31, 32], we use the probability-weighted
mean as the similarity score of a reason-pair:
norm() = 100 ·
where  is a grade and  is the LLM-determined probability of that grade (i.e., the exponential of the
logit of the token corresponding to the grade).</p>
        <p>We compute this similarity for every pair of examiner-cited reason and model-cited reason. Hence,
for a sample with  examiner-cited reasons and  model-cited reasons, we obtain a similarity matrix
 ∈ R× . From this matrix, we compute precision, recall and F1, each in a thresholded and a soft
variant. In the thresholded variant, we consider a model-cited reason correct if there is an examiner-cited
reason with which it has a similarity score of 75 (corresponding to “The reasons are closely related and
largely address the same issue in the claim” ) and above, and vice versa.
We report the macro average (i.e., the average of all average per-claim scores) and the micro average
(i.e., the average of all per-reason scores). For the soft variant, we do not set a fixed threshold but
directly average over the maximum similarity in the respective dimension.</p>
        <p>P =</p>
        <p>1 ∑︁ max 
 =1</p>
        <p>R = 1 ∑︁ max</p>
        <p>=1</p>
        <sec id="sec-5-2-1">
          <title>In both cases, F1 is computed as harmonic mean of P and R.</title>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Multi-Label Classification</title>
        <p>We also assess the system as a multi-label classification task, where each label serves as a binary indicator
denoting the presence or absence of a specific category within the list of indefiniteness reasons. A high
score means that the model has classified a claim for reasons belonging to the same or overlapping
indefiniteness categories as the examiner, and thus also represents a kind of explanation for the binary
classification. We report the micro and macro average of the category-wise F1 scores. As with the
binary classification, we balance the predictions using the confidence threshold per category under
which the fraction of positive predictions matches the distribution in the validation set.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <p>Model
% indef</p>
      <p>P</p>
      <p>R</p>
      <p>Binary</p>
      <p>Multi-Label
F1 AUC Acc Macro Micro</p>
      <p>F1 F1
Random
TF-IDF 46.2%
Ling. Features 49.2%
All Features 49.2%
Logistic Regression. Among the three proposed variants of logistic regression, the one with linguistic
features and the one with all features perform comparably, while using only TF-IDF features performs
substantially worse. This indicates that the handcrafted features are indeed helpful to diferentiate
definite and indefinite claims, and that TF-IDF features provide only limited additional information.</p>
      <p>We show the feature importance values in Figure 3. The most important feature indicating definiteness
is the IOU of words between claim and description, i.e., samples with a high word overlap between
claim and description are less likely to be rejected for indefiniteness. The highest-weighted features
indicating indefiniteness are the number of unique word stems, the type token ratio and the number
of stopwords. Among the binary features showing whether a trigger word is contained in the claim,
“step for” is the most indicative of indefiniteness. The flag whether the claim is independent indicates
definiteness, which is in line with our dataset statistics, albeit with moderate importance. Many of
the text complexity and readability metrics have high importance scores. A higher text complexity is
associated with definiteness (with Gunning Fog, Automated Readability Index, and Flesch Kincaid Grade
having a negative feature weight) and better readability is associated with indefiniteness (with Flesch
Reading Ease and Dale Chall Readability Score having a positive feature weight). While this might
seem counterintuitive when relating definiteness to clarity or understandability, it makes sense when
considering definiteness as the absence of ambiguity, as removing ambiguities often adds complexity
through additional specification.</p>
      <p>LLM Agent. The LLM agents using Qwen 2.5 32B and 72B both make use of the provided tools, each
making at least one tool call for 98% of the samples. Notably, the larger model makes more tool calls (3.3
calls per sample on average) than the smaller model (2.3 calls per sample on average). This indicates that</p>
      <p>Thresholded
 1 ≥75 ≥75  1≥75
the larger model is more thorough in searching for relevant information in the document, which could
be a reason for the slightly higher AUROC and accuracy. Recall and F1 are notably worse for the larger
model, caused by the imbalance of its predictions. Both metrics favor classifiers that frequently predict
indefiniteness: F1 is per definition 23 when predicting indefiniteness for every sample. The distribution
of confidence scores produced by the 72B model makes it impossible to find a threshold that moves
the fraction of indefinite predictions close to 50%. We show the distribution of confidence scores for
both LLM agents and the logistic regression in Figure 4. The LLM agents are typically highly confident
in either extreme, whereas the logistic regression produces a smoother distribution with a peak in
the middle. In addition, as visible in Figure 4, there is no clear relationship between confidence and
accuracy in the LLMs’ binary predictions. This is unlike the predictions of the logistic regression model,
where the accuracy drops notably around the classification threshold. Overall, the results indicate that
the LLMs confidence estimates are not well-calibrated, and that better calibration could enhance the
LLMs’ classification performance.</p>
      <p>Multi-Label Classification. LLM agents and logistic regression achieve only moderate F1 scores. All
models achieve a much better micro F1 than macro F1, indicating that the performance varies between
categories. All models perform much better on the well-represented categories than on rare ones.
Given the substantial noise in the labels for the category “undefined term ” (see Section 3.6), additional
investigations are necessary to draw further conclusions.</p>
      <p>Ensemble. We achieve the best overall performance using an ensemble between Qwen 2.5 72B and
the logistic regression. To create the ensemble, we average the predicted probability of indefiniteness for
every sample. This leads to moderate improvements, indicating that the predictions are complementary
to some degree.</p>
      <p>Judge Evaluation. Our LLM-as-Judge evaluation provides further insight into LLM agents’
performance by directly analyzing the quality of their identified reasons for indefiniteness. Table 5 shows the
The term ’right before’ in claim 9 is a relative term which renders the claim
indefinite. The term ’right before’ is not defined by the claim, the
specification does not provide a standard for ascertaining the requisite degree ,
and one of ordinary skill in the art would not be reasonably apprised of
the scope of the invention. The amount of time necessary for an input to
constitute being obtained ’right before’ the first input is unclear.
Furthermore and for example, a person of ordinary skill in the art may consider
an arbitrary number of previous inputs, such as three, to constitute inputs
obtained ’right before’ a first input.</p>
      <p>The claims recite the limitation ’ranking the one or more groups’ There
is insuficient antecedent basis for this limitation in the claim, it appears
likely that this claim should instead depend from claim 2 .</p>
      <p>As per Claim 10, ’ each highlight in the second pane in the
second interface’ lacks antecedent basis to the extent that
’each’ implies that there is more than one highlight in the second
pane. Claim 9 only recites where a single highlight is dropped into the
second pane and Claim 1 recites where selected ones of the highlights are
associated with the second pane of the second interface in a sequence
(associating does not imply putting the selected highlights into the second
pane of the second interface, and associating in a sequence can be where
a sequence in the second interface includes elements referencing to the
highlights themselves without including the highlights themselves).
The term ’right before’
is a relative term
that lacks a clear,
objective standard for
determining the exact
input used to determine
the first domain.</p>
      <p>The term
’one or more groups’ in
claim 3 lacks antecedent
basis as it is not
introduced or defined in claim
1.</p>
      <p>The term
’highlight panel’ is
not clearly defined in the
claims or the description,
leading to uncertainty
about its scope.
100%
75%
25%
results for Qwen 2.5 32B and 72B. Interestingly, the smaller model clearly outperforms the larger one in
all judge metrics, despite exhibiting slightly lower binary classification performance. There seems to be
a disconnect between the models’ ability to identify potential reasons for indefiniteness and its ability
to accurately decide whether a claim is indefinite. The 32B model is able to correctly identify 35.4% of
the examiner-cited reasons (micro ≥75 , 25.0% for 72B). Both models list many reasons, even if they
consider them unlikely to cause indefiniteness, causing the precision to be lower than the recall.</p>
      <p>Figure 5 also shows the relation between reason-level confidence and the soft micro-averaged
precision. Both models exhibit a positive trend; as their confidence in a reason increases, the likelihood
of the examiner citing the same or a similar reason also increases. The Pearson correlation between
confidence and precision is positive and moderate (  = 0.432 for 32B and  = 0.514 for 72B). That is,
the reason-level confidence appears to be better calibrated than the claim-level confidence. Future work
should further investigate the relation between these two sub-tasks and develop methods to bridge this
disconnect.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Discussion</title>
      <p>We here compare example reasons for indefiniteness generated by Qwen 2.5 32B with those written by
the human examiner. Table 6 lists three examples. In Example 1, the LLM agent correctly identified
the reason for indefiniteness, albeit with a less verbose explanation. In Example 2, the LLM agent also
found the correct underlying issue. However, its explanation is not equally helpful as the one provided
by the human examiner, which additionally suggests that the claims might accidentally depend from
the wrong base claim. In Example 3, the model-generated reason points to similar problematic phrases,
but without identifying the underlying issue. The LLM-as-Judge correctly identifies that there is little
substantial overlap and assigns a lower score of 25%. Generally, the model-generated reasons seem to
be less detailed, and often lack the deep analysis and reasoning performed by human examiners.</p>
      <p>While our dataset creation process is domain-agnostic, our study focuses on patents from the NLP
domain. Therefore, the claim language and terminology is relatively homogeneous. Other domains
might difer in terms of terminology, distribution of claim types and categories of indefiniteness, and
legal conventions. Future work should generalize our study design and verify that our findings transfer
to other domains.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion and Outlook</title>
      <p>In this work, we tackle the task of automatic patent definiteness examination. We present PEDANTIC,
the first publicly available dataset for this task to enable reproducible experiments with diferent
examination models. We conduct first experiments on PEDANTIC using logistic regression and LLMs.
All models are able to perform better than random, but there is substantial room for improvement.
LLMs are able to identify many of the reasons for indefiniteness also cited by the human examiner, yet
their binary classification performance fails to clearly outperform logistic regression with hand-crafted
features. We show that poor calibration is one of the issues of current LLM-based setups. Promising
directions for future research include better calibration methods, fine-tuning on task-specific data using
supervised and/or reinforcement learning, and using the rejection full-text for in-depth evaluation of
other patent requirements like novelty and non-obviousness.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>We would like to thank the patent attorneys Philipp Mangold and Charlotte Hellmann for insightful
discussions and helpful pointers.</p>
    </sec>
    <sec id="sec-10">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Gemini 2.0 Flash in order to5: grammar and
spelling check, improve writing style, paraphrase and reword. After using these tool(s)/service(s), the
authors reviewed and edited the content as needed and take full responsibility for the publication’s
content.
[7] H.-C. Lo, J.-M. Chu, Pre-trained Transformer-based Classification for Automated Patentability
Examination, in: 2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering
(CSDE), 2021, pp. 1–5. doi:10.1109/CSDE53843.2021.9718474.
[8] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing,
H. Zhang, J. E. Gonzalez, I. Stoica, Judging llm-as-a-judge with mt-bench and chatbot arena, in:
Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS
’23, 2023.
[9] J. Risch, N. Alder, C. Hewel, R. Krestel, PatentMatch: A Dataset for Matching Patent Claims &amp;</p>
      <p>Prior Art, 2020. doi:10.48550/arXiv.2012.13919. arXiv:2012.13919.
[10] K. Vowinckel, V. D. Hähnke, SEARCHFORMER: Semantic patent embeddings by siamese
transformers for prior art search, World Patent Information 73 (2023) 102192. doi:10.1016/j.wpi.
2023.102192.
[11] M. Blume, G. Heidari, C. Hewel, Comparing complex concepts with transformers: Matching patent
claims against natural language text, in: 5th Workshop on Patent Text Mining and Semantic
Technologies (PatentSemTech, volume 3775, CEUR-WS.org, 2024, pp. 72–76.
[12] A. Parikh, S. Dori-Hacohen, ClaimCompare: A data pipeline for evaluation of novelty destroying
patent pairs, in: 5th Workshop on Patent Text Mining and Semantic Technologies (PatentSemTech,
volume 3775 of CEUR Workshop Proceedings, CEUR-WS.org, 2024, pp. 61–66.
[13] V. Stamatis, M. Salampasis, K. Diamantaras, A novel re-ranking architecture for patent search,</p>
      <p>World Patent Information 78 (2024) 102282. doi:10.1016/j.wpi.2024.102282.
[14] J. Shan, Q. Zhang, C. Shi, M. Gui, S. Wang, U. Naseem, Structural Representation Learning and
Disentanglement for Evidential Chinese Patent Approval Prediction, in: Proceedings of the 33rd
ACM International Conference on Information and Knowledge Management, ACM, 2024, pp.
2014–2023. doi:10.1145/3627673.3679766.
[15] Y. Shen, Z. Lin, PatentGrapher: A PLM-GNNs Hybrid Model for Comprehensive Patent Plagiarism
Detection Across Full Claim Texts, IEEE Access 12 (2024) 182717–182725. doi:10.1109/ACCESS.
2024.3508762.
[16] T. Wei, D. Feng, S. Song, C. Zhang, An extraction and novelty evaluation framework for technology
knowledge elements of patents, Scientometrics (2024). doi:10.1007/s11192-024-04990-9.
[17] H. Ikoma, T. Mitamura, Can AI Examine Novelty of Patents?: Novelty Evaluation Based on the
Correspondence between Patent Claim and Prior Art, 2025. doi:10.48550/arXiv.2502.06316.
arXiv:2502.06316.
[18] R. Lee, A. Spangher, X. Ma, PatentEdits: Framing Patent Novelty as Textual Entailment, 2024.</p>
      <p>arXiv:2411.13477.
[19] S. Hido, S. Suzuki, R. Nishiyama, T. Imamichi, R. Takahashi, T. Nasukawa, T. Id^|^eacute;, Y.
Kanehira, R. Yohda, T. Ueno, A. Tajima, T. Watanabe, Modeling Patent Quality: A System for Large-scale
Patentability Analysis using Text Mining, Journal of Information Processing 20 (2012) 655–666.
doi:10.2197/ipsjjip.20.655.
[20] N. Kong, U. Dulleck, A. B. Jafe, S. Sun, S. Vajjala, Linguistic metrics for patent disclosure: Evidence
from university versus corporate patents, Research Policy 52 (2023) 104670. doi:10.1016/j.
respol.2022.104670.
[21] J. H. Ashtor, Modeling patent clarity, Research Policy 51 (2022) 104415. doi:10.1016/j.respol.</p>
      <p>2021.104415.
[22] M. Mersha, K. Lam, J. Wood, A. K. AlShami, J. Kalita, Explainable artificial intelligence: A
survey of needs, techniques, applications, and future direction, Neurocomputing 599 (2024)
128111. URL: https://www.sciencedirect.com/science/article/pii/S0925231224008828. doi:https:
//doi.org/10.1016/j.neucom.2024.128111.
[23] S. Huang, S. Mamidanna, S. Jangam, Y. Zhou, L. H. Gilpin, Can large language models explain
themselves? a study of llm-generated self-explanations, 2023. URL: https://arxiv.org/abs/2310.11207.
arXiv:2310.11207.
[24] A. Madsen, S. Chandar, S. Reddy, Are self-explanations from Large Language Models faithful?, in:
L.-W. Ku, A. Martins, V. Srikumar (Eds.), Findings of the Association for Computational Linguistics:
ACL 2024, Association for Computational Linguistics, 2024, pp. 295–337. doi:10.18653/v1/2024.
findings-acl.19.
[25] C. Agarwal, S. H. Tanneru, H. Lakkaraju, Faithfulness vs. plausibility: On the (un)reliability
of explanations from large language models, 2024. URL: https://arxiv.org/abs/2402.04614.
arXiv:2402.04614.
[26] Nautilus, Inc. v. Biosig Instruments, Inc, 572 U.S. 898 (2014),
https://supreme.justia.com/cases/federal/us/572/898/, 2014.
[27] A. K. et al., Gemma 3 technical report, 2025. URL: https://arxiv.org/abs/2503.19786.</p>
      <p>arXiv:2503.19786.
[28] W. Fagen-Ulmschneider, Perception of Probability Words,
https://waf.cs.illinois.edu/visualizations/Perception-of-Probability-Words/, 2025.
[29] K. Tian, E. Mitchell, A. Zhou, A. Sharma, R. Rafailov, H. Yao, C. Finn, C. Manning, Just ask
for calibration: Strategies for eliciting calibrated confidence scores from language models
finetuned with human feedback, in: Proceedings of the 2023 Conference on Empirical Methods in
Natural Language Processing, Association for Computational Linguistics, 2023, pp. 5433–5442. URL:
https://aclanthology.org/2023.emnlp-main.330/. doi:10.18653/v1/2023.emnlp-main.330.
[30] M. Xiong, Z. Hu, X. Lu, Y. LI, J. Fu, J. He, B. Hooi, Can LLMs express their uncertainty? an
empirical evaluation of confidence elicitation in LLMs, in: The Twelfth International Conference
on Learning Representations, 2024. URL: https://openreview.net/forum?id=gjeQKFxFpZ.
[31] V. Wang, M. J. Q. Zhang, E. Choi, Improving llm-as-a-judge inference with the judgment
distribution, 2025. URL: https://arxiv.org/abs/2503.03064. arXiv:2503.03064.
[32] M. Yasunaga, L. Shamis, C. Zhou, A. Cohen, J. Weston, L. Zettlemoyer, M.
Ghazvininejad, Alma: Alignment with minimal annotation, 2024. URL: https://arxiv.org/abs/2412.04305.
arXiv:2412.04305.</p>
    </sec>
    <sec id="sec-11">
      <title>A. Ofice Action Parsing Prompt</title>
      <p>1 ### TASK
2
3 Your task is to extract data related to claim rejections under 35 U.S.C. 112(b) or pre-AIA 35 U.</p>
      <p>S.C. 112, second paragraph (indefiniteness).
4
5 You will be provided with a snippet of text from a USPTO Office Action. You will identify claim
rejections based on indefiniteness (35 U.S.C. 112(b) or pre-AIA 35 U.S.C. 112, second
paragraph) and represent them in a JSON format.
Listing 1: Prompt used to parse ofice action into the described JSON schema.
rejection_categories are replaced with their respective values.</p>
      <p>schema and
B. Indefiniteness Examination Prompt
1 Examine this patent claim with respect to definiteness.
2
3 ### Guidelines
4
5 - Carefully dissect the claim and its features and search for common patterns that cause
indefiniteness.
6 - Think step by step and reason about potential issues! You can correct yourself at any point in
time, only the final verdict counts.
7 - Be very thorough and list all potential issues you can find!
8 - Ultimately, estimate the likelihood of the claim begin rejected due to indefiniteness. Note
that a single issue renders the entire claim indefinite.
9 - Completely ignore all other aspects like novelty or non-obviousness, focus entirely on
indefiniteness.
Listing 2: Prompt used to examine a given claim with respect to indefiniteness. indefiniteness_categories
and claim are replaced with their respective values.
1 &lt;instruction&gt;
2 You will evaluate the performance of an AI system used to identify indefiniteness issues in
patent claims. You will be given 2 short text snippets that mention an issue in a claim
that renders it indefinite, one written by a human examiner, one written by the AI. Your
task is to determine whether the two text snippets refer to the same issue or not. Ignore
differences in phrasing and evaluate only whether they pinpoint the same issue. You can
always assume that both text snippets talk about the same claim. Use the scale shown below.
3 &lt;/instruction&gt;
4 &lt;scale&gt;
5 1: The reasons are completely unrelated and address different concerns.
6 2: The reasons address distinct aspects of the claim with minimal overlap.
7 3: The reasons overlap in some areas but also have notable differences.
8 4: The reasons are closely related and largely address the same issue in the claim.
9 5: The reasons are essentially identical and address the same issue in the claim.
10 &lt;/scale&gt;
11 &lt;examples&gt;
12 &lt;example&gt;
13 &lt;text-1&gt;
14 The phrase ’like’ (stated in all the claims) renders the claim(s) indefinite because the claim(s
) include(s) elements not actually disclosed (those encompassed by ’like’), thereby
rendering the scope of the claim(s) unascertainable. See MPEP 2173.05(d).
15 &lt;/text-1&gt;
16 &lt;text-2&gt;
17 The claim does not specify how these ’component variables’ are utilized or how they relate to
the method described in Claim 13, which is critical to understanding the claim’s scope.
18 &lt;/text-2&gt;
19 &lt;score&gt; 1 &lt;/score&gt;
20 &lt;/example&gt;
21 &lt;example&gt;
22 &lt;text-1&gt;
23 The phrase ’like’ (stated in all the claims) renders the claim(s) indefinite because the claim(s
) include(s) elements not actually disclosed (those encompassed by ’like’), thereby
rendering the scope of the claim(s) unascertainable. See MPEP 2173.05(d).
24 &lt;/text-1&gt;
25 &lt;text-2&gt;
26 The use of ’like’ in ’example component variables of solution automation &amp; interface analysis (
like ’solution automation workflow variables’)’ suggests that these might be examples
rather than exhaustive lists, leading to potential ambiguity.
27 &lt;/text-2&gt;
28 &lt;score&gt; 5 &lt;/score&gt;
29 &lt;/example&gt;
30 &lt;example&gt;
31 &lt;text-1&gt;
32 Claim 1 recites the limitation ’a corresponding link instruction’ (line 2) and ’at least one
link instruction’ (line 12). It would be unclear to one having ordinary skill in the art
whether the above limitations are intended to be identical to, common to, or distinct from
one another.
33 &lt;/text-1&gt;
34 &lt;text-2&gt;
35 The term ’link instruction’ is not clearly defined in the specification, leading to ambiguity in
the scope of the claim.
36 &lt;/text-2&gt;
37 &lt;score&gt; 4 &lt;/score&gt;
38 &lt;/example&gt;
39 &lt;/examples&gt;
40 Apply this scheme, as shown in the examples, to the text snippets shown below.
41 &lt;text-1&gt;{reason_1}&lt;/text-1&gt;
42 &lt;text-2&gt;{reason_2}&lt;/text-2&gt;
43 Before you settle on a score, summarize the similarities and differences between the two reasons
for indefiniteness.</p>
      <p>Listing 3: Prompt used to evaluate the similarity of two reasons for indefiniteness. reason_1 and reason_2
are replaced with their respective values.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>USPTO</surname>
          </string-name>
          ,
          <source>Manual of Patent Examining Procedure (MPEP)</source>
          ,
          <year>2024</year>
          . URL: https://www.uspto.gov/web/ ofices/pac/mpep/index.html.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>USPTO</surname>
          </string-name>
          , Pendency | Patents Dashboard | USPTO,
          <year>2025</year>
          . URL: https://www.uspto.gov/dashboard/ patents/pendency.html.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>World</given-names>
            <surname>Intellectual Property Organization</surname>
          </string-name>
          ,
          <source>World Intellectual Property Indicators</source>
          <year>2024</year>
          , World Intellectual Property Organization,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .34667/TIND.50133.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Knappich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Razniewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hätty</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Friedrich,</surname>
          </string-name>
          <article-title>Pap2pat: Towards automated paper-to-patent drafting using chunk-based outline-guided generation</article-title>
          ,
          <source>arXiv preprint arXiv:2410.07009</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Myers</surname>
          </string-name>
          , S. Beliveau,
          <source>USPTO Patent Prosecution Research Data: Unlocking Ofice Action Traits</source>
          ,
          <year>2017</year>
          . doi:
          <volume>10</volume>
          .2139/ssrn.3024621. arXiv:
          <fpage>3024621</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>The</given-names>
            <surname>Most Common Rejections</surname>
          </string-name>
          :
          <volume>102</volume>
          ,
          <issue>103</issue>
          , and
          <volume>112</volume>
          (b),
          <year>2019</year>
          . URL: https://blog.juristat.com/ most-common-rejections.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>