=Paper=
{{Paper
|id=Vol-3775/paper5
|storemode=property
|title=ClaimCompare: A Data Pipeline for Evaluation of Novelty Destroying Patent Pairs
|pdfUrl=https://ceur-ws.org/Vol-3775/paper5.pdf
|volume=Vol-3775
|authors=Arav Parikh,Shiri Dori-Hacohen
|dblpUrl=https://dblp.org/rec/conf/patentsemtech/ParikhD24
}}
==ClaimCompare: A Data Pipeline for Evaluation of Novelty Destroying Patent Pairs==
ClaimCompare: A Data Pipeline for Evaluation of Novelty
Destroying Patent Pairs
Arav Parikh, Shiri Dori-Hacohen
University of Connecticut, School of Computing, Reducing Information Ecosystem Threats (RIET) Lab
Abstract
A fundamental step in the patent application process is the determination of whether there exist prior patents that are novelty
destroying. This step is routinely performed by both applicants and examiners, in order to assess the novelty of proposed
inventions among the millions of applications filed annually. However, conducting this search is time and labor-intensive,
as searchers must navigate complex legal and technical jargon while covering a large amount of legal claims. Automated
approaches using information retrieval and machine learning approaches to detect novelty destroying patents present a
promising avenue to streamline this process, yet research focusing on this space remains limited. In this paper, we introduce
a novel data pipeline, ClaimCompare, designed to generate labeled patent claim datasets suitable for training IR and ML
models to address this challenge of novelty destruction assessment. To the best of our knowledge, ClaimCompare is the first
pipeline that can generate multiple novelty destroying patent datasets. To illustrate the practical relevance of this pipeline,
we utilize it to construct a sample dataset comprising of over 27K patents in the electrochemical domain: 1,045 base patents
from USPTO, each associated with 25 related patents labeled according to their novelty destruction towards the base patent.
Subsequently, we conduct preliminary experiments showcasing the efficacy of this dataset in fine-tuning transformer models
to identify novelty destroying patents, demonstrating 29.2% and 32.7% absolute improvement in MRR and P@1, respectively.
Keywords
Patent novelty destruction, patent claims, data pipeline, machine learning, information retrieval
1. Introduction increasingly unsustainable, driving a growing interest in
the usage of information retrieval (IR), machine learning
Patent search is a rich and challenging space which com- (ML), and deep learning (DL) approaches to streamline
prises a diverse set of tasks, including Freedom to Op- search methodologies and optimize result relevance, for
erate (FTO) searches, novelty or patentability searches, example, via query expansion and targeted semantic sim-
and validity searches. Within this spectrum, patentability ilarity techniques [1, 2, 3]. These advances build on prior
searches hold particular significance as they help gauge work in the patent space pertaining to automated patent
whether an invention’s claims (i.e. structural features) landscaping and other automation tasks related to patent
are novel and non-obvious, and are therefore a critical code classification and categorization [4, 5, 6, 7]. Far less
part of the patent examination process. This assessment work has focused on finding novelty destroying prior
typically involves patent examiners, as well as inven- art; in fact, to date, there is only one other public dataset
tors or their legal representatives, meticulously combing dedicated to this task [8], and none on US Patent data.
through prior art databases to uncover any existing dis- Contributions and Scope. In this paper, we intro-
closures that could potentially anticipate the invention duce ClaimCompare, a novel data pipeline for generat-
and, consequently, undermine its novelty. In the United ing patent claim datasets, labeled with respect to the
States, in particular, prior art is deemed “novelty destroy- novelty destruction search problem, in order to facili-
ing” if it anticipates or references every element of at tate improved performance in this space. To the best of
least one of a proposed invention’s claims. our knowledge, ClaimCompare is the first pipeline that
Traditionally, prior art searches have been performed can generate multiple novelty destroying patent datasets.
manually, with searchers iteratively crafting and revising We leverage publicly available United States Patent and
complex keyword and Boolean queries to obtain the most Trademark Office (USPTO) APIs in order to curate such
relevant documents. However, with the number of patent datasets, alongside web-scraping Google Patents. To sim-
applications and volume of prior art growing annually, plify the problem, we focus only on identifying novelty
the labor-intensive nature of this process has become destroying patents, rather than all potential literature
contributing to novelty destruction. Our contributions
5th Workshop on Patent Text Mining and Semantic Technologies are as follows:
(PatentSemTech) 2024
$ arav.parikh@uconn.edu (A. Parikh); shiridh@uconn.edu • We construct ClaimCompare, a pipeline utilizing the
(S. Dori-Hacohen) USPTO API and Google Patents to generate curated
0000-0002-0877-7063 (A. Parikh); 0000-0002-0877-7063 novelty destroying datasets.
(S. Dori-Hacohen)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). • We utilize ClaimCompare to curate a sample dataset
61
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Arav Parikh et al. CEUR Workshop Proceedings 61–66
Dataset Size Data Source Positive Samples Negative Samples Matching Strategy Balanced?
PatentMatch v2 25K EPO Search report “X" citations Search report “A" citations Specific excerpts/lines Yes
CC Sample 27K USPTO Office action 102 rejections Similar keyword patents Entire claim sets No
Table 1
PatentMatch vs. a sample ClaimCompare (CC) dataset. Note that the ClaimCompare pipeline can be used as-is in order to
generate many other datasets, including significantly larger ones.
of 27K patents in a specialized domain, comprising of models to perform poorly with accuracies of 54% and 57%
1,045 base patents and 25 related patents for each. Of respectively [11]. In examining the PatentMatch dataset
the base patents, 357 (34%) have one or more identifiedfurther, we see that many of the excerpts are quite short
novelty destroying patent(s). in length, lacking the context of the broader patent, which
helps explain why context-dependent transformer mod-
• To assess the effectiveness of ClaimCompare and the
els like BERT struggle to effectively capture the nuanced
sample dataset, we perform experiments utilizing
semantic relationships defining novelty destruction be-
LLMs fine-tuned on our dataset to assist with novelty
tween patents. In light of this observation, we opt against
determination. Our experiments demonstrate 29.2%
incorporating the specific excerpts into our dataset, fa-
and 32.7% absolute improvement in MRR and P@1,
voring instead the inclusion of broader claim sets which
respectively, over a baseline model.
concisely encapsulate the essential elements that define
We envision ClaimCompare being used to generate the novelty of an invention, or lack thereof, while still
both generic and domain-specific training datasets at providing sufficient context.
scale, focused specifically on the task of novelty determi- To the best of our knowledge, our paper is the first to
nation. These datasets can subsequently be used to train approach novelty destruction from a US-centric approach,
and test a variety of IR, AI/ML, and/or DL models for this and offers the first public dataset for this task as1 well as
task. We release all our pipeline code and data1 . a pipeline to easily generate additional datasets .
2. Prior Work 3. Methodology
Despite a very rich literature on patent search overall [9], We now introduce the ClaimCompare data pipeline (Fig-
there is little work on the important, highly-specialized ure 1) which generates domain-centric and agnostic
task of detecting novelty destroying patents. datasets in order to train DL models to assess patent
The now discontinued CLEF-IP tracks in 2012 and 2013 novelty. For the remainder of the paper, we focus on
present useful datasets related to patentability searching. describing our pipeline, sharing information about the
The 2012 edition released a claims to passage dataset con- sample dataset, and demonstrating its effectiveness for
taining 2.3 million European patent documents with 2,700 this task via preliminary experiments involving large lan-
corresponding relevance judgements [10]. The shared guage models (LLMs) fine-tuned on this sample dataset.
task is to create the most effective passage retrieval sys-
tem given a particular topic. However, the task focused 3.1. Approach
on claim sets, rather than single claims, and considers “X”
(novelty destroying) and “Y” (inventiveness destroying) To ground the use of our ClaimCompare pipeline in the
passages as equally relevant, merging the two problems context of novelty determination specifically, we define
rather than isolating novelty destruction. the novelty destruction search problem as a sub-problem
A key paper on the task of novelty evaluation is Patent- of the prior art retrieval process. Accordingly, ClaimCom-
Match [8], which offers the first dataset directly address- pare operates under the assumption that a preliminary
ing novelty destruction in patents by leveraging Euro- set of prior art patents of size 𝑘 has already been retrieved
pean Patent Office (EPO) search reports. The dataset is by the searcher for a given query (i.e., base patent 𝑞), us-
composed of pairs, each containing an individual patent ing preexisting methods. While the models trained with
application claim paired with a passage from either an “X" ClaimCompare’s datasets can certainly be used to im-
citation or an “A" (background) citation. Unfortunately, prove direct retrieval of only novelty destroying patents
as the authors note, fine-tuning a BERT model on the (i.e., when 𝑘 is made sufficiently large), in this paper we
dataset produces relatively poor results with an accuracy focus on applying these models as a filter on top of a
of 54% in the best case. A follow-up empirical study uti- smaller subset of retrieved patents.
lizing the dataset also found fine-tuned BERT and SBERT To develop ClaimCompare, we primarily rely on two
publicly available USPTO APIs to access the patent data
1
https://github.com/RIET-lab/claim-compare
62
Arav Parikh et al. CEUR Workshop Proceedings 61–66
Claim Compare Pipeline
User-Selected Seed Queries
Bulk Data Citation
Patent Extraction
API Call redirect to
Query results Application
t up-to-date
rac (Base)
st Citation # patent #
Ab
Keyword Extraction Google
(top 5 keywords) +
Related Patents s ult
s Patents
Bulk Data API Call Related Patents g re
US Patent and Related
Related
(Neg; not Patents
Patents
novelty pin
(Neg; not
Related
(Neg; notnovelty
Patents Cited ra
Trademark
Keyword que
ry results notnovelty
Related
destroying)
(Neg;
Related Patents
novelty
destroying)Patents Patent(s) eb
sc
Related
(Neg; Patents
not novelty
destroying)
(Neg; not novelty W
destroying)
(Neg; not
Office (USPTO) (Neg; notnovelty
destroying)
novelty
destroying) (Pos; novelty
Office Action destroying)
destroying) destroying)
API Call (query
by application #)
Rejection text
Figure 1: The ClaimCompare Pipeline. ClaimCompare can accept any initial seed queries and generate a novelty destroying
dataset for that query set. For each base patent, the pipeline finds 𝑘 novelty destroying and related, non-novelty destroying
patents. The pipeline utilizes two types of API calls to the USPTO (Bulk Data and Office Action APIs) and scrapes data from
Google Patents to account for patent number changes.
necessary to teach subsequently trained models to se- if applicable, and other related patents. For the cited
mantically differentiate between novelty destruction and patents, we query the USPTO Office Action Citation API3
mere relevance. Relevant patents constitute “negative" using the base patents’ application numbers. If an office
samples in our dataset and are queried for with keywords action is found with a 102 rejection, we take the rejection
extracted from the base patents, mimicking a common text and pass it to a text2text generation base T5 model
prior art search practice. In contrast, novelty destroy- along with a standardized prompt in order to extract the
ing “positive" samples are generated by obtaining USPTO publication number of the novelty destroying patent cited
office actions with rejections citing novelty destroying within the text. The T5 performs the task efficiently with
prior art patents. For our positive samples, we only con- a 94% success rate, which is sufficiently high for our needs.
sider office actions that contain a 102 rejection, indicating
We then perform a simple cleanup on the publication
that the citation is considered to be novelty destroying number and use it to query for its claims via Google
for the corresponding application by skilled patent ex- Patents4 , which can web redirect to the most up-to-date
aminers. Our focus on 102 rejections is deliberate, as version of the patent if the number is outdated, unlike
they only reference elements from a single document as the APIs. To acquire our negative samples, we extract
sufficient to destroy the novelty of a patent application, keywords and phrases from the base patent abstract. We
whereas the more nuanced and complex 103 rejections apply the KeyBERT model to get the top 5 keywords
often refer to elements from multiple patents and/or com- from the abstract [12], with which we query the USPTO
mon technical knowledge coming together to invalidate Bulk Data API for the number of relevant patents it takes
the novelty of the proposed patent. to meet the limit 𝑘 per base patent depending on the
number of positive samples previously acquired. We
3.2. Implementation Details omit smaller details of the related patent query process
for space considerations; we refer interested readers to
ClaimCompare’s pipeline starts with a set of seed queries, our codebase.
which are sent to the USPTO Bulk Data API2 . In the
context of the provided sample dataset, we use the phrase
3.3. Dataset Structure
“redox flow battery" as a query to retrieve inventions in
the electrochemical device space; naturally, this can be Of the 1,045 rows in our raw sample dataset, 357 (34%) of
replaced with keywords/phrases in any given domain. them contain at least one positive sample. Of these 357
For each retrieved patent application, we collect the rows, 36 (10%) have two positive samples; there are no
application and publication numbers, abstract, and claims rows with three or more novelty destroying patents in our
in order to form our set of base patents; we then set dataset. To ensure that the number of samples per row
out to acquire their cited novelty destroying patent(s),
3
https://developer.uspto.gov/api-catalog/uspto-office-action-
2
https://developer.uspto.gov/api-catalog/bulk-search-and- citations-api-beta
4
download https://patents.google.com
63
Arav Parikh et al. CEUR Workshop Proceedings 61–66
is always 𝑘, we set 𝐶𝑙𝑎𝑖𝑚𝑠_25 and/or 𝐶𝑙𝑎𝑖𝑚𝑠_24 and Model AUROC AP MRR P@1
General Baseline 0.473 0.350 0.697 0.651
their corresponding publication numbers as null values,
Domain Baseline 0.589 0.464 0.703 0.651
depending on the number of novelty destroying patents Fine-Tuned (𝑘 = 25) 0.999 0.999 0.989 0.978
present. If none are found for a given base patent, then Fine-Tuned (𝑘 = 10) 0.999 0.998 0.987 0.975
the rejection columns are set as null instead. Fine-Tuned (𝑘 = 5) 0.982 0.975 0.967 0.934
Clearly, given the structure of our dataset, there is an
inherent lack of balance between the two classes. We de- Table 2
Model testing results. General baseline and fine-tuned models
liberately maintain this imbalance for two main reasons.
rely on DistilRoBERTa. Domain baseline relies on BERT for
Firstly, it reflects the current state of prior art searching Patents.
where there are far more relevant samples than novelty
destroying samples for any given patent. While imbal-
anced datasets typically pose challenges for training ML
We also intentionally downsample the negative sam-
or DL models, we are intrigued to explore how this real-
ples in our training dataset, to observe the effect this has
istic representation of class distribution influences model
on both the validation and testing metrics. We perform
performance in our experimentation. Secondly, although
this downsampling by simply reducing 𝑘 = 25 to 𝑘 = 10
we do not directly train a ranking model in our experi-
and 𝑘 = 5 on our three shuffled train-val-test splits, re-
mentation, we indirectly test the ability of our model to
taining all positive sample while randomly sampling the
rank prior art patents based on their likelihood of invali-
required number of negatives examples from the larger
dating the novelty of a given base patent, a task which
set for each base patent.
necessitates a larger value for 𝑘 and, consequently, an
imbalanced dataset.
4.2. Model Fine-Tuning
4. Experimental Setup and Results For our experiments, we fine-tune a sequence classifica-
tion model with our sample dataset. We primarily use the
5
In order to assess the effectiveness of our dataset in train- base DistilRoBERTa model due to its compact size and
ing LLMs to assist with novelty determination, we test robust performance on related tasks [13]. We also use the
6
whether fine-tuning these models outperforms a baseline BERT for Patents model as a stronger, domain-specific
pre-trained BERT-based transformer model. baseline but unfortunately lack the computational re-
sources to fine-tune such a large model. As a result,
we choose to fine-tune the DistilRoBERTa model “from
4.1. Training Data scratch," presenting an intriguing opportunity to assess
To prepare our raw sample dataset for model training, the ability of this model to adapt to both the broader
we drop all non-claim columns and convert the row-wise patent domain and our specific novelty determination
format of the dataset into a pairwise format such that use case. We train the model for 3 epochs with a cross
each base patent is individually matched with each of entropy loss function, training and validation batch size
its relevant or novelty destroying patents to form the of 16, learning rate of 0.00002, and weight decay of 0.01.
training examples. In other words, instead of having
rows where a base patent is matched with 25 related 4.3. Model Evaluation and Discussion
patents, there are now 25 rows enumerating each of these
matches. The base patents are found in the 𝐶𝑙𝑎𝑖𝑚𝑠_𝑥 Once the model has been trained, we assess its perfor-
column while their related matches are each found in the mance on our testing dataset. However, rather than solely
𝐶𝑙𝑎𝑖𝑚𝑠_𝑦 column. If the pair is novelty destroying, the test its ability to perform pairwise classifications, we
𝐿𝑎𝑏𝑒𝑙 column contains a 1 to denote the positive match; take each set of 25 test patents and combine the model’s
otherwise, it contains a 0. pairwise predictions for each of the patents in the set
We use an 80-10-10 stratified train-val-test split, with using a simple logical OR as an ensemble, such that if
each of the splits possessing roughly the same propor- any patent in the set is deemed to be novelty destroy-
tion of positive samples. To avoid data leakage, the base ing, the base patent is found not novel. To quantify the
patents are restricted to one of the splits such that all 25 model’s performance on this task, we compute classifica-
pairs can be found in that split. To mitigate the effects of tion metrics suitable for our imbalanced dataset such as
sampling bias, we randomly subsample the raw dataset average precision (AP) and area under the receiver oper-
twice more to generate a total of three unique train-val- ating characteristic curve (AUROC). Since these metrics
test splits for our model. The results we present are all
averaged across these three runs.
5
https://huggingface.co/distilbert/distilroberta-base
6
https://huggingface.co/anferico/bert-for-patents
64
Arav Parikh et al. CEUR Workshop Proceedings 61–66
require prediction scores as opposed to labels to com- of both the validation and testing sets. Manual examina-
pute, we take the maximum novelty destruction score tion of a significant portion of the dataset shows no sign
(i.e., pairwise probability of the positive class) out of all of data leakage, and the training appears to be sound as
the patents as the representative score for the set. well. Thus, we hypothesize that the high absolute results
Additionally, we also evaluate the model in the con- are the consequence of the negative samples selected
text of ranking, using their novelty destruction scores via keyword search being far too “easy” to differentiate
for rank ordering the test patents in each set of 25. We from the positive, novelty destroying samples. Whether
use ranking metrics such as MRR and Precision@1 to as- this issue stems from the keyword extraction process
sess the performance. The unique nature of the novelty itself, the fact that more powerful Boolean queries are
destruction task means there are often no members of needed to obtain the most relevant results, or perhaps
the positive class in a given set; we therefore introduce a even another unseen factor, requires further experimen-
placeholder patent, assigned a score of 0.9, in order to ac- tation. We leave an in-depth exploration of how to lever-
curately compute these metrics. The placeholder patent age unique inter-patent relationships to build upon our
is inserted into the set prior to ranking, in order to denote pipeline to future work; likewise with the task of find-
a patent which the model regards as reasonably novelty ing more semantically similar non-novelty destroying
destroying. To facilitate accurate comparisons across the patents (i.e., harder negatives) to match with the applica-
board on the testing set, we introduce the placeholder tion to improve robustness at the classification boundary.
patent to all of the patent sets, such that if a novelty de- We note that the USPTO has a new Citation API that can
stroying patent is actually present, it is ideally ranked potentially be of use for these goals.
first, above the placeholder patent. Alternatively, when We also invite future work to employ ClaimCompare
no novelty destroying patent is present, the placeholder in order to generate additional datasets, inclusive of ad-
patent is ideally ranked #1. ditional technical fields, for example, by creating queries
Examining the impact of fine-tuning with our dataset that utilize Cooperative Patent Classification (CPC) codes
on model performances (Table 2), we see significant im- instead of keywords. With these datasets, future work
provement over both baselines. As expected, the models can train state-of-the-art models, such as generative
perform quite poorly in the baseline, zero-shot setting, es- LLMs, to improve performance even further.
pecially in terms of classification, with BERT for Patents
exhibiting only a slight improvement over DistilRoBERTa.
The near-perfect classification and ranking performance 5. Conclusion
metrics we see after fine-tuning, however, are quite sur-
In this paper, we introduce a novel pipeline, ClaimCom-
prising. The high AUROC and AP, in particular, indicate
pare, to generate datasets geared towards patent novelty
that the model is able to differentiate between the positive
evaluation; offer a sample dataset generated using Claim-
and negative samples with a high degree of confidence.
Compare; and assess its utility in fine-tuning a novelty
The 𝑘 = 25 model is the top performer, but with little
destroying classifier. We leverage USPTO APIs to obtain
distinguishing it from the 𝑘 = 10 model. This is intrigu-
our novelty and non-novelty destroying data. To the best
ing as the model appears to be performing better in spite
of our knowledge, this is the first usage of USPTO office
of the greater imbalance, perhaps simply due to the pres-
actions and patent claims as a source of data, providing
ence of more data allowing the model to better learn the
high quality datasets while being fairly straightforward,
relationships defining novelty destruction.
flexible, and easy to replicate.
We believe ClaimCompare holds potential to acceler-
4.4. Limitations and Future Work ate research in the patent retrieval field, in conjunction
These results, though incredibly promising, point to some with the use of LLMs and other cutting-edge DL mod-
limitations due to the large gap between the baseline els. Improving novelty determination holds the promise
and fine-tuned models, and exceptionally high testing of reducing the time and monetary burden required for
scores. While we can confidently say that the fine-tuned patent search and patentability determinations, while
models are more suited for the novelty evaluation task also increasing accuracy, thereby saving searchers’ time
than the base models, further comparisons are needed to and making patent databases more accessible for all. In
baseline models pre-trained on general legal data, which this sense, we hope that our pipeline can facilitate the
would likely provide a stronger baseline; however, due to democratization of what has been a historically complex
computational resource limitations at submission time, process, and enable inventors, attorneys, and patent ex-
we leave this to future work. aminers alike to assess patent novelty with greater ease.
We note that our models do not appear to be overfitting,
given that the loss trends during training are nominal,
and these models are generalizing well to the unseen data
65
Arav Parikh et al. CEUR Workshop Proceedings 61–66
References
[1] H. Aras, R. Türker, D. Geiss, M. Milbradt, H. Sack, Get
your hands dirty: Evaluating word2vec models for patent
data, in: International Conference on Semantic Systems,
2018.
[2] L. Helmers, F. Horn, F. Biegler, T. Oppermann, K.-R.
Müller, Automating the search for a patent’s prior art
with a full text similarity search, PLOS ONE 14 (2019)
1–17.
[3] J. Navrozidis, H. Jansson, Using natural language pro-
cessing to identify similar patent documents, LU-CS-EX
(2020).
[4] A. Abood, D. Feltenberger, Automated patent landscap-
ing, Artificial Intelligence and Law 26 (2018) 103–125.
[5] S. Choi, H. Lee, E. L. Park, S. Choi, Deep patent land-
scaping model using transformer and graph embedding,
arXiv preprint arXiv:1903.05823 (2019).
[6] M. F. Grawe, C. A. Martins, A. G. Bonfante, Automated
patent classification using word embedding, in: 2017 16th
IEEE International Conference on Machine Learning and
Applications (ICMLA), IEEE, 2017, pp. 408–411.
[7] C. J. Fall, A. Törcsvári, K. Benzineb, G. Karetka, Auto-
mated categorization in the international patent classifi-
cation, in: Acm Sigir Forum, volume 37, ACM New York,
NY, USA, 2003, pp. 10–25.
[8] J. Risch, N. Alder, C. Hewel, R. Krestel, Patentmatch: a
dataset for matching patent claims & prior art, arXiv
preprint arXiv:2012.13919 (2020).
[9] R. Krestel, R. Chikkamath, C. Hewel, J. Risch, A sur-
vey on deep learning for patent analysis, World Patent
Information 65 (2021) 102035.
[10] J. Gobeill, P. Ruch, Bitem site report for the claims to
passage task in clef-ip 2012, in: CLEF 2012, 2012, p. sp.
[11] R. Chikkamath, M. Endres, L. Bayyapu, C. Hewel, An
empirical study on patent novelty detection: A novel
approach using machine learning and natural language
processing, in: 2020 Seventh International Conference
on Social Networks Analysis, Management and Security
(SNAMS), IEEE, 2020, pp. 1–7.
[12] M. Grootendorst, Keybert: Minimal keyword extraction
with bert., 2020.
[13] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert,
a distilled version of bert: smaller, faster, cheaper and
lighter, ArXiv abs/1910.01108 (2019).
66