=Paper=
{{Paper
|id=Vol-2909/paper2
|storemode=property
|title=Prior Art Search and Reranking for Generated Patent Text
|pdfUrl=https://ceur-ws.org/Vol-2909/paper2.pdf
|volume=Vol-2909
|authors=Jieh-Sheng Lee,Jieh Hsiang
}}
==Prior Art Search and Reranking for Generated Patent Text==
Prior Art Search and Reranking
for Generated Patent Text
Jieh-Sheng Lee∗ Jieh Hsiang
d04922013@csie.ntu.edu.tw hsiang@csie.ntu.edu.tw
National Taiwan University National Taiwan University
Taipei, Taiwan Taipei, Taiwan
ABSTRACT pre-training of the GPT-2 generative models in [18] is indicative of
Generative models, such as GPT-2, have demonstrated impressive such novelty. In the paper, the authors observed some text memo-
results recently. A fundamental question we would like to address rizing behavior in their models on longer strings that are repeated
is: where did the generated text come from? This work is our initial many times in the dataset. The authors quantified how often ex-
effort toward answering the question by using prior art search. The act memorization shows up in the generated text by measuring
purpose of the prior art search is to find the most similar prior text in the percentage in the 8-gram overlap. According to the authors,
the training data of GPT-2. We take a reranking approach and apply most samples have less than 1% overlap, including over 30% of
it to the patent domain. Specifically, we pre-train GPT-2 models samples with no overlap. Such results indicate that GPT-2 models
from scratch by using the patent data from the USPTO. The input for can generate novel text relatively well.
the prior art search is the patent text generated by the GPT-2 model. In the patent domain, the authors in [12] applied such GPT-2
We also pre-trained BERT models from scratch for converting patent model to generate patent claims. The authors proposed an idea
text to embeddings. The steps of reranking are: (1) search the most called “Augmented Inventing,” aiming to help inventors conceive
similar text in the training data of GPT-2 by taking a bag-of-words new patents in a better way. Since patent claims are generally longer
ranking approach (BM25), (2) convert the search results in text than ordinary sentences, the authors proposed a “span-based” ap-
format to BERT embeddings, and (3) provide the final result by proach to decompose a long patent text into multiple shorter text
ranking the BERT embeddings based on their similarities with spans. The authors also proposed an idea called “auto-complete”
the patent text generated by GPT-2. The experiments in this work function to generate patent text on a span basis. From a legal per-
show that such reranking is better than ranking with embeddings spective, such a function will be valuable if it can generate some-
alone. However, our mixed results also indicate that calculating thing new and meet (at least) the “novelty” requirement in patent
the semantic similarities among long text spans is still challenging. laws. However, for a generative model to meet the legal require-
To our knowledge, this work is the first to implement a reranking ment, a fundamental question is to calculate the similarity between
system to identify retrospectively the most similar inputs to a GPT generated patent text and prior patents. In [12], the GPT-2 model
model based on its output. can generate plausible patent claims in surface form, but it is unclear
how novel the patent text is. To address the problem, the authors
KEYWORDS proposed a dual-Transformer framework (using one Transformer
to measure the other Transformer), and they tried to measure the
patent, natural language generation, natural language processing,
quality of patent text generation (by span relevancy in [11] and
deep learning, semantic search
by semantic search in [10]). Despite these efforts, measuring the
novelty in patent text generation remains an open problem.
1 INTRODUCTION
From a different perspective, building a generative patent model
Generative models based on Deep Learning techniques have shown to augment inventors might be the beginning of the era of human-
significant progress in recent years. A long-term objective of our machine co-inventing or meta-inventing (inventing how to invent).
research is to evaluate the novelty of the text produced by the gen- In such an era, measuring the novelty created by the generative
erative models in the patent domain. Before evaluating the novelty, model will be an essential function. To measure the novelty, it is
a prerequisite is to identify the closest prior arts. The scope of the required to compare the output of the model and its inputs. In
prior arts is the patent text used for training the generative models. this work, our implementation scope is to compare the generated
From the perspective of system implementation, the approach this patent text with the original patent text in the training dataset.
paper took is to integrate “patent text generation” and “prior art Since the training dataset is large, in order to narrow the scope of
search.” The purpose of this paper is to fulfill the prerequisite for comparison, it is required to identify the most similar prior text in
identifying prior arts so that the novelty of the generated patent the training dataset. Therefore, our implementation is to build such
text can be evaluated in the future. Assuming that the opposite a prior art search system. We found that reranking is a practical
of the novelty in text generation is memorizing training text, the way to make the search more effective. As proof of concept, we limit
∗ Admitted in New York and passed the USPTO patent bar exam. the data scope in this work to granted patents only. The prior art
search is also limited to finding the most relevant text in span-based
fashion. How to aggregate the similarities of multiple text spans
PatentSemTech, July 15th, 2021, online
© 2021 for this paper by its authors. Use permitted under Creative Commons License
into a longer sentence or a paragraph is another topic in the future.
Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org)
18
Prior Art Search and Reranking for Generated Patent Text PatentSemTech, July 15th, 2021, online
2 RELATED WORK USPTO tokenizer
download Patent
Our prior art search’s main challenge is how to calculate the se- Raw Data & split Spans Vocab for GPT-2 Vocab for BERT
mantic similarity between two patent text spans. In the past, most
of the prior art searches were performed at the word level, such pre-training indexing pre-training
as keywords or phrases. For example, the authors in [4] found user GPT-2 Elasticsearch BERT
input (BM25)
that combining unigrams and PoS-filtered skipgrams leads to a generate search
significant improvement in classification scores over the unigram
baseline. In recent years, researchers moved toward neural network Generated Ranked Bert-as-Service
Spans Results
models and embeddings for the semantic search of longer text. For convert
example, in [21], the authors utilized domain-specific word em-
beddings for patent classification. In [16], the authors proposed cosine Reranked
“Quick Thought” to represent a sentence in a fixed-length vector. embeddings Annoy
similarity Results
The scheme is similar to the skip-gram method in Word2Vec [17]
by escalating the idea from word level to sentence level. Another
Figure 1: System Architecture
line of development is based on new neural architectures, such as
Transformer [26]. Notably, BERT [2] and RoBERTa [15] set a new
state-of-the-art performance on sentence-pair regression tasks, e.g.,
is not satisfactory for having a useful metric to measure the se-
semantic textual similarity (STS). According to [19], however, BERT
mantic similarity in patent spans. We also found that with a BERT
is unsuitable for semantic similarity search on a large scale. For
model, even pre-trained with patent corpus, the false-positive rate
example, finding the most similar pairs in a collection of 10,000
of the semantic similarity based on BERT embeddings is still high.
sentences requires about 50 million inference computations (65
The Sentence-BERT in [19] might be a solution to these problems.
hours) with BERT. The authors in [19] proposed a modification of
However, if we would like to take the Sentence-BERT approach,
the pre-trained BERT model to use siamese and triplet network
an obstacle will be data. Sentence-BERT requires both positive and
structures. Their model called “Sentence-BERT” can derive seman-
negative examples to learn the similarity function. In this work,
tically meaningful sentence embeddings to be compared by using
all of the data from USPTO are positive examples. As for how to
cosine similarity. Significantly it reduces the effort for finding the
prepare negative examples in the future, the PatentMatch dataset
most similar pair from 65 hours with BERT/RoBERTa to about
for training a binary text pair classifier in [20] can be a reference.
5 seconds while maintaining the accuracy from BERT, according
Since none of those above (USE, BERT, and Sentence-BERT) is a
to [19]. Specific to the patent domain in [14], the authors showed
viable option, we resorted to the reranking idea demonstrated by
that embedding could be a better metric than conventional ROGUE
NBoost. Besides, we found that the first author of [19] proposes a
(word-based) for measuring semantic similarity. The metric for
similar view on his GitHub repository. According to the author, a
measuring embeddings in [14] is based on the Universal Sentence
bag-of-words search, such as BM25, can have a higher recall, but
Encoder (USE) [7] without any fine-tuning.
its precision is lower. Conversely, the embedding-based search can
A further line of development is to combine both word level and
have higher precision, but its recall is lower. Therefore, for having
embedding level. For example, NBoost [23] can deploy Transformer
both the higher recall and the higher precision, a reranking strategy
models to improve the relevance of search results on conventional
is to perform the word-based ranking upfront for a higher recall and
word-based search engines, such as Elasticsearch using BM25. Ac-
then perform the embedding-based ranking for higher precision.
cording to [23], NBoost works like a proxy between users and
It is noted that, in our initial experiments, we found that ranking
Elasticsearch. It leverages fine-tuned models to produce domain-
based on embeddings first and reranking based on words later does
specific results. In a search request, the user sends a query to NBoost.
not perform well. In such a configuration, the false-positive rate in
Then, NBoost asks for results from Elasticsearch, picks the best ones
the ranking of embeddings is too high.
based on the fine-tuned model, and returns its final results to the
user. Specifically, if a user asks for 10 results, NBoost can increase
3.2 System Architecture
the number of requests for Elasticsearch to produce 100 records
(word-based) and then pick the best 10 results (embedding-based). This section explains the overall architecture of our implementation
Such a technique is called reranking. and the function/data flows in the architecture. For details, section 4
will cover the data part and its preprocessing, and section 5.1 will
3 APPROACH provide the code repositories we leveraged from others. Fig. 1 shows
our system architecture. The upper portion of the figure (stage 1
3.1 Semantic Search with Reranking of our implementation) represents what we have to build before a
Compared with contextualized word embeddings, we found the re- user can trigger GPT-2 for text generation. The function flows in
search in sentence embeddings more challenging and less explored. this upper portion are depicted in solid lines. The bottom portion
For example, the USE model in [7] is publicly available, but the of the figure (stage 2 of our implementation) shows the function
code for pre-training or fine-tuning is not. Without fine-tuning flows (in dotted lines) for ranking (BM25) and reranking (BERT
with domain-specific data, a model could deviate from a down- embeddings).
stream task and fail to perform well in the specific domain. Our At stage 1, we download raw data from the USPTO and split them
experience found that the USE model alone without fine-tuning into patent spans. The patent spans are fed into Elasticsearch (with
19
PatentSemTech, July 15th, 2021, online Jieh-Sheng Lee and Jieh Hsiang
default settings) for indexing so that we can query them based on Table 1: Special Tags & Mappings
BM25 ranking. Before pre-training GPT-2 and BERT from scratch,
we built their vocabulary files. The details of pre-trainings GPT-2 Tags for Metadata
and BERT are provided in sections 5.2 and 5.3, respectively. At stage metadata prefix appendix
2, a user provides the input text to GPT-2 and the parameters of title <|start_of_title|> <|end_of_title|>
GPT-2 inferencing. GPT-2 generates a patent text span based on abstract <|start_of_abstract|> <|end_of_abstract|>
these settings. For reranking, the generated patent span first goes to figure <|start_of_figure|> <|end_of_figure|>
Elasticsearch to obtain the most relevant prior patent spans based on independent claim <|start_of_claim|> <|end_of_claim|>
BM25. The ranked prior patent spans then go to Bert-as-Service [28] dependent claim <|dep|><|start_of_claim|> <|end_of_claim|>
and convert to BERT embeddings. Next, the embeddings go to
span / sentence (n/a) <|span|>
Annoy [22] for reranking based on cosine similarity. Lastly, the
final and reranked embeddings are decoded back to patent spans Metadata Mappings
in text format and shown to the user. In our experiments, based on metadata 1 mapping metadata 2
the same user’s input, we let GPT-2 generate multiple patent text title <|title2abstract|> abstract
spans to collect both positive and negative reranking results based abstract <|abstract2title|> title
on each of the multiple patent text spans. claim <|claim2abstract|> abstract
abstract <|abstract2claim|> claim
4 DATA title <|title2figure|> figure
figure <|figure2title|> title
4.1 Data source
In [13], the authors use the patent datasets on BigQuery provided
by Google [5]. Although it is flexible to manipulate the data by Based on the approaches as mentioned above, the actual pipeline
SQL statements, we found that the data provided by BigQuery are to build the datasets for GPT-2 includes: (1) downloading the raw
not updated frequently. We turned to the USPTO PatentsView [25] TSV file from USPTO PatentsView and splitting them into smaller
for bulk download files and more updates. At the moment of this files, (2) extracting text based on metadata (e.g., title, abstract, etc.)
writing, the latest version of the “patent.tsv.zip” file is dated as and uploading them to the Elasticsearch server, (3) retrieving patent
2020-03-31. For incremental download instead of bulk download, text from the Elasticsearch server, adding special tags to them,
the USPTO Open Data Portal [24] can be another choice. The raw and saving them in text format (4) converting the text files in the
data provided by the PatentsView and the Open Data Portal are previous step to TFRecord format for Tensorflow code. In step (1),
plain text in TSV or XML format. The downside of using such raw the text data from USPTO PatentsView is about 48.7G (version:
text is the extra efforts on data preprocessing, compared with the 2019-10-08). Such a corpus is larger than the WebText corpus (40G)
flexibility of SQL statements in BigQeury. Practitioners need to used by OpenAI for GPT-2 pre-training. In step (4), the total number
consider the tradeoff between flexibility and data frequency. We of tokens is 32.3B (32,398,927,872).
opt for more frequently updated data in this work. By concatenating all of the text with special tags, the total
amount of data reaches 180G. Due to resource constraints, we
4.2 Datasets for GPT-2 did not concatenate a dependent claim with its corresponding in-
During data preprocessing, we follow the span-based approach dependent claim. Independent claims are generally much longer
in [12] and follow the “structural metadata” and “metadata map- than their dependent claims. If our training data capture such claim
ping” approaches in [14]. The structural metadata in [14] is defined dependency for all dependent claims, e.g., “(claim 1) <|dep|> (claim
to include patent title, abstract, independent claim, and dependent 2)” and “...<|abstract2claim|><|start_of_claim|> (claim1+claim2)”
claim. According to the authors, it is a mechanism to control what (some special tags omitted for clarity), it is possible that the total
kind of patent text to generate. Regarding metadata mapping, it is a amount of text data may exceed 570G (the size of text data for train-
mechanism to guide GPT-2 for generating from one kind of patent ing GPT-3 [1]). We leave such an experiment for future researchers.
text to another. What we differ from [14] are: (1) we add new tags It is also noted that the <|figure2title|> mapping in Table 1 does
for patent drawing descriptions, (2) we add <|dep|> for dependent not exist in our training data. We reserve this mapping for testing
claims, (3) we remove the proposed “backward” tags because back- whether it is possible for GPT-2 models to do the same few-shot
ward text generation is not required in this work. Table 1 shows learning in GPT-3.
our special tags for structural metadata and mappings between
metadata. We found the span-based approach helpful for splitting 4.3 Datasets for BERT
long claims into short text spans. However, for patent abstracts, According to BERT’s code repository [6], the input for pre-training
such a span-based approach may not apply. If an abstract’s text has is a plain text file having one sentence per line. Consecutive lines
been taken verbatim from a claim, the span splitting mechanism are the actual sentences for the "next sentence prediction" task.
may apply. If not, there might be no span to split in a sentence. Documents are delimited by empty lines. The final datasets contain
When no span is found in the abstract, we split a patent abstract serialized text in TFRecord file format. In our case, we follow the
into multiple sentences instead. Collectively the split sentences or format and prepare the plain text file with one span or sentence per
spans are treated the same way in our data processing, and we refer line. We did not add our special tags or metadata mappings to the
to both of them as “span” in this work. text file because such annotations are designed for GPT-2 only. The
20
Prior Art Search and Reranking for Generated Patent Text PatentSemTech, July 15th, 2021, online
total number of words serialized in our training data is 6.8 billion
words. It is larger than the 3.3 billion word corpus (BooksCorpus
with 800M words and English Wikipedia with 2,500M words) for
pre-training the official BERT model.
4.4 Data for Elasticsearch server
The purpose of the Elasticsearch server in our data pipeline is two-
folded. First, it provides the ranking mechanism based on a bag-of-
words approach. For example, we can query the top n records (e.g.,
100) based on BM25. Second, the Elasticsearch server is convenient
for us to aggregate various patent text from different raw files. Such
aggregation replaces the BigQuery and SQL statements in [13]. In
step (2) of the data pipeline in 4.2, we split the patent text into spans
or sentences and upload them to the Elasticsearch server. The total
number of records in Elasticsearch is 343,987,632, and they occupy Figure 2: Training loss of GPT-2 models (Base & Large)
59.7GB.
5 IMPLEMENTATION & EXPERIMENTS OpenAI’s code) and building vocabulary from patent corpus, “bert-
as-service” [28] for fast conversion from text to BERT embeddings,
5.1 GitHub repositories and “annoy” [22] for searching and ranking BERT embeddings
In addition to the official code of BERT by Google and GPT-2 by efficiently.
OpenAI, our implementation leverages the following repositories:
5.2 Implementation details: GPT-2
(a) imcaspar/gpt2-ml [30]
Before pre-training, we use “tokenizers” (ByteLevelBPETokenizer)
(b) huggingface/transformers [27]
to build the vocabulary specific to our patent corpus, instead of
(c) ConnorJL/GPT2 [9]
using the default vocabulary released in “gpt2-ml”. We set the same
(d) huggingface/tokenizers [8]
vocabulary size (50257) to build our vocabulary. One advantage
(e) hanxiao/bert-as-service [28]
of building our own vocabulary file is that each special tag in our
(f) spotify/annoy [22]
design can be encoded as one token instead of multiple (if using the
According to [14], OpenAI trained their models with TPU, but original vocabulary by others). The model sizes we experiment with
the code for training was not released. The authors in [12] resorted are Base (similar to OpenAI’s 117M) and Large (similar to OpenAI’s
to [9] since it can leverage TPU and the trained model is compatible 345M).
with OpenAI’s code for inferencing on GPU. According to [9], a The total number of tokens in the TFRecords for GPT-2 is about
potential downside is that the performance of the 1.5B model seems 32.3B. For training the Base model, we found that batch_size_per_core
inferior to the official model performance by OpenAI. Therefore, = 16 and max_seq_length = 1024 are workable on Colab. Larger
we checked alternatives and found “transformers” [27] and “gpt2- batch size will trigger an OOM (out-of-memory) error. The number
ml” [30]. The former is a more promising codebase for several of TPU cores on Colab is 8. Our goal is to train at least one epoch.
technical reasons (omitted here for brevity). Unfortunately, we tried Therefore, we set our training steps as 248,000 (32,398,927,872
and realized that PyTorch’s support for TPU is maturing, but the / 1024 / 16 / 8 = 247,184). For training the Large model, we set
specific code for GPT-2 training is not ready. Therefore, we opted batch_size_per_core=4 to avoid the OOM error and set the same
for “gpt2-ml” which has successfully built a 1.5B model. The “gpt2- training steps. Fig. 2 shows the curves of training loss. The final
ml” repository is forked from Grover [29], which was developed loss values are 1.122 (Base) and 0.9934 (Large), respectively. It is
by the Allen Institute. Grover is designed for fake news detection. noted that the largest model (1.5B) will trigger the OOM error even
According to [29], Grover obtains over 92% accuracy at telling after setting the batch size as 1. We leave the 1.5B model to the
apart human-written from machine-written news. The authors also future when having more resources.
released the 1.5B Grover GPT-2 model. The 1.5B model’s availability
from a reputable institute is the main reason we select the “gpt2- 5.3 Implementation details: BERT
ml” repository to work on. One disadvantage of Grover’s model Before pre-training, we use “tokenizers” (BertWordPieceTokenizer)
is that it is not compatible with OpenAI’s GPT-2 model. It means to build the vocabulary (uncased) specific to our patent corpus in-
that we can not re-use OpenAI’s code for inferencing. We have stead of using the default vocabulary released in BERT official code.
to use the inferencing code from Grover’s code. We expect that We set the same vocabulary size (30522) to build our vocabulary.
“transformers” might be the best choice for researchers to pre-train As for the BERT model size, we experiment with BERT-Base and
OpenAI GPT models with TPU and retain the compatibility with BERT-Large.
OpenAI GPT-2 and GPT-3 models in the near future. Regarding According to [3], the pre-training for the BERT-Large model
the other repositories on the above list, their respective functions took 1,000,000 steps, which is approximately 40 epochs over the
are: “tokenizers” [8] for fast tokenization (replacing Google’s and 3.3 billion word corpus by using 16 Cloud TPU devices (256 batch
21
PatentSemTech, July 15th, 2021, online Jieh-Sheng Lee and Jieh Hsiang
• output: [1-33] In an embodiment, a method is provided for
automatically connecting a mobile telephone to a Wi-Fi net-
work.
• output: [1-42] The apparatus can include a device for detect-
ing whether a wireless device is in proximity to a wireless
device associated with the Wi-Fi network.
Taking [1-33] (generated by GPT-2) as the input, our prior art
search retrieves the top 100 records by BM25 and rerank them by
embeddings. The [3/100] record in POC 4 (as below) is subjectively
a positive example for us. Compared with other records in POC
4, the [3/100] record “automatic connectivity....a mobile device to
roam” is more relevant to the “automatically connecting a mobile
telephone” in [1-33] of POC 1. The [3/100] record is ranked as 26
based on BM25 and reranked as 3 based on embedding. Therefore,
Figure 3: Training loss of BERT models (Base & Large) the reranking is effective in boosting its ranking. The [3/100] record
is the 5th span in the abstract of patent 8590023, which was in the
dataset for pre-training GPT-2 in the first place.
size * 512 tokens = 128,000 tokens/batch). Our pre-training data
contains 6.8 billion words (6,824,071,153). Since the Colab provides • (POC 4)
one Cloud TPU device only, the total number of tokens per batch is • patent: 8590023 [ A-4 ] (5th span in abstract)
more limited (64 batch size * 128 tokens = 8,192 tokens/batch). We • text: This automatic connectivity may allow a mobile device
set our training steps as 2,000,000 to pre-train approximately 2.4 to roam across Wi-Fi hotspots of Wi-Fi networks and offload
epochs over the 6.8 billion word corpus. Except for these, we use the traffic to Wi-Fi networks.
same hyperparatmers provided in [3] for the BERT-Base and BERT- • ranked by BM25: 26
Large models. For evaluatoin, we set the eval_batch_size=32 and • re-ranked by embedding: 3
max_eval_steps=100,000. The evaluation results are: The rankings by BM25 and embedding similarity may be different
• loss = 1.0650321 or similar or the same. For example, in POC 4, the [1/100] record (as
• masked_lm_accuracy = 0.78279483 below) shows that both ranks are top 1. The [1/100] record in POC
• masked_lm_loss = 0.96379614 4 is also semantically similar to the [1-33] record in POC 1. The
• next_sentence_accuracy = 0.9975 [1/100] record in POC 4 is the first span in the abstract of patent
• next_sentence_loss = 0.0040773232 10356696, which was in the dataset for pre-training GPT-2 in the
first place.
For comparing model performance, we trained the BERT-Base
• (POC 4)
model with similar settings. Fig. 3 shows the curves of training
• patent: 10356696 [ A-0 ] (1st span in abstract)
loss for the BERT-Large and BERT-Base models. As expected, the
• text: An apparatus and methods are provided for automati-
BERT-Large model has a lower curve.
cally detecting and connecting to a Wi-Fi network.
• ranked by BM25: 1
5.4 Qualitative examples • re-ranked by embedding: 1
In this section, we provide positive and negative examples in our
reranking experiments. Our proof-of-concept results (POC 1˜7) are We also found negative examples. In POC 5, the following is
available on the web.1 In POC 1, the results contain 100 gener- the top record according to both BM25 and embeddings. However,
ated patent spans (no cherrypicking) in patent abstract (similar the similarity between the top record in POC 5 and the input of
experiments can be conducted on patent claims in the future). The POC 5 (the GPT-2 output [1-4] in POC 1) seems remote. Such a
input for GPT-2 is the first sentence in the abstract of the US Patent result suggests that sentence similarity is still a difficult problem.
10,694,449 (granted on 2020-06-23). We selected three generated One possible reason is that the coverage of the recall by BM25 is
patent spans for prior art search and reranking, as below. not broad enough. Therefore, it filters out suitable candidates for
calculating embedding similarity too soon.
• (POC 1)
• (POC 5)
• input: An apparatus and methods are provided for automati-
• patent: 9373249 [ A-1 ] (2nd span in abstract)
cally detecting and connecting to a Wi-Fi network.
• text: The Wi-Fi transceiver receives a Wi-Fi control signal
• output: [1-4] In accordance with a signal strength measure-
from a control signal generator.
ment from a Wi-Fi transceiver during an idle period when a
• ranked by BM25: 1
Wi-Fi network is detected, the Wi-Fi transceiver sends on
• re-ranked by embedding: 1
to a server an indication of a connection mode of the user
equipment. In addition to BM25 and embedding, we found that adding a
keyword can be a beneficial enhancement if a user has a clear idea
1 https://usptg.herokuapp.com/mlld about what to look for. For example, the top reranking result in the
22
Prior Art Search and Reranking for Generated Patent Text PatentSemTech, July 15th, 2021, online
following POC 2 will be less relevant if “proximity” in the [1-42] The <|figure2title|> mapping is defined in the vocabulary file,
record of POC 1 is the point of interest. The [1-42] record of POC 1 and there is no training data contains such a mapping. Our purpose
is the input for prior art search in POC 2. Such a result is reasonable is to test whether the model can learn such a new mapping by few-
because there is no clue for the model to weigh the point of interest shot learning. In our experiments, we concatenate several records
more. of different figure text and title. Then, we remove the title in the last
• (POC 2, reranked as top 1) record. If the few-shot learning works, the model should generate
• patent: 7302229 [ A-2 ] (2nd span in abstract) the removed patent title in the last record. Unfortunately, we found
• text: In one embodiment, availability of wireless connectivity it not workable. The limitation in the model size is probably the
may be determined to a first user of a wireless service at a primary root cause. Although such a failure case was anticipated,
first wireless communication device to communicate with we found one intriguing pattern: the model keeps generating the
an access point associated with a Wi-Fi wireless network patent title in the second record most of the time. We leave this
that offers the wireless service. case to the future. Determining the minimal model size to achieve
few-shot learning in the patent domain is also an important topic
To boost the search relevancy, we add a keyword setting to BM25. for future research.
After adding “proximity” as a required term in the BM25 search,
in the following POC 6, the relevancy of its top record increases
6 FUTURE RESEARCH
significantly. The total number of positive results increases too.
Using a keyword as the first filter in reranking is a research topic Our experiments show mixed results, and the topics for future
we plan to study in the future because adding such a hard constraint researchers include:
could be a double-edged sword.
• How to make reranking more effective?
• (POC 6, reranked as top 1) • How to measure the “novelty” and “non-obviousness” (re-
• patent: 9986380 [ A-0 ] (1st span in abstract) quirements in patent laws) between the generated patent
• text: A first wireless device determines whether the first text and prior patent text?
wireless device is in a specified proximity to a second wireless • What are the legal & ethical considerations before releasing
device based on a signal wirelessly transmitted by the second a generative patent model?
wireless device. • Can the discrepancy of the rankings between BM25 and
It is noted that POC 3 (omitted here for brevity) shows an ex- embedding be a source for data augmentation? For example,
ample of a complete patent abstract containing several text spans Sentence-BERT requires both positive and negative examples
generated by GPT-2. Each text span can go through the same prior to train. Ranking by embeddings first and filtering by BM25
art search with reranking, as demonstrated above. We leave such later might be a way to collect negative training examples.
an enhancement to the future. It is also noted that, in our early
experiments, using embeddings alone (without BM25) produces
many false-positive results, as shown in POC 7. For example, the 7 CONCLUSION
similarity between “Coherent LADAR using intra-pixel quadrature Reranking with BM25 and embeddings is a practical approach for
detection” and “In-pixel correlated double sampling with fold-over producing better search results than using embeddings alone. Our
detection” is a negative example. There are many negative results reranking is a two-step approach in which the search is performed
with unreasonable similarities. Therefore, embeddings alone are not based on BM25 first and then performed based on the cosine simi-
effective for semantic search. Comparing our initial experiments larity of embeddings. If a user has a clear point of interest in mind,
(embedding only) and later experiments (reranking by BM25 and the search can be more productive by adding an extra step of pro-
embeddings), we conclude that the reranking is more effective even viding a keyword to the BM25 search. In this work, the input for
though it still produces very mixed results. the prior art search is the patent text span generated by a GPT-2
model. The objective of our prior art search is to identify retro-
5.5 Failure case: few-shot learning spectively the most similar patent text spans in the training data
Although this work focuses on GPT-2, we are also interested in the of the GPT-2 model. Although our experiments show the effec-
capabilities of the latest GPT-3. GPT-3 is an autoregressive language tiveness of reranking in the patent domain, they also show that
model with 175 billion parameters. According to the authors, it is semantic search for longer text remains challenging. By finding
10x more than any previous language model. By scaling up, the the similarity between GPT-2’s inputs and outputs, we expect that
model can perform few-shot learning purely via text interaction this work and its future enhancement can help researchers under-
without any gradient updates or fine-tuning. We estimate that the stand GPT-2 better. Particularly, in the patent domain, it is critical
largest GPT-3 model is about 507 times bigger than the GPT-2 model to evaluate the novelty in GPT-2 and GPT-3 models. To evaluate
we utilized. We hypothesize that the patent text structure is more the novelty, a prerequisite is to identify the closest training data.
uniform and less diverse than the training data for GPT-3. Hence, The progress in this paper is toward fulfilling the prerequisite so
we wonder whether few-shot learning might be possible on our that novelty of the generated patent text can be evaluated in the
GPT-2 model too. We prepare our input text in the following format: future. In our system architecture, we integrate several building
<|start_of_figure|> (text1) <|end_of_figure|> <|figure2title|> blocks, notably pre-training GPT-2, pre-training BERT, using Elas-
<|start_of_title|> (text2) <|end_of_title|> ticsearch for BM25 ranking, and reranking by embedding similarity
23
PatentSemTech, July 15th, 2021, online Jieh-Sheng Lee and Jieh Hsiang
with Annoy. Such a proof-of-concept implementation is a practical [24] USPTO. [n.d.]. USPTO Open Data Portal. https://developer.uspto.gov/data.
reference for future researchers. [25] USPTO. [n.d.]. USPTO PatentsView. https://www.patentsview.org/download.
[26] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All
you Need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V.
REFERENCES Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.).
[1] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Curran Associates, Inc., 5998–6008. http://papers.nips.cc/paper/7181-attention-
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda is-all-you-need.pdf
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, [27] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie
Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Brew. 2019. HuggingFace’s Transformers: State-of-the-art Natural Language
Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Processing. ArXiv (2019). https://arxiv.org/abs/1910.03771
Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. [28] Han Xiao. 2018. bert-as-service. https://github.com/hanxiao/bert-as-service.
ArXiv (2020). https://arxiv.org/abs/2005.14165 [29] Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi,
[2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Franziska Roesner, and Yejin Choi. 2019. Defending Against Neural Fake News.
Pre-training of Deep Bidirectional Transformers for Language Understanding. In In Advances in Neural Information Processing Systems 32.
Proceedings of the 2019 Conference of the North American Chapter of the Association [30] Zhibo Zhang. 2019. GPT2-ML: GPT-2 for Multiple Languages. https://github.
for Computational Linguistics: Human Language Technologies, Volume 1 (Long and com/imcaspar/gpt2-ml.
Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota,
4171–4186. https://doi.org/10.18653/v1/N19-1423
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
Pre-training of Deep Bidirectional Transformers for Language Understanding. In
Proceedings of the 2019 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, Volume 1 (Long and
Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota,
4171–4186. https://doi.org/10.18653/v1/N19-1423
[4] Eva D’hondt, Suzan Verberne, Niklas Weber, Kees Koster, and Lou Boves.
2012. Using skipgrams and PoS-based feature selection for patent classifica-
tion. Computational Linguistics in the Netherlands Journal 2 (Dec. 2012), 52–70.
https://www.clips.uantwerpen.be/clinjournal/clinj/article/view/15
[5] Google. [n.d.]. Google Patents public datasets on BigQuery. https://console.cloud.
google.com/bigquery?p=patents-public-data.
[6] Google. [n.d.]. google-research/bert. https://github.com/google-research/bert.
[7] Google. [n.d.]. Universal Sentence Encoder. https://tfhub.dev/google/universal-
sentence-encoder/2.
[8] HuggingFace. 2020. Fast State-of-the-Art Tokenizers optimized for Research and
Production. https://github.com/huggingface/tokenizers.
[9] Connor Leahy. 2019. An implementation of training for GPT2, supports TPUs.
https://github.com/ConnorJL/GPT2.
[10] Jieh-Sheng Lee. 2020. Measuring and Controlling Text Generation by Semantic
Search. In WWW ’20: Companion Proceedings of the Web Conference 2020. Taipei,
Taiwan, 269–273. https://doi.org/10.1145/3366424.3382086
[11] Jieh-Sheng Lee and Jieh Hsiang. 2019. Measuring Patent Claim Generation
by Span Relevancy. In Proceedings of the Thirteenth International Workshop on
Juris-informatics (JURISIN). Keio University Kanagawa, Japan.
[12] Jieh-Sheng Lee and Jieh Hsiang. 2020. Patent claim generation by fine-tuning
OpenAI GPT-2. World Patent Information (2020). in press.
[13] Jieh-Sheng Lee and Jieh Hsiang. 2020. PatentBERT: Patent classification with
fine-tuning a pre-trained BERT model. World Patent Information 61, 101965 (2020).
https://doi.org/10.1016/j.wpi.2020.101965
[14] Jieh-Sheng Lee and Jieh Hsiang. 2020. PatentTransformer-2: Controlling Patent
Text Generation by Structural Metadata. (2020). https://arxiv.org/abs/2001.03708
[15] Yinhan Liu, Myle Ott, Naman Goyal, Mandar Joshi Jingfei Du, Danqi Chen, Omer
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A
Robustly Optimized BERT Pretraining Approach. ArXiv (2019). http://arxiv.org/
abs/1907.11692
[16] Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for
learning sentence representations. In International Conference on Learning Repre-
sentations. https://openreview.net/forum?id=rJvJXZb0W
[17] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
Distributed Representations of Words and Phrases and their Compositionality. In
Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou,
M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates, Inc.,
3111–3119.
[18] Alec Radrof, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
Sutskever. 2018. Language Models are Unsupervised Multitask Learners.
[19] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings
using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Em-
pirical Methods in Natural Language Processing. Association for Computational
Linguistics. http://arxiv.org/abs/1908.10084
[20] Julian Risch, Nicolas Alder, Christoph Hewel, and Ralf Krestel. 2020. PatentMatch:
A Dataset for Matching Patent Claims & Prior Art. arXiv:2012.13919 [cs.IR]
[21] Julian Risch and Ralf Krestel. 2019. Domain-specific word embeddings for patent
classification. Data Technol. Appl. 53 (2019), 108–122.
[22] Spotify. 2018. Approximate Nearest Neighbors in C++/Python optimized for
memory usage and loading/saving to disk. https://github.com/spotify/annoy.
[23] Cole Thienes and Jack Pertschuk. 2019. NBoost: Neural Boosting Search Results.
https://github.com/koursaros-ai/nboost.
24