=Paper=
{{Paper
|id=Vol-2909/paper2
|storemode=property
|title=Prior Art Search and Reranking for Generated Patent Text
|pdfUrl=https://ceur-ws.org/Vol-2909/paper2.pdf
|volume=Vol-2909
|authors=Jieh-Sheng Lee,Jieh Hsiang
}}
==Prior Art Search and Reranking for Generated Patent Text==
<pdf width="1500px">https://ceur-ws.org/Vol-2909/paper2.pdf</pdf>
<pre>
                                            Prior Art Search and Reranking
                                               for Generated Patent Text
                               Jieh-Sheng Lee∗                                                                    Jieh Hsiang
                         d04922013@csie.ntu.edu.tw                                                         hsiang@csie.ntu.edu.tw
                         National Taiwan University                                                       National Taiwan University
                               Taipei, Taiwan                                                                   Taipei, Taiwan

ABSTRACT                                                                                  pre-training of the GPT-2 generative models in [18] is indicative of
Generative models, such as GPT-2, have demonstrated impressive                            such novelty. In the paper, the authors observed some text memo-
results recently. A fundamental question we would like to address                         rizing behavior in their models on longer strings that are repeated
is: where did the generated text come from? This work is our initial                      many times in the dataset. The authors quantified how often ex-
effort toward answering the question by using prior art search. The                       act memorization shows up in the generated text by measuring
purpose of the prior art search is to find the most similar prior text in                 the percentage in the 8-gram overlap. According to the authors,
the training data of GPT-2. We take a reranking approach and apply                        most samples have less than 1% overlap, including over 30% of
it to the patent domain. Specifically, we pre-train GPT-2 models                          samples with no overlap. Such results indicate that GPT-2 models
from scratch by using the patent data from the USPTO. The input for                       can generate novel text relatively well.
the prior art search is the patent text generated by the GPT-2 model.                        In the patent domain, the authors in [12] applied such GPT-2
We also pre-trained BERT models from scratch for converting patent                        model to generate patent claims. The authors proposed an idea
text to embeddings. The steps of reranking are: (1) search the most                       called “Augmented Inventing,” aiming to help inventors conceive
similar text in the training data of GPT-2 by taking a bag-of-words                       new patents in a better way. Since patent claims are generally longer
ranking approach (BM25), (2) convert the search results in text                           than ordinary sentences, the authors proposed a “span-based” ap-
format to BERT embeddings, and (3) provide the final result by                            proach to decompose a long patent text into multiple shorter text
ranking the BERT embeddings based on their similarities with                              spans. The authors also proposed an idea called “auto-complete”
the patent text generated by GPT-2. The experiments in this work                          function to generate patent text on a span basis. From a legal per-
show that such reranking is better than ranking with embeddings                           spective, such a function will be valuable if it can generate some-
alone. However, our mixed results also indicate that calculating                          thing new and meet (at least) the “novelty” requirement in patent
the semantic similarities among long text spans is still challenging.                     laws. However, for a generative model to meet the legal require-
To our knowledge, this work is the first to implement a reranking                         ment, a fundamental question is to calculate the similarity between
system to identify retrospectively the most similar inputs to a GPT                       generated patent text and prior patents. In [12], the GPT-2 model
model based on its output.                                                                can generate plausible patent claims in surface form, but it is unclear
                                                                                          how novel the patent text is. To address the problem, the authors
KEYWORDS                                                                                  proposed a dual-Transformer framework (using one Transformer
                                                                                          to measure the other Transformer), and they tried to measure the
patent, natural language generation, natural language processing,
                                                                                          quality of patent text generation (by span relevancy in [11] and
deep learning, semantic search
                                                                                          by semantic search in [10]). Despite these efforts, measuring the
                                                                                          novelty in patent text generation remains an open problem.
1    INTRODUCTION
                                                                                             From a different perspective, building a generative patent model
Generative models based on Deep Learning techniques have shown                            to augment inventors might be the beginning of the era of human-
significant progress in recent years. A long-term objective of our                        machine co-inventing or meta-inventing (inventing how to invent).
research is to evaluate the novelty of the text produced by the gen-                      In such an era, measuring the novelty created by the generative
erative models in the patent domain. Before evaluating the novelty,                       model will be an essential function. To measure the novelty, it is
a prerequisite is to identify the closest prior arts. The scope of the                    required to compare the output of the model and its inputs. In
prior arts is the patent text used for training the generative models.                    this work, our implementation scope is to compare the generated
From the perspective of system implementation, the approach this                          patent text with the original patent text in the training dataset.
paper took is to integrate “patent text generation” and “prior art                        Since the training dataset is large, in order to narrow the scope of
search.” The purpose of this paper is to fulfill the prerequisite for                     comparison, it is required to identify the most similar prior text in
identifying prior arts so that the novelty of the generated patent                        the training dataset. Therefore, our implementation is to build such
text can be evaluated in the future. Assuming that the opposite                           a prior art search system. We found that reranking is a practical
of the novelty in text generation is memorizing training text, the                        way to make the search more effective. As proof of concept, we limit
∗ Admitted in New York and passed the USPTO patent bar exam.                              the data scope in this work to granted patents only. The prior art
                                                                                          search is also limited to finding the most relevant text in span-based
                                                                                          fashion. How to aggregate the similarities of multiple text spans
PatentSemTech, July 15th, 2021, online
© 2021 for this paper by its authors. Use permitted under Creative Commons License
                                                                                          into a longer sentence or a paragraph is another topic in the future.
Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org)

                                                                                     18
Prior Art Search and Reranking for Generated Patent Text                                                                   PatentSemTech, July 15th, 2021, online


2    RELATED WORK                                                                  USPTO                               tokenizer
                                                                                              download       Patent
Our prior art search’s main challenge is how to calculate the se-                 Raw Data     & split       Spans     Vocab for GPT-2    Vocab for BERT
mantic similarity between two patent text spans. In the past, most
of the prior art searches were performed at the word level, such                                  pre-training            indexing               pre-training

as keywords or phrases. For example, the authors in [4] found                          user                  GPT-2         Elasticsearch                     BERT
                                                                                      input                                   (BM25)
that combining unigrams and PoS-filtered skipgrams leads to a                                            generate      search
significant improvement in classification scores over the unigram
baseline. In recent years, researchers moved toward neural network                                         Generated            Ranked                     Bert-as-Service
                                                                                                             Spans              Results
models and embeddings for the semantic search of longer text. For                                                                                       convert
example, in [21], the authors utilized domain-specific word em-
beddings for patent classification. In [16], the authors proposed                                                                            cosine         Reranked
“Quick Thought” to represent a sentence in a fixed-length vector.                                         embeddings            Annoy
                                                                                                                                           similarity        Results
The scheme is similar to the skip-gram method in Word2Vec [17]
by escalating the idea from word level to sentence level. Another
                                                                                                     Figure 1: System Architecture
line of development is based on new neural architectures, such as
Transformer [26]. Notably, BERT [2] and RoBERTa [15] set a new
state-of-the-art performance on sentence-pair regression tasks, e.g.,
                                                                                is not satisfactory for having a useful metric to measure the se-
semantic textual similarity (STS). According to [19], however, BERT
                                                                                mantic similarity in patent spans. We also found that with a BERT
is unsuitable for semantic similarity search on a large scale. For
                                                                                model, even pre-trained with patent corpus, the false-positive rate
example, finding the most similar pairs in a collection of 10,000
                                                                                of the semantic similarity based on BERT embeddings is still high.
sentences requires about 50 million inference computations (65
                                                                                The Sentence-BERT in [19] might be a solution to these problems.
hours) with BERT. The authors in [19] proposed a modification of
                                                                                However, if we would like to take the Sentence-BERT approach,
the pre-trained BERT model to use siamese and triplet network
                                                                                an obstacle will be data. Sentence-BERT requires both positive and
structures. Their model called “Sentence-BERT” can derive seman-
                                                                                negative examples to learn the similarity function. In this work,
tically meaningful sentence embeddings to be compared by using
                                                                                all of the data from USPTO are positive examples. As for how to
cosine similarity. Significantly it reduces the effort for finding the
                                                                                prepare negative examples in the future, the PatentMatch dataset
most similar pair from 65 hours with BERT/RoBERTa to about
                                                                                for training a binary text pair classifier in [20] can be a reference.
5 seconds while maintaining the accuracy from BERT, according
                                                                                    Since none of those above (USE, BERT, and Sentence-BERT) is a
to [19]. Specific to the patent domain in [14], the authors showed
                                                                                viable option, we resorted to the reranking idea demonstrated by
that embedding could be a better metric than conventional ROGUE
                                                                                NBoost. Besides, we found that the first author of [19] proposes a
(word-based) for measuring semantic similarity. The metric for
                                                                                similar view on his GitHub repository. According to the author, a
measuring embeddings in [14] is based on the Universal Sentence
                                                                                bag-of-words search, such as BM25, can have a higher recall, but
Encoder (USE) [7] without any fine-tuning.
                                                                                its precision is lower. Conversely, the embedding-based search can
   A further line of development is to combine both word level and
                                                                                have higher precision, but its recall is lower. Therefore, for having
embedding level. For example, NBoost [23] can deploy Transformer
                                                                                both the higher recall and the higher precision, a reranking strategy
models to improve the relevance of search results on conventional
                                                                                is to perform the word-based ranking upfront for a higher recall and
word-based search engines, such as Elasticsearch using BM25. Ac-
                                                                                then perform the embedding-based ranking for higher precision.
cording to [23], NBoost works like a proxy between users and
                                                                                It is noted that, in our initial experiments, we found that ranking
Elasticsearch. It leverages fine-tuned models to produce domain-
                                                                                based on embeddings first and reranking based on words later does
specific results. In a search request, the user sends a query to NBoost.
                                                                                not perform well. In such a configuration, the false-positive rate in
Then, NBoost asks for results from Elasticsearch, picks the best ones
                                                                                the ranking of embeddings is too high.
based on the fine-tuned model, and returns its final results to the
user. Specifically, if a user asks for 10 results, NBoost can increase
                                                                                3.2       System Architecture
the number of requests for Elasticsearch to produce 100 records
(word-based) and then pick the best 10 results (embedding-based).               This section explains the overall architecture of our implementation
Such a technique is called reranking.                                           and the function/data flows in the architecture. For details, section 4
                                                                                will cover the data part and its preprocessing, and section 5.1 will
3 APPROACH                                                                      provide the code repositories we leveraged from others. Fig. 1 shows
                                                                                our system architecture. The upper portion of the figure (stage 1
3.1 Semantic Search with Reranking                                              of our implementation) represents what we have to build before a
Compared with contextualized word embeddings, we found the re-                  user can trigger GPT-2 for text generation. The function flows in
search in sentence embeddings more challenging and less explored.               this upper portion are depicted in solid lines. The bottom portion
For example, the USE model in [7] is publicly available, but the                of the figure (stage 2 of our implementation) shows the function
code for pre-training or fine-tuning is not. Without fine-tuning                flows (in dotted lines) for ranking (BM25) and reranking (BERT
with domain-specific data, a model could deviate from a down-                   embeddings).
stream task and fail to perform well in the specific domain. Our                   At stage 1, we download raw data from the USPTO and split them
experience found that the USE model alone without fine-tuning                   into patent spans. The patent spans are fed into Elasticsearch (with
                                                                           19
PatentSemTech, July 15th, 2021, online                                                                                     Jieh-Sheng Lee and Jieh Hsiang


default settings) for indexing so that we can query them based on                             Table 1: Special Tags & Mappings
BM25 ranking. Before pre-training GPT-2 and BERT from scratch,
we built their vocabulary files. The details of pre-trainings GPT-2                                       Tags for Metadata
and BERT are provided in sections 5.2 and 5.3, respectively. At stage              metadata                       prefix                   appendix
2, a user provides the input text to GPT-2 and the parameters of                       title                <|start_of_title|>          <|end_of_title|>
GPT-2 inferencing. GPT-2 generates a patent text span based on                      abstract             <|start_of_abstract|>        <|end_of_abstract|>
these settings. For reranking, the generated patent span first goes to               figure               <|start_of_figure|>          <|end_of_figure|>
Elasticsearch to obtain the most relevant prior patent spans based on          independent claim           <|start_of_claim|>          <|end_of_claim|>
BM25. The ranked prior patent spans then go to Bert-as-Service [28]             dependent claim        <|dep|><|start_of_claim|>       <|end_of_claim|>
and convert to BERT embeddings. Next, the embeddings go to
                                                                                 span / sentence                   (n/a)                    <|span|>
Annoy [22] for reranking based on cosine similarity. Lastly, the
final and reranked embeddings are decoded back to patent spans                                           Metadata Mappings
in text format and shown to the user. In our experiments, based on                  metadata 1                  mapping                    metadata 2
the same user’s input, we let GPT-2 generate multiple patent text                      title               <|title2abstract|>               abstract
spans to collect both positive and negative reranking results based                  abstract              <|abstract2title|>                 title
on each of the multiple patent text spans.                                            claim               <|claim2abstract|>                abstract
                                                                                     abstract             <|abstract2claim|>                 claim
4 DATA                                                                                 title                 <|title2figure|>                figure
                                                                                      figure                 <|figure2title|>                 title
4.1 Data source
In [13], the authors use the patent datasets on BigQuery provided
by Google [5]. Although it is flexible to manipulate the data by                  Based on the approaches as mentioned above, the actual pipeline
SQL statements, we found that the data provided by BigQuery are               to build the datasets for GPT-2 includes: (1) downloading the raw
not updated frequently. We turned to the USPTO PatentsView [25]               TSV file from USPTO PatentsView and splitting them into smaller
for bulk download files and more updates. At the moment of this               files, (2) extracting text based on metadata (e.g., title, abstract, etc.)
writing, the latest version of the “patent.tsv.zip” file is dated as          and uploading them to the Elasticsearch server, (3) retrieving patent
2020-03-31. For incremental download instead of bulk download,                text from the Elasticsearch server, adding special tags to them,
the USPTO Open Data Portal [24] can be another choice. The raw                and saving them in text format (4) converting the text files in the
data provided by the PatentsView and the Open Data Portal are                 previous step to TFRecord format for Tensorflow code. In step (1),
plain text in TSV or XML format. The downside of using such raw               the text data from USPTO PatentsView is about 48.7G (version:
text is the extra efforts on data preprocessing, compared with the            2019-10-08). Such a corpus is larger than the WebText corpus (40G)
flexibility of SQL statements in BigQeury. Practitioners need to              used by OpenAI for GPT-2 pre-training. In step (4), the total number
consider the tradeoff between flexibility and data frequency. We              of tokens is 32.3B (32,398,927,872).
opt for more frequently updated data in this work.                                By concatenating all of the text with special tags, the total
                                                                              amount of data reaches 180G. Due to resource constraints, we
4.2     Datasets for GPT-2                                                    did not concatenate a dependent claim with its corresponding in-
During data preprocessing, we follow the span-based approach                  dependent claim. Independent claims are generally much longer
in [12] and follow the “structural metadata” and “metadata map-               than their dependent claims. If our training data capture such claim
ping” approaches in [14]. The structural metadata in [14] is defined          dependency for all dependent claims, e.g., “(claim 1) <|dep|> (claim
to include patent title, abstract, independent claim, and dependent           2)” and “...<|abstract2claim|><|start_of_claim|> (claim1+claim2)”
claim. According to the authors, it is a mechanism to control what            (some special tags omitted for clarity), it is possible that the total
kind of patent text to generate. Regarding metadata mapping, it is a          amount of text data may exceed 570G (the size of text data for train-
mechanism to guide GPT-2 for generating from one kind of patent               ing GPT-3 [1]). We leave such an experiment for future researchers.
text to another. What we differ from [14] are: (1) we add new tags            It is also noted that the <|figure2title|> mapping in Table 1 does
for patent drawing descriptions, (2) we add <|dep|> for dependent             not exist in our training data. We reserve this mapping for testing
claims, (3) we remove the proposed “backward” tags because back-              whether it is possible for GPT-2 models to do the same few-shot
ward text generation is not required in this work. Table 1 shows              learning in GPT-3.
our special tags for structural metadata and mappings between
metadata. We found the span-based approach helpful for splitting              4.3    Datasets for BERT
long claims into short text spans. However, for patent abstracts,             According to BERT’s code repository [6], the input for pre-training
such a span-based approach may not apply. If an abstract’s text has           is a plain text file having one sentence per line. Consecutive lines
been taken verbatim from a claim, the span splitting mechanism                are the actual sentences for the "next sentence prediction" task.
may apply. If not, there might be no span to split in a sentence.             Documents are delimited by empty lines. The final datasets contain
When no span is found in the abstract, we split a patent abstract             serialized text in TFRecord file format. In our case, we follow the
into multiple sentences instead. Collectively the split sentences or          format and prepare the plain text file with one span or sentence per
spans are treated the same way in our data processing, and we refer           line. We did not add our special tags or metadata mappings to the
to both of them as “span” in this work.                                       text file because such annotations are designed for GPT-2 only. The
                                                                         20
Prior Art Search and Reranking for Generated Patent Text                                                             PatentSemTech, July 15th, 2021, online


total number of words serialized in our training data is 6.8 billion
words. It is larger than the 3.3 billion word corpus (BooksCorpus
with 800M words and English Wikipedia with 2,500M words) for
pre-training the official BERT model.

4.4     Data for Elasticsearch server
The purpose of the Elasticsearch server in our data pipeline is two-
folded. First, it provides the ranking mechanism based on a bag-of-
words approach. For example, we can query the top n records (e.g.,
100) based on BM25. Second, the Elasticsearch server is convenient
for us to aggregate various patent text from different raw files. Such
aggregation replaces the BigQuery and SQL statements in [13]. In
step (2) of the data pipeline in 4.2, we split the patent text into spans
or sentences and upload them to the Elasticsearch server. The total
number of records in Elasticsearch is 343,987,632, and they occupy                 Figure 2: Training loss of GPT-2 models (Base & Large)
59.7GB.

5 IMPLEMENTATION & EXPERIMENTS                                                   OpenAI’s code) and building vocabulary from patent corpus, “bert-
                                                                                 as-service” [28] for fast conversion from text to BERT embeddings,
5.1 GitHub repositories                                                          and “annoy” [22] for searching and ranking BERT embeddings
In addition to the official code of BERT by Google and GPT-2 by                  efficiently.
OpenAI, our implementation leverages the following repositories:
                                                                                 5.2    Implementation details: GPT-2
   (a) imcaspar/gpt2-ml [30]
                                                                                 Before pre-training, we use “tokenizers” (ByteLevelBPETokenizer)
   (b) huggingface/transformers [27]
                                                                                 to build the vocabulary specific to our patent corpus, instead of
   (c) ConnorJL/GPT2 [9]
                                                                                 using the default vocabulary released in “gpt2-ml”. We set the same
   (d) huggingface/tokenizers [8]
                                                                                 vocabulary size (50257) to build our vocabulary. One advantage
   (e) hanxiao/bert-as-service [28]
                                                                                 of building our own vocabulary file is that each special tag in our
   (f) spotify/annoy [22]
                                                                                 design can be encoded as one token instead of multiple (if using the
    According to [14], OpenAI trained their models with TPU, but                 original vocabulary by others). The model sizes we experiment with
the code for training was not released. The authors in [12] resorted             are Base (similar to OpenAI’s 117M) and Large (similar to OpenAI’s
to [9] since it can leverage TPU and the trained model is compatible             345M).
with OpenAI’s code for inferencing on GPU. According to [9], a                      The total number of tokens in the TFRecords for GPT-2 is about
potential downside is that the performance of the 1.5B model seems               32.3B. For training the Base model, we found that batch_size_per_core
inferior to the official model performance by OpenAI. Therefore,                 = 16 and max_seq_length = 1024 are workable on Colab. Larger
we checked alternatives and found “transformers” [27] and “gpt2-                 batch size will trigger an OOM (out-of-memory) error. The number
ml” [30]. The former is a more promising codebase for several                    of TPU cores on Colab is 8. Our goal is to train at least one epoch.
technical reasons (omitted here for brevity). Unfortunately, we tried            Therefore, we set our training steps as 248,000 (32,398,927,872
and realized that PyTorch’s support for TPU is maturing, but the                 / 1024 / 16 / 8 = 247,184). For training the Large model, we set
specific code for GPT-2 training is not ready. Therefore, we opted               batch_size_per_core=4 to avoid the OOM error and set the same
for “gpt2-ml” which has successfully built a 1.5B model. The “gpt2-              training steps. Fig. 2 shows the curves of training loss. The final
ml” repository is forked from Grover [29], which was developed                   loss values are 1.122 (Base) and 0.9934 (Large), respectively. It is
by the Allen Institute. Grover is designed for fake news detection.              noted that the largest model (1.5B) will trigger the OOM error even
According to [29], Grover obtains over 92% accuracy at telling                   after setting the batch size as 1. We leave the 1.5B model to the
apart human-written from machine-written news. The authors also                  future when having more resources.
released the 1.5B Grover GPT-2 model. The 1.5B model’s availability
from a reputable institute is the main reason we select the “gpt2-               5.3    Implementation details: BERT
ml” repository to work on. One disadvantage of Grover’s model                    Before pre-training, we use “tokenizers” (BertWordPieceTokenizer)
is that it is not compatible with OpenAI’s GPT-2 model. It means                 to build the vocabulary (uncased) specific to our patent corpus in-
that we can not re-use OpenAI’s code for inferencing. We have                    stead of using the default vocabulary released in BERT official code.
to use the inferencing code from Grover’s code. We expect that                   We set the same vocabulary size (30522) to build our vocabulary.
“transformers” might be the best choice for researchers to pre-train             As for the BERT model size, we experiment with BERT-Base and
OpenAI GPT models with TPU and retain the compatibility with                     BERT-Large.
OpenAI GPT-2 and GPT-3 models in the near future. Regarding                         According to [3], the pre-training for the BERT-Large model
the other repositories on the above list, their respective functions             took 1,000,000 steps, which is approximately 40 epochs over the
are: “tokenizers” [8] for fast tokenization (replacing Google’s and              3.3 billion word corpus by using 16 Cloud TPU devices (256 batch
                                                                            21
PatentSemTech, July 15th, 2021, online                                                                                   Jieh-Sheng Lee and Jieh Hsiang


                                                                                  • output: [1-33] In an embodiment, a method is provided for
                                                                                    automatically connecting a mobile telephone to a Wi-Fi net-
                                                                                    work.
                                                                                  • output: [1-42] The apparatus can include a device for detect-
                                                                                    ing whether a wireless device is in proximity to a wireless
                                                                                    device associated with the Wi-Fi network.
                                                                                  Taking [1-33] (generated by GPT-2) as the input, our prior art
                                                                              search retrieves the top 100 records by BM25 and rerank them by
                                                                              embeddings. The [3/100] record in POC 4 (as below) is subjectively
                                                                              a positive example for us. Compared with other records in POC
                                                                              4, the [3/100] record “automatic connectivity....a mobile device to
                                                                              roam” is more relevant to the “automatically connecting a mobile
                                                                              telephone” in [1-33] of POC 1. The [3/100] record is ranked as 26
                                                                              based on BM25 and reranked as 3 based on embedding. Therefore,
   Figure 3: Training loss of BERT models (Base & Large)                      the reranking is effective in boosting its ranking. The [3/100] record
                                                                              is the 5th span in the abstract of patent 8590023, which was in the
                                                                              dataset for pre-training GPT-2 in the first place.
size * 512 tokens = 128,000 tokens/batch). Our pre-training data
contains 6.8 billion words (6,824,071,153). Since the Colab provides              • (POC 4)
one Cloud TPU device only, the total number of tokens per batch is                • patent: 8590023 [ A-4 ] (5th span in abstract)
more limited (64 batch size * 128 tokens = 8,192 tokens/batch). We                • text: This automatic connectivity may allow a mobile device
set our training steps as 2,000,000 to pre-train approximately 2.4                  to roam across Wi-Fi hotspots of Wi-Fi networks and offload
epochs over the 6.8 billion word corpus. Except for these, we use the               traffic to Wi-Fi networks.
same hyperparatmers provided in [3] for the BERT-Base and BERT-                   • ranked by BM25: 26
Large models. For evaluatoin, we set the eval_batch_size=32 and                   • re-ranked by embedding: 3
max_eval_steps=100,000. The evaluation results are:                              The rankings by BM25 and embedding similarity may be different
      • loss = 1.0650321                                                      or similar or the same. For example, in POC 4, the [1/100] record (as
      • masked_lm_accuracy = 0.78279483                                       below) shows that both ranks are top 1. The [1/100] record in POC
      • masked_lm_loss = 0.96379614                                           4 is also semantically similar to the [1-33] record in POC 1. The
      • next_sentence_accuracy = 0.9975                                       [1/100] record in POC 4 is the first span in the abstract of patent
      • next_sentence_loss = 0.0040773232                                     10356696, which was in the dataset for pre-training GPT-2 in the
                                                                              first place.
   For comparing model performance, we trained the BERT-Base
                                                                                  • (POC 4)
model with similar settings. Fig. 3 shows the curves of training
                                                                                  • patent: 10356696 [ A-0 ] (1st span in abstract)
loss for the BERT-Large and BERT-Base models. As expected, the
                                                                                  • text: An apparatus and methods are provided for automati-
BERT-Large model has a lower curve.
                                                                                    cally detecting and connecting to a Wi-Fi network.
                                                                                  • ranked by BM25: 1
5.4     Qualitative examples                                                      • re-ranked by embedding: 1
In this section, we provide positive and negative examples in our
reranking experiments. Our proof-of-concept results (POC 1˜7) are                We also found negative examples. In POC 5, the following is
available on the web.1 In POC 1, the results contain 100 gener-               the top record according to both BM25 and embeddings. However,
ated patent spans (no cherrypicking) in patent abstract (similar              the similarity between the top record in POC 5 and the input of
experiments can be conducted on patent claims in the future). The             POC 5 (the GPT-2 output [1-4] in POC 1) seems remote. Such a
input for GPT-2 is the first sentence in the abstract of the US Patent        result suggests that sentence similarity is still a difficult problem.
10,694,449 (granted on 2020-06-23). We selected three generated               One possible reason is that the coverage of the recall by BM25 is
patent spans for prior art search and reranking, as below.                    not broad enough. Therefore, it filters out suitable candidates for
                                                                              calculating embedding similarity too soon.
      • (POC 1)
                                                                                  • (POC 5)
      • input: An apparatus and methods are provided for automati-
                                                                                  • patent: 9373249 [ A-1 ] (2nd span in abstract)
        cally detecting and connecting to a Wi-Fi network.
                                                                                  • text: The Wi-Fi transceiver receives a Wi-Fi control signal
      • output: [1-4] In accordance with a signal strength measure-
                                                                                    from a control signal generator.
        ment from a Wi-Fi transceiver during an idle period when a
                                                                                  • ranked by BM25: 1
        Wi-Fi network is detected, the Wi-Fi transceiver sends on
                                                                                  • re-ranked by embedding: 1
        to a server an indication of a connection mode of the user
        equipment.                                                              In addition to BM25 and embedding, we found that adding a
                                                                              keyword can be a beneficial enhancement if a user has a clear idea
1 https://usptg.herokuapp.com/mlld                                            about what to look for. For example, the top reranking result in the
                                                                         22
Prior Art Search and Reranking for Generated Patent Text                                                               PatentSemTech, July 15th, 2021, online


following POC 2 will be less relevant if “proximity” in the [1-42]                  The <|figure2title|> mapping is defined in the vocabulary file,
record of POC 1 is the point of interest. The [1-42] record of POC 1            and there is no training data contains such a mapping. Our purpose
is the input for prior art search in POC 2. Such a result is reasonable         is to test whether the model can learn such a new mapping by few-
because there is no clue for the model to weigh the point of interest           shot learning. In our experiments, we concatenate several records
more.                                                                           of different figure text and title. Then, we remove the title in the last
      • (POC 2, reranked as top 1)                                              record. If the few-shot learning works, the model should generate
      • patent: 7302229 [ A-2 ] (2nd span in abstract)                          the removed patent title in the last record. Unfortunately, we found
      • text: In one embodiment, availability of wireless connectivity          it not workable. The limitation in the model size is probably the
        may be determined to a first user of a wireless service at a            primary root cause. Although such a failure case was anticipated,
        first wireless communication device to communicate with                 we found one intriguing pattern: the model keeps generating the
        an access point associated with a Wi-Fi wireless network                patent title in the second record most of the time. We leave this
        that offers the wireless service.                                       case to the future. Determining the minimal model size to achieve
                                                                                few-shot learning in the patent domain is also an important topic
   To boost the search relevancy, we add a keyword setting to BM25.             for future research.
After adding “proximity” as a required term in the BM25 search,
in the following POC 6, the relevancy of its top record increases
                                                                                6    FUTURE RESEARCH
significantly. The total number of positive results increases too.
Using a keyword as the first filter in reranking is a research topic            Our experiments show mixed results, and the topics for future
we plan to study in the future because adding such a hard constraint            researchers include:
could be a double-edged sword.
                                                                                    • How to make reranking more effective?
      • (POC 6, reranked as top 1)                                                  • How to measure the “novelty” and “non-obviousness” (re-
      • patent: 9986380 [ A-0 ] (1st span in abstract)                                quirements in patent laws) between the generated patent
      • text: A first wireless device determines whether the first                    text and prior patent text?
        wireless device is in a specified proximity to a second wireless            • What are the legal & ethical considerations before releasing
        device based on a signal wirelessly transmitted by the second                 a generative patent model?
        wireless device.                                                            • Can the discrepancy of the rankings between BM25 and
   It is noted that POC 3 (omitted here for brevity) shows an ex-                     embedding be a source for data augmentation? For example,
ample of a complete patent abstract containing several text spans                     Sentence-BERT requires both positive and negative examples
generated by GPT-2. Each text span can go through the same prior                      to train. Ranking by embeddings first and filtering by BM25
art search with reranking, as demonstrated above. We leave such                       later might be a way to collect negative training examples.
an enhancement to the future. It is also noted that, in our early
experiments, using embeddings alone (without BM25) produces
many false-positive results, as shown in POC 7. For example, the                7    CONCLUSION
similarity between “Coherent LADAR using intra-pixel quadrature                 Reranking with BM25 and embeddings is a practical approach for
detection” and “In-pixel correlated double sampling with fold-over              producing better search results than using embeddings alone. Our
detection” is a negative example. There are many negative results               reranking is a two-step approach in which the search is performed
with unreasonable similarities. Therefore, embeddings alone are not             based on BM25 first and then performed based on the cosine simi-
effective for semantic search. Comparing our initial experiments                larity of embeddings. If a user has a clear point of interest in mind,
(embedding only) and later experiments (reranking by BM25 and                   the search can be more productive by adding an extra step of pro-
embeddings), we conclude that the reranking is more effective even              viding a keyword to the BM25 search. In this work, the input for
though it still produces very mixed results.                                    the prior art search is the patent text span generated by a GPT-2
                                                                                model. The objective of our prior art search is to identify retro-
5.5     Failure case: few-shot learning                                         spectively the most similar patent text spans in the training data
Although this work focuses on GPT-2, we are also interested in the              of the GPT-2 model. Although our experiments show the effec-
capabilities of the latest GPT-3. GPT-3 is an autoregressive language           tiveness of reranking in the patent domain, they also show that
model with 175 billion parameters. According to the authors, it is              semantic search for longer text remains challenging. By finding
10x more than any previous language model. By scaling up, the                   the similarity between GPT-2’s inputs and outputs, we expect that
model can perform few-shot learning purely via text interaction                 this work and its future enhancement can help researchers under-
without any gradient updates or fine-tuning. We estimate that the               stand GPT-2 better. Particularly, in the patent domain, it is critical
largest GPT-3 model is about 507 times bigger than the GPT-2 model              to evaluate the novelty in GPT-2 and GPT-3 models. To evaluate
we utilized. We hypothesize that the patent text structure is more              the novelty, a prerequisite is to identify the closest training data.
uniform and less diverse than the training data for GPT-3. Hence,               The progress in this paper is toward fulfilling the prerequisite so
we wonder whether few-shot learning might be possible on our                    that novelty of the generated patent text can be evaluated in the
GPT-2 model too. We prepare our input text in the following format:             future. In our system architecture, we integrate several building
   <|start_of_figure|> (text1) <|end_of_figure|> <|figure2title|>               blocks, notably pre-training GPT-2, pre-training BERT, using Elas-
<|start_of_title|> (text2) <|end_of_title|>                                     ticsearch for BM25 ranking, and reranking by embedding similarity
                                                                           23
PatentSemTech, July 15th, 2021, online                                                                                                                Jieh-Sheng Lee and Jieh Hsiang


with Annoy. Such a proof-of-concept implementation is a practical                               [24] USPTO. [n.d.]. USPTO Open Data Portal. https://developer.uspto.gov/data.
reference for future researchers.                                                               [25] USPTO. [n.d.]. USPTO PatentsView. https://www.patentsview.org/download.
                                                                                                [26] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
                                                                                                     Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All
                                                                                                     you Need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V.
REFERENCES                                                                                           Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.).
 [1] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan,                         Curran Associates, Inc., 5998–6008. http://papers.nips.cc/paper/7181-attention-
     Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda                      is-all-you-need.pdf
     Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,              [27] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
     Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,                      Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie
     Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin                 Brew. 2019. HuggingFace’s Transformers: State-of-the-art Natural Language
     Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya                       Processing. ArXiv (2019). https://arxiv.org/abs/1910.03771
     Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners.                  [28] Han Xiao. 2018. bert-as-service. https://github.com/hanxiao/bert-as-service.
     ArXiv (2020). https://arxiv.org/abs/2005.14165                                             [29] Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi,
 [2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:                   Franziska Roesner, and Yejin Choi. 2019. Defending Against Neural Fake News.
     Pre-training of Deep Bidirectional Transformers for Language Understanding. In                  In Advances in Neural Information Processing Systems 32.
     Proceedings of the 2019 Conference of the North American Chapter of the Association        [30] Zhibo Zhang. 2019. GPT2-ML: GPT-2 for Multiple Languages. https://github.
     for Computational Linguistics: Human Language Technologies, Volume 1 (Long and                  com/imcaspar/gpt2-ml.
     Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota,
     4171–4186. https://doi.org/10.18653/v1/N19-1423
 [3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
     Pre-training of Deep Bidirectional Transformers for Language Understanding. In
     Proceedings of the 2019 Conference of the North American Chapter of the Association
     for Computational Linguistics: Human Language Technologies, Volume 1 (Long and
     Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota,
     4171–4186. https://doi.org/10.18653/v1/N19-1423
 [4] Eva D’hondt, Suzan Verberne, Niklas Weber, Kees Koster, and Lou Boves.
     2012. Using skipgrams and PoS-based feature selection for patent classifica-
     tion. Computational Linguistics in the Netherlands Journal 2 (Dec. 2012), 52–70.
     https://www.clips.uantwerpen.be/clinjournal/clinj/article/view/15
 [5] Google. [n.d.]. Google Patents public datasets on BigQuery. https://console.cloud.
     google.com/bigquery?p=patents-public-data.
 [6] Google. [n.d.]. google-research/bert. https://github.com/google-research/bert.
 [7] Google. [n.d.]. Universal Sentence Encoder. https://tfhub.dev/google/universal-
     sentence-encoder/2.
 [8] HuggingFace. 2020. Fast State-of-the-Art Tokenizers optimized for Research and
     Production. https://github.com/huggingface/tokenizers.
 [9] Connor Leahy. 2019. An implementation of training for GPT2, supports TPUs.
     https://github.com/ConnorJL/GPT2.
[10] Jieh-Sheng Lee. 2020. Measuring and Controlling Text Generation by Semantic
     Search. In WWW ’20: Companion Proceedings of the Web Conference 2020. Taipei,
     Taiwan, 269–273. https://doi.org/10.1145/3366424.3382086
[11] Jieh-Sheng Lee and Jieh Hsiang. 2019. Measuring Patent Claim Generation
     by Span Relevancy. In Proceedings of the Thirteenth International Workshop on
     Juris-informatics (JURISIN). Keio University Kanagawa, Japan.
[12] Jieh-Sheng Lee and Jieh Hsiang. 2020. Patent claim generation by fine-tuning
     OpenAI GPT-2. World Patent Information (2020). in press.
[13] Jieh-Sheng Lee and Jieh Hsiang. 2020. PatentBERT: Patent classification with
     fine-tuning a pre-trained BERT model. World Patent Information 61, 101965 (2020).
     https://doi.org/10.1016/j.wpi.2020.101965
[14] Jieh-Sheng Lee and Jieh Hsiang. 2020. PatentTransformer-2: Controlling Patent
     Text Generation by Structural Metadata. (2020). https://arxiv.org/abs/2001.03708
[15] Yinhan Liu, Myle Ott, Naman Goyal, Mandar Joshi Jingfei Du, Danqi Chen, Omer
     Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A
     Robustly Optimized BERT Pretraining Approach. ArXiv (2019). http://arxiv.org/
     abs/1907.11692
[16] Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for
     learning sentence representations. In International Conference on Learning Repre-
     sentations. https://openreview.net/forum?id=rJvJXZb0W
[17] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
     Distributed Representations of Words and Phrases and their Compositionality. In
     Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou,
     M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates, Inc.,
     3111–3119.
[18] Alec Radrof, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
     Sutskever. 2018. Language Models are Unsupervised Multitask Learners.
[19] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings
     using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Em-
     pirical Methods in Natural Language Processing. Association for Computational
     Linguistics. http://arxiv.org/abs/1908.10084
[20] Julian Risch, Nicolas Alder, Christoph Hewel, and Ralf Krestel. 2020. PatentMatch:
     A Dataset for Matching Patent Claims & Prior Art. arXiv:2012.13919 [cs.IR]
[21] Julian Risch and Ralf Krestel. 2019. Domain-specific word embeddings for patent
     classification. Data Technol. Appl. 53 (2019), 108–122.
[22] Spotify. 2018. Approximate Nearest Neighbors in C++/Python optimized for
     memory usage and loading/saving to disk. https://github.com/spotify/annoy.
[23] Cole Thienes and Jack Pertschuk. 2019. NBoost: Neural Boosting Search Results.
     https://github.com/koursaros-ai/nboost.
                                                                                           24

</pre>