<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Prior Art Search and Reranking for Generated Patent Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jieh-Sheng Lee∗</string-name>
          <email>d04922013@csie.ntu.edu.tw</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jieh Hsiang</string-name>
          <email>hsiang@csie.ntu.edu.tw</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Taiwan University</institution>
          ,
          <addr-line>Taipei</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>24</lpage>
      <abstract>
        <p>Generative models, such as GPT-2, have demonstrated impressive results recently. A fundamental question we would like to address is: where did the generated text come from? This work is our initial efort toward answering the question by using prior art search. The purpose of the prior art search is to find the most similar prior text in the training data of GPT-2. We take a reranking approach and apply it to the patent domain. Specifically, we pre-train GPT-2 models from scratch by using the patent data from the USPTO. The input for the prior art search is the patent text generated by the GPT-2 model. We also pre-trained BERT models from scratch for converting patent text to embeddings. The steps of reranking are: (1) search the most similar text in the training data of GPT-2 by taking a bag-of-words ranking approach (BM25), (2) convert the search results in text format to BERT embeddings, and (3) provide the final result by ranking the BERT embeddings based on their similarities with the patent text generated by GPT-2. The experiments in this work show that such reranking is better than ranking with embeddings alone. However, our mixed results also indicate that calculating the semantic similarities among long text spans is still challenging. To our knowledge, this work is the first to implement a reranking system to identify retrospectively the most similar inputs to a GPT model based on its output.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>Generative models based on Deep Learning techniques have shown
significant progress in recent years. A long-term objective of our
research is to evaluate the novelty of the text produced by the
generative models in the patent domain. Before evaluating the novelty,
a prerequisite is to identify the closest prior arts. The scope of the
prior arts is the patent text used for training the generative models.
From the perspective of system implementation, the approach this
paper took is to integrate “patent text generation” and “prior art
search.” The purpose of this paper is to fulfill the prerequisite for
identifying prior arts so that the novelty of the generated patent
text can be evaluated in the future. Assuming that the opposite
of the novelty in text generation is memorizing training text, the
∗Admitted in New York and passed the USPTO patent bar exam.</p>
      <p>
        PatentSemTech, July 15th, 2021, online
© 2021 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org)
pre-training of the GPT-2 generative models in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] is indicative of
such novelty. In the paper, the authors observed some text
memorizing behavior in their models on longer strings that are repeated
many times in the dataset. The authors quantified how often
exact memorization shows up in the generated text by measuring
the percentage in the 8-gram overlap. According to the authors,
most samples have less than 1% overlap, including over 30% of
samples with no overlap. Such results indicate that GPT-2 models
can generate novel text relatively well.
      </p>
      <p>
        In the patent domain, the authors in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] applied such GPT-2
model to generate patent claims. The authors proposed an idea
called “Augmented Inventing,” aiming to help inventors conceive
new patents in a better way. Since patent claims are generally longer
than ordinary sentences, the authors proposed a “span-based”
approach to decompose a long patent text into multiple shorter text
spans. The authors also proposed an idea called “auto-complete”
function to generate patent text on a span basis. From a legal
perspective, such a function will be valuable if it can generate
something new and meet (at least) the “novelty” requirement in patent
laws. However, for a generative model to meet the legal
requirement, a fundamental question is to calculate the similarity between
generated patent text and prior patents. In [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], the GPT-2 model
can generate plausible patent claims in surface form, but it is unclear
how novel the patent text is. To address the problem, the authors
proposed a dual-Transformer framework (using one Transformer
to measure the other Transformer), and they tried to measure the
quality of patent text generation (by span relevancy in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and
by semantic search in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]). Despite these eforts, measuring the
novelty in patent text generation remains an open problem.
      </p>
      <p>From a diferent perspective, building a generative patent model
to augment inventors might be the beginning of the era of
humanmachine co-inventing or meta-inventing (inventing how to invent).
In such an era, measuring the novelty created by the generative
model will be an essential function. To measure the novelty, it is
required to compare the output of the model and its inputs. In
this work, our implementation scope is to compare the generated
patent text with the original patent text in the training dataset.
Since the training dataset is large, in order to narrow the scope of
comparison, it is required to identify the most similar prior text in
the training dataset. Therefore, our implementation is to build such
a prior art search system. We found that reranking is a practical
way to make the search more efective. As proof of concept, we limit
the data scope in this work to granted patents only. The prior art
search is also limited to finding the most relevant text in span-based
fashion. How to aggregate the similarities of multiple text spans
into a longer sentence or a paragraph is another topic in the future.
2</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        Our prior art search’s main challenge is how to calculate the
semantic similarity between two patent text spans. In the past, most
of the prior art searches were performed at the word level, such
as keywords or phrases. For example, the authors in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] found
that combining unigrams and PoS-filtered skipgrams leads to a
significant improvement in classification scores over the unigram
baseline. In recent years, researchers moved toward neural network
models and embeddings for the semantic search of longer text. For
example, in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], the authors utilized domain-specific word
embeddings for patent classification. In [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], the authors proposed
“Quick Thought” to represent a sentence in a fixed-length vector.
The scheme is similar to the skip-gram method in Word2Vec [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]
by escalating the idea from word level to sentence level. Another
line of development is based on new neural architectures, such as
Transformer [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. Notably, BERT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and RoBERTa [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] set a new
state-of-the-art performance on sentence-pair regression tasks, e.g.,
semantic textual similarity (STS). According to [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], however, BERT
is unsuitable for semantic similarity search on a large scale. For
example, finding the most similar pairs in a collection of 10,000
sentences requires about 50 million inference computations (65
hours) with BERT. The authors in [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] proposed a modification of
the pre-trained BERT model to use siamese and triplet network
structures. Their model called “Sentence-BERT” can derive
semantically meaningful sentence embeddings to be compared by using
cosine similarity. Significantly it reduces the efort for finding the
most similar pair from 65 hours with BERT/RoBERTa to about
5 seconds while maintaining the accuracy from BERT, according
to [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Specific to the patent domain in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], the authors showed
that embedding could be a better metric than conventional ROGUE
(word-based) for measuring semantic similarity. The metric for
measuring embeddings in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] is based on the Universal Sentence
Encoder (USE) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] without any fine-tuning.
      </p>
      <p>
        A further line of development is to combine both word level and
embedding level. For example, NBoost [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] can deploy Transformer
models to improve the relevance of search results on conventional
word-based search engines, such as Elasticsearch using BM25.
According to [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], NBoost works like a proxy between users and
Elasticsearch. It leverages fine-tuned models to produce
domainspecific results. In a search request, the user sends a query to NBoost.
Then, NBoost asks for results from Elasticsearch, picks the best ones
based on the fine-tuned model, and returns its final results to the
user. Specifically, if a user asks for 10 results, NBoost can increase
the number of requests for Elasticsearch to produce 100 records
(word-based) and then pick the best 10 results (embedding-based).
Such a technique is called reranking.
3
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>APPROACH</title>
    </sec>
    <sec id="sec-4">
      <title>Semantic Search with Reranking</title>
      <p>
        Compared with contextualized word embeddings, we found the
research in sentence embeddings more challenging and less explored.
For example, the USE model in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is publicly available, but the
code for pre-training or fine-tuning is not. Without fine-tuning
with domain-specific data, a model could deviate from a
downstream task and fail to perform well in the specific domain. Our
experience found that the USE model alone without fine-tuning
USPTO
Raw Data
download
&amp; split
user
input
pre-training
      </p>
      <p>Patent
Spans
GPT-2
Generated</p>
      <p>Spans
embeddings
generate
search
tokenizer
Vocab for GPT-2</p>
      <p>
        Vocab for BERT
indexing
Elasticsearch
(BM25)
Ranked
Results
is not satisfactory for having a useful metric to measure the
semantic similarity in patent spans. We also found that with a BERT
model, even pre-trained with patent corpus, the false-positive rate
of the semantic similarity based on BERT embeddings is still high.
The Sentence-BERT in [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] might be a solution to these problems.
However, if we would like to take the Sentence-BERT approach,
an obstacle will be data. Sentence-BERT requires both positive and
negative examples to learn the similarity function. In this work,
all of the data from USPTO are positive examples. As for how to
prepare negative examples in the future, the PatentMatch dataset
for training a binary text pair classifier in [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] can be a reference.
      </p>
      <p>
        Since none of those above (USE, BERT, and Sentence-BERT) is a
viable option, we resorted to the reranking idea demonstrated by
NBoost. Besides, we found that the first author of [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] proposes a
similar view on his GitHub repository. According to the author, a
bag-of-words search, such as BM25, can have a higher recall, but
its precision is lower. Conversely, the embedding-based search can
have higher precision, but its recall is lower. Therefore, for having
both the higher recall and the higher precision, a reranking strategy
is to perform the word-based ranking upfront for a higher recall and
then perform the embedding-based ranking for higher precision.
It is noted that, in our initial experiments, we found that ranking
based on embeddings first and reranking based on words later does
not perform well. In such a configuration, the false-positive rate in
the ranking of embeddings is too high.
3.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>System Architecture</title>
      <p>This section explains the overall architecture of our implementation
and the function/data flows in the architecture. For details, section 4
will cover the data part and its preprocessing, and section 5.1 will
provide the code repositories we leveraged from others. Fig. 1 shows
our system architecture. The upper portion of the figure (stage 1
of our implementation) represents what we have to build before a
user can trigger GPT-2 for text generation. The function flows in
this upper portion are depicted in solid lines. The bottom portion
of the figure (stage 2 of our implementation) shows the function
lfows (in dotted lines) for ranking (BM25) and reranking (BERT
embeddings).</p>
      <p>
        At stage 1, we download raw data from the USPTO and split them
into patent spans. The patent spans are fed into Elasticsearch (with
default settings) for indexing so that we can query them based on
BM25 ranking. Before pre-training GPT-2 and BERT from scratch,
we built their vocabulary files. The details of pre-trainings GPT-2
and BERT are provided in sections 5.2 and 5.3, respectively. At stage
2, a user provides the input text to GPT-2 and the parameters of
GPT-2 inferencing. GPT-2 generates a patent text span based on
these settings. For reranking, the generated patent span first goes to
Elasticsearch to obtain the most relevant prior patent spans based on
BM25. The ranked prior patent spans then go to Bert-as-Service [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]
and convert to BERT embeddings. Next, the embeddings go to
Annoy [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] for reranking based on cosine similarity. Lastly, the
ifnal and reranked embeddings are decoded back to patent spans
in text format and shown to the user. In our experiments, based on
the same user’s input, we let GPT-2 generate multiple patent text
spans to collect both positive and negative reranking results based
on each of the multiple patent text spans.
4
4.1
      </p>
    </sec>
    <sec id="sec-6">
      <title>DATA</title>
    </sec>
    <sec id="sec-7">
      <title>Data source</title>
      <p>
        In [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], the authors use the patent datasets on BigQuery provided
by Google [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Although it is flexible to manipulate the data by
SQL statements, we found that the data provided by BigQuery are
not updated frequently. We turned to the USPTO PatentsView [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]
for bulk download files and more updates. At the moment of this
writing, the latest version of the “patent.tsv.zip” file is dated as
2020-03-31. For incremental download instead of bulk download,
the USPTO Open Data Portal [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] can be another choice. The raw
data provided by the PatentsView and the Open Data Portal are
plain text in TSV or XML format. The downside of using such raw
text is the extra eforts on data preprocessing, compared with the
lfexibility of SQL statements in BigQeury. Practitioners need to
consider the tradeof between flexibility and data frequency. We
opt for more frequently updated data in this work.
4.2
      </p>
    </sec>
    <sec id="sec-8">
      <title>Datasets for GPT-2</title>
      <p>
        During data preprocessing, we follow the span-based approach
in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and follow the “structural metadata” and “metadata
mapping” approaches in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The structural metadata in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] is defined
to include patent title, abstract, independent claim, and dependent
claim. According to the authors, it is a mechanism to control what
kind of patent text to generate. Regarding metadata mapping, it is a
mechanism to guide GPT-2 for generating from one kind of patent
text to another. What we difer from [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] are: (1) we add new tags
for patent drawing descriptions, (2) we add &lt;|dep|&gt; for dependent
claims, (3) we remove the proposed “backward” tags because
backward text generation is not required in this work. Table 1 shows
our special tags for structural metadata and mappings between
metadata. We found the span-based approach helpful for splitting
long claims into short text spans. However, for patent abstracts,
such a span-based approach may not apply. If an abstract’s text has
been taken verbatim from a claim, the span splitting mechanism
may apply. If not, there might be no span to split in a sentence.
When no span is found in the abstract, we split a patent abstract
into multiple sentences instead. Collectively the split sentences or
spans are treated the same way in our data processing, and we refer
to both of them as “span” in this work.
      </p>
      <p>Based on the approaches as mentioned above, the actual pipeline
to build the datasets for GPT-2 includes: (1) downloading the raw
TSV file from USPTO PatentsView and splitting them into smaller
ifles, (2) extracting text based on metadata (e.g., title, abstract, etc.)
and uploading them to the Elasticsearch server, (3) retrieving patent
text from the Elasticsearch server, adding special tags to them,
and saving them in text format (4) converting the text files in the
previous step to TFRecord format for Tensorflow code. In step (1),
the text data from USPTO PatentsView is about 48.7G (version:
2019-10-08). Such a corpus is larger than the WebText corpus (40G)
used by OpenAI for GPT-2 pre-training. In step (4), the total number
of tokens is 32.3B (32,398,927,872).</p>
      <p>
        By concatenating all of the text with special tags, the total
amount of data reaches 180G. Due to resource constraints, we
did not concatenate a dependent claim with its corresponding
independent claim. Independent claims are generally much longer
than their dependent claims. If our training data capture such claim
dependency for all dependent claims, e.g., “(claim 1) &lt;|dep|&gt; (claim
2)” and “...&lt;|abstract2claim|&gt;&lt;|start_of_claim|&gt; (claim1+claim2)”
(some special tags omitted for clarity), it is possible that the total
amount of text data may exceed 570G (the size of text data for
training GPT-3 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]). We leave such an experiment for future researchers.
It is also noted that the &lt;|figure2title| &gt; mapping in Table 1 does
not exist in our training data. We reserve this mapping for testing
whether it is possible for GPT-2 models to do the same few-shot
learning in GPT-3.
4.3
      </p>
    </sec>
    <sec id="sec-9">
      <title>Datasets for BERT</title>
      <p>
        According to BERT’s code repository [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the input for pre-training
is a plain text file having one sentence per line. Consecutive lines
are the actual sentences for the "next sentence prediction" task.
Documents are delimited by empty lines. The final datasets contain
serialized text in TFRecord file format. In our case, we follow the
format and prepare the plain text file with one span or sentence per
line. We did not add our special tags or metadata mappings to the
text file because such annotations are designed for GPT-2 only. The
total number of words serialized in our training data is 6.8 billion
words. It is larger than the 3.3 billion word corpus (BooksCorpus
with 800M words and English Wikipedia with 2,500M words) for
pre-training the oficial BERT model.
4.4
      </p>
    </sec>
    <sec id="sec-10">
      <title>Data for Elasticsearch server</title>
      <p>
        The purpose of the Elasticsearch server in our data pipeline is
twofolded. First, it provides the ranking mechanism based on a
bag-ofwords approach. For example, we can query the top n records (e.g.,
100) based on BM25. Second, the Elasticsearch server is convenient
for us to aggregate various patent text from diferent raw files. Such
aggregation replaces the BigQuery and SQL statements in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. In
step (2) of the data pipeline in 4.2, we split the patent text into spans
or sentences and upload them to the Elasticsearch server. The total
number of records in Elasticsearch is 343,987,632, and they occupy
59.7GB.
5
5.1
      </p>
    </sec>
    <sec id="sec-11">
      <title>IMPLEMENTATION &amp; EXPERIMENTS</title>
    </sec>
    <sec id="sec-12">
      <title>GitHub repositories</title>
      <p>
        In addition to the oficial code of BERT by Google and GPT-2 by
OpenAI, our implementation leverages the following repositories:
(a) imcaspar/gpt2-ml [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]
(b) huggingface/transformers [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]
(c) ConnorJL/GPT2 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
(d) huggingface/tokenizers [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
(e) hanxiao/bert-as-service [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]
(f) spotify/annoy [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]
      </p>
      <p>
        According to [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], OpenAI trained their models with TPU, but
the code for training was not released. The authors in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] resorted
to [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] since it can leverage TPU and the trained model is compatible
with OpenAI’s code for inferencing on GPU. According to [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], a
potential downside is that the performance of the 1.5B model seems
inferior to the oficial model performance by OpenAI. Therefore,
we checked alternatives and found “transformers” [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] and
“gpt2ml” [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. The former is a more promising codebase for several
technical reasons (omitted here for brevity). Unfortunately, we tried
and realized that PyTorch’s support for TPU is maturing, but the
specific code for GPT-2 training is not ready. Therefore, we opted
for “gpt2-ml” which has successfully built a 1.5B model. The
“gpt2ml” repository is forked from Grover [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], which was developed
by the Allen Institute. Grover is designed for fake news detection.
According to [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], Grover obtains over 92% accuracy at telling
apart human-written from machine-written news. The authors also
released the 1.5B Grover GPT-2 model. The 1.5B model’s availability
from a reputable institute is the main reason we select the
“gpt2ml” repository to work on. One disadvantage of Grover’s model
is that it is not compatible with OpenAI’s GPT-2 model. It means
that we can not re-use OpenAI’s code for inferencing. We have
to use the inferencing code from Grover’s code. We expect that
“transformers” might be the best choice for researchers to pre-train
OpenAI GPT models with TPU and retain the compatibility with
OpenAI GPT-2 and GPT-3 models in the near future. Regarding
the other repositories on the above list, their respective functions
are: “tokenizers” [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for fast tokenization (replacing Google’s and
OpenAI’s code) and building vocabulary from patent corpus,
“bertas-service” [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] for fast conversion from text to BERT embeddings,
and “annoy” [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] for searching and ranking BERT embeddings
eficiently.
5.2
      </p>
    </sec>
    <sec id="sec-13">
      <title>Implementation details: GPT-2</title>
      <p>Before pre-training, we use “tokenizers” (ByteLevelBPETokenizer)
to build the vocabulary specific to our patent corpus, instead of
using the default vocabulary released in “gpt2-ml”. We set the same
vocabulary size (50257) to build our vocabulary. One advantage
of building our own vocabulary file is that each special tag in our
design can be encoded as one token instead of multiple (if using the
original vocabulary by others). The model sizes we experiment with
are Base (similar to OpenAI’s 117M) and Large (similar to OpenAI’s
345M).</p>
      <p>The total number of tokens in the TFRecords for GPT-2 is about
32.3B. For training the Base model, we found that batch_size_per_core
= 16 and max_seq_length = 1024 are workable on Colab. Larger
batch size will trigger an OOM (out-of-memory) error. The number
of TPU cores on Colab is 8. Our goal is to train at least one epoch.
Therefore, we set our training steps as 248,000 (32,398,927,872
/ 1024 / 16 / 8 = 247,184). For training the Large model, we set
batch_size_per_core=4 to avoid the OOM error and set the same
training steps. Fig. 2 shows the curves of training loss. The final
loss values are 1.122 (Base) and 0.9934 (Large), respectively. It is
noted that the largest model (1.5B) will trigger the OOM error even
after setting the batch size as 1. We leave the 1.5B model to the
future when having more resources.
5.3</p>
    </sec>
    <sec id="sec-14">
      <title>Implementation details: BERT</title>
      <p>Before pre-training, we use “tokenizers” (BertWordPieceTokenizer)
to build the vocabulary (uncased) specific to our patent corpus
instead of using the default vocabulary released in BERT oficial code.
We set the same vocabulary size (30522) to build our vocabulary.
As for the BERT model size, we experiment with BERT-Base and
BERT-Large.</p>
      <p>
        According to [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the pre-training for the BERT-Large model
took 1,000,000 steps, which is approximately 40 epochs over the
3.3 billion word corpus by using 16 Cloud TPU devices (256 batch
size * 512 tokens = 128,000 tokens/batch). Our pre-training data
contains 6.8 billion words (6,824,071,153). Since the Colab provides
one Cloud TPU device only, the total number of tokens per batch is
more limited (64 batch size * 128 tokens = 8,192 tokens/batch). We
set our training steps as 2,000,000 to pre-train approximately 2.4
epochs over the 6.8 billion word corpus. Except for these, we use the
same hyperparatmers provided in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for the BERT-Base and
BERTLarge models. For evaluatoin, we set the eval_batch_size=32 and
max_eval_steps=100,000. The evaluation results are:
• loss = 1.0650321
• masked_lm_accuracy = 0.78279483
• masked_lm_loss = 0.96379614
• next_sentence_accuracy = 0.9975
• next_sentence_loss = 0.0040773232
      </p>
      <p>For comparing model performance, we trained the BERT-Base
model with similar settings. Fig. 3 shows the curves of training
loss for the BERT-Large and BERT-Base models. As expected, the
BERT-Large model has a lower curve.
5.4</p>
    </sec>
    <sec id="sec-15">
      <title>Qualitative examples</title>
      <p>
        In this section, we provide positive and negative examples in our
reranking experiments. Our proof-of-concept results (POC 1˜7) are
available on the web.1 In POC 1, the results contain 100
generated patent spans (no cherrypicking) in patent abstract (similar
experiments can be conducted on patent claims in the future). The
input for GPT-2 is the first sentence in the abstract of the US Patent
10,694,449 (granted on 2020-06-23). We selected three generated
patent spans for prior art search and reranking, as below.
• (POC 1)
• input: An apparatus and methods are provided for
automatically detecting and connecting to a Wi-Fi network.
• output: [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1-4</xref>
        ] In accordance with a signal strength
measurement from a Wi-Fi transceiver during an idle period when a
Wi-Fi network is detected, the Wi-Fi transceiver sends on
to a server an indication of a connection mode of the user
equipment.
• output: [
        <xref ref-type="bibr" rid="ref1 ref10 ref11 ref12 ref13 ref14 ref15 ref16 ref17 ref18 ref19 ref2 ref20 ref21 ref22 ref23 ref24 ref25 ref26 ref27 ref28 ref29 ref3 ref30 ref4 ref5 ref6 ref7 ref8 ref9">1-33</xref>
        ] In an embodiment, a method is provided for
automatically connecting a mobile telephone to a Wi-Fi
network.
• output: [
        <xref ref-type="bibr" rid="ref1 ref10 ref11 ref12 ref13 ref14 ref15 ref16 ref17 ref18 ref19 ref2 ref20 ref21 ref22 ref23 ref24 ref25 ref26 ref27 ref28 ref29 ref3 ref30 ref4 ref5 ref6 ref7 ref8 ref9">1-42</xref>
        ] The apparatus can include a device for
detecting whether a wireless device is in proximity to a wireless
device associated with the Wi-Fi network.
      </p>
      <p>
        Taking [
        <xref ref-type="bibr" rid="ref1 ref10 ref11 ref12 ref13 ref14 ref15 ref16 ref17 ref18 ref19 ref2 ref20 ref21 ref22 ref23 ref24 ref25 ref26 ref27 ref28 ref29 ref3 ref30 ref4 ref5 ref6 ref7 ref8 ref9">1-33</xref>
        ] (generated by GPT-2) as the input, our prior art
search retrieves the top 100 records by BM25 and rerank them by
embeddings. The [3/100] record in POC 4 (as below) is subjectively
a positive example for us. Compared with other records in POC
4, the [3/100] record “automatic connectivity....a mobile device to
roam” is more relevant to the “automatically connecting a mobile
telephone” in [
        <xref ref-type="bibr" rid="ref1 ref10 ref11 ref12 ref13 ref14 ref15 ref16 ref17 ref18 ref19 ref2 ref20 ref21 ref22 ref23 ref24 ref25 ref26 ref27 ref28 ref29 ref3 ref30 ref4 ref5 ref6 ref7 ref8 ref9">1-33</xref>
        ] of POC 1. The [3/100] record is ranked as 26
based on BM25 and reranked as 3 based on embedding. Therefore,
the reranking is efective in boosting its ranking. The [3/100] record
is the 5th span in the abstract of patent 8590023, which was in the
dataset for pre-training GPT-2 in the first place.
      </p>
      <p>• (POC 4)
• patent: 8590023 [ A-4 ] (5th span in abstract)
• text: This automatic connectivity may allow a mobile device
to roam across Wi-Fi hotspots of Wi-Fi networks and ofload
trafic to Wi-Fi networks.
• ranked by BM25: 26
• re-ranked by embedding: 3</p>
      <p>
        The rankings by BM25 and embedding similarity may be diferent
or similar or the same. For example, in POC 4, the [1/100] record (as
below) shows that both ranks are top 1. The [1/100] record in POC
4 is also semantically similar to the [
        <xref ref-type="bibr" rid="ref1 ref10 ref11 ref12 ref13 ref14 ref15 ref16 ref17 ref18 ref19 ref2 ref20 ref21 ref22 ref23 ref24 ref25 ref26 ref27 ref28 ref29 ref3 ref30 ref4 ref5 ref6 ref7 ref8 ref9">1-33</xref>
        ] record in POC 1. The
[1/100] record in POC 4 is the first span in the abstract of patent
10356696, which was in the dataset for pre-training GPT-2 in the
ifrst place.
      </p>
      <p>• (POC 4)
• patent: 10356696 [ A-0 ] (1st span in abstract)
• text: An apparatus and methods are provided for
automatically detecting and connecting to a Wi-Fi network.
• ranked by BM25: 1
• re-ranked by embedding: 1</p>
      <p>
        We also found negative examples. In POC 5, the following is
the top record according to both BM25 and embeddings. However,
the similarity between the top record in POC 5 and the input of
POC 5 (the GPT-2 output [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1-4</xref>
        ] in POC 1) seems remote. Such a
result suggests that sentence similarity is still a dificult problem.
One possible reason is that the coverage of the recall by BM25 is
not broad enough. Therefore, it filters out suitable candidates for
calculating embedding similarity too soon.
      </p>
      <p>• (POC 5)
• patent: 9373249 [ A-1 ] (2nd span in abstract)
• text: The Wi-Fi transceiver receives a Wi-Fi control signal
from a control signal generator.
• ranked by BM25: 1
• re-ranked by embedding: 1</p>
      <p>
        In addition to BM25 and embedding, we found that adding a
keyword can be a beneficial enhancement if a user has a clear idea
about what to look for. For example, the top reranking result in the
following POC 2 will be less relevant if “proximity” in the [
        <xref ref-type="bibr" rid="ref1 ref10 ref11 ref12 ref13 ref14 ref15 ref16 ref17 ref18 ref19 ref2 ref20 ref21 ref22 ref23 ref24 ref25 ref26 ref27 ref28 ref29 ref3 ref30 ref4 ref5 ref6 ref7 ref8 ref9">1-42</xref>
        ]
record of POC 1 is the point of interest. The [
        <xref ref-type="bibr" rid="ref1 ref10 ref11 ref12 ref13 ref14 ref15 ref16 ref17 ref18 ref19 ref2 ref20 ref21 ref22 ref23 ref24 ref25 ref26 ref27 ref28 ref29 ref3 ref30 ref4 ref5 ref6 ref7 ref8 ref9">1-42</xref>
        ] record of POC 1
is the input for prior art search in POC 2. Such a result is reasonable
because there is no clue for the model to weigh the point of interest
more.
      </p>
      <p>• (POC 2, reranked as top 1)
• patent: 7302229 [ A-2 ] (2nd span in abstract)
• text: In one embodiment, availability of wireless connectivity
may be determined to a first user of a wireless service at a
ifrst wireless communication device to communicate with
an access point associated with a Wi-Fi wireless network
that ofers the wireless service.</p>
      <p>To boost the search relevancy, we add a keyword setting to BM25.
After adding “proximity” as a required term in the BM25 search,
in the following POC 6, the relevancy of its top record increases
significantly. The total number of positive results increases too.
Using a keyword as the first filter in reranking is a research topic
we plan to study in the future because adding such a hard constraint
could be a double-edged sword.</p>
      <p>• (POC 6, reranked as top 1)
• patent: 9986380 [ A-0 ] (1st span in abstract)
• text: A first wireless device determines whether the first
wireless device is in a specified proximity to a second wireless
device based on a signal wirelessly transmitted by the second
wireless device.</p>
      <p>It is noted that POC 3 (omitted here for brevity) shows an
example of a complete patent abstract containing several text spans
generated by GPT-2. Each text span can go through the same prior
art search with reranking, as demonstrated above. We leave such
an enhancement to the future. It is also noted that, in our early
experiments, using embeddings alone (without BM25) produces
many false-positive results, as shown in POC 7. For example, the
similarity between “Coherent LADAR using intra-pixel quadrature
detection” and “In-pixel correlated double sampling with fold-over
detection” is a negative example. There are many negative results
with unreasonable similarities. Therefore, embeddings alone are not
efective for semantic search. Comparing our initial experiments
(embedding only) and later experiments (reranking by BM25 and
embeddings), we conclude that the reranking is more efective even
though it still produces very mixed results.
5.5</p>
    </sec>
    <sec id="sec-16">
      <title>Failure case: few-shot learning</title>
      <p>Although this work focuses on GPT-2, we are also interested in the
capabilities of the latest GPT-3. GPT-3 is an autoregressive language
model with 175 billion parameters. According to the authors, it is
10x more than any previous language model. By scaling up, the
model can perform few-shot learning purely via text interaction
without any gradient updates or fine-tuning. We estimate that the
largest GPT-3 model is about 507 times bigger than the GPT-2 model
we utilized. We hypothesize that the patent text structure is more
uniform and less diverse than the training data for GPT-3. Hence,
we wonder whether few-shot learning might be possible on our
GPT-2 model too. We prepare our input text in the following format:
&lt;|start_of_figure| &gt; (text1) &lt;|end_of_figure| &gt; &lt;|figure2title| &gt;
&lt;|start_of_title|&gt; (text2) &lt;|end_of_title|&gt;</p>
      <p>The &lt;|figure2title| &gt; mapping is defined in the vocabulary file,
and there is no training data contains such a mapping. Our purpose
is to test whether the model can learn such a new mapping by
fewshot learning. In our experiments, we concatenate several records
of diferent figure text and title. Then, we remove the title in the last
record. If the few-shot learning works, the model should generate
the removed patent title in the last record. Unfortunately, we found
it not workable. The limitation in the model size is probably the
primary root cause. Although such a failure case was anticipated,
we found one intriguing pattern: the model keeps generating the
patent title in the second record most of the time. We leave this
case to the future. Determining the minimal model size to achieve
few-shot learning in the patent domain is also an important topic
for future research.
6</p>
    </sec>
    <sec id="sec-17">
      <title>FUTURE RESEARCH</title>
      <p>Our experiments show mixed results, and the topics for future
researchers include:
• How to make reranking more efective?
• How to measure the “novelty” and “non-obviousness”
(requirements in patent laws) between the generated patent
text and prior patent text?
• What are the legal &amp; ethical considerations before releasing
a generative patent model?
• Can the discrepancy of the rankings between BM25 and
embedding be a source for data augmentation? For example,
Sentence-BERT requires both positive and negative examples
to train. Ranking by embeddings first and filtering by BM25
later might be a way to collect negative training examples.
7</p>
    </sec>
    <sec id="sec-18">
      <title>CONCLUSION</title>
      <p>Reranking with BM25 and embeddings is a practical approach for
producing better search results than using embeddings alone. Our
reranking is a two-step approach in which the search is performed
based on BM25 first and then performed based on the cosine
similarity of embeddings. If a user has a clear point of interest in mind,
the search can be more productive by adding an extra step of
providing a keyword to the BM25 search. In this work, the input for
the prior art search is the patent text span generated by a GPT-2
model. The objective of our prior art search is to identify
retrospectively the most similar patent text spans in the training data
of the GPT-2 model. Although our experiments show the
efectiveness of reranking in the patent domain, they also show that
semantic search for longer text remains challenging. By finding
the similarity between GPT-2’s inputs and outputs, we expect that
this work and its future enhancement can help researchers
understand GPT-2 better. Particularly, in the patent domain, it is critical
to evaluate the novelty in GPT-2 and GPT-3 models. To evaluate
the novelty, a prerequisite is to identify the closest training data.
The progress in this paper is toward fulfilling the prerequisite so
that novelty of the generated patent text can be evaluated in the
future. In our system architecture, we integrate several building
blocks, notably pre-training GPT-2, pre-training BERT, using
Elasticsearch for BM25 ranking, and reranking by embedding similarity
with Annoy. Such a proof-of-concept implementation is a practical
reference for future researchers.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Tom</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Brown</surname>
            , Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jefrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler,
            <given-names>Mateusz</given-names>
          </string-name>
          <string-name>
            <surname>Litwin</surname>
          </string-name>
          , Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner,
          <string-name>
            <surname>Sam</surname>
            <given-names>McCandlish</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Alec</given-names>
            <surname>Radford</surname>
          </string-name>
          , Ilya Sutskever, and
          <string-name>
            <given-names>Dario</given-names>
            <surname>Amodei</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Language Models are Few-Shot Learners</article-title>
          .
          <source>ArXiv</source>
          (
          <year>2020</year>
          ). https://arxiv.org/abs/
          <year>2005</year>
          .14165
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers).
          <source>Association for Computational Linguistics</source>
          , Minneapolis, Minnesota,
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . https://doi.org/10.18653/v1/
          <fpage>N19</fpage>
          -1423
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers).
          <source>Association for Computational Linguistics</source>
          , Minneapolis, Minnesota,
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . https://doi.org/10.18653/v1/
          <fpage>N19</fpage>
          -1423
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Eva</surname>
            <given-names>D</given-names>
          </string-name>
          'hondt, Suzan Verberne, Niklas Weber,
          <string-name>
            <given-names>Kees</given-names>
            <surname>Koster</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Lou</given-names>
            <surname>Boves</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Using skipgrams and PoS-based feature selection for patent classification</article-title>
          .
          <source>Computational Linguistics in the Netherlands Journal</source>
          <volume>2</volume>
          (
          <issue>Dec</issue>
          .
          <year>2012</year>
          ),
          <fpage>52</fpage>
          -
          <lpage>70</lpage>
          . https://www.clips.uantwerpen.be/clinjournal/clinj/article/view/15
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Google</surname>
          </string-name>
          . [n.d.].
          <article-title>Google Patents public datasets on BigQuery</article-title>
          . https://console.cloud. google.com/bigquery?p=
          <article-title>patents-public-data.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Google</surname>
          </string-name>
          . [n.d.]. google-research/bert. https://github.com/google-research/bert.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Google</surname>
          </string-name>
          . [n.d.]. Universal Sentence Encoder. https://tfhub.dev/google/universalsentence-encoder/2.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>HuggingFace.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>Fast State-of-the-Art Tokenizers optimized for Research and Production</article-title>
          . https://github.com/huggingface/tokenizers.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Connor</given-names>
            <surname>Leahy</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>An implementation of training for GPT2, supports TPUs</article-title>
          . https://github.com/ConnorJL/GPT2.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Jieh-Sheng Lee</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Measuring and Controlling Text Generation by Semantic Search</article-title>
          .
          <source>In WWW '20: Companion Proceedings of the Web Conference</source>
          <year>2020</year>
          . Taipei, Taiwan,
          <fpage>269</fpage>
          -
          <lpage>273</lpage>
          . https://doi.org/10.1145/3366424.3382086
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Jieh-Sheng Lee</surname>
            and
            <given-names>Jieh</given-names>
          </string-name>
          <string-name>
            <surname>Hsiang</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Measuring Patent Claim Generation by Span Relevancy</article-title>
          .
          <source>In Proceedings of the Thirteenth International Workshop on Juris-informatics (JURISIN)</source>
          . Keio University Kanagawa, Japan.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Jieh-Sheng Lee</surname>
            and
            <given-names>Jieh</given-names>
          </string-name>
          <string-name>
            <surname>Hsiang</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Patent claim generation by fine-tuning OpenAI GPT-2</article-title>
          . World Patent Information (
          <year>2020</year>
          ). in press.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Jieh-Sheng Lee</surname>
            and
            <given-names>Jieh</given-names>
          </string-name>
          <string-name>
            <surname>Hsiang</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>PatentBERT: Patent classification with ifne-tuning a pre-trained BERT model</article-title>
          .
          <source>World Patent Information</source>
          <volume>61</volume>
          ,
          <issue>101965</issue>
          (
          <year>2020</year>
          ). https://doi.org/10.1016/j.wpi.
          <year>2020</year>
          .101965
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Jieh-Sheng Lee</surname>
            and
            <given-names>Jieh</given-names>
          </string-name>
          <string-name>
            <surname>Hsiang</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>PatentTransformer-2: Controlling Patent Text Generation by Structural Metadata</article-title>
          . (
          <year>2020</year>
          ). https://arxiv.org/abs/
          <year>2001</year>
          .03708
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Yinhan</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Myle Ott, Naman Goyal, Mandar Joshi Jingfei Du, Danqi Chen,
          <string-name>
            <surname>Omer Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Mike</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Luke</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Veselin</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>RoBERTa: A Robustly Optimized BERT Pretraining Approach</article-title>
          .
          <source>ArXiv</source>
          (
          <year>2019</year>
          ). http://arxiv.org/ abs/
          <year>1907</year>
          .11692
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Lajanugen</given-names>
            <surname>Logeswaran</surname>
          </string-name>
          and
          <string-name>
            <given-names>Honglak</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>An eficient framework for learning sentence representations</article-title>
          .
          <source>In International Conference on Learning Representations</source>
          . https://openreview.net/forum?id=rJvJXZb0W
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Tomas</surname>
            <given-names>Mikolov</given-names>
          </string-name>
          , Ilya Sutskever, Kai Chen, Greg S Corrado, and
          <string-name>
            <given-names>Jef</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Distributed Representations of Words and Phrases and their Compositionality</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          26,
          <string-name>
            <surname>C. J. C. Burges</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Welling</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Ghahramani</surname>
            , and
            <given-names>K. Q.</given-names>
          </string-name>
          <string-name>
            <surname>Weinberger</surname>
          </string-name>
          (Eds.). Curran Associates, Inc.,
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Alec</surname>
            <given-names>Radrof</given-names>
          </string-name>
          , Jefrey Wu, Rewon Child, David Luan,
          <string-name>
            <given-names>Dario</given-names>
            <surname>Amodei</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Sutskever</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Language Models are Unsupervised Multitask Learners</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Nils</given-names>
            <surname>Reimers</surname>
          </string-name>
          and
          <string-name>
            <given-names>Iryna</given-names>
            <surname>Gurevych</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks</article-title>
          .
          <source>In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing</source>
          .
          <article-title>Association for Computational Linguistics</article-title>
          . http://arxiv.org/abs/
          <year>1908</year>
          .10084
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Julian</surname>
            <given-names>Risch</given-names>
          </string-name>
          , Nicolas Alder, Christoph Hewel, and
          <string-name>
            <given-names>Ralf</given-names>
            <surname>Krestel</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>PatentMatch: A Dataset for Matching Patent Claims &amp; Prior Art</article-title>
          . arXiv:
          <year>2012</year>
          .
          <article-title>13919 [cs</article-title>
          .IR]
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Julian</given-names>
            <surname>Risch</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ralf</given-names>
            <surname>Krestel</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Domain-specific word embeddings for patent classification</article-title>
          .
          <source>Data Technol. Appl</source>
          .
          <volume>53</volume>
          (
          <year>2019</year>
          ),
          <fpage>108</fpage>
          -
          <lpage>122</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Spotify</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk</article-title>
          . https://github.com/spotify/annoy.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Cole</given-names>
            <surname>Thienes</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jack</given-names>
            <surname>Pertschuk</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>NBoost: Neural Boosting Search Results</article-title>
          . https://github.com/koursaros-ai/nboost.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>[24] USPTO. [n.d.]. USPTO Open Data Portal. https://developer.uspto.gov/data.</mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25] USPTO. [n.d.].
          <source>USPTO PatentsView</source>
          . https://www.patentsview.org/download.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Ashish</surname>
            <given-names>Vaswani</given-names>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
          <article-title>Ł ukasz Kaiser, and</article-title>
          <string-name>
            <given-names>Illia</given-names>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention is All you Need</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          30, I. Guyon,
          <string-name>
            <given-names>U. V.</given-names>
            <surname>Luxburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          , and R. Garnett (Eds.). Curran Associates, Inc.,
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          . http://papers.nips.cc/paper/7181-attentionis
          <article-title>-all-you-need</article-title>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Thomas</surname>
            <given-names>Wolf</given-names>
          </string-name>
          , Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R'emi Louf, Morgan Funtowicz, and
          <string-name>
            <given-names>Jamie</given-names>
            <surname>Brew</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>HuggingFace's Transformers: State-of-the-art Natural Language Processing</article-title>
          .
          <source>ArXiv</source>
          (
          <year>2019</year>
          ). https://arxiv.org/abs/
          <year>1910</year>
          .03771
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Han</given-names>
            <surname>Xiao</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>bert-as-service</article-title>
          . https://github.com/hanxiao/bert-as-service.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Rowan</surname>
            <given-names>Zellers</given-names>
          </string-name>
          , Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and
          <string-name>
            <given-names>Yejin</given-names>
            <surname>Choi</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Defending Against Neural Fake News</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          <volume>32</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Zhibo</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>GPT2-ML: GPT-2 for Multiple Languages</article-title>
          . https://github. com/imcaspar/gpt2-ml.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>