<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Explainable Artificial Intelligence for Highlighting and Searching in Patent Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Renukswamy Chikkamath</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rana Fassahat Ali</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christoph Hewel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus Endres</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>PAUSTIAN &amp; PARTNERS</institution>
          ,
          <addr-line>Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Applied Sciences</institution>
          ,
          <addr-line>Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Passau</institution>
          ,
          <addr-line>Passau</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>12</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>The verbose content and redundant information present in patents often add complexity to reading and understanding them. Individual subject matters related to an invention and its decisiveness are scattered throughout patent documents. Moreover, these matters could provide relevant key arguments for an efective examination or critical assessment of an invention. To address these complexities and facilitate patent practitioners' eficient reading and in-page semantic searches of patents, we generated a multiclass dataset representing key arguments of patents on a sentence level. Essentially, these key arguments are the concrete details related to an invention, such as the problem it solves or the technical efects or advantages it achieves. We fine-tuned Transfer Learning models on this novel dataset and developed two Chromium extensions. One extension automatically highlights these key arguments using our fine-tuned model, and the other steers semantic search within any opened patent document in the browser. The data and code related to this work are released to the community via a GIT repository. The empirical test cases and manually labeled gold truth data provide evidence supporting our hypothesis regarding in-page patent search and eficient reading, respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Patent analysis</kwd>
        <kwd>prior art search</kwd>
        <kwd>patent language model</kwd>
        <kwd>sentence classification</kwd>
        <kwd>patent datasets</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>providing one or several specific embodiments of the
invention. Patent owners tend to keep the specification
1.1. Motivation as general as possible, which may not only be
advantageous for further broadening the scope of protection
A patent is a form of intellectual property that provides but may also relieve the patent owners from publishing
the owner with legal rights to prohibit others from pro- their developed technology. Therefore, most parts of the
ducing, using, or selling the invention. However, these specification only repeat the text of the patent claims
rights are granted in exchange for disclosing how the and add generalized boilerplate text concerning the
funcinvention works. Before a patent can be granted, it must tioning of an invention. Even if a patent specification
undergo a rigorous examination process, known as the may typically be 10 to 30 pages long, there are only a few
prior art search. This search is typically conducted in two short text passages that explain the concrete technical
stages: the first s tage o ccurs i n t he e arly s tages o f the efects of the invention.
patent life cycle when patent attorneys draft the patent Therefore, it is often challenging for patent
practitionapplication. And the second stage takes place in the later ers, including attorneys and examiners, to comprehend
stages of the patent life cycle when patent examiners the invention’s definition in the claims, which problem is
review the patent application. addressed by the invention, or which technical efects or</p>
      <p>Since patent claims define the scope of protection, find- benefits are achieved by the invention. However,
withing any prior art or other competing art that can be used out understanding the motivation behind the invention,
as evidence for the proposed claims is a crucial step. A it is dificult to compare it with other inventions when
patent does not only comprise one or several claims defin- assessing its inventive step over the prior art.
ing the legal scope of protection but also a specification For example, suppose the claimed invention defines
PatentSemTech'23: 4th Workshop on Patent Text Mining and a heating system with three temperature sensors. In that
Semantic Technologies, colocated with the 46th International ACM case, the closest prior art document, such as an older
SIGIR Conference on Research and Development in Information patent, may only disclose a heating system with two
temRetrieval, July 27th, 2023, Taipei, Taiwan. perature sensors. In such cases, important questions arise,
* Corresponding author. such as what is the technical efect of the third sensor?
" renukswamy.chikkamath@hm.edu (R. Chikkamath); why does the prior art suggest only two sensors? In case
(aCli.11H@e wadesl).u;mnia-prkausssa.eun.ddere(sR@.Fh. mA.lei)d;uh(eMw.elE@ndpraeuss)tian.de the motivations behind the two concepts are completely
diferent, the claimed invention might be considered as
implying an inventive step over the prior art.</p>
      <p>© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons</p>
      <p>CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g LCicEenUseRAttWribuotironk4s.0hIontpernPatrioonacl e(CeCdBiYn4g.0)s. (CEUR-WS.org)</p>
      <p>Consequently, patent analysis often requires retriev- represent key arguments of any invention, such as
ading those few text passages in a patent that can reveal the vantages, solutions, problems, and justifications for claim
motivation behind the claimed invention. In this work, features. Understanding and diferentiating the above
we aim to address the aforementioned dificulties and mentioned points in a timely manner aids examiners and
ease the prior art search. Specifically, we focus on auto- attorneys in critical assessments and efective analysis
matically highlighting these text passages with the help in the light of prior art. The focus of this work is to ease
of the Chrome extension (Analyse), as shown in Figure 1. the readability and understandability of patents, unlike
The Analyse extension is supported by an Artificial Intel- investigations of information retrieval or prior art search
ligence (AI) model that is fine-tuned on a novel dataset approaches.
developed in this work. Also, we present a Chrome ex- The AI-based assistance presented in this work is in
tension (search text box) to facilitate cross-questioning greater demand when individual patents are considered
during patent analysis, as depicted in Figure 2. for analysis, and this assistance has two main benefits.
Firstly, it provides ease of readability by automatically
1.2. Highlight and Search in Patent Text highlighting technical aspects related to the invention on
the sentence level. Secondly, it ofers deeper
understandThe quality of a patent prior art search is greatly influ- ing by allowing readers to ask various cross-questions.
enced by the readability and understandability of patents. For example, the question What are the problems with
In prior art search or patent analysis in general, the most conventional mouse catchers? in the patent Mousetrap1
important parts of patents are considered to be the claims can be searched, as shown in Figure 2. Such a tool can
and technical description which disclose and describe an enhance the user experience by providing the
opportuinvention respectively. Since the claims are written in le- nity to explore the documents in greater detail and work
gal terminology, they are often dificult to understand just
by reading them alone. Detailed descriptions of patents 1https://patents.google.com/patent/US8943741
with the semantics and context of the patent text, un- outperform other state-of-the-art models in both
senlike keyword-matching in-page searches (Ctrl+F based tence classification, such as the GLUE 4 benchmarked on
search). the Stanford Sentiment Treebank (SST-2) dataset, and</p>
      <p>
        Highlighting at the sentence level is more interest- question answering, based on the Stanford Question
Aning and important than at the keyword or paragraph swering Dataset (SQuAD). Various other variants5 of the
level. This is because keywords in patents can be suc- BERT [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] architecture can also be seen as competitors
cinct but do not provide any evidence to understand the in various settings. In recent years, Google released a
context in which key arguments are used. On the other language model pre-trained on patent data called
BERThand, paragraphs can be informative but can contain for-Patents [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Since this model is trained on more than
mixed opinions. For example, individual sentences ex- 100 million patents, unlike the above-mentioned
generalplaining diferent arguments of inventions (advantageous purpose models, we have used it to fine-tune our
classifiefects, problems, solutions) can be visible in one para- cation model.
graph. Therefore, in this work, we focus on identifying Text highlighting in this context emphasizes the
signifand highlighting key arguments only at the sentence icance of readability and understandability of patent and
level. non-patent text. There is evidence in the literature
re
      </p>
      <p>
        In this paper, we present a sentence-level patent dataset garding how patent examiners from the European Patent
designed to highlight key arguments for any invention at Ofice (EPO) initially read patent documents to come to
the sentence level. This is a multi-class dataset that was a preliminary understanding of the patent. In particular,
utilized to finetune Bert-for-Patents [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. We developed there is a greater need for developing tools to assist them
two Chromium extensions: one for automatically high- in skimming through patents and achieving a deeper
unlighting arguments, facilitated by the internally finetuned derstanding of the contents [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Moreover, there is a lot
Bert-for-Patents model; and another for in-page semantic of motivation from patent attorneys on the web to assess
search based on SQuAD2 models. A free-flow natural the parameters for patentability and skim through the
language query can be used to search within the opened document to find individual subject matters 6,7.
document on the web. Both extensions can work well on Although there is much interest in the readability of
text present in any web page or document. However, to patents [
        <xref ref-type="bibr" rid="ref6 ref7 ref8 ref9">6, 7, 8, 9</xref>
        ], these approaches are limited to the
this end experiments are limited to Google patents3, for analysis of claims. However, segmentation and
analyexample, a patent opened in Google Patents as shown in sis of claims are other segments of research in prior art
Figures 1 and 2. search. To the best of our knowledge, there are no
ap
      </p>
      <p>
        The remainder of this work is organized as follows: proaches that focus on patent text at the sentence level
Section 2 describes related work. Section 3 explains the to highlight relevant key arguments. Highlighting
impormethodologies used to develop the data, with a detailed tant aspects of the text in the context of
education/learnmulti-stage flowchart to describe the models developed ing is not new [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In other non-patent domains,
generatin this work. Section 4 outlines the browser extension ing and providing a quick summary with highlighted text
communication architecture. In Section 5, we discuss is proposed to emphasize textual elements [
        <xref ref-type="bibr" rid="ref11 ref12 ref13 ref14">11, 12, 13, 14</xref>
        ].
the results achieved in this work, including a sample test Text highlighting in general encourages a thorough
uncase. In the end, in Section 6, we conclude our work and derstanding of a document [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and also supports easier
suggest possible future directions. subsequent literature study [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. To ease access,
developing browser extensions to highlight text on the web has
drawn attention. For instance, highlighting the disputed
2. Related Work claims on the web pages and finding the relevant article
from the web for facilitating the arguments in claims is
proposed by Ennals et al [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Other related research also
showed that reading comprehension can be attained by
text highlighting on the web or any digital text content
[
        <xref ref-type="bibr" rid="ref14 ref18 ref19 ref20">18, 19, 14, 20</xref>
        ].
      </p>
      <p>In the patent domain, there are few private sectors that
have developed solutions for multi-color highlighting of</p>
      <sec id="sec-1-1">
        <title>The research aspects of this work are related to the intersection of tasks such as text highlighting, sentence classification, and question answering in the field of Natural Language Processing (NLP).</title>
        <p>
          In recent times, language representation learning, also
known as language model development, and research
on reading comprehension, such as question-answering
models, have grown rapidly in the field of NLP. Notable
models that have achieved top performance include
Turing NLR-v5 [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and Turing ULR-v6 [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. These models
4https://gluebenchmark.com/leaderboard
5https://huggingface.co/models?sort=downloads&amp;search=bert
6https://www.heerlaw.com/
diference-patentability-assessment-patent-search
7https://www.brmpatentattorneys.com.au/
intellectual-property-law-melbourne/how-to-read-a-patent/
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>2https://rajpurkar.github.io/SQuAD-explorer/ 3https://patents.google.com/</title>
        <p>keywords8,9. However, such approaches would not be level using a domain and task-specific dataset . Therefore,
eficient because patent applications can be written us- in this work, we propose and develop a dataset for
finding diferent terminologies even for the same concept. ing technical aspects on a sentence level (refer to Section
Furthermore, considering the context in addition to key- 1.2 to know why the sentence level is preferred).
Furwords adds domain knowledge that can explain why a thermore, we utilize this dataset to fine-tune a patent
particular keyword was highlighted. Additionally, these domain-specific language model. This fine-tuned model
solutions are paid, and the reader has to manually find is deployed in a Chrome extension service as a
protoand highlight keywords. These solutions are more like type. A detailed description of the technique utilized
digital pens to highlight and keep a record of keywords, to develop the sentence-level dataset and the variety of
which is again a time-consuming task. models fine-tuned are described in the next Section 3.</p>
        <p>
          To utilize AI models in the process of automatic
highlighting in the patent domain, IPGoggles10 (one of the
motivations for this paper) proposes a new-age cloud- 3. Data and Models
based solution. This service highlights keywords or even
phrases in patents based on sentiment. Professionals be- To the best of our knowledge, there is no dataset
availlieve that reading and understanding patents becomes able in the literature that identifies technical aspects at
challenging, even at an individual document level, given the sentence level. Therefore, we proposed to
generthe huge amount of prior art. However, IPGoggles uti- ate a sentence-level dataset based on a paragraph-level
lizes general-purpose AI models that are not fine-tuned dataset called PaSa [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. The patent paragraphs of PaSa
on patent data to identify technical aspects or key argu- (shown in the top left of Figure 3) represent essential key
ments. In the patent domain, researchers have developed arguments that are crucial for efective patent reading.
a dataset (PaSa) to identify the technical aspects of patent They also facilitate critical assessment of the boundaries
documents on a paragraph level [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. It contains patent of an invention. To aid patent practitioners in making
paragraphs named under the headings “Technical Prob- decisions during report writing or formal hearings in
lem,” “Solution to Problem,” and “Advantageous Efects examinations, AI models trained on such a dataset are
of Invention.” necessary. However, it is not always true that all
sen
        </p>
        <p>
          In PaSa, United States Patent and Trademark Ofice tences in a specific paragraph represent the heading.
(USPTO)11 patent grants from 2010 to 2020 were searched The following excerpt from the patent “US10834907B2”
to identify the technical aspects mentioned in clear and shows that there are sentences reflecting both problems
distinguishable paragraphs. The authors argue that these and advantages under the same heading “Technical
Probparagraphs are not common in all patents, but rather lem”.
reflect a patent drafting style (based on region) that is For e.g., "In summer, when rock oysters come in season,
mostly followed by Asia-specific patents. Moreover, it sea areas are highly contaminated. . . which causes
inhibiis even harder to find these specific paragraphs in Asia- tion of distribution...Accordingly, an object of the present
specific patents before 2010 (refer to Table 4 [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], which invention is to provide . . . enables the production of
virusshows a gradual decrease in the number from 2020 to free oysters having no experience of being exposed to a
2010). This provides strong motivation to utilize these sea area. . . . present invention solves the above-mentioned
important and infrequent paragraphs as the basis of our problems.".
investigations. However, the PaSa dataset has not been Therefore, in this work, we utilized the PaSa dataset
used in any downstream application or tool so far. There- to develop sentence-level data for identifying the key
fore, we decided to develop a dataset using PaSa and technical aspects present in patents. We also used the
to use it to further train AI models that can be used in “sentiments” naming convention for the three classes
downstream applications, such as a Chrome extension. in our dataset, which are solutions-neutral,
advantages
        </p>
        <p>In the state of the art, there is either evidence of high- positive, and problems-negative.
lighting technical aspects based on general-purpose AI The dataset generation and model training in this work
models or evidence of a dataset to identify technical as- can be seen in three stages, as shown in Figure 3. In
pects on the paragraph level. However, to the best of our tSotakgeen-iIz,ears12a tsotrcaoignhvtefrotrwaapradraagprparpohacihn,toweseunsteedntchees NbaLsTeKd
knowledge, there are no approaches that focus on
idenon full stops. Further, preprocessing was carried out
tifying and highlighting technical aspects on a sentence to remove smaller sentences containing fewer than 20
characters, which are mostly small phrases or sentences
oriented toward special symbols. After preprocessing,
PaSa_Sentence-Baseline contains 940,000 sentences, and
Figure 3 displays samples from each class. It is clear that
8https://help.patsnap.com/hc/en-us/articles/
115005478629-What-Can-I-Do-When-I-View-A-Patent9https://patseer.com
10https://ipgoggles.com/
11https://developer.uspto.gov/product/
patent-grant-full-text-dataxml
the dataset is unbalanced as we have fewer samples in
the positive and negative classes.</p>
        <p>To maintain standard experimental settings, as in PaSa,
and to avoid class imbalance problems, we chose only
150k samples (set A) to train the baseline models in
StageI. The remaining samples were used for other
experiments such as “except set A” which was used in
StageII, and 650 samples for manual labeling of the data in
Stage-III. In Stage-I, we also used the original PaSa
paragraph dataset to train transformer models, as the PaSa
paper focused only on machine learning models. In
StageII, we generated an improvised version (set B) of the
PaSa_Sentence Baseline data to address errors and
shortcomings identified in using PaSa_Sentence Baseline
(refer to Section 5.2 for error analysis). The data samples
used for various purposes (set A, set B, manually labeled
data) were kept completely non-identical to avoid bias in
learning the models.</p>
        <p>We utilized pre-trained transformer models from the
Hugging Face platform13 to fine-tune our datasets. With
the exception of Bert-For-Patents, the remaining three
baseline models (refer to Stage-I) were pre-trained on
non-patent literature and hosted on Hugging Face. The
naming convention (Bert-For-Patent-#) indicates that
these models were fine-tuned on diferent datasets. For
example, Bert-for-patent-2 is a completely new
pretrained model that was fine-tuned using PaSa paragraph
data in Stage-II. In Stage-III, the same
Bert-For-Patents2 (fine-tuned) was used solely for making predictions
on “except set A” (i.e., there was no role of “except set
A” in training Bert-For-Patents-2). Thus, all models and
datasets used were kept separate. The baseline models
shown in Figure 3 were fine-tuned with a sequence length
of 512 and batch size of 16, except for Bert-For-Patents-#,
which was fine-tuned with a sequence length of 128 and
batch size of 8. The reason for this diference is that
BertFor-Patents-# is an extremely large architecture with 24
hidden layers and creates hardware dependencies
during fine-tuning, even for an NVIDIA server with an A30
GPU.</p>
        <p>And for the in-page patent semantic search, we have
used SQuAD-dataset based question-answering models14
hosted on Hugging Face. The best-performing and most
downloaded models are Bert Large (uncased), RoBerta
base, and DistilBert based (cased). To the best of our
knowledge, no datasets are available in the
state-of-theart with SQuAD format in the patent domain (which
opens the door for research in developing a
questionanswering dataset in the patent domain). SQuAD models
14https://huggingface.co/models?pipeline_tag=</p>
        <p>question-answering&amp;sort=downloads
are feasible for in-page searching in this work because trained on SQuAD from Hugging Face18). The response
natural text queries can be searched within a given con- will be an answer (start and end positions of text from
text (e.g., patent text in chunks), unlike keyword matches. the context considered) for the question searched.
SQuAD models can be easily hosted and deployed in
Chrome extensions. Therefore, we investigated the
aforementioned models in our in-page semantic search
extension. The components of the Chromium extension
are explained in detail with the help of communication
architecture in the next Section 4.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Browser Extensions</title>
      <p>The browser extensions developed in this work are aimed
at enhancing the readability and understandability of
patents. Readability is more efective when the
technical aspects of the considered patents are automatically
highlighted. This automation is based on knowledge
from domain-specific AI models fine-tuned in this work,
and the respective model is deployed in a Chrome
extension (refer to Figure 1). The understandability of patents
is improved when there is an opportunity to ask
crossquestions during patent analysis within a patent
document. Such a feature is provided by our other extension
developed in this work (refer to Figure 2). Patent
practitioners can install and activate these two Chrome
extensions in their browsers for efective prior art searches
(refer to the GIT repository15 of this work for
installation). More details including the usability of the Chrome
extension, request run times, and responsiveness of the
interface are also added to the GIT repository.</p>
      <p>The browser extensions presented in this paper
operate on the browsers such as Google Chrome (Chromium
based), with development in two parts: i) Python Flask16
API for models (acts as backend) and ii) Chromium
extension (acts as front end). We used Flask to develop an
API for our models, further to get the predictions from
our fine-tuned models we utilized Hugging Face
transformers pipelines17. We hosted our fine-tuned models
in the Hugging Face repository to make use of them in
pipelines. The API has two POST endpoints one for each
of the tasks (classification/sentiment-predict and in-page
semantic search). The classification POST endpoint
accepts an array of sentences of any opened document in
the browser and collects the prediction response from
the transformer pipeline with our fine-tuned model
(BertFor-Patents-3). Further, the endpoint will assign classes
to the array of sentences. With respect to the semantic
search POST endpoint, a context (complete patent text
in our case) and question are given as input and passed
to question-answering model pipeline (e.g., Bert large
15https://github.com/Renuk9390/expaai_model
16https://flask.palletsprojects.com/en/2.2.x/
17https://huggingface.co/docs/transformers/main_classes/
pipelines</p>
      <p>We use chrome-extension-cli19 for developing the
Chromium extension. In addition, we used technologies
such as Javascript, HTML, and CSS for data handling and
styling. The communication architecture of the browser
extension with its components is shown in Figure 4. The
functionalities of individual components are as follows:
• Popup: The component that is visible when we
click the browser extension icon, which acts as
the only point of contact between the user and
the extension. The popup is responsible for
providing buttons for both classifications with
multicolor highlighting and a search bar. Additionally,
the Loader shows the task being performed or
stopped. The Popup script communicates with
both the “Content” and “Background”
components. Text content from the web page will be
accessed, analyzed (predictions, answers), and
highlighted in the final step.
• Content: This component collects the text
present in the opened web page and
communicates with both the “Background” and “Popup”
components. The “Content” component is
responsible for receiving a message from the “Popup”
script and for sending and receiving messages
to and from the “Background” component. In
this case, it prepares the content for analysis and
highlights the relevant content on the web page
based on predictions from the “Background”
component. Highlighting the content (sentences and
answers) is one of the salient tasks of the
"Content" component. This is achieved by using a
18https://huggingface.co/</p>
      <p>bert-large-uncased-whole-word-masking-finetuned-squad
19https://github.com/dutiyesh/chrome-extension-cli</p>
    </sec>
    <sec id="sec-3">
      <title>5. Findings</title>
      <p>5.1. Scores and Test Cases
In this section, we discuss the results of this work and
perform an error analysis to show how the dataset
representation problem afects the model performances.</p>
      <p>“div” number or “class” on the HTML page for Patents-2 to obtain improved samples from our
“Basethe respective matched answer or sentence to line_preprocessed Dataset”. We considered only those
highlight. samples where the prediction score was greater than 70%
• Background: This is the only component com- when predicted by Bert-for-Patents-2.
municating with the Flask API backend. When With respect to in-page semantic search, we are
utiit receives a message from the “Content” com- lizing models (Bert Large uncased, RoBerta base, and
ponent with a payload to perform a task, the DistilBert based cased) which are fine-tuned on SQuAD
API endpoint will be called with inputs. Back- data. To our knowledge, there are no SQuAD formatted
ground listens to two types of messages from datasets in the patent domain to address in-page question
Content such as “Patent_Text” for highlighting answering. Therefore in this work, we are not fine-tuning
technical aspects based on the type of class it them on any patent data. Instead, we only perform test
belongs to and “Patent_Semantic_Search” to ac- cases to compare and evaluate them. For the test cases,
complish in-page search. After receiving a re- we considered various contexts (patent text) and
quessponse from API, the response will be sent to tions to compare the answering capability of said models.
“Content” for further processing. In addition, DistilBert is competitive with Bert Large in some cases.
Background is also responsible for sending mes- For instance, as depicted in Figure 5, we provided the
sages task_started and task_stopped to “Popup” same context and question to the aforementioned models.
to keep the “Loader” busy or active for taking Bert Large exhibited superior performance in retrieving
the next task from the user. More details on the the answer; nevertheless, DistilBert also performed
reacommunication of components can be collected sonably well in retrieving the correct answer. In most
via the code base repository of this paper. cases, Bert Large uncased model performed better in
ifnding accurate answers for longer queries (which are
common in patent searches). Therefore, Bert Large is
deployed in the in-page semantic search extension.</p>
      <p>To test and debug the API endpoints for intended
functioning, we used an open-source application called
Insomnia20. We provided Insomnia test requests to the in-page
semantic search API and the classification (aka
sentiment_predict) API endpoints. For example, we passed
an array of sentences to the sentiment_predict API
endpoint, and the fine-tuned model returned a response with
the label and prediction probability score. Similarly, for
semantic_search, we passed a sample patent text as a
context along with a question, and the retrieved response
included the begin and end token numbers of the
possible answer text snippet with confidence scores. After
confirming the intended functioning of the APIs using
Insomnia tests, we deployed the APIs in the Chromium
extensions.</p>
      <sec id="sec-3-1">
        <title>There are three diferent ways in which the labels are</title>
        <p>assigned to the sentence level dataset of this work. Firstly,
automatic labeling is based on the NLTK tokenizer (in
STAGE-I). Secondly, labels are given by fine-tuned
paragraph model (in STAGE-II). And thirdly, manually
assigned labels (in STAGE-III). Although “Baseline models”
developed in this work show good performances in terms
of accuracy, there are cases where the models’ validation
loss is less than the training loss at the end of 3rd epoch.</p>
        <p>The validation data was easier to predict than learning
the training data for the models. This signifies a dataset
20https://docs.insomnia.rest/insomnia/get-started
Data Size
“PaSa_Sentence Improvised Dataset” (refer to Stage-II
in Figure 3) is used to fine-tune Bert-for-Patents-3. Due
to the improvements made in the dataset, this model
shows an accuracy of 97.11%. As shown in Figure 3,
Bert-for-Patents-2, fine-tuned on a paragraph level with
an accuracy of 98.13%, is competent enough to
represent the classes. Therefore, we decided to use
Bert-forrepresentation problem, i.e., classes are not equally rep- least a 70% probability of representing a class. Further,
resented by all the samples because of various reasons we have used these improvised samples (PaSa_Sentence
as shown below. The models finetuned on this poorly Improvised Dataset) to fine-tune a new model
(Bert-Forrepresented data induce bias in predicting the valida- Patents-3), which outperforms other baseline models in
tion set. There are various samples in PaSa_Sentence terms of both accuracy and class representativeness.
Baseline_preprocessed data which can be examples of We manually labeled 650 randomly selected
samsubstandard training samples. ples, which were not used in any of the
experi</p>
        <p>Example 1: “In the view of the problem of the back- ments. The original labels for these samples from
ground art, it is an object of the present invention to pro- PaSa_Baseline_preprocessed were kept separate. To
vervide a conveyor which estimates the weight of a transport ify the presence of bias and representation problems in
object while it is carried without using devices such as a the baseline models, we compared the prediction
accuraload cell which directly measures weight.” cies of manual predictions, baseline models, and
Bert-for</p>
        <p>Observation 1: The above example is automatically la- Patents-3. The manual and Bert-for-Patents-3 prediction
beled as a negative class during PaSa_Baseline generation, accuracies were 68.59% and 69.05%, respectively.
Bertbut it is not when we do manual labeling. During patent for-Patents-3 was fine-tuned on the improved dataset,
drafting, mostly in “Technical Problem” paragraphs, at- and its prediction performance was closer to the manual
torneys/applicants commonly use underlined phrases to labels. However, due to bias, the baseline models showed
quickly repeat their invention while describing problems higher scores with accuracies of 87.80% (DistilBert base
with other prior art. If sentences with such underlined uncased), 87.04% (Bert base uncased), and 94.06%
(Bertphrases are present in the negative class, then such sam- for-Patents-1). Therefore, Bert-for-Patents-3 is more
suitples can be discarded. able for use in the Chrome extension for highlighting</p>
        <p>Example 2: “An embodiment provides a lighting device technical aspects.
in which an optical plate is disposed on at least one light Technical aspects in a patent represent advantages
source and a light source module including the same.” over the prior art, proposed solutions, or problems with</p>
        <p>Observation 2: The above sentence, as well as others other prior art. The core objective of this work was
that are similar, are automatically labeled as negative to automatically identify and highlight these aspects
even though they are not. This indicates the presence of in patents. Although this objective may resemble a
mixed opinions at times on the paragraph level, which sentiment analysis problem, general sentiment analysis
also appears in some sentences. datasets or algorithms are not suitable for this task. Our</p>
        <p>There are other samples that are very long (60-70 sentence-level dataset is distinct from other sentiment
words); in such cases, smaller sentences are joined using analysis datasets such as IMDB21 and Amazon product
special symbols such as ";,:". Manually checking every reviews22. These datasets mostly contain sentences
exsuch sample in large datasets is laborious. Therefore, we pressing people’s opinions on products, things, or other
decided to fine-tune a model on the paragraph level so social aspects. In contrast, our dataset highlights the key
that this model would have a greater understanding of technical arguments in patents that demonstrate the
inthe representativeness of classes on advantages, prob- vention’s technical capabilities in comparison to the prior
lems, and solutions in a patent text. Such a fine-tuned
model is used to consider the sentences that show at
21https://www.imdb.com/interfaces/
22https://cseweb.ucsd.edu/~jmcauley/datasets.html
art. Most importantly, our dataset is specific to the patent
domain and accounts for patent-specific vocabulary and
knowledge.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Conclusion and Future Work</title>
      <p>In this work, we present a multi-class dataset at the
sentence level to highlight the technical subject matters of
patents, which can serve as important key arguments to
determine a patent’s novelty. We fine-tuned language
models on our new dataset and developed a Chromium
extension to automatically highlight key arguments based
on predictions, provided the probability exceeds 70%. We
also developed another Chromium extension to facilitate
in-page semantic search.</p>
      <p>We anticipate a growing need for AI-based tools to
assist patent practitioners in conducting patent prior art
searches. We hope this empirical work serves as
preliminary research and motivates researchers and patent
practitioners to develop tools that can automate prior
art searches. Future work in this area could identify
additional technical aspects in patent documents and
train new classes for highlighting. For this study, we
focused only on advantages, problems, and solutions.
Furthermore, sentence-level data could be improved to
enhance the representativeness of samples belonging to
a particular class. For example, sentences representing
"advantages" should not be mixed with sentences related
to "problems".</p>
      <p>Developing a question-answering dataset in the patent
domain is crucial, and such datasets can be used to
develop tools to automate in-page semantic searches. We
also hope that AI-based tools to assist prior art searches
will enhance the interaction of patent analysts with
patent documents. For instance, the automatic highlight
and semantic search tools prototyped in this work can
allow for cross-questioning within any patent document
opened in a web browser.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <sec id="sec-5-1">
        <title>This research is part of the project "BigScience", which is funded by the Bavarian State Ministry for Economic Afairs, Regional Development, and Energy under the grant number DIK0259/01.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Srebrovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yonamine</surname>
          </string-name>
          ,
          <article-title>Leveraging the BERT algorithm for Patents with TensorFlow and BigQuery</article-title>
          ,
          <source>Technical Report, Technical Report. Global Patents</source>
          , Google https://services. google. com/fh . . . ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bajaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tiwary</surname>
          </string-name>
          , T.-Y. Liu,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bennett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , Metro:
          <article-title>Eficient denoising pretraining of large scale autoencoding language models with model generated signals</article-title>
          ,
          <source>arXiv preprint arXiv:2204.06644</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Patra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singhal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <article-title>Beyond english-centric bitexts for better multilingual language representation learning</article-title>
          ,
          <source>arXiv preprint arXiv:2210.14867</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lahorte</surname>
          </string-name>
          ,
          <article-title>Inside the mind of an epo examiner</article-title>
          ,
          <source>World Patent Information</source>
          <volume>54</volume>
          (
          <year>2018</year>
          )
          <fpage>S18</fpage>
          -
          <lpage>S22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Shinmori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Okumura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Marukawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Iwayama</surname>
          </string-name>
          ,
          <article-title>Patent claim processing for readability-structure analysis and term explanation</article-title>
          ,
          <source>in: Proceedings of the ACL-2003 workshop on Patent corpus processing</source>
          ,
          <year>2003</year>
          , pp.
          <fpage>56</fpage>
          -
          <lpage>65</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ferraro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Suominen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nualart</surname>
          </string-name>
          ,
          <article-title>Segmentation of patent claims for improving their readability</article-title>
          ,
          <source>in: Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR)</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>66</fpage>
          -
          <lpage>73</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Shinmori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Okumura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Marukawa</surname>
          </string-name>
          ,
          <article-title>Aligning patent claims with detailed descriptions for readability</article-title>
          .,
          <source>in: NTCIR</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sheremetyeva</surname>
          </string-name>
          ,
          <article-title>Natural language analysis of patent claims</article-title>
          ,
          <source>in: Proceedings of the ACL-2003 workshop on Patent corpus processing</source>
          ,
          <year>2003</year>
          , pp.
          <fpage>66</fpage>
          -
          <lpage>73</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Rello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Saggion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          ,
          <article-title>Keyword highlighting improves comprehension for people with dyslexia</article-title>
          ,
          <source>in: Proceedings of the 3rd workshop on predicting and improving text readability for target reader populations (PITR)</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>30</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Spala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dernoncourt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dockhorn</surname>
          </string-name>
          ,
          <article-title>A web-based framework for collecting and assessing highlighted sentences in a document</article-title>
          ,
          <source>in: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>78</fpage>
          -
          <lpage>81</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Thadani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stent</surname>
          </string-name>
          ,
          <article-title>The role of discourse units in near-extractive summarization</article-title>
          ,
          <source>in: Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>137</fpage>
          -
          <lpage>147</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K.</given-names>
            <surname>Woodsend</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lapata</surname>
          </string-name>
          ,
          <article-title>Automatic generation of story highlights</article-title>
          ,
          <source>in: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>565</fpage>
          -
          <lpage>574</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kaisser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Hearst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Lowe</surname>
          </string-name>
          ,
          <article-title>Improving search results quality by customizing summary lengths</article-title>
          ,
          <source>in: Proceedings of ACL-08: HLT</source>
          ,
          <year>2008</year>
          , pp.
          <fpage>701</fpage>
          -
          <lpage>709</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>F. I. Craik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Lockhart</surname>
          </string-name>
          ,
          <article-title>Levels of processing: A framework for memory research</article-title>
          ,
          <source>Journal of verbal learning and verbal behavior 11</source>
          (
          <year>1972</year>
          )
          <fpage>671</fpage>
          -
          <lpage>684</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Faw</surname>
          </string-name>
          , T. G. Waller,
          <article-title>Mathemagenic behaviours and eficiency in learning from prose materials: Review, critique and recommendations</article-title>
          ,
          <source>Review of Educational Research</source>
          <volume>46</volume>
          (
          <year>1976</year>
          )
          <fpage>691</fpage>
          -
          <lpage>720</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ennals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Trushkowsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Agosta</surname>
          </string-name>
          ,
          <article-title>Highlighting disputed claims on the web</article-title>
          ,
          <source>in: Proceedings of the 19th international conference on World wide web</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>341</fpage>
          -
          <lpage>350</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Yeari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Oudega</surname>
          </string-name>
          , P. van den Broek,
          <article-title>The efect of highlighting on processing and memory of central and peripheral text information: Evidence from eye movements</article-title>
          ,
          <source>Journal of Research in Reading 40</source>
          (
          <year>2017</year>
          )
          <fpage>365</fpage>
          -
          <lpage>383</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Knollman-Porter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Wallace</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Deville</surname>
          </string-name>
          ,
          <article-title>Efect of digital highlighting on reading comprehension given text-to-speech technology for people with aphasia</article-title>
          ,
          <source>Aphasiology</source>
          <volume>35</volume>
          (
          <year>2021</year>
          )
          <fpage>200</fpage>
          -
          <lpage>221</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Winchell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lan</surname>
          </string-name>
          , M. Mozer,
          <article-title>Highlights as an early predictor of student comprehension and interests</article-title>
          ,
          <source>Cognitive Science 44</source>
          (
          <year>2020</year>
          )
          <article-title>e12901</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>R.</given-names>
            <surname>Chikkamath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. R.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hewel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Endres</surname>
          </string-name>
          ,
          <article-title>Patent sentiment analysis to highlight patent paragraphs</article-title>
          ,
          <source>arXiv preprint arXiv:2111.09741</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>