Explainable Artificial Intelligence for Highlighting and
                                Searching in Patent Text
                                Renukswamy Chikkamath1,* , Rana Fassahat Ali2 , Christoph Hewel3 and Markus Endres1
                                1
                                  University of Applied Sciences, Munich, Germany
                                2
                                  University of Passau, Passau, Germany
                                3
                                  PAUSTIAN & PARTNERS, Munich, Germany


                                                                          Abstract
                                                                          The verbose content and redundant information present in patents often add complexity to reading and understanding
                                                                          them. Individual subject matters related to an invention and its decisiveness are scattered throughout patent documents.
                                                                          Moreover, these matters could provide relevant key arguments for an effective examination or critical assessment of an
                                                                          invention. To address these complexities and facilitate patent practitioners’ efficient reading and in-page semantic searches
                                                                          of patents, we generated a multiclass dataset representing key arguments of patents on a sentence level. Essentially, these key
                                                                          arguments are the concrete details related to an invention, such as the problem it solves or the technical effects or advantages
                                                                          it achieves. We fine-tuned Transfer Learning models on this novel dataset and developed two Chromium extensions. One
                                                                          extension automatically highlights these key arguments using our fine-tuned model, and the other steers semantic search
                                                                          within any opened patent document in the browser. The data and code related to this work are released to the community via
                                                                          a GIT repository. The empirical test cases and manually labeled gold truth data provide evidence supporting our hypothesis
                                                                          regarding in-page patent search and efficient reading, respectively.

                                                                          Keywords
                                                                          Patent analysis, prior art search, patent language model, sentence classification, patent datasets


                                1. Introduction                                                                                  providing one or several specific embodiments of the
                                                                                                                                 invention. Patent owners tend to keep the specification
                                 1.1. Motivation                                                                                 as general as possible, which may not only be advan-
                                                                                                                                 tageous for further broadening the scope of protection
                                A patent is a form of intellectual property that provides
                                                                                                                                 but may also relieve the patent owners from publishing
                                 the owner with legal rights to prohibit others from pro-
                                                                                                                                 their developed technology. Therefore, most parts of the
                                 ducing, using, or selling the invention. However, these
                                                                                                                                 specification only repeat the text of the patent claims
                                 rights are granted in exchange for disclosing how the
                                                                                                                                 and add generalized boilerplate text concerning the func-
                                 invention works. Before a patent can be granted, it must
                                                                                                                                 tioning of an invention. Even if a patent specification
                                 undergo a rigorous examination process, known as the
                                                                                                                                 may typically be 10 to 30 pages long, there are only a few
                                 prior art search. This search is typically conducted in two
                                                                                                                                 short text passages that explain the concrete technical
                                 stages: the first s tage o ccurs i n t he e arly s tages o f the
                                                                                                                                 effects of the invention.
                                 patent life cycle when patent attorneys draft the patent
                                                                                                                                     Therefore, it is often challenging for patent practition-
                                 application. And the second stage takes place in the later
                                                                                                                                 ers, including attorneys and examiners, to comprehend
                                 stages of the patent life cycle when patent examiners
                                                                                                                                 the invention’s definition in the claims, which problem is
                                 review the patent application.
                                                                                                                                 addressed by the invention, or which technical effects or
                                    Since patent claims define the scope of protection, find-
                                                                                                                                 benefits are achieved by the invention. However, with-
                                 ing any prior art or other competing art that can be used
                                                                                                                                 out understanding the motivation behind the invention,
                                 as evidence for the proposed claims is a crucial step. A
                                                                                                                                 it is difficult to compare it with other inventions when
                                 patent does not only comprise one or several claims defin-
                                                                                                                                 assessing its inventive step over the prior art.
                                 ing the legal scope of protection but also a specification
                                                                                                                                     For example, suppose the claimed invention defines
                                 PatentSemTech'23: 4th Workshop on Patent Text Mining and                                        a heating system with three temperature sensors. In that
                                 Semantic Technologies, colocated with the 46th International ACM                                case, the closest prior art document, such as an older
                                 SIGIR Conference on Research and Development in Information                                     patent, may only disclose a heating system with two tem-
                                 Retrieval, July 27th, 2023, Taipei, Taiwan.
                                                                                                                                 perature sensors. In such cases, important questions arise,
                                *
                                  Corresponding author.                                                                          such as what is the technical effect of the third sensor?
                                " renukswamy.chikkamath@hm.edu (R. Chikkamath);                                                  why does the prior art suggest only two sensors? In case
                                ali11@ads.uni-passau.de (R. F. Ali); hewel@paustian.de
                                (C. Hewel); markus.endres@hm.edu (M. Endres)                                                     the motivations behind the two concepts are completely
                                                                                                                                 different, the claimed invention might be considered as
                                            © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons
                                            License Attribution 4.0 International (CC BY 4.0).                                   implying an inventive step over the prior art.
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)
                                    CEUR
                                                  http://ceur-ws.org
                                    Workshop      ISSN 1613-0073
                                    Proceedings


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                                                 12
Figure 1: Chromium extension to highlight technical aspects. Analyse button activates this extension, wherein technical
problems, solutions, and advantages are colored automatically in red, yellow, and green respectively.


Figure 2: Chromium extension for in-page semantic search. A search bar can be used to ask a question and the answer is
highlighted within opened web page or patent.


   Consequently, patent analysis often requires retriev-     represent key arguments of any invention, such as ad-
ing those few text passages in a patent that can reveal the  vantages, solutions, problems, and justifications for claim
motivation behind the claimed invention. In this work,       features. Understanding and differentiating the above
we aim to address the aforementioned difficulties and        mentioned points in a timely manner aids examiners and
ease the prior art search. Specifically, we focus on auto-   attorneys in critical assessments and effective analysis
matically highlighting these text passages with the help     in the light of prior art. The focus of this work is to ease
of the Chrome extension (Analyse), as shown in Figure 1.     the readability and understandability of patents, unlike
The Analyse extension is supported by an Artificial Intel-   investigations of information retrieval or prior art search
ligence (AI) model that is fine-tuned on a novel dataset     approaches.
developed in this work. Also, we present a Chrome ex-           The AI-based assistance presented in this work is in
tension (search text box) to facilitate cross-questioning    greater demand when individual patents are considered
during patent analysis, as depicted in Figure 2.             for analysis, and this assistance has two main benefits.
                                                             Firstly, it provides ease of readability by automatically
1.2. Highlight and Search in Patent Text                     highlighting technical aspects related to the invention on
                                                             the sentence level. Secondly, it offers deeper understand-
The quality of a patent prior art search is greatly influ- ing by allowing readers to ask various cross-questions.
enced by the readability and understandability of patents. For example, the question What are the problems with
In prior art search or patent analysis in general, the most conventional mouse catchers? in the patent Mousetrap1
important parts of patents are considered to be the claims can be searched, as shown in Figure 2. Such a tool can
and technical description which disclose and describe an enhance the user experience by providing the opportu-
invention respectively. Since the claims are written in le- nity to explore the documents in greater detail and work
gal terminology, they are often difficult to understand just
by reading them alone. Detailed descriptions of patents 1 https://patents.google.com/patent/US8943741


                                                           13
with the semantics and context of the patent text, un-      outperform other state-of-the-art models in both sen-
like keyword-matching in-page searches (Ctrl+F based        tence classification, such as the GLUE4 benchmarked on
search).                                                    the Stanford Sentiment Treebank (SST-2) dataset, and
   Highlighting at the sentence level is more interest-     question answering, based on the Stanford Question An-
ing and important than at the keyword or paragraph          swering Dataset (SQuAD). Various other variants5 of the
level. This is because keywords in patents can be suc-      BERT [4] architecture can also be seen as competitors
cinct but do not provide any evidence to understand the     in various settings. In recent years, Google released a
context in which key arguments are used. On the other       language model pre-trained on patent data called BERT-
hand, paragraphs can be informative but can contain         for-Patents [1]. Since this model is trained on more than
mixed opinions. For example, individual sentences ex-       100 million patents, unlike the above-mentioned general-
plaining different arguments of inventions (advantageous    purpose models, we have used it to fine-tune our classifi-
effects, problems, solutions) can be visible in one para-   cation model.
graph. Therefore, in this work, we focus on identifying        Text highlighting in this context emphasizes the signif-
and highlighting key arguments only at the sentence         icance of readability and understandability of patent and
level.                                                      non-patent text. There is evidence in the literature re-
   In this paper, we present a sentence-level patent datasetgarding how patent examiners from the European Patent
designed to highlight key arguments for any invention at    Office (EPO) initially read patent documents to come to
the sentence level. This is a multi-class dataset that was  a preliminary understanding of the patent. In particular,
utilized to finetune Bert-for-Patents [1]. We developed     there is a greater need for developing tools to assist them
two Chromium extensions: one for automatically high-        in skimming through patents and achieving a deeper un-
lighting arguments, facilitated by the internally finetuned derstanding of the contents [5]. Moreover, there is a lot
Bert-for-Patents model; and another for in-page semantic    of motivation from patent attorneys on the web to assess
search based on SQuAD2 models. A free-flow natural          the parameters for patentability and skim through the
language query can be used to search within the opened      document to find individual subject matters6,7 .
document on the web. Both extensions can work well on          Although there is much interest in the readability of
text present in any web page or document. However, to       patents [6, 7, 8, 9], these approaches are limited to the
this end experiments are limited to Google patents3 , for   analysis of claims. However, segmentation and analy-
example, a patent opened in Google Patents as shown in      sis of claims are other segments of research in prior art
Figures 1 and 2.                                            search. To the best of our knowledge, there are no ap-
   The remainder of this work is organized as follows:      proaches that focus on patent text at the sentence level
Section 2 describes related work. Section 3 explains the    to highlight relevant key arguments. Highlighting impor-
methodologies used to develop the data, with a detailed     tant aspects of the text in the context of education/learn-
multi-stage flowchart to describe the models developed      ing is not new [10]. In other non-patent domains, generat-
in this work. Section 4 outlines the browser extension      ing and providing a quick summary with highlighted text
communication architecture. In Section 5, we discuss        is proposed to emphasize textual elements [11, 12, 13, 14].
the results achieved in this work, including a sample test     Text highlighting in general encourages a thorough un-
case. In the end, in Section 6, we conclude our work and    derstanding of a document [15] and also supports easier
suggest possible future directions.                         subsequent literature study [16]. To ease access, develop-
                                                            ing browser extensions to highlight text on the web has
                                                            drawn attention. For instance, highlighting the disputed
2. Related Work                                             claims on the web pages and finding the relevant article
                                                            from the web for facilitating the arguments in claims is
The research aspects of this work are related to the in-
                                                            proposed by Ennals et al [17]. Other related research also
tersection of tasks such as text highlighting, sentence
                                                            showed that reading comprehension can be attained by
classification, and question answering in the field of Nat-
                                                            text highlighting on the web or any digital text content
ural Language Processing (NLP).
                                                            [18, 19, 14, 20].
   In recent times, language representation learning, also
                                                               In the patent domain, there are few private sectors that
known as language model development, and research
                                                            have developed solutions for multi-color highlighting of
on reading comprehension, such as question-answering
models, have grown rapidly in the field of NLP. Notable
models that have achieved top performance include Tur- 4
                                                             https://gluebenchmark.com/leaderboard
ing NLR-v5 [2] and Turing ULR-v6 [3]. These models 5 https://huggingface.co/models?sort=downloads&search=bert
                                                                6
                                                                  https://www.heerlaw.com/
                                                                  difference-patentability-assessment-patent-search
2                                                               7
    https://rajpurkar.github.io/SQuAD-explorer/                   https://www.brmpatentattorneys.com.au/
3
    https://patents.google.com/                                   intellectual-property-law-melbourne/how-to-read-a-patent/


                                                           14
keywords8,9 . However, such approaches would not be              level using a domain and task-specific dataset. Therefore,
efficient because patent applications can be written us-         in this work, we propose and develop a dataset for find-
ing different terminologies even for the same concept.           ing technical aspects on a sentence level (refer to Section
Furthermore, considering the context in addition to key-         1.2 to know why the sentence level is preferred). Fur-
words adds domain knowledge that can explain why a               thermore, we utilize this dataset to fine-tune a patent
particular keyword was highlighted. Additionally, these          domain-specific language model. This fine-tuned model
solutions are paid, and the reader has to manually find          is deployed in a Chrome extension service as a proto-
and highlight keywords. These solutions are more like            type. A detailed description of the technique utilized
digital pens to highlight and keep a record of keywords,         to develop the sentence-level dataset and the variety of
which is again a time-consuming task.                            models fine-tuned are described in the next Section 3.
   To utilize AI models in the process of automatic high-
lighting in the patent domain, IPGoggles10 (one of the
motivations for this paper) proposes a new-age cloud-            3. Data and Models
based solution. This service highlights keywords or even
                                                                To the best of our knowledge, there is no dataset avail-
phrases in patents based on sentiment. Professionals be-
                                                                able in the literature that identifies technical aspects at
lieve that reading and understanding patents becomes
                                                                the sentence level. Therefore, we proposed to gener-
challenging, even at an individual document level, given
                                                                ate a sentence-level dataset based on a paragraph-level
the huge amount of prior art. However, IPGoggles uti-
                                                                dataset called PaSa [21]. The patent paragraphs of PaSa
lizes general-purpose AI models that are not fine-tuned
                                                                (shown in the top left of Figure 3) represent essential key
on patent data to identify technical aspects or key argu-
                                                                arguments that are crucial for effective patent reading.
ments. In the patent domain, researchers have developed
                                                                They also facilitate critical assessment of the boundaries
a dataset (PaSa) to identify the technical aspects of patent
                                                                of an invention. To aid patent practitioners in making
documents on a paragraph level [21]. It contains patent
                                                                decisions during report writing or formal hearings in
paragraphs named under the headings “Technical Prob-
                                                                examinations, AI models trained on such a dataset are
lem,” “Solution to Problem,” and “Advantageous Effects
                                                                necessary. However, it is not always true that all sen-
of Invention.”
                                                                tences in a specific paragraph represent the heading.
   In PaSa, United States Patent and Trademark Office
                                                                   The following excerpt from the patent “US10834907B2”
(USPTO)11 patent grants from 2010 to 2020 were searched
                                                                shows that there are sentences reflecting both problems
to identify the technical aspects mentioned in clear and
                                                                and advantages under the same heading “Technical Prob-
distinguishable paragraphs. The authors argue that these
                                                                lem”.
paragraphs are not common in all patents, but rather
                                                                   For e.g., "In summer, when rock oysters come in season,
reflect a patent drafting style (based on region) that is
                                                                sea areas are highly contaminated. . . which causes inhibi-
mostly followed by Asia-specific patents. Moreover, it
                                                                tion of distribution...Accordingly, an object of the present
is even harder to find these specific paragraphs in Asia-
                                                                invention is to provide . . . enables the production of virus-
specific patents before 2010 (refer to Table 4 [21], which
                                                                free oysters having no experience of being exposed to a
shows a gradual decrease in the number from 2020 to
                                                                sea area. . . . present invention solves the above-mentioned
2010). This provides strong motivation to utilize these
                                                                problems.".
important and infrequent paragraphs as the basis of our
                                                                   Therefore, in this work, we utilized the PaSa dataset
investigations. However, the PaSa dataset has not been
                                                                to develop sentence-level data for identifying the key
used in any downstream application or tool so far. There-
                                                                technical aspects present in patents. We also used the
fore, we decided to develop a dataset using PaSa and
                                                                “sentiments” naming convention for the three classes
to use it to further train AI models that can be used in
                                                                in our dataset, which are solutions-neutral, advantages-
downstream applications, such as a Chrome extension.
                                                                positive, and problems-negative.
   In the state of the art, there is either evidence of high-
                                                                   The dataset generation and model training in this work
lighting technical aspects based on general-purpose AI
                                                                can be seen in three stages, as shown in Figure 3. In
models or evidence of a dataset to identify technical as-
                                                                Stage-I, as a straightforward approach, we used the NLTK
pects on the paragraph level. However, to the best of our
                                                                tokenizer12 to convert a paragraph into sentences based
knowledge, there are no approaches that focus on iden-
                                                                on full stops. Further, preprocessing was carried out
tifying and highlighting technical aspects on a sentence
                                                                to remove smaller sentences containing fewer than 20
8
                                                                characters, which are mostly small phrases or sentences
  https://help.patsnap.com/hc/en-us/articles/
  115005478629-What-Can-I-Do-When-I-View-A-Patent-
                                                                oriented toward special symbols. After preprocessing,
9
  https://patseer.com                                           PaSa_Sentence-Baseline contains 940,000 sentences, and
10
   https://ipgoggles.com/                                       Figure 3 displays samples from each class. It is clear that
11
   https://developer.uspto.gov/product/
                                                                 12
   patent-grant-full-text-dataxml                                     https://www.nltk.org/


                                                            15
Figure 3: PaSa sentence level dataset generation and models including types and statistics of datasets in different settings.


the dataset is unbalanced as we have fewer samples in             example, Bert-for-patent-2 is a completely new pre-
the positive and negative classes.                                trained model that was fine-tuned using PaSa paragraph
    To maintain standard experimental settings, as in PaSa,       data in Stage-II. In Stage-III, the same Bert-For-Patents-
and to avoid class imbalance problems, we chose only              2 (fine-tuned) was used solely for making predictions
150k samples (set A) to train the baseline models in Stage-       on “except set A” (i.e., there was no role of “except set
I. The remaining samples were used for other experi-              A” in training Bert-For-Patents-2). Thus, all models and
ments such as “except set A” which was used in Stage-             datasets used were kept separate. The baseline models
II, and 650 samples for manual labeling of the data in            shown in Figure 3 were fine-tuned with a sequence length
Stage-III. In Stage-I, we also used the original PaSa para-       of 512 and batch size of 16, except for Bert-For-Patents-#,
graph dataset to train transformer models, as the PaSa            which was fine-tuned with a sequence length of 128 and
paper focused only on machine learning models. In Stage-          batch size of 8. The reason for this difference is that Bert-
II, we generated an improvised version (set B) of the             For-Patents-# is an extremely large architecture with 24
PaSa_Sentence Baseline data to address errors and short-          hidden layers and creates hardware dependencies dur-
comings identified in using PaSa_Sentence Baseline (re-           ing fine-tuning, even for an NVIDIA server with an A30
fer to Section 5.2 for error analysis). The data samples          GPU.
used for various purposes (set A, set B, manually labeled            And for the in-page patent semantic search, we have
data) were kept completely non-identical to avoid bias in         used SQuAD-dataset based question-answering models14
learning the models.                                              hosted on Hugging Face. The best-performing and most
    We utilized pre-trained transformer models from the           downloaded models are Bert Large (uncased), RoBerta
Hugging Face platform13 to fine-tune our datasets. With           base, and DistilBert based (cased). To the best of our
the exception of Bert-For-Patents, the remaining three            knowledge, no datasets are available in the state-of-the-
baseline models (refer to Stage-I) were pre-trained on            art with SQuAD format in the patent domain (which
non-patent literature and hosted on Hugging Face. The             opens the door for research in developing a question-
naming convention (Bert-For-Patent-#) indicates that              answering dataset in the patent domain). SQuAD models
these models were fine-tuned on different datasets. For
                                                                  14
                                                                       https://huggingface.co/models?pipeline_tag=
13
     https://huggingface.co/models                                     question-answering&sort=downloads


                                                             16
are feasible for in-page searching in this work because trained on SQuAD from Hugging Face18 ). The response
natural text queries can be searched within a given con- will be an answer (start and end positions of text from
text (e.g., patent text in chunks), unlike keyword matches. the context considered) for the question searched.
SQuAD models can be easily hosted and deployed in
Chrome extensions. Therefore, we investigated the afore-
mentioned models in our in-page semantic search ex-
tension. The components of the Chromium extension
are explained in detail with the help of communication
architecture in the next Section 4.


4. Browser Extensions
The browser extensions developed in this work are aimed
at enhancing the readability and understandability of
patents. Readability is more effective when the techni-
cal aspects of the considered patents are automatically
highlighted. This automation is based on knowledge               Figure 4: Browser extension communication architecture
from domain-specific AI models fine-tuned in this work,          with its components.
and the respective model is deployed in a Chrome exten-
sion (refer to Figure 1). The understandability of patents          We use chrome-extension-cli19 for developing the
is improved when there is an opportunity to ask cross-           Chromium extension. In addition, we used technologies
questions during patent analysis within a patent docu-           such as Javascript, HTML, and CSS for data handling and
ment. Such a feature is provided by our other extension          styling. The communication architecture of the browser
developed in this work (refer to Figure 2). Patent prac-         extension with its components is shown in Figure 4. The
titioners can install and activate these two Chrome ex-          functionalities of individual components are as follows:
tensions in their browsers for effective prior art searches
(refer to the GIT repository15 of this work for installa-                • Popup: The component that is visible when we
tion). More details including the usability of the Chrome                  click the browser extension icon, which acts as
extension, request run times, and responsiveness of the                    the only point of contact between the user and
interface are also added to the GIT repository.                            the extension. The popup is responsible for pro-
   The browser extensions presented in this paper oper-                    viding buttons for both classifications with multi-
ate on the browsers such as Google Chrome (Chromium                        color highlighting and a search bar. Additionally,
based), with development in two parts: i) Python Flask16                   the Loader shows the task being performed or
API for models (acts as backend) and ii) Chromium ex-                      stopped. The Popup script communicates with
tension (acts as front end). We used Flask to develop an                   both the “Content” and “Background” compo-
API for our models, further to get the predictions from                    nents. Text content from the web page will be
our fine-tuned models we utilized Hugging Face trans-                      accessed, analyzed (predictions, answers), and
formers pipelines17 . We hosted our fine-tuned models                      highlighted in the final step.
in the Hugging Face repository to make use of them in                    • Content: This component collects the text
pipelines. The API has two POST endpoints one for each                     present in the opened web page and communi-
of the tasks (classification/sentiment-predict and in-page                 cates with both the “Background” and “Popup”
semantic search). The classification POST endpoint ac-                     components. The “Content” component is respon-
cepts an array of sentences of any opened document in                      sible for receiving a message from the “Popup”
the browser and collects the prediction response from                      script and for sending and receiving messages
the transformer pipeline with our fine-tuned model (Bert-                  to and from the “Background” component. In
For-Patents-3). Further, the endpoint will assign classes                  this case, it prepares the content for analysis and
to the array of sentences. With respect to the semantic                    highlights the relevant content on the web page
search POST endpoint, a context (complete patent text                      based on predictions from the “Background” com-
in our case) and question are given as input and passed                    ponent. Highlighting the content (sentences and
to question-answering model pipeline (e.g., Bert large                     answers) is one of the salient tasks of the "Con-
                                                                           tent" component. This is achieved by using a
15
   https://github.com/Renuk9390/expaai_model
16                                                               18
   https://flask.palletsprojects.com/en/2.2.x/                        https://huggingface.co/
17
   https://huggingface.co/docs/transformers/main_classes/             bert-large-uncased-whole-word-masking-finetuned-squad
                                                                 19
   pipelines                                                          https://github.com/dutiyesh/chrome-extension-cli


                                                            17
       “div” number or “class” on the HTML page for             Patents-2 to obtain improved samples from our “Base-
       the respective matched answer or sentence to             line_preprocessed Dataset”. We considered only those
       highlight.                                               samples where the prediction score was greater than 70%
     • Background: This is the only component com-              when predicted by Bert-for-Patents-2.
       municating with the Flask API backend. When                 With respect to in-page semantic search, we are uti-
       it receives a message from the “Content” com-            lizing models (Bert Large uncased, RoBerta base, and
       ponent with a payload to perform a task, the             DistilBert based cased) which are fine-tuned on SQuAD
       API endpoint will be called with inputs. Back-           data. To our knowledge, there are no SQuAD formatted
       ground listens to two types of messages from             datasets in the patent domain to address in-page question
       Content such as “Patent_Text” for highlighting           answering. Therefore in this work, we are not fine-tuning
       technical aspects based on the type of class it          them on any patent data. Instead, we only perform test
       belongs to and “Patent_Semantic_Search” to ac-           cases to compare and evaluate them. For the test cases,
       complish in-page search. After receiving a re-           we considered various contexts (patent text) and ques-
       sponse from API, the response will be sent to            tions to compare the answering capability of said models.
       “Content” for further processing. In addition,           DistilBert is competitive with Bert Large in some cases.
       Background is also responsible for sending mes-          For instance, as depicted in Figure 5, we provided the
       sages task_started and task_stopped to “Popup”           same context and question to the aforementioned models.
       to keep the “Loader” busy or active for taking           Bert Large exhibited superior performance in retrieving
       the next task from the user. More details on the         the answer; nevertheless, DistilBert also performed rea-
       communication of components can be collected             sonably well in retrieving the correct answer. In most
       via the code base repository of this paper.              cases, Bert Large uncased model performed better in
                                                                finding accurate answers for longer queries (which are
                                                                common in patent searches). Therefore, Bert Large is
5. Findings                                                     deployed in the in-page semantic search extension.
                                                                   To test and debug the API endpoints for intended func-
In this section, we discuss the results of this work and        tioning, we used an open-source application called Insom-
perform an error analysis to show how the dataset repre-        nia20 . We provided Insomnia test requests to the in-page
sentation problem affects the model performances.               semantic search API and the classification (aka senti-
                                                                ment_predict) API endpoints. For example, we passed
5.1. Scores and Test Cases                                      an array of sentences to the sentiment_predict API end-
                                                                point, and the fine-tuned model returned a response with
Table 1 displays the classification accuracies of the models
                                                                the label and prediction probability score. Similarly, for
developed in this work using the PaSa_sentence Base-
                                                                semantic_search, we passed a sample patent text as a con-
line_preprocessed dataset. Bert-for-Patents-1 exhibits
                                                                text along with a question, and the retrieved response
better performance than the other models, possibly be-
                                                                included the begin and end token numbers of the pos-
cause it was pre-trained by Google on patent literature.
                                                                sible answer text snippet with confidence scores. After
As a result, we opted to employ only the Bert-for-Patents
                                                                confirming the intended functioning of the APIs using
pre-trained architecture in Stage-II.
                                                                Insomnia tests, we deployed the APIs in the Chromium
                                                                extensions.
Table 1
Classification scores on sentence level
                                                                5.2. Error Analysis
        Data Size          Model          Accuracy
                                                          There are three different ways in which the labels are
          150k           BerTweet         80%
                                                          assigned to the sentence level dataset of this work. Firstly,
          150k           Bert base        83.5%
          150k           DistilBert       84%             automatic labeling is based on the NLTK tokenizer (in
          150k       Bert-for-Patents-1   86.30%          STAGE-I). Secondly, labels are given by fine-tuned para-
                                                          graph model (in STAGE-II). And thirdly, manually as-
                                                          signed labels (in STAGE-III). Although “Baseline models”
   “PaSa_Sentence Improvised Dataset” (refer to Stage-II
                                                          developed in this work show good performances in terms
in Figure 3) is used to fine-tune Bert-for-Patents-3. Due
                                                          of accuracy, there are cases where the models’ validation
to the improvements made in the dataset, this model
                                                          loss is less than the training loss at the end of 3rd epoch.
shows an accuracy of 97.11%. As shown in Figure 3,
                                                          The validation data was easier to predict than learning
Bert-for-Patents-2, fine-tuned on a paragraph level with
                                                          the training data for the models. This signifies a dataset
an accuracy of 98.13%, is competent enough to repre-
sent the classes. Therefore, we decided to use Bert-for- 20
                                                                 https://docs.insomnia.rest/insomnia/get-started


                                                           18
Figure 5: An example SQuAD-based in-page patent semantic search tested on different models


representation problem, i.e., classes are not equally rep-      least a 70% probability of representing a class. Further,
resented by all the samples because of various reasons          we have used these improvised samples (PaSa_Sentence
as shown below. The models finetuned on this poorly             Improvised Dataset) to fine-tune a new model (Bert-For-
represented data induce bias in predicting the valida-          Patents-3), which outperforms other baseline models in
tion set. There are various samples in PaSa_Sentence            terms of both accuracy and class representativeness.
Baseline_preprocessed data which can be examples of                We manually labeled 650 randomly selected sam-
substandard training samples.                                   ples, which were not used in any of the experi-
   Example 1: “In the view of the problem of the back-          ments. The original labels for these samples from
ground art, it is an object of the present invention to pro-    PaSa_Baseline_preprocessed were kept separate. To ver-
vide a conveyor which estimates the weight of a transport       ify the presence of bias and representation problems in
object while it is carried without using devices such as a      the baseline models, we compared the prediction accura-
load cell which directly measures weight.”                      cies of manual predictions, baseline models, and Bert-for-
   Observation 1: The above example is automatically la-        Patents-3. The manual and Bert-for-Patents-3 prediction
beled as a negative class during PaSa_Baseline generation,      accuracies were 68.59% and 69.05%, respectively. Bert-
but it is not when we do manual labeling. During patent         for-Patents-3 was fine-tuned on the improved dataset,
drafting, mostly in “Technical Problem” paragraphs, at-         and its prediction performance was closer to the manual
torneys/applicants commonly use underlined phrases to           labels. However, due to bias, the baseline models showed
quickly repeat their invention while describing problems        higher scores with accuracies of 87.80% (DistilBert base
with other prior art. If sentences with such underlined         uncased), 87.04% (Bert base uncased), and 94.06% (Bert-
phrases are present in the negative class, then such sam-       for-Patents-1). Therefore, Bert-for-Patents-3 is more suit-
ples can be discarded.                                          able for use in the Chrome extension for highlighting
   Example 2: “An embodiment provides a lighting device         technical aspects.
in which an optical plate is disposed on at least one light        Technical aspects in a patent represent advantages
source and a light source module including the same.”           over the prior art, proposed solutions, or problems with
   Observation 2: The above sentence, as well as others         other prior art. The core objective of this work was
that are similar, are automatically labeled as negative         to automatically identify and highlight these aspects
even though they are not. This indicates the presence of        in patents. Although this objective may resemble a
mixed opinions at times on the paragraph level, which           sentiment analysis problem, general sentiment analysis
also appears in some sentences.                                 datasets or algorithms are not suitable for this task. Our
   There are other samples that are very long (60-70            sentence-level dataset is distinct from other sentiment
words); in such cases, smaller sentences are joined using       analysis datasets such as IMDB21 and Amazon product
special symbols such as ";,:". Manually checking every          reviews22 . These datasets mostly contain sentences ex-
such sample in large datasets is laborious. Therefore, we       pressing people’s opinions on products, things, or other
decided to fine-tune a model on the paragraph level so          social aspects. In contrast, our dataset highlights the key
that this model would have a greater understanding of           technical arguments in patents that demonstrate the in-
the representativeness of classes on advantages, prob-          vention’s technical capabilities in comparison to the prior
lems, and solutions in a patent text. Such a fine-tuned         21
model is used to consider the sentences that show at                 https://www.imdb.com/interfaces/
                                                                22
                                                                     https://cseweb.ucsd.edu/~jmcauley/datasets.html


                                                           19
art. Most importantly, our dataset is specific to the patent         [2] P. Bajaj, C. Xiong, G. Ke, X. Liu, D. He, S. Tiwary,
domain and accounts for patent-specific vocabulary and                   T.-Y. Liu, P. Bennett, X. Song, J. Gao, Metro: Effi-
knowledge.                                                               cient denoising pretraining of large scale autoen-
                                                                         coding language models with model generated sig-
                                                                         nals, arXiv preprint arXiv:2204.06644 (2022).
6. Conclusion and Future Work                                        [3] B. Patra, S. Singhal, S. Huang, Z. Chi, L. Dong,
                                                                         F. Wei, V. Chaudhary, X. Song,                Beyond
In this work, we present a multi-class dataset at the sen-
                                                                         english-centric bitexts for better multilingual lan-
tence level to highlight the technical subject matters of
                                                                         guage representation learning, arXiv preprint
patents, which can serve as important key arguments to
                                                                         arXiv:2210.14867 (2022).
determine a patent’s novelty. We fine-tuned language
                                                                     [4] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
models on our new dataset and developed a Chromium ex-
                                                                         Bert: Pre-training of deep bidirectional transform-
tension to automatically highlight key arguments based
                                                                         ers for language understanding, arXiv preprint
on predictions, provided the probability exceeds 70%. We
                                                                         arXiv:1810.04805 (2018).
also developed another Chromium extension to facilitate
                                                                     [5] P. Lahorte, Inside the mind of an epo examiner,
in-page semantic search.
                                                                         World Patent Information 54 (2018) S18–S22.
   We anticipate a growing need for AI-based tools to
                                                                     [6] A. Shinmori, M. Okumura, Y. Marukawa,
assist patent practitioners in conducting patent prior art
                                                                         M. Iwayama,           Patent claim processing for
searches. We hope this empirical work serves as pre-
                                                                         readability-structure analysis and term explana-
liminary research and motivates researchers and patent
                                                                         tion, in: Proceedings of the ACL-2003 workshop
practitioners to develop tools that can automate prior
                                                                         on Patent corpus processing, 2003, pp. 56–65.
art searches. Future work in this area could identify
                                                                     [7] G. Ferraro, H. Suominen, J. Nualart, Segmentation
additional technical aspects in patent documents and
                                                                         of patent claims for improving their readability, in:
train new classes for highlighting. For this study, we
                                                                         Proceedings of the 3rd Workshop on Predicting
focused only on advantages, problems, and solutions.
                                                                         and Improving Text Readability for Target Reader
Furthermore, sentence-level data could be improved to
                                                                         Populations (PITR), 2014, pp. 66–73.
enhance the representativeness of samples belonging to
                                                                     [8] A. Shinmori, M. Okumura, Y. Marukawa, Aligning
a particular class. For example, sentences representing
                                                                         patent claims with detailed descriptions for read-
"advantages" should not be mixed with sentences related
                                                                         ability., in: NTCIR, 2004.
to "problems".
                                                                     [9] S. Sheremetyeva, Natural language analysis of
   Developing a question-answering dataset in the patent
                                                                         patent claims, in: Proceedings of the ACL-2003
domain is crucial, and such datasets can be used to de-
                                                                         workshop on Patent corpus processing, 2003, pp.
velop tools to automate in-page semantic searches. We
                                                                         66–73.
also hope that AI-based tools to assist prior art searches
                                                                    [10] L. Rello, H. Saggion, R. Baeza-Yates, Keyword high-
will enhance the interaction of patent analysts with
                                                                         lighting improves comprehension for people with
patent documents. For instance, the automatic highlight
                                                                         dyslexia, in: Proceedings of the 3rd workshop on
and semantic search tools prototyped in this work can
                                                                         predicting and improving text readability for target
allow for cross-questioning within any patent document
                                                                         reader populations (PITR), 2014, pp. 30–37.
opened in a web browser.
                                                                    [11] S. Spala, F. Dernoncourt, W. Chang, C. Dockhorn, A
                                                                         web-based framework for collecting and assessing
Acknowledgments                                                          highlighted sentences in a document, in: Proceed-
                                                                         ings of the 27th International Conference on Com-
This research is part of the project "BigScience", which                 putational Linguistics: System Demonstrations,
is funded by the Bavarian State Ministry for Economic                    2018, pp. 78–81.
Affairs, Regional Development, and Energy under the                 [12] J. J. Li, K. Thadani, A. Stent, The role of discourse
grant number DIK0259/01.                                                 units in near-extractive summarization, in: Pro-
                                                                         ceedings of the 17th Annual Meeting of the Special
                                                                         Interest Group on Discourse and Dialogue, 2016,
References                                                               pp. 137–147.
                                                                    [13] K. Woodsend, M. Lapata, Automatic generation of
 [1] R. Srebrovic, J. Yonamine, Leveraging the BERT al-
                                                                         story highlights, in: Proceedings of the 48th An-
     gorithm for Patents with TensorFlow and BigQuery,
                                                                         nual Meeting of the Association for Computational
     Technical Report, Technical Report. Global Patents,
                                                                         Linguistics, 2010, pp. 565–574.
     Google https://services. google. com/fh . . . , 2020.
                                                                    [14] M. Kaisser, M. A. Hearst, J. B. Lowe, Improving
                                                                         search results quality by customizing summary


                                                               20
     lengths, in: Proceedings of ACL-08: HLT, 2008,
     pp. 701–709.
[15] F. I. Craik, R. S. Lockhart, Levels of processing: A
     framework for memory research, Journal of verbal
     learning and verbal behavior 11 (1972) 671–684.
[16] H. W. Faw, T. G. Waller, Mathemagenic behaviours
     and efficiency in learning from prose materials: Re-
     view, critique and recommendations, Review of
     Educational Research 46 (1976) 691–720.
[17] R. Ennals, B. Trushkowsky, J. M. Agosta, Highlight-
     ing disputed claims on the web, in: Proceedings of
     the 19th international conference on World wide
     web, 2010, pp. 341–350.
[18] M. Yeari, M. Oudega, P. van den Broek, The effect of
     highlighting on processing and memory of central
     and peripheral text information: Evidence from eye
     movements, Journal of Research in Reading 40
     (2017) 365–383.
[19] J. A. Brown, K. Knollman-Porter, K. Hux, S. E. Wal-
     lace, C. Deville, Effect of digital highlighting on
     reading comprehension given text-to-speech tech-
     nology for people with aphasia, Aphasiology 35
     (2021) 200–221.
[20] A. Winchell, A. Lan, M. Mozer, Highlights as an
     early predictor of student comprehension and in-
     terests, Cognitive Science 44 (2020) e12901.
[21] R. Chikkamath, V. R. Parmar, C. Hewel, M. Endres,
     Patent sentiment analysis to highlight patent para-
     graphs, arXiv preprint arXiv:2111.09741 (2021).


                                                        21