1. Introduction

Explainable Artificial Intelligence for Highlighting and Searching in Patent Text

Renukswamy Chikkamath

Rana Fassahat Ali

Christoph Hewel

Markus Endres

1 0 PAUSTIAN & PARTNERS , Munich , Germany 1 University of Applied Sciences , Munich , Germany 2 University of Passau , Passau , Germany

12 21

The verbose content and redundant information present in patents often add complexity to reading and understanding them. Individual subject matters related to an invention and its decisiveness are scattered throughout patent documents. Moreover, these matters could provide relevant key arguments for an efective examination or critical assessment of an invention. To address these complexities and facilitate patent practitioners' eficient reading and in-page semantic searches of patents, we generated a multiclass dataset representing key arguments of patents on a sentence level. Essentially, these key arguments are the concrete details related to an invention, such as the problem it solves or the technical efects or advantages it achieves. We fine-tuned Transfer Learning models on this novel dataset and developed two Chromium extensions. One extension automatically highlights these key arguments using our fine-tuned model, and the other steers semantic search within any opened patent document in the browser. The data and code related to this work are released to the community via a GIT repository. The empirical test cases and manually labeled gold truth data provide evidence supporting our hypothesis regarding in-page patent search and eficient reading, respectively.

eol>Patent analysis prior art search patent language model sentence classification patent datasets

1. Introduction

providing one or several specific embodiments of the invention. Patent owners tend to keep the specification 1.1. Motivation as general as possible, which may not only be advantageous for further broadening the scope of protection A patent is a form of intellectual property that provides but may also relieve the patent owners from publishing the owner with legal rights to prohibit others from pro- their developed technology. Therefore, most parts of the ducing, using, or selling the invention. However, these specification only repeat the text of the patent claims rights are granted in exchange for disclosing how the and add generalized boilerplate text concerning the funcinvention works. Before a patent can be granted, it must tioning of an invention. Even if a patent specification undergo a rigorous examination process, known as the may typically be 10 to 30 pages long, there are only a few prior art search. This search is typically conducted in two short text passages that explain the concrete technical stages: the first s tage o ccurs i n t he e arly s tages o f the efects of the invention. patent life cycle when patent attorneys draft the patent Therefore, it is often challenging for patent practitionapplication. And the second stage takes place in the later ers, including attorneys and examiners, to comprehend stages of the patent life cycle when patent examiners the invention’s definition in the claims, which problem is review the patent application. addressed by the invention, or which technical efects or

Since patent claims define the scope of protection, find- benefits are achieved by the invention. However, withing any prior art or other competing art that can be used out understanding the motivation behind the invention, as evidence for the proposed claims is a crucial step. A it is dificult to compare it with other inventions when patent does not only comprise one or several claims defin- assessing its inventive step over the prior art. ing the legal scope of protection but also a specification For example, suppose the claimed invention defines PatentSemTech'23: 4th Workshop on Patent Text Mining and a heating system with three temperature sensors. In that Semantic Technologies, colocated with the 46th International ACM case, the closest prior art document, such as an older SIGIR Conference on Research and Development in Information patent, may only disclose a heating system with two temRetrieval, July 27th, 2023, Taipei, Taiwan. perature sensors. In such cases, important questions arise, * Corresponding author. such as what is the technical efect of the third sensor? " renukswamy.chikkamath@hm.edu (R. Chikkamath); why does the prior art suggest only two sensors? In case (aCli.11H@e wadesl).u;mnia-prkausssa.eun.ddere(sR@.Fh. mA.lei)d;uh(eMw.elE@ndpraeuss)tian.de the motivations behind the two concepts are completely diferent, the claimed invention might be considered as implying an inventive step over the prior art.

CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g LCicEenUseRAttWribuotironk4s.0hIontpernPatrioonacl e(CeCdBiYn4g.0)s. (CEUR-WS.org)

Consequently, patent analysis often requires retriev- represent key arguments of any invention, such as ading those few text passages in a patent that can reveal the vantages, solutions, problems, and justifications for claim motivation behind the claimed invention. In this work, features. Understanding and diferentiating the above we aim to address the aforementioned dificulties and mentioned points in a timely manner aids examiners and ease the prior art search. Specifically, we focus on auto- attorneys in critical assessments and efective analysis matically highlighting these text passages with the help in the light of prior art. The focus of this work is to ease of the Chrome extension (Analyse), as shown in Figure 1. the readability and understandability of patents, unlike The Analyse extension is supported by an Artificial Intel- investigations of information retrieval or prior art search ligence (AI) model that is fine-tuned on a novel dataset approaches. developed in this work. Also, we present a Chrome ex- The AI-based assistance presented in this work is in tension (search text box) to facilitate cross-questioning greater demand when individual patents are considered during patent analysis, as depicted in Figure 2. for analysis, and this assistance has two main benefits. Firstly, it provides ease of readability by automatically 1.2. Highlight and Search in Patent Text highlighting technical aspects related to the invention on the sentence level. Secondly, it ofers deeper understandThe quality of a patent prior art search is greatly influ- ing by allowing readers to ask various cross-questions. enced by the readability and understandability of patents. For example, the question What are the problems with In prior art search or patent analysis in general, the most conventional mouse catchers? in the patent Mousetrap1 important parts of patents are considered to be the claims can be searched, as shown in Figure 2. Such a tool can and technical description which disclose and describe an enhance the user experience by providing the opportuinvention respectively. Since the claims are written in le- nity to explore the documents in greater detail and work gal terminology, they are often dificult to understand just by reading them alone. Detailed descriptions of patents 1https://patents.google.com/patent/US8943741 with the semantics and context of the patent text, un- outperform other state-of-the-art models in both senlike keyword-matching in-page searches (Ctrl+F based tence classification, such as the GLUE 4 benchmarked on search). the Stanford Sentiment Treebank (SST-2) dataset, and

Highlighting at the sentence level is more interest- question answering, based on the Stanford Question Aning and important than at the keyword or paragraph swering Dataset (SQuAD). Various other variants5 of the level. This is because keywords in patents can be suc- BERT [ 4 ] architecture can also be seen as competitors cinct but do not provide any evidence to understand the in various settings. In recent years, Google released a context in which key arguments are used. On the other language model pre-trained on patent data called BERThand, paragraphs can be informative but can contain for-Patents [ 1 ]. Since this model is trained on more than mixed opinions. For example, individual sentences ex- 100 million patents, unlike the above-mentioned generalplaining diferent arguments of inventions (advantageous purpose models, we have used it to fine-tune our classifiefects, problems, solutions) can be visible in one para- cation model. graph. Therefore, in this work, we focus on identifying Text highlighting in this context emphasizes the signifand highlighting key arguments only at the sentence icance of readability and understandability of patent and level. non-patent text. There is evidence in the literature re

In this paper, we present a sentence-level patent dataset garding how patent examiners from the European Patent designed to highlight key arguments for any invention at Ofice (EPO) initially read patent documents to come to the sentence level. This is a multi-class dataset that was a preliminary understanding of the patent. In particular, utilized to finetune Bert-for-Patents [ 1 ]. We developed there is a greater need for developing tools to assist them two Chromium extensions: one for automatically high- in skimming through patents and achieving a deeper unlighting arguments, facilitated by the internally finetuned derstanding of the contents [ 5 ]. Moreover, there is a lot Bert-for-Patents model; and another for in-page semantic of motivation from patent attorneys on the web to assess search based on SQuAD2 models. A free-flow natural the parameters for patentability and skim through the language query can be used to search within the opened document to find individual subject matters 6,7. document on the web. Both extensions can work well on Although there is much interest in the readability of text present in any web page or document. However, to patents [ 6, 7, 8, 9 ], these approaches are limited to the this end experiments are limited to Google patents3, for analysis of claims. However, segmentation and analyexample, a patent opened in Google Patents as shown in sis of claims are other segments of research in prior art Figures 1 and 2. search. To the best of our knowledge, there are no ap

The remainder of this work is organized as follows: proaches that focus on patent text at the sentence level Section 2 describes related work. Section 3 explains the to highlight relevant key arguments. Highlighting impormethodologies used to develop the data, with a detailed tant aspects of the text in the context of education/learnmulti-stage flowchart to describe the models developed ing is not new [ 10 ]. In other non-patent domains, generatin this work. Section 4 outlines the browser extension ing and providing a quick summary with highlighted text communication architecture. In Section 5, we discuss is proposed to emphasize textual elements [ 11, 12, 13, 14 ]. the results achieved in this work, including a sample test Text highlighting in general encourages a thorough uncase. In the end, in Section 6, we conclude our work and derstanding of a document [ 15 ] and also supports easier suggest possible future directions. subsequent literature study [ 16 ]. To ease access, developing browser extensions to highlight text on the web has drawn attention. For instance, highlighting the disputed 2. Related Work claims on the web pages and finding the relevant article from the web for facilitating the arguments in claims is proposed by Ennals et al [ 17 ]. Other related research also showed that reading comprehension can be attained by text highlighting on the web or any digital text content [ 18, 19, 14, 20 ].

In the patent domain, there are few private sectors that have developed solutions for multi-color highlighting of

The research aspects of this work are related to the intersection of tasks such as text highlighting, sentence classification, and question answering in the field of Natural Language Processing (NLP).

In recent times, language representation learning, also known as language model development, and research on reading comprehension, such as question-answering models, have grown rapidly in the field of NLP. Notable models that have achieved top performance include Turing NLR-v5 [ 2 ] and Turing ULR-v6 [ 3 ]. These models 4https://gluebenchmark.com/leaderboard 5https://huggingface.co/models?sort=downloads&search=bert 6https://www.heerlaw.com/ diference-patentability-assessment-patent-search 7https://www.brmpatentattorneys.com.au/ intellectual-property-law-melbourne/how-to-read-a-patent/

2https://rajpurkar.github.io/SQuAD-explorer/ 3https://patents.google.com/

keywords8,9. However, such approaches would not be level using a domain and task-specific dataset . Therefore, eficient because patent applications can be written us- in this work, we propose and develop a dataset for finding diferent terminologies even for the same concept. ing technical aspects on a sentence level (refer to Section Furthermore, considering the context in addition to key- 1.2 to know why the sentence level is preferred). Furwords adds domain knowledge that can explain why a thermore, we utilize this dataset to fine-tune a patent particular keyword was highlighted. Additionally, these domain-specific language model. This fine-tuned model solutions are paid, and the reader has to manually find is deployed in a Chrome extension service as a protoand highlight keywords. These solutions are more like type. A detailed description of the technique utilized digital pens to highlight and keep a record of keywords, to develop the sentence-level dataset and the variety of which is again a time-consuming task. models fine-tuned are described in the next Section 3.

To utilize AI models in the process of automatic highlighting in the patent domain, IPGoggles10 (one of the motivations for this paper) proposes a new-age cloud- 3. Data and Models based solution. This service highlights keywords or even phrases in patents based on sentiment. Professionals be- To the best of our knowledge, there is no dataset availlieve that reading and understanding patents becomes able in the literature that identifies technical aspects at challenging, even at an individual document level, given the sentence level. Therefore, we proposed to generthe huge amount of prior art. However, IPGoggles uti- ate a sentence-level dataset based on a paragraph-level lizes general-purpose AI models that are not fine-tuned dataset called PaSa [ 21 ]. The patent paragraphs of PaSa on patent data to identify technical aspects or key argu- (shown in the top left of Figure 3) represent essential key ments. In the patent domain, researchers have developed arguments that are crucial for efective patent reading. a dataset (PaSa) to identify the technical aspects of patent They also facilitate critical assessment of the boundaries documents on a paragraph level [ 21 ]. It contains patent of an invention. To aid patent practitioners in making paragraphs named under the headings “Technical Prob- decisions during report writing or formal hearings in lem,” “Solution to Problem,” and “Advantageous Efects examinations, AI models trained on such a dataset are of Invention.” necessary. However, it is not always true that all sen

In PaSa, United States Patent and Trademark Ofice tences in a specific paragraph represent the heading. (USPTO)11 patent grants from 2010 to 2020 were searched The following excerpt from the patent “US10834907B2” to identify the technical aspects mentioned in clear and shows that there are sentences reflecting both problems distinguishable paragraphs. The authors argue that these and advantages under the same heading “Technical Probparagraphs are not common in all patents, but rather lem”. reflect a patent drafting style (based on region) that is For e.g., "In summer, when rock oysters come in season, mostly followed by Asia-specific patents. Moreover, it sea areas are highly contaminated. . . which causes inhibiis even harder to find these specific paragraphs in Asia- tion of distribution...Accordingly, an object of the present specific patents before 2010 (refer to Table 4 [ 21 ], which invention is to provide . . . enables the production of virusshows a gradual decrease in the number from 2020 to free oysters having no experience of being exposed to a 2010). This provides strong motivation to utilize these sea area. . . . present invention solves the above-mentioned important and infrequent paragraphs as the basis of our problems.". investigations. However, the PaSa dataset has not been Therefore, in this work, we utilized the PaSa dataset used in any downstream application or tool so far. There- to develop sentence-level data for identifying the key fore, we decided to develop a dataset using PaSa and technical aspects present in patents. We also used the to use it to further train AI models that can be used in “sentiments” naming convention for the three classes downstream applications, such as a Chrome extension. in our dataset, which are solutions-neutral, advantages

In the state of the art, there is either evidence of high- positive, and problems-negative. lighting technical aspects based on general-purpose AI The dataset generation and model training in this work models or evidence of a dataset to identify technical as- can be seen in three stages, as shown in Figure 3. In pects on the paragraph level. However, to the best of our tSotakgeen-iIz,ears12a tsotrcaoignhvtefrotrwaapradraagprparpohacihn,toweseunsteedntchees NbaLsTeKd knowledge, there are no approaches that focus on idenon full stops. Further, preprocessing was carried out tifying and highlighting technical aspects on a sentence to remove smaller sentences containing fewer than 20 characters, which are mostly small phrases or sentences oriented toward special symbols. After preprocessing, PaSa_Sentence-Baseline contains 940,000 sentences, and Figure 3 displays samples from each class. It is clear that 8https://help.patsnap.com/hc/en-us/articles/ 115005478629-What-Can-I-Do-When-I-View-A-Patent9https://patseer.com 10https://ipgoggles.com/ 11https://developer.uspto.gov/product/ patent-grant-full-text-dataxml the dataset is unbalanced as we have fewer samples in the positive and negative classes.

To maintain standard experimental settings, as in PaSa, and to avoid class imbalance problems, we chose only 150k samples (set A) to train the baseline models in StageI. The remaining samples were used for other experiments such as “except set A” which was used in StageII, and 650 samples for manual labeling of the data in Stage-III. In Stage-I, we also used the original PaSa paragraph dataset to train transformer models, as the PaSa paper focused only on machine learning models. In StageII, we generated an improvised version (set B) of the PaSa_Sentence Baseline data to address errors and shortcomings identified in using PaSa_Sentence Baseline (refer to Section 5.2 for error analysis). The data samples used for various purposes (set A, set B, manually labeled data) were kept completely non-identical to avoid bias in learning the models.

We utilized pre-trained transformer models from the Hugging Face platform13 to fine-tune our datasets. With the exception of Bert-For-Patents, the remaining three baseline models (refer to Stage-I) were pre-trained on non-patent literature and hosted on Hugging Face. The naming convention (Bert-For-Patent-#) indicates that these models were fine-tuned on diferent datasets. For example, Bert-for-patent-2 is a completely new pretrained model that was fine-tuned using PaSa paragraph data in Stage-II. In Stage-III, the same Bert-For-Patents2 (fine-tuned) was used solely for making predictions on “except set A” (i.e., there was no role of “except set A” in training Bert-For-Patents-2). Thus, all models and datasets used were kept separate. The baseline models shown in Figure 3 were fine-tuned with a sequence length of 512 and batch size of 16, except for Bert-For-Patents-#, which was fine-tuned with a sequence length of 128 and batch size of 8. The reason for this diference is that BertFor-Patents-# is an extremely large architecture with 24 hidden layers and creates hardware dependencies during fine-tuning, even for an NVIDIA server with an A30 GPU.

And for the in-page patent semantic search, we have used SQuAD-dataset based question-answering models14 hosted on Hugging Face. The best-performing and most downloaded models are Bert Large (uncased), RoBerta base, and DistilBert based (cased). To the best of our knowledge, no datasets are available in the state-of-theart with SQuAD format in the patent domain (which opens the door for research in developing a questionanswering dataset in the patent domain). SQuAD models 14https://huggingface.co/models?pipeline_tag=

question-answering&sort=downloads are feasible for in-page searching in this work because trained on SQuAD from Hugging Face18). The response natural text queries can be searched within a given con- will be an answer (start and end positions of text from text (e.g., patent text in chunks), unlike keyword matches. the context considered) for the question searched. SQuAD models can be easily hosted and deployed in Chrome extensions. Therefore, we investigated the aforementioned models in our in-page semantic search extension. The components of the Chromium extension are explained in detail with the help of communication architecture in the next Section 4.

4. Browser Extensions

The browser extensions developed in this work are aimed at enhancing the readability and understandability of patents. Readability is more efective when the technical aspects of the considered patents are automatically highlighted. This automation is based on knowledge from domain-specific AI models fine-tuned in this work, and the respective model is deployed in a Chrome extension (refer to Figure 1). The understandability of patents is improved when there is an opportunity to ask crossquestions during patent analysis within a patent document. Such a feature is provided by our other extension developed in this work (refer to Figure 2). Patent practitioners can install and activate these two Chrome extensions in their browsers for efective prior art searches (refer to the GIT repository15 of this work for installation). More details including the usability of the Chrome extension, request run times, and responsiveness of the interface are also added to the GIT repository.

The browser extensions presented in this paper operate on the browsers such as Google Chrome (Chromium based), with development in two parts: i) Python Flask16 API for models (acts as backend) and ii) Chromium extension (acts as front end). We used Flask to develop an API for our models, further to get the predictions from our fine-tuned models we utilized Hugging Face transformers pipelines17. We hosted our fine-tuned models in the Hugging Face repository to make use of them in pipelines. The API has two POST endpoints one for each of the tasks (classification/sentiment-predict and in-page semantic search). The classification POST endpoint accepts an array of sentences of any opened document in the browser and collects the prediction response from the transformer pipeline with our fine-tuned model (BertFor-Patents-3). Further, the endpoint will assign classes to the array of sentences. With respect to the semantic search POST endpoint, a context (complete patent text in our case) and question are given as input and passed to question-answering model pipeline (e.g., Bert large 15https://github.com/Renuk9390/expaai_model 16https://flask.palletsprojects.com/en/2.2.x/ 17https://huggingface.co/docs/transformers/main_classes/ pipelines

We use chrome-extension-cli19 for developing the Chromium extension. In addition, we used technologies such as Javascript, HTML, and CSS for data handling and styling. The communication architecture of the browser extension with its components is shown in Figure 4. The functionalities of individual components are as follows: • Popup: The component that is visible when we click the browser extension icon, which acts as the only point of contact between the user and the extension. The popup is responsible for providing buttons for both classifications with multicolor highlighting and a search bar. Additionally, the Loader shows the task being performed or stopped. The Popup script communicates with both the “Content” and “Background” components. Text content from the web page will be accessed, analyzed (predictions, answers), and highlighted in the final step. • Content: This component collects the text present in the opened web page and communicates with both the “Background” and “Popup” components. The “Content” component is responsible for receiving a message from the “Popup” script and for sending and receiving messages to and from the “Background” component. In this case, it prepares the content for analysis and highlights the relevant content on the web page based on predictions from the “Background” component. Highlighting the content (sentences and answers) is one of the salient tasks of the "Content" component. This is achieved by using a 18https://huggingface.co/

bert-large-uncased-whole-word-masking-finetuned-squad 19https://github.com/dutiyesh/chrome-extension-cli

5. Findings

5.1. Scores and Test Cases In this section, we discuss the results of this work and perform an error analysis to show how the dataset representation problem afects the model performances.

“div” number or “class” on the HTML page for Patents-2 to obtain improved samples from our “Basethe respective matched answer or sentence to line_preprocessed Dataset”. We considered only those highlight. samples where the prediction score was greater than 70% • Background: This is the only component com- when predicted by Bert-for-Patents-2. municating with the Flask API backend. When With respect to in-page semantic search, we are utiit receives a message from the “Content” com- lizing models (Bert Large uncased, RoBerta base, and ponent with a payload to perform a task, the DistilBert based cased) which are fine-tuned on SQuAD API endpoint will be called with inputs. Back- data. To our knowledge, there are no SQuAD formatted ground listens to two types of messages from datasets in the patent domain to address in-page question Content such as “Patent_Text” for highlighting answering. Therefore in this work, we are not fine-tuning technical aspects based on the type of class it them on any patent data. Instead, we only perform test belongs to and “Patent_Semantic_Search” to ac- cases to compare and evaluate them. For the test cases, complish in-page search. After receiving a re- we considered various contexts (patent text) and quessponse from API, the response will be sent to tions to compare the answering capability of said models. “Content” for further processing. In addition, DistilBert is competitive with Bert Large in some cases. Background is also responsible for sending mes- For instance, as depicted in Figure 5, we provided the sages task_started and task_stopped to “Popup” same context and question to the aforementioned models. to keep the “Loader” busy or active for taking Bert Large exhibited superior performance in retrieving the next task from the user. More details on the the answer; nevertheless, DistilBert also performed reacommunication of components can be collected sonably well in retrieving the correct answer. In most via the code base repository of this paper. cases, Bert Large uncased model performed better in ifnding accurate answers for longer queries (which are common in patent searches). Therefore, Bert Large is deployed in the in-page semantic search extension.

To test and debug the API endpoints for intended functioning, we used an open-source application called Insomnia20. We provided Insomnia test requests to the in-page semantic search API and the classification (aka sentiment_predict) API endpoints. For example, we passed an array of sentences to the sentiment_predict API endpoint, and the fine-tuned model returned a response with the label and prediction probability score. Similarly, for semantic_search, we passed a sample patent text as a context along with a question, and the retrieved response included the begin and end token numbers of the possible answer text snippet with confidence scores. After confirming the intended functioning of the APIs using Insomnia tests, we deployed the APIs in the Chromium extensions.

There are three diferent ways in which the labels are

assigned to the sentence level dataset of this work. Firstly, automatic labeling is based on the NLTK tokenizer (in STAGE-I). Secondly, labels are given by fine-tuned paragraph model (in STAGE-II). And thirdly, manually assigned labels (in STAGE-III). Although “Baseline models” developed in this work show good performances in terms of accuracy, there are cases where the models’ validation loss is less than the training loss at the end of 3rd epoch.

The validation data was easier to predict than learning the training data for the models. This signifies a dataset 20https://docs.insomnia.rest/insomnia/get-started Data Size “PaSa_Sentence Improvised Dataset” (refer to Stage-II in Figure 3) is used to fine-tune Bert-for-Patents-3. Due to the improvements made in the dataset, this model shows an accuracy of 97.11%. As shown in Figure 3, Bert-for-Patents-2, fine-tuned on a paragraph level with an accuracy of 98.13%, is competent enough to represent the classes. Therefore, we decided to use Bert-forrepresentation problem, i.e., classes are not equally rep- least a 70% probability of representing a class. Further, resented by all the samples because of various reasons we have used these improvised samples (PaSa_Sentence as shown below. The models finetuned on this poorly Improvised Dataset) to fine-tune a new model (Bert-Forrepresented data induce bias in predicting the valida- Patents-3), which outperforms other baseline models in tion set. There are various samples in PaSa_Sentence terms of both accuracy and class representativeness. Baseline_preprocessed data which can be examples of We manually labeled 650 randomly selected samsubstandard training samples. ples, which were not used in any of the experi

Example 1: “In the view of the problem of the back- ments. The original labels for these samples from ground art, it is an object of the present invention to pro- PaSa_Baseline_preprocessed were kept separate. To vervide a conveyor which estimates the weight of a transport ify the presence of bias and representation problems in object while it is carried without using devices such as a the baseline models, we compared the prediction accuraload cell which directly measures weight.” cies of manual predictions, baseline models, and Bert-for

Observation 1: The above example is automatically la- Patents-3. The manual and Bert-for-Patents-3 prediction beled as a negative class during PaSa_Baseline generation, accuracies were 68.59% and 69.05%, respectively. Bertbut it is not when we do manual labeling. During patent for-Patents-3 was fine-tuned on the improved dataset, drafting, mostly in “Technical Problem” paragraphs, at- and its prediction performance was closer to the manual torneys/applicants commonly use underlined phrases to labels. However, due to bias, the baseline models showed quickly repeat their invention while describing problems higher scores with accuracies of 87.80% (DistilBert base with other prior art. If sentences with such underlined uncased), 87.04% (Bert base uncased), and 94.06% (Bertphrases are present in the negative class, then such sam- for-Patents-1). Therefore, Bert-for-Patents-3 is more suitples can be discarded. able for use in the Chrome extension for highlighting

Example 2: “An embodiment provides a lighting device technical aspects. in which an optical plate is disposed on at least one light Technical aspects in a patent represent advantages source and a light source module including the same.” over the prior art, proposed solutions, or problems with

Observation 2: The above sentence, as well as others other prior art. The core objective of this work was that are similar, are automatically labeled as negative to automatically identify and highlight these aspects even though they are not. This indicates the presence of in patents. Although this objective may resemble a mixed opinions at times on the paragraph level, which sentiment analysis problem, general sentiment analysis also appears in some sentences. datasets or algorithms are not suitable for this task. Our

There are other samples that are very long (60-70 sentence-level dataset is distinct from other sentiment words); in such cases, smaller sentences are joined using analysis datasets such as IMDB21 and Amazon product special symbols such as ";,:". Manually checking every reviews22. These datasets mostly contain sentences exsuch sample in large datasets is laborious. Therefore, we pressing people’s opinions on products, things, or other decided to fine-tune a model on the paragraph level so social aspects. In contrast, our dataset highlights the key that this model would have a greater understanding of technical arguments in patents that demonstrate the inthe representativeness of classes on advantages, prob- vention’s technical capabilities in comparison to the prior lems, and solutions in a patent text. Such a fine-tuned model is used to consider the sentences that show at 21https://www.imdb.com/interfaces/ 22https://cseweb.ucsd.edu/~jmcauley/datasets.html art. Most importantly, our dataset is specific to the patent domain and accounts for patent-specific vocabulary and knowledge.

6. Conclusion and Future Work

In this work, we present a multi-class dataset at the sentence level to highlight the technical subject matters of patents, which can serve as important key arguments to determine a patent’s novelty. We fine-tuned language models on our new dataset and developed a Chromium extension to automatically highlight key arguments based on predictions, provided the probability exceeds 70%. We also developed another Chromium extension to facilitate in-page semantic search.

We anticipate a growing need for AI-based tools to assist patent practitioners in conducting patent prior art searches. We hope this empirical work serves as preliminary research and motivates researchers and patent practitioners to develop tools that can automate prior art searches. Future work in this area could identify additional technical aspects in patent documents and train new classes for highlighting. For this study, we focused only on advantages, problems, and solutions. Furthermore, sentence-level data could be improved to enhance the representativeness of samples belonging to a particular class. For example, sentences representing "advantages" should not be mixed with sentences related to "problems".

Developing a question-answering dataset in the patent domain is crucial, and such datasets can be used to develop tools to automate in-page semantic searches. We also hope that AI-based tools to assist prior art searches will enhance the interaction of patent analysts with patent documents. For instance, the automatic highlight and semantic search tools prototyped in this work can allow for cross-questioning within any patent document opened in a web browser.

Acknowledgments This research is part of the project "BigScience", which is funded by the Bavarian State Ministry for Economic Afairs, Regional Development, and Energy under the grant number DIK0259/01.

[1]

Srebrovic ,

Yonamine , Leveraging the BERT algorithm for Patents with TensorFlow and BigQuery , Technical Report, Technical Report. Global Patents , Google https://services. google. com/fh . . . , 2020 .

[2]

Bajaj ,

Xiong ,

Ke ,

Liu ,

He ,

Tiwary , T.-Y. Liu,

Bennett ,

Song ,

Gao , Metro: Eficient denoising pretraining of large scale autoencoding language models with model generated signals , arXiv preprint arXiv:2204.06644 ( 2022 ).

[3]

Patra ,

Singhal ,

Huang ,

Chi ,

Dong ,

Wei ,

Chaudhary ,

Song , Beyond english-centric bitexts for better multilingual language representation learning , arXiv preprint arXiv:2210.14867 ( 2022 ).

[4]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , arXiv preprint arXiv: 1810 . 04805 ( 2018 ).

[5]

Lahorte , Inside the mind of an epo examiner , World Patent Information 54 ( 2018 ) S18 - S22 .

[6]

Shinmori ,

Okumura ,

Marukawa ,

Iwayama , Patent claim processing for readability-structure analysis and term explanation , in: Proceedings of the ACL-2003 workshop on Patent corpus processing , 2003 , pp. 56 - 65 .

[7]

Ferraro ,

Suominen ,

Nualart , Segmentation of patent claims for improving their readability , in: Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR) , 2014 , pp. 66 - 73 .

[8]

Shinmori ,

Okumura ,

Marukawa , Aligning patent claims with detailed descriptions for readability ., in: NTCIR , 2004 .

[9]

Sheremetyeva , Natural language analysis of patent claims , in: Proceedings of the ACL-2003 workshop on Patent corpus processing , 2003 , pp. 66 - 73 .

[10]

Rello ,

Saggion ,

Baeza-Yates , Keyword highlighting improves comprehension for people with dyslexia , in: Proceedings of the 3rd workshop on predicting and improving text readability for target reader populations (PITR) , 2014 , pp. 30 - 37 .

[11]

Spala ,

Dernoncourt ,

Chang ,

Dockhorn , A web-based framework for collecting and assessing highlighted sentences in a document , in: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations , 2018 , pp. 78 - 81 .

[12]

J. J.

Li ,

Thadani ,

Stent , The role of discourse units in near-extractive summarization , in: Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue , 2016 , pp. 137 - 147 .

[13]

Woodsend ,

Lapata , Automatic generation of story highlights , in: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics , 2010 , pp. 565 - 574 .

[14]

Kaisser ,

M. A.

Hearst ,

J. B.

Lowe , Improving search results quality by customizing summary lengths , in: Proceedings of ACL-08: HLT , 2008 , pp. 701 - 709 .

[15] F. I. Craik ,

R. S.

Lockhart , Levels of processing: A framework for memory research , Journal of verbal learning and verbal behavior 11 ( 1972 ) 671 - 684 .

[16]

H. W.

Faw , T. G. Waller, Mathemagenic behaviours and eficiency in learning from prose materials: Review, critique and recommendations , Review of Educational Research 46 ( 1976 ) 691 - 720 .

[17]

Ennals ,

Trushkowsky ,

J. M.

Agosta , Highlighting disputed claims on the web , in: Proceedings of the 19th international conference on World wide web , 2010 , pp. 341 - 350 .

[18]

Yeari ,

Oudega , P. van den Broek, The efect of highlighting on processing and memory of central and peripheral text information: Evidence from eye movements , Journal of Research in Reading 40 ( 2017 ) 365 - 383 .

[19]

J. A.

Brown ,

Knollman-Porter ,

Hux ,

S. E.

Wallace ,

Deville , Efect of digital highlighting on reading comprehension given text-to-speech technology for people with aphasia , Aphasiology 35 ( 2021 ) 200 - 221 .

[20]

Winchell ,

Lan , M. Mozer, Highlights as an early predictor of student comprehension and interests , Cognitive Science 44 ( 2020 ) e12901 .

[21]

Chikkamath ,

V. R.

Parmar ,

Hewel ,

Endres , Patent sentiment analysis to highlight patent paragraphs , arXiv preprint arXiv:2111.09741 ( 2021 ).