Explainable Artificial Intelligence for Highlighting and Searching in Patent Text Renukswamy Chikkamath1,* , Rana Fassahat Ali2 , Christoph Hewel3 and Markus Endres1 1 University of Applied Sciences, Munich, Germany 2 University of Passau, Passau, Germany 3 PAUSTIAN & PARTNERS, Munich, Germany Abstract The verbose content and redundant information present in patents often add complexity to reading and understanding them. Individual subject matters related to an invention and its decisiveness are scattered throughout patent documents. Moreover, these matters could provide relevant key arguments for an effective examination or critical assessment of an invention. To address these complexities and facilitate patent practitioners’ efficient reading and in-page semantic searches of patents, we generated a multiclass dataset representing key arguments of patents on a sentence level. Essentially, these key arguments are the concrete details related to an invention, such as the problem it solves or the technical effects or advantages it achieves. We fine-tuned Transfer Learning models on this novel dataset and developed two Chromium extensions. One extension automatically highlights these key arguments using our fine-tuned model, and the other steers semantic search within any opened patent document in the browser. The data and code related to this work are released to the community via a GIT repository. The empirical test cases and manually labeled gold truth data provide evidence supporting our hypothesis regarding in-page patent search and efficient reading, respectively. Keywords Patent analysis, prior art search, patent language model, sentence classification, patent datasets 1. Introduction providing one or several specific embodiments of the invention. Patent owners tend to keep the specification 1.1. Motivation as general as possible, which may not only be advan- tageous for further broadening the scope of protection A patent is a form of intellectual property that provides but may also relieve the patent owners from publishing the owner with legal rights to prohibit others from pro- their developed technology. Therefore, most parts of the ducing, using, or selling the invention. However, these specification only repeat the text of the patent claims rights are granted in exchange for disclosing how the and add generalized boilerplate text concerning the func- invention works. Before a patent can be granted, it must tioning of an invention. Even if a patent specification undergo a rigorous examination process, known as the may typically be 10 to 30 pages long, there are only a few prior art search. This search is typically conducted in two short text passages that explain the concrete technical stages: the first s tage o ccurs i n t he e arly s tages o f the effects of the invention. patent life cycle when patent attorneys draft the patent Therefore, it is often challenging for patent practition- application. And the second stage takes place in the later ers, including attorneys and examiners, to comprehend stages of the patent life cycle when patent examiners the invention’s definition in the claims, which problem is review the patent application. addressed by the invention, or which technical effects or Since patent claims define the scope of protection, find- benefits are achieved by the invention. However, with- ing any prior art or other competing art that can be used out understanding the motivation behind the invention, as evidence for the proposed claims is a crucial step. A it is difficult to compare it with other inventions when patent does not only comprise one or several claims defin- assessing its inventive step over the prior art. ing the legal scope of protection but also a specification For example, suppose the claimed invention defines PatentSemTech'23: 4th Workshop on Patent Text Mining and a heating system with three temperature sensors. In that Semantic Technologies, colocated with the 46th International ACM case, the closest prior art document, such as an older SIGIR Conference on Research and Development in Information patent, may only disclose a heating system with two tem- Retrieval, July 27th, 2023, Taipei, Taiwan. perature sensors. In such cases, important questions arise, * Corresponding author. such as what is the technical effect of the third sensor? " renukswamy.chikkamath@hm.edu (R. Chikkamath); why does the prior art suggest only two sensors? In case ali11@ads.uni-passau.de (R. F. Ali); hewel@paustian.de (C. Hewel); markus.endres@hm.edu (M. Endres) the motivations behind the two concepts are completely different, the claimed invention might be considered as © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). implying an inventive step over the prior art. CEUR Workshop Proceedings (CEUR-WS.org) CEUR http://ceur-ws.org Workshop ISSN 1613-0073 Proceedings CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 12 Figure 1: Chromium extension to highlight technical aspects. Analyse button activates this extension, wherein technical problems, solutions, and advantages are colored automatically in red, yellow, and green respectively. Figure 2: Chromium extension for in-page semantic search. A search bar can be used to ask a question and the answer is highlighted within opened web page or patent. Consequently, patent analysis often requires retriev- represent key arguments of any invention, such as ad- ing those few text passages in a patent that can reveal the vantages, solutions, problems, and justifications for claim motivation behind the claimed invention. In this work, features. Understanding and differentiating the above we aim to address the aforementioned difficulties and mentioned points in a timely manner aids examiners and ease the prior art search. Specifically, we focus on auto- attorneys in critical assessments and effective analysis matically highlighting these text passages with the help in the light of prior art. The focus of this work is to ease of the Chrome extension (Analyse), as shown in Figure 1. the readability and understandability of patents, unlike The Analyse extension is supported by an Artificial Intel- investigations of information retrieval or prior art search ligence (AI) model that is fine-tuned on a novel dataset approaches. developed in this work. Also, we present a Chrome ex- The AI-based assistance presented in this work is in tension (search text box) to facilitate cross-questioning greater demand when individual patents are considered during patent analysis, as depicted in Figure 2. for analysis, and this assistance has two main benefits. Firstly, it provides ease of readability by automatically 1.2. Highlight and Search in Patent Text highlighting technical aspects related to the invention on the sentence level. Secondly, it offers deeper understand- The quality of a patent prior art search is greatly influ- ing by allowing readers to ask various cross-questions. enced by the readability and understandability of patents. For example, the question What are the problems with In prior art search or patent analysis in general, the most conventional mouse catchers? in the patent Mousetrap1 important parts of patents are considered to be the claims can be searched, as shown in Figure 2. Such a tool can and technical description which disclose and describe an enhance the user experience by providing the opportu- invention respectively. Since the claims are written in le- nity to explore the documents in greater detail and work gal terminology, they are often difficult to understand just by reading them alone. Detailed descriptions of patents 1 https://patents.google.com/patent/US8943741 13 with the semantics and context of the patent text, un- outperform other state-of-the-art models in both sen- like keyword-matching in-page searches (Ctrl+F based tence classification, such as the GLUE4 benchmarked on search). the Stanford Sentiment Treebank (SST-2) dataset, and Highlighting at the sentence level is more interest- question answering, based on the Stanford Question An- ing and important than at the keyword or paragraph swering Dataset (SQuAD). Various other variants5 of the level. This is because keywords in patents can be suc- BERT [4] architecture can also be seen as competitors cinct but do not provide any evidence to understand the in various settings. In recent years, Google released a context in which key arguments are used. On the other language model pre-trained on patent data called BERT- hand, paragraphs can be informative but can contain for-Patents [1]. Since this model is trained on more than mixed opinions. For example, individual sentences ex- 100 million patents, unlike the above-mentioned general- plaining different arguments of inventions (advantageous purpose models, we have used it to fine-tune our classifi- effects, problems, solutions) can be visible in one para- cation model. graph. Therefore, in this work, we focus on identifying Text highlighting in this context emphasizes the signif- and highlighting key arguments only at the sentence icance of readability and understandability of patent and level. non-patent text. There is evidence in the literature re- In this paper, we present a sentence-level patent datasetgarding how patent examiners from the European Patent designed to highlight key arguments for any invention at Office (EPO) initially read patent documents to come to the sentence level. This is a multi-class dataset that was a preliminary understanding of the patent. In particular, utilized to finetune Bert-for-Patents [1]. We developed there is a greater need for developing tools to assist them two Chromium extensions: one for automatically high- in skimming through patents and achieving a deeper un- lighting arguments, facilitated by the internally finetuned derstanding of the contents [5]. Moreover, there is a lot Bert-for-Patents model; and another for in-page semantic of motivation from patent attorneys on the web to assess search based on SQuAD2 models. A free-flow natural the parameters for patentability and skim through the language query can be used to search within the opened document to find individual subject matters6,7 . document on the web. Both extensions can work well on Although there is much interest in the readability of text present in any web page or document. However, to patents [6, 7, 8, 9], these approaches are limited to the this end experiments are limited to Google patents3 , for analysis of claims. However, segmentation and analy- example, a patent opened in Google Patents as shown in sis of claims are other segments of research in prior art Figures 1 and 2. search. To the best of our knowledge, there are no ap- The remainder of this work is organized as follows: proaches that focus on patent text at the sentence level Section 2 describes related work. Section 3 explains the to highlight relevant key arguments. Highlighting impor- methodologies used to develop the data, with a detailed tant aspects of the text in the context of education/learn- multi-stage flowchart to describe the models developed ing is not new [10]. In other non-patent domains, generat- in this work. Section 4 outlines the browser extension ing and providing a quick summary with highlighted text communication architecture. In Section 5, we discuss is proposed to emphasize textual elements [11, 12, 13, 14]. the results achieved in this work, including a sample test Text highlighting in general encourages a thorough un- case. In the end, in Section 6, we conclude our work and derstanding of a document [15] and also supports easier suggest possible future directions. subsequent literature study [16]. To ease access, develop- ing browser extensions to highlight text on the web has drawn attention. For instance, highlighting the disputed 2. Related Work claims on the web pages and finding the relevant article from the web for facilitating the arguments in claims is The research aspects of this work are related to the in- proposed by Ennals et al [17]. Other related research also tersection of tasks such as text highlighting, sentence showed that reading comprehension can be attained by classification, and question answering in the field of Nat- text highlighting on the web or any digital text content ural Language Processing (NLP). [18, 19, 14, 20]. In recent times, language representation learning, also In the patent domain, there are few private sectors that known as language model development, and research have developed solutions for multi-color highlighting of on reading comprehension, such as question-answering models, have grown rapidly in the field of NLP. Notable models that have achieved top performance include Tur- 4 https://gluebenchmark.com/leaderboard ing NLR-v5 [2] and Turing ULR-v6 [3]. These models 5 https://huggingface.co/models?sort=downloads&search=bert 6 https://www.heerlaw.com/ difference-patentability-assessment-patent-search 2 7 https://rajpurkar.github.io/SQuAD-explorer/ https://www.brmpatentattorneys.com.au/ 3 https://patents.google.com/ intellectual-property-law-melbourne/how-to-read-a-patent/ 14 keywords8,9 . However, such approaches would not be level using a domain and task-specific dataset. Therefore, efficient because patent applications can be written us- in this work, we propose and develop a dataset for find- ing different terminologies even for the same concept. ing technical aspects on a sentence level (refer to Section Furthermore, considering the context in addition to key- 1.2 to know why the sentence level is preferred). Fur- words adds domain knowledge that can explain why a thermore, we utilize this dataset to fine-tune a patent particular keyword was highlighted. Additionally, these domain-specific language model. This fine-tuned model solutions are paid, and the reader has to manually find is deployed in a Chrome extension service as a proto- and highlight keywords. These solutions are more like type. A detailed description of the technique utilized digital pens to highlight and keep a record of keywords, to develop the sentence-level dataset and the variety of which is again a time-consuming task. models fine-tuned are described in the next Section 3. To utilize AI models in the process of automatic high- lighting in the patent domain, IPGoggles10 (one of the motivations for this paper) proposes a new-age cloud- 3. Data and Models based solution. This service highlights keywords or even To the best of our knowledge, there is no dataset avail- phrases in patents based on sentiment. Professionals be- able in the literature that identifies technical aspects at lieve that reading and understanding patents becomes the sentence level. Therefore, we proposed to gener- challenging, even at an individual document level, given ate a sentence-level dataset based on a paragraph-level the huge amount of prior art. However, IPGoggles uti- dataset called PaSa [21]. The patent paragraphs of PaSa lizes general-purpose AI models that are not fine-tuned (shown in the top left of Figure 3) represent essential key on patent data to identify technical aspects or key argu- arguments that are crucial for effective patent reading. ments. In the patent domain, researchers have developed They also facilitate critical assessment of the boundaries a dataset (PaSa) to identify the technical aspects of patent of an invention. To aid patent practitioners in making documents on a paragraph level [21]. It contains patent decisions during report writing or formal hearings in paragraphs named under the headings “Technical Prob- examinations, AI models trained on such a dataset are lem,” “Solution to Problem,” and “Advantageous Effects necessary. However, it is not always true that all sen- of Invention.” tences in a specific paragraph represent the heading. In PaSa, United States Patent and Trademark Office The following excerpt from the patent “US10834907B2” (USPTO)11 patent grants from 2010 to 2020 were searched shows that there are sentences reflecting both problems to identify the technical aspects mentioned in clear and and advantages under the same heading “Technical Prob- distinguishable paragraphs. The authors argue that these lem”. paragraphs are not common in all patents, but rather For e.g., "In summer, when rock oysters come in season, reflect a patent drafting style (based on region) that is sea areas are highly contaminated. . . which causes inhibi- mostly followed by Asia-specific patents. Moreover, it tion of distribution...Accordingly, an object of the present is even harder to find these specific paragraphs in Asia- invention is to provide . . . enables the production of virus- specific patents before 2010 (refer to Table 4 [21], which free oysters having no experience of being exposed to a shows a gradual decrease in the number from 2020 to sea area. . . . present invention solves the above-mentioned 2010). This provides strong motivation to utilize these problems.". important and infrequent paragraphs as the basis of our Therefore, in this work, we utilized the PaSa dataset investigations. However, the PaSa dataset has not been to develop sentence-level data for identifying the key used in any downstream application or tool so far. There- technical aspects present in patents. We also used the fore, we decided to develop a dataset using PaSa and “sentiments” naming convention for the three classes to use it to further train AI models that can be used in in our dataset, which are solutions-neutral, advantages- downstream applications, such as a Chrome extension. positive, and problems-negative. In the state of the art, there is either evidence of high- The dataset generation and model training in this work lighting technical aspects based on general-purpose AI can be seen in three stages, as shown in Figure 3. In models or evidence of a dataset to identify technical as- Stage-I, as a straightforward approach, we used the NLTK pects on the paragraph level. However, to the best of our tokenizer12 to convert a paragraph into sentences based knowledge, there are no approaches that focus on iden- on full stops. Further, preprocessing was carried out tifying and highlighting technical aspects on a sentence to remove smaller sentences containing fewer than 20 8 characters, which are mostly small phrases or sentences https://help.patsnap.com/hc/en-us/articles/ 115005478629-What-Can-I-Do-When-I-View-A-Patent- oriented toward special symbols. After preprocessing, 9 https://patseer.com PaSa_Sentence-Baseline contains 940,000 sentences, and 10 https://ipgoggles.com/ Figure 3 displays samples from each class. It is clear that 11 https://developer.uspto.gov/product/ 12 patent-grant-full-text-dataxml https://www.nltk.org/ 15 Figure 3: PaSa sentence level dataset generation and models including types and statistics of datasets in different settings. the dataset is unbalanced as we have fewer samples in example, Bert-for-patent-2 is a completely new pre- the positive and negative classes. trained model that was fine-tuned using PaSa paragraph To maintain standard experimental settings, as in PaSa, data in Stage-II. In Stage-III, the same Bert-For-Patents- and to avoid class imbalance problems, we chose only 2 (fine-tuned) was used solely for making predictions 150k samples (set A) to train the baseline models in Stage- on “except set A” (i.e., there was no role of “except set I. The remaining samples were used for other experi- A” in training Bert-For-Patents-2). Thus, all models and ments such as “except set A” which was used in Stage- datasets used were kept separate. The baseline models II, and 650 samples for manual labeling of the data in shown in Figure 3 were fine-tuned with a sequence length Stage-III. In Stage-I, we also used the original PaSa para- of 512 and batch size of 16, except for Bert-For-Patents-#, graph dataset to train transformer models, as the PaSa which was fine-tuned with a sequence length of 128 and paper focused only on machine learning models. In Stage- batch size of 8. The reason for this difference is that Bert- II, we generated an improvised version (set B) of the For-Patents-# is an extremely large architecture with 24 PaSa_Sentence Baseline data to address errors and short- hidden layers and creates hardware dependencies dur- comings identified in using PaSa_Sentence Baseline (re- ing fine-tuning, even for an NVIDIA server with an A30 fer to Section 5.2 for error analysis). The data samples GPU. used for various purposes (set A, set B, manually labeled And for the in-page patent semantic search, we have data) were kept completely non-identical to avoid bias in used SQuAD-dataset based question-answering models14 learning the models. hosted on Hugging Face. The best-performing and most We utilized pre-trained transformer models from the downloaded models are Bert Large (uncased), RoBerta Hugging Face platform13 to fine-tune our datasets. With base, and DistilBert based (cased). To the best of our the exception of Bert-For-Patents, the remaining three knowledge, no datasets are available in the state-of-the- baseline models (refer to Stage-I) were pre-trained on art with SQuAD format in the patent domain (which non-patent literature and hosted on Hugging Face. The opens the door for research in developing a question- naming convention (Bert-For-Patent-#) indicates that answering dataset in the patent domain). SQuAD models these models were fine-tuned on different datasets. For 14 https://huggingface.co/models?pipeline_tag= 13 https://huggingface.co/models question-answering&sort=downloads 16 are feasible for in-page searching in this work because trained on SQuAD from Hugging Face18 ). The response natural text queries can be searched within a given con- will be an answer (start and end positions of text from text (e.g., patent text in chunks), unlike keyword matches. the context considered) for the question searched. SQuAD models can be easily hosted and deployed in Chrome extensions. Therefore, we investigated the afore- mentioned models in our in-page semantic search ex- tension. The components of the Chromium extension are explained in detail with the help of communication architecture in the next Section 4. 4. Browser Extensions The browser extensions developed in this work are aimed at enhancing the readability and understandability of patents. Readability is more effective when the techni- cal aspects of the considered patents are automatically highlighted. This automation is based on knowledge Figure 4: Browser extension communication architecture from domain-specific AI models fine-tuned in this work, with its components. and the respective model is deployed in a Chrome exten- sion (refer to Figure 1). The understandability of patents We use chrome-extension-cli19 for developing the is improved when there is an opportunity to ask cross- Chromium extension. In addition, we used technologies questions during patent analysis within a patent docu- such as Javascript, HTML, and CSS for data handling and ment. Such a feature is provided by our other extension styling. The communication architecture of the browser developed in this work (refer to Figure 2). Patent prac- extension with its components is shown in Figure 4. The titioners can install and activate these two Chrome ex- functionalities of individual components are as follows: tensions in their browsers for effective prior art searches (refer to the GIT repository15 of this work for installa- • Popup: The component that is visible when we tion). More details including the usability of the Chrome click the browser extension icon, which acts as extension, request run times, and responsiveness of the the only point of contact between the user and interface are also added to the GIT repository. the extension. The popup is responsible for pro- The browser extensions presented in this paper oper- viding buttons for both classifications with multi- ate on the browsers such as Google Chrome (Chromium color highlighting and a search bar. Additionally, based), with development in two parts: i) Python Flask16 the Loader shows the task being performed or API for models (acts as backend) and ii) Chromium ex- stopped. The Popup script communicates with tension (acts as front end). We used Flask to develop an both the “Content” and “Background” compo- API for our models, further to get the predictions from nents. Text content from the web page will be our fine-tuned models we utilized Hugging Face trans- accessed, analyzed (predictions, answers), and formers pipelines17 . We hosted our fine-tuned models highlighted in the final step. in the Hugging Face repository to make use of them in • Content: This component collects the text pipelines. The API has two POST endpoints one for each present in the opened web page and communi- of the tasks (classification/sentiment-predict and in-page cates with both the “Background” and “Popup” semantic search). The classification POST endpoint ac- components. The “Content” component is respon- cepts an array of sentences of any opened document in sible for receiving a message from the “Popup” the browser and collects the prediction response from script and for sending and receiving messages the transformer pipeline with our fine-tuned model (Bert- to and from the “Background” component. In For-Patents-3). Further, the endpoint will assign classes this case, it prepares the content for analysis and to the array of sentences. With respect to the semantic highlights the relevant content on the web page search POST endpoint, a context (complete patent text based on predictions from the “Background” com- in our case) and question are given as input and passed ponent. Highlighting the content (sentences and to question-answering model pipeline (e.g., Bert large answers) is one of the salient tasks of the "Con- tent" component. This is achieved by using a 15 https://github.com/Renuk9390/expaai_model 16 18 https://flask.palletsprojects.com/en/2.2.x/ https://huggingface.co/ 17 https://huggingface.co/docs/transformers/main_classes/ bert-large-uncased-whole-word-masking-finetuned-squad 19 pipelines https://github.com/dutiyesh/chrome-extension-cli 17 “div” number or “class” on the HTML page for Patents-2 to obtain improved samples from our “Base- the respective matched answer or sentence to line_preprocessed Dataset”. We considered only those highlight. samples where the prediction score was greater than 70% • Background: This is the only component com- when predicted by Bert-for-Patents-2. municating with the Flask API backend. When With respect to in-page semantic search, we are uti- it receives a message from the “Content” com- lizing models (Bert Large uncased, RoBerta base, and ponent with a payload to perform a task, the DistilBert based cased) which are fine-tuned on SQuAD API endpoint will be called with inputs. Back- data. To our knowledge, there are no SQuAD formatted ground listens to two types of messages from datasets in the patent domain to address in-page question Content such as “Patent_Text” for highlighting answering. Therefore in this work, we are not fine-tuning technical aspects based on the type of class it them on any patent data. Instead, we only perform test belongs to and “Patent_Semantic_Search” to ac- cases to compare and evaluate them. For the test cases, complish in-page search. After receiving a re- we considered various contexts (patent text) and ques- sponse from API, the response will be sent to tions to compare the answering capability of said models. “Content” for further processing. In addition, DistilBert is competitive with Bert Large in some cases. Background is also responsible for sending mes- For instance, as depicted in Figure 5, we provided the sages task_started and task_stopped to “Popup” same context and question to the aforementioned models. to keep the “Loader” busy or active for taking Bert Large exhibited superior performance in retrieving the next task from the user. More details on the the answer; nevertheless, DistilBert also performed rea- communication of components can be collected sonably well in retrieving the correct answer. In most via the code base repository of this paper. cases, Bert Large uncased model performed better in finding accurate answers for longer queries (which are common in patent searches). Therefore, Bert Large is 5. Findings deployed in the in-page semantic search extension. To test and debug the API endpoints for intended func- In this section, we discuss the results of this work and tioning, we used an open-source application called Insom- perform an error analysis to show how the dataset repre- nia20 . We provided Insomnia test requests to the in-page sentation problem affects the model performances. semantic search API and the classification (aka senti- ment_predict) API endpoints. For example, we passed 5.1. Scores and Test Cases an array of sentences to the sentiment_predict API end- point, and the fine-tuned model returned a response with Table 1 displays the classification accuracies of the models the label and prediction probability score. Similarly, for developed in this work using the PaSa_sentence Base- semantic_search, we passed a sample patent text as a con- line_preprocessed dataset. Bert-for-Patents-1 exhibits text along with a question, and the retrieved response better performance than the other models, possibly be- included the begin and end token numbers of the pos- cause it was pre-trained by Google on patent literature. sible answer text snippet with confidence scores. After As a result, we opted to employ only the Bert-for-Patents confirming the intended functioning of the APIs using pre-trained architecture in Stage-II. Insomnia tests, we deployed the APIs in the Chromium extensions. Table 1 Classification scores on sentence level 5.2. Error Analysis Data Size Model Accuracy There are three different ways in which the labels are 150k BerTweet 80% assigned to the sentence level dataset of this work. Firstly, 150k Bert base 83.5% 150k DistilBert 84% automatic labeling is based on the NLTK tokenizer (in 150k Bert-for-Patents-1 86.30% STAGE-I). Secondly, labels are given by fine-tuned para- graph model (in STAGE-II). And thirdly, manually as- signed labels (in STAGE-III). Although “Baseline models” “PaSa_Sentence Improvised Dataset” (refer to Stage-II developed in this work show good performances in terms in Figure 3) is used to fine-tune Bert-for-Patents-3. Due of accuracy, there are cases where the models’ validation to the improvements made in the dataset, this model loss is less than the training loss at the end of 3rd epoch. shows an accuracy of 97.11%. As shown in Figure 3, The validation data was easier to predict than learning Bert-for-Patents-2, fine-tuned on a paragraph level with the training data for the models. This signifies a dataset an accuracy of 98.13%, is competent enough to repre- sent the classes. Therefore, we decided to use Bert-for- 20 https://docs.insomnia.rest/insomnia/get-started 18 Figure 5: An example SQuAD-based in-page patent semantic search tested on different models representation problem, i.e., classes are not equally rep- least a 70% probability of representing a class. Further, resented by all the samples because of various reasons we have used these improvised samples (PaSa_Sentence as shown below. The models finetuned on this poorly Improvised Dataset) to fine-tune a new model (Bert-For- represented data induce bias in predicting the valida- Patents-3), which outperforms other baseline models in tion set. There are various samples in PaSa_Sentence terms of both accuracy and class representativeness. Baseline_preprocessed data which can be examples of We manually labeled 650 randomly selected sam- substandard training samples. ples, which were not used in any of the experi- Example 1: “In the view of the problem of the back- ments. The original labels for these samples from ground art, it is an object of the present invention to pro- PaSa_Baseline_preprocessed were kept separate. To ver- vide a conveyor which estimates the weight of a transport ify the presence of bias and representation problems in object while it is carried without using devices such as a the baseline models, we compared the prediction accura- load cell which directly measures weight.” cies of manual predictions, baseline models, and Bert-for- Observation 1: The above example is automatically la- Patents-3. The manual and Bert-for-Patents-3 prediction beled as a negative class during PaSa_Baseline generation, accuracies were 68.59% and 69.05%, respectively. Bert- but it is not when we do manual labeling. During patent for-Patents-3 was fine-tuned on the improved dataset, drafting, mostly in “Technical Problem” paragraphs, at- and its prediction performance was closer to the manual torneys/applicants commonly use underlined phrases to labels. However, due to bias, the baseline models showed quickly repeat their invention while describing problems higher scores with accuracies of 87.80% (DistilBert base with other prior art. If sentences with such underlined uncased), 87.04% (Bert base uncased), and 94.06% (Bert- phrases are present in the negative class, then such sam- for-Patents-1). Therefore, Bert-for-Patents-3 is more suit- ples can be discarded. able for use in the Chrome extension for highlighting Example 2: “An embodiment provides a lighting device technical aspects. in which an optical plate is disposed on at least one light Technical aspects in a patent represent advantages source and a light source module including the same.” over the prior art, proposed solutions, or problems with Observation 2: The above sentence, as well as others other prior art. The core objective of this work was that are similar, are automatically labeled as negative to automatically identify and highlight these aspects even though they are not. This indicates the presence of in patents. Although this objective may resemble a mixed opinions at times on the paragraph level, which sentiment analysis problem, general sentiment analysis also appears in some sentences. datasets or algorithms are not suitable for this task. Our There are other samples that are very long (60-70 sentence-level dataset is distinct from other sentiment words); in such cases, smaller sentences are joined using analysis datasets such as IMDB21 and Amazon product special symbols such as ";,:". Manually checking every reviews22 . These datasets mostly contain sentences ex- such sample in large datasets is laborious. Therefore, we pressing people’s opinions on products, things, or other decided to fine-tune a model on the paragraph level so social aspects. In contrast, our dataset highlights the key that this model would have a greater understanding of technical arguments in patents that demonstrate the in- the representativeness of classes on advantages, prob- vention’s technical capabilities in comparison to the prior lems, and solutions in a patent text. Such a fine-tuned 21 model is used to consider the sentences that show at https://www.imdb.com/interfaces/ 22 https://cseweb.ucsd.edu/~jmcauley/datasets.html 19 art. Most importantly, our dataset is specific to the patent [2] P. Bajaj, C. Xiong, G. Ke, X. Liu, D. He, S. Tiwary, domain and accounts for patent-specific vocabulary and T.-Y. Liu, P. Bennett, X. Song, J. Gao, Metro: Effi- knowledge. cient denoising pretraining of large scale autoen- coding language models with model generated sig- nals, arXiv preprint arXiv:2204.06644 (2022). 6. Conclusion and Future Work [3] B. Patra, S. Singhal, S. Huang, Z. Chi, L. Dong, F. Wei, V. Chaudhary, X. Song, Beyond In this work, we present a multi-class dataset at the sen- english-centric bitexts for better multilingual lan- tence level to highlight the technical subject matters of guage representation learning, arXiv preprint patents, which can serve as important key arguments to arXiv:2210.14867 (2022). determine a patent’s novelty. We fine-tuned language [4] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, models on our new dataset and developed a Chromium ex- Bert: Pre-training of deep bidirectional transform- tension to automatically highlight key arguments based ers for language understanding, arXiv preprint on predictions, provided the probability exceeds 70%. We arXiv:1810.04805 (2018). also developed another Chromium extension to facilitate [5] P. Lahorte, Inside the mind of an epo examiner, in-page semantic search. World Patent Information 54 (2018) S18–S22. We anticipate a growing need for AI-based tools to [6] A. Shinmori, M. Okumura, Y. Marukawa, assist patent practitioners in conducting patent prior art M. Iwayama, Patent claim processing for searches. We hope this empirical work serves as pre- readability-structure analysis and term explana- liminary research and motivates researchers and patent tion, in: Proceedings of the ACL-2003 workshop practitioners to develop tools that can automate prior on Patent corpus processing, 2003, pp. 56–65. art searches. Future work in this area could identify [7] G. Ferraro, H. Suominen, J. Nualart, Segmentation additional technical aspects in patent documents and of patent claims for improving their readability, in: train new classes for highlighting. For this study, we Proceedings of the 3rd Workshop on Predicting focused only on advantages, problems, and solutions. and Improving Text Readability for Target Reader Furthermore, sentence-level data could be improved to Populations (PITR), 2014, pp. 66–73. enhance the representativeness of samples belonging to [8] A. Shinmori, M. Okumura, Y. Marukawa, Aligning a particular class. For example, sentences representing patent claims with detailed descriptions for read- "advantages" should not be mixed with sentences related ability., in: NTCIR, 2004. to "problems". [9] S. Sheremetyeva, Natural language analysis of Developing a question-answering dataset in the patent patent claims, in: Proceedings of the ACL-2003 domain is crucial, and such datasets can be used to de- workshop on Patent corpus processing, 2003, pp. velop tools to automate in-page semantic searches. We 66–73. also hope that AI-based tools to assist prior art searches [10] L. Rello, H. Saggion, R. Baeza-Yates, Keyword high- will enhance the interaction of patent analysts with lighting improves comprehension for people with patent documents. For instance, the automatic highlight dyslexia, in: Proceedings of the 3rd workshop on and semantic search tools prototyped in this work can predicting and improving text readability for target allow for cross-questioning within any patent document reader populations (PITR), 2014, pp. 30–37. opened in a web browser. [11] S. Spala, F. Dernoncourt, W. Chang, C. Dockhorn, A web-based framework for collecting and assessing Acknowledgments highlighted sentences in a document, in: Proceed- ings of the 27th International Conference on Com- This research is part of the project "BigScience", which putational Linguistics: System Demonstrations, is funded by the Bavarian State Ministry for Economic 2018, pp. 78–81. Affairs, Regional Development, and Energy under the [12] J. J. Li, K. Thadani, A. Stent, The role of discourse grant number DIK0259/01. units in near-extractive summarization, in: Pro- ceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2016, References pp. 137–147. [13] K. Woodsend, M. Lapata, Automatic generation of [1] R. Srebrovic, J. Yonamine, Leveraging the BERT al- story highlights, in: Proceedings of the 48th An- gorithm for Patents with TensorFlow and BigQuery, nual Meeting of the Association for Computational Technical Report, Technical Report. Global Patents, Linguistics, 2010, pp. 565–574. Google https://services. google. com/fh . . . , 2020. [14] M. Kaisser, M. A. Hearst, J. B. Lowe, Improving search results quality by customizing summary 20 lengths, in: Proceedings of ACL-08: HLT, 2008, pp. 701–709. [15] F. I. Craik, R. S. Lockhart, Levels of processing: A framework for memory research, Journal of verbal learning and verbal behavior 11 (1972) 671–684. [16] H. W. Faw, T. G. Waller, Mathemagenic behaviours and efficiency in learning from prose materials: Re- view, critique and recommendations, Review of Educational Research 46 (1976) 691–720. [17] R. Ennals, B. Trushkowsky, J. M. Agosta, Highlight- ing disputed claims on the web, in: Proceedings of the 19th international conference on World wide web, 2010, pp. 341–350. [18] M. Yeari, M. Oudega, P. van den Broek, The effect of highlighting on processing and memory of central and peripheral text information: Evidence from eye movements, Journal of Research in Reading 40 (2017) 365–383. [19] J. A. Brown, K. Knollman-Porter, K. Hux, S. E. Wal- lace, C. Deville, Effect of digital highlighting on reading comprehension given text-to-speech tech- nology for people with aphasia, Aphasiology 35 (2021) 200–221. [20] A. Winchell, A. Lan, M. Mozer, Highlights as an early predictor of student comprehension and in- terests, Cognitive Science 44 (2020) e12901. [21] R. Chikkamath, V. R. Parmar, C. Hewel, M. Endres, Patent sentiment analysis to highlight patent para- graphs, arXiv preprint arXiv:2111.09741 (2021). 21