Employing Active Learning for Training a DL-Model for Citation Identification in Patent Text ⋆

Employing Active Learning for Training a DL-Model for Citation Identification in Patent Text ⋆ FaragSaad farag.saad@fiz-karlsruhe.de FIZ Karlsruhe -Leibniz Institute for Information Infrastructure

Hermann-von-Helmholtz Platz 1 • 76344 Eggenstein-Leopoldshafen Germany

HidirAras hidir.aras@fiz-karlsruhe.de FIZ Karlsruhe -Leibniz Institute for Information Infrastructure

Hermann-von-Helmholtz Platz 1 • 76344 Eggenstein-Leopoldshafen Germany

MarkPrince mprince@cas.org CAS -Chemical Abstracts Service

2540 Olentangy River Rd 43202 Columbus OH USA

Workshop on Patent Text Mining and Semantic Technologies Employing Active Learning for Training a DL-Model for Citation Identification in Patent Text ⋆ 1613-0073 32C418FF5CEAF9D493CE7B1EF74DAEB5 GROBID - A machine learning software for extracting information from scholarly documents Patent Citations Named Entity Recognition Deep learning Active learning

Citations play an important role in patent analytics. Due to the fact that existing citation lists in patent documents are incomplete, detecting and enhancing them automatically from the patent text has been a user need in patent information retrieval since a while. In this paper, we describe an approach for the identification of citations in patent text using Deep Learning (DL) models. We apply active learning for training and improving of a DL-based named entity recognition (NER) model for this task. The evaluation showed a high accuracy for the focused type of citations, i.e. for the p-c-p (patent cites patent) case.

Introduction

Citations play an important role in patent retrieval and analytics. Since the existing citation lists in patent documents are incomplete, it is a long-cherished user wish to automatically determine and complete them from the patent text. Furthermore, citations within a full-text of a patent do not follow well-defined patterns or rules, therefore identifying them with high accuracy is a challenging task. In addition, researching and implementing a suitable approach for utilizing citations from patent text depends on the use case (e.g., search for prior art, linking to corresponding online sources, etc.). Successfully creating such citation lists for patents automatically from patent text will enable users to extend their discovery more efficiently to stated background or adjacent prior art.

In general, patent citations typically come in two types: a patent cites another patent (herein referred to as p-c-p) or a patent cites literature (herein referred to as p-c-l) referring to as NPL (non patent literature) citations. However, in this paper the focus will be on the p-c-p use case where we have designed, implemented and evaluated an approach for patent citation identification based on the chosen p-c-p citation type.

There are two citations patterns types for the p-c-p citation use case, the standard citation pattern type and the non-standard citation pattern type. In the standard citation pattern type, patent applicants tend to use a simple form for referencing other patent publications e.g., "US20050114951A1, WO 2006122188" etc., while in the nonstandard citation pattern type patent applicants tend to use more complex pattern for citing other patents e.g., "U.S. Pat. Nos. 6,808,085; 6,736,293; 6,732,955; 6,708,846; 6,626,379; 6,626,330; 6,626,328; 6,454,185 In the following, we firstly review shortly the related work in Section 2, followed by a presentation of the proposed approach in Section 3. In Section 4 an empirical evaluation is presented and discussed. Some hints about future work is given in Section 5. A conclusion is given in Section 6.

Related Work

Several machine learning approaches have been applied to the problem of extracting data from free text (NER) i.e, citation extraction, among these approaches Support Vector Machines (SVM) e.g., [1], Hidden Markov Models (HMM) e.g., [2], Conditional Random Fields (CRF) e.g., [3]. However, in the past few years, Deep Learning (DL) approaches for the NER task (mainly LSTM = Long Short-Term Memory, CNN = Convolutional Neural Network became dominant as they outperformed the state-of-the-art approaches significantly [4]. In contrast to machine learning approaches, where features are designed and prepared through human effort, deep learning is able to automatically discover hidden features from unlabelled data. The first application for NER using a neural network (NN) was proposed in [5]. In this paper the authors considered six standard NLP tasks, among them the NER task where atomic elements in the sentence were labelled into categories such as "PERSON", "COMPANY", or "LOCATION". The authors used feature vectors generated from all words in an unlabelled corpora. A separate feature (orthographic) is included based on the assumption that a capital letter at the beginning of a word is a strong indication that the word is a named entity. The proposed controlled features were later replaced with word embeddings [6] [7]. Word embeddings, which is a representation of word meanings in 𝑛-dimensional space, were learned from unlabelled data. A major strength of these approaches is that they allow the design of training algorithms that avoid task-specific engineering and instead rely on large, unlabelled data to discover internal word representations that are useful for the NER task.

P-C-P Approach based on Deep Learning

To find suitable training data which can be used to train the p-c-p NER model, we have firstly investigated if there is any publicly available training dataset that we can rely on to build the NER precursor model which will be used to enlarge the training data we have in hand. A precursor model is a type of temporary model which is trained on a small set of training data and will be re-trained further based on a larger training data.

In the following, we give some insight about the publicly available training data as well as the generated training data using the active learning framework.

Public Training Dataset

We have identified two freely available datasets: the GRO-BID 1 and the manually/expert-created dataset at FIZ Karlsruhe. The GROBID project is specialized on literature citation extraction. However, they have recently done some work related to patent citation extraction. After a pre-processing steps e.g., removing non-English patents, corrupted documents etc., the total number of obtained annotated documents were 130 belonging to three patent authorities: EPO2 , US 3 and PCT (WIPO) 4 . The citations coverage for each patent document is varying, there is some document which contains two citations while some other documents contain up to 375 citations. The domains of focus is life science. The total number of the extracted training paragraphs that hold citations were 487. The FIZ dataset contains 41 annotated patent documents. The citation coverage for each patent varies, between 2 and 66 citations. The domains of focus are life science and technology. The total number of the extracted training paragraphs were 230. Both datasets have a different format, not equivalent to the NER state-of-the art format e.g., IOB (Inside-outside-beginning (tagging) ) format (See [4]). We have unified the format of both datasets to the standard IOB format and processed the resulting training data (with 717 training paragraphs) to train the precursor p-c-p NER model.

Generated P-C-P Training Data

For training the precursor p-c-p NER model the freely acquired training data is not sufficient and needs to be improved in quality and quantity. Therefore, we have generated our own training data. To achieve this goal, patent paragraphs that hold many citations are pre-annotated by the p-c-p precursor model first. Then, the patent subject matter experts (SMEs) used the visual annotation user interface of the Prodigy5 annotation tool to review and enhance the annotated p-c-p citations. Prodigy is an annotation tool with a easy to use interactive interface that supports by active learning. It is a scriptable tool that allows users to create the annotation themselves, enabling rapid iterations.

From the utilized patent full-text databases PCT (WIPO) and US we have prepared 500 (in total 1000) citation-rich paragraphs belonging to an equally distributes set of patent documents based on their IPC/CPC6 classes. In order to use citation-rich paragraphs for training, we have kept only the part of the detailed description (DETD) which holds at least 8 cited patents identified by the precursor p-c-p model. Furthermore, we took into account to have sufficient training instance candidates (in the selected paragraphs) that represent the two types of citation patterns, the standard as well as the non-standard one. The training instance candidates are then reviewed by the SMEs to accept or correct each instance using the Prodigy annotation tool (See Figure 1). The process starts by training a precursor model based on the public dataset, which is then utilized to enlarge the training data iteratively (See Figure 2 (1)). This precursor model was used to filter the acquired US and PCT raw data where we have kept only the part of the Detailed Description (DETD) text which holds at least 8 patent citations (See Figure 2 (2)). We have then pre-processed and prepared the raw data to ensure that it holds sufficient citations. We then loaded part of it into the Prodigy tool along with the integrated precursor model to start the training process for the final NER model in several iterations. The input training instance candidates (in the selected paragraphs), which were annotated by the precursor model, are loaded into the Prodigy tool and presented to the SMEs for reviewing. The SMEs interacted with the pre-annotations and either approved, corrected or added new annotations (See Figure 2 (3)) in the presented paragraphs.

Model Design and Training

In the first iteration, the SMEs have reviewed 250 preannotated paragraphs for each US and PCT databases. We have used these reviewed pre-annotated paragraphs to re-train the NER model (See Figure 2 (4)) where the model reached F-Score of 85% and hence another iteration was required. To enhance the model performance, We have picked up more raw data and have pre-annotated it again (250 for each US and PCT databases), using the enhanced p-c-p NER model (See Figure 2 (2)). The newly prepared citation-rich paragraphs were reviewed by the SMEs and used to re-train the NER model. If needed, this process will be iteratively repeated and will end when we reach a certain degree of confidence that the final NER model is significantly trained to be applied for the p-c-p citation identification task (See Figure 2 (4)).

Evaluation

Once the p-c-p model is sufficiently trained, the citation identification approach is evaluated by processing the evaluation corpus of 245 patents (prepared by SMEs) representing a random collection of patents from the US (128 patents) and PCT (117 patents) patents. To start the p-c-p model evaluation process, all required materials e.g., identified citations of the SMEs evaluation corpus, etc., were handed over to the SMEs.

We have evaluated the p-c-p model based on the 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 (𝑃 ) (See equation 1), 𝑅𝑒𝑐𝑎𝑙𝑙 (𝑅) (See equation 2), and 𝐹 1 − 𝑆𝑐𝑜𝑟𝑒 measures (See equation 3). In order to compute the scores, the False Positive (𝐹 𝑃 ), True Positive (𝑇 𝑃 ) and False Negative (𝐹 𝑃 ) counts are determined first. The 𝐹 𝑃 refer to the number of wrongly identified citations by the p-c-p model. The 𝑇 𝑃 refer to

𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇 𝑃 𝑇 𝑃 + 𝐹 𝑁(2)𝐹 1 − 𝑆𝑐𝑜𝑟𝑒 = 2 * (𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 * 𝑅𝑒𝑐𝑎𝑙𝑙) 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙(3)

As it is shown in Table 1 in total the p-c-p model successfully identified 951 citations out of 978 citations and failed to identify only 28 citations. An explanation of these fails is that in rare cases patent applicants don't cite citations in the right way, for example, consider this citation example "This is a continuation-in-part of my copending application Ser. No. 43,784 for Catalytic Reformer Process". Here the patent authority marker (US, WO, JP, etc.) is missing so the model has no clue about which patent authority is meant, therefore, to minimize error rate the model was trained to neglect such incomplete citation.

Even though the p-c-p model obtained a higher evaluation score also for different patent authorities e.g., Finish, Japanese etc. that it was not trained on, it failed to identify some citations that appear in some special context. An effective solution for these failures is to increase the training data to cover more patent authorities. This can be done efficiently using the framework described in Section 3.2.

Generally, the p-c-p model performed very well in most cases and could achieve a high precision of 96%, a high recall of 89%. In addition, we have computed the F1-Score measure to take into account both Precision and Recall measure in order to ultimately measure the accuracy of the model. Despite the fact that the p-c-p model was trained on a very small training dataset consisting of 1717 training paragraphs, the F1-score (using the evaluation corpus prepared by the SMEs) shows that the p-c-p model achieves a certain degree of accuracy and reaches the F1-Score of 92%.

Future Work Directions

The DL-based p-c-p NER model has been trained with a small set of training data (1717 training paragraphs) related to two patent authorities US and PCT. Even though the model obtained a higher evaluation score also for different patent authority data. However, the developed model needs further training and testing to cover more citations belonging to different patent authorities e.g., to cover more citation patterns which might be specialized to some patent authority. Based on our experience so far, a few thousand training paragraphs for each patent authority should be sufficient. To speed up this process, the developed visual active learning approach in this paper can be utilized.

To utilize the extracted citation for further tasks or application such as search, linking patents with a literature knowledge base through citation etc., the extracted citations need to be post-processed. This importantly needed as significant portions of the identified citations were presented in the non-standard citation form e.g., U.S. Pat. Nos. 5,188,960; 5,689,052; 5,880,275; 5,986,177; 7,105,332; 7,208,474. Hence, after extraction, the individual citations should be normalized accordingly: US5188960, US5689052, US5880275, etc. Another example for the normalization is splitting up the identified citation string e.g., "EP 0 716 884 A2" into meaningful segments: The patent authority "EP", the patent number "0716884", the patent kind code "A2", and, finally the normalized patent string "EP0716884A2".

To consider a detailed patent citation type such as the filling number of a patent application, the publication of a patent application etc., it is essential to integrate a patent citation specific scheme into the developed approach. For example, if we consider the US patent citation, we noticed that the filing number of a US patent application has a specific format (e.g., No.16/769,261), the publication of a US patent application has a specific format (e.g., US 2005/0114951 A1; starting with a year number) and a US patent has a specific format (e.g., US 6,808,085). Encoding such features into the developed approach will certainly lead to a significant improvement.

Conclusion

In this paper, we have developed a DL-based p-c-p NER model to identify citations in the patent fulltext. To realize that, we have designed, implemented and evaluated an active learning framework for patent citation identification employing a DL approach. Furthermore, to train a robust citation identification p-c-p model with high accuracy, we have designed an active learning framework that can be used by patent SMEs to iteratively improve the model performance significantly with less manual effort. In the first iteration, the reviewed pre-annotated paragraphs (250 for each US and PCT databases) have led to a significant model improvement. However, another iteration which involved another 250 pre-annotated paragraphs for each database was required in order to achieve the desired F1-Score of 92%, and to stop the re-training process.

Figure 2 Figure 1 :21Figure 2 shows the workflow how the final NER model (based on Convolutional Neural Networks (CNNs)) is built for the p-c-p task within the active learning framework. To train the NER model we have utilized the open source framework spaCy7 . For rapid implementation, we have used the spaCy implementation which is provided by the Prodigy framework. As spaCy offers no pre-

Figure 2 :2Figure 2: Building the p-c-p ner model and improving it through Experts' interaction

, United States Provisional Application No.61/914,561, Japanese Unexamined Patent Publication No. 4-187748, US provisional application Serial No 61/640,128" etc. Based on that, we have developed and trained a suitable p-c-p NER DL model for identifying and extracting those types of citation patterns automatically (see Section 3).

Table 11The p-c-p ner model overall score for precision, recall and F1-Score

DATABASEIdentified citations FPTPFN Precision Recall F1-ScoreUS72717710230.970.960.96PCT25111241470,950.830.88Summation97828951700.960.890.92the number of correctly identified citations by the p-c-pmodel. The 𝐹 𝑁 refers to the number of citations thatthe model fails to identify.𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑇 𝑃 𝑇 𝑃 + 𝐹 𝑃(1)

https://github.com/kermitt2/grobid/tree/master/grobidtrainer/resources/dataset 2 https://www.epo.org/ en 3 https://www.cas.org/support/training/stnanavist/uspatfull- anavist 4 https://www.cas.org/support/training/stnanavist/pctfull-anavist https://prodi.gy/ https://www.wipo.int/classifications/ipc/en/ https://spacy.io/

A structural svm approach for reference parsing XZhang JZou DXLe GRThoma 10.1109/ICMLA.2010.77 Ninth International Conference on Machine Learning and Applications 2010. 2010 A trigram hidden markov model for metadata extraction from heterogeneous references BOjokoh MZhang JTang 10.1016/j.ins.2011.01.014 Information Sciences 181 2011 Grobid: Combining automatic bibliographic data recognition and term extraction for scholarship publications PLopez European Conference on Research and Advanced Technology for Digital Libraries 2009 Improving named entity recognition for biomedical and patent data using bi-lstm deep neural network models FSaad HAras RHackl-Sommer 10.1007/978-3-030-51310-8_3 doi: Natural Language Processing and Information Systems -25th International Conference on Applications of Natural Language to Information Systems, NLDB 2020 Lecture Notes in Computer Science EMétais FMeziane HHoracek PCimiano

Saarbrücken, Germany

Springer June 24-26, 2020. 2020 12089 Proceedings A unified architecture for natural language processing: Deep neural networks with multitask learning RCollobert JWeston Proceedings of the 25th international conference on Machine learning the 25th international conference on Machine learning ACM 2008 Natural language processing (almost) from scratch RCollobert JWeston LBottou MKarlen KKavukcuoglu PPKuksa Repository -CORR abs/1103.0398 2011 Computing Research Improving feature extraction using a hybrid of cnn and lstm for entity identification EParsaeimehr MFartash JATorkestani 10.1007/s11063-022-11122-y Neural Process Lett 55 2023