1. Introduction

Employing Active Learning for Training a DL-Model for Citation Identification in Patent Text⋆

Farag Saad

Hidir Aras

Mark Prince

0 0 CAS - Chemical Abstracts Service , 2540 Olentangy River Rd, Columbus, OH 43202 , USA 1 FIZ Karlsruhe - Leibniz Institute for Information Infrastructure , Hermann-von-Helmholtz Platz 1 · 76344 Eggenstein-Leopoldshafen , Germany

2024

77 81

Citations play an important role in patent analytics. Due to the fact that existing citation lists in patent documents are incomplete, detecting and enhancing them automatically from the patent text has been a user need in patent information retrieval since a while. In this paper, we describe an approach for the identification of citations in patent text using Deep Learning (DL) models. We apply active learning for training and improving of a DL-based named entity recognition (NER) model for this task. The evaluation showed a high accuracy for the focused type of citations, i.e. for the p-c-p (patent cites patent) case.

eol>Patent Citations Named Entity Recognition Deep learning Active learning

1. Introduction

"US20050114951A1, WO 2006122188" etc., while in the nonstandard citation pattern type patent applicants tend to Citations play an important role in patent retrieval and use more complex pattern for citing other patents e.g., analytics. Since the existing citation lists in patent doc- "U.S. Pat. Nos. 6,808,085; 6,736,293; 6,732,955; 6,708,846; uments are incomplete, it is a long-cherished user wish 6,626,379; 6,626,330; 6,626,328; 6,454,185, United States Proto automatically determine and complete them from the visional Application No.61/914,561, Japanese Unexamined patent text. Furthermore, citations within a full-text of Patent Publication No. 4-187748, US provisional applicaa patent do not follow well-defined patterns or rules, tion Serial No 61/640,128" etc. Based on that, we have therefore identifying them with high accuracy is a chal- developed and trained a suitable p-c-p NER DL model for lenging task. In addition, researching and implementing identifying and extracting those types of citation patterns a suitable approach for utilizing citations from patent text automatically (see Section 3). depends on the use case (e.g., search for prior art, linking In the following, we firstly review shortly the related to corresponding online sources, etc.). Successfully cre- work in Section 2, followed by a presentation of the proating such citation lists for patents automatically from posed approach in Section 3. In Section 4 an empirical patent text will enable users to extend their discovery evaluation is presented and discussed. Some hints about more eficiently to stated background or adjacent prior future work is given in Section 5. A conclusion is given art. in Section 6.

In general, patent citations typically come in two types: a patent cites another patent (herein referred to as p-c-p) or a patent cites literature (herein referred to as p-c-l) 2. Related Work referring to as NPL (non patent literature) citations. However, in this paper the focus will be on the p-c-p use case where we have designed, implemented and evaluated an approach for patent citation identification based on the chosen p-c-p citation type.

There are two citations patterns types for the p-c-p citation use case, the standard citation pattern type and the non-standard citation pattern type. In the standard citation pattern type, patent applicants tend to use a simple form for referencing other patent publications e.g., the authors considered six standard NLP tasks, among extracted training paragraphs that hold citations were them the NER task where atomic elements in the sen- 487. The FIZ dataset contains 41 annotated patent doctence were labelled into categories such as "PERSON", uments. The citation coverage for each patent varies, "COMPANY", or "LOCATION". The authors used fea- between 2 and 66 citations. The domains of focus are ture vectors generated from all words in an unlabelled life science and technology. The total number of the excorpora. A separate feature (orthographic) is included tracted training paragraphs were 230. Both datasets have based on the assumption that a capital letter at the be- a diferent format, not equivalent to the NER state-of-the ginning of a word is a strong indication that the word is art format e.g., IOB (Inside–outside–beginning (tagging) a named entity. The proposed controlled features were ) format (See [4]). We have unified the format of both later replaced with word embeddings [6] [7]. Word em- datasets to the standard IOB format and processed the beddings, which is a representation of word meanings in resulting training data (with 717 training paragraphs) to -dimensional space, were learned from unlabelled data. train the precursor p-c-p NER model. A major strength of these approaches is that they allow the design of training algorithms that avoid task-specific 3.2. Generated P-C-P Training Data engineering and instead rely on large, unlabelled data to discover internal word representations that are useful for the NER task.

For training the precursor p-c-p NER model the freely

acquired training data is not suficient and needs to be improved in quality and quantity. Therefore, we have generated our own training data. To achieve this goal, patent 3. P-C-P Approach based on Deep paragraphs that hold many citations are pre-annotated Learning by the p-c-p precursor model first. Then, the patent subject matter experts (SMEs) used the visual annotation To find suitable training data which can be used to train user interface of the Prodigy5 annotation tool to review the p-c-p NER model, we have firstly investigated if there and enhance the annotated p-c-p citations. Prodigy is an is any publicly available training dataset that we can rely annotation tool with a easy to use interactive interface on to build the NER precursor model which will be used that supports by active learning. It is a scriptable tool to enlarge the training data we have in hand. A precursor that allows users to create the annotation themselves, model is a type of temporary model which is trained on enabling rapid iterations. a small set of training data and will be re-trained further From the utilized patent full-text databases PCT based on a larger training data. (WIPO) and US we have prepared 500 (in total 1000)

In the following, we give some insight about the pub- citation-rich paragraphs belonging to an equally dislicly available training data as well as the generated train- tributes set of patent documents based on their IPC/CPC6 ing data using the active learning framework. classes. In order to use citation-rich paragraphs for training, we have kept only the part of the detailed description 3.1. Public Training Dataset (DETD) which holds at least 8 cited patents identified by the precursor p-c-p model. Furthermore, we took into acWe have identified two freely available datasets: the GRO- count to have suficient training instance candidates (in BID1 and the manually/expert-created dataset at FIZ Karl- the selected paragraphs) that represent the two types of sruhe. The GROBID project is specialized on literature citation patterns, the standard as well as the non-standard citation extraction. However, they have recently done one. The training instance candidates are then reviewed some work related to patent citation extraction. After a by the SMEs to accept or correct each instance using the pre-processing steps e.g., removing non-English patents, Prodigy annotation tool (See Figure 1). corrupted documents etc., the total number of obtained annotated documents were 130 belonging to three patent 3.3. Model Design and Training authorities: EPO2, US3 and PCT (WIPO)4. The citations coverage for each patent document is varying, there is Figure 2 shows the workflow how the final NER model some document which contains two citations while some (based on Convolutional Neural Networks (CNNs)) is other documents contain up to 375 citations. The do- built for the p-c-p task within the active learning framemains of focus is life science. The total number of the work. To train the NER model we have utilized the open source framework spaCy7. For rapid implementation, we have used the spaCy implementation which is provided by the Prodigy framework. As spaCy ofers no pre1https://github.com/kermitt2/grobid/tree/master/grobidtrainer/resources/dataset 2https://www.epo.org/en 3https://www.cas.org/support/training/stnanavist/uspatfullanavist 4https://www.cas.org/support/training/stnanavist/pctfull-anavist

5https://prodi.gy/ 6https://www.wipo.int/classifications/ipc/en/ 7https://spacy.io/

trained NER model for identifying citations in patents have picked up more raw data and have pre-annotated text, we have trained our own DL-based p-c-p NER model it again (250 for each US and PCT databases), using the utilizing the Prodigy framework (See Figure 2). enhanced p-c-p NER model (See Figure 2 (2)). The newly

The process starts by training a precursor model based prepared citation-rich paragraphs were reviewed by the on the public dataset, which is then utilized to enlarge the SMEs and used to re-train the NER model. If needed, this training data iteratively (See Figure 2 (1)). This precursor process will be iteratively repeated and will end when we model was used to filter the acquired US and PCT raw reach a certain degree of confidence that the final NER data where we have kept only the part of the Detailed model is significantly trained to be applied for the p-c-p Description (DETD) text which holds at least 8 patent citation identification task (See Figure 2 (4)). citations (See Figure 2 (2)). We have then pre-processed and prepared the raw data to ensure that it holds suficient citations. We then loaded part of it into the Prodigy 4. Evaluation tool along with the integrated precursor model to start the training process for the final NER model in several Once the p-c-p model is suficiently trained, the citation iterations. The input training instance candidates (in identification approach is evaluated by processing the the selected paragraphs), which were annotated by the evaluation corpus of 245 patents (prepared by SMEs) precursor model, are loaded into the Prodigy tool and pre- representing a random collection of patents from the US sented to the SMEs for reviewing. The SMEs interacted (128 patents) and PCT (117 patents) patents. To start the with the pre-annotations and either approved, corrected p-c-p model evaluation process, all required materials or added new annotations (See Figure 2 (3)) in the pre- e.g., identified citations of the SMEs evaluation corpus, sented paragraphs. etc., were handed over to the SMEs.

In the first iteration, the SMEs have reviewed 250 pre- We have evaluated the p-c-p model based on the annotated paragraphs for each US and PCT databases. ( ) (See equation 1), () (See equaWe have used these reviewed pre-annotated paragraphs tion 2), and 1 − measures (See equation 3). In to re-train the NER model (See Figure 2 (4)) where the order to compute the scores, the False Positive ( ), model reached F-Score of 85% and hence another iteration True Positive ( ) and False Negative ( ) counts are was required. To enhance the model performance, We determined first. The refer to the number of wrongly identified citations by the p-c-p model. The refer to the number of correctly identified citations by the p-c-p model. The refers to the number of citations that the model fails to identify.

= =

+ 1 − = 2 * ( * )

As it is shown in Table 1 in total the p-c-p model successfully identified 951 citations out of 978 citations and failed to identify only 28 citations. An explanation of these fails is that in rare cases patent applicants don’t cite citations in the right way, for example, consider this citation example "This is a continuation-in-part of my copending application Ser. No. 43,784 for Catalytic Reformer Process". Here the patent authority marker (US, WO, JP, etc.) is missing so the model has no clue about which patent authority is meant, therefore, to minimize error rate the model was trained to neglect such incomplete citation.

Even though the p-c-p model obtained a higher evalua(1) tion score also for diferent patent authorities e.g., Finish,

Japanese etc. that it was not trained on, it failed to identify some citations that appear in some special context. (2) An efective solution for these failures is to increase the training data to cover more patent authorities. This can be done eficiently using the framework described in Sec(3) tion 3.2.

Generally, the p-c-p model performed very well in most cases and could achieve a high precision of 96%, a high recall of 89%. In addition, we have computed the F1-Score measure to take into account both Precision and Recall measure in order to ultimately measure the accuracy of the model. Despite the fact that the p-c-p model was trained on a very small training dataset consisting of 1717 training paragraphs, the F1-score (using the evaluation corpus prepared by the SMEs) shows that the p-c-p model achieves a certain degree of accuracy and reaches the F1-Score of 92%.

5. Future Work Directions

paragraphs (250 for each US and PCT databases) have led to a significant model improvement. However, another The DL-based p-c-p NER model has been trained with a iteration which involved another 250 pre-annotated parasmall set of training data (1717 training paragraphs) re- graphs for each database was required in order to achieve lated to two patent authorities US and PCT. Even though the desired F1-Score of 92%, and to stop the re-training the model obtained a higher evaluation score also for process. diferent patent authority data. However, the developed model needs further training and testing to cover more citations belonging to diferent patent authorities e.g., to References cover more citation patterns which might be specialized to some patent authority. Based on our experience so far, [1] X. Zhang, J. Zou, D. X. Le, G. R. Thoma, A struca few thousand training paragraphs for each patent au- tural svm approach for reference parsing, in: 2010 thority should be suficient. To speed up this process, the Ninth International Conference on Machine Learndeveloped visual active learning approach in this paper ing and Applications, 2010, pp. 479–484. doi:10. can be utilized. 1109/ICMLA.2010.77.

To utilize the extracted citation for further tasks or ap- [2] B. Ojokoh, M. Zhang, J. Tang, A trigram hidden plication such as search, linking patents with a literature markov model for metadata extraction from heteroknowledge base through citation etc., the extracted cita- geneous references, Information Sciences 181 (2011) tions need to be post-processed. This importantly needed 1538–1551. URL: https://www.sciencedirect.com/ as significant portions of the identified citations were pre- science/article/pii/S0020025511000259. doi:https: sented in the non-standard citation form e.g., U.S. Pat. //doi.org/10.1016/j.ins.2011.01.014. Nos. 5,188,960; 5,689,052; 5,880,275; 5,986,177; 7,105,332; [3] P. Lopez, Grobid: Combining automatic biblio7,208,474. Hence, after extraction, the individual citations graphic data recognition and term extraction for should be normalized accordingly: US5188960, US5689052, scholarship publications, in: European Conference US5880275, etc. Another example for the normalization is on Research and Advanced Technology for Digital splitting up the identified citation string e.g., "EP 0 716 884 Libraries, 2009. URL: https://api.semanticscholar.org/ A2" into meaningful segments: The patent authority "EP", CorpusID:27383212. the patent number "0716884", the patent kind code "A2", [4] F. Saad, H. Aras, R. Hackl-Sommer, Improving and, finally the normalized patent string "EP0716884A2". named entity recognition for biomedical and patent

To consider a detailed patent citation type such as the data using bi-lstm deep neural network models, iflling number of a patent application, the publication of a in: E. Métais, F. Meziane, H. Horacek, P. Cimiano patent application etc., it is essential to integrate a patent (Eds.), Natural Language Processing and Informacitation specific scheme into the developed approach. For tion Systems - 25th International Conference on example, if we consider the US patent citation, we noticed Applications of Natural Language to Information that the lfiing number of a US patent application has a Systems, NLDB 2020, Saarbrücken, Germany, June specific format (e.g., No.16/769,261), the publication of 24-26, 2020, Proceedings, volume 12089 of Lecture a US patent application has a specific format (e.g., US Notes in Computer Science, Springer, 2020, pp. 25–36. 2005/0114951 A1; starting with a year number) and a US URL: https://doi.org/10.1007/978-3-030-51310-8_3. patent has a specific format (e.g., US 6,808,085). Encoding doi:10.1007/978-3-030-51310-8_3. such features into the developed approach will certainly [5] R. Collobert, J. Weston, A unified architecture for lead to a significant improvement. natural language processing: Deep neural networks with multitask learning, in: Proceedings of the 25th international conference on Machine learning, ACM, 6. Conclusion 2008, pp. 160–167. [6] R. Collobert, J. Weston, L. Bottou, M. Karlen,

K. Kavukcuoglu, P. P. Kuksa, Natural language processing (almost) from scratch, Computing Research

Repository - CORR abs/1103.0398 (2011). [7] E. Parsaeimehr, M. Fartash, J. A. Torkestani, Improving feature extraction using a hybrid of cnn and lstm for entity identification, Neural Process Lett 55 (2023) 5979–5994. doi:https://doi.org/ 10.1007/s11063-022-11122-y.

In this paper, we have developed a DL-based p-c-p NER model to identify citations in the patent fulltext. To realize that, we have designed, implemented and evaluated an active learning framework for patent citation identification employing a DL approach. Furthermore, to train a robust citation identification p-c-p model with high accuracy, we have designed an active learning framework that can be used by patent SMEs to iteratively improve the model performance significantly with less manual efort. In the first iteration, the reviewed pre-annotated