=Paper=
{{Paper
|id=Vol-3775/paper16
|storemode=property
|title=Employing Active Learning for Training a DL-Model for Citation Identification in Patent Text
|pdfUrl=https://ceur-ws.org/Vol-3775/paper16.pdf
|volume=Vol-3775
|authors=Farag Saad,Hidir Aras,Mark Prince
|dblpUrl=https://dblp.org/rec/conf/patentsemtech/SaadAP24
}}
==Employing Active Learning for Training a DL-Model for Citation Identification in Patent Text==
Employing Active Learning for Training a DL-Model for
Citation Identification in Patent Text⋆
Farag Saad1,*,† , Hidir Aras1,† and Mark Prince2,†
1
FIZ Karlsruhe - Leibniz Institute for Information Infrastructure, Hermann-von-Helmholtz Platz 1 · 76344 Eggenstein-Leopoldshafen,Germany
2
CAS - Chemical Abstracts Service, 2540 Olentangy River Rd, Columbus, OH 43202, USA
Abstract
Citations play an important role in patent analytics. Due to the fact that existing citation lists in patent documents are
incomplete, detecting and enhancing them automatically from the patent text has been a user need in patent information
retrieval since a while. In this paper, we describe an approach for the identification of citations in patent text using Deep
Learning (DL) models. We apply active learning for training and improving of a DL-based named entity recognition (NER)
model for this task. The evaluation showed a high accuracy for the focused type of citations, i.e. for the p-c-p (patent cites
patent) case.
Keywords
Patent Citations, Named Entity Recognition, Deep learning, Active learning
1. Introduction "US20050114951A1, WO 2006122188" etc., while in the non-
standard citation pattern type patent applicants tend to
Citations play an important role in patent retrieval and use more complex pattern for citing other patents e.g.,
analytics. Since the existing citation lists in patent doc- "U.S. Pat. Nos. 6,808,085; 6,736,293; 6,732,955; 6,708,846;
uments are incomplete, it is a long-cherished user wish 6,626,379; 6,626,330; 6,626,328; 6,454,185, United States Pro-
to automatically determine and complete them from the visional Application No.61/914,561, Japanese Unexamined
patent text. Furthermore, citations within a full-text of Patent Publication No. 4-187748, US provisional applica-
a patent do not follow well-defined patterns or rules, tion Serial No 61/640,128" etc. Based on that, we have
therefore identifying them with high accuracy is a chal- developed and trained a suitable p-c-p NER DL model for
lenging task. In addition, researching and implementing identifying and extracting those types of citation patterns
a suitable approach for utilizing citations from patent text automatically (see Section 3).
depends on the use case (e.g., search for prior art, linking In the following, we firstly review shortly the related
to corresponding online sources, etc.). Successfully cre- work in Section 2, followed by a presentation of the pro-
ating such citation lists for patents automatically from posed approach in Section 3. In Section 4 an empirical
patent text will enable users to extend their discovery evaluation is presented and discussed. Some hints about
more efficiently to stated background or adjacent prior future work is given in Section 5. A conclusion is given
art. in Section 6.
In general, patent citations typically come in two types:
a patent cites another patent (herein referred to as p-c-p)
or a patent cites literature (herein referred to as p-c-l) 2. Related Work
referring to as NPL (non patent literature) citations. How-
ever, in this paper the focus will be on the p-c-p use case Several machine learning approaches have been applied
where we have designed, implemented and evaluated an to the problem of extracting data from free text (NER)
approach for patent citation identification based on the i.e, citation extraction, among these approaches Support
chosen p-c-p citation type. Vector Machines (SVM) e.g., [1], Hidden Markov Mod-
There are two citations patterns types for the p-c-p els (HMM) e.g., [2], Conditional Random Fields (CRF)
citation use case, the standard citation pattern type and e.g., [3]. However, in the past few years, Deep Learn-
the non-standard citation pattern type. In the standard ing (DL) approaches for the NER task (mainly LSTM =
citation pattern type, patent applicants tend to use a sim- Long Short-Term Memory, CNN = Convolutional Neu-
ple form for referencing other patent publications e.g., ral Network became dominant as they outperformed the
state-of-the-art approaches significantly [4]. In contrast
5th Workshop on Patent Text Mining and Semantic Technologies to machine learning approaches, where features are de-
(PatentSemTech) 2024 signed and prepared through human effort, deep learning
*
Corresponding author. is able to automatically discover hidden features from
$ farag.saad@fiz-karlsruhe.de (F. Saad);
hidir.aras@fiz-karlsruhe.de (H. Aras); MPrince@cas.org (M. Prince)
unlabelled data. The first application for NER using a
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License neural network (NN) was proposed in [5]. In this paper
Attribution 4.0 International (CC BY 4.0).
77
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Farag Saad et al. CEUR Workshop Proceedings 77–81
the authors considered six standard NLP tasks, among extracted training paragraphs that hold citations were
them the NER task where atomic elements in the sen- 487. The FIZ dataset contains 41 annotated patent doc-
tence were labelled into categories such as "PERSON", uments. The citation coverage for each patent varies,
"COMPANY", or "LOCATION". The authors used fea- between 2 and 66 citations. The domains of focus are
ture vectors generated from all words in an unlabelled life science and technology. The total number of the ex-
corpora. A separate feature (orthographic) is included tracted training paragraphs were 230. Both datasets have
based on the assumption that a capital letter at the be- a different format, not equivalent to the NER state-of-the
ginning of a word is a strong indication that the word is art format e.g., IOB (Inside–outside–beginning (tagging)
a named entity. The proposed controlled features were ) format (See [4]). We have unified the format of both
later replaced with word embeddings [6] [7]. Word em- datasets to the standard IOB format and processed the
beddings, which is a representation of word meanings in resulting training data (with 717 training paragraphs) to
𝑛-dimensional space, were learned from unlabelled data. train the precursor p-c-p NER model.
A major strength of these approaches is that they allow
the design of training algorithms that avoid task-specific 3.2. Generated P-C-P Training Data
engineering and instead rely on large, unlabelled data to
discover internal word representations that are useful for For training the precursor p-c-p NER model the freely
the NER task. acquired training data is not sufficient and needs to be im-
proved in quality and quantity. Therefore, we have gen-
erated our own training data. To achieve this goal, patent
3. P-C-P Approach based on Deep paragraphs that hold many citations are pre-annotated
Learning by the p-c-p precursor model first. Then, the patent sub-
ject matter experts (SMEs) used the visual annotation
To find suitable training data which can be used to train user interface of the Prodigy5 annotation tool to review
the p-c-p NER model, we have firstly investigated if there and enhance the annotated p-c-p citations. Prodigy is an
is any publicly available training dataset that we can rely annotation tool with a easy to use interactive interface
on to build the NER precursor model which will be used that supports by active learning. It is a scriptable tool
to enlarge the training data we have in hand. A precursor that allows users to create the annotation themselves,
model is a type of temporary model which is trained on enabling rapid iterations.
a small set of training data and will be re-trained further From the utilized patent full-text databases PCT
based on a larger training data. (WIPO) and US we have prepared 500 (in total 1000)
In the following, we give some insight about the pub- citation-rich paragraphs belonging to an equally dis-
licly available training data as well as the generated train- tributes set of patent documents based on their IPC/CPC6
ing data using the active learning framework. classes. In order to use citation-rich paragraphs for train-
ing, we have kept only the part of the detailed description
(DETD) which holds at least 8 cited patents identified by
3.1. Public Training Dataset
the precursor p-c-p model. Furthermore, we took into ac-
We have identified two freely available datasets: the GRO- count to have sufficient training instance candidates (in
BID1 and the manually/expert-created dataset at FIZ Karl- the selected paragraphs) that represent the two types of
sruhe. The GROBID project is specialized on literature citation patterns, the standard as well as the non-standard
citation extraction. However, they have recently done one. The training instance candidates are then reviewed
some work related to patent citation extraction. After a by the SMEs to accept or correct each instance using the
pre-processing steps e.g., removing non-English patents, Prodigy annotation tool (See Figure 1).
corrupted documents etc., the total number of obtained
annotated documents were 130 belonging to three patent 3.3. Model Design and Training
authorities: EPO2 , US3 and PCT (WIPO)4 . The citations
coverage for each patent document is varying, there is Figure 2 shows the workflow how the final NER model
some document which contains two citations while some (based on Convolutional Neural Networks (CNNs)) is
other documents contain up to 375 citations. The do- built for the p-c-p task within the active learning frame-
mains of focus is life science. The total number of the work. To train the NER model we have utilized the open
source framework spaCy7 . For rapid implementation,
1
https://github.com/kermitt2/grobid/tree/master/grobid- we have used the spaCy implementation which is pro-
trainer/resources/dataset vided by the Prodigy framework. As spaCy offers no pre-
2
https://www.epo.org/en
3 5
https://www.cas.org/support/training/stnanavist/uspatfull- https://prodi.gy/
6
anavist https://www.wipo.int/classifications/ipc/en/
4 7
https://www.cas.org/support/training/stnanavist/pctfull-anavist https://spacy.io/
78
Farag Saad et al. CEUR Workshop Proceedings 77–81
Figure 1: Pre-annotations provided by the p-c-p model and displayed in the Prodigy tool interface.
trained NER model for identifying citations in patents have picked up more raw data and have pre-annotated
text, we have trained our own DL-based p-c-p NER model it again (250 for each US and PCT databases), using the
utilizing the Prodigy framework (See Figure 2). enhanced p-c-p NER model (See Figure 2 (2)). The newly
The process starts by training a precursor model based prepared citation-rich paragraphs were reviewed by the
on the public dataset, which is then utilized to enlarge the SMEs and used to re-train the NER model. If needed, this
training data iteratively (See Figure 2 (1)). This precursor process will be iteratively repeated and will end when we
model was used to filter the acquired US and PCT raw reach a certain degree of confidence that the final NER
data where we have kept only the part of the Detailed model is significantly trained to be applied for the p-c-p
Description (DETD) text which holds at least 8 patent citation identification task (See Figure 2 (4)).
citations (See Figure 2 (2)). We have then pre-processed
and prepared the raw data to ensure that it holds suffi-
cient citations. We then loaded part of it into the Prodigy 4. Evaluation
tool along with the integrated precursor model to start
Once the p-c-p model is sufficiently trained, the citation
the training process for the final NER model in several
identification approach is evaluated by processing the
iterations. The input training instance candidates (in
evaluation corpus of 245 patents (prepared by SMEs)
the selected paragraphs), which were annotated by the
representing a random collection of patents from the US
precursor model, are loaded into the Prodigy tool and pre-
(128 patents) and PCT (117 patents) patents. To start the
sented to the SMEs for reviewing. The SMEs interacted
p-c-p model evaluation process, all required materials
with the pre-annotations and either approved, corrected
e.g., identified citations of the SMEs evaluation corpus,
or added new annotations (See Figure 2 (3)) in the pre-
etc., were handed over to the SMEs.
sented paragraphs.
We have evaluated the p-c-p model based on the
In the first iteration, the SMEs have reviewed 250 pre-
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 (𝑃 ) (See equation 1), 𝑅𝑒𝑐𝑎𝑙𝑙 (𝑅) (See equa-
annotated paragraphs for each US and PCT databases.
tion 2), and 𝐹 1 − 𝑆𝑐𝑜𝑟𝑒 measures (See equation 3). In
We have used these reviewed pre-annotated paragraphs
order to compute the scores, the False Positive (𝐹 𝑃 ),
to re-train the NER model (See Figure 2 (4)) where the
True Positive (𝑇 𝑃 ) and False Negative (𝐹 𝑃 ) counts are
model reached F-Score of 85% and hence another iteration
determined first. The 𝐹 𝑃 refer to the number of wrongly
was required. To enhance the model performance, We
identified citations by the p-c-p model. The 𝑇 𝑃 refer to
79
Farag Saad et al. CEUR Workshop Proceedings 77–81
Figure 2: Building the p-c-p ner model and improving it through Experts’ interaction
Table 1
The p-c-p ner model overall score for precision, recall and F1-Score
DATABASE Identified citations FP TP FN Precision Recall F1-Score
US 727 17 710 23 0.97 0.96 0.96
PCT 251 11 241 47 0,95 0.83 0.88
Summation 978 28 951 70 0.96 0.89 0.92
the number of correctly identified citations by the p-c-p which patent authority is meant, therefore, to minimize
model. The 𝐹 𝑁 refers to the number of citations that error rate the model was trained to neglect such incom-
the model fails to identify. plete citation.
Even though the p-c-p model obtained a higher evalua-
𝑇𝑃
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (1) tion score also for different patent authorities e.g., Finish,
𝑇𝑃 + 𝐹𝑃 Japanese etc. that it was not trained on, it failed to iden-
𝑇𝑃 tify some citations that appear in some special context.
𝑅𝑒𝑐𝑎𝑙𝑙 = (2) An effective solution for these failures is to increase the
𝑇𝑃 + 𝐹𝑁
training data to cover more patent authorities. This can
be done efficiently using the framework described in Sec-
2 * (𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 * 𝑅𝑒𝑐𝑎𝑙𝑙)
𝐹 1 − 𝑆𝑐𝑜𝑟𝑒 = (3) tion 3.2.
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 Generally, the p-c-p model performed very well in
most cases and could achieve a high precision of 96%, a
As it is shown in Table 1 in total the p-c-p model suc- high recall of 89%. In addition, we have computed the
cessfully identified 951 citations out of 978 citations and F1-Score measure to take into account both Precision
failed to identify only 28 citations. An explanation of and Recall measure in order to ultimately measure the
these fails is that in rare cases patent applicants don’t accuracy of the model. Despite the fact that the p-c-p
cite citations in the right way, for example, consider this model was trained on a very small training dataset con-
citation example "This is a continuation-in-part of my sisting of 1717 training paragraphs, the F1-score (using
copending application Ser. No. 43,784 for Catalytic Re- the evaluation corpus prepared by the SMEs) shows that
former Process". Here the patent authority marker (US, the p-c-p model achieves a certain degree of accuracy
WO, JP, etc.) is missing so the model has no clue about and reaches the F1-Score of 92%.
80
Farag Saad et al. CEUR Workshop Proceedings 77–81
5. Future Work Directions paragraphs (250 for each US and PCT databases) have led
to a significant model improvement. However, another
The DL-based p-c-p NER model has been trained with a iteration which involved another 250 pre-annotated para-
small set of training data (1717 training paragraphs) re- graphs for each database was required in order to achieve
lated to two patent authorities US and PCT. Even though the desired F1-Score of 92%, and to stop the re-training
the model obtained a higher evaluation score also for process.
different patent authority data. However, the developed
model needs further training and testing to cover more
citations belonging to different patent authorities e.g., to References
cover more citation patterns which might be specialized
to some patent authority. Based on our experience so far, [1] X. Zhang, J. Zou, D. X. Le, G. R. Thoma, A struc-
a few thousand training paragraphs for each patent au- tural svm approach for reference parsing, in: 2010
thority should be sufficient. To speed up this process, the Ninth International Conference on Machine Learn-
developed visual active learning approach in this paper ing and Applications, 2010, pp. 479–484. doi:10.
can be utilized. 1109/ICMLA.2010.77.
To utilize the extracted citation for further tasks or ap- [2] B. Ojokoh, M. Zhang, J. Tang, A trigram hidden
plication such as search, linking patents with a literature markov model for metadata extraction from hetero-
knowledge base through citation etc., the extracted cita- geneous references, Information Sciences 181 (2011)
tions need to be post-processed. This importantly needed 1538–1551. URL: https://www.sciencedirect.com/
as significant portions of the identified citations were pre- science/article/pii/S0020025511000259. doi:https:
sented in the non-standard citation form e.g., U.S. Pat. //doi.org/10.1016/j.ins.2011.01.014 .
Nos. 5,188,960; 5,689,052; 5,880,275; 5,986,177; 7,105,332; [3] P. Lopez, Grobid: Combining automatic biblio-
7,208,474. Hence, after extraction, the individual citations graphic data recognition and term extraction for
should be normalized accordingly: US5188960, US5689052, scholarship publications, in: European Conference
US5880275, etc. Another example for the normalization is on Research and Advanced Technology for Digital
splitting up the identified citation string e.g., "EP 0 716 884 Libraries, 2009. URL: https://api.semanticscholar.org/
A2" into meaningful segments: The patent authority "EP", CorpusID:27383212.
the patent number "0716884", the patent kind code "A2", [4] F. Saad, H. Aras, R. Hackl-Sommer, Improving
and, finally the normalized patent string "EP0716884A2". named entity recognition for biomedical and patent
To consider a detailed patent citation type such as the data using bi-lstm deep neural network models,
filling number of a patent application, the publication of a in: E. Métais, F. Meziane, H. Horacek, P. Cimiano
patent application etc., it is essential to integrate a patent (Eds.), Natural Language Processing and Informa-
citation specific scheme into the developed approach. For tion Systems - 25th International Conference on
example, if we consider the US patent citation, we noticed Applications of Natural Language to Information
that the filing number of a US patent application has a Systems, NLDB 2020, Saarbrücken, Germany, June
specific format (e.g., No.16/769,261), the publication of 24-26, 2020, Proceedings, volume 12089 of Lecture
a US patent application has a specific format (e.g., US Notes in Computer Science, Springer, 2020, pp. 25–36.
2005/0114951 A1; starting with a year number) and a US URL: https://doi.org/10.1007/978-3-030-51310-8_3.
patent has a specific format (e.g., US 6,808,085). Encoding doi:10.1007/978-3-030-51310-8_3.
such features into the developed approach will certainly [5] R. Collobert, J. Weston, A unified architecture for
lead to a significant improvement. natural language processing: Deep neural networks
with multitask learning, in: Proceedings of the 25th
international conference on Machine learning, ACM,
6. Conclusion 2008, pp. 160–167.
[6] R. Collobert, J. Weston, L. Bottou, M. Karlen,
In this paper, we have developed a DL-based p-c-p NER K. Kavukcuoglu, P. P. Kuksa, Natural language pro-
model to identify citations in the patent fulltext. To real- cessing (almost) from scratch, Computing Research
ize that, we have designed, implemented and evaluated Repository - CORR abs/1103.0398 (2011).
an active learning framework for patent citation identifi- [7] E. Parsaeimehr, M. Fartash, J. A. Torkestani, Im-
cation employing a DL approach. Furthermore, to train proving feature extraction using a hybrid of cnn
a robust citation identification p-c-p model with high ac- and lstm for entity identification, Neural Process
curacy, we have designed an active learning framework Lett 55 (2023) 5979–5994. doi:https://doi.org/
that can be used by patent SMEs to iteratively improve 10.1007/s11063-022-11122-y.
the model performance significantly with less manual
effort. In the first iteration, the reviewed pre-annotated
81