=Paper= {{Paper |id=Vol-3775/paper16 |storemode=property |title=Employing Active Learning for Training a DL-Model for Citation Identification in Patent Text |pdfUrl=https://ceur-ws.org/Vol-3775/paper16.pdf |volume=Vol-3775 |authors=Farag Saad,Hidir Aras,Mark Prince |dblpUrl=https://dblp.org/rec/conf/patentsemtech/SaadAP24 }} ==Employing Active Learning for Training a DL-Model for Citation Identification in Patent Text== https://ceur-ws.org/Vol-3775/paper16.pdf
                                Employing Active Learning for Training a DL-Model for
                                Citation Identification in Patent Text⋆
                                Farag Saad1,*,† , Hidir Aras1,† and Mark Prince2,†
                                1
                                    FIZ Karlsruhe - Leibniz Institute for Information Infrastructure, Hermann-von-Helmholtz Platz 1 · 76344 Eggenstein-Leopoldshafen,Germany
                                2
                                    CAS - Chemical Abstracts Service, 2540 Olentangy River Rd, Columbus, OH 43202, USA


                                                  Abstract
                                                  Citations play an important role in patent analytics. Due to the fact that existing citation lists in patent documents are
                                                  incomplete, detecting and enhancing them automatically from the patent text has been a user need in patent information
                                                  retrieval since a while. In this paper, we describe an approach for the identification of citations in patent text using Deep
                                                  Learning (DL) models. We apply active learning for training and improving of a DL-based named entity recognition (NER)
                                                  model for this task. The evaluation showed a high accuracy for the focused type of citations, i.e. for the p-c-p (patent cites
                                                  patent) case.

                                                  Keywords
                                                  Patent Citations, Named Entity Recognition, Deep learning, Active learning



                                1. Introduction                                                                    "US20050114951A1, WO 2006122188" etc., while in the non-
                                                                                                                   standard citation pattern type patent applicants tend to
                                Citations play an important role in patent retrieval and                           use more complex pattern for citing other patents e.g.,
                                analytics. Since the existing citation lists in patent doc-                        "U.S. Pat. Nos. 6,808,085; 6,736,293; 6,732,955; 6,708,846;
                                uments are incomplete, it is a long-cherished user wish                            6,626,379; 6,626,330; 6,626,328; 6,454,185, United States Pro-
                                to automatically determine and complete them from the                              visional Application No.61/914,561, Japanese Unexamined
                                patent text. Furthermore, citations within a full-text of                          Patent Publication No. 4-187748, US provisional applica-
                                a patent do not follow well-defined patterns or rules,                             tion Serial No 61/640,128" etc. Based on that, we have
                                therefore identifying them with high accuracy is a chal-                           developed and trained a suitable p-c-p NER DL model for
                                lenging task. In addition, researching and implementing                            identifying and extracting those types of citation patterns
                                a suitable approach for utilizing citations from patent text                       automatically (see Section 3).
                                depends on the use case (e.g., search for prior art, linking                          In the following, we firstly review shortly the related
                                to corresponding online sources, etc.). Successfully cre-                          work in Section 2, followed by a presentation of the pro-
                                ating such citation lists for patents automatically from                           posed approach in Section 3. In Section 4 an empirical
                                patent text will enable users to extend their discovery                            evaluation is presented and discussed. Some hints about
                                more efficiently to stated background or adjacent prior                            future work is given in Section 5. A conclusion is given
                                art.                                                                               in Section 6.
                                   In general, patent citations typically come in two types:
                                a patent cites another patent (herein referred to as p-c-p)
                                or a patent cites literature (herein referred to as p-c-l)                         2. Related Work
                                referring to as NPL (non patent literature) citations. How-
                                ever, in this paper the focus will be on the p-c-p use case                                            Several machine learning approaches have been applied
                                where we have designed, implemented and evaluated an                                                   to the problem of extracting data from free text (NER)
                                approach for patent citation identification based on the                                               i.e, citation extraction, among these approaches Support
                                chosen p-c-p citation type.                                                                            Vector Machines (SVM) e.g., [1], Hidden Markov Mod-
                                   There are two citations patterns types for the p-c-p                                                els (HMM) e.g., [2], Conditional Random Fields (CRF)
                                citation use case, the standard citation pattern type and                                              e.g., [3]. However, in the past few years, Deep Learn-
                                the non-standard citation pattern type. In the standard                                                ing (DL) approaches for the NER task (mainly LSTM =
                                citation pattern type, patent applicants tend to use a sim-                                            Long Short-Term Memory, CNN = Convolutional Neu-
                                ple form for referencing other patent publications e.g.,                                               ral Network became dominant as they outperformed the
                                                                                                                                       state-of-the-art approaches significantly [4]. In contrast
                                5th Workshop on Patent Text Mining and Semantic Technologies to machine learning approaches, where features are de-
                                (PatentSemTech) 2024                                                                                   signed and prepared through human effort, deep learning
                                *
                                  Corresponding author.                                                                                is able to automatically discover hidden features from
                                $ farag.saad@fiz-karlsruhe.de (F. Saad);
                                hidir.aras@fiz-karlsruhe.de (H. Aras); MPrince@cas.org (M. Prince)
                                                                                                                                       unlabelled data. The first application for NER using a
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License neural network (NN) was proposed in [5]. In this paper
                                            Attribution 4.0 International (CC BY 4.0).




                                                                                                              77




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Farag Saad et al. CEUR Workshop Proceedings                                                                                   77–81



the authors considered six standard NLP tasks, among                     extracted training paragraphs that hold citations were
them the NER task where atomic elements in the sen-                      487. The FIZ dataset contains 41 annotated patent doc-
tence were labelled into categories such as "PERSON",                    uments. The citation coverage for each patent varies,
"COMPANY", or "LOCATION". The authors used fea-                          between 2 and 66 citations. The domains of focus are
ture vectors generated from all words in an unlabelled                   life science and technology. The total number of the ex-
corpora. A separate feature (orthographic) is included                   tracted training paragraphs were 230. Both datasets have
based on the assumption that a capital letter at the be-                 a different format, not equivalent to the NER state-of-the
ginning of a word is a strong indication that the word is                art format e.g., IOB (Inside–outside–beginning (tagging)
a named entity. The proposed controlled features were                    ) format (See [4]). We have unified the format of both
later replaced with word embeddings [6] [7]. Word em-                    datasets to the standard IOB format and processed the
beddings, which is a representation of word meanings in                  resulting training data (with 717 training paragraphs) to
𝑛-dimensional space, were learned from unlabelled data.                  train the precursor p-c-p NER model.
A major strength of these approaches is that they allow
the design of training algorithms that avoid task-specific               3.2. Generated P-C-P Training Data
engineering and instead rely on large, unlabelled data to
discover internal word representations that are useful for               For training the precursor p-c-p NER model the freely
the NER task.                                                            acquired training data is not sufficient and needs to be im-
                                                                         proved in quality and quantity. Therefore, we have gen-
                                                                         erated our own training data. To achieve this goal, patent
3. P-C-P Approach based on Deep                                          paragraphs that hold many citations are pre-annotated
   Learning                                                              by the p-c-p precursor model first. Then, the patent sub-
                                                                         ject matter experts (SMEs) used the visual annotation
To find suitable training data which can be used to train                user interface of the Prodigy5 annotation tool to review
the p-c-p NER model, we have firstly investigated if there               and enhance the annotated p-c-p citations. Prodigy is an
is any publicly available training dataset that we can rely              annotation tool with a easy to use interactive interface
on to build the NER precursor model which will be used                   that supports by active learning. It is a scriptable tool
to enlarge the training data we have in hand. A precursor                that allows users to create the annotation themselves,
model is a type of temporary model which is trained on                   enabling rapid iterations.
a small set of training data and will be re-trained further                 From the utilized patent full-text databases PCT
based on a larger training data.                                         (WIPO) and US we have prepared 500 (in total 1000)
   In the following, we give some insight about the pub-                 citation-rich paragraphs belonging to an equally dis-
licly available training data as well as the generated train-            tributes set of patent documents based on their IPC/CPC6
ing data using the active learning framework.                            classes. In order to use citation-rich paragraphs for train-
                                                                         ing, we have kept only the part of the detailed description
                                                                         (DETD) which holds at least 8 cited patents identified by
3.1. Public Training Dataset
                                                                         the precursor p-c-p model. Furthermore, we took into ac-
We have identified two freely available datasets: the GRO-               count to have sufficient training instance candidates (in
BID1 and the manually/expert-created dataset at FIZ Karl-                the selected paragraphs) that represent the two types of
sruhe. The GROBID project is specialized on literature                   citation patterns, the standard as well as the non-standard
citation extraction. However, they have recently done                    one. The training instance candidates are then reviewed
some work related to patent citation extraction. After a                 by the SMEs to accept or correct each instance using the
pre-processing steps e.g., removing non-English patents,                 Prodigy annotation tool (See Figure 1).
corrupted documents etc., the total number of obtained
annotated documents were 130 belonging to three patent                   3.3. Model Design and Training
authorities: EPO2 , US3 and PCT (WIPO)4 . The citations
coverage for each patent document is varying, there is                   Figure 2 shows the workflow how the final NER model
some document which contains two citations while some                    (based on Convolutional Neural Networks (CNNs)) is
other documents contain up to 375 citations. The do-                     built for the p-c-p task within the active learning frame-
mains of focus is life science. The total number of the                  work. To train the NER model we have utilized the open
                                                                         source framework spaCy7 . For rapid implementation,
1
  https://github.com/kermitt2/grobid/tree/master/grobid-                 we have used the spaCy implementation which is pro-
  trainer/resources/dataset                                              vided by the Prodigy framework. As spaCy offers no pre-
2
  https://www.epo.org/en
3                                                                        5
  https://www.cas.org/support/training/stnanavist/uspatfull-                 https://prodi.gy/
                                                                         6
  anavist                                                                    https://www.wipo.int/classifications/ipc/en/
4                                                                        7
  https://www.cas.org/support/training/stnanavist/pctfull-anavist            https://spacy.io/




                                                                    78
Farag Saad et al. CEUR Workshop Proceedings                                                                          77–81




Figure 1: Pre-annotations provided by the p-c-p model and displayed in the Prodigy tool interface.



trained NER model for identifying citations in patents           have picked up more raw data and have pre-annotated
text, we have trained our own DL-based p-c-p NER model           it again (250 for each US and PCT databases), using the
utilizing the Prodigy framework (See Figure 2).                  enhanced p-c-p NER model (See Figure 2 (2)). The newly
   The process starts by training a precursor model based        prepared citation-rich paragraphs were reviewed by the
on the public dataset, which is then utilized to enlarge the     SMEs and used to re-train the NER model. If needed, this
training data iteratively (See Figure 2 (1)). This precursor     process will be iteratively repeated and will end when we
model was used to filter the acquired US and PCT raw             reach a certain degree of confidence that the final NER
data where we have kept only the part of the Detailed            model is significantly trained to be applied for the p-c-p
Description (DETD) text which holds at least 8 patent            citation identification task (See Figure 2 (4)).
citations (See Figure 2 (2)). We have then pre-processed
and prepared the raw data to ensure that it holds suffi-
cient citations. We then loaded part of it into the Prodigy      4. Evaluation
tool along with the integrated precursor model to start
                                                                 Once the p-c-p model is sufficiently trained, the citation
the training process for the final NER model in several
                                                                 identification approach is evaluated by processing the
iterations. The input training instance candidates (in
                                                                 evaluation corpus of 245 patents (prepared by SMEs)
the selected paragraphs), which were annotated by the
                                                                 representing a random collection of patents from the US
precursor model, are loaded into the Prodigy tool and pre-
                                                                 (128 patents) and PCT (117 patents) patents. To start the
sented to the SMEs for reviewing. The SMEs interacted
                                                                 p-c-p model evaluation process, all required materials
with the pre-annotations and either approved, corrected
                                                                 e.g., identified citations of the SMEs evaluation corpus,
or added new annotations (See Figure 2 (3)) in the pre-
                                                                 etc., were handed over to the SMEs.
sented paragraphs.
                                                                    We have evaluated the p-c-p model based on the
   In the first iteration, the SMEs have reviewed 250 pre-
                                                                 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 (𝑃 ) (See equation 1), 𝑅𝑒𝑐𝑎𝑙𝑙 (𝑅) (See equa-
annotated paragraphs for each US and PCT databases.
                                                                 tion 2), and 𝐹 1 − 𝑆𝑐𝑜𝑟𝑒 measures (See equation 3). In
We have used these reviewed pre-annotated paragraphs
                                                                 order to compute the scores, the False Positive (𝐹 𝑃 ),
to re-train the NER model (See Figure 2 (4)) where the
                                                                 True Positive (𝑇 𝑃 ) and False Negative (𝐹 𝑃 ) counts are
model reached F-Score of 85% and hence another iteration
                                                                 determined first. The 𝐹 𝑃 refer to the number of wrongly
was required. To enhance the model performance, We
                                                                 identified citations by the p-c-p model. The 𝑇 𝑃 refer to



                                                            79
Farag Saad et al. CEUR Workshop Proceedings                                                                         77–81




Figure 2: Building the p-c-p ner model and improving it through Experts’ interaction


Table 1
The p-c-p ner model overall score for precision, recall and F1-Score

                  DATABASE       Identified citations   FP     TP      FN   Precision   Recall   F1-Score
                      US                 727            17     710     23     0.97       0.96    0.96
                     PCT                 251            11     241     47     0,95       0.83    0.88
                  Summation              978            28     951     70     0.96       0.89    0.92



the number of correctly identified citations by the p-c-p    which patent authority is meant, therefore, to minimize
model. The 𝐹 𝑁 refers to the number of citations that        error rate the model was trained to neglect such incom-
the model fails to identify.                                 plete citation.
                                                                Even though the p-c-p model obtained a higher evalua-
                                    𝑇𝑃
                 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =                            (1) tion score also for different patent authorities e.g., Finish,
                                𝑇𝑃 + 𝐹𝑃                      Japanese etc. that it was not trained on, it failed to iden-
                                  𝑇𝑃                         tify some citations that appear in some special context.
                   𝑅𝑒𝑐𝑎𝑙𝑙 =                              (2) An effective solution for these failures is to increase the
                              𝑇𝑃 + 𝐹𝑁
                                                             training data to cover more patent authorities. This can
                                                             be done efficiently using the framework described in Sec-
                         2 * (𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 * 𝑅𝑒𝑐𝑎𝑙𝑙)
        𝐹 1 − 𝑆𝑐𝑜𝑟𝑒 =                                    (3) tion 3.2.
                            𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙                 Generally, the p-c-p model performed very well in
                                                             most cases and could achieve a high precision of 96%, a
   As it is shown in Table 1 in total the p-c-p model suc- high recall of 89%. In addition, we have computed the
cessfully identified 951 citations out of 978 citations and F1-Score measure to take into account both Precision
failed to identify only 28 citations. An explanation of and Recall measure in order to ultimately measure the
these fails is that in rare cases patent applicants don’t accuracy of the model. Despite the fact that the p-c-p
cite citations in the right way, for example, consider this model was trained on a very small training dataset con-
citation example "This is a continuation-in-part of my sisting of 1717 training paragraphs, the F1-score (using
copending application Ser. No. 43,784 for Catalytic Re- the evaluation corpus prepared by the SMEs) shows that
former Process". Here the patent authority marker (US, the p-c-p model achieves a certain degree of accuracy
WO, JP, etc.) is missing so the model has no clue about and reaches the F1-Score of 92%.



                                                              80
Farag Saad et al. CEUR Workshop Proceedings                                                                         77–81



5. Future Work Directions                                       paragraphs (250 for each US and PCT databases) have led
                                                                to a significant model improvement. However, another
The DL-based p-c-p NER model has been trained with a iteration which involved another 250 pre-annotated para-
small set of training data (1717 training paragraphs) re- graphs for each database was required in order to achieve
lated to two patent authorities US and PCT. Even though the desired F1-Score of 92%, and to stop the re-training
the model obtained a higher evaluation score also for process.
different patent authority data. However, the developed
model needs further training and testing to cover more
citations belonging to different patent authorities e.g., to References
cover more citation patterns which might be specialized
to some patent authority. Based on our experience so far, [1] X. Zhang, J. Zou, D. X. Le, G. R. Thoma, A struc-
a few thousand training paragraphs for each patent au-               tural svm approach for reference parsing, in: 2010
thority should be sufficient. To speed up this process, the          Ninth International Conference on Machine Learn-
developed visual active learning approach in this paper              ing and Applications, 2010, pp. 479–484. doi:10.
can be utilized.                                                     1109/ICMLA.2010.77.
    To utilize the extracted citation for further tasks or ap- [2] B. Ojokoh, M. Zhang, J. Tang, A trigram hidden
plication such as search, linking patents with a literature          markov model for metadata extraction from hetero-
knowledge base through citation etc., the extracted cita-            geneous references, Information Sciences 181 (2011)
tions need to be post-processed. This importantly needed             1538–1551. URL: https://www.sciencedirect.com/
as significant portions of the identified citations were pre-        science/article/pii/S0020025511000259. doi:https:
sented in the non-standard citation form e.g., U.S. Pat.             //doi.org/10.1016/j.ins.2011.01.014              .
Nos. 5,188,960; 5,689,052; 5,880,275; 5,986,177; 7,105,332; [3] P. Lopez, Grobid: Combining automatic biblio-
7,208,474. Hence, after extraction, the individual citations         graphic data recognition and term extraction for
should be normalized accordingly: US5188960, US5689052,              scholarship publications, in: European Conference
US5880275, etc. Another example for the normalization is             on Research and Advanced Technology for Digital
splitting up the identified citation string e.g., "EP 0 716 884      Libraries, 2009. URL: https://api.semanticscholar.org/
A2" into meaningful segments: The patent authority "EP",             CorpusID:27383212.
the patent number "0716884", the patent kind code "A2", [4] F. Saad, H. Aras, R. Hackl-Sommer, Improving
and, finally the normalized patent string "EP0716884A2".             named entity recognition for biomedical and patent
    To consider a detailed patent citation type such as the          data using bi-lstm deep neural network models,
filling number of a patent application, the publication of a         in: E. Métais, F. Meziane, H. Horacek, P. Cimiano
patent application etc., it is essential to integrate a patent       (Eds.), Natural Language Processing and Informa-
citation specific scheme into the developed approach. For            tion Systems - 25th International Conference on
example, if we consider the US patent citation, we noticed           Applications   of Natural Language to Information
that the filing number of a US patent application has a              Systems, NLDB 2020, Saarbrücken, Germany, June
specific format (e.g., No.16/769,261), the publication of            24-26, 2020, Proceedings, volume 12089 of Lecture
a US patent application has a specific format (e.g., US              Notes in Computer Science, Springer, 2020, pp. 25–36.
2005/0114951 A1; starting with a year number) and a US               URL: https://doi.org/10.1007/978-3-030-51310-8_3.
patent has a specific format (e.g., US 6,808,085). Encoding          doi:10.1007/978-3-030-51310-8_3.
such features into the developed approach will certainly        [5]  R. Collobert, J. Weston, A unified architecture for
lead to a significant improvement.                                   natural language processing: Deep neural networks
                                                                     with multitask learning, in: Proceedings of the 25th
                                                                     international conference on Machine learning, ACM,
6. Conclusion                                                        2008, pp. 160–167.
                                                                [6] R. Collobert, J. Weston, L. Bottou, M. Karlen,
In this paper, we have developed a DL-based p-c-p NER                K. Kavukcuoglu, P. P. Kuksa, Natural language pro-
model to identify citations in the patent fulltext. To real-         cessing (almost) from scratch, Computing Research
ize that, we have designed, implemented and evaluated                Repository - CORR abs/1103.0398 (2011).
an active learning framework for patent citation identifi- [7] E. Parsaeimehr, M. Fartash, J. A. Torkestani, Im-
cation employing a DL approach. Furthermore, to train                proving feature extraction using a hybrid of cnn
a robust citation identification p-c-p model with high ac-           and lstm for entity identification, Neural Process
curacy, we have designed an active learning framework                Lett 55 (2023) 5979–5994. doi:https://doi.org/
that can be used by patent SMEs to iteratively improve               10.1007/s11063-022-11122-y.
the model performance significantly with less manual
effort. In the first iteration, the reviewed pre-annotated




                                                           81