<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Employing Active Learning for Training a DL-Model for Citation Identification in Patent Text⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Farag Saad</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hidir Aras</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mark Prince</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CAS - Chemical Abstracts Service</institution>
          ,
          <addr-line>2540 Olentangy River Rd, Columbus, OH 43202</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>FIZ Karlsruhe - Leibniz Institute for Information Infrastructure</institution>
          ,
          <addr-line>Hermann-von-Helmholtz Platz 1 · 76344 Eggenstein-Leopoldshafen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <fpage>77</fpage>
      <lpage>81</lpage>
      <abstract>
        <p>Citations play an important role in patent analytics. Due to the fact that existing citation lists in patent documents are incomplete, detecting and enhancing them automatically from the patent text has been a user need in patent information retrieval since a while. In this paper, we describe an approach for the identification of citations in patent text using Deep Learning (DL) models. We apply active learning for training and improving of a DL-based named entity recognition (NER) model for this task. The evaluation showed a high accuracy for the focused type of citations, i.e. for the p-c-p (patent cites patent) case.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Patent Citations</kwd>
        <kwd>Named Entity Recognition</kwd>
        <kwd>Deep learning</kwd>
        <kwd>Active learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>"US20050114951A1, WO 2006122188" etc., while in the
nonstandard citation pattern type patent applicants tend to
Citations play an important role in patent retrieval and use more complex pattern for citing other patents e.g.,
analytics. Since the existing citation lists in patent doc- "U.S. Pat. Nos. 6,808,085; 6,736,293; 6,732,955; 6,708,846;
uments are incomplete, it is a long-cherished user wish 6,626,379; 6,626,330; 6,626,328; 6,454,185, United States
Proto automatically determine and complete them from the visional Application No.61/914,561, Japanese Unexamined
patent text. Furthermore, citations within a full-text of Patent Publication No. 4-187748, US provisional
applicaa patent do not follow well-defined patterns or rules, tion Serial No 61/640,128" etc. Based on that, we have
therefore identifying them with high accuracy is a chal- developed and trained a suitable p-c-p NER DL model for
lenging task. In addition, researching and implementing identifying and extracting those types of citation patterns
a suitable approach for utilizing citations from patent text automatically (see Section 3).
depends on the use case (e.g., search for prior art, linking In the following, we firstly review shortly the related
to corresponding online sources, etc.). Successfully cre- work in Section 2, followed by a presentation of the
proating such citation lists for patents automatically from posed approach in Section 3. In Section 4 an empirical
patent text will enable users to extend their discovery evaluation is presented and discussed. Some hints about
more eficiently to stated background or adjacent prior future work is given in Section 5. A conclusion is given
art. in Section 6.</p>
      <p>In general, patent citations typically come in two types:
a patent cites another patent (herein referred to as p-c-p)
or a patent cites literature (herein referred to as p-c-l) 2. Related Work
referring to as NPL (non patent literature) citations.
However, in this paper the focus will be on the p-c-p use case
where we have designed, implemented and evaluated an
approach for patent citation identification based on the
chosen p-c-p citation type.</p>
      <p>There are two citations patterns types for the p-c-p
citation use case, the standard citation pattern type and
the non-standard citation pattern type. In the standard
citation pattern type, patent applicants tend to use a
simple form for referencing other patent publications e.g.,
the authors considered six standard NLP tasks, among extracted training paragraphs that hold citations were
them the NER task where atomic elements in the sen- 487. The FIZ dataset contains 41 annotated patent
doctence were labelled into categories such as "PERSON", uments. The citation coverage for each patent varies,
"COMPANY", or "LOCATION". The authors used fea- between 2 and 66 citations. The domains of focus are
ture vectors generated from all words in an unlabelled life science and technology. The total number of the
excorpora. A separate feature (orthographic) is included tracted training paragraphs were 230. Both datasets have
based on the assumption that a capital letter at the be- a diferent format, not equivalent to the NER state-of-the
ginning of a word is a strong indication that the word is art format e.g., IOB (Inside–outside–beginning (tagging)
a named entity. The proposed controlled features were ) format (See [4]). We have unified the format of both
later replaced with word embeddings [6] [7]. Word em- datasets to the standard IOB format and processed the
beddings, which is a representation of word meanings in resulting training data (with 717 training paragraphs) to
-dimensional space, were learned from unlabelled data. train the precursor p-c-p NER model.
A major strength of these approaches is that they allow
the design of training algorithms that avoid task-specific 3.2. Generated P-C-P Training Data
engineering and instead rely on large, unlabelled data to
discover internal word representations that are useful for
the NER task.</p>
      <sec id="sec-1-1">
        <title>For training the precursor p-c-p NER model the freely</title>
        <p>acquired training data is not suficient and needs to be
improved in quality and quantity. Therefore, we have
generated our own training data. To achieve this goal, patent
3. P-C-P Approach based on Deep paragraphs that hold many citations are pre-annotated
Learning by the p-c-p precursor model first. Then, the patent
subject matter experts (SMEs) used the visual annotation
To find suitable training data which can be used to train user interface of the Prodigy5 annotation tool to review
the p-c-p NER model, we have firstly investigated if there and enhance the annotated p-c-p citations. Prodigy is an
is any publicly available training dataset that we can rely annotation tool with a easy to use interactive interface
on to build the NER precursor model which will be used that supports by active learning. It is a scriptable tool
to enlarge the training data we have in hand. A precursor that allows users to create the annotation themselves,
model is a type of temporary model which is trained on enabling rapid iterations.
a small set of training data and will be re-trained further From the utilized patent full-text databases PCT
based on a larger training data. (WIPO) and US we have prepared 500 (in total 1000)</p>
        <p>In the following, we give some insight about the pub- citation-rich paragraphs belonging to an equally
dislicly available training data as well as the generated train- tributes set of patent documents based on their IPC/CPC6
ing data using the active learning framework. classes. In order to use citation-rich paragraphs for
training, we have kept only the part of the detailed description
3.1. Public Training Dataset (DETD) which holds at least 8 cited patents identified by
the precursor p-c-p model. Furthermore, we took into
acWe have identified two freely available datasets: the GRO- count to have suficient training instance candidates (in
BID1 and the manually/expert-created dataset at FIZ Karl- the selected paragraphs) that represent the two types of
sruhe. The GROBID project is specialized on literature citation patterns, the standard as well as the non-standard
citation extraction. However, they have recently done one. The training instance candidates are then reviewed
some work related to patent citation extraction. After a by the SMEs to accept or correct each instance using the
pre-processing steps e.g., removing non-English patents, Prodigy annotation tool (See Figure 1).
corrupted documents etc., the total number of obtained
annotated documents were 130 belonging to three patent 3.3. Model Design and Training
authorities: EPO2, US3 and PCT (WIPO)4. The citations
coverage for each patent document is varying, there is Figure 2 shows the workflow how the final NER model
some document which contains two citations while some (based on Convolutional Neural Networks (CNNs)) is
other documents contain up to 375 citations. The do- built for the p-c-p task within the active learning
framemains of focus is life science. The total number of the work. To train the NER model we have utilized the open
source framework spaCy7. For rapid implementation,
we have used the spaCy implementation which is
provided by the Prodigy framework. As spaCy ofers no
pre1https://github.com/kermitt2/grobid/tree/master/grobidtrainer/resources/dataset
2https://www.epo.org/en
3https://www.cas.org/support/training/stnanavist/uspatfullanavist
4https://www.cas.org/support/training/stnanavist/pctfull-anavist</p>
      </sec>
      <sec id="sec-1-2">
        <title>5https://prodi.gy/ 6https://www.wipo.int/classifications/ipc/en/ 7https://spacy.io/</title>
        <p>trained NER model for identifying citations in patents have picked up more raw data and have pre-annotated
text, we have trained our own DL-based p-c-p NER model it again (250 for each US and PCT databases), using the
utilizing the Prodigy framework (See Figure 2). enhanced p-c-p NER model (See Figure 2 (2)). The newly</p>
        <p>The process starts by training a precursor model based prepared citation-rich paragraphs were reviewed by the
on the public dataset, which is then utilized to enlarge the SMEs and used to re-train the NER model. If needed, this
training data iteratively (See Figure 2 (1)). This precursor process will be iteratively repeated and will end when we
model was used to filter the acquired US and PCT raw reach a certain degree of confidence that the final NER
data where we have kept only the part of the Detailed model is significantly trained to be applied for the p-c-p
Description (DETD) text which holds at least 8 patent citation identification task (See Figure 2 (4)).
citations (See Figure 2 (2)). We have then pre-processed
and prepared the raw data to ensure that it holds
suficient citations. We then loaded part of it into the Prodigy 4. Evaluation
tool along with the integrated precursor model to start
the training process for the final NER model in several Once the p-c-p model is suficiently trained, the citation
iterations. The input training instance candidates (in identification approach is evaluated by processing the
the selected paragraphs), which were annotated by the evaluation corpus of 245 patents (prepared by SMEs)
precursor model, are loaded into the Prodigy tool and pre- representing a random collection of patents from the US
sented to the SMEs for reviewing. The SMEs interacted (128 patents) and PCT (117 patents) patents. To start the
with the pre-annotations and either approved, corrected p-c-p model evaluation process, all required materials
or added new annotations (See Figure 2 (3)) in the pre- e.g., identified citations of the SMEs evaluation corpus,
sented paragraphs. etc., were handed over to the SMEs.</p>
        <p>In the first iteration, the SMEs have reviewed 250 pre- We have evaluated the p-c-p model based on the
annotated paragraphs for each US and PCT databases.   ( ) (See equation 1),  () (See
equaWe have used these reviewed pre-annotated paragraphs tion 2), and  1 −  measures (See equation 3). In
to re-train the NER model (See Figure 2 (4)) where the order to compute the scores, the False Positive (  ),
model reached F-Score of 85% and hence another iteration True Positive (  ) and False Negative (  ) counts are
was required. To enhance the model performance, We determined first. The   refer to the number of wrongly
identified citations by the p-c-p model. The   refer to
the number of correctly identified citations by the p-c-p
model. The   refers to the number of citations that
the model fails to identify.</p>
        <p>=
 =</p>
        <p>+</p>
        <p>+  
 1 −  =
2 * (  * )</p>
        <p>+</p>
        <p>As it is shown in Table 1 in total the p-c-p model
successfully identified 951 citations out of 978 citations and
failed to identify only 28 citations. An explanation of
these fails is that in rare cases patent applicants don’t
cite citations in the right way, for example, consider this
citation example "This is a continuation-in-part of my
copending application Ser. No. 43,784 for Catalytic
Reformer Process". Here the patent authority marker (US,
WO, JP, etc.) is missing so the model has no clue about
which patent authority is meant, therefore, to minimize
error rate the model was trained to neglect such
incomplete citation.</p>
        <p>Even though the p-c-p model obtained a higher
evalua(1) tion score also for diferent patent authorities e.g., Finish,</p>
        <p>Japanese etc. that it was not trained on, it failed to
identify some citations that appear in some special context.
(2) An efective solution for these failures is to increase the
training data to cover more patent authorities. This can
be done eficiently using the framework described in
Sec(3) tion 3.2.</p>
        <p>Generally, the p-c-p model performed very well in
most cases and could achieve a high precision of 96%, a
high recall of 89%. In addition, we have computed the
F1-Score measure to take into account both Precision
and Recall measure in order to ultimately measure the
accuracy of the model. Despite the fact that the p-c-p
model was trained on a very small training dataset
consisting of 1717 training paragraphs, the F1-score (using
the evaluation corpus prepared by the SMEs) shows that
the p-c-p model achieves a certain degree of accuracy
and reaches the F1-Score of 92%.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>5. Future Work Directions</title>
      <p>paragraphs (250 for each US and PCT databases) have led
to a significant model improvement. However, another
The DL-based p-c-p NER model has been trained with a iteration which involved another 250 pre-annotated
parasmall set of training data (1717 training paragraphs) re- graphs for each database was required in order to achieve
lated to two patent authorities US and PCT. Even though the desired F1-Score of 92%, and to stop the re-training
the model obtained a higher evaluation score also for process.
diferent patent authority data. However, the developed
model needs further training and testing to cover more
citations belonging to diferent patent authorities e.g., to References
cover more citation patterns which might be specialized
to some patent authority. Based on our experience so far, [1] X. Zhang, J. Zou, D. X. Le, G. R. Thoma, A
struca few thousand training paragraphs for each patent au- tural svm approach for reference parsing, in: 2010
thority should be suficient. To speed up this process, the Ninth International Conference on Machine
Learndeveloped visual active learning approach in this paper ing and Applications, 2010, pp. 479–484. doi:10.
can be utilized. 1109/ICMLA.2010.77.</p>
      <p>To utilize the extracted citation for further tasks or ap- [2] B. Ojokoh, M. Zhang, J. Tang, A trigram hidden
plication such as search, linking patents with a literature markov model for metadata extraction from
heteroknowledge base through citation etc., the extracted cita- geneous references, Information Sciences 181 (2011)
tions need to be post-processed. This importantly needed 1538–1551. URL: https://www.sciencedirect.com/
as significant portions of the identified citations were pre- science/article/pii/S0020025511000259. doi:https:
sented in the non-standard citation form e.g., U.S. Pat. //doi.org/10.1016/j.ins.2011.01.014.
Nos. 5,188,960; 5,689,052; 5,880,275; 5,986,177; 7,105,332; [3] P. Lopez, Grobid: Combining automatic
biblio7,208,474. Hence, after extraction, the individual citations graphic data recognition and term extraction for
should be normalized accordingly: US5188960, US5689052, scholarship publications, in: European Conference
US5880275, etc. Another example for the normalization is on Research and Advanced Technology for Digital
splitting up the identified citation string e.g., "EP 0 716 884 Libraries, 2009. URL: https://api.semanticscholar.org/
A2" into meaningful segments: The patent authority "EP", CorpusID:27383212.
the patent number "0716884", the patent kind code "A2", [4] F. Saad, H. Aras, R. Hackl-Sommer, Improving
and, finally the normalized patent string "EP0716884A2". named entity recognition for biomedical and patent</p>
      <p>To consider a detailed patent citation type such as the data using bi-lstm deep neural network models,
iflling number of a patent application, the publication of a in: E. Métais, F. Meziane, H. Horacek, P. Cimiano
patent application etc., it is essential to integrate a patent (Eds.), Natural Language Processing and
Informacitation specific scheme into the developed approach. For tion Systems - 25th International Conference on
example, if we consider the US patent citation, we noticed Applications of Natural Language to Information
that the lfiing number of a US patent application has a Systems, NLDB 2020, Saarbrücken, Germany, June
specific format (e.g., No.16/769,261), the publication of 24-26, 2020, Proceedings, volume 12089 of Lecture
a US patent application has a specific format (e.g., US Notes in Computer Science, Springer, 2020, pp. 25–36.
2005/0114951 A1; starting with a year number) and a US URL: https://doi.org/10.1007/978-3-030-51310-8_3.
patent has a specific format (e.g., US 6,808,085). Encoding doi:10.1007/978-3-030-51310-8_3.
such features into the developed approach will certainly [5] R. Collobert, J. Weston, A unified architecture for
lead to a significant improvement. natural language processing: Deep neural networks
with multitask learning, in: Proceedings of the 25th
international conference on Machine learning, ACM,
6. Conclusion 2008, pp. 160–167.
[6] R. Collobert, J. Weston, L. Bottou, M. Karlen,</p>
      <p>K. Kavukcuoglu, P. P. Kuksa, Natural language
processing (almost) from scratch, Computing Research</p>
      <p>Repository - CORR abs/1103.0398 (2011).
[7] E. Parsaeimehr, M. Fartash, J. A. Torkestani,
Improving feature extraction using a hybrid of cnn
and lstm for entity identification, Neural Process
Lett 55 (2023) 5979–5994. doi:https://doi.org/
10.1007/s11063-022-11122-y.</p>
      <p>In this paper, we have developed a DL-based p-c-p NER
model to identify citations in the patent fulltext. To
realize that, we have designed, implemented and evaluated
an active learning framework for patent citation
identification employing a DL approach. Furthermore, to train
a robust citation identification p-c-p model with high
accuracy, we have designed an active learning framework
that can be used by patent SMEs to iteratively improve
the model performance significantly with less manual
efort. In the first iteration, the reviewed pre-annotated</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>