=Paper=
{{Paper
|id=Vol-3004/paper4
|storemode=property
|title=Design and Implementation of Keyphrase Extraction Engine for Chinese Scientific Literature
|pdfUrl=https://ceur-ws.org/Vol-3004/paper4.pdf
|volume=Vol-3004
|authors=Liangping Ding,Zhixiong Zhang,Huan Liu,Yang Zhao
|dblpUrl=https://dblp.org/rec/conf/jcdl/DingZLZ21
}}
==Design and Implementation of Keyphrase Extraction Engine for Chinese Scientific Literature==
<pdf width="1500px">https://ceur-ws.org/Vol-3004/paper4.pdf</pdf>
<pre>
            EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents


 Design and Implementation of Keyphrase Extraction
       Engine for Chinese Scientific Literature
                         Liangping Ding                                                  Zhixiong Zhang∗
                 dingliangping@mail.las.ac.cn                                         zhangzhx@mail.las.ac.cn
    National Science Library, Chinese Academy of Sciences              National Science Library, Chinese Academy of Sciences
                         Beijing, China                                                    Beijing, China
      Department of Library Information and Archives                     Department of Library Information and Archives
    Management, University of Chinese Academy of Science               Management, University of Chinese Academy of Science
                         Beijing, China                                                    Beijing, China

                            Huan Liu                                                         Yang Zhao
                   liuhuan@mail.las.ac.cn                                            zhaoyang@mail.las.ac.cn
    National Science Library, Chinese Academy of Sciences              National Science Library, Chinese Academy of Sciences
                        Beijing, China                                                     Beijing, China
      Department of Library Information and Archives                     Department of Library Information and Archives
    Management, University of Chinese Academy of Science               Management, University of Chinese Academy of Science
                        Beijing, China                                                     Beijing, China

Abstract                                                               reading interests. Keyphrase extraction task is the basis for
Accurate keyphrases summarize the main topics, which are               many natural language processing tasks such as information
important for information retrieval and many other natural             retrieval [2], text summarization [3], text classification [4],
language processing tasks. In this paper, we construct a               opinion mining [5], and document indexing [6].
keyphrase extraction engine for Chinese scientific literature             For Chinese scientific literature, there are cases of miss-
to assist researchers in improving the efficiency of scientific        ing keyphrases stored by publishers. In addition, many
research. There are four key technical problems in the                 keyphrases given by authors do not fully reveal the main idea
process of building the engine: how to select a keyphrase              of the text. So keyphrase extraction for Chinese scientific
extraction algorithm, how to build a large-scale training set          literature is particularly important, not only to fill the gap
to achieve application-level performance, how to adjust and            of keyphrase metadata fields in publishers’ repositories, but
optimize the model to achieve better application results, and          also serve as an effective complement to the keyphrases
how to be conveniently invoked by researchers. Aiming at               given by authors themselves. It can also provide reference
the above problems, we propose corresponding solutions.                for researchers when writing Chinese scientific papers.
The engine is able to automatically recommend four to five                The training corpus used in current Chinese keyphrase
keyphrases for the Chinese scientific abstracts given by the           extraction models is generally limited to one or several
user, and the response speed is generally within 3 seconds.            subject areas and is relatively small in size [7], which is
The keyphrase extraction engine for Chinese scientific                 difficult to be oriented to large-scale applications. Moreover,
literature is developed based on advanced deep learning                the keyphrase extraction models are generally self-stored
algorithms, large-scale training set, and high-performance             by the developers, making it difficult for widespread use by
computing capacity, which might be an effective tool for               researchers.
researchers and publishers to quickly capture the key stating             To address the above problems, we constructed a keyphrase
points of scientific text.                                             extraction engine for Chinese scientific literature based on
                                                                       a large-scale training corpus from multiple disciplines for
Keywords: Keyphrase Extraction, Artificial Intelligence En-            practical applications. The engine can be easily called by
gine, Chinese Scientific Literature                                    means of Application Programming Interface (API) without
                                                                       local model installation and configuration. In this paper,
1    Introduction                                                      we discuss the overall construction idea of building the
Keyphrase extraction task is a branch of information extrac-           keyphrase extraction engine for Chinese scientific literature,
tion and has been a research hotspot for many years. It aims           the solutions to the key technical problems, and the specific
to identify important topical phrases from text [1], which is          engineering implementation of the engine.
of great significance for readers to quickly grasp the main            2    Related Work
idea of the articles and select the articles that meet their
                                                                       Currently, the popular keyphrase extraction methods can be
∗ Corresponding Author                                                 divided into three categories: (1) keyphrase extraction based

 Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

                                                                  26
           EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents

on traditional two-stage ranking; (2) keyphrase extraction              select a keyphrase extraction algorithm, how to build a large-
based on sequence labeling; (3) keyphrase extraction based              scale training set to achieve application-level performance,
on span prediction. The traditional two-stage ranking based             how to adjust and optimize the model to achieve better
methods use some heuristic rules to identify candidate                  application results, and how to be conveniently invoked
keyphrases from the text in the first stage, and use a ranking          by researchers.
algorithm to rank the candidate keyphrases in the second                   To address the problem of how to choose an appropriate
stage. The commonly used ranking algorithms include term                keyphrase extraction algorithm, we first investigated the
frequency [8], TF*IDF [9], etc. A major drawback of this                current popular and advanced keyphrase extraction algo-
two-stage approach is the error propagation, which means                rithms, and used publicly available dataset to compare model
that the error caused in the candidate keyphrases generation            performance and determine an optimal keyphrase extraction
will be passed to the candidate keyphrases ranking.                     model for engine construction.
   To address this issue, researchers proposed unified keyphrase           To address the problem of how to construct an application-
extraction formulations, which regard keyphrase extraction              level large-scale training set, we took advantage of the
task as a sequence labeling task or span prediction task.               title, abstract and keyphrase metadata fields of the Chinese
Sequence labeling formulation usually uses BIO [10] or                  Science Citation Database (CSCD) to construct a large-
BIOES [11] tagging schemes to annotate tokens in the                    scale training set covering multidisciplinary fields such
text sequences, and then train keyphrase extraction models              as medicine and health, industrial technology, agricultural
based on machine learning algorithms [7] or deep learning               science, mathematical science, chemistry and biological
algorithms [12][13]. The idea of span prediction formulation            science.
originates from machine reading comprehension based on                     To address the problem of how to adjust and optimize
SQuAD format [14], which predicts the role of tokens in                 the model to achieve better application results, we used the
the sequence by training two binary classifiers to determine            TF*IDF algorithm as a complement to compensate for the
whether they are the start and end positions of keyphrases              data shortage of humanities domain in the training corpus.
[15]. While no consensus has been reached about what kind               And we used large-scale scientific literature as a corpus to
of formulation should be used for supervised keyphrase                  calculate the inverse document frequency. Aiming at the
extraction task.                                                        problem that keyphrases are often truncated by TF*IDF
   In addition, keyphrase extraction algorithm is another               algorithm, we proposed a circular iterative splicing algorithm
important issue that should be paid attention to. In 2018,              to capture more accurate keyphrases.
Google released pretrained language model BERT [16],                       To address the problem of how to be conveniently invoked
which attracted widespread attention in the field of natural            by researchers, we deployed the keyphrase extraction model
language processing. This study is widely regarded as a                 as a service, so that researchers can call the API of the model
landmark discovery that provides a new paradigm for the                 by GET or POST method to obtain the keyphrase extraction
field of natural language processing. In the past three years,          results for the given text, without the need for local model
a large number of pretrained language models have emerged,              installation and configuration.
and many researchers found that using pretrained language
models can lead to large improvements in the model per-                 4   Solutions to Key Technical Problems
formance of downstream tasks [17][18]. Furthermore, some                For the four key technical problems faced in the engine con-
researchers suggested that incorporating external features              struction process, we proposed the corresponding solutions.
such as lexicon feature to pretrained language model can
further boost the model performance[19][20].                            4.1 Selection of Keyphrase Extraction Model
   Even though advanced keyphrase extraction algorithms                 Pretrained language model BERT has captured common
are applied, there are less publicly available keyphrase                language representations from large-scale corpus, enabling
extraction engine that can be directly called by users to               downstream supervised learning tasks to achieve great
the best of our knowledge, limiting the industrialization               model performance even with a small amount of labeled
of academic achievements. In this paper, we illustrate the              data. We assumed that taking advantage of pretrained
construction process of keyphrase extraction engine for                 language model, which has been pretrained using large-scale
Chinese scientific literature, aiming to provide reference              unsupervised text, is of great value to build a keyphrase
for academic researches and industrial usage of keyphrase               extraction model for Chinese scientific literature applicable
extraction.                                                             to multi-disciplines. Therefore, we decided to construct a
                                                                        keyphrase extraction model for Chinese scientific literature
3   The Overall Construction Idea                                       based on BERT-Base-Chinese, and tried to experiment with
To build a keyphrase extraction engine for Chinese scientific           both sequence labeling formulation and span prediction for-
literature that can be used for practical applications in multi-        mulation to find the optimal keyphrase extraction algorithm
ple disciplines, there are four key technical problems: how to          for keyphrase extraction engine.


                                                                   27
          EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents

Table 1. Experimental Results of Keyphrase Extraction                       the pretrained language model BERT to output the
Model on CAKE test set                                                      probability of each category. The parameters of BERT
                                                                            was fine-tuned by CAKE training data.
          Models             Precision   Recall    F1-score              2. Based on 𝐵𝐸𝑅𝑇 + 𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥 model, we fused POS fea-
                                                                            ture into the embedding space of BERT to incorporate
    BERT+SoftMax                63.81%   56.52%     59.94%                  word semantics indirectly and constructed the 𝐵𝐸𝑅𝑇 +
  BERT+POS+SoftMax              64.83%   57.44%     60.91%                  𝑃𝑂𝑆+𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥 model. The POS tagging was generated
 BERT+Lexicon+SoftMax           68.06%   60.67%     64.15%                  by Hanlp1 . The details of feature incorporation and
      BERT+CRF                  64.87%   59.15%     61.88%                  model construction are shown in [22].
      BERT+Span                 65.51%   57.61%     61.31%               3. We collected keyphrases from the keyphrase metadata
                                                                            fields in CSCD restricted to medical domain. Based
     Table 2. Assessment Results of the Training set                        on 𝐵𝐸𝑅𝑇 + 𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥 model, we used BIO tagging
                                                                            scheme to generate lexicon feature and embedded it
                   Indicators                     Results                   into BERT to add in domain features and indicate word
                                                                            boundary information to some extent, composing
                 Precision                  99.38%                          𝐵𝐸𝑅𝑇 + 𝐿𝑒𝑥𝑖𝑐𝑜𝑛 + 𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥 model.
                   Recall                   97.56%                       4. The 𝐵𝐸𝑅𝑇 +𝐶𝑅𝐹 model used Conditional Random Field
                 F1-score                   98.46%                          (CRF) layer on top of BERT to capture the sequential
   Number of correct keyphrases identified 4,447,454                        features among labels. To learn a reasonable transition
    Number of all keyphrases identified    4,447,454                        matrix, we used a hierarchical learning rate, using a
    Number of author-given keyphrases      4,558,596                        learning rate of 5e-5 for training the parameters of the
                                                                            neural network layers of BERT and a learning rate of
                                                                            0.01 for training the parameters of the CRF layer.
   It is worth noting that for Chinese keyphrase extraction,             5. 𝐵𝐸𝑅𝑇 + 𝑆𝑝𝑎𝑛 model defined keyphrase extraction task
there is no delimiter like space in English to indicate the                 as a span prediction problem. Two binary classifiers
segmentation of words. So it’s necessary to consider whether                were trained to determine whether each token is a
to use character or word as the minimal language unit to                    start position or an end position of the keyphrase.
feed into the model. It has been shown that for Chinese
                                                                        Table 1 shows the keyphrase extraction performance of
keyphrase extraction task, using character as the smallest
                                                                     the above-mentioned models on the CAKE test set. In the
linguistic unit can achieve better results [21]. In Chinese,
                                                                     experiments of keyphrase extraction for Chinese scientific
word is the smallest unit for expressing semantics. Even
                                                                     literature, we were concerned with how many correct
though character formulation can avoid the errors caused
                                                                     keyphrases we can identify from the given text. Therefore,
by Chinese tokenizer, it also loses some of the semantics. To
                                                                     we compared the keyphrases predicted by the model with the
remedy the deficiency, we considered incorporating external
                                                                     keyphrases given by the authors and calculated the precision,
features including POS feature and lexicon feature into the
                                                                     recall and F1-score to evaluate the model performance. The
model to add in semantics and human knowledge indirectly.
                                                                     formula for each indicator is as follows.
   We used the publicly available Chinese keyphrase ex-
traction dataset CAKE [21] for the experiments to deter-                                                     𝑐
                                                                                               𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =                      (1)
mine the best algorithm, which is a dataset containing                                                       𝑟
Chinese medical abstracts from CSCD in sequence labeling                                                𝑐
                                                                                                 𝑅𝑒𝑐𝑎𝑙𝑙 =                      (2)
format. 100,000 abstracts are included in the training set                                              𝑠
and 3,094 abstracts are included in the test set. Based on                                     2 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
                                                                                 𝐹 1 − 𝑠𝑐𝑜𝑟𝑒 =                                 (3)
the training set of CAKE, we conducted experiments on                                            𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
five models: 𝐵𝐸𝑅𝑇 + 𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥, 𝐵𝐸𝑅𝑇 + 𝑃𝑂𝑆 + 𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥,                  where c denotes the number of keyphrases predicted by
𝐵𝐸𝑅𝑇 + 𝐿𝑒𝑥𝑖𝑐𝑜𝑛 + 𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥, 𝐵𝐸𝑅𝑇 + 𝐶𝑅𝐹 , and 𝐵𝐸𝑅𝑇 + 𝑆𝑝𝑎𝑛.            the model that match the author-given keyphrases; r denotes
The first four of these models are based on sequence labeling        the number of keyphrases predicted by the model in total;
task formulation, while the last model is based on span              and s denotes the number of all author-given keyphrases.
prediction formulation. The short description of each model             The experimental results showed that the best results can
is shown in the following:                                           be achieved by adding a SoftMax layer directly on top of
                                                                     the BERT model for classification incorporating the lexicon
   1. The 𝐵𝐸𝑅𝑇 + 𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥 model defined the task of
                                                                     features simultaneously, which is 𝐵𝐸𝑅𝑇 +𝐿𝑒𝑥𝑖𝑐𝑜𝑛+𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥
      keyphrase extraction from Chinese scientific literature
                                                                     model. Without adding external features, 𝐵𝐸𝑅𝑇 + 𝐶𝑅𝐹
      as a character-level sequence labeling task, where
                                                                     model and 𝐵𝐸𝑅𝑇 + 𝑆𝑝𝑎𝑛 model achieved better results than
      each token was annotated in BIO tagging scheme.
      A SoftMax classification layer was added on top of             1 https://github.com/hankcs/HanLP


                                                                28
           EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents

                                Table 3. Statistics of the Discipline Distribution in Training Set

   Chinese Library Classification (CLC)                                 Discipline                            Number of Abstracts
                      R                                  Medicine and health sciences                                421,879
                      T                                     Industrial Technology                                    386,649
                      S                                      Agricultural science                                    142,866
                      O                              Mathematics, physics and chemistry                              80,052
                      Q                                          Life sciences                                       60,901
                      P                                   Astronomy and geoscience                                    56,301
                      X                                     Environmental science                                    54,712
                      F                                           Economics                                           27,078
                      U                                         Transportation                                        15,664
                      V                                    Aviation and Aerospace                                    13,956
                      G                             Culture, science, education and sports                            7,848
                      N                                         Natural science                                       3,565
                      C                                         Social sciences                                        3,505
                      B                                    Philosophy and religions                                    3,379
                      K                                     History and geography                                     2,001
                      E                                        Military science                                       1,059
                      D                                         Politics and law                                        971
                      J                                                Art                                              712
                      H                                   Languages and linguistics                                     278
                      Z                                         General works                                            48
                      I                                            Literature                                           23
                      A                      Marxism, Leninism, Maoism and Deng Xiaoping theory                          12


𝐵𝐸𝑅𝑇 + 𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥 model. We finally decided to use the                   BIO tagging scheme to convert the final concatenated text
𝐵𝐸𝑅𝑇 + 𝐿𝑒𝑥𝑖𝑐𝑜𝑛 + 𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥 model architecture to build the              into sequence labeling format, and assigned labels to each
keyphrase extraction engine for Chinese scientific literature.          token to generate the dataset in the format required for
                                                                        model training. Specifically, given the concatenated text and
                                                                        keyphrases, we assigned label "B" to the first token of the
4.2    Construction of Application-Level Large-Scale
                                                                        keyphrase in the text, "I" to the other tokens of the keyphrase,
       Training Set
                                                                        and "O" to the tokens in the text that did not belong to any
We aimed to build a keyphrase extraction engine for Chinese             keyphrase.
scientific literature applicable to multidisciplinary fields               To ensure that the training set is of high quality and
using large scale training data, while CAKE dataset only                to avoid providing incorrect supervised signals for model
contained 100,000 abstracts from medical field, which cannot            training, we assessed the quality of the training set by
meet the demand for practical applications. So we con-                  comparing author-given keyphrases with the automatic
structed a large-scale dataset based on CSCD and evaluated              extracted keyphrases in the dataset. The assessment results
the quality of the dataset. The details of the training set             of the training set are shown in Table 2. It is worth noting
generation are described as follows.                                    that in the process of training set generation, we used the
   In order to ensure that the constructed training set had a           same processing technique as Ding et al. [21] and therefore
high recall and can annotate as many keyphrases as possible,            the quality of the training set cannot reach to 100%. For
we processed the tile, abstract and keyphrase fields in the             example, if there was an inclusion relationship between two
Chinese Science Citation Database and selected the records              keyphrases, the longest keyphrase would be selected for
in which all of the author’s given keyphrases appeared in the           labeling; if there was an overlapping relationship between
abstract. Finally, a total of 1,137,945 records were obtained           two keyphrases, the two keyphrases would be concatenated
to satisfy the above conditions, and the total number of                according to the overlapping tokens.
keyphrases was 1,055,335 (removing duplicates).                            In order to ensure that the model can support large-scale
   We selected 1.1 million records for generating the training          applications in multidisciplinary domains, we counted the
set and 37,945 records for generating the test set. Based on the        first-class discipline distribution in the training set based on
obtained titles, abstracts and keyphrases, we concatenated
titles and their corresponding abstracts by period and used


                                                                   29
             EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents

Table 4. Parameter Configuration of the Proposed Approach                            Table 5. Keyphrase Extraction Model Performance on All-
                                                                                     Domain Test Set
                 Parameters                        Values
                                                                                                              Indicators              Results
                Batch size                        7
                   Epoch                          1                                                     Precision                     59.11%
                Optimizer                       Adam                                                      Recall                      46.84%
         Learning rate scheduler           exponential decay                                            F1-score                      52.26%
           Initial learning rate                 5e-5                                     Number of correct keyphrases identified      77,735
          Max sequence length                    512                                       Number of all keyphrases identified        131,517
                                                                                           Number of author-given keyphrases          165,956
Chinese Library Classification (CLC), and the statistics are
shown in Table 3 2 .                                                                 these domains, causing the problem that the number of
4.3 Model Adjustment and Optimization                                                keyphrases that can be identified for these domains are very
Based on the finalized 𝐵𝐸𝑅𝑇 + 𝐿𝑒𝑥𝑖𝑐𝑜𝑛 + 𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥 model,                             limited. To address this issue, we decided to use TF*IDF
we fine-tuned the model using 1.1 million BIO-format                                 algorithm as a complement to the extraction results of
Chinese scientific records from multidisciplinary domains.                           the 𝐵𝐸𝑅𝑇 + 𝐿𝑒𝑥𝑖𝑐𝑜𝑛 + 𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥 model to capture the high
The parameters used in the training process are shown in                             frequency keyphrases that appeared in the text.
Table 43 . Due to memory limitation, it was not feasible to                             We randomly selected 1 million abstracts from the Chinese
load the entire dataset into memory, so we transformed the                           Science Citation Database as the training corpus for the
data into the format shown in Figure 1. We loaded the data by                        calculation of inverse document frequency (IDF), using Jieba4
Pytorch DataLoader, which read one record at a time by using                         as the Chinese tokenizer to segment the words. To guide
an iterator, and calculated the gradient of the model after the                      the word separation and avoid the professional terms to be
amount of data reached to the batch size. The final model                            cut incorrectly, we introduced all the keyphrases in CSCD,
performance on our all-domain test set are shown in Table                            totaling 2,606,322 (no duplicates), as Jieba’s user-defined
5. It’s worth noting that the practical keyphrase extraction                         lexicon. The phrases in the custom lexicons as well as nouns
results are greater than the statistical indicators because we                       in the corpus were calculated for their inverse document
used exact match principle to calculate the related indicators,                      frequency, and finally an IDF file was obtained for subsequent
while there are some recognized keyphrases not included                              keyphrase extraction of Chinese scientific literature based
in the author-given keyphrases but still indicate the main                           on the TF*IDF algorithm.
point of the text.                                                                      At the same time, in order to solve the problem that
                                                                                     the keyphrases extracted by TF*IDF algorithm were often
                                                                                     truncated, we designed a circular iterative splicing algorithm
                                                                                     as improved TF*IDF algorithm. This algorithm spliced two-
                                                                                     by-two keyphrases identified by TF*IDF algorithm and
                                                                                     determined whether the spliced keyphrases still appeared in
                                                                                     the original text. The iterative splicing was continued until
                                                                                     no new keyphrases appeared. We combined the recognized
                                                                                     keyphrases of 𝐵𝐸𝑅𝑇 + 𝐿𝑒𝑥𝑖𝑐𝑜𝑛 + 𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥 model with that
                                                                                     of the improved TF*IDF algorithm as the final keyphrase
             Figure 1. Input Format to DataLoader                                    extraction results for Chinese scientific literature, and the
                                                                                     specific process of the model is as follows.
  By observing the test results of the model during the                                 For the given scientific abstract, use 𝐵𝐸𝑅𝑇 + 𝐿𝑒𝑥𝑖𝑐𝑜𝑛 +
practical application, we found that the model did not achieve                       𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥 model to recognize keyphrases firstly. If the
the expected prediction results for the data in the humanities                       number of the recognized keyphrases of 𝐵𝐸𝑅𝑇 + 𝐿𝑒𝑥𝑖𝑐𝑜𝑛 +
domain and could not capture the high-frequency words                                𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥 was less than 4, the TF*IDF algorithm would
appearing in the text. As shown in Table 3, the sample size                          be introduced as a complement. Otherwise, the keyphrase
for the humanities domain was small, and apparently the                              extraction results of the 𝐵𝐸𝑅𝑇 + 𝐿𝑒𝑥𝑖𝑐𝑜𝑛 + 𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥 model
model did not capture enough features on the data from                               were returned directly. In the keyphrase extraction process of
2 Some articles have more than one CLC code, the statistics total is over 1.1        TF*IDF algorithm, the keyphrases were restricted to nouns
million.                                                                             or pronouns, etc. to get the top 10 keyphrases in TF*IDF
3 Noted that because of computational limitation, the batch size was set
                                                                                     value.
to 7 and we assumed that 1 epoch was enough because of the large-scale
training set.                                                                        4 https://github.com/fxsjy/jieba


                                                                                30
           EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents


                    Figure 2. Variables in the Iteration Process of the Circular Iterative Splicing Algorithm

   We removed short keyphrases which were overlapped                  ‘架构设计(Architecture Design)’, ‘民用飞机(Civilian Air-
with other keyphrases, and the keyphrases whose length                craft)’. The keyphrases extracted by 𝐵𝐸𝑅𝑇 + 𝐿𝑒𝑥𝑖𝑐𝑜𝑛 +
were less than two. Then we used the circular iterative               𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥 model were less than four, so the TF*IDF algorithm
splicing algorithm to splice the keyphrases identified by             was trigger to get the top 10 keyphrases according to the
TF*IDF two by two in two directions, splicing from the left           TF*IDF value. After the preprocessing, there were eight
and the right. And if the spliced keyphrases still appeared in        extracted keyphrases by TF*IDF algorithm as ‘民 用 飞
the text, keep the spliced keyphrases and tag the two original        机(Civilian Aircraft)’, ‘架构设计(Architecture Design)’, ‘电
keyphrases as used keyphrases for deletion. Otherwise, keep           传 飞 控 系 统(Telex Flight Control System)’, ‘安 全 性 需
the keyphrases that are not successfully spliced. This process        求(Security Requirements)’, ‘适航规范(Airworthiness Spec-
was iterated until there were no new keyphrases appearing in          ifications)’, ‘需求论证(Proof of Need)’, ‘具体体现(Specific
the original text. The keyphrases identified by the improved          Embodiment)’, ‘安全要求(Security requirements)’. As we
TF*IDF algorithm were sorted in descending order according            can see, there were some redundant keyphrases recognized
to the TF*IDF value.                                                  by traditional TF*IDF, such as ‘具体体现(Specific Embodi-
   The keyphrase extraction results of the 𝐵𝐸𝑅𝑇 + 𝐿𝑒𝑥𝑖𝑐𝑜𝑛 +           ment)’. Next, the circular iterative splicing algorithm would
𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥 model and the improved TF*IDF algorithm were                splice the keyphrases two by two. The changes in variables
combined and ranked as the final results. The priority of             during iteration process are shown in Figure 2.
the keyphrases identified by the 𝐵𝐸𝑅𝑇 + 𝐿𝑒𝑥𝑖𝑐𝑜𝑛 + 𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥              As we can see, in the first iteration, there were seven
model were higher than that of the keyphrases identified by           spliced keyphrases occurred in the abstract, in which three
the improved TF*IDF algorithm. Based on this principle, we            were new keyphrases and four were original keyphrases
merged the keyphrase extraction results of 𝐵𝐸𝑅𝑇 +𝐿𝑒𝑥𝑖𝑐𝑜𝑛 +            (Unused Keyphrases) that didn’t splice with other keyphrases.
𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥 model with improved TF*IDF algorithm, and took              And in the second iteration, no new keyphrases arose, so
the longest keyphrase for the keyphrases with inclusion               the iteration finished and all the seven spliced keyphrases
relationship. Finally, the top five keyphrases became the             kept unused and returned as the results of improved TF*IDF
final keyphrases. In addition, we used some heuristic rules           algorithm. Then, we ranked the keyphrases generated by
to filter the final keyphrases, such as removing keyphrases           the improved TF*IDF algorithm and combined them with
ending with special characters, etc., to improve the accuracy         that of 𝐵𝐸𝑅𝑇 + 𝐿𝑒𝑥𝑖𝑐𝑜𝑛 + 𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥 model to get the further
of the keyphrase extraction model.                                    results as ‘适 航 安 全 性(Airworthiness Safety)’, ‘架 构 设
   To further elaborate, for an input abstract 5 , the 𝐵𝐸𝑅𝑇 +         计(Architecture Design)’, ‘民用飞机(Civilian Aircraft)’,‘民
𝐿𝑒𝑥𝑖𝑐𝑜𝑛 + 𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥 model would process this input first and          用飞机电传飞控系统(Civilian Aircraft Telemetry Flight
got the keyphrases as ‘适航安全性(Airworthiness Safety)’,                  Control System)’, ‘电传飞控系统架构设计(Architecture
5 https://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CJFD&
                                                                      Design of the Telemetry Flight Control System)’, ‘安全性需
                                                                      求(Security Requirements)’, ‘民用飞机适航规范(Airworthiness
dbname=CJFDAUTODAY&filename=HKKX202103004&v=
G8TESBUsSe2JeIClg6moqemy3ExscLTVMNxH885u%25mmd2BI%                    Specifications for Civil Aircraft)’, ‘需求论证(Proof of Need)’,
25mmd2Bl9p5i%25mmd2FUmcOUqnMUOyTZM5


                                                                 31
           EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents


             Figure 3. The Processing Flow of the Keyphrase Extraction Engine for Chinese Scientific Literature

‘具体体现(Specific Embodiment)’, ‘安全要求(Security re-                         call the keyphrase extraction engine, depositing the abstracts
quirements)’. Finally, we removed the short keyphrases                  of multiple Chinese scientific articles into a list, and pass
who had an inclusion relationship with others and got                   in the verification code. After the engine responds, it will
the ultimate top 5 recognized keyphrases as ‘适 航 安 全                    return the keyphrase extraction results of all the abstracts
性(Airworthiness Safety)’, ‘民用飞机电传飞控系统(Civilian                          in the list in JSON format to achieve batch processing. the
Aircraft Telemetry Flight Control System)’, ‘电传飞控系                      details of the POST API call are shown in Table 7.
统架构设计(Architecture Design of the Telemetry Flight
Control System)’, ‘安全性需求(Security Requirements)’, ‘民                    5     Engineering Implementation
用 飞 机 适 航 规 范(Airworthiness Specifications for Civil                    In order to display the keyphrase extraction results intu-
Aircraft)’. It can be seen that the final keyphrase extraction          itively and meet the demands for different users to call the
results of our proposed hybrid model are better than that of            engine, we currently provide three ways to call the keyphrase
the 𝐵𝐸𝑅𝑇 +𝐿𝑒𝑥𝑖𝑐𝑜𝑛 +𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥 model and the TF*IDF model.                extraction API: browser online demo, Python code access and
                                                                        client access. The calling flow of the keyphrase extraction
4.4 API Design                                                          engine for Chinese scientific literature is shown in Figure 3.
In order to avoid various hardware and software constraints
that may be encountered in the local deployment of the                  5.1   Browser Online Demo
model, and to provide a fast and convenient way for re-
searchers to invoke the keyphrase extraction model, we
deployed the keyphrase extraction model as a service, and
built a keyphrase extraction engine for Chinese scientific
literature through API calls. Researchers can call the API of
the engine in two ways, POST and GET, to achieve automatic
keyphrase extraction of Chinese scientific literature. Pass
in the abstract of Chinese scientific literature and the
verification code, and the engine would return the keyphrase
extraction results in JSON format.
   For the GET method, users can send an request to the
URL: http://sciengine.las.ac.cn/keywords_extraction_cn to
call the keyphrase extraction engine, passing in the abstract
of a Chinese scientific literature abstract and the verification
code. When the engine receives the call, it will respond by
returning the keyphrase extraction results in JSON format.                      Figure 4. Browser Online Demo Interface
Details of the GET API call are shown in Table VI.
   For the POST method, users can send an request to the                  Users can visit the URL: http://sciengine.las.ac.cn/Keywords_
URL: http://sciengine.las.ac.cn/keywords_extraction_cn to               BIO_Lexi to test the keyphrase extraction engine online.


                                                                   32
     EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents

                                        Table 6. GET API Call Details

                      Format                                 Example
Request URL           /keywords_extraction_cn                http://sciengine.las.ac.cn/keywords_extraction_cn
Request Parameters    "data": abstract of Chinese scientific {"data": "新辅助治疗背景下胰腺癌扩大切除术
                      literature, "token": Verification Code 的应用价值。胰腺癌恶性程度高,预后较差,治疗
                                                             效果仍不理想...(The value of extended resection
                                                             of pancreatic cancer in the context of neoadjuvant
                                                             therapy. Pancreatic cancer is highly malignant, with
                                                             poor prognosis and still unsatisfactory treatment
                                                             results...)", "token":99999}
Browser parameter     /Keywords_BIO_Lexi?data=               http://sciengine.las.ac.cn/keywords_extraction_
                      &token=                                cn?data=新辅助治疗背景下胰腺癌扩大切除术的
                                                             应用价值。胰腺癌恶性程度高,预后较差,治疗效
                                                             果仍不理想...(The value of extended resection of
                                                             pancreatic cancer in the context of neoadjuvant
                                                             therapy. Pancreatic cancer is highly malignant, with
                                                             poor prognosis and still unsatisfactory treatment
                                                             results....)&token=99999
Success message       "keywords": [keyphrases list]          {"keywords":["胰腺癌(pancreatic cancer)", "新辅助
                                                             治疗(neoadjuvant therapy)", "扩大切除术(extended
                                                             resection)" ] }
Error message         "info":error message                   {"info": "Server not available!"}, {"info": "Token
                                                             incorrect!"}

                                       Table 7. POST API Call Details

                      Format                                 Example
Request URL           /keywords_extraction_cn                http://sciengine.las.ac.cn/keywords_extraction_cn
Request Parameters    "data": abstract of Chinese scientific {"data": ["新辅助治疗背景下胰腺癌扩大切除术
                      literature, "token": Verification Code 的应用价值。胰腺癌恶性程度高,预后较差,治疗
                                                             效果仍不理想...(The value of extended resection
                                                             of pancreatic cancer in the context of neoadjuvant
                                                             therapy. Pancreatic cancer is highly malignant, with
                                                             poor prognosis and still unsatisfactory treatment
                                                             results...)", "参芪地黄汤联合ACEI/ARB类药物治疗
                                                             糖尿病肾病的Meta分析...(Meta-analysis of Shenqi
                                                             Dihuang Decoction combined with ACEI/ARB
                                                             drugs in the treatment of diabetic nephropathy...)"],
                                                             "token":99999}
Success message       Abstract ID:[keyphrases list]          {0: ["胰 腺 癌(pancreatic cancer)", "新 辅 助 治
                                                             疗(neoadjuvant therapy)", "扩大切除术(extended
                                                             resection)" ], 1: ["糖尿病肾病(diabetic nephropa-
                                                             thy)", "ACEI/ARB类(ACEI/ARB)", "META分析(Meta-
                                                             analysis)", "参 芪 地 黄 汤(Shenqi Dihuang Decoc-
                                                             tion)"]}
Error message         "info":error message                   {"info": "Server not available!"}, {"info": "Token
                                                             incorrect!"}


                                                      33
           EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents

Type the abstract of a Chinese scientific literature in the input        single line of code. Users can download and install the
box (it is recommended to use the title + ‘。’ + abstract as              client from the website http://sciengine.las.ac.cn/Client, and
input), click the keyphrase extraction button, and the engine            get the verification code as the login credentials to call
API will be called automatically to invoke the bottom model              the keyphrase extraction engine API to achieve automatic
and four to five keyphrases related to the main idea will be             keyphrase extraction of Chinese scientific literature. The
returned. The response time of the engine is generally within            keyphrase extraction engine client interface is shown in
3 seconds and the interface of the browser online demo is                Figure 5 and the specific operation process is as follows.
shown in Figure 4.                                                           1. After opening the client and entering the verification
5.2 Python Code Access                                                          code, click the button of "Keyphrase Extraction for
                                                                                Chinese Scientific Literature" in the menu bar to enter
Technical staff who are familiar with Python programming                        the interface of keyphrase extraction function. Click
language can download the corresponding sample codes                            "Browse" button to import the file to be processed, and
from the website http://sciengine.las.ac.cn/Scripts and revise                  it means successful if the data presentation box shows
the file path to achieve convenient usage. There are four                       the imported data, and the message box shows the
files: 𝑘𝑒𝑦𝑝ℎ𝑟𝑎𝑠𝑒𝑠_𝑒𝑥𝑡𝑟𝑎𝑐𝑡𝑖𝑜𝑛_𝑐𝑛_𝑔𝑒𝑡 .𝑝𝑦, the sample code                        total number of the data.
for calling the API of the keyphrase extraction engine                       2. Click "Start Extraction" button, the client will automat-
using GET method; 𝑘𝑒𝑦𝑝ℎ𝑟𝑎𝑠𝑒𝑠_𝑒𝑥𝑡𝑟𝑎𝑐𝑡𝑖𝑜𝑛_𝑐𝑛_𝑝𝑜𝑠𝑡 .𝑝𝑦, the                        ically carry out the function of keyphrase extraction
sample code for calling the API of the keyphrase extraction                     for Chinese scientific literature and display the real-
engine using POST method; 𝑖𝑛𝑝𝑢𝑡_𝑐𝑛.𝑡𝑥𝑡, the sample input                        time processing progress.
file; 𝑅𝑒𝑎𝑑𝑀𝑒.𝑡𝑥𝑡, the description file.                                      3. When the extraction is finished, the client will pop
   When using the GET method to call the API, input                             up completion window and automatically show the
the verification code and the Chinese abstract that needs                       output file path.
to be recognized in the corresponding location of the                        4. Click "Open" button to view the output file.
code. Then run 𝑘𝑒𝑦𝑝ℎ𝑟𝑎𝑠𝑒𝑠_𝑒𝑥𝑡𝑟𝑎𝑐𝑡𝑖𝑜𝑛_𝑐𝑛_𝑔𝑒𝑡 .𝑝𝑦 file and
the automatic keyphrase extraction results will be printed
                                                                         6    Conclusions
directly. When using POST method to call the API, open
the 𝑘𝑒𝑦𝑝ℎ𝑟𝑎𝑠𝑒𝑠_𝑒𝑥𝑡𝑟𝑎𝑐𝑡𝑖𝑜𝑛_𝑐𝑛_𝑝𝑜𝑠𝑡 .𝑝𝑦 file with the Python               In this paper, we make full use of the large-scale training cor-
editor and input the verification code to the corresponding              pus of Chinese Science Citation Database and the pretrained
location in the code. Set the paths of the input file and                language model BERT to construct a keyphrase extraction
output file, where the format of the input file is one line per          engine for Chinese scientific literature. We incorporate lexi-
abstract. Run 𝑘𝑒𝑦𝑝ℎ𝑟𝑎𝑠𝑒𝑠_𝑒𝑥𝑡𝑟𝑎𝑐𝑡𝑖𝑜𝑛_𝑐𝑛_𝑝𝑜𝑠𝑡 .𝑝𝑦 file, and                con features into the high-dimensional vector space of BERT,
the program will read the input file and write the keyphrase             fusing human knowledge to instruct the model training. To
extraction results to the output file.                                   support practical applications in multidisciplinary fields, the
                                                                         TF*IDF algorithm is introduced as a complement to better
5.3   Client Access                                                      capture the high-frequency words appearing in the text. We
                                                                         deploy the engine as a service, which can be invoked using
                                                                         the API, and the response speed is generally within 3 seconds.
                                                                         And we provide example scripts in Python for technical staff
                                                                         and a visualization client for non-technical personnel to use
                                                                         without writing a line of code. We hope that our keyphrase
                                                                         extraction engine can provide a feasible path for researchers
                                                                         to improve efficiency.

                                                                         7    ACKNOWLEDGMENTS
                                                                         The work is supported by the project “Artificial Intelligence
                                                                         (AI) Engine Construction Based on Scientific Literature
                                                                         Knowledge" (Grant No.E0290906) and the project “Key Tech-
                                                                         nology Optimization Integration and System Development of
                                                                         Next Generation Open Knowledge Service Platform" (Grant
                  Figure 5. Client Interface                             No.2021XM45).

  In order to provide for non-technical personnel to use,                References
we designed the client to realize the keyphrase extraction                [1] Peter D Turney. Learning algorithms for keyphrase extraction.
service for Chinese scientific literature without writing a                   Information retrieval, 2(4):303–336, 2000.


                                                                    34
             EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents

 [2] Steve Jones and Mark S Staveley. Phrasier: a system for interactive              [20] Xiangyang Li, Huan Zhang, and Xiao-Hua Zhou. Chinese clinical
     document retrieval using keyphrases. In Proceedings of the 22nd annual                named entity recognition with variant neural structures based on bert
     international ACM SIGIR conference on Research and development in                     methods. Journal of biomedical informatics, 107:103422, 2020.
     information retrieval, pages 160–167, 1999.                                      [21] Liangping Ding, Zhixiong Zhang, Huan Liu, Jie Li, and Gaihong
 [3] Yongzheng Zhang, Nur Zincir-Heywood, and Evangelos Milios. World                      Yu. Automatic keyphrase extraction from scientific chinese medical
     wide web site summarization. Web Intelligence and Agent Systems: An                   abstracts based on character-level sequence labeling. Journal of Data
     International Journal, 2(1):39–53, 2004.                                              and Information Science,, 6(3):33–57, 2020.
 [4] Anette Hulth and Beáta Megyesi. A study on automatically extracted               [22] Liangping Ding, Zhixiong Zhang, and Yang Zhao. Bert-based chinese
     keywords in text categorization. In Proceedings of the 21st International             medical keyphrase extraction model enhanced with external features.
     Conference on Computational Linguistics and 44th Annual Meeting of                    International Conference on Asia-Pacific Digital Libraries, 2021.
     the Association for Computational Linguistics, pages 537–544, 2006.
 [5] Gábor Berend. Opinion expression mining by exploiting keyphrase
     extraction. In Proceedings of the 5th International Joint Conference on
     Natural Language Processing, pages 1162–1170. Asian Federation of
     Natural Language Processing, 2011.
 [6] Yi-fang Brook Wu, Quanzhi Li, Razvan Stefan Bot, and Xin Chen.
     Domain-specific keyphrase extraction. In Proceedings of the 14th ACM
     international conference on Information and knowledge management,
     pages 283–284, 2005.
 [7] Chengzhi Zhang. Automatic keyword extraction from documents
     using conditional random fields. Journal of Computational Information
     Systems, 4(3):1169–1180, 2008.
 [8] Anette Hulth. Improved automatic keyword extraction given more
     linguistic knowledge. In Proceedings of the 2003 conference on Empirical
     methods in natural language processing, pages 216–223, 2003.
 [9] Gerard Salton, Chung-Shu Yang, and CLEMENT T Yu. A theory of
     term importance in automatic text analysis. Journal of the American
     society for Information Science, 26(1):33–44, 1975.
[10] Lance A Ramshaw and Mitchell P Marcus. Text chunking using
     transformation-based learning. In Natural language processing using
     very large corpora, pages 157–176. Springer, 1999.
[11] Lev Ratinov and Dan Roth. Design challenges and misconceptions in
     named entity recognition. In Proceedings of the Thirteenth Conference
     on Computational Natural Language Learning (CoNLL-2009), pages
     147–155, 2009.
[12] Qi Zhang, Yang Wang, Yeyun Gong, and Xuan-Jing Huang. Keyphrase
     extraction using deep recurrent neural networks on twitter. In
     Proceedings of the 2016 conference on empirical methods in natural
     language processing, pages 836–845, 2016.
[13] Dhruva Sahrawat, Debanjan Mahata, Mayank Kulkarni, Haimin
     Zhang, Rakesh Gosangi, Amanda Stent, Agniv Sharma, Yaman Kumar,
     Rajiv Ratn Shah, and Roger Zimmermann. Keyphrase extraction
     from scholarly articles as sequence labeling using contextualized
     embeddings. arXiv preprint arXiv:1910.08840, 2019.
[14] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang.
     Squad: 100,000+ questions for machine comprehension of text. arXiv
     preprint arXiv:1606.05250, 2016.
[15] Funan Mu, Zhenting Yu, LiFeng Wang, Yequan Wang, Qingyu Yin,
     Yibo Sun, Liqun Liu, Teng Ma, Jing Tang, and Xing Zhou. Keyphrase
     extraction with span-based feature representations. arXiv preprint
     arXiv:2002.05407, 2020.
[16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
     Bert: Pre-training of deep bidirectional transformers for language
     understanding. arXiv preprint arXiv:1810.04805, 2018.
[17] Iz Beltagy, Kyle Lo, and Arman Cohan. Scibert: A pretrained language
     model for scientific text. arXiv preprint arXiv:1903.10676, 2019.
[18] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu
     Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre-trained
     biomedical language representation model for biomedical text mining.
     Bioinformatics, 36(4):1234–1240, 2020.
[19] Tianyu Liu, Jin-Ge Yao, and Chin-Yew Lin. Towards improving neural
     named entity recognition with gazetteers. In Proceedings of the 57th
     Annual Meeting of the Association for Computational Linguistics, pages
     5301–5307, 2019.


                                                                                 35

</pre>