<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Design and Implementation of Keyphrase Extraction Engine for Chinese Scientific Literature</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Liangping Ding</string-name>
          <email>dingliangping@mail.las.ac.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Huan Liu</string-name>
          <email>liuhuan@mail.las.ac.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhixiong Zhang∗</string-name>
          <email>zhangzhx@mail.las.ac.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yang Zhao</string-name>
          <email>zhaoyang@mail.las.ac.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Science Library, Chinese Academy of Sciences</institution>
          ,
          <addr-line>Beijing, China</addr-line>
          ,
          <institution>Department of Library Information and Archives</institution>
          ,
          <addr-line>Management</addr-line>
          ,
          <institution>University of Chinese Academy of Science</institution>
          ,
          <addr-line>Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>26</fpage>
      <lpage>35</lpage>
      <abstract>
        <p>Accurate keyphrases summarize the main topics, which are important for information retrieval and many other natural language processing tasks. In this paper, we construct a keyphrase extraction engine for Chinese scientific literature to assist researchers in improving the eficiency of scientific research. There are four key technical problems in the process of building the engine: how to select a keyphrase extraction algorithm, how to build a large-scale training set to achieve application-level performance, how to adjust and optimize the model to achieve better application results, and how to be conveniently invoked by researchers. Aiming at the above problems, we propose corresponding solutions. The engine is able to automatically recommend four to five keyphrases for the Chinese scientific abstracts given by the user, and the response speed is generally within 3 seconds. The keyphrase extraction engine for Chinese scientific literature is developed based on advanced deep learning algorithms, large-scale training set, and high-performance computing capacity, which might be an efective tool for researchers and publishers to quickly capture the key stating points of scientific text.</p>
      </abstract>
      <kwd-group>
        <kwd>Keyphrase Extraction</kwd>
        <kwd>Artificial Intelligence Engine</kwd>
        <kwd>Chinese Scientific Literature</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Keyphrase extraction task is a branch of information
extraction and has been a research hotspot for many years. It aims
to identify important topical phrases from text [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which is
of great significance for readers to quickly grasp the main
idea of the articles and select the articles that meet their
∗Corresponding Author
reading interests. Keyphrase extraction task is the basis for
many natural language processing tasks such as information
retrieval [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], text summarization [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], text classification [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
opinion mining [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and document indexing [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>For Chinese scientific literature, there are cases of
missing keyphrases stored by publishers. In addition, many
keyphrases given by authors do not fully reveal the main idea
of the text. So keyphrase extraction for Chinese scientific
literature is particularly important, not only to fill the gap
of keyphrase metadata fields in publishers’ repositories, but
also serve as an efective complement to the keyphrases
given by authors themselves. It can also provide reference
for researchers when writing Chinese scientific papers.</p>
      <p>
        The training corpus used in current Chinese keyphrase
extraction models is generally limited to one or several
subject areas and is relatively small in size [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which is
dificult to be oriented to large-scale applications. Moreover,
the keyphrase extraction models are generally self-stored
by the developers, making it dificult for widespread use by
researchers.
      </p>
      <p>To address the above problems, we constructed a keyphrase
extraction engine for Chinese scientific literature based on
a large-scale training corpus from multiple disciplines for
practical applications. The engine can be easily called by
means of Application Programming Interface (API) without
local model installation and configuration. In this paper,
we discuss the overall construction idea of building the
keyphrase extraction engine for Chinese scientific literature,
the solutions to the key technical problems, and the specific
engineering implementation of the engine.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Currently, the popular keyphrase extraction methods can be
divided into three categories: (1) keyphrase extraction based
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
on traditional two-stage ranking; (2) keyphrase extraction
based on sequence labeling; (3) keyphrase extraction based
on span prediction. The traditional two-stage ranking based
methods use some heuristic rules to identify candidate
keyphrases from the text in the first stage, and use a ranking
algorithm to rank the candidate keyphrases in the second
stage. The commonly used ranking algorithms include term
frequency [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], TF*IDF [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], etc. A major drawback of this
two-stage approach is the error propagation, which means
that the error caused in the candidate keyphrases generation
will be passed to the candidate keyphrases ranking.
      </p>
      <p>To address this issue, researchers proposed unified keyphrase
extraction formulations, which regard keyphrase extraction
task as a sequence labeling task or span prediction task.</p>
      <p>
        Sequence labeling formulation usually uses BIO [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] or
BIOES [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] tagging schemes to annotate tokens in the
text sequences, and then train keyphrase extraction models
based on machine learning algorithms [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] or deep learning
algorithms [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ][
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The idea of span prediction formulation
originates from machine reading comprehension based on
SQuAD format [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], which predicts the role of tokens in
the sequence by training two binary classifiers to determine
whether they are the start and end positions of keyphrases
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. While no consensus has been reached about what kind
of formulation should be used for supervised keyphrase
extraction task.
      </p>
      <p>
        In addition, keyphrase extraction algorithm is another
important issue that should be paid attention to. In 2018,
Google released pretrained language model BERT [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ],
which attracted widespread attention in the field of natural
language processing. This study is widely regarded as a
landmark discovery that provides a new paradigm for the
ifeld of natural language processing. In the past three years,
a large number of pretrained language models have emerged,
and many researchers found that using pretrained language
models can lead to large improvements in the model
performance of downstream tasks [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ][
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Furthermore, some
researchers suggested that incorporating external features
such as lexicon feature to pretrained language model can
further boost the model performance[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ][
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>Even though advanced keyphrase extraction algorithms
are applied, there are less publicly available keyphrase
extraction engine that can be directly called by users to
the best of our knowledge, limiting the industrialization
of academic achievements. In this paper, we illustrate the
construction process of keyphrase extraction engine for
Chinese scientific literature, aiming to provide reference
for academic researches and industrial usage of keyphrase
extraction.
3</p>
    </sec>
    <sec id="sec-3">
      <title>The Overall Construction Idea</title>
      <p>To build a keyphrase extraction engine for Chinese scientific
literature that can be used for practical applications in
multiple disciplines, there are four key technical problems: how to
select a keyphrase extraction algorithm, how to build a
largescale training set to achieve application-level performance,
how to adjust and optimize the model to achieve better
application results, and how to be conveniently invoked
by researchers.</p>
      <p>To address the problem of how to choose an appropriate
keyphrase extraction algorithm, we first investigated the
current popular and advanced keyphrase extraction
algorithms, and used publicly available dataset to compare model
performance and determine an optimal keyphrase extraction
model for engine construction.</p>
      <p>To address the problem of how to construct an
applicationlevel large-scale training set, we took advantage of the
title, abstract and keyphrase metadata fields of the Chinese
Science Citation Database (CSCD) to construct a
largescale training set covering multidisciplinary fields such
as medicine and health, industrial technology, agricultural
science, mathematical science, chemistry and biological
science.</p>
      <p>To address the problem of how to adjust and optimize
the model to achieve better application results, we used the
TF*IDF algorithm as a complement to compensate for the
data shortage of humanities domain in the training corpus.
And we used large-scale scientific literature as a corpus to
calculate the inverse document frequency. Aiming at the
problem that keyphrases are often truncated by TF*IDF
algorithm, we proposed a circular iterative splicing algorithm
to capture more accurate keyphrases.</p>
      <p>To address the problem of how to be conveniently invoked
by researchers, we deployed the keyphrase extraction model
as a service, so that researchers can call the API of the model
by GET or POST method to obtain the keyphrase extraction
results for the given text, without the need for local model
installation and configuration.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Solutions to Key Technical Problems</title>
      <p>For the four key technical problems faced in the engine
construction process, we proposed the corresponding solutions.
4.1 Selection of Keyphrase Extraction Model
Pretrained language model BERT has captured common
language representations from large-scale corpus, enabling
downstream supervised learning tasks to achieve great
model performance even with a small amount of labeled
data. We assumed that taking advantage of pretrained
language model, which has been pretrained using large-scale
unsupervised text, is of great value to build a keyphrase
extraction model for Chinese scientific literature applicable
to multi-disciplines. Therefore, we decided to construct a
keyphrase extraction model for Chinese scientific literature
based on BERT-Base-Chinese, and tried to experiment with
both sequence labeling formulation and span prediction
formulation to find the optimal keyphrase extraction algorithm
for keyphrase extraction engine.</p>
      <p>
        It is worth noting that for Chinese keyphrase extraction,
there is no delimiter like space in English to indicate the
segmentation of words. So it’s necessary to consider whether
to use character or word as the minimal language unit to
feed into the model. It has been shown that for Chinese
keyphrase extraction task, using character as the smallest
linguistic unit can achieve better results [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. In Chinese,
word is the smallest unit for expressing semantics. Even
though character formulation can avoid the errors caused
by Chinese tokenizer, it also loses some of the semantics. To
remedy the deficiency, we considered incorporating external
features including POS feature and lexicon feature into the
model to add in semantics and human knowledge indirectly.
      </p>
      <p>
        We used the publicly available Chinese keyphrase
extraction dataset CAKE [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] for the experiments to
determine the best algorithm, which is a dataset containing
Chinese medical abstracts from CSCD in sequence labeling
format. 100,000 abstracts are included in the training set
and 3,094 abstracts are included in the test set. Based on
the training set of CAKE, we conducted experiments on
ifve models:  +    ,  +  +    ,
 +  +    ,  +  , and  + .
The first four of these models are based on sequence labeling
task formulation, while the last model is based on span
prediction formulation. The short description of each model
is shown in the following:
1. The  +    model defined the task of
keyphrase extraction from Chinese scientific literature
as a character-level sequence labeling task, where
each token was annotated in BIO tagging scheme.
A SoftMax classification layer was added on top of
the pretrained language model BERT to output the
probability of each category. The parameters of BERT
was fine-tuned by CAKE training data.
2. Based on  +    model, we fused POS
feature into the embedding space of BERT to incorporate
word semantics indirectly and constructed the  +
+   model. The POS tagging was generated
by Hanlp1. The details of feature incorporation and
model construction are shown in [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
3. We collected keyphrases from the keyphrase metadata
ifelds in CSCD restricted to medical domain. Based
on  +    model, we used BIO tagging
scheme to generate lexicon feature and embedded it
into BERT to add in domain features and indicate word
boundary information to some extent, composing
 +  +    model.
4. The  + model used Conditional Random Field
(CRF) layer on top of BERT to capture the sequential
features among labels. To learn a reasonable transition
matrix, we used a hierarchical learning rate, using a
learning rate of 5e-5 for training the parameters of the
neural network layers of BERT and a learning rate of
0.01 for training the parameters of the CRF layer.
5.  +  model defined keyphrase extraction task
as a span prediction problem. Two binary classifiers
were trained to determine whether each token is a
start position or an end position of the keyphrase.
      </p>
      <p>Table 1 shows the keyphrase extraction performance of
the above-mentioned models on the CAKE test set. In the
experiments of keyphrase extraction for Chinese scientific
literature, we were concerned with how many correct
keyphrases we can identify from the given text. Therefore,
we compared the keyphrases predicted by the model with the
keyphrases given by the authors and calculated the precision,
recall and F1-score to evaluate the model performance. The
formula for each indicator is as follows.</p>
      <p>=



 = (2)</p>
      <p>2 ×  × 
 1 −  = (3)</p>
      <p>+ 
where c denotes the number of keyphrases predicted by
the model that match the author-given keyphrases; r denotes
the number of keyphrases predicted by the model in total;
and s denotes the number of all author-given keyphrases.</p>
      <p>The experimental results showed that the best results can
be achieved by adding a SoftMax layer directly on top of
the BERT model for classification incorporating the lexicon
features simultaneously, which is  + +  
model. Without adding external features,  + 
model and  +  model achieved better results than</p>
      <sec id="sec-4-1">
        <title>1https://github.com/hankcs/HanLP</title>
        <p>(1)
 +    model. We finally decided to use the
 +  +    model architecture to build the
keyphrase extraction engine for Chinese scientific literature.
4.2</p>
        <p>Construction of Application-Level Large-Scale
Training Set
We aimed to build a keyphrase extraction engine for Chinese
scientific literature applicable to multidisciplinary fields
using large scale training data, while CAKE dataset only
contained 100,000 abstracts from medical field, which cannot
meet the demand for practical applications. So we
constructed a large-scale dataset based on CSCD and evaluated
the quality of the dataset. The details of the training set
generation are described as follows.</p>
        <p>In order to ensure that the constructed training set had a
high recall and can annotate as many keyphrases as possible,
we processed the tile, abstract and keyphrase fields in the
Chinese Science Citation Database and selected the records
in which all of the author’s given keyphrases appeared in the
abstract. Finally, a total of 1,137,945 records were obtained
to satisfy the above conditions, and the total number of
keyphrases was 1,055,335 (removing duplicates).</p>
        <p>We selected 1.1 million records for generating the training
set and 37,945 records for generating the test set. Based on the
obtained titles, abstracts and keyphrases, we concatenated
titles and their corresponding abstracts by period and used</p>
        <p>BIO tagging scheme to convert the final concatenated text
into sequence labeling format, and assigned labels to each
token to generate the dataset in the format required for
model training. Specifically, given the concatenated text and
keyphrases, we assigned label "B" to the first token of the
keyphrase in the text, "I" to the other tokens of the keyphrase,
and "O" to the tokens in the text that did not belong to any
keyphrase.</p>
        <p>
          To ensure that the training set is of high quality and
to avoid providing incorrect supervised signals for model
training, we assessed the quality of the training set by
comparing author-given keyphrases with the automatic
extracted keyphrases in the dataset. The assessment results
of the training set are shown in Table 2. It is worth noting
that in the process of training set generation, we used the
same processing technique as Ding et al. [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] and therefore
the quality of the training set cannot reach to 100%. For
example, if there was an inclusion relationship between two
keyphrases, the longest keyphrase would be selected for
labeling; if there was an overlapping relationship between
two keyphrases, the two keyphrases would be concatenated
according to the overlapping tokens.
        </p>
        <p>In order to ensure that the model can support large-scale
applications in multidisciplinary domains, we counted the
ifrst-class discipline distribution in the training set based on
Chinese Library Classification (CLC), and the statistics are
shown in Table 3 2.
4.3</p>
        <sec id="sec-4-1-1">
          <title>Model Adjustment and Optimization</title>
          <p>Based on the finalized  +  +    model,
we fine-tuned the model using 1.1 million BIO-format
Chinese scientific records from multidisciplinary domains.
The parameters used in the training process are shown in
Table 43. Due to memory limitation, it was not feasible to
load the entire dataset into memory, so we transformed the
data into the format shown in Figure 1. We loaded the data by
Pytorch DataLoader, which read one record at a time by using
an iterator, and calculated the gradient of the model after the
amount of data reached to the batch size. The final model
performance on our all-domain test set are shown in Table
5. It’s worth noting that the practical keyphrase extraction
results are greater than the statistical indicators because we
used exact match principle to calculate the related indicators,
while there are some recognized keyphrases not included
in the author-given keyphrases but still indicate the main
point of the text.</p>
          <p>By observing the test results of the model during the
practical application, we found that the model did not achieve
the expected prediction results for the data in the humanities
domain and could not capture the high-frequency words
appearing in the text. As shown in Table 3, the sample size
for the humanities domain was small, and apparently the
model did not capture enough features on the data from
2Some articles have more than one CLC code, the statistics total is over 1.1
million.
3Noted that because of computational limitation, the batch size was set
to 7 and we assumed that 1 epoch was enough because of the large-scale
training set.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Indicators</title>
        <p>these domains, causing the problem that the number of
keyphrases that can be identified for these domains are very
limited. To address this issue, we decided to use TF*IDF
algorithm as a complement to the extraction results of
the  +  +    model to capture the high
frequency keyphrases that appeared in the text.</p>
        <p>We randomly selected 1 million abstracts from the Chinese
Science Citation Database as the training corpus for the
calculation of inverse document frequency (IDF), using Jieba4
as the Chinese tokenizer to segment the words. To guide
the word separation and avoid the professional terms to be
cut incorrectly, we introduced all the keyphrases in CSCD,
totaling 2,606,322 (no duplicates), as Jieba’s user-defined
lexicon. The phrases in the custom lexicons as well as nouns
in the corpus were calculated for their inverse document
frequency, and finally an IDF file was obtained for subsequent
keyphrase extraction of Chinese scientific literature based
on the TF*IDF algorithm.</p>
        <p>At the same time, in order to solve the problem that
the keyphrases extracted by TF*IDF algorithm were often
truncated, we designed a circular iterative splicing algorithm
as improved TF*IDF algorithm. This algorithm spliced
twoby-two keyphrases identified by TF*IDF algorithm and
determined whether the spliced keyphrases still appeared in
the original text. The iterative splicing was continued until
no new keyphrases appeared. We combined the recognized
keyphrases of  +  +    model with that
of the improved TF*IDF algorithm as the final keyphrase
extraction results for Chinese scientific literature, and the
specific process of the model is as follows.</p>
        <p>For the given scientific abstract, use  +  +
   model to recognize keyphrases firstly. If the
number of the recognized keyphrases of  +  +
   was less than 4, the TF*IDF algorithm would
be introduced as a complement. Otherwise, the keyphrase
extraction results of the  +  +    model
were returned directly. In the keyphrase extraction process of
TF*IDF algorithm, the keyphrases were restricted to nouns
or pronouns, etc. to get the top 10 keyphrases in TF*IDF
value.
We removed short keyphrases which were overlapped
with other keyphrases, and the keyphrases whose length
were less than two. Then we used the circular iterative
splicing algorithm to splice the keyphrases identified by
TF*IDF two by two in two directions, splicing from the left
and the right. And if the spliced keyphrases still appeared in
the text, keep the spliced keyphrases and tag the two original
keyphrases as used keyphrases for deletion. Otherwise, keep
the keyphrases that are not successfully spliced. This process
was iterated until there were no new keyphrases appearing in
the original text. The keyphrases identified by the improved
TF*IDF algorithm were sorted in descending order according
to the TF*IDF value.</p>
        <p>The keyphrase extraction results of the  +  +
   model and the improved TF*IDF algorithm were
combined and ranked as the final results. The priority of
the keyphrases identified by the  +  +   
model were higher than that of the keyphrases identified by
the improved TF*IDF algorithm. Based on this principle, we
merged the keyphrase extraction results of  + +
   model with improved TF*IDF algorithm, and took
the longest keyphrase for the keyphrases with inclusion
relationship. Finally, the top five keyphrases became the
ifnal keyphrases. In addition, we used some heuristic rules
to filter the final keyphrases, such as removing keyphrases
ending with special characters, etc., to improve the accuracy
of the keyphrase extraction model.</p>
        <p>To further elaborate, for an input abstract 5, the  +
 +    model would process this input first and
got the keyphrases as ‘适航安全性(Airworthiness Safety)’,
5htps://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CJFD&amp;
dbname=CJFDAUTODAY&amp;filename=HKKX202103004&amp;v=
G8TESBUsSe2JeIClg6moqemy3ExscLTVMNxH885u%25mmd2BI%
25mmd2Bl9p5i%25mmd2FUmcOUqnMUOyTZM5
‘架构设计(Architecture Design)’, ‘民用飞机(Civilian
Aircraft)’. The keyphrases extracted by  +  +
   model were less than four, so the TF*IDF algorithm
was trigger to get the top 10 keyphrases according to the
TF*IDF value. After the preprocessing, there were eight
extracted keyphrases by TF*IDF algorithm as ‘民 用 飞
机(Civilian Aircraft)’, ‘架构设计(Architecture Design)’, ‘电
传 飞 控 系 统(Telex Flight Control System)’, ‘安 全 性 需
求(Security Requirements)’, ‘适航规范(Airworthiness
Specifications)’, ‘ 需求论证(Proof of Need)’, ‘具体体现(Specific
Embodiment)’, ‘安全要求(Security requirements)’. As we
can see, there were some redundant keyphrases recognized
by traditional TF*IDF, such as ‘具体体现(Specific
Embodiment)’. Next, the circular iterative splicing algorithm would
splice the keyphrases two by two. The changes in variables
during iteration process are shown in Figure 2.</p>
        <p>As we can see, in the first iteration, there were seven
spliced keyphrases occurred in the abstract, in which three
were new keyphrases and four were original keyphrases
(Unused Keyphrases) that didn’t splice with other keyphrases.
And in the second iteration, no new keyphrases arose, so
the iteration finished and all the seven spliced keyphrases
kept unused and returned as the results of improved TF*IDF
algorithm. Then, we ranked the keyphrases generated by
the improved TF*IDF algorithm and combined them with
that of  +  +    model to get the further
results as ‘适 航 安 全 性(Airworthiness Safety)’, ‘架 构 设
计(Architecture Design)’, ‘民用飞机(Civilian Aircraft)’,‘民
用飞机电传飞控系统(Civilian Aircraft Telemetry Flight
Control System)’, ‘电传飞控系统架构设计(Architecture
Design of the Telemetry Flight Control System)’, ‘安全性需
求(Security Requirements)’, ‘民用飞机适航规范(Airworthiness
Specifications for Civil Aircraft)’, ‘ 需求论证(Proof of Need)’,
‘具体体现(Specific Embodiment)’, ‘ 安全要求(Security
requirements)’. Finally, we removed the short keyphrases
who had an inclusion relationship with others and got
the ultimate top 5 recognized keyphrases as ‘适 航 安 全
性(Airworthiness Safety)’, ‘民用飞机电传飞控系统(Civilian
Aircraft Telemetry Flight Control System)’, ‘电传飞控系
统架构设计(Architecture Design of the Telemetry Flight
Control System)’, ‘安全性需求(Security Requirements)’, ‘民
用 飞 机 适 航 规 范(Airworthiness Specifications for Civil
Aircraft)’. It can be seen that the final keyphrase extraction
results of our proposed hybrid model are better than that of
the  +  +   model and the TF*IDF model.
4.4</p>
        <sec id="sec-4-2-1">
          <title>API Design</title>
          <p>In order to avoid various hardware and software constraints
that may be encountered in the local deployment of the
model, and to provide a fast and convenient way for
researchers to invoke the keyphrase extraction model, we
deployed the keyphrase extraction model as a service, and
built a keyphrase extraction engine for Chinese scientific
literature through API calls. Researchers can call the API of
the engine in two ways, POST and GET, to achieve automatic
keyphrase extraction of Chinese scientific literature. Pass
in the abstract of Chinese scientific literature and the
verification code, and the engine would return the keyphrase
extraction results in JSON format.</p>
          <p>For the GET method, users can send an request to the
URL: htp://sciengine.las.ac.cn/keywords_extraction_cn to
call the keyphrase extraction engine, passing in the abstract
of a Chinese scientific literature abstract and the verification
code. When the engine receives the call, it will respond by
returning the keyphrase extraction results in JSON format.
Details of the GET API call are shown in Table VI.</p>
          <p>For the POST method, users can send an request to the
URL: htp://sciengine.las.ac.cn/keywords_extraction_cn to
call the keyphrase extraction engine, depositing the abstracts
of multiple Chinese scientific articles into a list, and pass
in the verification code. After the engine responds, it will
return the keyphrase extraction results of all the abstracts
in the list in JSON format to achieve batch processing. the
details of the POST API call are shown in Table 7.
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Engineering Implementation</title>
      <p>In order to display the keyphrase extraction results
intuitively and meet the demands for diferent users to call the
engine, we currently provide three ways to call the keyphrase
extraction API: browser online demo, Python code access and
client access. The calling flow of the keyphrase extraction
engine for Chinese scientific literature is shown in Figure 3.
5.1</p>
      <sec id="sec-5-1">
        <title>Browser Online Demo</title>
        <p>Users can visit the URL: htp://sciengine.las.ac.cn/Keywords_
BIO_Lexi to test the keyphrase extraction engine online.</p>
        <p>"keywords": [keyphrases list]</p>
        <sec id="sec-5-1-1">
          <title>Error message "info":error message</title>
        </sec>
        <sec id="sec-5-1-2">
          <title>Request URL Request Parameters Format</title>
          <p>/keywords_extraction_cn
"data": abstract of Chinese scientific
literature, "token": Verification Code</p>
        </sec>
        <sec id="sec-5-1-3">
          <title>Success message Abstract ID:[keyphrases list]</title>
        </sec>
        <sec id="sec-5-1-4">
          <title>Error message "info":error message</title>
          <p>Example
htp://sciengine.las.ac.cn/keywords_extraction_cn
{"data": "新 辅 助 治 疗 背 景 下 胰 腺 癌 扩 大 切 除 术
的应用价值。胰腺癌恶性程度高,预后较差,治疗
效果仍不理想...(The value of extended resection
of pancreatic cancer in the context of neoadjuvant
therapy. Pancreatic cancer is highly malignant, with
poor prognosis and still unsatisfactory treatment
results...)", "token":99999}
htp://sciengine.las.ac.cn/keywords_extraction_
cn?data=新辅助治疗背景下胰腺癌扩大切除术的
应用价值。胰腺癌恶性程度高,预后较差,治疗效
果仍不理想...(The value of extended resection of
pancreatic cancer in the context of neoadjuvant
therapy. Pancreatic cancer is highly malignant, with
poor prognosis and still unsatisfactory treatment
results....)&amp;token=99999
{"keywords":["胰腺癌(pancreatic cancer)", "新辅助
治疗(neoadjuvant therapy)", "扩大切除术(extended
resection)" ] }
{"info": "Server not available!"}, {"info": "Token
incorrect!"}
Example
htp://sciengine.las.ac.cn/keywords_extraction_cn
{"data": ["新辅助治疗背景下胰腺癌扩大切除术
的应用价值。胰腺癌恶性程度高,预后较差,治疗
效果仍不理想...(The value of extended resection
of pancreatic cancer in the context of neoadjuvant
therapy. Pancreatic cancer is highly malignant, with
poor prognosis and still unsatisfactory treatment
results...)", "参芪地黄汤联合ACEI/ARB类药物治疗
糖尿病肾病的Meta分析...(Meta-analysis of Shenqi
Dihuang Decoction combined with ACEI/ARB
drugs in the treatment of diabetic nephropathy...)"],
"token":99999}
{0: ["胰 腺 癌(pancreatic cancer)", "新 辅 助 治
疗(neoadjuvant therapy)", "扩 大 切 除 术(extended
resection)" ], 1: ["糖 尿 病 肾 病(diabetic
nephropathy)", "ACEI/ARB类(ACEI/ARB)",
"META分析(Metaanalysis)", "参 芪 地 黄 汤(Shenqi Dihuang
Decoction)"]}
{"info": "Server not available!"}, {"info": "Token
incorrect!"}
Type the abstract of a Chinese scientific literature in the input
box (it is recommended to use the title + ‘。’ + abstract as
input), click the keyphrase extraction button, and the engine
API will be called automatically to invoke the bottom model
and four to five keyphrases related to the main idea will be
returned. The response time of the engine is generally within
3 seconds and the interface of the browser online demo is
shown in Figure 4.
5.2</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>Python Code Access</title>
        <p>Technical staf who are familiar with Python programming
language can download the corresponding sample codes
from the website htp://sciengine.las.ac.cn/Scripts and revise
the file path to achieve convenient usage. There are four
ifles: ℎ___ ., the sample code
for calling the API of the keyphrase extraction engine
using GET method; ℎ___ ., the
sample code for calling the API of the keyphrase extraction
engine using POST method;  _. , the sample input
ifle;  . , the description file.</p>
        <p>When using the GET method to call the API, input
the verification code and the Chinese abstract that needs
to be recognized in the corresponding location of the
code. Then run ℎ___ . file and
the automatic keyphrase extraction results will be printed
directly. When using POST method to call the API, open
the ℎ___ . file with the Python
editor and input the verification code to the corresponding
location in the code. Set the paths of the input file and
output file, where the format of the input file is one line per
abstract. Run ℎ___ . file, and
the program will read the input file and write the keyphrase
extraction results to the output file.
5.3</p>
      </sec>
      <sec id="sec-5-3">
        <title>Client Access</title>
        <p>single line of code. Users can download and install the
client from the website htp://sciengine.las.ac.cn/Client, and
get the verification code as the login credentials to call
the keyphrase extraction engine API to achieve automatic
keyphrase extraction of Chinese scientific literature. The
keyphrase extraction engine client interface is shown in
Figure 5 and the specific operation process is as follows.
1. After opening the client and entering the verification
code, click the button of "Keyphrase Extraction for
Chinese Scientific Literature" in the menu bar to enter
the interface of keyphrase extraction function. Click
"Browse" button to import the file to be processed, and
it means successful if the data presentation box shows
the imported data, and the message box shows the
total number of the data.
2. Click "Start Extraction" button, the client will
automatically carry out the function of keyphrase extraction
for Chinese scientific literature and display the
realtime processing progress.
3. When the extraction is finished, the client will pop
up completion window and automatically show the
output file path.</p>
        <p>4. Click "Open" button to view the output file.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>In this paper, we make full use of the large-scale training
corpus of Chinese Science Citation Database and the pretrained
language model BERT to construct a keyphrase extraction
engine for Chinese scientific literature. We incorporate
lexicon features into the high-dimensional vector space of BERT,
fusing human knowledge to instruct the model training. To
support practical applications in multidisciplinary fields, the
TF*IDF algorithm is introduced as a complement to better
capture the high-frequency words appearing in the text. We
deploy the engine as a service, which can be invoked using
the API, and the response speed is generally within 3 seconds.
And we provide example scripts in Python for technical staf
and a visualization client for non-technical personnel to use
without writing a line of code. We hope that our keyphrase
extraction engine can provide a feasible path for researchers
to improve eficiency.
7</p>
    </sec>
    <sec id="sec-7">
      <title>ACKNOWLEDGMENTS</title>
      <p>The work is supported by the project “Artificial Intelligence
(AI) Engine Construction Based on Scientific Literature
Knowledge" (Grant No.E0290906) and the project “Key
Technology Optimization Integration and System Development of
Next Generation Open Knowledge Service Platform" (Grant
No.2021XM45).</p>
      <p>In order to provide for non-technical personnel to use,
we designed the client to realize the keyphrase extraction
service for Chinese scientific literature without writing a</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Peter</surname>
            <given-names>D</given-names>
          </string-name>
          <string-name>
            <surname>Turney</surname>
          </string-name>
          .
          <article-title>Learning algorithms for keyphrase extraction</article-title>
          .
          <source>Information retrieval</source>
          ,
          <volume>2</volume>
          (
          <issue>4</issue>
          ):
          <fpage>303</fpage>
          -
          <lpage>336</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Steve</given-names>
            <surname>Jones</surname>
          </string-name>
          and Mark S Staveley.
          <article-title>Phrasier: a system for interactive document retrieval using keyphrases</article-title>
          .
          <source>In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <fpage>160</fpage>
          -
          <lpage>167</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Yongzheng</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Nur Zincir-Heywood, and Evangelos Milios.
          <article-title>World wide web site summarization</article-title>
          .
          <source>Web Intelligence and Agent Systems: An International Journal</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):
          <fpage>39</fpage>
          -
          <lpage>53</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Anette</given-names>
            <surname>Hulth</surname>
          </string-name>
          and
          <string-name>
            <given-names>Beáta</given-names>
            <surname>Megyesi</surname>
          </string-name>
          .
          <article-title>A study on automatically extracted keywords in text categorization</article-title>
          .
          <source>In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics</source>
          , pages
          <fpage>537</fpage>
          -
          <lpage>544</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Gábor</given-names>
            <surname>Berend</surname>
          </string-name>
          .
          <article-title>Opinion expression mining by exploiting keyphrase extraction</article-title>
          .
          <source>In Proceedings of the 5th International Joint Conference on Natural Language Processing</source>
          , pages
          <fpage>1162</fpage>
          -
          <lpage>1170</lpage>
          .
          <source>Asian Federation of Natural Language Processing</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Yi-fang Brook</surname>
            <given-names>Wu</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Quanzhi</given-names>
            <surname>Li</surname>
          </string-name>
          , Razvan Stefan Bot, and
          <string-name>
            <given-names>Xin</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Domain-specific keyphrase extraction</article-title>
          .
          <source>In Proceedings of the 14th ACM international conference on Information and knowledge management</source>
          , pages
          <fpage>283</fpage>
          -
          <lpage>284</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Chengzhi</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <article-title>Automatic keyword extraction from documents using conditional random fields</article-title>
          .
          <source>Journal of Computational Information Systems</source>
          ,
          <volume>4</volume>
          (
          <issue>3</issue>
          ):
          <fpage>1169</fpage>
          -
          <lpage>1180</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Anette</given-names>
            <surname>Hulth</surname>
          </string-name>
          .
          <article-title>Improved automatic keyword extraction given more linguistic knowledge</article-title>
          .
          <source>In Proceedings of the 2003 conference on Empirical methods in natural language processing</source>
          , pages
          <fpage>216</fpage>
          -
          <lpage>223</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Gerard</given-names>
            <surname>Salton</surname>
          </string-name>
          ,
          <string-name>
            <surname>Chung-Shu Yang</surname>
          </string-name>
          , and
          <string-name>
            <surname>CLEMENT T Yu</surname>
          </string-name>
          .
          <article-title>A theory of term importance in automatic text analysis</article-title>
          .
          <source>Journal of the American society for Information Science</source>
          ,
          <volume>26</volume>
          (
          <issue>1</issue>
          ):
          <fpage>33</fpage>
          -
          <lpage>44</lpage>
          ,
          <year>1975</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Lance</surname>
            <given-names>A</given-names>
          </string-name>
          <string-name>
            <surname>Ramshaw and Mitchell P Marcus</surname>
          </string-name>
          .
          <article-title>Text chunking using transformation-based learning</article-title>
          .
          <source>In Natural language processing using very large corpora</source>
          , pages
          <fpage>157</fpage>
          -
          <lpage>176</lpage>
          . Springer,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Lev</given-names>
            <surname>Ratinov</surname>
          </string-name>
          and
          <string-name>
            <given-names>Dan</given-names>
            <surname>Roth</surname>
          </string-name>
          .
          <article-title>Design challenges and misconceptions in named entity recognition</article-title>
          .
          <source>In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009)</source>
          , pages
          <fpage>147</fpage>
          -
          <lpage>155</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Qi</surname>
            <given-names>Zhang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yeyun Gong</surname>
          </string-name>
          , and
          <string-name>
            <surname>Xuan-Jing Huang</surname>
          </string-name>
          .
          <article-title>Keyphrase extraction using deep recurrent neural networks on twitter</article-title>
          .
          <source>In Proceedings of the 2016 conference on empirical methods in natural language processing</source>
          , pages
          <fpage>836</fpage>
          -
          <lpage>845</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Dhruva</surname>
            <given-names>Sahrawat</given-names>
          </string-name>
          , Debanjan Mahata, Mayank Kulkarni, Haimin Zhang, Rakesh Gosangi, Amanda Stent, Agniv Sharma, Yaman Kumar, Rajiv Ratn Shah, and
          <string-name>
            <given-names>Roger</given-names>
            <surname>Zimmermann</surname>
          </string-name>
          .
          <article-title>Keyphrase extraction from scholarly articles as sequence labeling using contextualized embeddings</article-title>
          .
          <source>arXiv preprint arXiv:1910.08840</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Pranav</surname>
            <given-names>Rajpurkar</given-names>
          </string-name>
          , Jian Zhang, Konstantin Lopyrev, and
          <string-name>
            <given-names>Percy</given-names>
            <surname>Liang</surname>
          </string-name>
          . Squad:
          <volume>100</volume>
          ,000+
          <article-title>questions for machine comprehension of text</article-title>
          .
          <source>arXiv preprint arXiv:1606.05250</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Funan</surname>
            <given-names>Mu</given-names>
          </string-name>
          , Zhenting Yu, LiFeng Wang,
          <string-name>
            <surname>Yequan</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Qingyu Yin, Yibo Sun, Liqun Liu, Teng Ma, Jing Tang, and
          <string-name>
            <given-names>Xing</given-names>
            <surname>Zhou</surname>
          </string-name>
          .
          <article-title>Keyphrase extraction with span-based feature representations</article-title>
          .
          <source>arXiv preprint arXiv:2002.05407</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Jacob</surname>
            <given-names>Devlin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          . Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>arXiv preprint arXiv:1810.04805</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Iz</surname>
            <given-names>Beltagy</given-names>
          </string-name>
          , Kyle Lo, and
          <string-name>
            <given-names>Arman</given-names>
            <surname>Cohan</surname>
          </string-name>
          .
          <article-title>Scibert: A pretrained language model for scientific text</article-title>
          .
          <source>arXiv preprint arXiv:1903.10676</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Jinhyuk</given-names>
            <surname>Lee</surname>
          </string-name>
          , Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and
          <string-name>
            <given-names>Jaewoo</given-names>
            <surname>Kang</surname>
          </string-name>
          .
          <article-title>Biobert: a pre-trained biomedical language representation model for biomedical text mining</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>36</volume>
          (
          <issue>4</issue>
          ):
          <fpage>1234</fpage>
          -
          <lpage>1240</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Tianyu</surname>
            <given-names>Liu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jin-Ge Yao</surname>
          </string-name>
          , and
          <string-name>
            <surname>Chin-Yew Lin</surname>
          </string-name>
          .
          <article-title>Towards improving neural named entity recognition with gazetteers</article-title>
          .
          <source>In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          , pages
          <fpage>5301</fpage>
          -
          <lpage>5307</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Xiangyang</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Huan</given-names>
            <surname>Zhang</surname>
          </string-name>
          , and
          <string-name>
            <surname>Xiao-Hua Zhou</surname>
          </string-name>
          .
          <article-title>Chinese clinical named entity recognition with variant neural structures based on bert methods</article-title>
          .
          <source>Journal of biomedical informatics</source>
          ,
          <volume>107</volume>
          :
          <fpage>103422</fpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Liangping</surname>
            <given-names>Ding</given-names>
          </string-name>
          , Zhixiong Zhang, Huan Liu,
          <string-name>
            <given-names>Jie</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Gaihong</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>Automatic keyphrase extraction from scientific chinese medical abstracts based on character-level sequence labeling</article-title>
          .
          <source>Journal of Data and Information Science,</source>
          ,
          <volume>6</volume>
          (
          <issue>3</issue>
          ):
          <fpage>33</fpage>
          -
          <lpage>57</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Liangping</surname>
            <given-names>Ding</given-names>
          </string-name>
          , Zhixiong Zhang, and
          <string-name>
            <given-names>Yang</given-names>
            <surname>Zhao</surname>
          </string-name>
          .
          <article-title>Bert-based chinese medical keyphrase extraction model enhanced with external features</article-title>
          .
          <source>International Conference on Asia-Pacific Digital Libraries</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>