=Paper=
{{Paper
|id=Vol-3004/paper4
|storemode=property
|title=Design and Implementation of Keyphrase Extraction Engine for Chinese Scientific Literature
|pdfUrl=https://ceur-ws.org/Vol-3004/paper4.pdf
|volume=Vol-3004
|authors=Liangping Ding,Zhixiong Zhang,Huan Liu,Yang Zhao
|dblpUrl=https://dblp.org/rec/conf/jcdl/DingZLZ21
}}
==Design and Implementation of Keyphrase Extraction Engine for Chinese Scientific Literature==
EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents Design and Implementation of Keyphrase Extraction Engine for Chinese Scientific Literature Liangping Ding Zhixiong Zhangโ dingliangping@mail.las.ac.cn zhangzhx@mail.las.ac.cn National Science Library, Chinese Academy of Sciences National Science Library, Chinese Academy of Sciences Beijing, China Beijing, China Department of Library Information and Archives Department of Library Information and Archives Management, University of Chinese Academy of Science Management, University of Chinese Academy of Science Beijing, China Beijing, China Huan Liu Yang Zhao liuhuan@mail.las.ac.cn zhaoyang@mail.las.ac.cn National Science Library, Chinese Academy of Sciences National Science Library, Chinese Academy of Sciences Beijing, China Beijing, China Department of Library Information and Archives Department of Library Information and Archives Management, University of Chinese Academy of Science Management, University of Chinese Academy of Science Beijing, China Beijing, China Abstract reading interests. Keyphrase extraction task is the basis for Accurate keyphrases summarize the main topics, which are many natural language processing tasks such as information important for information retrieval and many other natural retrieval [2], text summarization [3], text classification [4], language processing tasks. In this paper, we construct a opinion mining [5], and document indexing [6]. keyphrase extraction engine for Chinese scientific literature For Chinese scientific literature, there are cases of miss- to assist researchers in improving the efficiency of scientific ing keyphrases stored by publishers. In addition, many research. There are four key technical problems in the keyphrases given by authors do not fully reveal the main idea process of building the engine: how to select a keyphrase of the text. So keyphrase extraction for Chinese scientific extraction algorithm, how to build a large-scale training set literature is particularly important, not only to fill the gap to achieve application-level performance, how to adjust and of keyphrase metadata fields in publishersโ repositories, but optimize the model to achieve better application results, and also serve as an effective complement to the keyphrases how to be conveniently invoked by researchers. Aiming at given by authors themselves. It can also provide reference the above problems, we propose corresponding solutions. for researchers when writing Chinese scientific papers. The engine is able to automatically recommend four to five The training corpus used in current Chinese keyphrase keyphrases for the Chinese scientific abstracts given by the extraction models is generally limited to one or several user, and the response speed is generally within 3 seconds. subject areas and is relatively small in size [7], which is The keyphrase extraction engine for Chinese scientific difficult to be oriented to large-scale applications. Moreover, literature is developed based on advanced deep learning the keyphrase extraction models are generally self-stored algorithms, large-scale training set, and high-performance by the developers, making it difficult for widespread use by computing capacity, which might be an effective tool for researchers. researchers and publishers to quickly capture the key stating To address the above problems, we constructed a keyphrase points of scientific text. extraction engine for Chinese scientific literature based on a large-scale training corpus from multiple disciplines for Keywords: Keyphrase Extraction, Artificial Intelligence En- practical applications. The engine can be easily called by gine, Chinese Scientific Literature means of Application Programming Interface (API) without local model installation and configuration. In this paper, 1 Introduction we discuss the overall construction idea of building the Keyphrase extraction task is a branch of information extrac- keyphrase extraction engine for Chinese scientific literature, tion and has been a research hotspot for many years. It aims the solutions to the key technical problems, and the specific to identify important topical phrases from text [1], which is engineering implementation of the engine. of great significance for readers to quickly grasp the main 2 Related Work idea of the articles and select the articles that meet their Currently, the popular keyphrase extraction methods can be โ Corresponding Author divided into three categories: (1) keyphrase extraction based Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 26 EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents on traditional two-stage ranking; (2) keyphrase extraction select a keyphrase extraction algorithm, how to build a large- based on sequence labeling; (3) keyphrase extraction based scale training set to achieve application-level performance, on span prediction. The traditional two-stage ranking based how to adjust and optimize the model to achieve better methods use some heuristic rules to identify candidate application results, and how to be conveniently invoked keyphrases from the text in the first stage, and use a ranking by researchers. algorithm to rank the candidate keyphrases in the second To address the problem of how to choose an appropriate stage. The commonly used ranking algorithms include term keyphrase extraction algorithm, we first investigated the frequency [8], TF*IDF [9], etc. A major drawback of this current popular and advanced keyphrase extraction algo- two-stage approach is the error propagation, which means rithms, and used publicly available dataset to compare model that the error caused in the candidate keyphrases generation performance and determine an optimal keyphrase extraction will be passed to the candidate keyphrases ranking. model for engine construction. To address this issue, researchers proposed unified keyphrase To address the problem of how to construct an application- extraction formulations, which regard keyphrase extraction level large-scale training set, we took advantage of the task as a sequence labeling task or span prediction task. title, abstract and keyphrase metadata fields of the Chinese Sequence labeling formulation usually uses BIO [10] or Science Citation Database (CSCD) to construct a large- BIOES [11] tagging schemes to annotate tokens in the scale training set covering multidisciplinary fields such text sequences, and then train keyphrase extraction models as medicine and health, industrial technology, agricultural based on machine learning algorithms [7] or deep learning science, mathematical science, chemistry and biological algorithms [12][13]. The idea of span prediction formulation science. originates from machine reading comprehension based on To address the problem of how to adjust and optimize SQuAD format [14], which predicts the role of tokens in the model to achieve better application results, we used the the sequence by training two binary classifiers to determine TF*IDF algorithm as a complement to compensate for the whether they are the start and end positions of keyphrases data shortage of humanities domain in the training corpus. [15]. While no consensus has been reached about what kind And we used large-scale scientific literature as a corpus to of formulation should be used for supervised keyphrase calculate the inverse document frequency. Aiming at the extraction task. problem that keyphrases are often truncated by TF*IDF In addition, keyphrase extraction algorithm is another algorithm, we proposed a circular iterative splicing algorithm important issue that should be paid attention to. In 2018, to capture more accurate keyphrases. Google released pretrained language model BERT [16], To address the problem of how to be conveniently invoked which attracted widespread attention in the field of natural by researchers, we deployed the keyphrase extraction model language processing. This study is widely regarded as a as a service, so that researchers can call the API of the model landmark discovery that provides a new paradigm for the by GET or POST method to obtain the keyphrase extraction field of natural language processing. In the past three years, results for the given text, without the need for local model a large number of pretrained language models have emerged, installation and configuration. and many researchers found that using pretrained language models can lead to large improvements in the model per- 4 Solutions to Key Technical Problems formance of downstream tasks [17][18]. Furthermore, some For the four key technical problems faced in the engine con- researchers suggested that incorporating external features struction process, we proposed the corresponding solutions. such as lexicon feature to pretrained language model can further boost the model performance[19][20]. 4.1 Selection of Keyphrase Extraction Model Even though advanced keyphrase extraction algorithms Pretrained language model BERT has captured common are applied, there are less publicly available keyphrase language representations from large-scale corpus, enabling extraction engine that can be directly called by users to downstream supervised learning tasks to achieve great the best of our knowledge, limiting the industrialization model performance even with a small amount of labeled of academic achievements. In this paper, we illustrate the data. We assumed that taking advantage of pretrained construction process of keyphrase extraction engine for language model, which has been pretrained using large-scale Chinese scientific literature, aiming to provide reference unsupervised text, is of great value to build a keyphrase for academic researches and industrial usage of keyphrase extraction model for Chinese scientific literature applicable extraction. to multi-disciplines. Therefore, we decided to construct a keyphrase extraction model for Chinese scientific literature 3 The Overall Construction Idea based on BERT-Base-Chinese, and tried to experiment with To build a keyphrase extraction engine for Chinese scientific both sequence labeling formulation and span prediction for- literature that can be used for practical applications in multi- mulation to find the optimal keyphrase extraction algorithm ple disciplines, there are four key technical problems: how to for keyphrase extraction engine. 27 EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents Table 1. Experimental Results of Keyphrase Extraction the pretrained language model BERT to output the Model on CAKE test set probability of each category. The parameters of BERT was fine-tuned by CAKE training data. Models Precision Recall F1-score 2. Based on ๐ต๐ธ๐ ๐ + ๐๐ ๐ ๐ก๐๐๐ฅ model, we fused POS fea- ture into the embedding space of BERT to incorporate BERT+SoftMax 63.81% 56.52% 59.94% word semantics indirectly and constructed the ๐ต๐ธ๐ ๐ + BERT+POS+SoftMax 64.83% 57.44% 60.91% ๐๐๐+๐๐ ๐ ๐ก๐๐๐ฅ model. The POS tagging was generated BERT+Lexicon+SoftMax 68.06% 60.67% 64.15% by Hanlp1 . The details of feature incorporation and BERT+CRF 64.87% 59.15% 61.88% model construction are shown in [22]. BERT+Span 65.51% 57.61% 61.31% 3. We collected keyphrases from the keyphrase metadata fields in CSCD restricted to medical domain. Based Table 2. Assessment Results of the Training set on ๐ต๐ธ๐ ๐ + ๐๐ ๐ ๐ก๐๐๐ฅ model, we used BIO tagging scheme to generate lexicon feature and embedded it Indicators Results into BERT to add in domain features and indicate word boundary information to some extent, composing Precision 99.38% ๐ต๐ธ๐ ๐ + ๐ฟ๐๐ฅ๐๐๐๐ + ๐๐ ๐ ๐ก๐๐๐ฅ model. Recall 97.56% 4. The ๐ต๐ธ๐ ๐ +๐ถ๐ ๐น model used Conditional Random Field F1-score 98.46% (CRF) layer on top of BERT to capture the sequential Number of correct keyphrases identified 4,447,454 features among labels. To learn a reasonable transition Number of all keyphrases identified 4,447,454 matrix, we used a hierarchical learning rate, using a Number of author-given keyphrases 4,558,596 learning rate of 5e-5 for training the parameters of the neural network layers of BERT and a learning rate of 0.01 for training the parameters of the CRF layer. It is worth noting that for Chinese keyphrase extraction, 5. ๐ต๐ธ๐ ๐ + ๐๐๐๐ model defined keyphrase extraction task there is no delimiter like space in English to indicate the as a span prediction problem. Two binary classifiers segmentation of words. So itโs necessary to consider whether were trained to determine whether each token is a to use character or word as the minimal language unit to start position or an end position of the keyphrase. feed into the model. It has been shown that for Chinese Table 1 shows the keyphrase extraction performance of keyphrase extraction task, using character as the smallest the above-mentioned models on the CAKE test set. In the linguistic unit can achieve better results [21]. In Chinese, experiments of keyphrase extraction for Chinese scientific word is the smallest unit for expressing semantics. Even literature, we were concerned with how many correct though character formulation can avoid the errors caused keyphrases we can identify from the given text. Therefore, by Chinese tokenizer, it also loses some of the semantics. To we compared the keyphrases predicted by the model with the remedy the deficiency, we considered incorporating external keyphrases given by the authors and calculated the precision, features including POS feature and lexicon feature into the recall and F1-score to evaluate the model performance. The model to add in semantics and human knowledge indirectly. formula for each indicator is as follows. We used the publicly available Chinese keyphrase ex- traction dataset CAKE [21] for the experiments to deter- ๐ ๐๐๐๐๐๐ ๐๐๐ = (1) mine the best algorithm, which is a dataset containing ๐ Chinese medical abstracts from CSCD in sequence labeling ๐ ๐ ๐๐๐๐๐ = (2) format. 100,000 abstracts are included in the training set ๐ and 3,094 abstracts are included in the test set. Based on 2 ร ๐๐๐๐๐๐ ๐๐๐ ร ๐ ๐๐๐๐๐ ๐น 1 โ ๐ ๐๐๐๐ = (3) the training set of CAKE, we conducted experiments on ๐๐๐๐๐๐ ๐๐๐ + ๐ ๐๐๐๐๐ five models: ๐ต๐ธ๐ ๐ + ๐๐ ๐ ๐ก๐๐๐ฅ, ๐ต๐ธ๐ ๐ + ๐๐๐ + ๐๐ ๐ ๐ก๐๐๐ฅ, where c denotes the number of keyphrases predicted by ๐ต๐ธ๐ ๐ + ๐ฟ๐๐ฅ๐๐๐๐ + ๐๐ ๐ ๐ก๐๐๐ฅ, ๐ต๐ธ๐ ๐ + ๐ถ๐ ๐น , and ๐ต๐ธ๐ ๐ + ๐๐๐๐. the model that match the author-given keyphrases; r denotes The first four of these models are based on sequence labeling the number of keyphrases predicted by the model in total; task formulation, while the last model is based on span and s denotes the number of all author-given keyphrases. prediction formulation. The short description of each model The experimental results showed that the best results can is shown in the following: be achieved by adding a SoftMax layer directly on top of the BERT model for classification incorporating the lexicon 1. The ๐ต๐ธ๐ ๐ + ๐๐ ๐ ๐ก๐๐๐ฅ model defined the task of features simultaneously, which is ๐ต๐ธ๐ ๐ +๐ฟ๐๐ฅ๐๐๐๐+๐๐ ๐ ๐ก๐๐๐ฅ keyphrase extraction from Chinese scientific literature model. Without adding external features, ๐ต๐ธ๐ ๐ + ๐ถ๐ ๐น as a character-level sequence labeling task, where model and ๐ต๐ธ๐ ๐ + ๐๐๐๐ model achieved better results than each token was annotated in BIO tagging scheme. A SoftMax classification layer was added on top of 1 https://github.com/hankcs/HanLP 28 EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents Table 3. Statistics of the Discipline Distribution in Training Set Chinese Library Classification (CLC) Discipline Number of Abstracts R Medicine and health sciences 421,879 T Industrial Technology 386,649 S Agricultural science 142,866 O Mathematics, physics and chemistry 80,052 Q Life sciences 60,901 P Astronomy and geoscience 56,301 X Environmental science 54,712 F Economics 27,078 U Transportation 15,664 V Aviation and Aerospace 13,956 G Culture, science, education and sports 7,848 N Natural science 3,565 C Social sciences 3,505 B Philosophy and religions 3,379 K History and geography 2,001 E Military science 1,059 D Politics and law 971 J Art 712 H Languages and linguistics 278 Z General works 48 I Literature 23 A Marxism, Leninism, Maoism and Deng Xiaoping theory 12 ๐ต๐ธ๐ ๐ + ๐๐ ๐ ๐ก๐๐๐ฅ model. We finally decided to use the BIO tagging scheme to convert the final concatenated text ๐ต๐ธ๐ ๐ + ๐ฟ๐๐ฅ๐๐๐๐ + ๐๐ ๐ ๐ก๐๐๐ฅ model architecture to build the into sequence labeling format, and assigned labels to each keyphrase extraction engine for Chinese scientific literature. token to generate the dataset in the format required for model training. Specifically, given the concatenated text and keyphrases, we assigned label "B" to the first token of the 4.2 Construction of Application-Level Large-Scale keyphrase in the text, "I" to the other tokens of the keyphrase, Training Set and "O" to the tokens in the text that did not belong to any We aimed to build a keyphrase extraction engine for Chinese keyphrase. scientific literature applicable to multidisciplinary fields To ensure that the training set is of high quality and using large scale training data, while CAKE dataset only to avoid providing incorrect supervised signals for model contained 100,000 abstracts from medical field, which cannot training, we assessed the quality of the training set by meet the demand for practical applications. So we con- comparing author-given keyphrases with the automatic structed a large-scale dataset based on CSCD and evaluated extracted keyphrases in the dataset. The assessment results the quality of the dataset. The details of the training set of the training set are shown in Table 2. It is worth noting generation are described as follows. that in the process of training set generation, we used the In order to ensure that the constructed training set had a same processing technique as Ding et al. [21] and therefore high recall and can annotate as many keyphrases as possible, the quality of the training set cannot reach to 100%. For we processed the tile, abstract and keyphrase fields in the example, if there was an inclusion relationship between two Chinese Science Citation Database and selected the records keyphrases, the longest keyphrase would be selected for in which all of the authorโs given keyphrases appeared in the labeling; if there was an overlapping relationship between abstract. Finally, a total of 1,137,945 records were obtained two keyphrases, the two keyphrases would be concatenated to satisfy the above conditions, and the total number of according to the overlapping tokens. keyphrases was 1,055,335 (removing duplicates). In order to ensure that the model can support large-scale We selected 1.1 million records for generating the training applications in multidisciplinary domains, we counted the set and 37,945 records for generating the test set. Based on the first-class discipline distribution in the training set based on obtained titles, abstracts and keyphrases, we concatenated titles and their corresponding abstracts by period and used 29 EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents Table 4. Parameter Configuration of the Proposed Approach Table 5. Keyphrase Extraction Model Performance on All- Domain Test Set Parameters Values Indicators Results Batch size 7 Epoch 1 Precision 59.11% Optimizer Adam Recall 46.84% Learning rate scheduler exponential decay F1-score 52.26% Initial learning rate 5e-5 Number of correct keyphrases identified 77,735 Max sequence length 512 Number of all keyphrases identified 131,517 Number of author-given keyphrases 165,956 Chinese Library Classification (CLC), and the statistics are shown in Table 3 2 . these domains, causing the problem that the number of 4.3 Model Adjustment and Optimization keyphrases that can be identified for these domains are very Based on the finalized ๐ต๐ธ๐ ๐ + ๐ฟ๐๐ฅ๐๐๐๐ + ๐๐ ๐ ๐ก๐๐๐ฅ model, limited. To address this issue, we decided to use TF*IDF we fine-tuned the model using 1.1 million BIO-format algorithm as a complement to the extraction results of Chinese scientific records from multidisciplinary domains. the ๐ต๐ธ๐ ๐ + ๐ฟ๐๐ฅ๐๐๐๐ + ๐๐ ๐ ๐ก๐๐๐ฅ model to capture the high The parameters used in the training process are shown in frequency keyphrases that appeared in the text. Table 43 . Due to memory limitation, it was not feasible to We randomly selected 1 million abstracts from the Chinese load the entire dataset into memory, so we transformed the Science Citation Database as the training corpus for the data into the format shown in Figure 1. We loaded the data by calculation of inverse document frequency (IDF), using Jieba4 Pytorch DataLoader, which read one record at a time by using as the Chinese tokenizer to segment the words. To guide an iterator, and calculated the gradient of the model after the the word separation and avoid the professional terms to be amount of data reached to the batch size. The final model cut incorrectly, we introduced all the keyphrases in CSCD, performance on our all-domain test set are shown in Table totaling 2,606,322 (no duplicates), as Jiebaโs user-defined 5. Itโs worth noting that the practical keyphrase extraction lexicon. The phrases in the custom lexicons as well as nouns results are greater than the statistical indicators because we in the corpus were calculated for their inverse document used exact match principle to calculate the related indicators, frequency, and finally an IDF file was obtained for subsequent while there are some recognized keyphrases not included keyphrase extraction of Chinese scientific literature based in the author-given keyphrases but still indicate the main on the TF*IDF algorithm. point of the text. At the same time, in order to solve the problem that the keyphrases extracted by TF*IDF algorithm were often truncated, we designed a circular iterative splicing algorithm as improved TF*IDF algorithm. This algorithm spliced two- by-two keyphrases identified by TF*IDF algorithm and determined whether the spliced keyphrases still appeared in the original text. The iterative splicing was continued until no new keyphrases appeared. We combined the recognized keyphrases of ๐ต๐ธ๐ ๐ + ๐ฟ๐๐ฅ๐๐๐๐ + ๐๐ ๐ ๐ก๐๐๐ฅ model with that of the improved TF*IDF algorithm as the final keyphrase Figure 1. Input Format to DataLoader extraction results for Chinese scientific literature, and the specific process of the model is as follows. By observing the test results of the model during the For the given scientific abstract, use ๐ต๐ธ๐ ๐ + ๐ฟ๐๐ฅ๐๐๐๐ + practical application, we found that the model did not achieve ๐๐ ๐ ๐ก๐๐๐ฅ model to recognize keyphrases firstly. If the the expected prediction results for the data in the humanities number of the recognized keyphrases of ๐ต๐ธ๐ ๐ + ๐ฟ๐๐ฅ๐๐๐๐ + domain and could not capture the high-frequency words ๐๐ ๐ ๐ก๐๐๐ฅ was less than 4, the TF*IDF algorithm would appearing in the text. As shown in Table 3, the sample size be introduced as a complement. Otherwise, the keyphrase for the humanities domain was small, and apparently the extraction results of the ๐ต๐ธ๐ ๐ + ๐ฟ๐๐ฅ๐๐๐๐ + ๐๐ ๐ ๐ก๐๐๐ฅ model model did not capture enough features on the data from were returned directly. In the keyphrase extraction process of 2 Some articles have more than one CLC code, the statistics total is over 1.1 TF*IDF algorithm, the keyphrases were restricted to nouns million. or pronouns, etc. to get the top 10 keyphrases in TF*IDF 3 Noted that because of computational limitation, the batch size was set value. to 7 and we assumed that 1 epoch was enough because of the large-scale training set. 4 https://github.com/fxsjy/jieba 30 EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents Figure 2. Variables in the Iteration Process of the Circular Iterative Splicing Algorithm We removed short keyphrases which were overlapped โๆถๆ่ฎพ่ฎก(Architecture Design)โ, โๆฐ็จ้ฃๆบ(Civilian Air- with other keyphrases, and the keyphrases whose length craft)โ. The keyphrases extracted by ๐ต๐ธ๐ ๐ + ๐ฟ๐๐ฅ๐๐๐๐ + were less than two. Then we used the circular iterative ๐๐ ๐ ๐ก๐๐๐ฅ model were less than four, so the TF*IDF algorithm splicing algorithm to splice the keyphrases identified by was trigger to get the top 10 keyphrases according to the TF*IDF two by two in two directions, splicing from the left TF*IDF value. After the preprocessing, there were eight and the right. And if the spliced keyphrases still appeared in extracted keyphrases by TF*IDF algorithm as โๆฐ ็จ ้ฃ the text, keep the spliced keyphrases and tag the two original ๆบ(Civilian Aircraft)โ, โๆถๆ่ฎพ่ฎก(Architecture Design)โ, โ็ต keyphrases as used keyphrases for deletion. Otherwise, keep ไผ ้ฃ ๆง ็ณป ็ป(Telex Flight Control System)โ, โๅฎ ๅ จ ๆง ้ the keyphrases that are not successfully spliced. This process ๆฑ(Security Requirements)โ, โ้่ช่ง่(Airworthiness Spec- was iterated until there were no new keyphrases appearing in ifications)โ, โ้ๆฑ่ฎบ่ฏ(Proof of Need)โ, โๅ ทไฝไฝ็ฐ(Specific the original text. The keyphrases identified by the improved Embodiment)โ, โๅฎๅ จ่ฆๆฑ(Security requirements)โ. As we TF*IDF algorithm were sorted in descending order according can see, there were some redundant keyphrases recognized to the TF*IDF value. by traditional TF*IDF, such as โๅ ทไฝไฝ็ฐ(Specific Embodi- The keyphrase extraction results of the ๐ต๐ธ๐ ๐ + ๐ฟ๐๐ฅ๐๐๐๐ + ment)โ. Next, the circular iterative splicing algorithm would ๐๐ ๐ ๐ก๐๐๐ฅ model and the improved TF*IDF algorithm were splice the keyphrases two by two. The changes in variables combined and ranked as the final results. The priority of during iteration process are shown in Figure 2. the keyphrases identified by the ๐ต๐ธ๐ ๐ + ๐ฟ๐๐ฅ๐๐๐๐ + ๐๐ ๐ ๐ก๐๐๐ฅ As we can see, in the first iteration, there were seven model were higher than that of the keyphrases identified by spliced keyphrases occurred in the abstract, in which three the improved TF*IDF algorithm. Based on this principle, we were new keyphrases and four were original keyphrases merged the keyphrase extraction results of ๐ต๐ธ๐ ๐ +๐ฟ๐๐ฅ๐๐๐๐ + (Unused Keyphrases) that didnโt splice with other keyphrases. ๐๐ ๐ ๐ก๐๐๐ฅ model with improved TF*IDF algorithm, and took And in the second iteration, no new keyphrases arose, so the longest keyphrase for the keyphrases with inclusion the iteration finished and all the seven spliced keyphrases relationship. Finally, the top five keyphrases became the kept unused and returned as the results of improved TF*IDF final keyphrases. In addition, we used some heuristic rules algorithm. Then, we ranked the keyphrases generated by to filter the final keyphrases, such as removing keyphrases the improved TF*IDF algorithm and combined them with ending with special characters, etc., to improve the accuracy that of ๐ต๐ธ๐ ๐ + ๐ฟ๐๐ฅ๐๐๐๐ + ๐๐ ๐ ๐ก๐๐๐ฅ model to get the further of the keyphrase extraction model. results as โ้ ่ช ๅฎ ๅ จ ๆง(Airworthiness Safety)โ, โๆถ ๆ ่ฎพ To further elaborate, for an input abstract 5 , the ๐ต๐ธ๐ ๐ + ่ฎก(Architecture Design)โ, โๆฐ็จ้ฃๆบ(Civilian Aircraft)โ,โๆฐ ๐ฟ๐๐ฅ๐๐๐๐ + ๐๐ ๐ ๐ก๐๐๐ฅ model would process this input first and ็จ้ฃๆบ็ตไผ ้ฃๆง็ณป็ป(Civilian Aircraft Telemetry Flight got the keyphrases as โ้่ชๅฎๅ จๆง(Airworthiness Safety)โ, Control System)โ, โ็ตไผ ้ฃๆง็ณป็ปๆถๆ่ฎพ่ฎก(Architecture 5 https://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CJFD& Design of the Telemetry Flight Control System)โ, โๅฎๅ จๆง้ ๆฑ(Security Requirements)โ, โๆฐ็จ้ฃๆบ้่ช่ง่(Airworthiness dbname=CJFDAUTODAY&filename=HKKX202103004&v= G8TESBUsSe2JeIClg6moqemy3ExscLTVMNxH885u%25mmd2BI% Specifications for Civil Aircraft)โ, โ้ๆฑ่ฎบ่ฏ(Proof of Need)โ, 25mmd2Bl9p5i%25mmd2FUmcOUqnMUOyTZM5 31 EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents Figure 3. The Processing Flow of the Keyphrase Extraction Engine for Chinese Scientific Literature โๅ ทไฝไฝ็ฐ(Specific Embodiment)โ, โๅฎๅ จ่ฆๆฑ(Security re- call the keyphrase extraction engine, depositing the abstracts quirements)โ. Finally, we removed the short keyphrases of multiple Chinese scientific articles into a list, and pass who had an inclusion relationship with others and got in the verification code. After the engine responds, it will the ultimate top 5 recognized keyphrases as โ้ ่ช ๅฎ ๅ จ return the keyphrase extraction results of all the abstracts ๆง(Airworthiness Safety)โ, โๆฐ็จ้ฃๆบ็ตไผ ้ฃๆง็ณป็ป(Civilian in the list in JSON format to achieve batch processing. the Aircraft Telemetry Flight Control System)โ, โ็ตไผ ้ฃๆง็ณป details of the POST API call are shown in Table 7. ็ปๆถๆ่ฎพ่ฎก(Architecture Design of the Telemetry Flight Control System)โ, โๅฎๅ จๆง้ๆฑ(Security Requirements)โ, โๆฐ 5 Engineering Implementation ็จ ้ฃ ๆบ ้ ่ช ่ง ่(Airworthiness Specifications for Civil In order to display the keyphrase extraction results intu- Aircraft)โ. It can be seen that the final keyphrase extraction itively and meet the demands for different users to call the results of our proposed hybrid model are better than that of engine, we currently provide three ways to call the keyphrase the ๐ต๐ธ๐ ๐ +๐ฟ๐๐ฅ๐๐๐๐ +๐๐ ๐ ๐ก๐๐๐ฅ model and the TF*IDF model. extraction API: browser online demo, Python code access and client access. The calling flow of the keyphrase extraction 4.4 API Design engine for Chinese scientific literature is shown in Figure 3. In order to avoid various hardware and software constraints that may be encountered in the local deployment of the 5.1 Browser Online Demo model, and to provide a fast and convenient way for re- searchers to invoke the keyphrase extraction model, we deployed the keyphrase extraction model as a service, and built a keyphrase extraction engine for Chinese scientific literature through API calls. Researchers can call the API of the engine in two ways, POST and GET, to achieve automatic keyphrase extraction of Chinese scientific literature. Pass in the abstract of Chinese scientific literature and the verification code, and the engine would return the keyphrase extraction results in JSON format. For the GET method, users can send an request to the URL: http://sciengine.las.ac.cn/keywords_extraction_cn to call the keyphrase extraction engine, passing in the abstract of a Chinese scientific literature abstract and the verification code. When the engine receives the call, it will respond by returning the keyphrase extraction results in JSON format. Figure 4. Browser Online Demo Interface Details of the GET API call are shown in Table VI. For the POST method, users can send an request to the Users can visit the URL: http://sciengine.las.ac.cn/Keywords_ URL: http://sciengine.las.ac.cn/keywords_extraction_cn to BIO_Lexi to test the keyphrase extraction engine online. 32 EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents Table 6. GET API Call Details Format Example Request URL /keywords_extraction_cn http://sciengine.las.ac.cn/keywords_extraction_cn Request Parameters "data": abstract of Chinese scientific {"data": "ๆฐ่พ ๅฉๆฒป็่ๆฏไธ่ฐ่ บ็ๆฉๅคงๅ้คๆฏ literature, "token": Verification Code ็ๅบ็จไปทๅผใ่ฐ่ บ็ๆถๆง็จๅบฆ้ซ,้ขๅ่พๅทฎ,ๆฒป็ ๆๆไปไธ็ๆณ...(The value of extended resection of pancreatic cancer in the context of neoadjuvant therapy. Pancreatic cancer is highly malignant, with poor prognosis and still unsatisfactory treatment results...)", "token":99999} Browser parameter /Keywords_BIO_Lexi?data= http://sciengine.las.ac.cn/keywords_extraction_ &token= cn?data=ๆฐ่พ ๅฉๆฒป็่ๆฏไธ่ฐ่ บ็ๆฉๅคงๅ้คๆฏ็ ๅบ็จไปทๅผใ่ฐ่ บ็ๆถๆง็จๅบฆ้ซ,้ขๅ่พๅทฎ,ๆฒป็ๆ ๆไปไธ็ๆณ...(The value of extended resection of pancreatic cancer in the context of neoadjuvant therapy. Pancreatic cancer is highly malignant, with poor prognosis and still unsatisfactory treatment results....)&token=99999 Success message "keywords": [keyphrases list] {"keywords":["่ฐ่ บ็(pancreatic cancer)", "ๆฐ่พ ๅฉ ๆฒป็(neoadjuvant therapy)", "ๆฉๅคงๅ้คๆฏ(extended resection)" ] } Error message "info":error message {"info": "Server not available!"}, {"info": "Token incorrect!"} Table 7. POST API Call Details Format Example Request URL /keywords_extraction_cn http://sciengine.las.ac.cn/keywords_extraction_cn Request Parameters "data": abstract of Chinese scientific {"data": ["ๆฐ่พ ๅฉๆฒป็่ๆฏไธ่ฐ่ บ็ๆฉๅคงๅ้คๆฏ literature, "token": Verification Code ็ๅบ็จไปทๅผใ่ฐ่ บ็ๆถๆง็จๅบฆ้ซ,้ขๅ่พๅทฎ,ๆฒป็ ๆๆไปไธ็ๆณ...(The value of extended resection of pancreatic cancer in the context of neoadjuvant therapy. Pancreatic cancer is highly malignant, with poor prognosis and still unsatisfactory treatment results...)", "ๅ่ชๅฐ้ปๆฑค่ๅACEI/ARB็ฑป่ฏ็ฉๆฒป็ ็ณๅฐฟ็ ่พ็ ็Metaๅๆ...(Meta-analysis of Shenqi Dihuang Decoction combined with ACEI/ARB drugs in the treatment of diabetic nephropathy...)"], "token":99999} Success message Abstract ID:[keyphrases list] {0: ["่ฐ ่ บ ็(pancreatic cancer)", "ๆฐ ่พ ๅฉ ๆฒป ็(neoadjuvant therapy)", "ๆฉๅคงๅ้คๆฏ(extended resection)" ], 1: ["็ณๅฐฟ็ ่พ็ (diabetic nephropa- thy)", "ACEI/ARB็ฑป(ACEI/ARB)", "METAๅๆ(Meta- analysis)", "ๅ ่ช ๅฐ ้ป ๆฑค(Shenqi Dihuang Decoc- tion)"]} Error message "info":error message {"info": "Server not available!"}, {"info": "Token incorrect!"} 33 EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents Type the abstract of a Chinese scientific literature in the input single line of code. Users can download and install the box (it is recommended to use the title + โใโ + abstract as client from the website http://sciengine.las.ac.cn/Client, and input), click the keyphrase extraction button, and the engine get the verification code as the login credentials to call API will be called automatically to invoke the bottom model the keyphrase extraction engine API to achieve automatic and four to five keyphrases related to the main idea will be keyphrase extraction of Chinese scientific literature. The returned. The response time of the engine is generally within keyphrase extraction engine client interface is shown in 3 seconds and the interface of the browser online demo is Figure 5 and the specific operation process is as follows. shown in Figure 4. 1. After opening the client and entering the verification 5.2 Python Code Access code, click the button of "Keyphrase Extraction for Chinese Scientific Literature" in the menu bar to enter Technical staff who are familiar with Python programming the interface of keyphrase extraction function. Click language can download the corresponding sample codes "Browse" button to import the file to be processed, and from the website http://sciengine.las.ac.cn/Scripts and revise it means successful if the data presentation box shows the file path to achieve convenient usage. There are four the imported data, and the message box shows the files: ๐๐๐ฆ๐โ๐๐๐ ๐๐ _๐๐ฅ๐ก๐๐๐๐ก๐๐๐_๐๐_๐๐๐ก .๐๐ฆ, the sample code total number of the data. for calling the API of the keyphrase extraction engine 2. Click "Start Extraction" button, the client will automat- using GET method; ๐๐๐ฆ๐โ๐๐๐ ๐๐ _๐๐ฅ๐ก๐๐๐๐ก๐๐๐_๐๐_๐๐๐ ๐ก .๐๐ฆ, the ically carry out the function of keyphrase extraction sample code for calling the API of the keyphrase extraction for Chinese scientific literature and display the real- engine using POST method; ๐๐๐๐ข๐ก_๐๐.๐ก๐ฅ๐ก, the sample input time processing progress. file; ๐ ๐๐๐๐๐.๐ก๐ฅ๐ก, the description file. 3. When the extraction is finished, the client will pop When using the GET method to call the API, input up completion window and automatically show the the verification code and the Chinese abstract that needs output file path. to be recognized in the corresponding location of the 4. Click "Open" button to view the output file. code. Then run ๐๐๐ฆ๐โ๐๐๐ ๐๐ _๐๐ฅ๐ก๐๐๐๐ก๐๐๐_๐๐_๐๐๐ก .๐๐ฆ file and the automatic keyphrase extraction results will be printed 6 Conclusions directly. When using POST method to call the API, open the ๐๐๐ฆ๐โ๐๐๐ ๐๐ _๐๐ฅ๐ก๐๐๐๐ก๐๐๐_๐๐_๐๐๐ ๐ก .๐๐ฆ file with the Python In this paper, we make full use of the large-scale training cor- editor and input the verification code to the corresponding pus of Chinese Science Citation Database and the pretrained location in the code. Set the paths of the input file and language model BERT to construct a keyphrase extraction output file, where the format of the input file is one line per engine for Chinese scientific literature. We incorporate lexi- abstract. Run ๐๐๐ฆ๐โ๐๐๐ ๐๐ _๐๐ฅ๐ก๐๐๐๐ก๐๐๐_๐๐_๐๐๐ ๐ก .๐๐ฆ file, and con features into the high-dimensional vector space of BERT, the program will read the input file and write the keyphrase fusing human knowledge to instruct the model training. To extraction results to the output file. support practical applications in multidisciplinary fields, the TF*IDF algorithm is introduced as a complement to better 5.3 Client Access capture the high-frequency words appearing in the text. We deploy the engine as a service, which can be invoked using the API, and the response speed is generally within 3 seconds. And we provide example scripts in Python for technical staff and a visualization client for non-technical personnel to use without writing a line of code. We hope that our keyphrase extraction engine can provide a feasible path for researchers to improve efficiency. 7 ACKNOWLEDGMENTS The work is supported by the project โArtificial Intelligence (AI) Engine Construction Based on Scientific Literature Knowledge" (Grant No.E0290906) and the project โKey Tech- nology Optimization Integration and System Development of Next Generation Open Knowledge Service Platform" (Grant Figure 5. Client Interface No.2021XM45). In order to provide for non-technical personnel to use, References we designed the client to realize the keyphrase extraction [1] Peter D Turney. Learning algorithms for keyphrase extraction. service for Chinese scientific literature without writing a Information retrieval, 2(4):303โ336, 2000. 34 EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents [2] Steve Jones and Mark S Staveley. Phrasier: a system for interactive [20] Xiangyang Li, Huan Zhang, and Xiao-Hua Zhou. Chinese clinical document retrieval using keyphrases. In Proceedings of the 22nd annual named entity recognition with variant neural structures based on bert international ACM SIGIR conference on Research and development in methods. Journal of biomedical informatics, 107:103422, 2020. information retrieval, pages 160โ167, 1999. [21] Liangping Ding, Zhixiong Zhang, Huan Liu, Jie Li, and Gaihong [3] Yongzheng Zhang, Nur Zincir-Heywood, and Evangelos Milios. World Yu. Automatic keyphrase extraction from scientific chinese medical wide web site summarization. Web Intelligence and Agent Systems: An abstracts based on character-level sequence labeling. Journal of Data International Journal, 2(1):39โ53, 2004. and Information Science,, 6(3):33โ57, 2020. [4] Anette Hulth and Beรกta Megyesi. A study on automatically extracted [22] Liangping Ding, Zhixiong Zhang, and Yang Zhao. Bert-based chinese keywords in text categorization. In Proceedings of the 21st International medical keyphrase extraction model enhanced with external features. Conference on Computational Linguistics and 44th Annual Meeting of International Conference on Asia-Pacific Digital Libraries, 2021. the Association for Computational Linguistics, pages 537โ544, 2006. [5] Gรกbor Berend. Opinion expression mining by exploiting keyphrase extraction. In Proceedings of the 5th International Joint Conference on Natural Language Processing, pages 1162โ1170. Asian Federation of Natural Language Processing, 2011. [6] Yi-fang Brook Wu, Quanzhi Li, Razvan Stefan Bot, and Xin Chen. Domain-specific keyphrase extraction. In Proceedings of the 14th ACM international conference on Information and knowledge management, pages 283โ284, 2005. [7] Chengzhi Zhang. Automatic keyword extraction from documents using conditional random fields. Journal of Computational Information Systems, 4(3):1169โ1180, 2008. [8] Anette Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 conference on Empirical methods in natural language processing, pages 216โ223, 2003. [9] Gerard Salton, Chung-Shu Yang, and CLEMENT T Yu. A theory of term importance in automatic text analysis. Journal of the American society for Information Science, 26(1):33โ44, 1975. [10] Lance A Ramshaw and Mitchell P Marcus. Text chunking using transformation-based learning. In Natural language processing using very large corpora, pages 157โ176. Springer, 1999. [11] Lev Ratinov and Dan Roth. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pages 147โ155, 2009. [12] Qi Zhang, Yang Wang, Yeyun Gong, and Xuan-Jing Huang. Keyphrase extraction using deep recurrent neural networks on twitter. In Proceedings of the 2016 conference on empirical methods in natural language processing, pages 836โ845, 2016. [13] Dhruva Sahrawat, Debanjan Mahata, Mayank Kulkarni, Haimin Zhang, Rakesh Gosangi, Amanda Stent, Agniv Sharma, Yaman Kumar, Rajiv Ratn Shah, and Roger Zimmermann. Keyphrase extraction from scholarly articles as sequence labeling using contextualized embeddings. arXiv preprint arXiv:1910.08840, 2019. [14] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016. [15] Funan Mu, Zhenting Yu, LiFeng Wang, Yequan Wang, Qingyu Yin, Yibo Sun, Liqun Liu, Teng Ma, Jing Tang, and Xing Zhou. Keyphrase extraction with span-based feature representations. arXiv preprint arXiv:2002.05407, 2020. [16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. [17] Iz Beltagy, Kyle Lo, and Arman Cohan. Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676, 2019. [18] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234โ1240, 2020. [19] Tianyu Liu, Jin-Ge Yao, and Chin-Yew Lin. Towards improving neural named entity recognition with gazetteers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5301โ5307, 2019. 35