=Paper=
{{Paper
|id=Vol-3745/paper18
|storemode=property
|title=Research on Fine-grained S&T Entity Identification with Contextual Semantics in Think-Tank Text
|pdfUrl=https://ceur-ws.org/Vol-3745/paper18.pdf
|volume=Vol-3745
|authors=Mengge Sun,Yanpeng Wang,Yang Zhao
|dblpUrl=https://dblp.org/rec/conf/eeke/SunWZ24
}}
==Research on Fine-grained S&T Entity Identification with Contextual Semantics in Think-Tank Text==
<pdf width="1500px">https://ceur-ws.org/Vol-3745/paper18.pdf</pdf>
<pre>
                                Research on Fine-grained S&T Entity Identification with
                                Contextual Semantics in Think-Tank Text
                                Mengge Sun1,2 , Yanpeng Wang1,2,∗ and Yang Zhao1,2
                                1
                                 National Science Library, Chinese Academy of science Beijing 100190
                                2
                                 Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences Beijing
                                100190


                                                 Abstract
                                                 Automatically extracting fine-grained S&T problems from think-tank reports written by numerous experts, has become
                                                 one of the effective ways to perceive the global trend of S&T development. We transform the automatic identification task
                                                 for fine-grained S&T problems into a multi category S&T entity extraction task with contextual semantics. To address the
                                                 shortage of high-quality data sets and fully exploit the potential of LLMs, we take LLMs as annotators and puts them into
                                                 an active learning loop to determine which samples to annotate efficiently. During the cyclic data annotation process, we
                                                 simultaneously trained the target’s entity extraction model ”RoBERTa-BiLSTM-CRF”. Finally, the model achieved an F1 value
                                                 of 86.02% in our task. The effectiveness and reliability of the model were verified by comparing it with the benchmark model
                                                 through experiments. This study to some extent solves the problem of manually annotating dataset dependencies, while
                                                 providing high-quality data support and effective model methods for mining and analyzing fine-grained S&T problems.

                                                 Keywords
                                                 S&T entity with contextual semantics, LLM annotators, active learning, RoBERTa-BiLSTM-CRF,


                                1. Introduction                                                                                              and time-consuming. S. Gupta and C D. Manning [2] de-
                                                                                                                                             signed matching rules for identifying research problem
                                The think tank is composed of multidisciplinary experts , including using the word ”applied” for rule matching,
                                in a country and gathers national intellectual resources, and then using the Bootstrapping method to find new
                                which is an important force to influence government rule templates based on the newly matched vocabulary.
                                decision-making and promote social development. Usu- K. Heffernan and S. Teufel [3] treated scientific method
                                ally, think tank reports tend to focus on major issues of identification as a classification task, using classification
                                great concern to the national government or the public, algorithms such as support vector machines, Naive Bayes,
                                which represent indicators and weather vane of national and logistic regression, and introduced features such as
                                policies and scientific research, and have high intelligence N-gram, sentiment polarity, part of speech, whether it
                                values. Therefore, the automatic extraction of scientific is a negative word, discourse information, and part of
                                and technological problems mentioned in think tank re- speech into the algorithm to enhance its performance.
                                ports can further clarify policy and public concerns effi- Semeval 2018 Task7 [4] also conducted extraction of var-
                                ciently and objectively. This paper defines ”fine-grained ious types of entities in academic papers. In this task,
                                S&T problems” as ”research directions or problems with many teams used convolutional neural networks and
                                limited conditions such as application scenarios, techno- Long Short-Term Memory networks to achieve perfor-
                                logical solutions, and technological routes”, and further mance superior to traditional machine learning meth-
                                analogizes them as ”S&T entities with contextual seman- ods (such as SVM), which also proved the usefulness of
                                tics”.                                                                                                       deep learning models. In terms of deep learning meth-
                                              Most of the S&T problem representations extracted ods, Xuesi Li et al. [5] designed a sentence classification
                                by researchers in the past have adopted several meth- model based on the BERT-CNN architecture, and auto-
                                ods such as manual annotation, rule-based matching, matically identified research issue sentences in scientific
                                machine learning-based, hybrid model-based, and deep papers with an F1 value of 94.8%. Z. Zhong and D. Chen
                                learning-based methods. H. Chu and Q. Ke [1] used man- [6] compared the performance of BERT and SciBERT,
                                ual annotation to analyze the distribution of methods in two pre-trained language models, in the extraction of
                                different academic journals. However, those expert anno- relations in academic papers, and found that SciBERT
                                tation methods are relatively highly accurate, but costly performed better than BERT.
                                Joint Workshop of the 5th Extraction and Evaluation of Knowledge                                                Since 2020, large language models (LLMs) have exhib-
                                Entities from Scientific Documents and the 4th AI + Informetrics (EEKE- ited remarkable few-shot performance in information
                                AII2024), April 23 24, 2024, Changchun, China and Online                                                     extraction tasks, with only a few demonstrations and
                                ∗
                                     Corresponding author.                                                                                   well-designed prompts. Under the prevalent “Language-
                                Envelope-Open wangyanpeng@mail.las.ac.cn (Y. Wang)
                                                   © 2024 Copyright 2024 for this paper by its authors. Use permitted under Creative Commons Model-as-a-Service” (Sun et al. 2022) setting, users are
                                           License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                               120
required to feed their own data, potentially including
sensitive or private information, which increases the risk
of data leakage.To exploit the abundant unlabeled corpus,
an alternative is to employ LLMs as annotators, which
generate labels in a zero-shot or few-shot manner.
   In this paper, we subdivide S&T entities into multiple
grained categories. Depending on the type of scientific
solution sought, they can be distinguished into: iden-
tification and judgment about the research object and
the inherent mechanisms and laws of research. Corre-
spondingly, the research objects include ”technologi-
cal methods”, ”system devices”, ”scientific experi-
ments”, ”scientific materials”, and ”databases name”.
Examples include ”cell-based cancer immunotherapy and
gene therapy”, ”ferrosilicon alloy latent heat photovoltaic
cells”, ”deep underground neutrino experiments” and
                                                            Figure 1: Main research framework
”two-dimensional materials for future heterogeneous
electronic devices”. And the underlying mechanisms
of things, such as “the principle of evolution controlled
from top to bottom”.                                        2.2.1. Initial annotation based on syntactic rules
                                                            In the part of initial annotation based on syntactic rules,
2. Data and Methods                                         we  mainly uses a rule-based extraction method as a cold
                                                            start, combined with manual correction, to obtain a small
2.1. Data                                                   amount of high-quality contextual S&T entities databases.
                                                            As of now, there are a total of 162 lexical and syntactic
The selected data source is high-quality strategic dynamic rules. Then, combined with the dependency syntax anal-
briefing data monitored and compiled by various depart- ysis function of a pretrained HanLP model, candidate
ments of the Chinese Academy of Sciences and the State scientific entity phrases with contextual semantics are
Council, which is available on the agency’s website1 . obtained.
The data source includes: (1) the trends of top scien-
tific journals, showcasing the latest scientific research 2.2.2. Extraction model based on LLM active
achievements in disciplines such as physics, Earth, and             annotation
biology; (2) the latest strategic deployments of various
countries in the field of S&T, representing the direction In the extraction part of based on LLM active annotation,
of national S&T development.                                the main goal is to gradually fine screen a small-scale
    These information contents can to some extent rep- annotated data from a large amount of unlabeled data,
resent the will of the country and scientists [7]. Finally, while using a large language model as the annotation
we crawled all the information from the three sites from model. At the same time, a S&T entity extraction model
2018 to 2023, totaling 42,984 reports, with an average of called ”Roberta-BiLSTM-CRF” is trained.
about 12 sentences per report.                                 1) Optimizing LLM as better annotator.According
                                                            to literature research, it has been found that the cur-
                                                            rent GPT series models are highly sensitive to different
2.2. Main Framework
                                                            PROMPT expressions. When different annotators use
Based on the above data sets, the research work of this pa- different PROMPT expressions, there is a significant dif-
per mainly includes three parts: initial annotation based ference in the response results of GPT. The robustness of
on syntatic rules, active annotation based on LLM, and the model on NLP tasks is relativaly weak [8]. Previous
train extraction model during active learning process. As studies show that the design of task-specific prompts
shown in figure 1.                                          varies between near state-of-the-art and random guesses
                                                            [9]. Therefore, finding the best prompts for given tasks
                                                            and given data points is very critical.
                                                               This paper adopts the Chain of thought (CoT) prompts
                                                            strategy, which gradually generates label sequences that
1
  http://www.casisd.cn/zkcg/ydkb/kjqykb/                    meet expectations by setting some conditions in each
  https://news.sciencenet.cn/AInews/newlist.aspx?           model. Guided by the CoT approach, this article trans-
  http://www.globaltechmap.com/document/index               forms this task into a multi round Q&A question, en-


                                                         121
                                                                Table 1
                                                                Performance of the Large Model at Each Prompt Stage in final
                                                                iteration
                                                                  ID                             Precision     Recall Rate
                                                                  Stage I                          100.0            –
                                                                  Stage II                         90.87            –
                                                                  Stage III                        71.20          88.41
                                                                  Stage IV                         92.01            –


                                                                S&T entities and complete the automatic extraction of
                                                                fine-grained S&T problem.Finally, evaluate the model
                                                                results based on soft matching strategy.

Figure 2: Flowchart of GPT annotation under CoT
                                                                3. Results and discussion
                                                                3.1. Analysis of data annotation results
abling the GPT model to gradually locate the fine-grained
categories of S&T entities contained in the text through        Initial supervised data based on the rule annotation
conversation, and finally annotate them. Specifically,          In the data annotation based on statistical rules, it was
this chapter focuses on the construction process of PRO-        found that the extraction effect of the model was: accu-
MOPT for different categories of S&T problems, as shown         racy: 0.36; recall rate: 0.82; F1 value: 0.50. That is to say,
in Figure 2.                                                    the majority of S&T phrases annotated by statistical rule-
   2) Active data acquisition. Active learning (AL)             based annotation methods are not within the category of
seeks to reduce labeling efforts by strategically choos-        S&T, and their level of S&T cannot be accurately judged.
ing which examples to annotate. We consider the stan-           Therefore, an AI model that can deeply understand and
dard pool-based setting, assuming that a large pool of          analyze semantics is particularly needed for annotation
unlabeled data 𝐷𝑝𝑜𝑜𝑙 is available. AL loop starts with a        and extraction.
seed labeled set 𝐷𝑙𝑎𝑏𝑒𝑙𝑒𝑑 . At each iteration, we train a
model 𝑀 on 𝐷𝑙𝑎𝑏𝑒𝑙𝑒𝑑 and then use acquisition func-              3.2. Analysis of LLM annotation results
tion 𝑓 (·, 𝑀) to acquire a batch 𝐵 consisting of 𝑏 ex-
amples from 𝐷𝑝𝑜𝑜𝑙 . We then query the LLM annotator     Firstly, we randomly selected 20 texts from the annotated
to label 𝐵. The labeled batch is then removed from the  dataset as the test set to determine the number of ex-
                                                        amples in the Few-shot strategy.The results showed that
pool 𝐷𝑝𝑜𝑜𝑙 and added to labeled set 𝐷𝑙𝑎𝑏𝑒𝑙𝑒𝑑 , and will serve
as training data for the next iteration. The process is 5-shot had the highest accuracy at 76.3%. The test shows
repeated for 𝑚 times.                                   that the more relevant and semantically similar the given
   Active acquisition strategies generally maximize ei- examples are to the test text, the better the annotation ef-
ther uncertainty or diversity. On one hand, uncertainty-fect of GPT3.5. In the 1-shot scenario where an example is
based methods(such as Maximum Entropy, Least Confi-     given, the performance of the given example is sensitive
dence) leverage model predictions to select hard exam-  and unstable to GPT3.5; Overall, 5-shot prompt performs
ples. On the other hand, diversity-based methods(such   better because combining multiple random examples can
as K-Means) exploit the heterogeneity of sampled data.  reduce the impact of noise.
                                                           After determining the number of given examples in the
                                                        Few shot strategy, we conducted multiple tests to select
2.2.3. Extraction model based on
                                                        the most effective example for each stage of the PROMPT.
       Roberta-BiLSTM-CRF
                                                        The performance of each prompt stage is shown in Table
Training the target model based on the labeled data 1.
obtained, and select the data to be annotated in the       (1) In terms of category judgment, GPT’s performance
next iteration using the acquisition function mecha- is almost perfect. That is to say,for classification tasks
nism. Among them, the target model uses the Chinese with more popular query semantics and more obvious
RoBERTa-WWM[10] model as the embedding model, semantic differences, the GPT model has better perfor-
and the BiLSTM model and CRF model as the label se- mance.
quence prediction layer to obtain the label sequence of


                                                            122
Table 2                                                          forming a set of feasible fine-grained S&T entity recogni-
Model comparison experiment results                              tion framework.
                                                                    The biggest limitation of our study is that it mainly fo-
 METHOD            Precision     Recall Rate      F1 Value
                                                                 cuses on the discussion of the effectiveness of the method,
 PROMPTING            67.72          76.72          67.72        and the standard of high accuracy has not been reached in
 BERT-                70.54          75.66          73.00        practical engineering applications, and the model effect
 BiLSTM-CRF                                                      will continue to be optimized in the future.
 Our model            82.20          90.23          86.02
                                                                 References
   (2) In terms of information extraction, GPT has lower          [1] H. Chu, Q. Ke, Research methods: What’s in the
accuracy and higher recall.The extracted S&T entities                 name?, Library & Information Science Research 39
are mainly in the form of nouns phrases, which are not                (2017) 284–294.
comprehensive, such as ”natural language processing               [2] S. Gupta, C. D. Manning, Analyzing the dynamics
algorithms will be used to study the principle of virus               of research by extracting key aspects of scientific
gene mutation.                                                        papers, in: Proceedings of 5th international joint
   Finally, after multiple rounds of annotation and man-              conference on natural language processing, 2011,
ual proofreading, a total of 19745 sentences formed a                 pp. 1–9.
supervised training dataset.                                      [3] K. Heffernan, S. Teufel, Identifying problems and
                                                                      solutions in scientific text, Scientometrics 116 (2018)
                                                                      1367–1382.
3.3. Analysis of Model extraction effect                          [4] D. Buscaldi, A.-K. Schumann, B. Qasemizadeh,
Dataset We chose 2680 fine-grained S&T entities datasets              H. Zargayouna, T. Charnois, Semeval-2018 task
as seed labeled set 𝐷𝑙𝑎𝑏𝑒𝑙𝑒𝑑 from initially annotated dataset,        7: Semantic relation extraction and classification
use the whole 19745 sentences as 𝐷𝑝𝑜𝑜𝑙 and randomly                   in scientific papers, in: International Workshop
acquired 100 samples per batch for 10 iterations, which               on Semantic Evaluation (SemEval-2018), 2017, pp.
generate 9,921 annoted samples 𝐷𝑙𝑎𝑏𝑒𝑙𝑒𝑑 in total.                     679–688.
   Baselines We compare RoBERTa-BiLSTM-CRF with                   [5] Y. L. Y. W. Xuesi Li, Zhixiong Zhang, Research on
the following baselines: (1) In-context learning (i.e.                problem sentence recognition methods in scientific
PROMPTING). The PROMPTING enables LLM to con-                         literature research, Library and Information Service
duct few-shot inference without fine-tuning. (2) SUPER-               67 (2023) 132–140.
VISED(i.e. BERT-BiLSTM-CRF). The surpervised model                [6] Z. Zhong, D. Chen, A frustratingly easy approach
is trained on whole clean-labeled data 𝐷𝑙𝑎𝑏𝑒𝑙𝑒𝑑 .                     for entity and relation extraction, arXiv preprint
   Accelerating with Active Learning The last layer                   arXiv:2010.12812 (2020).
in the above extraction model is the CRF model, whose             [7] X. C. Y. L. X. L. Yanpeng Wang, Xuezhao Wang,
output result is the probability score of the BIO label               Analysis of key technologies and initiatives of the
corresponding to each character. Here, we use this prob-              fourth industrial revolution based on science and
ability score as the confidence score and input it into two           technology policy and frontier dynamics, Journal
uncertainty based active learning strategies. The results             of the China Society for Scientific and Technical
show that maximal entropy active learning strategies                  Information 41 (2022) 29–37.
enable extraction model to be more efficient and more             [8] J. Gao, H. Zhao, C. Yu, R. Xu, Exploring the feasibil-
capable. The results of the S&T entities extraction tasks             ity of chatgpt for event extraction, arXiv preprint
are shown in Table 2.                                                 arXiv:2303.03836 (2023).
                                                                  [9] T. Gao, A. Fisch, D. Chen, Making pre-trained
                                                                      language models better few-shot learners, arXiv
4. Conclusions                                                        preprint arXiv:2012.15723 (2020).
                                                                 [10] Y. Cui, W. Che, T. Liu, B. Qin, Z. Yang, Pre-
Automatic extraction of contextual technology entities                training with whole word masking for chinese bert,
with contextual semantics from think tank reports can                 IEEE/ACM Transactions on Audio, Speech, and Lan-
more efficiently capture key research development direc-              guage Processing 29 (2021) 3504–3514.
tions. In this paper, GPT is used as the teacher model
and Roberta-Bilstm-CRF is used as the student model.
Through active learning method, the training data gener-
ated by GPT is fine-tuned to the local extraction model,


                                                             123

</pre>