<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Research on Fine-grained S&amp;T Entity Identification with Contextual Semantics in Think-Tank Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mengge Sun</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yanpeng Wang</string-name>
          <email>wangyanpeng@mail.las.ac.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yang Zhao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences Beijing</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Science Library, Chinese Academy of science Beijing 100190</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Automatically extracting fine-grained S&amp;T problems from think-tank reports written by numerous experts, has become one of the efective ways to perceive the global trend of S&amp;T development. We transform the automatic identification task for fine-grained S&amp;T problems into a multi category S&amp;T entity extraction task with contextual semantics. To address the shortage of high-quality data sets and fully exploit the potential of LLMs, we take LLMs as annotators and puts them into an active learning loop to determine which samples to annotate eficiently. During the cyclic data annotation process, we simultaneously trained the target's entity extraction model ”RoBERTa-BiLSTM-CRF”. Finally, the model achieved an F1 value of 86.02% in our task. The efectiveness and reliability of the model were verified by comparing it with the benchmark model through experiments. This study to some extent solves the problem of manually annotating dataset dependencies, while providing high-quality data support and efective model methods for mining and analyzing fine-grained S&amp;T problems.</p>
      </abstract>
      <kwd-group>
        <kwd>S&amp;T entity with contextual semantics</kwd>
        <kwd>LLM annotators</kwd>
        <kwd>active learning</kwd>
        <kwd>RoBERTa-BiLSTM-CRF</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>The think tank is composed of multidisciplinary experts
in a country and gathers national intellectual resources,
which is an important force to influence government
decision-making and promote social development.
Usually, think tank reports tend to focus on major issues of
great concern to the national government or the public,
which represent indicators and weather vane of national
policies and scientific research, and have high intelligence
values. Therefore, the automatic extraction of scientific
ports can further clarify policy and public concerns
eficiently and objectively. This paper defines ”fine-grained</p>
      <sec id="sec-2-1">
        <title>S&amp;T problems” as ”research directions or problems with</title>
        <p>tics”.</p>
        <p>
          Most of the S&amp;T problem representations extracted
by researchers in the past have adopted several
methods such as manual annotation, rule-based matching,
machine learning-based, hybrid model-based, and deep
ual annotation to analyze the distribution of methods in
diferent academic journals. However, those expert
annotation methods are relatively highly accurate, but costly
Joint Workshop of the 5th Extraction and Evaluation of Knowledge
Entities from Scientific Documents and the 4th AI + Informetrics
(EEKE∗Corresponding author.
ods, Xuesi Li et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] designed a sentence classification
model based on the BERT-CNN architecture, and
automatically identified research issue sentences in scientific
papers with an F1 value of 94.8%. Z. Zhong and D. Chen
two pre-trained language models, in the extraction of
relations in academic papers, and found that SciBERT
performed better than BERT.
        </p>
        <p>
          Since 2020, large language models (LLMs) have
exhibited remarkable few-shot performance in information
extraction tasks, with only a few demonstrations and
well-designed prompts. Under the prevalent
“Languageand technological problems mentioned in think tank re- speech into the algorithm to enhance its performance.
logical solutions, and technological routes”, and further
limited conditions such as application scenarios, techno- Long Short-Term Memory networks to achieve
perforanalogizes them as ”S&amp;T entities with contextual seman- ods (such as SVM), which also proved the usefulness of
learning-based methods. H. Chu and Q. Ke [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] used man- [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] compared the performance of BERT and SciBERT,
required to feed their own data, potentially including
sensitive or private information, which increases the risk
of data leakage.To exploit the abundant unlabeled corpus,
an alternative is to employ LLMs as annotators, which
generate labels in a zero-shot or few-shot manner.
        </p>
        <p>In this paper, we subdivide S&amp;T entities into multiple
grained categories. Depending on the type of scientific
solution sought, they can be distinguished into:
identification and judgment about the research object and
the inherent mechanisms and laws of research.
Correspondingly, the research objects include
”technological methods”, ”system devices”, ”scientific
experiments”, ”scientific materials”, and ”databases name”.</p>
        <p>Examples include ”cell-based cancer immunotherapy and
gene therapy”, ”ferrosilicon alloy latent heat photovoltaic
cells”, ”deep underground neutrino experiments” and
”two-dimensional materials for future heterogeneous
electronic devices”. And the underlying mechanisms
of things, such as “the principle of evolution controlled
from top to bottom”.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2. Data and Methods</title>
      <p>2.1. Data</p>
      <sec id="sec-3-1">
        <title>2.2.1. Initial annotation based on syntactic rules</title>
        <sec id="sec-3-1-1">
          <title>In the part of initial annotation based on syntactic rules,</title>
          <p>we mainly uses a rule-based extraction method as a cold
start, combined with manual correction, to obtain a small
amount of high-quality contextual S&amp;T entities databases.
As of now, there are a total of 162 lexical and syntactic
rules. Then, combined with the dependency syntax
analysis function of a pretrained HanLP model, candidate
scientific entity phrases with contextual semantics are
obtained.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>2.2.2. Extraction model based on LLM active annotation</title>
        <sec id="sec-3-2-1">
          <title>The selected data source is high-quality strategic dynamic</title>
          <p>briefing data monitored and compiled by various
departments of the Chinese Academy of Sciences and the State
Council, which is available on the agency’s website1 .</p>
          <p>The data source includes: (1) the trends of top
scientific journals, showcasing the latest scientific research
achievements in disciplines such as physics, Earth, and
biology; (2) the latest strategic deployments of various
countries in the field of S&amp;T, representing the direction
of national S&amp;T development.</p>
          <p>
            These information contents can to some extent
represent the will of the country and scientists [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]. Finally,
we crawled all the information from the three sites from
2018 to 2023, totaling 42,984 reports, with an average of
about 12 sentences per report.
          </p>
          <p>In the extraction part of based on LLM active annotation,
the main goal is to gradually fine screen a small-scale
annotated data from a large amount of unlabeled data,
while using a large language model as the annotation
model. At the same time, a S&amp;T entity extraction model
called ”Roberta-BiLSTM-CRF” is trained.</p>
          <p>
            1) Optimizing LLM as better annotator.According
to literature research, it has been found that the
cur2.2. Main Framework rent GPT series models are highly sensitive to diferent
PROMPT expressions. When diferent annotators use
Based on the above data sets, the research work of this pa- diferent PROMPT expressions, there is a significant
difper mainly includes three parts: initial annotation based ference in the response results of GPT. The robustness of
on syntatic rules, active annotation based on LLM, and the model on NLP tasks is relativaly weak [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ]. Previous
train extraction model during active learning process. As studies show that the design of task-specific prompts
shown in figure 1. varies between near state-of-the-art and random guesses
[
            <xref ref-type="bibr" rid="ref9">9</xref>
            ]. Therefore, finding the best prompts for given tasks
and given data points is very critical.
          </p>
          <p>This paper adopts the Chain of thought (CoT) prompts
strategy, which gradually generates label sequences that
1http://www.casisd.cn/zkcg/ydkb/kjqykb/ meet expectations by setting some conditions in each
https://news.sciencenet.cn/AInews/newlist.aspx? model. Guided by the CoT approach, this article
transhttp://www.globaltechmap.com/document/index forms this task into a multi round Q&amp;A question,
en</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Results and discussion</title>
      <p>3.1. Analysis of data annotation results</p>
      <sec id="sec-4-1">
        <title>2.2.3. Extraction model based on</title>
      </sec>
      <sec id="sec-4-2">
        <title>Roberta-BiLSTM-CRF</title>
        <p>abling the GPT model to gradually locate the fine-grained
categories of S&amp;T entities contained in the text through Initial supervised data based on the rule annotation
conversation, and finally annotate them. Specifically, In the data annotation based on statistical rules, it was
this chapter focuses on the construction process of PRO- found that the extraction efect of the model was:
accuMOPT for diferent categories of S&amp;T problems, as shown racy: 0.36; recall rate: 0.82; F1 value: 0.50. That is to say,
in Figure 2. the majority of S&amp;T phrases annotated by statistical
rule2) Active data acquisition. Active learning (AL) based annotation methods are not within the category of
seeks to reduce labeling eforts by strategically choos- S&amp;T, and their level of S&amp;T cannot be accurately judged.
ing which examples to annotate. We consider the stan- Therefore, an AI model that can deeply understand and
dard pool-based setting, assuming that a large pool of analyze semantics is particularly needed for annotation
unlabeled data   is available. AL loop starts with a and extraction.
seed labeled set    . At each iteration, we train a
model  on    and then use acquisition func- 3.2. Analysis of LLM annotation results
tion  (·,  ) to acquire a batch  consisting of 
examples from   . We then query the LLM annotator
to label  . The labeled batch is then removed from the
pool   and added to labeled set    , and will serve
as training data for the next iteration. The process is
repeated for  times.</p>
        <p>Active acquisition strategies generally maximize
either uncertainty or diversity. On one hand,
uncertaintybased methods(such as Maximum Entropy, Least
Confidence) leverage model predictions to select hard
examples. On the other hand, diversity-based methods(such
as K-Means) exploit the heterogeneity of sampled data.</p>
        <p>Firstly, we randomly selected 20 texts from the annotated
dataset as the test set to determine the number of
examples in the Few-shot strategy.The results showed that
5-shot had the highest accuracy at 76.3%. The test shows
that the more relevant and semantically similar the given
examples are to the test text, the better the annotation
effect of GPT3.5. In the 1-shot scenario where an example is
given, the performance of the given example is sensitive
and unstable to GPT3.5; Overall, 5-shot prompt performs
better because combining multiple random examples can
reduce the impact of noise.</p>
        <p>After determining the number of given examples in the
Few shot strategy, we conducted multiple tests to select
the most efective example for each stage of the PROMPT.</p>
        <p>The performance of each prompt stage is shown in Table
1.</p>
        <sec id="sec-4-2-1">
          <title>Training the target model based on the labeled data</title>
          <p>
            obtained, and select the data to be annotated in the (1) In terms of category judgment, GPT’s performance
next iteration using the acquisition function mecha- is almost perfect. That is to say,for classification tasks
nism. Among them, the target model uses the Chinese with more popular query semantics and more obvious
RoBERTa-WWM[
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] model as the embedding model, semantic diferences, the GPT model has better
perforand the BiLSTM model and CRF model as the label se- mance.
quence prediction layer to obtain the label sequence of
(2) In terms of information extraction, GPT has lower
accuracy and higher recall.The extracted S&amp;T entities
are mainly in the form of nouns phrases, which are not
comprehensive, such as ”natural language processing
algorithms will be used to study the principle of virus
gene mutation.
          </p>
          <p>Finally, after multiple rounds of annotation and
manual proofreading, a total of 19745 sentences formed a
supervised training dataset.
3.3. Analysis of Model extraction efect</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>Dataset We chose 2680 fine-grained S&amp;T entities datasets</title>
          <p>as seed labeled set    from initially annotated dataset,
use the whole 19745 sentences as   and randomly
acquired 100 samples per batch for 10 iterations, which
generate 9,921 annoted samples    in total.</p>
          <p>Baselines We compare RoBERTa-BiLSTM-CRF with
the following baselines: (1) In-context learning (i.e.
PROMPTING). The PROMPTING enables LLM to
conduct few-shot inference without fine-tuning. (2)
SUPERVISED(i.e. BERT-BiLSTM-CRF). The surpervised model
is trained on whole clean-labeled data    .</p>
          <p>Accelerating with Active Learning The last layer
in the above extraction model is the CRF model, whose
output result is the probability score of the BIO label
corresponding to each character. Here, we use this
probability score as the confidence score and input it into two
uncertainty based active learning strategies. The results
show that maximal entropy active learning strategies
enable extraction model to be more eficient and more
capable. The results of the S&amp;T entities extraction tasks
are shown in Table 2.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Conclusions</title>
      <sec id="sec-5-1">
        <title>Automatic extraction of contextual technology entities</title>
        <p>with contextual semantics from think tank reports can
more eficiently capture key research development
directions. In this paper, GPT is used as the teacher model
and Roberta-Bilstm-CRF is used as the student model.
Through active learning method, the training data
generated by GPT is fine-tuned to the local extraction model,
forming a set of feasible fine-grained S&amp;T entity
recognition framework.</p>
        <p>The biggest limitation of our study is that it mainly
focuses on the discussion of the efectiveness of the method,
and the standard of high accuracy has not been reached in
practical engineering applications, and the model efect
will continue to be optimized in the future.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ke</surname>
          </string-name>
          , Research methods:
          <article-title>What's in the name?</article-title>
          ,
          <source>Library &amp; Information Science Research</source>
          <volume>39</volume>
          (
          <year>2017</year>
          )
          <fpage>284</fpage>
          -
          <lpage>294</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>Analyzing the dynamics of research by extracting key aspects of scientific papers</article-title>
          ,
          <source>in: Proceedings of 5th international joint conference on natural language processing</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Hefernan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Teufel</surname>
          </string-name>
          ,
          <article-title>Identifying problems and solutions in scientific text</article-title>
          ,
          <source>Scientometrics</source>
          <volume>116</volume>
          (
          <year>2018</year>
          )
          <fpage>1367</fpage>
          -
          <lpage>1382</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Buscaldi</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.-K. Schumann</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Qasemizadeh</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Zargayouna</surname>
          </string-name>
          , T. Charnois, Semeval
          <article-title>-2018 task 7: Semantic relation extraction and classification in scientific papers</article-title>
          , in: International Workshop on Semantic Evaluation (SemEval-2018),
          <year>2017</year>
          , pp.
          <fpage>679</fpage>
          -
          <lpage>688</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y. L. Y. W.</given-names>
            <surname>Xuesi</surname>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Zhixiong Zhang, Research on problem sentence recognition methods in scientific literature research</article-title>
          ,
          <source>Library and Information Service</source>
          <volume>67</volume>
          (
          <year>2023</year>
          )
          <fpage>132</fpage>
          -
          <lpage>140</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>A frustratingly easy approach for entity and relation extraction</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>12812</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>X. C. Y. L. X. L.</given-names>
            <surname>Yanpeng</surname>
          </string-name>
          <string-name>
            <given-names>Wang</given-names>
            ,
            <surname>Xuezhao</surname>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Analysis of key technologies and initiatives of the fourth industrial revolution based on science and technology policy and frontier dynamics</article-title>
          ,
          <source>Journal of the China Society for Scientific and Technical Information</source>
          <volume>41</volume>
          (
          <year>2022</year>
          )
          <fpage>29</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Exploring the feasibility of chatgpt for event extraction</article-title>
          ,
          <source>arXiv preprint arXiv:2303.03836</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fisch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Making pre-trained language models better few-shot learners</article-title>
          , arXiv preprint arXiv:
          <year>2012</year>
          .
          <volume>15723</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Che</surname>
          </string-name>
          , T. Liu,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Pretraining with whole word masking for chinese bert</article-title>
          ,
          <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          <volume>29</volume>
          (
          <year>2021</year>
          )
          <fpage>3504</fpage>
          -
          <lpage>3514</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>