Research on Fine-grained S&T Entity Identification with Contextual Semantics in Think-Tank Text Mengge Sun1,2 , Yanpeng Wang1,2,∗ and Yang Zhao1,2 1 National Science Library, Chinese Academy of science Beijing 100190 2 Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences Beijing 100190 Abstract Automatically extracting fine-grained S&T problems from think-tank reports written by numerous experts, has become one of the effective ways to perceive the global trend of S&T development. We transform the automatic identification task for fine-grained S&T problems into a multi category S&T entity extraction task with contextual semantics. To address the shortage of high-quality data sets and fully exploit the potential of LLMs, we take LLMs as annotators and puts them into an active learning loop to determine which samples to annotate efficiently. During the cyclic data annotation process, we simultaneously trained the target’s entity extraction model ”RoBERTa-BiLSTM-CRF”. Finally, the model achieved an F1 value of 86.02% in our task. The effectiveness and reliability of the model were verified by comparing it with the benchmark model through experiments. This study to some extent solves the problem of manually annotating dataset dependencies, while providing high-quality data support and effective model methods for mining and analyzing fine-grained S&T problems. Keywords S&T entity with contextual semantics, LLM annotators, active learning, RoBERTa-BiLSTM-CRF, 1. Introduction and time-consuming. S. Gupta and C D. Manning [2] de- signed matching rules for identifying research problem The think tank is composed of multidisciplinary experts , including using the word ”applied” for rule matching, in a country and gathers national intellectual resources, and then using the Bootstrapping method to find new which is an important force to influence government rule templates based on the newly matched vocabulary. decision-making and promote social development. Usu- K. Heffernan and S. Teufel [3] treated scientific method ally, think tank reports tend to focus on major issues of identification as a classification task, using classification great concern to the national government or the public, algorithms such as support vector machines, Naive Bayes, which represent indicators and weather vane of national and logistic regression, and introduced features such as policies and scientific research, and have high intelligence N-gram, sentiment polarity, part of speech, whether it values. Therefore, the automatic extraction of scientific is a negative word, discourse information, and part of and technological problems mentioned in think tank re- speech into the algorithm to enhance its performance. ports can further clarify policy and public concerns effi- Semeval 2018 Task7 [4] also conducted extraction of var- ciently and objectively. This paper defines ”fine-grained ious types of entities in academic papers. In this task, S&T problems” as ”research directions or problems with many teams used convolutional neural networks and limited conditions such as application scenarios, techno- Long Short-Term Memory networks to achieve perfor- logical solutions, and technological routes”, and further mance superior to traditional machine learning meth- analogizes them as ”S&T entities with contextual seman- ods (such as SVM), which also proved the usefulness of tics”. deep learning models. In terms of deep learning meth- Most of the S&T problem representations extracted ods, Xuesi Li et al. [5] designed a sentence classification by researchers in the past have adopted several meth- model based on the BERT-CNN architecture, and auto- ods such as manual annotation, rule-based matching, matically identified research issue sentences in scientific machine learning-based, hybrid model-based, and deep papers with an F1 value of 94.8%. Z. Zhong and D. Chen learning-based methods. H. Chu and Q. Ke [1] used man- [6] compared the performance of BERT and SciBERT, ual annotation to analyze the distribution of methods in two pre-trained language models, in the extraction of different academic journals. However, those expert anno- relations in academic papers, and found that SciBERT tation methods are relatively highly accurate, but costly performed better than BERT. Joint Workshop of the 5th Extraction and Evaluation of Knowledge Since 2020, large language models (LLMs) have exhib- Entities from Scientific Documents and the 4th AI + Informetrics (EEKE- ited remarkable few-shot performance in information AII2024), April 23 24, 2024, Changchun, China and Online extraction tasks, with only a few demonstrations and ∗ Corresponding author. well-designed prompts. Under the prevalent “Language- Envelope-Open wangyanpeng@mail.las.ac.cn (Y. Wang) © 2024 Copyright 2024 for this paper by its authors. Use permitted under Creative Commons Model-as-a-Service” (Sun et al. 2022) setting, users are License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 120 required to feed their own data, potentially including sensitive or private information, which increases the risk of data leakage.To exploit the abundant unlabeled corpus, an alternative is to employ LLMs as annotators, which generate labels in a zero-shot or few-shot manner. In this paper, we subdivide S&T entities into multiple grained categories. Depending on the type of scientific solution sought, they can be distinguished into: iden- tification and judgment about the research object and the inherent mechanisms and laws of research. Corre- spondingly, the research objects include ”technologi- cal methods”, ”system devices”, ”scientific experi- ments”, ”scientific materials”, and ”databases name”. Examples include ”cell-based cancer immunotherapy and gene therapy”, ”ferrosilicon alloy latent heat photovoltaic cells”, ”deep underground neutrino experiments” and Figure 1: Main research framework ”two-dimensional materials for future heterogeneous electronic devices”. And the underlying mechanisms of things, such as “the principle of evolution controlled from top to bottom”. 2.2.1. Initial annotation based on syntactic rules In the part of initial annotation based on syntactic rules, 2. Data and Methods we mainly uses a rule-based extraction method as a cold start, combined with manual correction, to obtain a small 2.1. Data amount of high-quality contextual S&T entities databases. As of now, there are a total of 162 lexical and syntactic The selected data source is high-quality strategic dynamic rules. Then, combined with the dependency syntax anal- briefing data monitored and compiled by various depart- ysis function of a pretrained HanLP model, candidate ments of the Chinese Academy of Sciences and the State scientific entity phrases with contextual semantics are Council, which is available on the agency’s website1 . obtained. The data source includes: (1) the trends of top scien- tific journals, showcasing the latest scientific research 2.2.2. Extraction model based on LLM active achievements in disciplines such as physics, Earth, and annotation biology; (2) the latest strategic deployments of various countries in the field of S&T, representing the direction In the extraction part of based on LLM active annotation, of national S&T development. the main goal is to gradually fine screen a small-scale These information contents can to some extent rep- annotated data from a large amount of unlabeled data, resent the will of the country and scientists [7]. Finally, while using a large language model as the annotation we crawled all the information from the three sites from model. At the same time, a S&T entity extraction model 2018 to 2023, totaling 42,984 reports, with an average of called ”Roberta-BiLSTM-CRF” is trained. about 12 sentences per report. 1) Optimizing LLM as better annotator.According to literature research, it has been found that the cur- rent GPT series models are highly sensitive to different 2.2. Main Framework PROMPT expressions. When different annotators use Based on the above data sets, the research work of this pa- different PROMPT expressions, there is a significant dif- per mainly includes three parts: initial annotation based ference in the response results of GPT. The robustness of on syntatic rules, active annotation based on LLM, and the model on NLP tasks is relativaly weak [8]. Previous train extraction model during active learning process. As studies show that the design of task-specific prompts shown in figure 1. varies between near state-of-the-art and random guesses [9]. Therefore, finding the best prompts for given tasks and given data points is very critical. This paper adopts the Chain of thought (CoT) prompts strategy, which gradually generates label sequences that 1 http://www.casisd.cn/zkcg/ydkb/kjqykb/ meet expectations by setting some conditions in each https://news.sciencenet.cn/AInews/newlist.aspx? model. Guided by the CoT approach, this article trans- http://www.globaltechmap.com/document/index forms this task into a multi round Q&A question, en- 121 Table 1 Performance of the Large Model at Each Prompt Stage in final iteration ID Precision Recall Rate Stage I 100.0 – Stage II 90.87 – Stage III 71.20 88.41 Stage IV 92.01 – S&T entities and complete the automatic extraction of fine-grained S&T problem.Finally, evaluate the model results based on soft matching strategy. Figure 2: Flowchart of GPT annotation under CoT 3. Results and discussion 3.1. Analysis of data annotation results abling the GPT model to gradually locate the fine-grained categories of S&T entities contained in the text through Initial supervised data based on the rule annotation conversation, and finally annotate them. Specifically, In the data annotation based on statistical rules, it was this chapter focuses on the construction process of PRO- found that the extraction effect of the model was: accu- MOPT for different categories of S&T problems, as shown racy: 0.36; recall rate: 0.82; F1 value: 0.50. That is to say, in Figure 2. the majority of S&T phrases annotated by statistical rule- 2) Active data acquisition. Active learning (AL) based annotation methods are not within the category of seeks to reduce labeling efforts by strategically choos- S&T, and their level of S&T cannot be accurately judged. ing which examples to annotate. We consider the stan- Therefore, an AI model that can deeply understand and dard pool-based setting, assuming that a large pool of analyze semantics is particularly needed for annotation unlabeled data 𝐷𝑝𝑜𝑜𝑙 is available. AL loop starts with a and extraction. seed labeled set 𝐷𝑙𝑎𝑏𝑒𝑙𝑒𝑑 . At each iteration, we train a model 𝑀 on 𝐷𝑙𝑎𝑏𝑒𝑙𝑒𝑑 and then use acquisition func- 3.2. Analysis of LLM annotation results tion 𝑓 (·, 𝑀) to acquire a batch 𝐵 consisting of 𝑏 ex- amples from 𝐷𝑝𝑜𝑜𝑙 . We then query the LLM annotator Firstly, we randomly selected 20 texts from the annotated to label 𝐵. The labeled batch is then removed from the dataset as the test set to determine the number of ex- amples in the Few-shot strategy.The results showed that pool 𝐷𝑝𝑜𝑜𝑙 and added to labeled set 𝐷𝑙𝑎𝑏𝑒𝑙𝑒𝑑 , and will serve as training data for the next iteration. The process is 5-shot had the highest accuracy at 76.3%. The test shows repeated for 𝑚 times. that the more relevant and semantically similar the given Active acquisition strategies generally maximize ei- examples are to the test text, the better the annotation ef- ther uncertainty or diversity. On one hand, uncertainty-fect of GPT3.5. In the 1-shot scenario where an example is based methods(such as Maximum Entropy, Least Confi- given, the performance of the given example is sensitive dence) leverage model predictions to select hard exam- and unstable to GPT3.5; Overall, 5-shot prompt performs ples. On the other hand, diversity-based methods(such better because combining multiple random examples can as K-Means) exploit the heterogeneity of sampled data. reduce the impact of noise. After determining the number of given examples in the Few shot strategy, we conducted multiple tests to select 2.2.3. Extraction model based on the most effective example for each stage of the PROMPT. Roberta-BiLSTM-CRF The performance of each prompt stage is shown in Table Training the target model based on the labeled data 1. obtained, and select the data to be annotated in the (1) In terms of category judgment, GPT’s performance next iteration using the acquisition function mecha- is almost perfect. That is to say,for classification tasks nism. Among them, the target model uses the Chinese with more popular query semantics and more obvious RoBERTa-WWM[10] model as the embedding model, semantic differences, the GPT model has better perfor- and the BiLSTM model and CRF model as the label se- mance. quence prediction layer to obtain the label sequence of 122 Table 2 forming a set of feasible fine-grained S&T entity recogni- Model comparison experiment results tion framework. The biggest limitation of our study is that it mainly fo- METHOD Precision Recall Rate F1 Value cuses on the discussion of the effectiveness of the method, PROMPTING 67.72 76.72 67.72 and the standard of high accuracy has not been reached in BERT- 70.54 75.66 73.00 practical engineering applications, and the model effect BiLSTM-CRF will continue to be optimized in the future. Our model 82.20 90.23 86.02 References (2) In terms of information extraction, GPT has lower [1] H. Chu, Q. Ke, Research methods: What’s in the accuracy and higher recall.The extracted S&T entities name?, Library & Information Science Research 39 are mainly in the form of nouns phrases, which are not (2017) 284–294. comprehensive, such as ”natural language processing [2] S. Gupta, C. D. Manning, Analyzing the dynamics algorithms will be used to study the principle of virus of research by extracting key aspects of scientific gene mutation. papers, in: Proceedings of 5th international joint Finally, after multiple rounds of annotation and man- conference on natural language processing, 2011, ual proofreading, a total of 19745 sentences formed a pp. 1–9. supervised training dataset. [3] K. Heffernan, S. Teufel, Identifying problems and solutions in scientific text, Scientometrics 116 (2018) 1367–1382. 3.3. Analysis of Model extraction effect [4] D. Buscaldi, A.-K. Schumann, B. Qasemizadeh, Dataset We chose 2680 fine-grained S&T entities datasets H. Zargayouna, T. Charnois, Semeval-2018 task as seed labeled set 𝐷𝑙𝑎𝑏𝑒𝑙𝑒𝑑 from initially annotated dataset, 7: Semantic relation extraction and classification use the whole 19745 sentences as 𝐷𝑝𝑜𝑜𝑙 and randomly in scientific papers, in: International Workshop acquired 100 samples per batch for 10 iterations, which on Semantic Evaluation (SemEval-2018), 2017, pp. generate 9,921 annoted samples 𝐷𝑙𝑎𝑏𝑒𝑙𝑒𝑑 in total. 679–688. Baselines We compare RoBERTa-BiLSTM-CRF with [5] Y. L. Y. W. Xuesi Li, Zhixiong Zhang, Research on the following baselines: (1) In-context learning (i.e. problem sentence recognition methods in scientific PROMPTING). The PROMPTING enables LLM to con- literature research, Library and Information Service duct few-shot inference without fine-tuning. (2) SUPER- 67 (2023) 132–140. VISED(i.e. BERT-BiLSTM-CRF). The surpervised model [6] Z. Zhong, D. Chen, A frustratingly easy approach is trained on whole clean-labeled data 𝐷𝑙𝑎𝑏𝑒𝑙𝑒𝑑 . for entity and relation extraction, arXiv preprint Accelerating with Active Learning The last layer arXiv:2010.12812 (2020). in the above extraction model is the CRF model, whose [7] X. C. Y. L. X. L. Yanpeng Wang, Xuezhao Wang, output result is the probability score of the BIO label Analysis of key technologies and initiatives of the corresponding to each character. Here, we use this prob- fourth industrial revolution based on science and ability score as the confidence score and input it into two technology policy and frontier dynamics, Journal uncertainty based active learning strategies. The results of the China Society for Scientific and Technical show that maximal entropy active learning strategies Information 41 (2022) 29–37. enable extraction model to be more efficient and more [8] J. Gao, H. Zhao, C. Yu, R. Xu, Exploring the feasibil- capable. The results of the S&T entities extraction tasks ity of chatgpt for event extraction, arXiv preprint are shown in Table 2. arXiv:2303.03836 (2023). [9] T. Gao, A. Fisch, D. Chen, Making pre-trained language models better few-shot learners, arXiv 4. Conclusions preprint arXiv:2012.15723 (2020). [10] Y. Cui, W. Che, T. Liu, B. Qin, Z. Yang, Pre- Automatic extraction of contextual technology entities training with whole word masking for chinese bert, with contextual semantics from think tank reports can IEEE/ACM Transactions on Audio, Speech, and Lan- more efficiently capture key research development direc- guage Processing 29 (2021) 3504–3514. tions. In this paper, GPT is used as the teacher model and Roberta-Bilstm-CRF is used as the student model. Through active learning method, the training data gener- ated by GPT is fine-tuned to the local extraction model, 123