=Paper=
{{Paper
|id=Vol-3004/paper12
|storemode=property
|title=Extraction of Thesis Research Conclusion Sentences in Academic Literature
|pdfUrl=https://ceur-ws.org/Vol-3004/paper12.pdf
|volume=Vol-3004
|authors=Litao Lin,Dongbo Wang,Si Shen
|dblpUrl=https://dblp.org/rec/conf/jcdl/LinWS21
}}
==Extraction of Thesis Research Conclusion Sentences in Academic Literature==
EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents Extraction of Thesis Research Conclusion Sentences in Academic Literature Litao Lin† Dongbo Wang Si Shen College of Information College of Information School of Economics & Management Management Management Nanjing Agricultural University Nanjing Agricultural University Nanjing University of Science and Nanjing Jiangsu China Nanjing Jiangsu China Technology 2020114016@njau.edu.cn db.wang@njau.edu.cn Nanjing Jiangsu China shensi@njust.edu.cn ABSTRACT related to entity extraction and less researches on sentence extraction. The extraction of sentences with specific meaning in academic literature is an important work in academic full-text The research conclusion sentence refers to the sentence that bibliometrics. This research attempts to establish a practical contains the research conclusion. In the academic full text, model of extracting conclusion sentences from academic research conclusion sentences are divided into citation research literature. In this research, SVM and SciBERT models were conclusion sentences and thesis research conclusion sentences. trained and tested using academic papers published in JASIST Citation research conclusion sentences refer to experimental from 2017 to 2020. The experimental results show that SciBERT results and conclusions in quotation sentences, such as ‘Taylor’s is more suitable for extracting thesis conclusion sentences and work shows that the special purpose syntactic parsers perform the optimal F1-value is 77.51%. well on morphological descriptions.’. Thesis research conclusion is the author's statement of his own research results, such as ‘In CCS CONCEPTS this way, we extended earlier work to the case that the impact Theory of computation~Theory and algorithms for application factor can have a value lower than one.’. domains~Machine learning theory~Models of learning Automatically extracting thesis research conclusion sentences KEYWORDS can promote the development of automatic summarization and originality evaluation of academic papers. Therefore, this SVM, BERT, Academic full text, Thesis research conclusion, Text research attempts to construct an automatic recognition model mining, Deep learning of the thesis research conclusion sentence based on the deep learning techniques. 1 INTRODUCTION 2 CORPUS AND METHOD Full-text data of academic literature mainly contains external characteristics and content characteristics. Since the creation of 2.1 Data Source and Data Annotation citation index by Garfield, citation analysis based on external characteristics of literature has been widely applied in various This research obtained all the full texts of academic papers fields. However, due to the limitations of data and technology, published in JASIST (Journal of the Association for Information the previous bibliometric have many defects, including rough Science and Technology) from 2017 to 2020 by using self-made statistical method and single indication ability [1]. Today, Python program. increasingly rich full-text data and evolving machine learning As for data annotation, first, we use Python's NLTK module to and deep learning techniques allow researchers to investigate segment the full text of the paper in sentence units. Then, 7 the content characteristics of academic literature in depth. postgraduates majoring in information science manually Entity extraction and sentence extraction are two important annotated the sentences. For sentences that are not sure how to basic works of full-text bibliometric analysis. At the entity level, label, the decision will be made after group discussion and the the relevant research mainly includes theory method entity experimenter completes the final review. The discriminant extraction [2], algorithm entity extraction [3] and software criteria of the thesis research conclusion sentence are as follows: entity extraction [4]. At the sentence level,there are mainly (1) Semantically speaking, the sentence content is a summary of four research directions, including extraction of experimental the author's own work experience, observations or actual result sentences, extraction of research question sentences, research results. (2) The content of the sentence can be a extraction of research conclusion sentences and extraction of reasoning and qualitative interpretation of the experimental future work sentences. At present, there are more researches Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 74 EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents results, but it cannot be a straightforward description of the data MIN 73.41% 44.13% 58.22% of the experimental results. AVG 79.86% 64.51% 70.74% Data imbalance, that is, the gap between the number of positive MAX 98.19% 64.51% 77.03% and negative samples used to train the model is too large, which SVM MIN 90.37% 37.08% 53.80% is one of the most widespread problems in contemporary AVG 95.97% 52.14% 67.24% machine learning [5]. After the annotation is completed, the thesis research conclusion sentences only account for 3% of the Table 2 shows that the SVM has a high precision rate for total corpus (more than 130 thousand sentences in total). In extracting thesis research conclusion sentences and a low recall order to alleviate the problem of data imbalance, we negatively rate. The SciBERT 's precision rate and recall rate are more sampled non-research conclusion sentences to increase the balanced. From the perspective of the average F1-Value, SciBERT proportion of thesis research conclusion sentences to 8.9%. The reached 70%, which is more than three percentage points higher basic information of the final corpus is shown in Table 1. than SVM. In summary, SciBERT performance is relatively better. Table 1. Basic Information of the Corpus Compared to the sentences extracted by the SciBERT model with Num. Type Count the manually annotated sentences, recognition errors of the 1 Total article 502 SciBERT that have been discovered are as follows: (1) Recognizing the sentence describing the graph as the thesis 2 Total Sentences 54,479 research conclusion sentence. The possible reason for this 3 Thesis research conclusion sentences 4,870 problem is that the sentence describing the graph normally has Average number of marked sentences in phrases such as "as shown in" at the beginning, and these words 4 9.7 are also important features of the thesis research conclusion each article sentence. (2) Recognizing research hypothesis sentences as thesis 5 Average words number in each sentences 27.99 research conclusion sentences. According to observations, the 6 The longest sentence words number 255 thesis conclusion sentence is similar to the hypothesis sentence in terms of grammar and semantics. (3) Recognizing citation 2.2 Method conclusion sentences without quotation mark as thesis research conclusion sentence. It indicates that some special words or SVM and SciBERT are used in this research. SVM is called symbols may affect the judgment of the model. support vector machine and it is a classic model for text classification. In its simplest form, an SVM is able to perform a 4 CONCLUSION & FUTURE WORK binary classification finding the ‘best’ separating hyperplane between two linearly separable classes. SciBERT [6] is a deep This research provides a practical method for extracting learning model based on the BERT architecture [7], which is conclusion sentences of thesis research from academic literature. trained on the full text corpus of 1.14 million scientific and This research shows that SciBERT is relatively superior than technological documents. SciBERT uses the same configuration SVM for automatically extracting thesis conclusion sentences. and size as BERT-base [7] in the construction process, and it This research uses a negative sample strategy to alleviate the performs better than BERT-Base on natural language processing problem of data imbalance and to enable faster model tasks in scientific literature. optimization, which may reduce the complexity of negative samples. Therefore, data augmentation needs to be achieved by 3 EXPERIMENT adding more positive samples in the future. In addition, the position of the sentence in the article also needs to be considered Before the start of the formal experiment, we tested different to optimize the performance of the model. Finally, some research hyper parameters combinations on a small part of the conclusion sentences extracted contain pronouns and do not experimental corpus to explore the optimal settings for SVM and have perfect semantics when read alone. Therefore, research on SciBERT. At the same time, considering the performance of the Co-Reference Resolution should be carried out. computer hardware used in the experiment, the final hyper parameters are set as follows. SciBERT (scibert-scivocab- ACKNOWLEDGMENTS uncased): 256 for Maximum sequence length, 64 for batch size, The authors acknowledge the National Natural Science 2e-5 for learning rate, 3 for training epoch, case insensitive. The Foundation of China (Grant Numbers:71974094) for financial penalty function of SVM is set to 2, the kernel function is RBF, support. and TF-IDF is used to vectorize the text. The research uses a ten- fold cross-validation strategy, and the operating effect of the model is measured by Precision, Recall and F1-Value. Table 2 REFERENCES shows the results of the experiment. [1] C. Lu, Y. Ding and C. Zhang, Understanding the impact change of a highly cited article: a content-based citation analysis, SCIENTOMETRICS, vol. 112, pp. 927-945, 2017. Table 2. Results of 10-Fold Cross-Validation [2] H. Zhang and C. Zhang, Using Full-text Content of Academic Articles to Build a Methodology Taxonomy of Information Science in China, ArXiv, vol. Model Precision Recall F1-Value abs/2101.07924, 2021. [3] Y. Wang and C. Zhang, Using the full-text content of academic articles SciBERT MAX 85.86% 78.61% 77.51% to identify and evaluate algorithm entities in the domain of natural language 75 EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents processing, J INFORMETR, vol. 14, pp. 101091 - 101091, 2020. [4] X. Pan, E. Yan, Q. Wang, and W. Hua, Assessing the impact of software on science: A bootstrapped learning of software entities in full-text papers, J. Informetrics, vol. 9, pp. 860-871, 2015. [5] K. Micha, Radial-Based Undersampling for imbalanced data classification, PATTERN RECOGN, vol. 102, 2020-06-23 2020. [6] I. Beltagy, A. Cohan and K. Lo, SciBERT: Pretrained Contextualized Embeddings for Scientific Text, ArXiv, vol. abs/1903.10676, 2019. [7] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,, 2018. 76