=Paper=
{{Paper
|id=Vol-3004/paper12
|storemode=property
|title=Extraction of Thesis Research Conclusion Sentences in Academic Literature
|pdfUrl=https://ceur-ws.org/Vol-3004/paper12.pdf
|volume=Vol-3004
|authors=Litao Lin,Dongbo Wang,Si Shen
|dblpUrl=https://dblp.org/rec/conf/jcdl/LinWS21
}}
==Extraction of Thesis Research Conclusion Sentences in Academic Literature==
<pdf width="1500px">https://ceur-ws.org/Vol-3004/paper12.pdf</pdf>
<pre>
                 EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents


          Extraction of Thesis Research Conclusion Sentences in
                           Academic Literature
             Litao Lin†                                      Dongbo Wang                                             Si Shen
      College of Information                             College of Information                             School of Economics &
          Management                                         Management                                         Management
  Nanjing Agricultural University                    Nanjing Agricultural University                   Nanjing University of Science and
      Nanjing Jiangsu China                              Nanjing Jiangsu China                                   Technology
     2020114016@njau.edu.cn                              db.wang@njau.edu.cn                                Nanjing Jiangsu China
                                                                                                             shensi@njust.edu.cn


ABSTRACT                                                                    related to entity extraction and less researches on sentence
                                                                            extraction.
The extraction of sentences with specific meaning in academic
literature is an important work in academic full-text                       The research conclusion sentence refers to the sentence that
bibliometrics. This research attempts to establish a practical              contains the research conclusion. In the academic full text,
model of extracting conclusion sentences from academic                      research conclusion sentences are divided into citation research
literature. In this research, SVM and SciBERT models were                   conclusion sentences and thesis research conclusion sentences.
trained and tested using academic papers published in JASIST                Citation research conclusion sentences refer to experimental
from 2017 to 2020. The experimental results show that SciBERT               results and conclusions in quotation sentences, such as ‘Taylor’s
is more suitable for extracting thesis conclusion sentences and             work shows that the special purpose syntactic parsers perform
the optimal F1-value is 77.51%.                                             well on morphological descriptions.’. Thesis research conclusion
                                                                            is the author's statement of his own research results, such as ‘In
CCS CONCEPTS                                                                this way, we extended earlier work to the case that the impact
Theory of computation~Theory and algorithms for application                 factor can have a value lower than one.’.
domains~Machine learning theory~Models of learning
                                                                            Automatically extracting thesis research conclusion sentences
KEYWORDS                                                                    can promote the development of automatic summarization and
                                                                            originality evaluation of academic papers. Therefore, this
SVM, BERT, Academic full text, Thesis research conclusion, Text
                                                                            research attempts to construct an automatic recognition model
mining, Deep learning
                                                                            of the thesis research conclusion sentence based on the deep
                                                                            learning techniques.
1 INTRODUCTION
                                                                            2 CORPUS AND METHOD
Full-text data of academic literature mainly contains external
characteristics and content characteristics. Since the creation of          2.1 Data Source and Data Annotation
citation index by Garfield, citation analysis based on external
characteristics of literature has been widely applied in various            This research obtained all the full texts of academic papers
fields. However, due to the limitations of data and technology,             published in JASIST (Journal of the Association for Information
the previous bibliometric have many defects, including rough                Science and Technology) from 2017 to 2020 by using self-made
statistical method and single indication ability [1]. Today,                Python program.
increasingly rich full-text data and evolving machine learning
                                                                            As for data annotation, first, we use Python's NLTK module to
and deep learning techniques allow researchers to investigate
                                                                            segment the full text of the paper in sentence units. Then, 7
the content characteristics of academic literature in depth.
                                                                            postgraduates majoring in information science manually
Entity extraction and sentence extraction are two important                 annotated the sentences. For sentences that are not sure how to
basic works of full-text bibliometric analysis. At the entity level,        label, the decision will be made after group discussion and the
the relevant research mainly includes theory method entity                  experimenter completes the final review. The discriminant
extraction [2], algorithm entity extraction [3] and software                criteria of the thesis research conclusion sentence are as follows:
entity extraction [4]. At the sentence level，there are mainly               (1) Semantically speaking, the sentence content is a summary of
four research directions, including extraction of experimental              the author's own work experience, observations or actual
result sentences, extraction of research question sentences,                research results. (2) The content of the sentence can be a
extraction of research conclusion sentences and extraction of               reasoning and qualitative interpretation of the experimental
future work sentences. At present, there are more researches


 Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                                       74
                 EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents

results, but it cannot be a straightforward description of the data                             MIN           73.41%            44.13%            58.22%
of the experimental results.
                                                                                               AVG            79.86%            64.51%            70.74%
Data imbalance, that is, the gap between the number of positive                                MAX            98.19%            64.51%            77.03%
and negative samples used to train the model is too large, which               SVM              MIN           90.37%            37.08%            53.80%
is one of the most widespread problems in contemporary                                         AVG            95.97%            52.14%            67.24%
machine learning [5]. After the annotation is completed, the
thesis research conclusion sentences only account for 3% of the            Table 2 shows that the SVM has a high precision rate for
total corpus (more than 130 thousand sentences in total). In               extracting thesis research conclusion sentences and a low recall
order to alleviate the problem of data imbalance, we negatively            rate. The SciBERT 's precision rate and recall rate are more
sampled non-research conclusion sentences to increase the                  balanced. From the perspective of the average F1-Value, SciBERT
proportion of thesis research conclusion sentences to 8.9%. The            reached 70%, which is more than three percentage points higher
basic information of the final corpus is shown in Table 1.                 than SVM. In summary, SciBERT performance is relatively
                                                                           better.
           Table 1. Basic Information of the Corpus
                                                                           Compared to the sentences extracted by the SciBERT model with
Num.                        Type                          Count            the manually annotated sentences, recognition errors of the
  1                      Total article                      502            SciBERT that have been discovered are as follows: (1)
                                                                           Recognizing the sentence describing the graph as the thesis
  2                    Total Sentences                    54,479           research conclusion sentence. The possible reason for this
  3         Thesis research conclusion sentences           4,870           problem is that the sentence describing the graph normally has
           Average number of marked sentences in                           phrases such as "as shown in" at the beginning, and these words
  4                                                         9.7            are also important features of the thesis research conclusion
                         each article
                                                                           sentence. (2) Recognizing research hypothesis sentences as thesis
  5        Average words number in each sentences          27.99           research conclusion sentences. According to observations, the
  6          The longest sentence words number              255            thesis conclusion sentence is similar to the hypothesis sentence
                                                                           in terms of grammar and semantics. (3) Recognizing citation
2.2 Method                                                                 conclusion sentences without quotation mark as thesis research
                                                                           conclusion sentence. It indicates that some special words or
SVM and SciBERT are used in this research. SVM is called                   symbols may affect the judgment of the model.
support vector machine and it is a classic model for text
classification. In its simplest form, an SVM is able to perform a          4 CONCLUSION & FUTURE WORK
binary classification finding the ‘best’ separating hyperplane
between two linearly separable classes. SciBERT [6] is a deep              This research provides a practical method for extracting
learning model based on the BERT architecture [7], which is                conclusion sentences of thesis research from academic literature.
trained on the full text corpus of 1.14 million scientific and             This research shows that SciBERT is relatively superior than
technological documents. SciBERT uses the same configuration               SVM for automatically extracting thesis conclusion sentences.
and size as BERT-base [7] in the construction process, and it              This research uses a negative sample strategy to alleviate the
performs better than BERT-Base on natural language processing              problem of data imbalance and to enable faster model
tasks in scientific literature.                                            optimization, which may reduce the complexity of negative
                                                                           samples. Therefore, data augmentation needs to be achieved by
3 EXPERIMENT                                                               adding more positive samples in the future. In addition, the
                                                                           position of the sentence in the article also needs to be considered
Before the start of the formal experiment, we tested different             to optimize the performance of the model. Finally, some research
hyper parameters combinations on a small part of the                       conclusion sentences extracted contain pronouns and do not
experimental corpus to explore the optimal settings for SVM and            have perfect semantics when read alone. Therefore, research on
SciBERT. At the same time, considering the performance of the              Co-Reference Resolution should be carried out.
computer hardware used in the experiment, the final hyper
parameters are set as follows. SciBERT (scibert-scivocab-                  ACKNOWLEDGMENTS
uncased): 256 for Maximum sequence length, 64 for batch size,              The authors acknowledge the National Natural Science
2e-5 for learning rate, 3 for training epoch, case insensitive. The        Foundation of China (Grant Numbers:71974094) for financial
penalty function of SVM is set to 2, the kernel function is RBF,           support.
and TF-IDF is used to vectorize the text. The research uses a ten-
fold cross-validation strategy, and the operating effect of the
model is measured by Precision, Recall and F1-Value. Table 2               REFERENCES
shows the results of the experiment.                                       [1]          C. Lu, Y. Ding and C. Zhang, Understanding the impact change of a
                                                                           highly cited article: a content-based citation analysis, SCIENTOMETRICS, vol. 112,
                                                                           pp. 927-945, 2017.
        Table 2. Results of 10-Fold Cross-Validation                       [2]          H. Zhang and C. Zhang, Using Full-text Content of Academic Articles
                                                                           to Build a Methodology Taxonomy of Information Science in China, ArXiv, vol.
   Model                   Precision       Recall       F1-Value           abs/2101.07924, 2021.
                                                                           [3]          Y. Wang and C. Zhang, Using the full-text content of academic articles
 SciBERT        MAX         85.86%         78.61%        77.51%            to identify and evaluate algorithm entities in the domain of natural language


                                                                      75
                     EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents

processing, J INFORMETR, vol. 14, pp. 101091 - 101091, 2020.
[4]           X. Pan, E. Yan, Q. Wang, and W. Hua, Assessing the impact of software
on science: A bootstrapped learning of software entities in full-text papers, J.
Informetrics, vol. 9, pp. 860-871, 2015.
[5]           K. Micha, Radial-Based Undersampling for imbalanced data
classification, PATTERN RECOGN, vol. 102, 2020-06-23 2020.
[6]           I. Beltagy, A. Cohan and K. Lo, SciBERT: Pretrained Contextualized
Embeddings for Scientific Text, ArXiv, vol. abs/1903.10676, 2019.
[7]           J. Devlin, M.W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of
Deep Bidirectional Transformers for Language Understanding,, 2018.


                                                                                       76

</pre>