=Paper=
{{Paper
|id=Vol-2871/paper5
|storemode=property
|title=Extraction of Sentences Describing Originality from Conclusion in Academic Papers
|pdfUrl=https://ceur-ws.org/Vol-2871/paper5.pdf
|volume=Vol-2871
|authors=Bolin Hua,Youngkug Shin
|dblpUrl=https://dblp.org/rec/conf/iconference/HuaS21
}}
==Extraction of Sentences Describing Originality from Conclusion in Academic Papers==
1st Workshop on AI + Informetrics - AII2021
Extraction of Sentences Describing Originality from
Conclusion in Academic Papers
Bolin Hua, YoungKug Shin
Department of Information Management, Peking University, China
Abstract. Citation analysis-based strategies such as SCI, impact factor, and h-
index reveals the influence of scientific papers, but it is difficult to demonstrate
their originality. With the advancement of text mining technology and deep learn-
ing algorithms, it is feasible to extract the segment that illustrate originality (here-
after “originality points”) from the paper and compare it with the originality
points in previous literatures so as to detect the originality of a certain focal paper.
The extraction of originality points in the paper is the first step in judging the
originality of the paper. On the basis of summarizing the writing rules of the
conclusion part of the literature, this paper summarizes the expression of sen-
tences describing originality(SDO) of the papers in the conclusion and forms a
vocabulary of guiding words for SDO of the papers, and then uses the rules to
identify and extract SDO of the papers. In the experiment, we download the full
text of papers on artificial intelligence from arXiv for the experiment, and the
recognition accuracy and recall rate are 83.3% and 72.2%, respectively.
Keywords: academic literature, originality point recognition, originality feature
words, knowledge extraction, originality evaluation
1 Introduction
For decades, scientometricians have proposed many sophisticated measurements to
characterize the impact of scientific publications, such as the number of citations of a
specific publication(Bornmann 2008) and the impact factor of the journal in which the
paper is published(Garfield 1955). Yet, it is oftentimes difficult to reflect the originality
and innovation of publications. Despite the fact that later science of science researchers
employed citing relations to estimate these (e.g., Uzzi et al., 2013; Wang et al., 2020),
current practice mainly relies on peer review.
Text mining techniques can be employed for evaluating the originality of a paper,
which requires much less time and human effort compared to peer reviewing. The judg-
ment of the originality of a paper includes subjective and objective reviews. Subjective
reviews may come from the authors themselves (i.e., self-evaluation) or other scholars:
The former is embodied in the description of originality and research conclusion of a
paper, while the latter is mainly distributed in the citing content of citations. According
to whether the cited literature appears in the reference or the main body of the citing
literature, Ding and colleagues (2013) defined the “count one” and “count x” indices.
He (2010) presented a prototype system CiteSeerX which aims to build a context-aware
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
2
citation recommendation system to recommend a set of citations for a paper with high
quality.
Although measurements such as the number of citations, impact factor, and h-index
have been introduced to reflect the influence/popularity of research papers, it is difficult
to reflect the originality. To detect the originality of a research work and a paper, the
current practice mainly relies on peer review. Peer reviews are subjective, and it is dif-
ficult to handle the evaluation task for a considerable number of scientific papers. While
citation content analyses have been proposed to address this issue, most existing prac-
tices have purely focused on the motivation and sentiment of citations instead of the
detection of the originality of a paper.
In the current paper, we address this gap and aim at developing methods for the
automatic identification of SDO of a paper (“originality points”) in scientific publica-
tions. Extraction of originality in a paper is the first step in judging the originality of
the paper. This paper uses the full-text data of arXiv for the experiment, and studies the
recognition and extraction of SDO of the papers in the conclusion part of scientific
publications,
2 Related work
The expression of originality in academic literatures is diverse, and originality may
appear in various parts of the research in different forms. Therefore, it is necessary to
identify and extract SDO in academic literatures. The current methods of extracting
information about originality in academic literatures can be divided into rule-based
methods and machine learning-based methods.
2.1 Rule-based methods
The core idea of rule-based methods is to analyze the language features of the original-
ity point, to select the feature items of the sentence for extraction, or to specify some
rules for extraction. Kirschner (2015) presents the results of an annotation study that
focused on the fine-grained analysis of argumentation structures in scientific publica-
tions by specifying four types of binary argumentative relations between sentences.
Zhang et al. (2011) proposed a method of extracting sentence-level originality in the
field of scientific and technological literature based on the relationship between do-
main-wise vocabulary and the ontology. Wen (2019) proposed a method of semantic
recognition and classification. Specifically, he divided the scientific and technological
abstracts into 6 categories according to syntax and semantic functions. Then he per-
formed statistical analysis of the distributions of categories and sentence positions, sen-
tence types, and sentence semantic positions. Li (2005) proposed an approach of origi-
nality detection based on the identification of sentence-level patterns. Zhang (2011)
addressed the problem of multilingual sentence categorization and originality mining.
3
2.2 Machine learning-based methods
With the substantial increase in computing power and the rapid expansion of data scale,
it has become possible to use deep learning methods in the big data environment to
expand the semantics of text features and calculate the similarity of content. The com-
putational efficiency has also been significantly improved, and scholars adopted deep
learning methods for originality detection research. Markou (2003) reviewed various
neural network methods (such as MLP, ART, RBF) that can be used for novel infor-
mation detection based on the theoretical level. Kim et al (2018) presented a network-
based method to detect the originality of a research paper. An autoencoder neural net-
work is used as the originality detection model. Among the constructed networks, key-
word-level graph features exhibit the best performance using regression analysis as the
metric. Safdera (2020) proposed a set of methods to automatically identify and extract
algorithmic pseudo-codes and the sentences that convey related algorithmic metadata
using a set of machine-learning techniques.
These studies promote the innovative extraction and evaluation of papers, but there
are still some shortcomings, such as:
1. Machine learning-based methods often need some labeled training data, but there is
no corresponding dataset about originality of the papers;
2. Rule-based methods aim at a small amount of data, and how to design a method to
process a large amount of data is quite challenging;
3. Most existing studies focused on the abstract of the paper, with only a few exploring
the full text of scientific publications.
To tackle these problems, we design a method to extract SDO of the papers from the
conclusion section, which combines rules and statistics. This method finds some fea-
tures through statistical analysis of the description of originality points and then trans-
forms these features into regular expressions to reduce the trouble of large-scale anno-
tation data required by machine learning.
3 Methodology
3.1 Research framework
Our technical framework for the extraction of SDO of the papers in academic literatures
includes three modules, namely data preparation, text processing, and extraction of
SDO of the papers, as shown in Figure 1.
4
Fig. 1. Flow of SDO of the paper recognition.
The main processing part includes the following steps:
1. Using a web crawler, we obtained the full-text information of the papers from arXiv.
2. Then, we converted the format of the papers from PDF to TXT.
3. We summarized the characteristics of the conclusions in the literatures to extract
them.
4. The conclusion section was split into sentences according to the characteristics of
the text or the "full stop" character.
5. We processed the sentence set, such as word segmentation, stemming, stop words,
part-of-speech tagging, and synonym merging.
5
6. In the module to recognize SDO of the paper, we collected and organized the words
that comprise the SDO of the paper through literature research, word frequency anal-
ysis, domain dictionary, literature keyword collection.
7. We labeled the originality-related words and serialized the sentence in the conclu-
sion section according to the originality guide vocabulary.
8. According to the result of sentence serialization, extracting rules of SDO of the paper
were constructed and realized by regular expressions.
9. To extract the sentence describing originality from the conclusion of the paper by
using rules.
Among them, the first and second steps belong to data acquisition module, step 3, 4 and
5 constitute data preprocessing module, and step 6, 7, 8 and 9 belong to extraction
module.
3.2 Data collection and processing
3.2.1 Document format conversion and preprocessing
In order to extract SDO, it is necessary to convert the documents formatted as PDF
into the TXT format. In practice, we adopt the pdfminer3k library in Python, an open-
source package that converts PDF files into manageable TXT or Microsoft Word doc-
uments. When a PDF is parsed into a corresponding TXT document, people oftentimes
encounter some issues, such as the lack of paragraph marks, the disappearance of the
first-line indentation, and the forced disconnection of words. Therefore, the compre-
hensive application of line breaks, punctuation marks, hyphenation symbols, and sen-
tence length was used to identify the paragraphs of TXT.
After extracting the conclusion section from the academic literatures, this paper used
spaCy natural language processing software package for word segmentation, part-of-
speech tagging, stemming, and stop words processing. To improve the accuracy of
word segmentation, this paper introduces a keyword list, a domain glossary before word
segmentation and uses Bi-gram and Tri-gram methods to identify phrases in the litera-
ture.
3.2.2 Extracting conclusions from academic literature
This paper recognized the conclusion or summary based on the chapter title, then
split texts into sentences according to the length of the sentence and punctuation. After
this, we divided sentences into words using professional dictionaries, keyword vocab-
ulary, N-gram, and other methods for word segmentation, and finally generated a da-
taset in sentence units. The structure and function of most academic texts can be iden-
tified by chapter titles. For example, "Introduction" and "Introduction and Motivation"
can be directly judged as the introduction; "Related Work" and "Context of this Re-
search" can be directly judged as "related research". Due to the different expressions of
the conclusion section in the literature, this paper manually screened the chapter titles
of the experimental data and finally derive the characteristic vocabulary of the conclu-
sion chapter titles in Table 1.
6
Table 1. Characteristic vocabulary of the title of the conclusion chapter
Chapter Title Featured Chapter End Featured
Chapter
Vocabulary Vocabulary
Conclusion, conclusions, dis-
cussion, summary, future,
perspective, limitations, out-
Acknowledgement[s],
Conclusion look, work, directions, re-
acknowledge, reference[s], \n\n
sults, concluding, remarks,
suggestions, recommenda-
tions, comments, discussions
According to the above-mentioned starting feature vocabulary and ending feature vo-
cabulary, the conclusion chapter extraction rules were constructed. We extract experi-
mental data with these rules and finally obtained 18,563 conclusions. Then 17,653 con-
clusions were finally screened out through manual inspection.
3.3 SDO of the paper extraction
3.3.1 Constructing a dictionary of originative guiding words
Originality of academic literatures is mainly reflected in two aspects: characteristic
words (guiding words) and common expressions. Aiming at the linguistic characteris-
tics and style of scientific literature, the use of rule-based extraction methods could
accurately identify the "knowledge claims" in the papers. Approximate 95% of origi-
nalities in papers are guided by characteristic words (Wen, 2014). Therefore, this paper
combines the previous research results, domain keywords, domain terminology data-
base, and word frequency statistical analysis to obtain the vocabulary list through man-
ual screening and preliminary screening of originalities point feature guiding words.
We then use WordNet to expand synonyms and finally select originative feature word
sets.
The main basis for selecting the guiding words of originality in this paper comes
from the usefulness, originality, enlightenment, scientificity and other elements de-
scribed in the definition of scientific originality by some scholars. This article referred
to the research results of Dahl (2009), Trine (2008), Parkinson (2011): We selected
originative linguistic feature guiding words and divided them into the following types:
referring to the author, referring to the article, iconic verb, iconic noun, and iconic ad-
jectives. Since the subject terms of the field reveal the research focus of the field, the
content of originality was closely related to the research subject. Given these, when
constructing the originality guide vocabulary in this article, the field glossary and the
keywords of the literature were introduced as the subject terms of the literature collec-
tion. In addition, the word frequency of the text in the conclusion part showed that most
of the originative guiding words were distributed in the high-frequency range. There-
fore, this article will compute word frequency on nouns and verbs and filter out origi-
native feature words to construct an originative guiding vocabulary table.
After initially identifying the originative feature words, we use WordNet to expand
synonymous word as the final selection of originative feature guiding words in this
7
article. According to their linguistic features, the originative feature guiding words are
divided into 6 categories. The finally constructed originality point feature guiding
words are shown in Table 2.
Table 2. guiding words of originality features
Marking
Type Word examples
symbol
Refers to authors RF I, We, Our
[In this|this|our|the]
Refers to papers TP
[paper|article|study|work|]
Use, show, propose, provide, present,
improve, observe, describe, investigate,
Verb VB prove, define, obtain, represent, design,
aim, address, find, analyze, illustrate,
conduct, appear, try, drive, and so on
problem, method, approach, work, re-
Nouns NN sult, performance, experiment, finding,
insight, notion, and so on
new, novel, unused, caused, resulting,
Adjective AD considered, known, observed, predicted,
and so on
algorithm, data, information, frame-
work, knowledge, Acoustic, Bayesian
network, beam search, CNN, RNN,
Keywords/subject
TW LSTM, ontology, optimization, cluster,
terms
bi-lstm, classifier, crf, dnn, deep q-learn-
ing, embedding, robotic, transfer learn-
ing, recommender system, and so on
3.3.2 Identification of SDO of the papers
Recognition rules are constructed based on the relationship between domain thesau-
rus and ontology, and the method of the redundancy based on the overlapping degree
of subject words is used to filter the candidate set of originality points (Zhang and Le,
2014). The vocabulary in the sentence is labeled according to the labeling symbols in
Table 2, and then the labeling symbols in the sentence are separated come out and form
a sequence of labeling symbols separated by spaces (according to the example in Figure
2, the labeling sequence in the sentence is: TP VB TW TW NN TW).
8
Fig. 2. An example of labeling originative feature guiding words
We comprehensively consider the labeling sequence and structure of SDO of the papers
and consider the positions of different types of clue words and the combination of dif-
ferent clue words when constructing rules. We also set limited matching for some rules.
Finally, the rules for writing regular expressions are as follows:
((RF )|(NN )|(TW )|(AD )|(TP )){0,3}(RF )((TP )|(AD )|(TW )|(NN )){0,3}(VB )
((AD ){0,1}(TW ){0,6}(NN ){0,2}){0,3}
4 Evaluation and Results
4.1 Experiment Data
This article used a web crawler to obtain publications in the field of "artificial intelli-
gence" under CS (computer science) on arXiv which were labelled “CS. AI”. These
experimental data were used for extraction of SDO from the papers. We collected basic
information about the papers including title, author, publication time, URL (document
PDF location). After that, the requests library was used to parse the URL to finally
obtain 22,213 academic papers in PDF format. Figure 2 shows the annual distribution
of literatures in the field of artificial intelligence.
Fig. 3. Number of documents issued per year from 1993 to 2019
4.2 Experimental results
This paper selects sentences from the conclusion chapters of randomly chosen 200 pa-
pers from the collected documents for manual annotation and obtains 346 SDO of the
papers out of 1,227 sentences. In order to test the performance of the SDO of the paper
recognition rules constructed in this paper, the accuracy and recall rates in information
retrieval are used to verify the recognition results. The results are shown in Table 3.
9
Table 3. Results of rules on experimental data
Accuracy Recall F1-score
Rules 0.833 0.722 0.698
According to Table 3, the originality point identification rules constructed in this paper
have an accuracy rate of 83.3% in the conclusion section. SDO of the paper were rec-
ognized from experimental dataset with the recognition rules. A total of 14,234 sen-
tences that match the rules are used as input for originality objects and topic mining.
Part of the extracted results is shown in Figure 4.
Fig. 4. Results of innovative sentence extraction in conclusion chapter
A qualitative analysis of the content of SDO of the papers is carried out by observing
approximately 100 papers selected randomly, and the commonly used SDO of the pa-
pers in the conclusion chapter are summarized. A part of the results is shown in Table
4.
Table 4. Examples of SDO of the papers in the conclusion section.
Type SDO of the paper patterns
[This paper|We] [propose|introduce|present|develop] a
New method class [new|first|novel] [model|solution|algorithm|method……]
that ……
We [presented|introduced] a methodology [for|to] ……
Methodology
We demonstrated …… methodology for ……
In this paper we have [redefined|defined|proposed] the [no-
concept/
tion|concept] of ……
Viewpoint
The [concept|notion] of …… is defined ……
Proof class [This paper|We] [demonstrate|prove] that
Problem class We considered …… problem ……
Application class In this paper, we [shown|studied] the application of ……
From Table 4, we can see there are mainly seven kinds of descriptions about the origi-
nality of the paper, which are new method class, methodology class, concept class,
viewpoint class, proof class, problem class, and application class. Among them, the first
two describe are method originality, the latter two refer to application originality, and
10
the middle three belong to theory originality. We will make a detailed analysis of the
theme, object and the pattern of sentence describing originality through the following
papers.
4.3 Analysis of experimental results
This article extracted SDO in the conclusion chapter according to the innovative sen-
tence recognition rules. High-frequency innovation objects and subject terms will be
analyzed.
4.3.1 Analysis of core nodes in SDO
According to the results of the dependency syntax analysis, the core nodes (ROOT)
in the innovative sentences are counted, and the proportion of the core nodes is shown
in Table 5. In the SDO from the conclusion chapter, the words present, propose, and
introduce respectively represent 23%, 22%, 11%, which amounts to more than 50% of
the entire core node, while the remaining core words account for a relatively small pro-
portion. This shows that in the conclusion chapter, researchers mainly use these words
to summarize or introduce the main points and originality of the article.
Table 5. Proportion of core nodes in SDO
Core node in Core node in
Proportion Proportion
SDO SDO
present 23.937844 develop 2.871755
propose 22.324941 provide 2.517703
introduce 11.801731 demonstrate 1.848938
show 3.599528 investigate 1.809599
study 3.127459 consider 1.298190
describe 3.029111 …… ……
4.3.2 Analysis of Innovation Objects
This paper takes the direct object of the core node as the innovation object of SDO,
and counts the frequency of the innovation object. The proportion of the innovation
object is shown in Table 6.
In the results of the proportion of innovation objects, approach, method, way and
other words about method have a relatively high proportion. It can be seen that the
innovation of methods in the field of artificial intelligence is the key research direction.
However, the innovation of methodology only accounts for 0.6% of the total, which
shows that there is relatively little research on methodological innovation.
11
Table 6. Proportion of innovation objects in SDO
Innovation Objects Innovation
Proportion Proportion
in SDO Objects in SDO
approach 8.477577 methodology 0.609756
framework 7.317073 concept 0.590087
method 6.805665 solution 0.550747
model 4.956727 notion 0.531078
algorithm 4.661684 scheme 0.531078
problem 4.346971 dataset 0.49173
system 2.163651 mechanism 0.472069
architecture 1.593234 application 0.432730
technique 1.258851 strategy 0.432730
0.393391
network 1.121164 information
performance 0.983478 complexity 0.373721
way 0.668765 …… ……
In addition to methodological innovation, according to the proportion of word fre-
quency, framework, model, algorithm, system, problem, architecture, network, concept
and dataset are key innovation objects in the field of artificial intelligence.
In short, the above experimental results show that in the field of artificial intelli-
gence, the main focus is on method innovation (approach, method, etc.), as well as
specific application innovation (model, algorithm, application, etc.), while there is less
innovation in the theory itself (methodology, idea, theory, etc.).
5 Discussion and Conclusion
By employing arXiv scientific publications, this paper constructs recognition rule
about SDO of the paper based on originative guiding words aiming to recognize the
sentence-level originality point of academic literature. Implementing SDO of the paper
recognition experiments on the literature on artificial intelligence topics on arXiv, we
find that the proposed method is quite effective to extract SDO of the paper from papers.
After obtaining SDO of the papers, people can evaluate papers by comparing the SDO
of the papers in the different papers.
12
The results of this paper show that the method constructed is feasible and effective
for sentence-level originality point identification and mining methods. Yet, as a re-
search-in-progress paper, there are still several limitations, and we are going to imple-
ment the following related studies in the future:
1. Although the SDO of the paper recognition rules constructed in this article are ef-
fective in recognition of SDO of the papers, the formulation and maintenance of the
rules cannot cover all papers. Therefore, in order to improve the accuracy of the
recognition of SDO of the papers, the sequence will be marked in the follow-up as
training data, machine learning methods are used to convert the extracted questions
into classification questions.
2. The current paper only identifies sentences that reflect the originative content of the
thesis. Next, the SDO of the papers will be analyzed and excavated, and the origina-
tive objects, topics, and specific methods will be extracted.
3. In this paper, we only extract the innovative description in the papers’ conclusion
section. We will extract information describing the originality in the research objec-
tives, related works, and methodology from papers in the following research.
6 Acknowledgements
This work was supported in part by The National Social Science Foundation of
China (Number: 17BTQ066).
References
1. Amplayo R.K., Hong S.L., Song M.(2018). Network-based approach to detect novelty of
scholarly literature. Information Sciences, 422, 542-557.
2. Dahl T. 2009. The Linguistic Representation of Rhetorical Function: A Study of How Econ-
omists Present Their Knowledge Claims. Written Communication, 29(4), 370-391.
3. Ding Y, Liu X.Z., Guo C., Cronin B.(2013). The distribution of references across texts:Some
implications for citation anlysis. Journal of information, 7(3), 583-592.
4. Garfield E(1955). Citation indexes for science. A new dimension in documentation through
association of ideas. Science, 3159(122), 108-111.
5. He Q., Pei J., Kifer D., et al.(2010). Context-aware citation recommendation. In Proceedings
of the 19th international conference on World wide web (WWW '10). Association for Com-
puting Machinery, New York, NY, USA, 421–430.
6. Kirschner C., Eckle-Kohler J., Gurevych I.(2015). Linking the Thoughts: Analysis of Argu-
mentation Structures in Scientific Publications. 2nd Workshop on Argumentation Mining
(ARG-MINING 2015) Denver, Colorado, USA, June 4.
7. Markou M., Singh S. (2003). Novelty detection: A review-part 2: Neural network based
approaches. Signal Processing, 83, (12), 2499–2521.
8. Parkinson J. (2011). The Discussion section as argument: The language used to prove
knowledge claims. English for Specific Purposes, 30(3), 164-175.
9. Safdera I., Hassana S.U., Visvizi A. Noraset T.,Tuarob S. (2020). Deep Learning-based
Extraction of Algorithmic Metadata in Full-Text Scholarly Documents. Information Pro-
cessing and Management, 57, 102269
13
10. Shibayama, S., Wang, J. (2020). Measuring originality in science. Scientometrics, 122, 409-
427.
11. Trine D (2008). Contributing to the Academic Conversation: A Study of New Knowledge
Claims in Economics and Linguistics. Journal of Pragmatics, 40(7),1184-1201
12. Uzzi B., Mukherjee S., Stringer M., Jones B. (2013). A typical Combinations and Scientific
Impact, Science, 342, 468-472.
13. Wen H.(2019). Semantic Recognition and Classification Method of originality Points in Sci-
entific and Technological. Journal of The China Society for Scientific and Technical Infor-
mation, 38(3), 249-256.
14. Wen Y.K., Wu G.Y(2014). Dynamic Mining of Fragmented Scientific Research originality
Points, Digital Library Forum,7,25-32.
15. Zhang F., & Le X.Q. (2014). Research on originality Points Extraction from Scientific Re-
search Paper Based on Field Thesaurus. New Technology of Library and Information Ser-
vice, (9), 15-21.
16. Zhang Y., Tsai F.S., & Kwee A.T.(2011). Multilingual sentence categorization and novelty
mining. Information Processing and Management, 47, 667–675.