=Paper= {{Paper |id=Vol-2414/paper25 |storemode=property |title=NJU@CL-SciSumm-19 |pdfUrl=https://ceur-ws.org/Vol-2414/paper25.pdf |volume=Vol-2414 |authors=Hyonil Kim,Shiyan Ou |dblpUrl=https://dblp.org/rec/conf/sigir/KimO19 }} ==NJU@CL-SciSumm-19== https://ceur-ws.org/Vol-2414/paper25.pdf
                            NJU@CL-SciSumm-19

            Hyonil Kim1,3[0000-0002-6380-9507] and Shiyan Ou1,2 [0000-0001-8617-6987]
          1 School of Information Management, Nanjing University, Nanjing, China
                                2 oushiyan@nju.edu.cn
                                  3 kimhyonil@126.com




       Abstract. Cited text identification is helpful for meaningful scientific literature
       summarization. In this paper, we introduces our system submitted to the CL-
       SciSumm 2019 Shared Task 1A. Our system have two stages: similarity-based
       ranking and supervised listwise ranking. Firstly, we select the top-5 sentences per
       a citation text, due to the modified Jaccard similarity. Secondly, these top-5 se-
       lected sentences are proceeded to rank by a CiteListNet (listwise ranking model
       based on deep learning). Our experiments showed that our proposed method out-
       performed other prior methods on the CL-SciSumm 2017 test dataset.

       Keywords: Cited Text Identification, Cited Text, Listwise Ranking, Citation
       Content Analysis, Text Similarity.


1      INTRODUCTION

Automatic summarization of academic paper may be a very effective solution to avoid
the information overload of researchers and to understand the state-of-the-art of the
research topic.
    The CL-SciSumm shared tasks explore the solutions for the making a comprehensi-
ble summary of an academic paper given its citation text. These tasks focuses on the
sentence-level cited text information to perform the summarization of a paper. To this
end, the identification of the cited text should be done. CL-SciSumm Shared Task 1A
is just to identify the spans of cited text in reference paper (RP) that contain the given
citation text of its citing paper (CP).
    In this paper, we use various similarity metrics to evaluate the similarity between a
citation text and its candidate cited sentence, and adopt listwise ranking algorithm to
train our ranking model.


2      RELATED WORK

Most of previous studies regarded the identification of cited texts as a classification
problem and thus used some machine learning algorithms like SVM, Random Forest,
CNN to train text classifiers. To build the classifiers, various features were explored by
researchers. Ma et al. (2018) chose Jaccard similarity, cosine similarity and some posi-
2


tion information as features, and trained four classifiers including Decision Tree, Lo-
gistic Regression and SVM.[9] Finally they used a weighted voting method to combine
the categorization results of the four classifiers and achieved the best performance in
CL-SciSumm 2017 competition. Yeh et al. (2017) considered some lexical features,
knowledge-based features, corpus-based features, syntactic features, surface features to
represent the feature vector and adopted a majority voting method to combine the re-
sults of the six classifiers like KNN, Decision Tree, Logistic Regression, Naive Bayes,
SVM and Random Forest. They got the F value of 14.9% by running their system on
the corpus of the CL-SciSumm 2016 competition.[3]
   There are two main issues in the categorization-based methods: local ranking and
class-imbalanced data. On the one hand, the cited text identification problem should be
regarded as a ranking problem rather than a classification one because we only intent
to choose the sentence(s) that contains more similar content with the citation sen-
tence(s) compared to other sentences. On the other hand, there is only few sentences
(usually not more than five) to be cited sentences in a target paper. Sometimes the ratio
of the negative and positive sample in a corpus is even greater than 150. Ma et al. (2018)
used Nearest Neighbor (NN) rule (Wilson, 1972) to reduce data imbalance and in-
creased the F1-score from 11.8% to 12.5%.[9]
   With respect to the ranking-based cited text identification, a few studies have been
done. Dipankar et al. (2017) ranked the sentences in a target paper according to the
cosine similarity between each candidate sentence and the citation sentences to select
the top five sentences as the cited sentences.[2] However, this unsupervised method did
not obtain reasonable performance. Therefore, we proposed a listwise ranking method
for identifying cited sentences, which is supervised method trained by a deep learning
mechanism.


3      Methodology

In this study, we regarded cited text identification as a ranking problem, and proposed
a ranking-based method to identify citation sentences based on deep learning. This
method includes two stages of ranking: a similarity-based unsupervised ranking and a
supervised listwise ranking. Since the cited text was deemed to contain more similar
content with the cited text than other sentences in the same paper, we first ranked all
the sentences in a reference paper according to each sentence’s similarity with a cited
text. Then we choose top K sentences to create a subset of the given train corpus for the
second stage ranking, while the Kth sentence obtained the best F-value according to the
given training corpus. In the second stage, a listwise ranking model was trained on the
subset training corpus to rank the K sentences and then top N sentences (N