NJU@CL-SciSumm-19 Hyonil Kim1,3[0000-0002-6380-9507] and Shiyan Ou1,2 [0000-0001-8617-6987] 1 School of Information Management, Nanjing University, Nanjing, China 2 oushiyan@nju.edu.cn 3 kimhyonil@126.com Abstract. Cited text identification is helpful for meaningful scientific literature summarization. In this paper, we introduces our system submitted to the CL- SciSumm 2019 Shared Task 1A. Our system have two stages: similarity-based ranking and supervised listwise ranking. Firstly, we select the top-5 sentences per a citation text, due to the modified Jaccard similarity. Secondly, these top-5 se- lected sentences are proceeded to rank by a CiteListNet (listwise ranking model based on deep learning). Our experiments showed that our proposed method out- performed other prior methods on the CL-SciSumm 2017 test dataset. Keywords: Cited Text Identification, Cited Text, Listwise Ranking, Citation Content Analysis, Text Similarity. 1 INTRODUCTION Automatic summarization of academic paper may be a very effective solution to avoid the information overload of researchers and to understand the state-of-the-art of the research topic. The CL-SciSumm shared tasks explore the solutions for the making a comprehensi- ble summary of an academic paper given its citation text. These tasks focuses on the sentence-level cited text information to perform the summarization of a paper. To this end, the identification of the cited text should be done. CL-SciSumm Shared Task 1A is just to identify the spans of cited text in reference paper (RP) that contain the given citation text of its citing paper (CP). In this paper, we use various similarity metrics to evaluate the similarity between a citation text and its candidate cited sentence, and adopt listwise ranking algorithm to train our ranking model. 2 RELATED WORK Most of previous studies regarded the identification of cited texts as a classification problem and thus used some machine learning algorithms like SVM, Random Forest, CNN to train text classifiers. To build the classifiers, various features were explored by researchers. Ma et al. (2018) chose Jaccard similarity, cosine similarity and some posi- 2 tion information as features, and trained four classifiers including Decision Tree, Lo- gistic Regression and SVM.[9] Finally they used a weighted voting method to combine the categorization results of the four classifiers and achieved the best performance in CL-SciSumm 2017 competition. Yeh et al. (2017) considered some lexical features, knowledge-based features, corpus-based features, syntactic features, surface features to represent the feature vector and adopted a majority voting method to combine the re- sults of the six classifiers like KNN, Decision Tree, Logistic Regression, Naive Bayes, SVM and Random Forest. They got the F value of 14.9% by running their system on the corpus of the CL-SciSumm 2016 competition.[3] There are two main issues in the categorization-based methods: local ranking and class-imbalanced data. On the one hand, the cited text identification problem should be regarded as a ranking problem rather than a classification one because we only intent to choose the sentence(s) that contains more similar content with the citation sen- tence(s) compared to other sentences. On the other hand, there is only few sentences (usually not more than five) to be cited sentences in a target paper. Sometimes the ratio of the negative and positive sample in a corpus is even greater than 150. Ma et al. (2018) used Nearest Neighbor (NN) rule (Wilson, 1972) to reduce data imbalance and in- creased the F1-score from 11.8% to 12.5%.[9] With respect to the ranking-based cited text identification, a few studies have been done. Dipankar et al. (2017) ranked the sentences in a target paper according to the cosine similarity between each candidate sentence and the citation sentences to select the top five sentences as the cited sentences.[2] However, this unsupervised method did not obtain reasonable performance. Therefore, we proposed a listwise ranking method for identifying cited sentences, which is supervised method trained by a deep learning mechanism. 3 Methodology In this study, we regarded cited text identification as a ranking problem, and proposed a ranking-based method to identify citation sentences based on deep learning. This method includes two stages of ranking: a similarity-based unsupervised ranking and a supervised listwise ranking. Since the cited text was deemed to contain more similar content with the cited text than other sentences in the same paper, we first ranked all the sentences in a reference paper according to each sentence’s similarity with a cited text. Then we choose top K sentences to create a subset of the given train corpus for the second stage ranking, while the Kth sentence obtained the best F-value according to the given training corpus. In the second stage, a listwise ranking model was trained on the subset training corpus to rank the K sentences and then top N sentences (N