=Paper=
{{Paper
|id=Vol-2002/editorial
|storemode=property
|title=The CL-SciSumm Shared Task 2017: Results and Key Insights
|pdfUrl=https://ceur-ws.org/Vol-2002/editorial.pdf
|volume=Vol-2002
|authors=Kokil Jaidka,Muthu Kumar Chandrasekaran,Devanshu Jain,Min-Yen Kan
|dblpUrl=https://dblp.org/rec/conf/sigir/JaidkaCJK17
}}
==The CL-SciSumm Shared Task 2017: Results and Key Insights==
The CL-SciSumm Shared Task 2017: Results and Key Insights Kokil Jaidka1 , Muthu Kumar Chandrasekaran2 , Devanshu Jain1 , and Min-Yen Kan2,3 1 University of Pennsylvania, USA 2 School of Computing, National University of Singapore, Singapore 3 Smart Systems Institute, National University of Singapore, Singapore jaidka@sas.upenn.edu Abstract. The CL-SciSumm Shared Task is the first medium-scale shared task on scientific document summarization in the computational linguis- tics (CL) domain. In 2017, it comprised three tasks: (1A) identifying relationships between citing documents and the referred document, (1B) classifying the discourse facets, and (2) generating the abstractive sum- mary. The dataset comprised 40 annotated sets of citing and reference papers from the open access research papers in the CL domain. This overview describes the participation and the official results of the CL- SciSumm 2017 Shared Task, organized as a part of the 40th Annual Con- ference of the Special Interest Group in Information Retrieval (SIGIR), held in Tokyo, Japan in August 2017. We compare the participating sys- tems in terms of two evaluation metrics and discuss the use of ROUGE as an evaluation metric. The annotated dataset used for this shared task and the scripts used for evaluation can be accessed and used by the community at: https://github.com/WING-NUS/scisumm-corpus. 1 Introduction CL-SciSumm explores summarization of scientific research in the domain of com- putational linguistics research. It encourages the incorporation of new kinds of information in automatic scientific paper summarization, such as the facets of research information being summarized in the research paper. CL-SciSumm also encourages the use of citing mini-summaries written in other papers, by other scholars, when they refer to the paper. The Shared Task dataset comprises the set of citation sentences (i.e., “citances”) that reference a specific paper as a (community-created) summary of a topic or paper [19]. Citances for a reference paper are considered a synopses of its key points and also its key contributions and importance within an academic community [17]. The advantage of using ci- tances is that they are embedded with meta-commentary and offer a contextual, interpretative layer to the cited text. Citances offer a view of the cited paper which could complement the reader’s context, possibly as a scholar [7] or a writer of a literature review [6]. The CL-SciSumm Shared Task is aimed at bringing together the summariza- tion community to address challenges in scientific communication summariza- tion. Over time, we anticipate that the Shared Task will spur the creation of new resources, tools and evaluation frameworks. A pilot CL-SciSumm task was conducted at TAC 2014, as part of the larger BioMedSumm Task4 . In 2016, a second CL-Scisumm Shared Task [5] was held as part of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL) workshop [16] at the Joint Conference on Digital Libraries (JCDL5 ). This paper provides the results and insights from CL-SciSumm 2017, which was held as part of subse- quent BIRNDL 2017 workshop[15] at the annual ACM Conference on Research and Development in Information Retrieval (SIGIR6 ). 2 Task CL-SciSumm defined two serially dependent tasks that participants could at- tempt, given a canonical training and testing set of papers. Given: A topic consists of a Reference Paper (RP) and ten or more Citing Papers (CPs) that all contain citations to the RP. In each CP, the text spans (i.e., citances) have been identified that pertain to a particular citation to the RP. Additionally, the dataset provides three types of summaries for each RP: – the abstract, written by the authors of the research paper. – the community summary, collated from the reference spans of its citances. – a human-written summary, written by the annotators of the CL-SciSumm annotation effort. Task 1A: For each citance, identify the spans of text (cited text spans) in the RP that most accurately reflect the citance. These are of the granularity of a sen- tence fragment, a full sentence, or several consecutive sentences (no more than 5). Task 1B: For each cited text span, identify what facet of the paper it belongs to, from a predefined set of facets. Task 2: Finally, generate a structured summary of the RP from the cited text spans of the RP. The length of the summary should not exceed 250 words. This was an optional bonus task. 3 Development We built the CL-SciSumm corpus by randomly sampling research papers (Ref- erence papers, RPs) from the ACL Anthology corpus and then downloading the 4 http://www.nist.gov/tac/2014 5 http://www.jcdl2016.org/ 6 http://sigir.org/sigir2017/ citing papers (CPs) for those which had at least ten citations. The prepared dataset then comprised annotated citing sentences for a research paper, mapped to the sentences in the RP which they referenced. Summaries of the RP were also included. The CL-SciSumm 2017 corpus included a refined version of the CL-SciSumm 2016 corpus of 30 RPs as a training set, in order to encourage teams from the previous edition to participate. The test set was an additional corpus of 10 RPs. Based on feedback from CL-SciSumm 2016 task participants, we refined the training set as follows: – In cases where the annotators could not place the citance to a sentence in the referred paper, the citance was discarded. In prior versions of the task, annotators were required to reference the title (Reference Offset: [’0’]) but the participants complained that this resulted in a drop in system performance. – Citances were deleted if they mentioned the referred paper in a clause as a part of multiple references and did not cite specific information about it. For details of the general procedure followed to construct the CL-SciSumm cor- pus, and changes made to the procedure in CL-SciSumm-2016, please see [5]. 3.1 Annotation The annotation scheme was unchanged from what was followed in previous edi- tions of the task and the original BiomedSumm task developed by Cohen et. al7 : Given each RP and its associated CPs, the annotation group was instructed to find citations to the RP in each CP. Specifically, the citation text, citation marker, reference text, and discourse facet were identified for each citation of the RP found in the CP. 4 Overview of Approaches Nine systems participated in Task 1 and a subset of five also participated in Task 2. The following paragraphs discuss the approaches followed by the partic- ipating systems, in lexicographic order by team name. The Beijing University of Posts and Telecommunications team from their Center for Intelligence Science and Technology (CIST, [11]) followed an approach similar to their 2016 system submission [10]. They calculated a set of similarity metrics between reference spans and citance – idf similarity, Jaccard similarity, and context similarity. They submitted six system runs which combined similar- ity scores using a fusion method, a Jaccard Cascade method, a Jaccard Focused method, an SVM method and two ensemble methods using voting. The Jadavpur University team (Jadavpur, [3]) participated in all of the tasks. For Task 1A, they defined a cosine similarity between texts. The reference paper’s 7 http://www.nist.gov/tac/2014 sentence with the highest score is selected as the reference span. For Task 1B, they represent each discourse facet as a bag of words of all the sentences having that facet. Only words with the highest tf.idf values are chosen. To identify the facet of a sentence, they calculated the cosine similarity between a candidate sentence vector and each bag’s vector. The bag with the highest similarity is deemed the chosen facet. For Task 2, a similarity score was calculated between pairs of sentences belonging to the same facets. If the resultant score is high, only a single sentence of the two is added to the summary. Nanjing University of Science and Technology team (NJUST, [14]) partici- pated in all of the tasks (Tasks 1A, 1B and 2). For Task 1A, they used a weighted voting-based ensemble of classifiers (linear support vector machine (SVM), SVM using a radial basis function kernel, Decision Tree and Logistic Regression) to identify the reference span. For Task 1B, they created a dictionary for each discourse facet and labeled the reference span with the facet if its dictionary contained any of the words in the span. For Task 2, they used bisecting K- means to group sentences in different clusters and then used maximal marginal relevance to extract sentences from each cluster and combine into a summary. National University of Singapore WING (NUS WING, [18]) participated in Tasks 1A and B. They followed a joint-scoring approach, weighting surface-level similarity using tf.idf and longest common subsequence (LCS), and semantic relatedness using a pairwise neural network ranking model. For Task 1B, they retrofitted their neural network approach, applying it to output of Task 1A. The Peking University team (PKU, [21]) participated in Task 1A. They computed features based on sentence-level and character-level tf.idf scores and word2vec similarity and used logistic regression to classify sentences as being reference spans or not. The Graz University of Technology team (TUGRAZ, [4]) participated in Tasks 1A and 1B. They followed an information retrieval style approach for Task 1A, creating an index of the reference papers and treating each citance as a query. Results were ranked according to a vector space model and BM25. For Task 1B, they created an index of cited text along with the discourse facet(s). To identify the discourse facet of the query, a majority vote was taken among the discourse facets found in the top 5 results. The University of Houston team (UHouston, [8]) used a combination of lexical and syntactic features for Task 1A, based on the position of text and textual entailment They tackled Task 1B using WordNet expansion. The University of Mannheim team (UniMA, [9]) also participated in all of the tasks. For Task 1A, they used supervised learning to rank paradigm to rank the sentences in the reference paper using features such as lexical similarity, se- mantic similarity, entity similarity and others. They formulated Task 1B, as a one-versus-all multi-class classification. They used an SVM and a trained con- volutional neural network (CNN) for each of the five binary classification tasks. For Task 2, they clustered the sentences using single pass clustering algorithm using a Word Mover’s similarity measure and sorted the sentences in each cluster according to their Text Rank score. Then they ranked the clusters according to the average Text Rank score. Top sentences were picked from the clusters and added to summary until the word limit of 250 words was reached. Finally, the Universitat Pompeu Fabra team (UPF, [1]) participated in Tasks 1A, 1B and 2. For Task 1A, they used a weighted voting ensemble of systems that used word embedding distance, modified Jaccard distance and BabelNet embed- ding distance. They formulated Task 1B as a one-versus-all multi-class classifi- cation. For Task 2, they trained a linear regression model to learn the scoring function (approximated as cosine similarity between reference paper’s sentence vector and summary vector) of each sentence. 5 Evaluation An automatic evaluation script was used to measure system performance for Task 1A, in terms of the sentence ID overlaps between the sentences identified in system output, versus the gold standard created by human annotators. The raw number of overlapping sentences were used to calculate the precision, recall and F1 score for each system. We followed the approach in most SemEval tasks in reporting the overall system performance as its micro-averaged performance over all topics in the blind test set. Additionally, we calculated lexical overlaps in terms of the ROUGE-2 and ROUGE-SU4 scores [12] between the system output and the human annotated gold standard reference spans. ROUGE scoring was used for CL-SciSumm 17, for Tasks 1a and Task 2. Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is a set of met- rics used to automatically evaluate summarization systems [12] by measuring the overlap between computer-generated summaries and multiple human writ- ten reference summaries. In previous studies, ROUGE scores have significantly correlated with human judgments on summary quality [13]. Different variants of ROUGE differ according to the granularity at which overlap is calculated. For instance, ROUGE–2 measures the bigram overlap between the candidate computer-generated summary and the reference summaries. More generally, ROUGE– N measures the n-gram overlap. ROUGE–L measures the overlap in Longest Common Subsequence (LCS). ROUGE–S measures overlaps in skip-bigrams or bigrams with arbitrary gaps in-between. ROUGE-SU uses skip-bigram plus uni- gram overlaps. CL-SciSumm 2017 uses ROUGE-2 and ROUGE-SU4 for its eval- uation. Task 1B was evaluated as a proportion of the correctly classified discourse facets by the system, contingent on the expected response of Task 1A. As it is a multi-label classification, this task was also scored based on the precision, recall and F1 scores. Task 2 was optional, and also evaluated using the ROUGE–2 and ROUGE– SU4 scores between the system output and three types of gold standard sum- maries of the research paper: the reference paper’s abstract, a community sum- mary, and a human summary. The evaluation scripts have been provided at the CL-SciSumm Github reposi- tory8 where the participants may run their own evaluation and report the results. 6 Results This section compares the participating systems in terms of their performance. Five of the nine system that did Task 1 also did the bonus Task 2. Following are the plots with their performance measured by ROUGE–2 and ROUGE–SU4 against the 3 gold standard summary types. The results are provided in Table 1 and Figure 1. The detailed implementation of the individual runs are described in the system papers included in this proceedings volume. Task 1a - F1 scores 0.16 Sentence Overlap (F1) Rouge 2 (F1) 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 (a) Task 1b (F1) 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 TUGRAZ_Run1 TUGRAZ_Run2 NJUST Run 1 NJUST Run 2 NJUST Run 3 NJUST Run 4 NJUST Run 5 UniMA_Run1 UniMA_Run2 UniMA_Run3 UniMA_Run4 UniMA_Run5 UniMA_Run6 UniMA_Run7 UniMA_Run8 UniMA_Run9 Uhouston_Run1 Uhouston_Run2 Uhouston_Run3 Uhouston_Run4 Uhouston_Run5 Uhouston_Run6 Uhouston_Run7 Uhouston_Run8 Uhouston_Run9 NUS_WING CIST_Run1 CIST_Run2 CIST_Run3 CIST_Run4 CIST_Run5 CIST_Run6 CIST_Run7 PKU_Run1 PKU_Run2 PKU_Run3 PKU_Run4 Jadavpur_Run1 Jadavpur_Run2 Uhouston_Run10 Uhouston_Run11 Uhouston_Run12 Uhouston_Run13 UPF_Run1 UPF_Run2 UPF_Run3 UPF_Run4 -0.05 (b) Fig. 1: Performances on (a) Task 1A in terms of sentence overlap and ROUGE-2, and (b) Task 1B conditional on Task 1A For Task 1A, the best performance was shown by three of the five runs from NJUST [14]. Their performance was closely followed by TUGRAZ [4]. The third best system was CIST [11] which was also the best performer for Task 1B. The 8 github.com/WING-NUS/scisumm-corpus Task 1A: Sentence Task 1A: System Overlap (F1 ) ROUGE F1 Task 1B NJUST [14] Run 2 0.124 0.100 (3) 0.339 (7) NJUST [14] Run 5 0.123 0.114 (1) 0.328 (10) NJUST [14] Run 4 0.114 0.090 (7) 0.306 (13) TUGRAZ [4] Run 2 0.110 0.108 (2) 0.337 (8) CIST [11] Run 1 0.107 0.047 (44) 0.373 (3) CIST [11] Run 4 0.105 0.053 (42) 0.408 (1) PKU [21] Run 4 0.102 0.081 (11) 0.370 (4) CIST [11] Run 5 0.100 0.050 (43) 0.392 (2) NJUST [14] Run 1 0.097 0.100 (3) 0.294 (16) NJUST [14] Run 3 0.094 0.090 (7) 0.209 (28) UHouston [8] Run 1 0.091 0.087 (9) 0.271 (20) TUGRAZ [4] Run1 0.088 0.093 (5) 0.269 (21) UPF [1] Run 1 0.088 0.063 (36) 0.293 (17) UPF [1] Run 2 0.088 0.063 (36) 0.293 (17) CIST [11] Run 3 0.086 0.058 (38) 0.365 (5) CIST [11] Run 7 0.084 0.045 (45) 0.351 (6) PKU [21] Run 3 0.083 0.070 (19) 0.315 (12) CIST [11] Run 6 0.077 0.042 (46) 0.330 (10) UHouston [8] Run 1 0.074 0.070 (19) 0.304 (14) UHouston [8] Run 2 0.074 0.070 (19) 0.241 (24) UHouston [8] Run 3 0.074 0.070 (19) 0.050 (40) UHouston [8] Run 4 0.074 0.070 (19) 0.272 (19) UHouston [8] Run 5 0.074 0.070 (19) 0.299 (13) UHouston [8] Run 6 0.074 0.070 (19) 0.266 (22) UHouston [8] Run 7 0.074 0.070 (19) 0 (43) UHouston [8] Run 8 0.074 0.070 (19) 0 (43) PKU [21] Run 1 0.071 0.067 (33) 0 (43) UPF [1] Run 4 0.071 0.091 (6) 0.220 (27) UHouston [8] Run 9 0.068 0.070 (16) 0.226 (25) UHouston [8] Run 10 0.068 0.070 (16) 0.226 (25) UHouston [8] Run 13 0.068 0.077 (12) 0.246 (23) CIST [11] Run 2 0.067 0.058 (38) 0.326 (11) PKU [21] Run 2 0.066 0.058 (38) 0 (43) NUS [18] 0.055 0.084 (10) 0.026 (42) UniMA [9] Run 1 0.053 0.068 (30) 0.114 (31) UniMA [9] Run 2 0.053 0.068 (30) 0.100 (36) UniMA [9] Run 3 0.053 0.068 (30) 0.103 (34) UPF [1] Run 3 0.052 0.055 (41) 0.134 (29) UniMA [9] Run 7 0.051 0.075 (13) 0.111 (33) UniMA [9] Run 8 0.051 0.075 (13) 0.112 (32) UniMA [9] Run 9 0.051 0.075 (13) 0.116 (30) UniMA [9] Run 4 0.049 0.072 (16) 0.096 (39) UniMA [9] Run 5 0.049 0.072 (16) 0.097 (38) UniMA [9] Run 6 0.049 0.072 (16) 0.102 (35) Jadavpur [3] Run 2 0.042 0.065 (34) 0.100 (36) Jadavpur [3] Run 1 0.037 0.065 (34) 0 (43) UHouston [8] Run 12 0.014 0.034 (47) 0.044 (41) Table 1: Systems’ performance in Task 1A and 1B, ordered by their F1 -scores for sentence overlap on Task 1A. Each system’s rank by their performance on ROUGE on Task 1A and 1B are shown in parentheses. Vs. Abstract Vs. Human Vs. Community System R–2 RSU–4 R–2 RSU–4 R–2 RSU–4 CIST [11] Run 4 0.351 0.185(3) 0.156(22) 0.101(23) 0.184(9) 0.136(16) CIST [11] Run 1 0.341 0.167(8) 0.173(16) 0.111(21) 0.187(7) 0.137(15) CIST [11] Run 6 0.331 0.172(6) 0.184(13) 0.110(22) 0.185(8) 0.141(12) CIST [11] Run 3 0.327 0.171(7) 0.275(1) 0.178(1) 0.204(1) 0.168(4) CIST [11] Run 2 0.322 0.163(9) 0.225(3) 0.147(9) 0.195(2) 0.155(7) CIST [11] Run 5 0.318 0.178(5) 0.153(23) 0.118(18) 0.192(3) 0.146(10) UPF [1] summa abs 0.297 0.158(11) 0.168(19) 0.147(9) 0.190(5) 0.153(8) UPF [1] acl abs 0.289 0.163(9) 0.214(7) 0.161(5) 0.191(4) 0.167(5) UniMA [9] Runs 1,2,3 0.265 0.184(4) 0.197(9) 0.157(6) 0.181(10) 0.169(2) NJUST [14] Run 4 0.258 0.152(14) 0.206(8) 0.131(15) 0.167(13) 0.126(19) UniMA [9] Run 4,5,6 0.257 0.191(1) 0.221(5) 0.166(3) 0.178(11) 0.174(1) UniMA [9] Run 7,8,9 0.256 0.187(2) 0.224(4) 0.169(2) 0.167(13) 0.167(5) UPF [1] summa com 0.247 0.153(13) 0.168(19) 0.142(12) 0.178(11) 0.143(11) CIST [11] Run 7 0.240 0.154(12) 0.170(18) 0.133(13) 0.163(15) 0.141(12) NJUST [14] Run 2 0.214 0.138(15) 0.229(2) 0.154(7) 0.152(16) 0.114(21) NJUST [14] Run 1 0.198 0.114(18) 0.190(10) 0.114(20) 0.147(17) 0.101(23) NJUST [14] Run 5 0.192 0.108(19) 0.178(15) 0.127(17) 0.119(23) 0.098(24) Jadavpur [3] Run 1 0.191 0.133(16) 0.181(14) 0.129(16) 0.132(19) 0.119(20) NJUST [14] Run 3 0.187 0.119(17) 0.162(21) 0.115(19) 0.141(19) 0.127(17) UPF [1] google abs 0.170 0.108(19) 0.173(16) 0.132(14) 0.143(18) 0.139(14) UPF [1] acl com 0.161 0.099(22) 0.217(6) 0.166(3) 0.189(6) 0.169(2) UPF [1] summa hum 0.144 0.091(23) 0.189(11) 0.148(8) 0.131(21) 0.147(9) UPF [1] acl hum 0.124 0.102(21) 0.188(12) 0.147(9) 0.132(19) 0.127(17) UPF [1] google hum 0.071 0.071(24) 0.127(24) 0.101(23) 0.103(24) 0.109(22) UPF [1] google com 0.052 0.065(25) 0.120(25) 0.092(25) 0.075(25) 0.096(25) Mean Score 0.237 0.150 0.193 0.141 0.164 0.145 Table 2: Systems’ performance for Task 2 ordered by their ROUGE–2(R–2) and ROUGE–SU4(R–SU4) F1 -scores. Each system’s rank by their performance on the corresponding evaluation is shown in parentheses. Winning scores are bolded. Task 2 - Vs. Abstract (ROUGE 2 F1) 0.4 ROUGE 2 (F1) 0.35 ROUGE SU4 (F1) 0.3 0.25 0.2 0.15 0.1 0.05 0 (a) Task 2 - Vs. Community (ROUGE 2 F1) 0.25 ROUGE 2 (F1) ROUGE SU4 (F1) 0.2 0.15 0.1 0.05 0 (b) Task 2 - Vs. Human (ROUGE 2 F1) ROUGE 2 (F1) 0.3 ROUGE SU4 (F1) 0.25 0.2 0.15 0.1 0.05 0 (c) Fig. 2: Task 2 Performances on (a) Abstract, (b) Community and (c) Human summaries. Plots correspond to the numbers in Table 2. next best performers of Task 1B were by PKU [21] and NJUST [14]. For Task 2, CIST had the best performance against the abstract, community and human summaries [11]. UPF [1] had the next best performances against the abstract and community summaries while NJUST [14] and UniMA [9] were close runner-ups against the human summaries. In this edition of the task, we used ROUGE-1 as a more lenient way to evaluate Task 1A – however, as Figure 1 shows, many systems’ performance on ROUGE scores was lower than on the exact match F1 . The reasons for this aberration are discussed in Section 7. 7 Error Analysis We carefully considered participant feedback from CL-Scisumm 2016 Task [5] and made a few changes to the annotation rules and evaluation procedure. We discuss the key insights from Task 1A, followed by Task 2. Task 1A: In 2017, we introduced the ROUGE metric to evaluate Task 1A, which we anticipated would be a more lenient way to score the system runs, especially since it would consider bigrams separated by over up to four words. However, we found that system performance on ROUGE was not always more lenient than sentence overlap F1 scores. Table 3 provides some examples to demonstrate how the ROUGE score is biased to prefer shorter sentences over longer ones. ROUGE scores are calculated for candidate reference spans (RS) from system submissions against the gold standard (GS) reference span (Row 1 of Table 3). Here, we consider 3 examples, each with a pair of RS compared with one another. The RS of Submission 2 is shorter than that of Submission 1. Both systems retrieve one correct sentence (overlap with GS) and one incorrect sentence. Although F1 score overlap for exact match of sentences for both will be the same, the ROUGE score for Submission 2 (shorter) is greater than that of Submission 1. In the next example, neither system retrieves a correct match. Submission 1 is shorter than that of Submission 2. The exact match for both systems are the same: 0. However the ROUGE scores for Submission 1 (shorter) is higher than that of Submission 2. In the last example, both the submissions correctly retrieve GS. However, they also retrieve an additional false positive sentence. Submission 1’s RS is longer than Submission 2. Similar to the pre- vious example, ROUGE score for Submission 1 (shorter) is less than that of Submission 2. Evaluation on ROUGE recall instead of ROUGE F1 will prevent longer can- didate summaries from being penalized. However, there is a caveat – a system would retrieve the entire article (RP) as the reference span and achieve the highest ROUGE recall. On sorting all the system runs by their average recall measure, we find that the submission by [3] ranked the first. Considering the overall standing of this system, we infer that there were probably a lot of false positives due to there being a lack of a stringent word limit. In future tasks, we will impose a limit on the length of the reference span that can be retrieved. Although our documentation advised participants to return reference spans of three sentences or under, we did not penalize longer outputs in our evaluation. On the other hand, evaluation on ROUGE precision would encourage systems to return single-sentences with high information overlap. A large body of work in information retrieval and summarization has measured system performance in terms of task precision. In fact, as argued by Felber and Kern [4], Task 1A can be considered akin to an information retrieval or a question answering task. We can then use standard IR performance measures such as Mean Average Precision (MAP) over all the reference spans. We plan to pilot this measure in the next edition of the Shared Task. Task 1A topic level meta-analysis: We conducted a meta-analysis of system performances for Task 1A over all the topics in the test set. We observed that for only one of the ten test topics (specifically, W09-0621), the average F1 score was one standard deviation away from the mean average F1 score of all the topics taken together. At the topic level, we observed that the largest variances in system performance were for W11-0815, W09-0621, D10-1058 and P07-1040, for which nearly two-thirds of all the submitted runs had an ROUGE or an overlap F1 score that was more than one standard deviation from the average F1 score for that topic. We note that since most of the participants submitted multiple runs, some of these variances are isolated to all the runs submitted by a couple of teams (specifically, NJUST and UniMA) and may not necessarily reflect an aberration with the topic itself. All participants were recommended to closely examine the outputs these and other topics during their error analysis. They can refer to the topic-level results posted in the Github repository of the CL-SciSumm dataset 9 . Task 1B: Systems reported difficulty in classifying discourse facets (classes) with few datapoints. The class distribution, in general, is skewed towards the ‘Method’ facet. Systems reported that the class imbalance could not be countered effectively by class weights. This suggests that the ‘Method’ facet is composed of other sub-facets which need to be identified and annotated as ground truth. Task 2: While considering the results from Task 2, we observed that ensem- ble approaches were the most effective against all three sets of gold standard summaries. Some systems – for instance, the system by Abura‘Ed et. al [1] – tailored their summary generation approach to improve on one type of summary at a time. We plan to discourage this approach in future tasks, as we envision that systems would converge towards a general, optimal method for generating salient scientific summaries. Based on the results from CL-SciSumm 2016, we had expected that approaches that did well against human summaries would also do well against community summaries. However, no such inferences could be made from the results of CL-SciSumm 2017. In the case of NJUST [14], one of their approaches (Run 2) was among the top approaches against abstract and human summaries, but was a poor performer against the community summaries. On the other hand, different runs by UPF [11] performed well against different 9 https://github.com/WING-NUS/scisumm-corpus Table 3: Error Analysis: Why ROUGE did not improve systems’ performances Summary Type Sentence id Text Rouge-F Gold Standard 36,37 ’Identifying semantic relations in a text can be a useful indicator of its concep- tual structure.’, ’Lexical cohesion is ex- pressed through the vocabulary used in text and the semantic relations between those words. Submission 1 36,45 ’To automatically detect lexical cohesion 0.20 tics between pairwise words, three linguis- tic features were considered: word repe- tition, collocation and relation weights.’, ’Lexical cohesion is expressed through the vocabulary used in text and the semantic relations between those words.’ Submission 2 36, 119 Each text was only 500 words in length 0.39 and was related to a specific subject area.’, ’Lexical cohesion is expressed through the vocabulary used in text and the semantic relations between those words. Submission 1 45,119 ’To automatically detect lexical cohesion 0.07 tics between pairwise words, three linguis- tic features were considered: word repe- tition, collocation and relation weights.’ ’Each text was only 500 words in length and was related to a specific subject area.’ Submission 2 45, 118 ’To automatically detect lexical cohesion 0.03 tics between pairwise words, three linguis- tic features were considered: word repeti- tion, collocation and relation weights.’, ’In this investigation, recall rates tended to be lower than precision rates because the al- gorithm identified fewer segments (4.1 per text) than the test subjects (4.5). Submission 1 36,37,118 ’Identifying semantic relations in a text 0.33 can be a useful indicator of its concep- tual structure.’, ’Lexical cohesion is ex- pressed through the vocabulary used in text and the semantic relations between those words.’, ’In this investigation, re- call rates tended to be lower than preci- sion rates because the algorithm identified fewer segments (4.1 per text) than the test subjects (4.5).’ Submission 2 36, 37, 119 ’Each text was only 500 words in length 0.58 and was related to a specific subject area.’, ’Identifying semantic relations in a text can be a useful indicator of its concep- tual structure.’, ’Lexical cohesion is ex- pressed through the vocabulary used in text and the semantic relations between those words.’ summaries. One of their runs (‘UPF acl com’) was among the top against human and community summaries but was near the bottom against abstract summaries. 8 Conclusion Nine systems participated in CL-SciSumm 2017 shared tasks. The tasks provided a larger corpus with further refinements over 2016. Compared with 2016, the task attracted additional submissions that attempted neural network-based methods. Participants also experimented with the use of word embeddings trained on the shared task corpus, as well as on other domain corpora. We recommend that future approaches should go beyond off-the-shelf deep learning methods, and also exploit the structural and semantic characteristics that are unique to scientific documents; perhaps as an enrichment device for word embeddings. The results from 2016 suggest that the scientific summarization task lends itself as a suitable problem for transfer learning [2]. For CL-SciSumm 2018, we are planning to collaborate with Yale University and introduce semantic concepts from the ACL Anthology Network [20]. Acknowledgement. We would like to thank Microsoft Research Asia, for their generous funding. We would also like to thank Vasudeva Varma and colleagues at IIIT-Hyderabad, India and University of Hyderabad for their efforts in convening and organizing our annotation workshops. We acknowledge the continued advice of Hoa Dang, Lucy Vanderwende and Anita de Waard from the pilot stage of this task. We would also like to thank Rahul Jha and Dragomir Radev for sharing their software to prepare the XML versions of papers. We are grateful to Kevin B. Cohen and colleagues for their support, and for sharing their annotation schema, export scripts and the Knowtator package implementation on the Protege software – all of which have been indispensable for this shared task. References 1. Abura’Ed, A., Chiruzzo, L., Saggion, H., Accuosto, P., lex Bravo: LaSTUS/TALN @ CL-SciSumm-17: Cross-document Sentence Matching and Scientific Text Sum- marization Systems. In: Proc. of the 2nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL2017). Tokyo, Japan (August 2017) 2. Conroy, J., Davis, S.: Vector space and language models for scientific document summarization. In: NAACL-HLT. pp. 186–191. Association of Computational Lin- guistics, Newark, NJ, USA (2015) 3. Dipankar Das, S.M., Pramanick, A.: Employing Word Vectors for Identify- ing,Classifying and Summarizing Scientific Documents. In: Proc. of the 2nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL2017). Tokyo, Japan (August 2017) 4. Felber, T., Kern, R.: Query Generation Strategies for CL-SciSumm 2017 Shared Task. In: Proc. of the 2nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL2017). Tokyo, Japan (August 2017) 5. Jaidka, K., Chandrasekaran, M.K., Rustagi, S., Kan, M.Y.: Insights from cl- scisumm 2016: the faceted scientific document summarization shared task. Inter- national Journal on Digital Libraries pp. 1–9 (2017) 6. Jaidka, K., Khoo, C.S., Na, J.C.: Deconstructing human literature reviews–a frame- work for multi-document summarization. In: Proc. of ENLG. pp. 125–135 (2013) 7. Jones, K.S.: Automatic summarising: The state of the art. Information Processing and Management 43(6), 1449–1481 (2007) 8. Karimi, S., Verma, R., Moraes, L., Das, A.: University of Houston at CL-SciSumm 2017. In: Proc. of the 2nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL2017). Tokyo, Japan (August 2017) 9. Lauscher, A., Glavas, G., Eckert, K.: Citation-Based Summarization of Scientific Articles Using Semantic Textual Similarity. In: Proc. of the 2nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL2017). Tokyo, Japan (August 2017) 10. Li, L., Mao, L., Zhang, Y., Chi, J., Huang, T., Cong, X., Peng, H.: CIST System for CL-SciSumm 2016 Shared Task. In: Proc. of the Joint Workshop on Bibliometric- enhanced Information Retrieval and Natural Language Processing for Digital Li- braries (BIRNDL2016). pp. 156–167. Newark, NJ, USA (June 2016) 11. Li, L., Zhang, Y., Mao, L., Chi, J., Chen, M., Huang, Z.: CIST@CLSciSumm-17: Multiple Features Based Citation Linkage, Classification and Summarization. In: Proc. of the 2nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL2017). Tokyo, Japan (August 2017) 12. Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. Text summa- rization branches out: Proceedings of the ACL-04 workshop 8 (2004) 13. Liu, F., Liu, Y.: Correlation between rouge and human evaluation of extractive meeting summaries. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. pp. 201–204. Association for Computational Linguistics (2008) 14. Ma, S., Xu, J., Wang, J., Zhang, C.: NJUST@CLSciSumm-17. In: Proc. of the 2nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL2017). Tokyo, Japan (August 2017) 15. Mayr, P., Chandrasekaran, M.K., Jaidka, K.: Editorial for the 2nd joint work- shop on bibliometric-enhanced information retrieval and natural language pro- cessing for digital libraries (BIRNDL) at SIGIR 2017. In: Proceedings of the 2nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Lan- guage Processing for Digital Libraries (BIRNDL 2017) co-located with the 40th International ACM SIGIR Conference on Research and Development in Infor- mation Retrieval (SIGIR 2017), Tokyo, Japan, August 11, 2017. pp. 1–6 (2017), http://ceur-ws.org/Vol-1888/editorial.pdf 16. Mayr, P., Frommholz, I., Cabanac, G., Wolfram, D.: Editorial for the Joint Work- shop on Bibliometric-enhanced Information Retrieval and Natural Language Pro- cessing for Digital Libraries (BIRNDL) at JCDL 2016. In: Proc. of the Joint Work- shop on Bibliometric-enhanced Information Retrieval and Natural Language Pro- cessing for Digital Libraries (BIRNDL2016). pp. 1–5. Newark, NJ, USA (June 2016) 17. Nakov, P.I., Schwartz, A.S., Hearst, M.: Citances: Citation sentences for semantic analysis of bioscience text. In: Proceedings of the SIGIR’04 workshop on Search and Discovery in Bioinformatics. pp. 81–88 (2004) 18. Prasad, A.: WING-NUS at CL-SciSumm 2017: Learning from Syntactic and Se- mantic Similarity for Citation Contextualization. In: Proc. of the 2nd Joint Work- shop on Bibliometric-enhanced Information Retrieval and Natural Language Pro- cessing for Digital Libraries (BIRNDL2017). Tokyo, Japan (August 2017) 19. Qazvinian, V., Radev, D.: Scientific paper summarization using citation summary networks. In: Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1. pp. 689–696. ACL (2008) 20. Radev, D.R., Muthukrishnan, P., Qazvinian, V.: The acl anthology network corpus. In: Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries. pp. 54–61. Association for Computational Linguistics (2009) 21. Zhang, D.: PKU @ CLSciSumm-17: Citation Contextualization. In: Proc. of the 2nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL2017). Tokyo, Japan (August 2017)