ATHENA@CL-SciSumm 2019: Siamese recurrent bi-directional neural network for identifying cited text spans Aris Fergadis1,3 , Dimitris Pappas2,3 , and Haris Papageorgiou3 1 School of Electrical and Computer Engineering, National Technical University of Athens, Greece 2 Department of Informatics, Athens University of Economics and Business, Greece 3 Athena Research and Innovation Center, Greece Abstract. In this paper we describe our participation to the Task1 of the CL-SciSumm 2019. The task is on automatic paper summarization in the research area of Computational Linguistics. Our approach is a two step binary sentence pair classification between the so-called citances and candidate sentences. Firstly, we classify sentences in the abstracts to predefined classes we call “zones”. These zones capture the discourse structure of a scientific publication. We then expand these zones with additional, similar sentences which are found in the main sections of the publication body. We train a Siamese bi-directional GRU neural network with a logistic regression layer to decide if a citance alludes to a candidate sentence. The cited sentences are also assigned one or more discourse facets (i.e., categories defined in the Task) using a multi-class SVM. We ran extensive experiments in three different datasets achieving promising results. Keywords: Candidate Cited Sentence Selection · Siamese Neural Net- work · Discourse Facet 1 Introduction Researchers are confronted with a continuously increasing volume of scientific publications, facing difficulties to monitor and track [13]. The ability to create synopsis of the key-points, contribution and importance of a paper within an academic community is an important step [12]. This synopsis, can be created by using citation sentences (i.e., the citances) that reference a specific paper and can be considered as a community-created summary of a topic or a paper. Scientific summaries offer an overview of the cited paper useful to scholars, writers or literature reviewers [7, 10]. The CL-SciSumm Shared Task focuses on the scientific summarization of papers [6], organized into two tasks. For both tasks the organizers provide several Reference Papers (RPs) called “topics”. Task1A: For each citance, identify the spans of text (cited text spans) in the RP that most accurately reflect the citance. These spans are of the granularity of 2 A. Fergadis et al. a sentence fragment, a full sentence, or several consecutive sentences (no more than 5). Task1B: For each cited text span, identify what facet of the paper it belongs to, from a predefined set of facets. Task2 (optional): Generate a structured summary of the RP from the cited texts pans of the RP. The length of the summary should not exceed 250 words. We participated on the tasks 1A and 1B of 2019 shared task [1] and present here our methodology. 2 Methodology We approach Task 1A as a binary sentence pair classification problem. We cre- ate pairs of citance and a candidate sentence extracted from the Citing Papers (CP) and the RPs respectively. Word embeddings are used to select candidate sentences. A Siamese neural network process these pairs to decide whether or not the candidate sentence is a cited text span of the citance. For the Task 1B a multi-class SVM [3] model assigns a discourse facet to the cited text spans. 2.1 System Components Word Embeddings We use word embeddings for both the candidate sen- tences selection and for the embedding layer of our network. Embedding vectors are trained on the ACL Corpus dump4 using the CBOW implementation of word2vec [11] of the gensim5 tool, with negative sampling set to 5 and 100 for dimensionality of the word vectors. All words are converted to lowercase. Candidate Sentence Selection We select sentences from the RP as candidate sentences. The intuition is that not all sentences are equally important as cited text spans. Thus, we try to select sentences that are about methodology, results and conclusions and discard sentences about background and related work. This is also supported by the fact that the cited text is assigned a discourse facet. Our approach tries to eliminate sentences that potentially would be false positives. To select the candidate sentences of the RP we split the abstract into zones [9]. Each sentence is classified to one of the following zones: Background, Method, Result and Conclusion. We keep only the sentences that belong to Method, Re- sult and Conclusion zones (referred to as zone sentences). Sentences are split into words6 , punctuation and numbers are removed and each word is assigned its embedding vector. For each zone sentence and the rest of the RP sentences we calculate sentence embeddings by averaging the word embedding vectors. 4 http://acl-arc.comp.nus.edu.sg/archives/acl-arc-160301-parscit/ 5 https://radimrehurek.com/gensim/, version 3.7.3 6 Using the tokenization tools of the gensim module Siamese biGRU network for identifying cited text spans 3 Using the embedding vectors of the N zone sentences and all the other embed- ding vectors of the M RP sentences, we calculate a similarity matrix S ∈ RN ×M using cosine similarity measure. To get the most similar sentences to the zone sentences we define a threshold ts . The RP sentences Si,j that pass the simi- larity threshold ts and the zone sentences are kept as candidate sentences. The decision of the ts value is discussed into section 3. Siamese Neural Network The Siamese neural network is composed of two bi-directional GRUs (biGRU) [14, 2] and a logistic regression layer, as depicted in Figure 1. Each biGRU processes one sentence at a time. For each citance and a set of candidate sentences of the RP, the left biGRU takes as input the citance and the right biGRU takes as input one of the candidate sentences. We use w1:n to denote a sequence of words w1:n = w1 , . . . wn , each with their corresponding demb dimensional word embedding ei = E [wi ] . The embedding matrix E ∈ R|V |×demb associates words from the vocabulary V with demb dimensional dense vectors. The left biGRU applies additive zero-centered Gaussian noise [4] to word embeddings with σ = 0.05 as a regularization layer at the training phase. The outputs y1b and ynf of the backward GRUb and the forward GRUf respectively are concatenated in one vector y1b = GRUb (en:1 ) f yn = GRUf (e1:n ) xl = [y1b ; yn f ] y1′b = GRUb (en:1 ) ′f yn = GRUf (e1:n ) xr = [y1′b ; yn ′f ] We use xl to denote the output of the left input and xr of the right input and [·; ·] to denote concatenation. The two output vectors are element wise multiplied to give a vector x. A logistic regression layer (LR) with a sigmoid activation func- tion σ(·) is used to make the final prediction ŷ. To summarize the architecture p(y = k|w1:n ) = ŷ, k ∈ {0, 1} ŷ = LR(x), with σ(·) activation x = [xl × xr ] The described model considers one sentence at a time. In order to find if a citance references more than one sentences in the RP, we take the predictions of all the candidate sentences and keep the maximum score smax . We define a threshold as st = 0.98 · smax . Any candidate sentence that has score s such as st ≤ s ≤ smax is selected as a cited sentence. 4 A. Fergadis et al. ŷ LR ′ ′ y1b ; ynf × y1b ; ynf ′ ′ y1b ; ynf y1b ; ynf ′ ′ y1b y1b ynf ynf GRU b GRU b ··· GRU b GRUb GRUb ··· GRUb GRUf GRUf ··· GRUf GRUf GRUf ··· GRUf GN GN ··· GN e1 e2 ··· en e′1 e′2 ··· e′n Citance Candidate Sentence Fig. 1. Siamese bi-directional GRU network. The left input is the Citance and the right input a Candidate Sentence. The output of the biGRU networks are concatenated, element wise multiplied and a Logistic Regression (LR) layer with sigmoid activation gives the a prediction if the Citance cites the Candidate Sentence. GN denotes Gaussian Noise Layer. Discourse Facet Task 1B asks “for each cited text span, identify what facet of the paper it belongs to, from a predefined set of facets”. The predefined facets are Aim, Hypothesis, Method, Result and Implication. We approach this task as a multi-class classification problem due to the fact that some cited text spans may have up to two facets. We build a bag-of-terms representation of all n-grams with n = 1, 2, 3 and calculate their tf-idf values using L1-norm. Five one-vs-rest SVM classifiers were trained assigning a cited text span to each of the five facets. Siamese biGRU network for identifying cited text spans 5 3 Experiments and Results The dataset provided was split into a training and a development set. The train- ing set consists of a set of 40 RPs with their CPs annotated by humans and 1000 RPs and their corresponding CPs that were automatically annotated. For our experiments we only used the human part of the training set (the TR-H set). As a first step for our experiments we selected candidate sentences from the RPs. By keeping only the candidate sentences we might miss cited sentences in the RP which were not selected by our method. In Table 1, coverage is the number of RP sentences we kept and the hits metric is the number of the cited sentences in our candidate list (in percentage). Our target is to get minimum coverage with maximum hits. Minimum coverage means that we have kept all the good candidates while maximum hits denotes that the cited sentences are within our candidate list. Table 1 displays the average coverage and hits for the 40 RPs of the training set and the 10 RPs of the development set for different thresholds. Based on the results, ts was set to 0.5. Using this threshold, we keep about 70% of the RP sentences on the training set and 60% on the development set, on average. Despite the fact that we discarded about 30% and 40% of the candidate sentences we only lose 15% and 20% of the cited text spans, respectively. Table 1. Average of the coverage and the hits of the selected candidate sentences for the 2018 training and development set using two thresholds. Coverage Hits Training Set Average 69.87% 84.16% ts = 0.5 Development Set Average 61.58% 81.15% Training Set Average 21.89% 38.50% ts = 0.7 Development Set Average 15.14% 33.02% We evaluated our system in three different versions of the dataset; for each version, we used for testing the development set (Dev), the 2016 test set (2016) and the 2017 test set (2017) respectively. For training, we used the TR-H set provided that we have excluded all papers in the relevant testing set for obvious reasons. The results shown in Table 2 are comparable to those of the previous shared tasks [6, 5, 8]. 4 Conclusions and Future Work Scientific summarization is a challenging task as it is evident from the results of the previous shared tasks [6, 5, 8]. In our methodology we create pairs of citance and a candidate sentence extracted from the CP and the RP respectively. These pairs are classified from a Siamese neural network as positive if a citance indeed cites a sentences, otherwise as negative. The cited sentences are also assigned one 6 A. Fergadis et al. Table 2. Results on the three test sets reporting Micro and Macro average scores for Tasks 1A and 1B. Task 1A Task 1B Test Set Average Precision Recall F1 Precision Recall F1 Micro 0.137 0.090 0.108 0.950 0.114 0.203 Dev Macro 0.125 0.087 0.102 0.750 0.104 0.183 Micro 0.077 0.055 0.064 0.941 0.076 0.140 2016 Macro 0.103 0.102 0.102 0.100 0.100 0.100 Micro 0.136 0.094 0.112 1.000 0.135 0.238 2017 Macro 0.182 0.156 0.168 0.600 0.182 0.279 or more discourse facets. We applied our methods on the dataset of the 2019 CL- SciSumm shared task. The evaluation of our system indicates that the Siamese neural network performs comparable to other machine learning methods. In future work we will investigate the impact of replacing the logistic regres- sion layer with other similarity functions, such as cosine similarity. We also plan to select the best value for the st threshold via hyper-parameter tuning. Finally, we will experiment with different methods for cited sentences selection which take into account the scores of neighboring sentences. 5 Acknowledgement We acknowledge support of this work by the Data4Impact Project which received funding from the European Union’s Horizon 2020 research and innovation pro- gramme under grant agreement No 770531. References 1. Chandrasekaran, M.K., Yasunaga, M., Radev, D., Kan, M.Y.: Overview and re- sults: Cl-scisumm shared task 2019. In: Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2019) @ SIGIR 2019, Paris, France. (2019) 2. Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the proper- ties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014) 3. Cortes, C., Vapnik, V.: Support-vector networks. Machine learning 20(3), 273–297 (1995) 4. Hinton, G., Van Camp, D.: Keeping neural networks simple by minimizing the description length of the weights. In: in Proc. of the 6th Ann. ACM Conf. on Computational Learning Theory. Citeseer (1993) 5. Jaidka, K., Chandrasekaran, M.K., Jain, D., Kan, M.Y.: The cl-scisumm shared task 2017: Results and key insights. In: BIRNDL@ SIGIR (2). pp. 1–15 (2017) 6. Jaidka, K., Chandrasekaran, M.K., Rustagi, S., Kan, M.Y.: Overview of the cl- scisumm 2016 shared task. In: Proceedings of the Joint Workshop on Bibliometric- enhanced Information Retrieval and Natural Language Processing for Digital Li- braries (BIRNDL). pp. 93–102 (2016) Siamese biGRU network for identifying cited text spans 7 7. Jaidka, K., Khoo, C., Na, J.C.: Deconstructing human literature reviews–a frame- work for multi-document summarization. In: proceedings of the 14th European workshop on natural language generation. pp. 125–135 (2013) 8. Jaidka, K., Yasunaga, M., Chandrasekaran, M.K., Radev, D.R., Kan, M.Y.: The cl-scisumm shared task 2018: Results and key insights. In: BIRNDL@SIGIR (2018) 9. Jin, D., Szolovits, P.: Hierarchical neural networks for sequential sentence classifi- cation in medical scientific abstracts. ArXiv abs/1808.06161 (2018) 10. Jones, K.S.: Automatic summarising: The state of the art. Inf. Process. Manage. 43, 1449–1481 (2007) 11. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre- sentations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp. 3111–3119 (2013) 12. Nakov, P.I., Schwartz, A.S., Hearst, M.: Citances: Citation sentences for semantic analysis of bioscience text. In: Proceedings of the SIGIR. vol. 4, pp. 81–88 (2004) 13. Qazvinian, V., Radev, D.R.: Scientific paper summarization using citation sum- mary networks. In: Proceedings of the 22nd International Conference on Compu- tational Linguistics-Volume 1. pp. 689–696. Association for Computational Lin- guistics (2008) 14. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans- actions on Signal Processing 45(11), 2673–2681 (1997)