University of Houston @ CL-SciSumm 2017: Positional language Models, Structural Correspondence Learning and Textual Entailment Samaneh Karimi1,2 , Luis Moraes2 , Avisha Das2 , and Rakesh Verma2 1 School of Electrical and Computer Engineering, University of Tehran, Iran 2 Computer Science Department, University of Houston, TX 77204 Abstract. This paper introduces the methods employed by University of Houston team participating in the CL-SciSumm 2017 Shared Task at BIRNDL 2017 to identify reference spans in a reference document given sentences from citing papers. The following approaches were investigated: structural correspondence learning, positional language models, and tex- tual entailment. In addition, we refined our methods from BIRNDL 2016. Furthermore, we analyzed the results of each method to find the best performing system. 1 Introduction The CL-SciSumm 2017 shared task [11] focuses on the problem of auto- matic summarization of scientific papers in the Computational Linguis- tics domain. In this problem, inputs are a set of reference documents and sets of citing documents associated with each reference document. Moreover, in each citing document, sentences which refer to the reference document (called citances) are marked. There are a couple of tasks de- fined in the shared task. Task 1a is, given a citance, to identify the span of reference text that best reflects what has been cited. Task 1b asks us to classify the cited span according to a predefined set of facets: hypoth- esis, aim, method, results, and implication. Finally, Task 2 is generating a structured summary. Three main approaches are investigated for Task 1a: positional language models, structural correspondence learning, and textual entailment sys- tems. The details of each method are explained in the following sections. Two methods are employed to address task 1b: a rule-based method which is basically a comparison-based method augmented by WordNet expansion and a classification method. 2 Related Works Citations are considered an important source of information in many text mining areas [9]. For example, citations can be used in summarization to improve a summary [23]. It is thought that citations embody the community’s perspective on the content of said paper [22]. In [26], the authors illustrate the importance of citations for summariza- tion purposes. They made their summaries based on three sets of infor- mation including only the reference article; only the abstract; and, only citations. Finally they showed that citations produced the best results. In another study, Mohammad et al. [20] also showed that the informa- tion from citations is different from that which can be gleaned from just the abstract or reference article. However, it is cautioned that citations often focus on very specific aspects of a paper [8]. Properly tagging/marking the actual citation has also attracted a great deal of attention to this area of reserach. Powley and Dale [25] give insight into recognizing text that is a citation. Siddharthan and Teufel also introduce a new concept called “scientific attribution” which can help in discourse classification. The importance of discourse classification is further developed in [1]; in this paper, they showed the importance of discourse facet identification for producing good summaries. In terms of what has been attempted at CL-SciSumm in past years, the methods are diverse. Aggarwal and Sharma [2] use bag-of-words bigrams, and compute scores to rank reference sentences based on their relevance to the citance using bigram overlap counts between citance and refer- ence sentences using some heuristics. In [12], researchers generate three combinations of an unsupervised graph-based sentence ranking approach with a supervised classification approach. Cao et al. [5] model Task 1a as a ranking problem and apply SVM Rank for this purpose. In [16], the citance is treated as a query over the sentences of the reference doc- ument; the authors then used learning-to-rank algorithms (RankBoost, RankNet, AdaRank, and Coordinate Ascent) with lexical and topic fea- tures, in addition to TextRank scores, for ranking sentences. Lei et al. [14] used SVMs and rule-based methods with lexicon features and simi- larities (IDF, Jaccard, and context similarity). In [24], authors propose a linear combination between a TFIDF model and a single layer neural net- work model. Saggion et al. [28] used supervised algorithms with feature vectors representing the citance and reference document sentences. Fea- tures include positional and rhetorical features, in addition to WordNet similarity measures. 3 Dataset The dataset for CL-SciSumm 2017 [11] is divided into 30 training docu- ments and 10 testing documents, each with multiple citances. In the rest of this section, some statistics about the raw dataset (with no prepro- cessing) are reported. This dataset contains 148,669 words and 11,114 unique words among the reference documents. There are 6,700 reference sentences and their average length is 23 words. The average reference document length is 4955 words in this dataset. Furthermore, the average number of sentences in each reference document is approximately 223 sentences. 4 Task 1a Individual Methods In this task, we are asked to identify the reference sentences referred to by a given citance. In general, we rank the sentences in the refer- ence document according to some method, then return the top 3. This year, the following new methods were attempted by our team: Positional Language Models, Structural Correspondence Learning, and Textual En- tailment techniques. 4.1 Positional Language Model Approach Positional language model was proposed with the idea of employing prox- imity information in documents to retrieve better results in response to a query[17]. In task 1a, we consider each reference text a document and each citance a query. Using PLM approach, a separate language model is constructed for each position of the reference sentence and computes the score of the reference sentence based on the similarity between its positional language models and citances language model. The elements of PLM (Positional Language Model) are the propagated counts of all words within the reference sentence which are estimated using a density function. With this idea, the closer words to the position, the higher the weight of the word in the PLM. Therefore, the PLM of reference sentence d at position i is estimated as follows: 0 c (w, i) p(w|d, i) = P 0 w V c0 (w0 , i) 0 where V is the vocabulary and c (w, i) is the propagated count of word w at position i from all of its occurrences in the reference sentence. Finally, PLM of each position in the reference sentence is compared with the language model of citance using KL-divergence to obtain a position specific similarity score as follows: X p(w|q) S(q, d, i) = − p(w|q) log p(w|d, i) wV where p(w|q) is the language model of the citance q, p(w|d, i) is the posi- tional language model of reference sentence d at position i and S(q, d, i) is the similarity score between the position i in the reference document and the citance. These scores are used to find the final similarity score of reference sentence(as a document) in response to the citance(as a query). Thus, we can apply PLM approach as a retrieval process to find the most relevant reference sentences in response to each citance. 4.2 Structural Correspondence Learning Approach SCL is a method of transfer learning that attempts to learn a joint repre- sentation for two different domains [4]. The reasoning behind using SCL in this task is that citances and the sentences to which they refer be- long to different domains, yet correspond to each other. Thus, it seemed plausible that such a method would be beneficial. Structural Correspondence Learning seeks to find a joint representation by focusing on pivot features, i.e. features that are frequent in occurrence in both domains. The key to SCL is to predict the occurrence of pivot features from the non-pivot features of an example. One can learn a machine learning model, such as an SVM, for this purpose. The next step is to reduce the dimensionality of these predictors; this forces some generalization. The joint representation consists of the predicted pivot features (non-pivot features are thrown away). For our purposes, these new feature vectors are used to calculate cosine similarity scores with the citance. 4.3 Textual Entailment Approach The property of textual entailment between two pieces of text can be described as a directional relationship which can only be True when the information contained in one text fragment is directly or indirectly de- rived from the other text fragment. The derived text fragment is then said to be textually entailed by the other. In Textual Entailment3 , the entailing fragment is termed the text and the possibly entailed frag- ment is the hypothesis. For example, the following pair of text fragments demonstrate entailment: Text: The cat ate the rat. Hypo.: The cat is not hungry. The task of deriving inference from pairs of text is called Recognizing Textual Entailment (RTE)4 . The proposed approach uses textual entail- ment as a measure of extracting the reference sentences relevant to a given citance. We build textual pairs using the given citance (text) and the sentences extracted from the reference document (hypothesis). We use an RTE system TIFMO [7, 29] to measure textual entailment be- tween a given pair of citance and reference text. TIFMO uses Dependency- based Compositional Semantics (DCS) [29] based trees to represent a text body. The system derives an inference for entailment prediction by considering logic based relations between ‘abstract denotations’ or relational expressions generated from the queries in the DCS trees. A further improvement to the system was proposed in [7], where General- ized Quantifiers (GQs) present in text are taken into account to evaluate lexical and/or syntactical relations between pairs of sentences (text and hypothesis) to predict the presence of entailment between them and also the type of entailment. We have used the TIFMO system proposed in [7] for our evaluation of citances and extraction of their relevant reference sentences. 3 https://aclweb.org/aclwiki/Textual_Entailment_Portal 4 https://aclweb.org/aclwiki/Recognizing_Textual_Entailment 4.4 Previous Methods We present an overview of the methods that were previously employed on this task in [21]. TFIDF. In this method we rank sentences in the reference document according to the cosine similarity between each sentence and the citance. Both the sentences and the citance are represented as a TFIDF vector, i.e. a word vector where the weights are the TFIDF values calculated from the reference document. Although TFIDF has been evaluated before in [21], this time we exper- iment with using more than just unigrams. We include variations that make use of bigrams and trigrams as well. Our naming convention for these systems includes the range of n-grams they use (for example, tfidf- 1:3 uses unigrams, bigrams, and trigrams). LDA. Latent Dirichlet Allocation is a topic modeling method [3] that models the interaction between topics and words as a statistical process. Topics within this model are drawn from a multinomial distribution. In turn, every topic has its own multinomial distribution for the words in the vocabulary. Thus, the model can capture the fact certain topics favor certain words. For our task, we represent each sentence by the topic membership vector, which assigns to the sentence a probability of membership for each topic. These vectors are then ranked by cosine similarity, similar to TFIDF. Word Embeddings. Word embeddings assign to each word a real- valued vector [18]. Through continuous iteration, the similarity between these vectors starts to approximate the similarity between the words they represent. Thus, since ‘dog’ and ‘pet’ are similar, their respective real-valued vectors will be similar as well. Our task concerns the similarity between sentences, however. To generate sentence similarities from word similarities, we employ the Word Mover’s Distance [13]. In addition to embeddings learned through the ACL anthology, we tested the performance of embeddings that were pretrained on the Google News corpus [18]. 4.5 Evaluation The evaluation of our systems in task 1a uses the metrics: Precision@3, Recall@3 and F1 -score. In addition to the results of the PLM method, two well known information retrieval methods including KL-divergence and Okapi are employed to compare with the results of PLM. In all of the retrieval methods and PLM method, reference sentences are documents and citances are queries. Okapi is known as a ranking function which is based on the probabilistic retrieval framework. KL-divergence is a language modeling retrieval approach which compares language models of document with the query and ranks them based on their KL-divergence score. The results of all three methods for task 1a on training and test set 2017 are reported in Table 1. Runs with an asterisk (∗ ) were submitted. As Table 1 shows, TFIDF is still a top performer. A few of the results are different from previous work due to the fact these results are obtained from all 30 training documents. For instance, in comparison to the re- sults in [21], LDA and word embeddings show worse performance. It is surprising that SCL performs better than LDA. TIFMO does not per- form as well as expected. Positional language model is performing better than KL-divergence and Okapi. However, none of these methods per- form desirably well. One of the important reasons for this performance can be the difficulty of queries which are citances in our problem defini- tion. Since citances may not include any of the reference text’s words, it makes the retrieval process more difficult. Train Test Method P@3 R@3 F1 F1 tfidf-1:1∗ 11.05% 21.20% 14.53% 6.84% tfidf-1:2 11.39% 21.85% 14.97% 7.70% tfidf-1:3∗ 11.05% 21.20% 14.53% 6.84% tfidf-2:3 10.86% 16.57% 13.12% 7.13% word2vec∗ 10.88% 20.88% 14.31% 9.12% LDA 2.63% 5.05% 3.46% 1.99% SCL 4.03% 6.02% 4.13% 2.28% TIFMO 2.02% 3.88% 2.66% 1.99% PLM 3.03% 5.81% 3.98% 0.84% KL-div 2.63% 5.05% 3.46% 0.84% Okapi 3.03% 5.81% 3.98% 1.13% Table 1. Scores for individual systems on the 2017 dataset. 5 Task 1b In Task 1b, for each cited text span, we pick the facet to which it belongs from a predefined set of facets. Two different approaches are employed in this task: A rule-based approach and a machine learning approach. Rule-based Approach. The Rule-based approach consists of three consecutive steps. Each one is designed to find the correct facet through some comparisons, in case a match was not found in any of the previous steps. In the first step, citance words are compared with all five facet labels: Method, Implication, Result, Hypothesis and Aim. If none of the words in the citance match a facet label, then we move on to the second step. In the second step of the rule-based approach, an expanded form of the citance is compared with the facet labels. We expand the citance by adding all WordNet synsets [19] of each word found in the citance. In the third step, if no matched facet label is found in steps one and two, we expand the facet labels with their synsets and once again compare with the words in the citance. Machine Learning Approach. In this approach, each citance is represented by a feature vector containing TFIDF values of its words and a classification model is learned using our training set. Then, the trained model is used to classify citances of the testing set. Machine learning methods used in this approach include Support Vector Machines (SVMs) [6], Random Forests [15], Decision Trees [27], MLP, and Adaboost [10]. 5.1 Evaluation As explained in section 5, we have employed two different approaches in Task 1b: a rule-based approach and a machine learning approach. The rule-based approach has different variations: 1) Rule based-V1: In this variation, all three sets of comparisons (comparing citance words with facet labels, comparing expanded form of citances with facet labels and comparing expanded form of facets with citance words) are done while non-relevant synsets of all facets are excluded. 2) Rule based-V2: In the second variation, all three sets of comparisons are done while only non-relevant synsets of “Method” facet are excluded. 3) Rule based-V3: In the third variation, only first and second comparisons are done. The results of first approach of Task 1b on training set 2017 and testing set 2017 are represented in Table 2. A ”Method only” approach which assigns ”method” to all of the citances is also employed to be compared with the rule-based approach. Train Test Method P R F1 F1 Rule based-V1 34.34% 31.43% 32.82% 28.84% Rule based-V2 58.41% 53.46% 55.83% 68.33% Rule based-V3 67.50% 61.70% 64.50% 78.99% Method only 69.36% 63.48% 66.29% 95.29% Table 2. Recall, Precision, and F1 score of rule-based method variations. As Table 2 shows, the third variation of the rule-based approach out- performs other variations on both training and test sets. It means that expansion of facet labels does not help in finding the correct facet la- bel of citances. Furthermore, the higher performance of Rule based-V2 over Rule based-V1 shows that excluding non-relevant synsets of the “Method” facet has a positive impact on the final results of the method. It might be due to the fact that “Method” is the most frequent facet label in both the training and test set for 2017. The results of the Method-only approach also verify this fact. Table 3 shows the results of Task 1b for machine learning methods on the training and test set. For classification experiments on the training set, the training set is split into two separate datasets: a subset of 20 documents is used as train data and the remaining 10 documents are used as test data. For the classification experiments on the test set, the whole training set is used for the learning phase. Train Test Method P R F1 F1 SVM 66.7% 59.0% 62.7% 73.35% Random Forest 61.6% 54.5% 57.8% 72.50% Decision Tree 50.0% 53.4% 51.6% 56.89% MLP 61.4% 54.5% 57.7% 65.83% Adaboost 54.0% 54.1% 54.1% 61.72% Rule based-V1 47.82% 42.30% 44.89% 28.84% Rule based-V2 63.24% 55.94% 59.36% 68.33% Rule based-V3 68.37% 60.48% 64.19% 78.99% Method only 69.16% 61.18% 64.93% 95.29% Table 3. Recall, Precision, and F1 score of classification methods. As Table 3 shows SVM outperforms other classification methods in Task 1b and the lowest results among classification methods belongs to Deci- sion Tree. Furthermore, comparison between the results of Table 2 and Table 3 shows that third variation of rule-based approach is our best performing method in Task 1b among all rule-based and classification methods. 6 Task 1a Method Combinations In this section, we attempt to improve on the performance of the methods found in Section 4.5 by combining them. We combine methods in three ways: 1) through a linear combination of the methods, 2) through the use of one method as a “filter” for another, and 3) through the use of learning-to-rank algorithms that are fed the scores of our individual methods. Linear Combination. A linear combination between two methods that tries to divide the importance given to the scores of two systems. An optimal tradeoff is calculated which normally generates better rankings than either system independently. For more details see [21]. λ · sys1 + (1 − λ) · sys2 (1) Filtering. The scores for one system are used to select the top N sentences from the reference document. These N sentences are then re- ranked according to another system. For N = 3 there will be no difference from the system that filters since we always return the top 3. However, as N increases, the rankings start to diverge. Learning-to-Rank. We used a library of learning-to-rank algorithms, RankLib5 , to combine the scores generated by the other methods. We construct a modified dataset for use with RankLib. For each citance, we construct three different queries by subsampling the irrelevant sentences in the reference document. Therefore, each query consists of all relevant sentences (chosen by the annotator) and 10 irrelevant sentences chosen at random. This helps emphasize learning the ranking of the relevant sentences. The scores of the following systems were used in conjunction: tfidf- 1:1, tfidf-1:2, tfidf-1:3, tfidf-2:3, word2vec (ACL), word2vec (pretrained GoogleNews), SCL. These systems were chosen in an ad-hoc manner to provide a diverse set of competing rankings. Even though some of these systems underperform in general, they can occasionally provide better rankings for specific citances. No attempt was made to tune the hyper- parameters for the algorithms. Since learning-to-rank methods had a considerable jump in performance (as can be seen in Table 4), we had to test whether overfitting was occur- ing. We chose LambdaMART since it is similar to MART and obtained the second highest score when fed the whole training set. We sorted the training set documents by the number of annotated citances; every third document became part of a validation set. The performance gains measured were much more modest in this scenario. 6.1 Evaluation The results obtained by combining individual systems are found in Ta- ble 4. Runs with an asterisk (∗ ) were submitted. LambdaMART was cho- sen as the representative for the learning-to-rank algorithms and, thus, is the only algorithm with test set results. 5 https://sourceforge.net/p/lemur/wiki/RankLib/ Train Test Method P@3 R@3 F1 F1 Linear Comb.∗ 11.79% 22.65% 15.51% 7.13% Filtering∗6 11.85% 22.76% 15.58% 7.41% LambdaMART 21.71% 41.65% 28.55% 6.84% Val. LambdaMART 13.08% 25.13% 17.21% 6.27% MART 23.27% 44.66% 30.59% – Random Forest 16.01% 30.74% 21.05% – RankBoost 11.74% 22.54% 15.44% – ListNet 11.69% 22.43% 15.37% – Coord. Ascent 11.35% 21.79% 14.92% – RankNet 11.18% 21.46% 14.70% – Linear Regres. 11.07% 21.25% 14.56% – LambdaRank 0.00% 0.00% 0.00% – Table 4. Scores for combinations on the 2017 dataset. 7 Discussion The results on the training set indicate semantic methods by themselves do not perform well, yet the test set’s results directly contradict that claim: although TFIDF is the clear winner in the training set, the best method on the test set made use solely of word embeddings. Task 1b also raises questions since the skewed facet distribution of the test set exacerbates the effectiveness of a simple baseline such as always choosing “Method”. Regardless, we can choose better features for the classifiers that would permit us to reach a comparable level of perfor- mance. In regard to the combination methods, there were less surprising results but still many unanswered questions. Our experiments with learning-to- rank methods hint at overfitting but the test set provided no evidence it was occurring. The linear combination between two systems, explored in [21], had similar performance to filtering. The filtering method between two systems was slightly more robust. 8 Future Work We would like to investigate why there was such a drastic difference be- tween the performance measured in the training and test sets. A compre- hensive study can be done to contrast the characteristics of the training and test sets from a linguistic and statistical point-of-view. The differ- ences between training and test set may reveal what type of citances benefit most from semantic information. In a similar vein, we would like to find out why there is a considerable difference between the perfor- mance of TFIDF and Okapi although they have similar formulations. Finally, we have not exhausted our exploration of textual entailmente and would like to investigate newer methods that have been developed. 6 Submitted run was generated erroneously, which led to an F1 score of 1.4%. 9 Acknowledgements We would like to thank the NSF for Grants CNS 1319212, DGE 1433817 and DUE 1241772. References 1. Amjad Abu-Jbara and Dragomir Radev. Coherent citation-based summarization of scientific papers. In Proceedings of the 49th An- nual Meeting of the Association for Computational Linguistics: Hu- man Language Technologies-Volume 1, pages 500–509. Association for Computational Linguistics, 2011. 2. Peeyush Aggarwal and Richa Sharma. Lexical and syntactic cues to identify reference scope of citance. In BIRNDL@ JCDL, pages 103–112, 2016. 3. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirich- let allocation. Journal of Machine Learning Research, 3:993–1022, 2003. 4. John Blitzer, Ryan T. McDonald, and Fernando Pereira. Domain adaptation with structural correspondence learning. In EMNLP 2007, Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, 22-23 July 2006, Sydney, Australia, pages 120–128, 2006. 5. Ziqiang Cao, Wenjie Li, and Dapeng Wu. Polyu at cl-scisumm 2016. In BIRNDL@ JCDL, pages 132–138, 2016. 6. Corinna Cortes and Vladimir Vapnik. Support-vector networks. Ma- chine Learning, 20(3):273–297, 1995. 7. Yubing Dong, Ran Tian, and Yusuke Miyao. Encoding generalized quantifiers in dependency-based compositional semantics. 2014. 8. Aaron Elkiss, Siwei Shen, Anthony Fader, Güneş Erkan, David States, and Dragomir Radev. Blind men and elephants: What do citation summaries tell us about a research article? Journal of the American Society for Information Science and Technology, 59(1):51– 62, 2008. 9. Aaron Elkiss, Siwei Shen, Anthony Fader, Gne Erkan, David States, and Dragomir Radev. Blind men and elephants: What do citation summaries tell us about a research article? Journal of the American Society for Information Science and Technology, 59(1):51–62, 2008. 10. Yoav Freund and Robert E. Schapire. A Decision-Theoretic General- ization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997. 11. Kokil Jaidka, Muthu Kumar Chandrasekaran, Devanshu Jain, and Min-Yen Kan. Overview of the cl-scisumm 2017 shared task. In Pro- ceedings of Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries (BIRNDL 2017), Tokyo, Japan, CEUR, 2017. 12. Stefan Klampfl, Andi Rexha, and Roman Kern. Identifying refer- enced text in scientific publications by summarisation and classifica- tion techniques. In BIRNDL@ JCDL, pages 122–131, 2016. 13. Matt J Kusner, Yu Sun, Nicholas I Kolkin, Kilian Q Weinberger, et al. From word embeddings to document distances. In ICML, volume 15, pages 957–966, 2015. 14. Lei Li, Liyuan Mao, Yazhao Zhang, Junqi Chi, Taiwen Huang, Xi- aoyue Cong, and Heng Peng. Cist system for cl-scisumm 2016 shared task. In BIRNDL@ JCDL, pages 156–167, 2016. 15. Andy Liaw and Matthew Wiener. Classification and regression by randomforest. R News, 2(3):18–22, 2002. 16. Kun Lu, Jin Mao, Gang Li, and Jian Xu. Recognizing reference spans and classifying their discourse facets. In BIRNDL@ JCDL, pages 139–145, 2016. 17. Yuanhua Lv and ChengXiang Zhai. Positional language models for information retrieval. Proceedings of the 32nd international ACM SIGIR conference on Research and development in information re- trieval SIGIR 09, page 299, 2009. 18. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. 19. George A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41, 1995. 20. Saif Mohammad, Bonnie Dorr, Melissa Egan, Ahmed Hassan, Pradeep Muthukrishan, Vahed Qazvinian, Dragomir Radev, and David Zajic. Using citations to generate surveys of scientific paradigms. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the As- sociation for Computational Linguistics, pages 584–592. Association for Computational Linguistics, 2009. 21. Luis Moraes, Shahryar Baki, Rakesh Verma, and Daniel Lee. Identi- fying reference spans: topic modeling and word embeddings help ir. International Journal on Digital Libraries, 2017. 22. Preslav I Nakov, Ariel S Schwartz, and Marti Hearst. Citances: Cita- tion sentences for semantic analysis of bioscience text. In Proceedings of the SIGIR, volume 4, pages 81–88, 2004. 23. Hidetsugu Nanba, Noriko Kando, and Manabu Okumura. Classi- fication of research papers using citation links and citation types: Towards automatic review article generation. Advances in Classifi- cation Research Online, 11(1):117–134, 2000. 24. Tadashi Nomoto. Neal: A neurally enhanced approach to linking citation and reference. In BIRNDL@ JCDL, pages 168–174, 2016. 25. Brett Powley and Robert Dale. Evidence-based information extrac- tion for high accuracy citation and author name identification. In Large Scale Semantic Access to Content (Text, Image, Video, and Sound), pages 618–632. Le Centre de Hautes Etudes Internationales D’Informatique Documentaire, 2007. 26. Vahed Qazvinian, Dragomir R Radev, Saif Mohammad, Bonnie J Dorr, David M Zajic, Michael Whidby, and Taesun Moon. Gener- ating extractive summaries of scientific paradigms. J. Artif. Intell. Res.(JAIR), 46:165–201, 2013. 27. J. R. Quinlan. Induction of decision trees. MACH. LEARN, 1:81– 106, 1986. 28. Horacio Saggion, Ahmed AbuRaed, and Francesco Ronzano. Train- able citation-enhanced summarization of scientific articles. In Ca- banac G, Chandrasekaran MK, Frommholz I, Jaidka K, Kan M, Mayr P, Wolfram D, editors. Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL); 2016 June 23; Newark, United States. CEUR Workshop Proceedings:[Sl]; 2016. p. 175-86. CEUR Workshop Proceedings, 2016. 29. Ran Tian, Yusuke Miyao, and Takuya Matsuzaki. Logical inference on dependency-based compositional semantics. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguis- tics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 1: Long Papers, pages 79–89, 2014.