An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Carlos Badenes-Olmedo1 , José Luis Redondo-Garcı́a2 , and Oscar Corcho1 1 Universidad Politécnica de Madrid, Ontology Engineering Group, Spain {cbadenes, ocorcho}@fi.upm.es 2 Amazon Research, Cambridge UK jluisred@amazon.com Abstract. Summaries and abstracts of research papers have been tra- ditionally used for many purposes by scientists, research practitioners, editors, programme committee members or reviewers (e.g. to identify relevant papers to read or publish, cite them, explore new fields and disciplines). As a result, many paper repositories only store or expose abstracts, what may limit the capacity of finding the right paper for a specific research purpose. Given the size limitations and the concise nature of abstracts, they usu- ally omit explicit references to some contributions and impacts of the paper. Therefore for certain information retrieval tasks they cannot be considered as the most appropriate excerpt of the paper to base these operations on. In this paper we have studied other kinds of summaries, built upon textual fragments falling under certain categories of the sci- entific discourse, such as outcome, background, approach, etc, in order to decide which one is more appropriate in order to substitute the origi- nal text. In particular, two novel measures are proposed: (1) internal- representativeness, which evaluates how well a summary describes what the full-text is about and (2) external-representativeness, which evaluates the potential of a summary to discover related texts. Results suggest that summaries explaining the method of a scientific article express a more accurate description of the full-content than oth- ers. In addition, more relevant related articles are also discovered from summaries describing the method, together with those containing the background knowledge or the outcomes of the research paper. 1 Introduction In this paper we present our first steps on the analysis of the quality of research article summaries. Our goal is to find the strengths and weaknesses of approaches leveraging exclusively on abstracts against those based on scientific discourse cat- egories such as approach, challenge, background, outcomes and future work. Since the main contributions and impacts of a research article are not always included explicitly in the abstract, as in the case of [9] describing an architecture, where details about the model architectures are missing, they cannot always be con- sidered as the most adequate scientific summary of a research paper. In order to judge on this accuracy, two novel measures are proposed based on the capability of the summary to substitute the original paper: (1) internal-representativeness, which evaluates how well the summary represents the original full-text and (2) external-representativeness, which evaluates the summary according to how the summary is able to produce a set of related texts that are similar to what the original full-text has triggered. The paper is organized as follows: Section 2 highlights recent studies on text mining research articles and presents the steps followed to measure the repre- sentativeness of abstracts and research article summaries based on rhetorical categories. It describes both the classifier used to identify those categories in papers and the representational model and similarity metric used to compare textual units. Experimental results comparing the different kind of summaries are shown in Section 3. Finally, Section 4 presents conclusions from the experi- ments. 2 Background and Approach Recent studies [16][13] have shown that text mining of full research articles give consistently better results than using only their corresponding abstracts. Given the size limitations and concise nature of abstracts, they often omit descriptions or results that are considered to be less relevant but still are important in certain Information Retrieval (IR) tasks. Thus, when other researchers cite a particular paper, 20% of the keywords that they mention are not present in the abstract [6]. In this paper, we show our initial analysis about the representativeness of research article summaries, considering those based exclusively on abstracts and those based on their discursive structure (approach, challenge, background, out- comes and future work )[14]. The representativeness of a summary with respect to the original full-text is defined as the degree of relation with the original one (internal-representativeness), along with the capacity of mimicking the full text when finding related items (external-representativeness). In order to quantify this notions of internal-external representativeness, a probabilistic topic model is trained over the entire set of papers to have a vectorial representation of each text retrieved from a paper: full content-based and summary-based. The vecto- rial representations of full-papers is used to measure the distance between them and those derived from abstract or summaries (internal-representativeness), and also to find similar documents (external- representativeness) based on the dis- tance between their vectorial representations. An upper distance threshold is specified to filter less similar pairs and compose a set of related papers for each paper. Then, a comparison in terms of precision and recall is performed between sets obtained by only using the vectorial representation of full-papers, against sets produced by using other kind of summaries. 2.1 Annotation of Rhetorical Discourse Parts First of all, we need to identify the rhetorical parts of a research paper. Some approaches have been proposed to summarize scientific articles [4] taking advan- tage of the citation context and the document discourse model. We have used the scientific discourse annotator proposed by [11] to automatically create sum- maries from scientific articles by classifying each sentence as belonging to one of the following scientific discourse categories: approach, challenge, background, outcomes and future work. These categories were identified from the schemata proposed by [15] with the original purpose of characterizing the content of Com- puter Graphics papers. The annotator is based on a Support Vector Machine classifier that combines both lexical and syntactic features to model each sen- tence in a paper. This tool3 was integrated in the librAIry [1] Rhetoric Module 4 to automatically annotate research papers with their rhetorical content. 2.2 Representational Model A representational model is required not only to measure distances between text fragments but, more importantly, to help to understand the differences in their content. Topic models are widely used to uncover the latent semantic structure from text corpora. In particular, Probabilistic Topic Models represent documents as a mixture of topics, where topics are probability distributions over words. Latent Dirichlet Allocation (LDA)[2] is the simplest generative topic model that adds Dirichlet priors for the document-specific topic mixtures, making it possible to characterize documents not previously used during the training task. This is a key feature for our evaluations because, although the model used for the experiments will be trained from the full-content of papers, it will be also used to describe the texts summaries. Thus, we have used a LDA model to describe the inherent topic distribution of papers in the corpus. Some hyper-parameters need to be estimated: the number of topics (k), the concentration parameter (α) for the prior placed on documents’ distributions over topics and the concentration parameter (β) for the prior placed on topics distributions over terms. Since the target of this experiment is not to evaluate the quality of the representational model, but to compare their topic distributions, we accepted as p valid values those widely used in the literature: α = 0.1, β = 0.1 , and k = 2 ∗ n/2 = 44 where n is the size of the corpus. Similarity Measure Feature vectors in Topic Models are topic distributions expressed as vectors of probabilities. Hence we opt for Jensen-Shannon diver- gence (JSD)[8] instead of the commonly used Kullback-Liebler divergence (KLD). The reason for this is that KLD (1) is not defined when a topic distribution is zero and (2) is not symmetric, what does not fit well with semantic similarity 3 http://backingdata.org/dri/library/ 4 https://github.com/librairy/annotator-rhetoric measures which in general are symmetric [12]. JSD considers the average of the distributions as follows : T T X 2 ∗ pi X 2 ∗ qi JSD(p, q) = pi ∗ log + qi ∗ log (1) i=1 pi + qi i=1 qi + pi where T is the number of topics and p, q are the topics distributions. And the similarity measure used in our analysis is based on the JSD trans- formed into a similarity measure as follows [5] : similarity(Di , Dj ) = 10−JSD(p,q) (2) where Di , Dj are the documents and p, q the topics distributions of each of them. 3 Experiments The corpus used in the experiments was created by combining journals in differ- ent scientific domains such as Advances in Space Research, Procedia Chemistry, Journal of Pharmaceutical Analysis and Journal of Web Semantics. In total 1,000 papers were added, 250 from each journal. Both the abstract and the full- content of these documents were directly retrieved from the Elsevier API 5 by using the librAIry [1] Harvester module 6 . The code used to perform the analysis along with the results obtained are available in GitHub7 . Since the annotation process to automatically discovers the rhetorical parts of a research paper (Section 2.1) is sensitive to the structure of the phrases that are used when writing the text, only 20% of papers in the corpus could be fully annotated with all the fragments considered. In fact, these categories are not present in the same proportion in the corpus: approach (90%), background (78%), outcome (73%), challenge (57%) and future work (21%) 3.1 Internal Representativeness The internal-representativeness of a summary measures the similarity of this summary against the original full-text research paper. This similarity is based on the JSD between the topic distribution of each of them. Since LDA considers documents as bag-of-words, the text length (e.g. full- content or summaries) affects the accuracy of the topic distributions inferred by the topic model described in Section 2.2. The occurrences of words in short texts are less discriminative than in long texts where the model has more word counts to know how words are related [7]. In view of the above, the approach, the background and the outcome content of a paper generate more accurate topic distributions than those created from other approaches such as the abstract. 5 https://dev.elsevier.com 6 https://github.com/librairy/harvester-elsevier 7 https://github.com/librairy/study-semantic-similarity Fig. 1. length of summaries Fig. 2. relative size of parts of an article Also, the relative presence of each of them in a paper (figure 2) shows an un- expected result when compared to the IMRaD format [10]. This style proposes to distribute the content of an abstract, and by extension the full-paper, as follows: Introduction(25%), Methods(25%), Results(35%) and Discussion(15%). However, the results (figure 2) show that Method section (approach content) is more extensive than Results section (outcome content) in our corpus. All pairwise similarities between full-papers, abstracts and rhetorical-based summaries are calculated to measure the internal-representativeness of a summary with respect to the original text, i.e. the topic-based similarity value (equation 2) between the probability distributions of the full-text and each of the summaries. Results (table 1) suggest than summaries created from the ap- proach content are more representative than others, i.e. the distribution of topics describing the text created from the approach content is the most similar to the one corresponding to the full-content of the paper. Min Lower Quartile Upper Quartile Max Dev Median abstract 0.0489 0.9109 0.9840 1.0000 0.1443 0.9741 approach 0.0499 0.9969 1.0000 1.0000 0.0872 0.9998 background 0.0463 0.8967 0.9937 0.9988 0.2037 0.9822 challenge 0.0426 0.7503 0.9517 0.9940 0.2224 0.8829 futureWork 0.0000 0.6003 0.9435 0.9948 0.2842 0.8814 outcome 0.0485 0.9267 0.9925 0.9990 0.1721 0.9835 Table 1. Internal-Representativeness 3.2 External-Representativeness The external-representativeness metric tries to measure how different is the set of related documents obtained from summaries with respect to those derived from the original full-text. In terms of precision, recall and f-measure, a comparison has been performed to analyze the behavior of the summaries when trying to discover related content compared to use the full-text of the article. By using the same topic model previously created, similarities among all pairs of documents were also calculated according to equation 2. Then, a minimum score or similarity threshold is required to define when a pair of papers are related. Each threshold is used to create a gold-standard which relates articles to others based on their similarity values. In order to discover that lower bound of similarity, a study about trends in the similarity scores (fig 3) as well as distributions of topics in the corpus (fig 4) was performed. We can see that topics are not equally balanced across papers. This fact generates separated groups of strongly related papers. We think this phenomena is due to our usage of a corpus created from journals where different domains are equally balanced. Then, we considered a similarity score equals to 0.99 (fig 3) as the threshold from which strong relations appear. However, to cover different interpretations of similarity, from those based on sharing general ideas or themes to those that imply to share a more specific content, the following list of thresholds was considered in the experiments: 0.5, 0.6, 0.7, 0.8, 0.9, 0.95 and 0.99. Fig. 3. number of pairwises by similarity Fig. 4. topics per article with value above score (rounded up to two decimals) 0.5 For each similarity threshold, a gold-standard was created based on con- sidering as related those papers with a similarity value upper than the selected threshold. Results ( figure 5) comparing the related papers inferred from the full- content with those inferred from the partial-content representation (i.e. abstract or rhetorical parts) suggest that strongly related papers are mainly discovered by using the summary created from the approach section. The reason for this may be based on the average size of this type of summaries or the particular content included in this part of a paper. While other summaries include more general-domain words, the approach content includes more specific words that describe the method or the final objective of the paper. So, for higher similarity thresholds, i.e. for strongly related papers, the recommendations discovered by using the approach are more precise than those discovered by using the abstract. In terms of recall (figure 6), the upward trend followed by the approach, the outcome and the background content remarks the assumption of summaries con- taining key words allow to discover more similar papers than others. Moreover, Fig. 5. P at different similarity thresholds Fig. 6. R at different similarity thresholds since recall overlooks false-negatives classifications, it suggests that these parts of a research paper share more words than others with strongly related papers but they may also present commonalities with highly related papers, except in case of approach which still exhibits higher precision. As expected, only summaries created from the approach, the outcome and the background content maintain high accuracy values (fig 7) even for high similarity thresholds. Along with the results showed in figure 8, where the same three rhetorical classes present the lowest standard deviation over the f-measure, they can be considered as the most robust summaries containing the ideas that better characterize the paper compared to others. Fig. 7. f-measure VS similarity thresholds Fig. 8. σ of the f-measure 4 Conclusions and Future Work We have studied the Topic-based similarities among scientific documents based on their abstract sections with respect to summaries corresponding to their sci- entific discourse categories. For this purpose, two novel measures have been pro- posed: (1) internal-representativeness and (2) external-representativeness. Results show that summaries created from the approach, outcome or back- ground content of a paper describe more accurately its full-content in terms of overall ideas and related documents than abstracts. Although those summaries are more extensive in number of characters than other with similar precision such as the abstract content, they have proven to be particularly helpful discovering strongly related papers, i.e. papers with a similarity value close to 1.0. In order to avoid an influence of the size of the summaries on the accuracy of the results, in future work we plan to use probabilistic topic model algorithms oriented to handle short-texts such as BTM [3] to describe texts. References 1. Badenes-Olmedo, C., Redondo-Garcia, J.L., Corcho, O.: Distributing Text Mining tasks with librAIry. In: Proceedings of the 17th ACM Symposium on Document Engineering (DocEng) (2017) 2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3(4-5), 993–1022 (2003) 3. Cheng, X., Yan, X., Lan, Y., Guo, J.: BTM: Topic Modeling over Short Texts. Knowledge and Data Engineering, IEEE Transactions on PP(99), 1 (2014) 4. Cohan, A., Goharian, N.: Scientific Article Summarization Using Citation-Context and Article’s Discourse Structure. In: Conference on Empirical Methods in Natural Language Processing. pp. 390–400 (2015) 5. Dagan, I., Lee, L., Pereira, F.C.N.: Similarity-Based Models of Word Cooccurrence Probabilities. Machine Learning 34(1-3), 43–69 (1999) 6. Divoli, A., Nakov, P., Hearst, M.A.: Do peers see more in a paper than its authors? Advances in Bioinformatics (2012) 7. Hong, L., Davison, B.D.: Empirical study of topic modeling in Twitter. Proceedings of the First Workshop on Social Media Analytics - SOMA ’10 pp. 80–88 (2010) 8. Lin, J.: Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information Theory 37(1), 145–151 (1991) 9. Mikolov, T., Corrado, G., Chen, K., Dean, J.: Efficient Estimation of Word Repre- sentations in Vector Space. Proceedings of the International Conference on Learn- ing Representations (ICLR 2013) pp. 1–12 (2013) 10. Nair, P.R., Nair, V.D.: Organization of a Research Paper: The IMRAD Format. In: Scientific Writing and Communication in Agriculture and Natural Resources, p. 150 (2014) 11. Ronzano, F., Saggion, H.: Dr. Inventor Framework: Extracting Structured Infor- mation from Scientific Publications. In: Discovery Science: 18th International Con- ference. pp. 209–220 (2015) 12. Rus, V., Niraula, N., Banjade, R.: Similarity Measures Based on Latent Dirichlet Allocation. In: Computational Linguistics and Intelligent Text Processing, pp. 459– 470 (2013) 13. Sciences, E.R.S.f.P., life: Harnessing the power of content - Extracting value from scientific literature: the power of mining full-text articles for pathway analysis Harnessing the Power of content (2016) 14. Simone Teufel: The Structure of Scientific Articles - Applications to Citation In- dexing and Summarization. In: CSLI Studies in Computational Linguistics (2010) 15. Teufel, S., Siddharthan, A., Batchelor, C.: Towards discipline-independent Argu- mentative Zoning: Evidence from chemistry and computational linguistics. Confer- ence on Empirical Methods in Natural Language Processing pp. 1493–1502 (2009) 16. Westergaard, D., Stærfeldt, H.h., Tønsberg, C., Jensen, L.J., Brunak, S.: Text mining of 15 million full-text scientific articles. bioRxiv (2017)