Identifying Citation Contexts: a Review of Strategies and Goals. Agata Rotondi, Angelo Di Iorio, Freddy Limpens Department of Computer Science and Engineering University of Bologna, Italy agata.rotondi@unibo.it angelo.diiorio@unibo.it freddy.limpens@unibo.it Abstract Citazionale ottimale è il primo passo per numerose analisi e ricerche. Il Contesto English. The Citation Contexts of a cited Citazionale è stato definito in diversi modi entity can be seen as little tesserae that, in letteratura, in relazione a differenti fit together, can be exploited to follow the scopi, domini e applicazioni. In questo opinion of the scientific community to- paper presentiamo le principali dimen- wards that entity as well as to summa- sioni testuali di Contesto Citazionale rize its most important contents. This mo- investigate dai ricercatori nel corso degli saic is an excellent resource of informa- anni. tion also for identifying topic specific syn- onyms, indexing terms and citers’ moti- vations, i.e. the reasons why authors cite 1 Introduction and Background other works. Is a paper cited for compar- ison, as a source of data or just for addi- Researchers consider as Citation Context (CC) tional info? What is the polarity of a ci- different snippets of text around a citation marker. tation? Different reasons for citing reveal These differences of width influence the appli- also different weights of the citations and cations that exploit CC as source of informa- different impacts of the cited authors that tion. For example, Qazvinian and Radev (2010) go beyond the mere citation count met- showed that using also implicit citations (i.e. sen- rics. Identifying the appropriate Citation tences that contain information about a specific Context is the first step toward a multi- secondary source but do not explicitly cite it) for tude of possible analysis and researches. generating surveys, rather than citing sentences So far, Citation Context have been defined alone, improve the results. Ritchie et al. (2008) in several ways in literature, related to dif- compared different widths of CC in order to find ferent purposes, domains and applications. the most appropriate window for identifying In- In this paper we present different dimen- dex Terms. They proved that varying the context sions of Citation Context investigated by from which the Index Terms are gathered has a researchers through the years in order to significant effect on retrieval effectiveness. Al- provide an introductory review of the topic jaber et al. (2010) tested different sizes of CC for to anyone approaching this subject. a document clustering experiment. They claimed that a window size of 50 words from either side Italiano. Possiamo pensare ai Contesti of the citation marker works better than taking 10 Citazionali come tante tessere che, unite, or 30 terms or the citing sentence alone, whatever possono essere sfruttate per seguire its size is. From their analysis, relevant synony- l’opinione della comunità scientifica mous and related vocabulary extracted from this riguardo ad un determinato lavoro o per window of text, in combination with an original riassumerne i contenuti più importanti. full-text representation of the cited document, are Questo mosaico di informazioni può effective for document clustering. We can claim essere utilizzato per identificare sinon- that the issue of finding the optimal CC for a spe- imi specifici e Index Terms nonchè per cific application is a challenging task that interests individuare i motivi degli autori dietro researchers and which is at the base of every study le citazioni. Identificare il Contesto that exploits the CC as a source of information. Figure 1: Survey Summary 1 With the purpose of providing a useful back- 2 Fixed Number of Characters ground to anyone approaching this question, in the following sections we give an overview of differ- A good way to start exploring how the CC can be ent dimensions of textual CC investigated in lit- diversely defined is to look for well known exam- erature. We classified them in 3 main categories: ples. One of these is the public search engine and a) fixed number of characters b) citing sentence digital library for scientific and academic papers c) extended context (fixed and adaptive), and we CiteSeerX2 . This web platform allows users to summarized our analysis in Figure1. We focus browse papers’ references and to read the context on the strategies to identify the correct textual CC in which a reference is cited. The function enables of a citation, nevertheless other CC related topics the reading of 200 characters before and after the have been investigated in literature as for example citation marker. Here the choice of the CC width citation recommendations (see Farber (2018) and is not directly related to further analysis and appli- Ebesu (2017)) cations as the purpose is the mere reading of text The belief of the need of a clear introductory sur- by users. As Ii et al. (2014) describe, CiteSeerX vey about how CC has been differently shaped in uses ParsCit (Councill et al., 2008) for citation ex- literature came to our mind when we faced the traction. ParsCit is a freely available, open-source problem of defining the optimal CC for the Se- implementation of a reference string parsing pack- mantic Coloring of Academic References (SCAR) age which performs reference string segmentation project1 (Di Iorio et al., 2018). The goal of the and CC extraction. The size of the context is con- SCAR project is to enrich bibliographies of scien- figurable, but by default extends to 200 characters tific articles by adding explicit meta data about in- on either side of the match. ParsCit is a well know dividual bibliographic entries and to characterize software and is used in different projects. For these entries according to multiple criteria. With example, the Association Of Computational Lin- this purpose, we are studying a set of properties guistics (ACL) Anthology Network3 uses ParsCit to support the automatic characterization of bibli- for curation. Doslu and Bingol (2016) also used ographic entries and one of our primary source of ParsCit in their work regarding how to rank arti- information is the textual content around citation cles for a given topic. The authors exploited the markers, i.e. the CC. We are currently investigat- information contained in the CC of a certain pa- ing on finding the best span of text for our needs. per for detecting important articles and providing By reviewing the literature, we realized that differ- focused directions to access the literature about a ent approaches correspond to different tasks and topic. They stated that the words that are used to are also related to the linguistic domain of applica- describe a cited paper stand close to the citation tion. The SCAR project as well as this review are marker, and this is their motivation for choosing a focused on the English language but it would be fixed window size context. Before Doslu and Bin- interesting to extend this study to other languages. gol, also Bradshaw (2003) used CC to index cited 2 http://citeseerx.ist.psu.edu/index 1 3 http://dasplab.cs.unibo.it/index.php/scar/ http://aan.how/index.php/home/about paper for specific topics. He designed the Refer- dependency parser to build paraphrase expressing ence Direct Indexing in which measures of rele- relations between two named entities. As com- vance and impact are joined in a single retrieval mented before, parsers need to be fed with full metric based on the comparison of the terms au- sentences in order to provide proper representa- thors use in multiple CC of a document. The CC tions and this work is a clear example where a Bradshaw used to index the documents are directly fixed length CC would not have been an appro- gathered from CiteSeerX. Also the tool presented priate input. Also Elkiss et al. (2008) focused by Knoth et al. (2017), who address the problem their research on the set of citing sentences of a of automatically retrieving and collecting CC for given article (named by the authors citation sum- a given unstructured research paper, extract a CC maries) testing the biomedical domain. Despite window of fixed length corresponding to 300 char- Elkiss study did not rely on any strictly sentence acters before and after a citation marker. The ap- based technique (they employed cosine similar- proach of considering as CC a fixed length snip- ity and tf-idf), both their hypothesis are grounded pet around the citation marker is a naive baseline on the importance of citing sentences boundaries. method. It can be used to retrieve terms related to Sula and Miller (2014) presented an experimental a cited entity and the accuracy of applications that tool for extracting and classifying citation contexts employ it might be improved for example by con- in humanities. Their approach is based on cit- sidering sentence or paragraph boundaries(Aljaber ing sentences from which they extracted features et al., 2010). This kind of context is unsuitable if (e.g. location in document) and polarity (evaluat- the CC needs to be further analyzed, for example ing n-grams with a naive Bayes classifier). Bertin by using syntactic parsers, or if its content have et al. (2016) followed a similar approach to iden- to be represented in a coherent formal way where tify n-grams and sentiment in CC. They chose to the meaning and structure of sentences have to be work on a sentence basis stating that sentences are preserved. the natural building blocks of text and likely to in- clude the context of a specific reference. Starting 3 Citing Sentence from citing sentences they extracted 3-grams con- taining verbs, together with position in the paper Another famous platform among scholars is Se- and type of section according to the IMRaD struc- mantic Scholar4 . This subjective search service ture in order to analyze the combination and distri- for journal articles provides several functions for bution of these features in the biomedical domain. browsing papers among which the possibility of Citing sentence as a base unit for CC is mostly quickly read the CC of each citation. This service chosen in hard sciences domains. In fact, sci- allows reading more than one excerpt of text for entific communities have particular ways of us- each entity (when available). Each CC shown cor- ing language and specific conventions that reveal responds exactly to a citing sentence, i.e. the sen- clear disciplinary differences. Hyland (2009) de- tence that contains the targeted reference marker. scribes some of these language variations that go Implicit citations5 are also investigated by exploit- from terminology differences to different citations ing lexical hooks and also in these cases the CC practices and rhetorical preferences. Writers use excerpts shown are in the form of a full sentence. different sets of reporting verbs to refer to others The same CC window has been adopted in sev- work (engineers show, philosophers argue, biol- eral projects. Nakov et al. (2004) investigated ogists find and linguists suggest); frequencies of the use of CC for semantic interpretation of bio- hedges and self citations, directives and n-grams science articles. Starting from the collection of the also diverge across fields. In the humanities writ- citing sentences related to a specific cited entity ers tend to include extensive referencing and build (that they call citances), they used the output of a a background for the heterogeneous readership 4 https://www.semanticscholar.org while in hard sciences most of the readers share a 5 More in details, with implicit citations we refer to those common context with writers. This attitude clar- mentions of a work where the relation cited entity-citing en- tity is not provided by a citation marker but rather by a lexical ifies citers’ behaviors in different domains and object related to the cited entity. E.g.: The heuristics based on makes us presume that CC in humanities might WordNet and Wikipedia ontologies are very sensitive to pre- processing is an implicit citation of George A. Miller (1995). be more complex than in hard sciences. Follow- WordNet: A Lexical Database for English. Communications ing these considerations, it is reasonable to con- of the ACM Vol. 38, No. 11: 39-41. clude that for choosing the appropriate CC width 4.1 Fixed Extended Context one needs to take into account not only the task Besides ResearchGate and the aforementioned he is going to face but also the domain of appli- Ritchie’s work, who studied different window cations and the specificity of the language. In this sizes of CC for identifying Index Terms, also Mei sense, CC as citing sentence might not always cor- and Zhai (2008) implemented a fixed extended respond to the entire fragment of text referring to context for their study of summarizing articles in- a targeted citation marker. fluence. For their impact-based summarization task they used a 5 sentences window size, with 4 Extended Context 2 sentences before and after the citing sentence. This technique allows to include more info in the Extending CC beyond the citing sentence can CC but at the same time the risk of adding noise is prove useful in many cases as illustrated by high. This is why most of the literature concerning the social networking site for researchers Re- extended CC rather provides adaptive methods. searchGate6 . Every document in this platform’s A mention is needed to the work of Fujiwara and database can be inspected according to different Yamamoto (2015), mostly for their overall project prospectives. Among them, readers can browse than for the CC retrieval approach which relies on documents citations lists and access CC (when a very basic technique (they include the sentence available) displayed in the form of: 1 sentence after the citing one if the reference marker is at before the citing sentence + citing sentence + 1 the end of the citing sentence and limit long citing sentence after the citing sentence. This window sentences to 240 characters before and after cita- size allows users to better understand the full tion markers). The authors built the Colil database context of a citation without loosing any possible where CC of the life sciences domain are stored, informations contained in the nearby sentences. and made it available to users through a web-based This is particularly relevant for the task of polarity search service. For each resource stored in the identification of citations. Athar and Teufel (2012) database, a list of CC in which the resource has have shown that authors’ sentiments are most been cited is returned to the user who can easily likely expressed outside the citing sentences. Sen- read how a work is perceived and used by differ- timent in citations is often hidden and especially ent authors. criticism might be hedged both for politeness and for political reasons (MacRoberts and Mac- 4.2 Adaptive Extended Context Roberts, 1984). Citing sentences are typically O’Connor (1982) was the first who investigated neutral and in particular negative polarity occurs the CC as a sequence of sentences - a multi- in the following sentences (Teufel et al., 2006), sentence citing statement. His purpose was to see for example (from (Platt, 1990)): study the words of CC as possible improvement for the retrieval of the related cited entities. He In [19, sec. 11.11], Vapnik suggests a method wrote 16 complex and detailed computer rules (not for mapping the output of SVM to probabilities by completely computer procedures at that time) with decomposing the feature space []. Preliminary linguistic, structural and more general features for results for this method, are promising.However, the selection of citing statements. Nanba and Oku- there are some limitations that are overcome by mura (1999) presented a system to support writ- the method of this chapter. ing surveys of a specific domain. They see the CC as a succession of sentences where the pos- Particularly for, but not limited to, polarity iden- sible connections are indicated by 6 kinds of cue tification tasks, a context extended to the nearby words (anaphora, negative expression, 1st and 3rd sentences can supply the complete set of informa- person pronoun, adverb, other) that they use for re- tion about a citation to applications and readers. trieving the suitable CC for their system. To iden- Sentences nearby a citing sentence can be add as tify the full span of CC, Kaplan et al. (2009) pre- part of the CC according to a fixed schema or by sented a different method based on co-reference following an adaptive approach. chains. They built a SVM (Cortes and Vapnik, 1995) classifier with 13 features (among which: 6 https://www.researchgate.net cosine similarity, gender and number agreement, semantic class agreement etc.) that are tested in CRF method fits better the task than the SVM ap- order to find the best configuration. Results of the proach. classifier alone and in combination with cue-based The different works briefly described so far give techniques are promising. Despite the little data an overview of the most interesting techniques analyzed for the project, Kaplan raised some inter- explored by researchers. From rule-based ap- esting remarks about CC. Particularly, they stated proaches to probability methods, the implemented that sentences of CC are not necessarily contigu- features are most of the time domain-specific re- ous. Qazvinian and Radev (2010) explored the lying on particular vocabulary and on stylistic and task of retrieving background information close to rhetorical habits. explicit citations by implementing a probabilistic inference model (Markov Random Field). Like 4.2.1 Citation Scope previous authors, they observed that the majority Related to the Adaptive Extended Context topic is of sentences related to a citation directly occur af- the identification of the Scope of a citation. So far ter or before the citation or another context sen- we have discussed different ways of including in tence; however they also confirmed Kaplan’s in- the CC what is outside the citing sentence but at tuition about possible gaps between sentences de- the same time related to it. The idea is to extend scribing a cited paper. Athar and Teufel (2012) the context. However, there are cases in which the tried to go further by attempting to retrieve all the citing sentence does not completely refer to the mentions of a cited entity within the full text of the targeted citation or where the context of multiple citing paper. As claimed by the authors, mentions citations overlap. In these cases the aforemen- to a cited entity can occur in the full article and are tioned approaches of CC extraction would include necessary to identify the real sentiment toward the noise and affect applications results. See for cited work. Their first experiment of manual an- instance the following example where the whole notation proved the insight that retrieving all the citing sentence might produce a negative polarity mentions of a cited entity increases citation sen- despite the neutral value of the citation: timent coverage. Also the SVM framework im- plemented by the authors, despite limited to a 4 The negative results produced by the BoW sentence window, outperformed a single sentence approach led our team to change direction and baseline system. Abu-Jbara et al. (2013), with we tested a SVM(CORTES, 1995) classifier. the purpose of adding qualitative aspects to stan- dard quantitative bibliometrics (H-Index, G-Index, Finding a procedure to cut out the precise scope etc.), analyzed the text surrounding a citation in or- of a citation is a tricky and challenging task for der to define the citer’s purposes and polarity. This which little experiments have been done. piece of text (CC), is retrieved with a sequence la- Athar (2011) suggested to trim the parse tree of beling method. Starting from the citing sentence, each citing sentence and to keep only the deepest Abu-Jbara’s team used CRF (Lafferty et al., 2001) clause in the subtree of which the citation is a part. to determine if the sentence before and the two Abu-Jbara and Radev (2012) explored 3 different sentences after the citing sentences have to be in- methods for identifying the scope: word classifi- cluded in the CC. The features for the CRF model cation, sequence labeling and segment classifica- are both structural (e.g. position of the current sen- tion. Results showed that the scope of a given ref- tence with respect to the citing sentence) and lex- erence consists of units of higher granularity than ical (e.g. presence of demonstrative determiners). words. In fact, the segment classification tech- Kaplan et al. (2016) named Citation Block Deter- nique achieved the best performance. Despite the mination(CBD) the task of detecting non-explicit interesting results, we agree with Hernandez and citing sentences and faced it by testing various fea- Gomez (2016) who stated that additional work is tures representing different aspects of textual co- required to improve the citation scope identifica- herence. Non local mentions are excluded from tion task. The need of further research in this what they formalized as a binary classification task field is also encouraged by the analysis of Jha et of sentences from the citing one. They tested dif- al. (2017) who performed an annotation experi- ferent relational and entity coherence features and ment on a sample of the ACL Anthology Network their combinations. Experiments showed that the revealing that, on average, the reference scope for a given target reference contains only 57.63 per cent of the original citing sentence. Ebesu, T., and Fang, Y. 2017. Neural Citation Net- work for Context-Aware Citation Recommendation. 5 Conclusion In Proc. of SIGIR, (p. 10931096). We have reviewed what we consider the most in- Elkiss A., Shen S., Fader A., Erkan G., States D., and teresting works about CC identification in order to Radev D. 2008. Blind Men and Elephants: What Do Citation Summaries Tell Us About a Research provide a solid background to anyone interested in Article?. American Society for Information Science the topic and especially to those researchers who and Technology, 59 (1), (p. 51-62). are facing the task of identifying the best approach for their studies. We did not compare the differ- Farber M., Thiemann A., and Jatowt A. 2018. To Cite, or Not to Cite? Detecting Citation Contexts ent strategies with the purpose of ranking them, in Text. In Proc. of ECIR: Advances in Information but we rather showed that there exists various re- Retrieval, (p. 598-603). lations between a methodology and the usage, do- main, and language specificity of its possible ap- Fujiwara, T., and Yamamoto, Y. 2015. Colil: a plications. Database and Search Service for Citation Contexts in the Life Sciences Domain. Biomedical Semantics, 6(38). References Hernandez-Alvarez, M., and Gomez, J. 2016. Sur- Abu-Jbara, A., and Radev, D. 2012. Reference vey About Citation Context Analysis: Tasks, Tech- Scope Identification in Citing Sentences. In Proc. niques, and Resources. Natural Language Engineer- of NAACL HLT, (p. 80-90). ing, 22(3), (p. 327-349). Abu-Jbara, A., Ezra, J., and Radev, D. 2013. Purpose Hyland, K. 2009. Writing in the Disciplines: Research and Polarity of Citation: Towards NLP-based Bib- Evidence for Specificity. Taiwan International ESP liometrics. In Proc. of NAACL HLT, (p. 596-606). Journal, 1(1), (p. 5-22). Aljaber, B., Stokes, N., Bailey, J., and Pei, J. 2010. Jha, R., Jbara, A., Qazvinian, V., and Radev D. 2017. Document Clustering of Scientific Texts Using Ci- NLP-driven Citation Analysis for Scientometrics. tation Contexts. Information Retrieval, 13, (p.101- Natural Language Engineering, 23(1), (p. 93-130). 131). Lafferty, J., McCallum, A., and C.N. Pereira, F. 2001. Athar, A. 2011. Sentiment Analysis of Citations using Conditional Random Fields: Probabilistic Models Sentence Structure-Based Features. Proceedings of for Segmenting and Labeling Sequence Data. In NAACL-HLT, (p.81-87). Proc. of ICML, (p. 282-289). Athar, A., and Teufel, S. 2012. Detection of Im- plicit Citations for Sentiment Detection.. In Proc. Ii, A., Wu, J., and Giles, C. 2014. CiteSeerX : Intel- of DSSD, (p. 18-26). ligent Information Extraction and Knowledge Cre- ation from Web-Based Data. In Proc. of AKBC, (p. Bertin, M., Atanassova, I., Sugimoto, C., and Lariviere, 1-7). V. 2016. The Linguistic Patterns and Rhetorical Structure of Citation Context: an Approach Using Kaplan, D., Iida, R., and Tokunaga, T. 2009. Auto- N-Grams. Scientometrics, 109(3). matic Extraction of Citation Contexts for Research Paper Summarization: A Coreference-Chain Based Bradshaw, S. 2003. Reference Directed Indexing: Re- Approach. In Proc. of NLPIR4DL, (p. 88-95). deeming Relevance for Subject Search in Citation Indexes. In Proc. of ECDL, (p. 499-510). Kaplan, D., Tokunaga, T., and Teufel, S. 2016. Cita- tion Block Determination Using Textual Coherence. Cortes, C. and Vapnik V. 1995. Support-Vector Net- Information Processing, 24(3), (p. 540-553). works. Machine Learning, 20 (3), (p. 273-297). Knoth, P., Gooch, P. and Jack, K. 2017. What Others Councill, I., Giles, C., and Kan, M. 2008. ParsCit : An Say About This Work? Scalable Extraction of Ci- Open-Source CRF Reference String Parsing Pack- tation Contexts from Research Papers. In Proc. of age. In Proc. of LREC, (p. 661-667). TPDL, (p. 287299). Di Iorio, A., Limpens, F., Peroni, S., Rotondi, A., and Tsatsaronis, G. 2018. Investigating Facets to Char- Mei, Q., and Zhai, C. 2008. Generating Impact- acterise Citations for Scholars. In Proc. of SAVE- Based Summaries for Scientific Literature. In Proc. SD Workshop. of ACL-HLT, (p. 816-824). Doslu, M., and Bingol, H. 2016. Context Sensitive Ar- MacRoberts, M.H., and MacRoberts, B.R. 1984. The ticle Ranking with Citation Context Analysis. Scien- Negational Reference: or the Art of Dissembling. tometrics, 108 (2), (p. 653671). Social Studies of Science, 14, (p. 91-94). Nakov, P., Schwartz, A., and Hearst, M. 2004. Ci- tances: Citation Sentences for Semantic Analysis of Bioscience Text. In Proc. of SIGIR. Nanba, H., and Okumura, M. 1999. Towards Multi- paper Summarization Using Reference Information. In Proc. of IJCAI, (p. 926-931). O’Connor, J. 1982. Citing Statements: Computer Recognition and Use to Improve Retrieval. Infor- mation Processing and Management, 18(3), (p. 125- 131). Platt J.C. 1990. Probabilistic Outputs for Support Vec- tor Machines and Comparisons to Regularized Like- lihood Methods. Advances in Large Margin Classi- fiers, (p. 61-74). Qazvinian, V., and Radev, D. 2010. Identifying Non- explicit Citing Sentences for Citation-based Summa- rization. In Proc. of ACL, (p. 555-564). Ritchie, A., Robertson, S., and Teufel, S. 2008. Com- paring Citation Contexts for Information Retrieval. In Proc. of ACM-CIKM, (p. 213-222). Sula, C., and Miller, M. 2014. Citations, Contexts, and Humanistic Discourse: Toward Automatic Ex- traction and Classification. Literary and Linguistic Computing, 29, (p. 453-464). Teufel, S., Siddharthan, A., and Tidhar, D. 2006. Au- tomatic Classification of Citation Function. In Proc. of EMNLP, (p. 103-110).