Importance Assessment in Scholarly Networks Saurav Manchanda and George Karypis University of Minnesota, Twin Cities, USA {manch043, karypis}@umn.edu Abstract etc. (Aguinis et al. 2012)). However, most of these tradi- tional metrics, such as citation counts and h-index treat all We present approaches to estimate content-aware bibliomet- rics to quantitatively measure the scholarly impact of a publi- citations and publications equally, and do not take into ac- cation. Traditional measures to assess quality-related aspects count the content of the publications and the context in such as citation counts and h-index, do not take into account which a prior scholarly work was cited. Another related the content of the publications, which limits their ability to line of work, such as PageRank (Page et al. 1999) and provide rigorous quality-related metrics and can significantly HITS (Kleinberg 1999) takes the node centrality into con- skew the results. Our proposed metric, denoted by Content In- sideration (as a proxy for publication influence), but still op- formed Index (CII), uses the content of the paper as a source erate in an content-agnostic manner. These content-agnostic of distant-supervision, to weight the edges of a citation net- metrics fail to reliably measure the scholarly impact of an work. These content-aware weights quantify the information article as they do not differentiate between the possible rea- in the citation i.e., these weights quantify the extent to which sons a scholarly work is being cited. Being content-agnostic, the cited-node informs the citing-node. The weights convert the original unweighted citation network to a weighted one. these metrics can be easily manipulated by the presence of Consequently, this weighted network can be used to derive malicious entities, such as publication venues indulging in impact metrics for the various entities involved, like the pub- self-citations, which leads to high impact factor, or a group lications, authors etc. We evaluate the weights estimated by of scholars citing each others’ work. For example, Journal our approach on three manually annotated datasets, where Citation Reports (JCR)1 routinely suppresses many journals the annotations quantify the extent of information in the cita- that indulge in citation stacking, a practice where the review- tion. Particularly, we evaluate how well the ranking imposed ers and journal editors pressure authors to cite papers that ei- by our approach associates with the ranking imposed by the ther they wrote or that are published in “their” journal. Thus, manual annotations. The proposed approach achieves up to there is a need to establish content-aware metrics to accu- 103% improvement in performance as compared to second rately and quantitatively measure various innovation-related best performing approach. aspects such as their significance, novelty, impact, and mar- ket value. Such metrics are essential for ensuring that SET- 1 Introduction driven innovations will play an ever more significant role in Scientific, engineering, and technological (SET) innovations the future. have been the drivers behind many of the significant pos- In this paper, we propose machine-learning-driven ap- itive advances in our modern economy, society, and life. proaches, that automatically estimate the weights of the To measure various impact-related aspects of these innova- edges in a citation network, such that edges with higher tions various quantitative metrics have been developed and weights correspond to higher-impact citations. There has deployed. These metrics play an important role as they are been considerable effort in the past to identify important used to influence how resources are allocated, assess the per- citations (Valenzuela, Ha, and Etzioni 2015; Jurgens et al. formance of personnel, identify intellectual property (IP)- 2018; Cohan et al. 2019). These approaches treat this task related takeover targets, value a company’s intangible assets as a supervised text-classification problem, and thus, require (IP is such an asset), and identify strategic and/or emerging the availability of training data with ground truth annota- competitors. tions. However, generating such labeled data is difficult and Citation networks of peered-reviewed scholarly publi- time consuming, especially when the meaning of the labels cations (e.g., journal/conference articles and patents) have is user-defined. In contrast, our approaches are distant su- widely been used and studied in order to derive such metrics pervised, that require no manual annotation. The proposed for the various entities involved (e.g., articles, researchers, approaches leverage the readily available content of the pa- institutions, companies, journals, conferences, countries, pers as a source of distant-supervision. Specifically, we for- Copyright © 2021for this paper by its authors. Use permitted under 1 Creative Commons License Attribution 4.0 International (CC BY http://help.incites.clarivate.com/incitesLiveJCR/JCRGroup/ 4.0). titleSuppressions.html mulate the problem as how well the linear combination of life sciences and biomedical topics, and Astrophysics Data the representations of the cited publication explains the rep- System8 which covers astronomy and physics. resentation of the citing publication. The weights in this linear-combination quantify the extent to which the cited- Citation recommendation publication informs the citing-publication. We evaluate the weights estimated by our approach on three manually an- Citation recommendation describes the task of recommend- notated datasets, where the annotations quantify the extent ing citations for a given text. It is an essential task, as of information in the citation. Particularly, we evaluate how all claims written by the authors need to be backed up well the ranking imposed by our approach associates with in order to ensure reliability and truthfulness. The ap- the ranking imposed by the manual annotations. The pro- proaches developed for citation recommendation can be posed approach achieves up to 103% improvement in per- grouped into 4 groups as follows(Färber and Jatowt 2020): formance as compared to second best performing approach. hand-crafted feature based approaches, topic-modelling While our discussion and evaluation focuses on iden- based approaches, machine-translation based approaches, tifying informing citations, our approach is not restricted and neural-network based approaches. Hand-crafted feature to this domain, and can be used to derive impact metrics based approaches are based on features are are manually for the various involved entities. For example, the content- engineered by the developers. For example, text similarity aware weights estimated by the proposed approach convert between the citation context and the candidate papers can the original unweighted citation network to a weighted one. be used as one of the text-based features. Examples of pa- Consequently, this weighted network can be used to derive pers that propose hand-crafted feature based approaches in- impact metrics for the various involved entities, like the pub- clude (Färber and Jatowt 2020; He et al. 2011; LIU, YAN, lications, authors etc. For example, to find the impact of and YAN 2016; Livne et al. 2014; Rokach et al. 1978). a publication, the sum of weights outgoing from its corre- Topic modeling based approaches represent the candidate sponding node can be used to quantify the impact of the papers’ text and the citation contexts by means of abstract publication, instead of using vanilla citation count. topics, and thereby exploiting the latent semantic structure The reminder of the paper is organized as follows. Section of texts. Examples of topic modeling based approaches in- 2 presents the related literature review. The paper discusses clude (He et al. 2010; Kataria, Mitra, and Bhatia 2010). the proposed method in Section 3 followed by the experi- The machine-translation based approaches apply the idea ments in Section 4. Section 5 discusses the results. Finally, of translating the citation context into the cited document Section 6 corresponds to the conclusions. to find the candidate-papers worth citing. Examples in this category include (He et al. 2012; Huang et al. 2012). Fi- nally, the popular examples of neural-network based mod- 2 Related Work els include (Ebesu and Fang 2017; Han et al. 2018; Huang The research areas relevant to the work present in this paper et al. 2015; Kobayashi, Shimbo, and Matsumoto 2018; Tang, belong to citation indexing, citation recommendation, link Wan, and Zhang 2014; Yin and Li 2017). prediction approaches, distant-supervised credit attribution approaches and citation-intent classification approaches. We Link-prediction briefly discuss these areas below: A link is a connection between two nodes in a network. Citation Indexing As such, link-prediction is the problem of predicting the A citation index indexes the links between publications that existence of a link between two nodes in a network. A authors make when they cite other publications. Citation in- good link-prediction model predicts the likelihood of a link dexes aim to improve the dissemination and retrieval of sci- between two nodes, so it can not only be used to pre- entific literature. CiteSeer (Giles, Bollacker, and Lawrence dict new links, but to also curate the graph by filtering 1998; Li et al. 2006) is a first automated citation indexing less-likely links that are already present. Thus, the link- system that works by downloading publications from the prediction can be a useful tool to find likely citations in Web and converting them to text. It then parses the papers to a citation network. The citation recommendation task de- extract the citations and the context in which the citations are scribed previously can be thought of as a special case of link- made in the body of the paper, storing this information in a prediction. Following the taxonomy described in (Martı́nez, database. Other examples of popular citation indices include Berzal, and Cubero 2016), link-prediction approaches can be Google Scholar2 , Web of Science3 by Clarivate Analytics, broadly categorized into three categories: similarity-based Scopus4 by Elsevier and Semantic Scholar5 . Some examples approaches, probabilistic and statistical approaches and al- of subject-specific citation indices include INSPIRE-HEP6 gorithmic approaches. The similarity based approaches as- which covers high energy physics, PubMed7 , which covers sume that nodes tend to form links with other similar nodes, and that two nodes are similar if they are connected to simi- 2 https://scholar.google.com/ lar nodes or are near in the network according to a given sim- 3 http://www.webofknowledge.com/ ilarity function. Examples of popular similarity functions 4 https://www.scopus.com/ include number of common neighbors (Liben-Nowell and 5 https://www.semanticscholar.org/ Kleinberg 2007), Adamic-Adar index (Adamic and Adar 6 https://inspirehep.net/ 7 8 https://pubmed.ncbi.nlm.nih.gov/ http://ads.harvard.edu/ Paper 2 Paper 3 Paper 4 2003), etc. The probabilistic and statistical approaches as- sume that the network has a known structure. These ap- Paper 1 proaches estimates the model parameters of the network (Citing paper) structure using statistical methods, and use these parame- ters to calculate the likelihood of the presence of a link ---[Paper 4]- ---[Paper 3]- between two nodes. Examples of probabilistic and statis- ---[Paper 2]- tical approaches include (Guimerà and Sales-Pardo 2009; Unit Normalization Huang 2010; Wang, Satuluri, and Parthasarathy 2007). Al- gorithmic approaches directly uses the link-prediction as su- pervision to build the model. For example, link-prediction task can be formulated as a binary classification task where the positive instances are the pair of nodes which are con- Representation of the historical concepts nected in the network, and negative instances are the uncon- Minimize the explanation loss nected nodes. Examples include (Menon and Elkan 2011; Bliss et al. 2014). Unsupervised or self-supervised node em- Figure 1: Overview of Content-Informed Index. Paper P1 bedding (such as DeepWalk (Perozzi, Al-Rfou, and Skiena cites papers P2 , P3 and P4 . The weights w21 , w31 , and w41 2014), node2vec (Grover and Leskovec 2016)), followed by quantifies the extent to which P2 , P3 and P4 informs P1 , training a binary classifier and Graph Neural network ap- respectively. The function f is implemented as a Multilayer proaches such as GraphSage (Hamilton, Ying, and Leskovec Perceptron. 2017) belong to this category. Distant-supervised credit-attribution driven approaches (Valenzuela, Ha, and Etzioni 2015; Jur- Various distant-supervised approaches have been developed gens et al. 2018; Cohan et al. 2019). Generating labeled data for credit-attribution, but the prior have primarily focused on for for these supervised approaches is difficult and time con- text documents. A document may be associated with multi- suming, especially when the meaning of the labels is user- ple labels but all the labels do not apply with equal speci- defined. In contrast, our approaches are distant supervised, ficity to the individual parts of the documents. Credit at- that require no manual annotation. tribution problem refers to identifying the specificity of la- bels to different parts of the document. Various probabilis- 3 Content-Informed Index (CII) tic and neural-network based approaches have been devel- In the absence of labels that define the impact, we assume oped to address the credit-attribution problem, such as La- that the extent to which a cited paper informs the citing pa- beled Latent Dirichlet Allocation (LLDA) (Ramage et al. per is an indication of the citation’s impact. Specifically, we 2009), Partially Labeled Dirichlet Allocation (PLDA) (Ra- assume that each paper Pi can be represented as a set of con- mage, Manning, and Dumais 2011), Multi-Label Topic cepts Ci . Further, we assume that each paper Pi is build on Model (MLTM) (Soleimani and Miller 2017), Segmentation top of a set of historical concepts Hi , and its novelty Ni is with Refinement (SEG-REFINE) (Manchanda and Karypis the new set of concepts it proposes. The contribution of a 2018), and Credit Attribution with Attention (CAWA) (Man- cited paper Pj towards the citing paper Pi is the set of con- chanda and Karypis 2020). cepts Cji = Cj ∩ Hi . In other terms, the set of concepts Ci Another line of work uses distant-supervised credit- is given by: attribution for query-understanding in product search. Ex- Ci = Ni ∪ Hi = Ni ∪ [∪Pi citesPj Cji ]. amples include, (i) using the reformulation logs as a source of distant-supervision to estimate a weight for each term in The task at hand is to quantify the extent to which Cji con- the query that indicates the importance of the term towards tributes towards Hi . To achieve this task, we look into the expressing the query’s product intent (Manchanda, Sharma, following directions: and Karypis 2019a,b); and (ii) annotating individual terms • How do we supervise the exercise? We minimize the in a query with the corresponding intended product charac- novelty of paper Pi , by trying to explain the concepts in teristics, using the characteristics of the engaged products paper Pi (denoted by Ci ) using the historical concepts, as a source of distant-supervision (Manchanda, Sharma, and i.e., the concepts of the papers it cites (Cj ). We call the Karypis 2020). loss associated with this minimization as the explanation loss. This gives rise to the following optimization prob- Citation-intent classification lem: There is a large body of work studying the intent of cita- X X minimize Ni = minimize Ci − Hi . tions and devising categorization systems. In general, these i i approaches treat citation-intent classification as a text clas- sification problem, and require the availability of training To proceed in this direction, we need to answer two ques- data with ground truth annotations. Representative examples tions, (i) How to represent the the set of concepts associ- include rule based approaches (Pham and Hoffmann 2003; ated with the paper Pi ?, and (ii) How do we represent the Garzone and Mercer 2000) as well as machine-learning set of historical concepts Hi ? As we show next, we use the textual content of the papers to estimate the represen- The max-bound constraint (wji ≤ b) is introduced to limit tations of Ci and Hi . Thus, we formulate our problem as a the projection space of the weights wji . This is because, distant-supervised problem, and the content of the papers without this constraint, for a given citing paper Pi , if the acts as a source of distant-supervision. set of weights wji minimize Equation (1), then so will any • How to represent the set of concepts associated with scalar multiplication of the weights wji . This can potentially a paper? For simplicity, we represent the set of con- lead to the estimated weights being incomparable across cepts associated with a paper (Ci ) as a pretrained vec- different citing papers. Having a max bound on the esti- tor representation (embedding) of its abstract, such as mated weights helps avoid this scenario. To take care of Word2Vec (Mikolov et al. 2013), GloVe (Pennington, the constraints, the function f (·) can be implemented as a Socher, and Manning 2014), BERT (Devlin et al. 2018), L2−regularized multilayer perceptron, with a single output ELMo (Peters et al. 2018), etc. In this paper, we use the node, and a non-negative mapping at the output node. Not pretrained representations pretrained on scientific docu- that we do not explicitly set the max-bound b, but it is im- ments provided by ScispaCy (Neumann et al. 2019). The plicitly set by the L2 regularization of the weights of the representation of Ci is denoted by r(Ci ). function f . The L2 regularization parameter is treated as a hyperparameter. Figure 1 shows an overview of Content- • How do we represent the set of historical concepts Hi ? Informed Index (CII). As the set of historical concepts Hi is a union of the bor- rowed concepts from the cited papers (Cj ), we simply represent the set of historical concepts as a weighted lin- 4 Experimental methodology ear combination of the representation of the concepts of Evaluation methodology and metrics the cited papers, i.e., X We need to evaluate how well the weights estimated by our r(Hi ) = w̃ji r(Cj ) proposed approach quantifies the extent to which a cited Pi cites Pj paper informs the citing paper. To this extent, we leverage various manually annotated datasets (explained later in Sec- X 2 subject to w̃ji =1 Pi cites Pj tion 4), where the annotations quantify the extent of infor- mation in the citation. The task inherently becomes an or- w̃ji ≥ 0; ∀(i, j). dinal association, and we need to evaluate how well the We have the constrained norm condition ( P 2 w̃ji = ranking imposed by our proposed method associates with Pi cites Pj the ranking imposed by the manual annotations. As a mea- 1) to make the representation of r(Hi ) agnostic to the sure of rank correlation, we use the non-parametric Somers’ number of cited-papers (a paper can cite multiple papers Delta (Somers 1962) (denoted by ∆). Values of ∆ range to reference the same borrowed concepts). from −1(100% negative association, or perfect inversion) The weights w̃ji can be thought of as normalized simi- to +1(100% positive association, or perfect agreement). A larity measure between the concepts of the cited paper, value of zero indicates the absence of association. Formally, and the citation context. Thus, to estimate w̃ji , we first given a dependent variable (i.e., the predicted weights by our estimate unnormalized w̃ji , denoted by wji , and then nor- model) and an independent variable (i.e., the manually anno- malize wji so as to have unit norm. The unnormalized tated ground truth), ∆ is the difference between the number weight wji is precisely the extent to which Cj contributes of concordant and discordant pairs, divided by the number towards Hi (and hence Ci ), i.e., the weight that we wish to of pairs with independent variable values in the pair being estimate in this paper. We estimate wji as a multilayer per- unequal. ceptron, that takes as input the representations of the cited Relation of ∆ to other metrics: When the independent paper and the citation context. We use the representation variable has only two distinct classes (binary variable), the associated with the corresponding concepts as the repre- area under the receiver operating characteristic curve (AUC sentations of the cited papers (r(Cj )). Similar to r(Cj ), ROC) statistic is equivalent to ∆ (Newson 2002). Thus, ∆ we use the ScispaCy vector representation for the citation can also be visualized as a generalization of AUC ROC to- context as the representation of the context, and denote it wards ordinal classification with multiple classes. Further, by r(j → i). as the dependent variable (the weights estimated by our pro- The above discussion leads to the following formulation: posed approach) is real valued, having two tied values on the X X independent variable is very difficult. Thus, for our case, ∆ minimize ||r(Ci ) − w̃ji r(Cj )||2 f is equivalent to Goodman and Kruskal’s Gamma (Goodman i Pi cites Pj and Kruskal 1959, 1963, 1972, 1979), and just a scaled vari- wji ant of Kendall’s τ coefficient (Kendall 1938), with are other subject to w̃ji = r ; ∀(i, j), 2 P wji popular measures of ordinal association. Pi cites Pj (1) wji = f (r(Cj ), r(Cji )); ∀(i, j), Baselines wji ≥ 0; ∀(i, j), We choose representative baselines from diverse categories wji ≤ b; ∀(i, j). as discussed below: Link-prediction approaches: The citation weights that citing paper. This assumption has also been used as a feature we estimate in this paper can also looked from the link- in prior supervised approaches (Valenzuela, Ha, and Etzioni prediction perspective, i.e., assigning a score to every cita- 2015). The absolute frequency of referencing a cited-paper tion (link) in the citation graph, the score portraying the like- may provide a good signal regarding the information bor- lihood of the existence of a link. Thus, the citations that are rowed from the cited paper, when comparing with other pa- noisy, i.e., the edges that do not make sense with the respect pers being cited by the same citing paper. However, as the to underlying link-prediction model get smaller weights. We citation-behavior differs between papers, the absolute fre- compare against two link-prediction methods, one based on quency may not be comparable across different citing pa- classic network embedding approach, and other belonging pers. Thus, we also provide results after doing normaliza- to recent Graph Neural Network (GNN) based approaches. tion of the absolute frequency of the citation references for • DeepWalk (Perozzi, Al-Rfou, and Skiena 2014): Deep- each citing paper. We provide results for mean, max, and Walk is a popular method to learn node embeddings. Deep- min normalization. Specifically, given a citation and the cor- Walk borrows ideas from language modeling and incor- responding citing paper, the information weight for a cita- porates them with network concepts. Its main proposition tion is calculated by dividing the number of references of is that linked nodes tend to be similar and they should that citation, by the mean, max, and min of references of all have similar embeddings as well. Once we have node the citations in that citing paper, respectively. embeddings as the output of DeepWalk, we train a bi- nary classifier, with the positive instances as the pairs of Datasets nodes which are connected in the network, and negative in- The Semantic Scholar Open Research Corpus (S2ORC): stances are the unconnected nodes (generated using nega- The S2ORC (Lo et al. 2020) dataset is a citation graph of tive sampling). We provide results using two different clas- 81.1 million academic publications and 380.5 million cita- sifiers: Logistic Regression (denoted by DeepWalk+LR) tion edges. We only consider the publications for which full- and Multilayer Perceptron (denoted by DeepWalk+MLP). text is available and abstract contains at least 50 words. This Note that Deepwalk is a transductive model, and only con- leaves us with a total of 5, 653, 297 papers, and 30, 533, 111 siders the network topology, i.e., DeepWalk does not use edges (citations). the content of the papers to estimate the model. • GraphSage (Hamilton, Ying, and Leskovec 2017): Graph- ACL-2015: The ACL-2015 (Valenzuela, Ha, and Etzioni SAGE is a Graph Concolutional Network (GCN) based 2015) dataset contains 465 citations gathered from the ACL framework for inductive representation learning on large anthology9 , represented as tuples of (cited paper, citing pa- graphs. GraphSage is trained with the link-prediction loss, per), with ordinal labels ranging from 0 to 3, in increasing so we do not use a second step (as in DeepWalk) to train order of importance. The citations were annotated by one ex- separate classifier. Note that, GraphSage is an inductive pert, followed by annotation by another expert on a subset of model, so also considers the content of the papers in addi- the dataset, to verify the inter-annotator agreement. We only tion to topology of the network to estimate the model. use the citations for which we have the inter-annotator agree- ment, and the citations are present in the S2ORC dataset we Text-similarity based baselines: We can think of the described before. The selected dataset contains 300 citations function f as a similarity measure between the cited pa- among 316 unique publications. The total number of unique per and the citation context. Thus, we consider the following citing publications are 283 and the total number of unique similarity measures as our baselines: We use the same pre- cited publications are 38. trained representations as we used as an input to CII, and cosine similarity as the similarity measure, which is a popu- ACL-ARC: The ACL-ARC (Jurgens et al. 2018) is a lar similarity measure for text data. dataset of citation intents based on a sample of papers from • Similarity-Abstract-Context: Similarity between the cited the ACL Anthology Reference Corpus (Bird et al. 2008) abstract and the citation context. and includes 1,941 citation instances from 186 papers and • Similarity-Context-Abstract: Similarity between the cit- is annotated by domain experts. The dataset provides ACL ing abstract and the citation context. IDs for the papers in the ACL corpus, but does not provide an identifier to the papers outside the ACL corpus, mak- • Similarity-Abstract-Abstract: Similarity between the ing it difficult to map many citations to the S2ORC cor- cited abstract and citing abstract. pus. However, it provided the titles of those papers, and To calculate each of the above similarity measures, we use we used these titles to map these papers to the papers in the same pretrained representations as we used as an input to the S2ORC dataset, if we found matching titles. The an- CII, and cosine similarity as the similarity measure, which is notations in ACL-ARC are provided at individual citation- a popular similarity measure for text data. The baselines be- context level, leading to multiple annotations for some of the longing to this category can also be thought of as similarity- (cited paper, citing paper) pair. If this is the case, we chose based link prediction approaches. the highest-informing annotation for such (cited paper, cit- In addition, we also consider another simple baseline, re- ing paper) pairs. The selected dataset contains 460 citations ferred to as Reference Frequency, where we assume that among 547 unique publications. The total number of unique more frequently the cited paper is referenced in the citing pa- 9 per, the higher the chances of the cited paper informing the https://www.aclweb.org/anthology/ Table 1: Results on the Somers’ ∆ metric. Model ACL-2015 ACL-ARC SciCite Content-Informed Index (CII) 0.428 ± 0.013 0.308 ± 0.010 0.296 ± 0.006 Ref. Frequency (Absolute) 0.325 ± 0.000 0.308 ± 0.000 0.144 ± 0.000 Ref. Frequency (Mean-normalized) 0.351 ± 0.000 0.300 ± 0.000 0.120 ± 0.000 Ref. Frequency (Min-normalized) 0.321 ± 0.000 0.298 ± 0.000 0.145 ± 0.000 Ref. Frequency (Max-normalized) 0.270 ± 0.000 0.172 ± 0.000 0.035 ± 0.000 Similarity-Abstract-Abstract −0.041 ± 0.000 0.091 ± 0.000 −0.003 ± 0.000 Similarity-Abstract-Context −0.147 ± 0.000 0.090 ± 0.000 −0.125 ± 0.000 Similarity-Context-Abstract 0.013 ± 0.000 −0.062 ± 0.000 −0.202 ± 0.000 Deepwalk+LR −0.071 ± 0.016 0.190 ± 0.006 −0.037 ± 0.018 Deepwalk+MLP −0.026 ± 0.011 0.205 ± 0.024 −0.047 ± 0.015 GraphSage 0.023 ± 0.045 0.132 ± 0.024 0.049 ± 0.019 citing publications are 145 and the total number of unique respectively. We use the same network architecture for the cited publications are 413. MLP that we train on top of DeepWalk representations. We train the logistic regression and MLP parts of Deepwalk, SciCite (Cohan et al. 2019) SciCite is a dataset of cita- GraphSage, and CII for a maximum of 50 epochs, and do tion intents based on a sample of papers from the Semantic early-stopping if the validation performance does not im- Scholar corpus10 , consisting of papers in general computer prove for 5 epochs. For GraphSage, we use the implemen- science and medicine domains. Citation intent was labeled tation provided by DGL12 . We used mini-batch size of 1024 using crowdsourcing. The annotators were asked to identify for training the models. the intent of a citation, and were directed to select among three citation intent options: Method, Result/Comparison and Background. This resulted in a total 9, 159 crowd- 5 Results and discussion sourced instances. We use the citations that are present in Quantitative analysis the S2ORC dataset we described before. Similar to ACL- ARC, the annotations are provided at individual citation- Table 1 shows the performance of the various approaches on context level, leading to multiple annotations for some of the the Somers’ Delta (∆) for each of the datasets ACL-2015, (cited paper, citing paper) pair. For such cases, we chose the ACL-ARC and SciCite. For ACL-2015 and SciCite, the pro- highest-informing annotation for the (cited paper, citing pa- posed approach CII outperforms the competing approaches; per) pairs. The selected dataset contains 352 citations among while for the ACL-ARC dataset, CII performs at par with 704 unique publications. There is no repeated citing or cited the best performing approach. The improvement of CII over publication in this dataset, thus, the total number of unique the second best performing approach is 22% and 103%, on citing publications as well as unique citing publications are the ACL-2015 and SciCite datasets, respectively. 352 each. Interestingly, the simplest baseline, Reference-frequency and its normalized forms are the second best performing ap- Parameter selection proaches. While Reference-frequency performs at par with We treat one of the evaluation datasets (ACL-ARC) as the the CII on the ACL-ARC dataset, it does not perform as validation set, and chose the hyperparameters of our ap- good on the other two datasets. This can be attributed to proaches and baselines with respect to best performance the fact that the number of unique citing papers in ACL- on this dataset. For DeepWalk, we use the implementation ARC dataset are relatively small. Thus, many citations in provided here11 , with the default parameters, except the di- ACL-ARC are shared by the same citing paper, which is not mensionality of the estimated representations, which is set the case with the other two datasets. Thus, as mentioned in to 200 (for the sake of fairness, as the used 200 dimen- Section 4, absolute frequency of referencing a cited-paper sional text representations for CII). For the models that re- may provide a good signal regarding the information bor- quire learning, i.e., the logistic regression part of Deepwalk, rowed from the cited paper, when comparing with other pa- MLP part of Deepwalk, GraphSage, and CII, we used the pers being cited by the same citing paper. Further, even the ADAM (Kingma and Ba 2015) optimizer, with initial learn- normalized forms of the Reference-frequency lead to only ing rate of 0.0001, and further use step learning rate sched- marginal increase in performance for the ACL-2015 and uler, by exponentially decaying the learning rate by a factor SciCite datasets. Thus, the simple normalizations (such as of 0.2 every epoch. We use L2 regularization of 0.0001. The mean, max and min normalization used in this paper), are function f in CII was implemented as a multilayer percep- not sufficient to address the difference in citation-behavior tron, with three hidden layers, with 256, 64, and 8 neurons, that occurs between different papers. 10 12 https://www.semanticscholar.org/ https://github.com/dmlc/dgl/blob/master/examples/pytorch/ 11 https://github.com/xgfs/deepwalk-c graphsage Furthermore, we observe that simple similarity based ap- proaches, such as cosine-similarity between pairs of various entities (each combination of citing abstract, citing abstract, and citation-context) performs close to random scoring (∆ value of close to zero). This validates that the simple sim- ilarity measures, like cosine similarity are not sufficient to manifest the the information that a cited-paper lends to the citing-paper; thus, showing the necessity of more expressive approaches, like CII. In addition, the other learning-based link-prediction- based approaches perform considerably worse than the sim- Figure 2: Word-cloud (Frequently occurring words) that ap- ple baseline reference-frequency. While on ACL-2015 and pear in the citation context of the citations with the highest SciCite datasets, they perform close to random scoring, the predict importance weights. performance on ACL-ARC dataset is better than the random baseline. Qualitative analysis In order to understand the patterns that the proposed ap- proach CII learns, we look into the data instances with the highest and lowest predicted weights. As the function f takes as input both the abstract of the cited paper and the ci- tation context, the learnt patterns can be a complex function of the cited paper abstract and the citation context. Thus, for simplicity, we limit the discussion in this section to under- Figure 3: Word-cloud (Frequently occurring words) that ap- stand the linguistic patterns in the citation context, and how pear in the citation context of the citations with the least pre- these patterns associate with the weights predicted for them. dict importance weights. In this direction, we select 10, 000 citation-contexts cor- responding to citations with highest predicted weights, and plot the word clouds for these contexts. We repeat the same 6 Conclusion exercise for the citation-contexts with the lowest predicted In this paper, we presented approaches to estimate content- weights. Figures 2 and 3 shows the wordclouds for the high- aware bibliometrics to accurately quantitatively measure the est weighted citations and lowest weighted citations, respec- scholarly impact of a publication. Our distant-supervised tively. These figures show some clear discriminatory pat- approaches use the content of the publications to weight ters between the highest-weighted and lowest-weighted ci- the edges of a citation network, where the weights quantify tations, that relate well with the information carried by a ci- the extent to which the cited-publication informs the citing- tation. For example, the words such as ‘used’ and ‘using’ are publication. Experiments on the three manually annotated very frequent in the citation contexts of the highest weighted datasets show the advantage of using the proposed method citations. This is expected, as such verbs provide a strong on the competing approaches. Our work makes a step to- signal that the cited work was indeed employed by the citing wards developing content-aware bibliometrics, and envision paper, and hence the cited paper informed the citing work. that the proposed method will serve as a motivation to de- Another interesting pattern in the highest weighted citations velop other rigorous quality-related metrics. is the presence of words like ‘fig’, ‘figure’ and ‘table’. Such words are usually present when the authors present or de- References scribe important concepts, such as methods and results. As such, citations in these important sections indicates that the Adamic, L. A.; and Adar, E. 2003. Friends and neighbors on cited work is used or extended in the citing paper, which the web. Social networks 25(3): 211–230. signals importance. Aguinis, H.; Suárez-González, I.; Lannelongue, G.; and Joo, On the other hand, the wordcloud for the least weighted H. 2012. Scholarly impact revisited. Academy of Manage- citations (Figure 3) is dominated by weasle words such as ment Perspectives 26(2): 105–132. ‘may’, ‘many’, ‘however’, etc. The words such as ‘many’ Bird, S.; Dale, R.; Dorr, B. J.; Gibson, B.; Joseph, M. T.; commonly occur in the related work section of the paper, Kan, M.-Y.; Lee, D.; Powley, B.; Radev, D. R.; and Tan, where the paper presents some examples of other related Y. F. 2008. The acl anthology reference corpus: A reference works to emphasize the problem that the citing paper is solv- dataset for bibliographic research in computational linguis- ing. The words like ‘may’, ‘however’, ‘but’ etc are com- tics . monly used to describe some limitation of the cited work. Such citations are expected to be incidental, carrying less Bliss, C. A.; Frank, M. R.; Danforth, C. M.; and Dodds, P. S. information, as compared to other citations. 2014. An evolutionary algorithm approach to link prediction in dynamic social networks. Journal of Computational Sci- He, J.; Nie, J.-Y.; Lu, Y.; and Zhao, W. X. 2012. Position- ence 5(5): 750–764. aligned translation model for citation recommendation. In Cohan, A.; Ammar, W.; van Zuylen, M.; and Cady, F. 2019. International symposium on string processing and informa- Structural Scaffolds for Citation Intent Classification in Sci- tion retrieval, 251–263. Springer. entific Publications. In Proceedings of the 2019 Conference He, Q.; Kifer, D.; Pei, J.; Mitra, P.; and Giles, C. L. 2011. Ci- of the North American Chapter of the Association for Com- tation recommendation without author supervision. In Pro- putational Linguistics: Human Language Technologies, Vol- ceedings of the fourth ACM international conference on Web ume 1 (Long and Short Papers), 3586–3596. search and data mining, 755–764. Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. He, Q.; Pei, J.; Kifer, D.; Mitra, P.; and Giles, L. 2010. Bert: Pre-training of deep bidirectional transformers for lan- Context-aware citation recommendation. In Proceedings of guage understanding. arXiv preprint arXiv:1810.04805 . the 19th international conference on World wide web, 421– 430. Ebesu, T.; and Fang, Y. 2017. Neural citation network for context-aware citation recommendation. In Proceedings of Huang, W.; Kataria, S.; Caragea, C.; Mitra, P.; Giles, C. L.; the 40th international ACM SIGIR conference on research and Rokach, L. 2012. Recommending citations: translating and development in information retrieval, 1093–1096. papers into references. In Proceedings of the 21st ACM in- ternational conference on Information and knowledge man- Färber, M.; and Jatowt, A. 2020. Citation Recommendation: agement, 1910–1914. Approaches and Datasets. arXiv preprint arXiv:2002.06961 Huang, W.; Wu, Z.; Liang, C.; Mitra, P.; and Giles, C. L. . 2015. A neural probabilistic model for context based cita- Garzone, M.; and Mercer, R. E. 2000. Towards an automated tion recommendation. In Twenty-ninth AAAI conference on citation classifier. In Conference of the canadian society for artificial intelligence. computational studies of intelligence, 337–346. Springer. Huang, Z. 2010. Link prediction based on graph topology: Giles, C. L.; Bollacker, K. D.; and Lawrence, S. 1998. Cite- The predictive value of generalized clustering coefficient. Seer: An automatic citation indexing system. In Proceedings Available at SSRN 1634014 . of the third ACM conference on Digital libraries, 89–98. Jurgens, D.; Kumar, S.; Hoover, R.; McFarland, D.; and Ju- Goodman, L. A.; and Kruskal, W. H. 1959. Measures of as- rafsky, D. 2018. Measuring the evolution of a scientific field sociation for cross classifications. II: Further discussion and through citation frames. Transactions of the Association for references. Journal of the American Statistical Association Computational Linguistics 6: 391–406. 54(285): 123–163. Kataria, S.; Mitra, P.; and Bhatia, S. 2010. Utilizing Context Goodman, L. A.; and Kruskal, W. H. 1963. Measures of in Generative Bayesian Models for Linked Corpus. In Aaai, association for cross classifications III: Approximate sam- volume 10, 1. Citeseer. pling theory. Journal of the American Statistical Association Kendall, M. G. 1938. A new measure of rank correlation. 58(302): 310–364. Biometrika 30(1/2): 81–93. Goodman, L. A.; and Kruskal, W. H. 1972. Measures of Kingma, D. P.; and Ba, J. 2015. Adam: A Method association for cross classifications, IV: Simplification of for Stochastic Optimization URL http://arxiv.org/abs/1412. asymptotic variances. Journal of the American Statistical 6980. Association 67(338): 415–421. Kleinberg, J. M. 1999. Authoritative sources in a hyper- Goodman, L. A.; and Kruskal, W. H. 1979. Measures of linked environment. Journal of the ACM (JACM) 46(5): association for cross classifications. In Measures of associ- 604–632. ation for cross classifications, 2–34. Springer. Kobayashi, Y.; Shimbo, M.; and Matsumoto, Y. 2018. Cita- tion recommendation using distributed representation of dis- Grover, A.; and Leskovec, J. 2016. node2vec: Scalable fea- course facets in scientific articles. In Proceedings of the 18th ture learning for networks. In Proceedings of the 22nd ACM ACM/IEEE on joint conference on digital libraries, 243– SIGKDD international conference on Knowledge discovery 251. and data mining, 855–864. Li, H.; Councill, I.; Lee, W.-C.; and Giles, C. L. 2006. Cite- Guimerà, R.; and Sales-Pardo, M. 2009. Missing and spu- Seerx: an architecture and web service design for an aca- rious interactions and the reconstruction of complex net- demic document search engine. In Proceedings of the 15th works. Proceedings of the National Academy of Sciences international conference on World Wide Web, 883–884. 106(52): 22073–22078. Liben-Nowell, D.; and Kleinberg, J. 2007. The link- Hamilton, W.; Ying, Z.; and Leskovec, J. 2017. Inductive prediction problem for social networks. Journal of the Amer- representation learning on large graphs. In Advances in neu- ican society for information science and technology 58(7): ral information processing systems, 1024–1034. 1019–1031. Han, J.; Song, Y.; Zhao, W. X.; Shi, S.; and Zhang, H. 2018. LIU, Y.; YAN, R.; and YAN, H. 2016. Personalized Citation hyperdoc2vec: Distributed representations of hypertext doc- Recommendation Based on User’s Preference and Language uments. arXiv preprint arXiv:1805.03793 . Model. Journal of Chinese Information Processing (2): 18. Livne, A.; Gokuladas, V.; Teevan, J.; Dumais, S. T.; and the 2014 conference on empirical methods in natural lan- Adar, E. 2014. CiteSight: supporting contextual citation rec- guage processing (EMNLP), 1532–1543. ommendation using differential search. In Proceedings of Perozzi, B.; Al-Rfou, R.; and Skiena, S. 2014. Deepwalk: the 37th international ACM SIGIR conference on Research Online learning of social representations. In Proceedings of & development in information retrieval, 807–816. the 20th ACM SIGKDD international conference on Knowl- Lo, K.; Wang, L. L.; Neumann, M.; Kinney, R.; and Weld, edge discovery and data mining, 701–710. D. 2020. S2ORC: The Semantic Scholar Open Research Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, Corpus. In Proceedings of the 58th Annual Meeting of C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized the Association for Computational Linguistics, 4969–4983. word representations. arXiv preprint arXiv:1802.05365 . Online: Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.447. URL https://www.aclweb. Pham, S. B.; and Hoffmann, A. 2003. A new approach for org/anthology/2020.acl-main.447. scientific citation classification using cue phrases. In Aus- tralasian Joint Conference on Artificial Intelligence, 759– Manchanda, S.; and Karypis, G. 2018. Text segmentation 771. Springer. on multilabel documents: A distant-supervised approach. In 2018 IEEE International Conference on Data Mining Ramage, D.; Hall, D.; Nallapati, R.; and Manning, C. D. (ICDM), 1170–1175. IEEE. 2009. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the Manchanda, S.; and Karypis, G. 2020. CAWA: An 2009 Conference on Empirical Methods in Natural Lan- Attention-Network for Credit Attribution. In AAAI, 8472– guage Processing: Volume 1-Volume 1, 248–256. Associa- 8479. tion for Computational Linguistics. Manchanda, S.; Sharma, M.; and Karypis, G. 2019a. Intent Ramage, D.; Manning, C. D.; and Dumais, S. 2011. Par- term selection and refinement in e-commerce queries. arXiv tially labeled topic models for interpretable text mining. In preprint arXiv:1908.08564 . Proceedings of the 17th ACM SIGKDD international con- Manchanda, S.; Sharma, M.; and Karypis, G. 2019b. Intent ference on Knowledge discovery and data mining, 457–465. term weighting in e-commerce queries. In Proceedings of ACM. the 28th ACM International Conference on Information and Rokach, L.; Mitra, P.; Kataria, S.; Huang, W.; and Giles, Knowledge Management, 2345–2348. L. 1978. A supervised learning method for context-aware Manchanda, S.; Sharma, M.; and Karypis, G. 2020. Distant- citation recommendation in a large corpus. INVITED Supervised Slot-Filling for E-Commerce Queries. arXiv SPEAKER: Analyzing the Performance of Top-K Retrieval preprint arXiv:2012.08134 . Algorithms 1978. Martı́nez, V.; Berzal, F.; and Cubero, J.-C. 2016. A survey Soleimani, H.; and Miller, D. J. 2017. Semisupervised, mul- of link prediction in complex networks. ACM computing tilabel, multi-instance learning for structured data. Neural surveys (CSUR) 49(4): 1–33. computation 29(4): 1053–1102. Menon, A. K.; and Elkan, C. 2011. Link prediction via ma- Somers, R. H. 1962. A new asymmetric measure of asso- trix factorization. In Joint european conference on machine ciation for ordinal variables. American sociological review learning and knowledge discovery in databases, 437–452. 799–811. Springer. Tang, X.; Wan, X.; and Zhang, X. 2014. Cross-language Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Ef- context-aware citation recommendation in scientific articles. ficient estimation of word representations in vector space. In Proceedings of the 37th international ACM SIGIR confer- arXiv preprint arXiv:1301.3781 . ence on Research & development in information retrieval, 817–826. Neumann, M.; King, D.; Beltagy, I.; and Ammar, W. 2019. ScispaCy: Fast and Robust Models for Biomedical Natural Valenzuela, M.; Ha, V.; and Etzioni, O. 2015. Identifying Language Processing. In Proceedings of the 18th BioNLP meaningful citations. In Workshops at the twenty-ninth AAAI Workshop and Shared Task, 319–327. Florence, Italy: As- conference on artificial intelligence. sociation for Computational Linguistics. doi:10.18653/v1/ Wang, C.; Satuluri, V.; and Parthasarathy, S. 2007. Local W19-5034. URL https://www.aclweb.org/anthology/W19- probabilistic models for link prediction. In Seventh IEEE 5034. international conference on data mining (ICDM 2007), 322– Newson, R. 2002. Parameters behind “nonparametric” 331. IEEE. statistics: Kendall’s tau, Somers’ D and median differences. Yin, J.; and Li, X. 2017. Personalized citation recommenda- The Stata Journal 2(1): 45–64. tion via convolutional neural networks. In Asia-Pacific web (APWeb) and web-age information management (WAIM) Page, L.; Brin, S.; Motwani, R.; and Winograd, T. 1999. The joint conference on web and big data, 285–293. Springer. PageRank citation ranking: Bringing order to the web. Tech- nical report, Stanford InfoLab. Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In Proceedings of