Citation Intent Classification Through Weakly Supervised Knowledge Graphs Xinwei Du1,2 , Kian Ahrabian1,2,* , Arun Baalaaji Sankar Ananthan1,2 , Richard Delwin Myloth1,2 and Jay Pujara1,2 1 Information Sciences Institute, Marina del Ray, CA, USA 2 University of Southern California, Los Angeles, CA, USA Abstract Citations are scientists’ tools for grounding their innovations and findings in the existing collective knowledge. They are used for semantically distinct purposes as scientists utilize them at different parts of their work to convey specific information. As a result, a crucial aspect of scientific document understanding is recognizing the authorial intent associated with citations. Current state-of-the-art methods rely on contextual sentences surrounding each citation to classify the intent. However, in the absence of textual content, these approaches become unusable. In this work, we propose a text-free citation intent classification method built on relational information among scholarly works in this work. To this end, we introduce a large-scale knowledge graph built from the publications in the SciCite dataset and their multi-hop neighborhood extracted from The Semantic Scholar Open Research Corpus (S2ORC). We also augment this knowledge graph by adding weakly-labeled links based on the intent information available in the S2ORC. Finally, we cast the intent classification task as a link prediction problem on the newly created knowledge graph. We study this problem in both transductive and inductive settings. Our experimental results show that we can achieve a comparable macro F1 score to word embedding content-based methods by only relying on features and relations derived from this knowledge graph. Specifically, we achieve macro F1 scores of 62.16 and 59.81 in the transductive and inductive settings, respectively, on the link-level SciCite dataset. Moreover, by combining our method with the state-of-the-art NLP-based model, we achieve improvements across all metrics. Keywords Citation Intent Classification, Knowledge Graphs, Graph Neural Networks, Large Language Models, Weakly supervised learning 1. Introduction to textual information. Previous works [3, 26, 6] have shown the importance of relational and structural infor- Citations are the primary way of identifying past contri- mation available in links among publications for various butions and connecting progress in new publications to tasks. In this work, we propose a general citation in- existing literature. Nevertheless, not all citations indicate tent classification method that relies purely on structural the same meaning. Authors use citations sparingly with information. specific intent behind them. For example, some papers Besides helping researchers better understand the re- are cited for providing background information in a do- lationship among publications, citation intent analysis main, while others are cited when adopting or adapting has been used for studying various other aspects of scien- a previously-used methodology. There are also scenar- tific works such as research domain evolution [10], scien- ios where the same paper is used as background infor- tific impact analysis [19], scientific document summariza- mation and methodology use-case in different contexts tion [5], and retrieving related scientific works [16]. The simultaneously. Understanding citation intent is crucial main three categories of citations are “Result,” “Method,” to studying scholarly works, given the universality of and “Background” [4]. These categories describe the rea- using citations. Current state-of-the-art citation intent sons behind making a scientific connection, referencing a classification models [17, 1, 4] rely heavily on textual publication in another publication. Classifying citations information, e.g., the sentences surrounding the citation. into these groups has traditionally required a high level However, such information is expensive to obtain and of expertise in the respective scientific domains. This in some scenarios inaccessible altogether. Consequently, constraint, combined with the high cost of expert human we need models that could operate without having access labor, has resulted in highly scarce datasets, which makes The Third AAAI Workshop on Scientific Document Understanding 2023, the task even more difficult. February 14th, 2023, Washington, DC, USA Previous works have proposed classifying citation in- * Corresponding author. tent through feature engineering-based [10] and repre- $ xinweidu@usc.edu (X. Du); ahrabian@usc.edu (K. Ahrabian); sentation learning-based [1] methods. However, most arunbaal@usc.edu (A. B. S. Ananthan); myloth@usc.edu (R. D. Myloth); jpujara@usc.edu (J. Pujara) of these methods depend on textual information. As a © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License result, they require a complex multi-stage pipeline of Attribution 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 parsing documents, identifying citation contexts, and CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings predicting citation intent [13]. Besides being prone to error propagation from various pipeline stages, the use of these models is limited to situations where the full text is available in a proper format. This work introduces a pure graph-based approach to classifying citation intent. We extend the existing SciCite dataset with 2-hop neigh- borhoods extracted from The Semantic Scholar Open Research Corpus (S2ORC). To further enrich the graph, we utilize the intent information provided in the S2ORC to create a weakly supervised knowledge graph (KG) con- sisting of the publications and the relations that match the provided intents. Our main idea is to use contextu- alized relational patterns to make predictions, obviating Figure 1: Overview of the extracted multi-hop KG. The set of the need for textual context. Given the newly built KG, 0-hop nodes 𝒱0 includes all the orange nodes. The set of 1-hop we cast the intent classification problem into the common nodes 𝒱1 includes all the orange and blue nodes. Similarly, link prediction problem on KGs. Specifically, we train a the graph could be expanded to include 𝑘-hop nodes 𝒱𝑘 . The model to learn representations for entities and relations. annotated set on each edge represents that specific link’s Using these representations, we run the following query intent. Specifically, the empty set denotes that the citation on the KG: (𝑠, ?, 𝑜), where 𝑠 cites 𝑜. link has no intent label. Converting this problem into a link prediction task allows us to adapt and extend widely used KG embed- ding models to this problem. We study the link predic- going as far as defining 35 [7] and 12 [21] fine-grained tion problem in both transductive and inductive settings. schemes for scientific arguments. The more recent works Our experimental results show that although our KG- however have focused on creating more concise cate- based method underperforms compared to the large lan- gories. For example, ACL-ARC [10] proposes a 6-class guage model-based approaches, it is comparable or even intent categorization scheme: Background, Motivation, superior to the word embedding-based methods. More- Uses, Extension, Comparison or Contrast, and Future. over, our experiments with combining the NLP-based and SciCite [4] is even more restrictive and drops or com- graph-based methods show slight improvements over the bines small fine-grained classes to provide a more con- current state-of-the-art model. These findings further cise 3-class annotation scheme: Background, Method, signify the importance of relational patterns for citation and Result. intent classification. The contributions of this work are as follows: 2.2. Citation Intent Classification 1. Extending the SciCite dataset using the S2ORC Methods dataset to generate a large-scale weakly super- vised KG. Before the explosion of deep learning approaches, most 2. Introducing a novel graph-based approach for methods relied on a combination of hand-crafted features citation intent classification built on top of the and classic machine learning models. For example, in newly built KG. one instance [23], authors propose 12 different features, 3. Presenting benchmarks for both transductive and including citation count, PageRank value, and author inductive settings. overlap, and use classic machine learning models such 4. Presenting analyses on the effect of different parts as SVM and Random Forest for classification. In another of the methodology such as weak supervision and instance [10], authors define pattern-based, topic-based, feature engineering. and prototypical argument features and use SVM to make predictions. With the advent of deep learning models and the emer- gence of large language models in recent years, represen- 2. Related Work tation learning-based methods have outperformed the hand-crafted methods achieving a higher accuracy by 2.1. Citation Function/Intent Schemes considering the textual information. Recent works have Many prior works have studied the problem of creat- proposed the use of structural scaffolds [4], BERT-based ing categorical schemes for citation intent which in some models trained on the scientific corpus (SciBERT) [1], works is referred to as citation function [9]. Earlier works word embedding-based approaches [17], and creating were focused on creating more fine-grained categories, a heterogeneous context graph based on an academic Table 1 paper and the target (cited) paper, and the output is the The statistic of the SciCite dataset and reconstructed datasets. label of a citation link between the source and target. We Dataset SciCite SciCiteorigin SciCiteresplit release all our datasets under a CC-BY-SA license at TBD Level Sentence Link Link # Samples 11,020 10,379 5,766 3.1. Entity Mapping # Train 8,243 7,602 4,122 We first map each paper in the SciCite dataset to the # Validation 916 916 822 S2ORC by matching SciCite’s IDs to Semantic Scholar’s # Test 1,861 1,861 822 SHA IDs. Since a publication could have many SHA IDs and only one Corpus ID, we then map each SHA ID to the unique Corpus ID to extract unique entities. network [26] From the 13,080 papers with unique IDs in SciCite, we successfully map 13,019 of them to valid SHA IDs in 2.3. Knowledge Graph Embedding semantic scholar, while the remaining 61 papers do not Models have any corresponding records. We believe this is due to publication removals, as the SciCite dataset was created KGs are structured information repositories consisting from the S2ORC in 2019. After converting SHA IDs to of a set of nodes representing entities and a set of typed Corpus IDs, we end up with 13,011 unique entities and 8 edges representing relations. Since, in most cases, the duplicate entities. KG nodes and edges are not attributed, KG embedding (KGE) models aim to learn low-dimensional representa- tions for all entities and relations. The most common 3.2. Dataset Splitting traditional shallow KGE methods are TransE [2], Com- The original SciCite dataset contains 11,020 human- plEx [22], and RotatE [20]. More recent GNN-based KGE labeled samples. Hence, to adapt it to our link predic- methods leverage the message-passing scheme of GNNs, tion setting, we reconstruct two datasets: SciCiteorigin enabling more complex multi-hop reasoning. Examples and SciCiteresplit . SciCiteorigin adheres to the same bench- of these methods are GCN [11], which leverages the marks reported in prior works but is modified to remove spectral information for information propagation but overlapping citation links in the training and test sets. is limited to mono-relational KGs, R-GCN [18], which To maximize usage of the training data while removing extends GCN to support multi-relational KGs, and Graph- artifacts, we create SciCiteresplit that performs additional SAGE [8] which introduces an inductive framework to cleaning, provides a stronger separation of training and handle unseen nodes. test sets, and avoids multi-intent citations. Table 1 show- cases the statistic of these datasets. 3. Dataset SciCiteorigin : The SciCite dataset focuses on individual citation links To make methods comparable, we use the same valida- and ignores the significance of broader relational connec- tion and test sets as SciCite for this dataset and try to tions and features. To overcome this issue, we construct keep the training set as close as possible. We convert each a knowledge graph by mapping each entity in the SciCite publication in the SciCite dataset to a Semantic Scholar dataset to the S2ORC and adding their 2-hop citation entity using the mapped Corpus IDs and drop the con- neighborhoods. The S2ROC contains more than 206 mil- textual sentence-level information. We assign a random lion publications and 2.49 billion citation links. Apart unique ID to publications without a Corpus ID. After from the regular citation links, this corpus provides par- this procedure, we end up with a set of links for our link tial intent labels for citations using a 3-class scheme as prediction task. follows: Due to the removal of the contextual information, 1. Background: Describe a problem, topic, or con- some of the training links appear exactly the same in cept the test set. Hence, we remove 641 training set samples 2. Method: Provide a method, tool, or dataset that also appear in the test set to prevent data leakage. 3. Result: To make a comparison Moreover, since only one link in the test set has mul- tiple intents, we treat the link prediction problem as a Moreover, the SciCite dataset is tailored for sentence multi-class task rather than a multi-label task. In this classification methods, where input features are textual scenario, the multi-intent links are represented as sepa- excerpts and the output labels are citation intents. We rate samples with the same inputs and different outputs. reformulate this task as link prediction on KGs, where the input features are a representation of the source (citing) Table 2 Statistics of the extracted KGs along with the original S2ORC dataset. Dataset # Nodes # Citation Links # Background # Method # Result Weak Labels Zero-Hop (𝒢0 ) 13,011 10,733 5,479 4,403 1,335 79.04% One-Hop (𝒢1 ) 5,862,261 119,776,090 39,202,086 16,830,665 16,830,665 43.18% Two-Hop (𝒢2 ) 57,535,880 1,621,293,902 467,860,523 121,877,053 35,283,718 34.41% S2ORC 206,159,629 2,495,513,737 643,955,457 169,472,164 45,779,793 31.90% Multi-label methods may be a promising future extension 4.2. Knowledge Graph Construction of our work. Given the S2ORC dataset, we expand the SciCite dataset using the mapped entities to construct a KG containing SciCiteresplit : 2-hop neighborhoods of the publications. Figure 1 illus- Even though we convert the SciCite dataset to the trates an overview of the expanded KG. This work uses SciCiteorigin , problems, such as duplicate citations and the 2022-09-13 version of the corpus downloaded from multi-label links, still exist. Therefore, we further tai- the bulk API. Formally, given the set of mapped entities lor the SciCite dataset to create a better link prediction 𝒱0 , the set of 𝑘-hop nodes 𝒱𝑘 is defined as dataset for graph-based models. First, we remove all the entities, and their related samples, that do not have a 𝒱𝑘 = 𝒱𝑘−1 ∪ {𝑦 | ∃𝑥 ∈ 𝒱𝑘−1 : 𝑦 ∈ 𝒩𝑥 } (1) mapped Corpus ID. Then, similar to SciCiteorigin , we con- where for a given entity 𝑥, 𝒩 denotes all the entities 𝑥 vert the remaining samples to a set of links. Following that cite or are cited by 𝑥, i.e., the set of neighboring this, we drop all duplicate samples. Among the remaining entities. Given the sets of unlabeled links 𝒰 and weakly 6,458 unique links, 5,886 only have one intent, 489 have labeled links ℒ, the set of 𝑘-hop edges ℰ is defined as 𝑘 two intents, and 83 have all three intents. We remove all the multi-intent links and resplit the dataset with ra- 𝒰 ℰ𝑘 = {(𝑥, 𝑦, UNK) | 𝑥, 𝑦 ∈ 𝒱𝑘 , (𝑥, 𝑦) ∈ 𝒰} (2) tios of 70%/15%/15% for training, validation, and test sets, respectively. ℰ𝑘ℒ = ∪𝑟 {(𝑥, 𝑦, 𝑟) | 𝑥, 𝑦 ∈ 𝒱𝑘 , (𝑥, 𝑦) ∈ ℒ𝑟 } (3) ℰ𝑘 = ℰ𝑘𝒰 ∪ ℰ𝑘ℒ (4) 4. Method where 𝑟 ∈ {Background, Method, Result} and ℒ𝑟 de- notes the set of all weakly labeled links with label 𝑟. Con- Throughout the rest of this work, for simplicity, we use sequently, given the sets of 𝑘-hop nodes 𝒱𝑘 and edges the term publication to denote all types of academic ℰ𝑘 , the extracted 𝑘-hop KG, 𝒢𝑘 , is defined as publications, e.g., books and papers. Moreover, we use the terms citation and reference to denote incoming 𝒢𝑘 = (𝒱𝑘 , ℰ𝑘 ) (5) and outgoing links, respectively. The specific statistics of the extracted KG and the origi- nal semantic scholar corpus are reported in Table 2. Since 4.1. Weak Supervision not every link has weakly labeled intent, this table also In order to enrich our data and provide more informa- provides the percentage of weakly labeled links for each tion to the models, we extract the set of intents provided corresponding graph. Although we extract 𝒢2 , given its in the S2ORC dataset for each citation link. The intent scale, we opt to run our current experiment only on 𝒢1 labels in S2ORC are extracted using the structural scaf- and leave the larger-scale experiments for future works. folds model [4] at a sentence level. In this scenario, we implicitly use the existing data derived from the con- 4.3. Feature Engineering tent for bootstrapping our approach. We refer to these links as weakly labeled due to being labeled by a noisy Since none of the publications in our KGs have any fea- model rather than a human expert. Since the intent labels tures or pre-defined representation, we propose to repre- are partial at a sentence level, citation links could have sent them through their references, citations, and graph- zero intent in the absence of text or several intents in an based features. More specifically, from S2ROC we extract abundance of use cases. the in-degrees and out-degrees of citations (or references), background links, method links, and result links. As a re- sult, each paper is represented with an 8-dimensional fea- ture vector, 4 for each in-degree and out-degree feature. Table 3 Intent classification results on SciCiteorigin and SciCiteresplit datasets. All the metrics are macro averaged. Bold values represent the highest performance within the metric and dataset scope. SciCiteorigin SciCiteresplit Method Setting Accuracy Precision Recall F1 Accuracy Precision Recall F1 Random Universal 33.05 33.05 33.83 31.22 32.99 32.88 33.85 31.89 Most Common Universal 53.57 17.86 33.33 23.26 42.63 14.21 33.33 19.93 TransE Transductive 40.41 37.09 37.81 36.52 39.57 35.96 35.70 35.59 ComplEx Transductive 49.01 44.11 37.94 33.30 40.25 41.85 35.64 28.78 RotatE Transductive 23.54 32.97 32.74 22.98 28.12 36.88 36.31 27.88 Random + MLP Transductive 49.60 30.58 35.17 32.42 45.35 30.26 35.83 32.78 TransE + MLP Transductive 54.16 45.77 45.21 45.24 51.93 45.68 44.16 43.89 ComplEx + MLP Transductive 55.72 47.80 45.19 44.77 48.64 43.46 43.15 43.24 RotatE + MLP Transductive 56.37 48.79 46.15 46.55 51.81 46.92 45.46 45.63 Infersent-KMeans Universal - 58 64 60 - - - - Infersent-HDBSCAN Universal - 57 63 58 - - - - Glove-KMeans Universal - 51 56 51 - - - - Glove-HDBSCAN Universal - 52 57 52 - - - - MHLP (Ours) Transductive 66.20 62.18 56.13 57.88 66.10 63.69 61.33 62.16 MHLP (Ours) Inductive 63.94 58.36 55.05 56.13 64.17 59.86 59.83 59.81 Structural Scaffolds Universal - 84.7 83.6 84.0 - - - - SciBERT Universal 86.94 85.30 85.92 85.58 86.39 85.51 85.14 85.28 SciBERT + MHLP Universal 87.53 85.56 87.07 86.25 86.85 86.80 85.96 86.35 For the publications where the content is unavailable, the 4.4. Baselines out-degree intent-based features will be zero since those Knowledge Graph Embedding Models: features are based on the noisy sentence-level model that the Semantic Scholar uses. However, the in-degree fea- Traditional KGE models consist of two shallow embed- tures may not be zero as long as the citing paper’s content dings as entity and relation encoders and a score function is available. For the new publications, i.e., unseen nodesas a decoder to predict the likelihood of a link. These in the inductive setting, the only known non-zero feature models are trained in a contrastive way by masking ei- is the reference count. ther one of the entities in a given triplet (head, relation, We normalize the reference and citation features by a tail) and sampling a set of negative entities, contrasting biased log factor defined as the positive entity. Since the traditional KGE methods rely on shallow em- ℎ̄𝑥 = log10 (ℎ𝑥 + 1 + 𝛼) (6) beddings for encoding entities and relations, they can only be used in the transductive setting and cannot op- where 𝛼 is a bias hyperparameter. We specifically set erate on unseen nodes. For our experiments, we use the 𝛼 = −0.9 to get a normalized value of −1 for zero- available implementations of TransE, ComplEx, and Ro- reference and zero-citation situations. tatE in the DGL-KE toolkit [27]. In the evaluation phase, Moreover, we normalize the non-zero in-degree intent- we calculate the likelihood of all different relation types based features into a [0, 1] probability distribution as for each link and consider the highest likelihood as the follows: model’s intent prediction. ℎ𝑥 ℎ̄𝑥 = (7) ℎBackground + ℎMethod + ℎResult Hybrid Models: The same normalization step is used for out-degree fea- To increase the reasoning power of the traditional KGE tures separately. models, we devise a two-stage approach based on mul- tilayer perceptron (MLP). We first use the traditional KGE models to learn embeddings for entities and rela- tions. Then, instead of relying on the produced likelihood scores, we concatenate the vectors of two entities and Figure 2: Overview of the composite model. The model consists of two encoders for the citation phrase and the citation graph around the citation link. During the training phase, we freeze the SciBERT model in the first two epochs as a warm-up step for the graph encoder; then, we jointly train both encoders along with the final prediction module. pass that through an MLP to get logit values. Formally, combination of the neighboring nodes’ representations. given a link (𝑢, 𝑣) and their respective learned represen- Let ℎ(0) 𝑥 be the extracted feature vector for any arbitrary tation (𝑧𝑢 , 𝑧𝑣 ), we calculate the logit values as node 𝑥. We calculate the representation of an arbitrary node 𝑣 at layer 𝑙 + 1 of a multilayer model as 𝑝 = MLP([𝑧𝑢 ‖𝑧𝑣 ]) (8) where 𝑝 ∈ R𝒞 contains the unnormalized logits for each (𝑙+1) 1 ∑︁ (𝑙) ℎ𝒩𝑣 = ℎ𝑢 (10) class. The predicted class 𝑐 is then calculated as |𝒩𝑣 | 𝑢∈𝒩 𝑣 argmax𝑐 sigmoid(𝑝). (9) ℎ(𝑙+1) 𝑣 (𝑙+1) = 𝜎(𝑊 (𝑙+1) [ℎ𝑣(𝑙) ‖ℎ𝒩𝑣 ]) (11) Natural Language Processing Models: where 𝜎 is a non-linear function. Throughout our ex- periments, we specifically use ReLU to introduce non- We include the reported results of several state-of-the-art linearity. Given the node representation from a 𝐿-layer Natural Language Processing (NLP) methods. Specifi- model and a link (𝑢, 𝑣), we calculate the logit values as cally, we include results from the word embedding-based methods such as Infersent-KMeans, Infersent-HDBSCAN, 𝑝 = MLP([ℎ(𝐿) (𝐿) 𝑢 ‖ℎ𝑣 ]) (12) Glove-KMeans, and Glove-HDBSCAN [17], BiLSTM- where 𝑝 ∈ R contains the unnormalized logits for each 𝒞 based method Structural Scaffolds [4], and large language class and 𝒞 is the set of all classes. The predicted class 𝑐 model-based method SciBERT [1]. Moreover, we report is then calculated as the results of fine-tuning a pre-trained SciBERT model on both datasets. All these methods use textural information argmax𝑐 sigmoid(𝑝). (13) and are evaluated on the SciCite dataset. The main disadvantage of the inductive settings is that the unseen nodes only have one available feature, i.e., 4.5. Multi-Hop Link Prediction (MHLP) reference count. This absence of information makes the Transductive and inductive settings are the most common task extremely difficult, as the feature vectors are highly link prediction evaluating schemes for KGs. The main dif- sparse. However, our model tries to diminish this effect ference between these two settings is having a fixed set of by using the message-passing scheme, as defined in Equa- nodes in both the training and evaluation phases (trans- tion 11, to aggregate information through connected en- ductive) versus allowing the addition of unseen nodes tities, i.e., cited papers, creating a denser representation in the evaluation phase (inductive). This work refers to for the unseen nodes. citation intent prediction on unseen publications as the All our models are trained using the cross-entropy loss inductive setting, whereas the transductive setting refers defined as to citation intent prediction on already seen publications. exp(𝑝𝑦𝑛 ) We propose an adaptable graph-based model for cita- 𝑙𝑛 = − log ∑︀|𝒞| (14) 𝑖=1 exp(𝑝𝑖 ) tion intent prediction in both the transductive and in- ductive settings. The primary basis of this approach is where and 𝑝𝑥 is the logit value for class 𝑥 given the that a node, i.e., publication, could be represented as a prediction vector 𝑝. Table 4. Then, the MLP component is trained using the procedure described in A.2 to predict the citation intent. For the MHLP-based methods, in both transductive and inductive settings, we use a 1-layer variation on top of the normalized features extracted as described in Section 4.3. Moreover, we tune their hyperparameters and train them as described in Appendix A.3. For the SciBERT method, we freeze the pre-trained model and add an MLP module on top of the 768-dimensional [CLS] token output. Similar to the other models, the MLP module is tuned using the parameters described in A.2. For the composite model, during the training phase, we freeze the SciBERT model in the first two epochs as a warm-up (a) The number of different citation intents. step for the graph encoder; then, we jointly train both encoders along with the final prediction module. To control for the effect of the pre-training using tradi- tional KGE models, we also run a variation with randomly initialized node features and designate it as “Random + MLP.” For the NLP models, we use the previously re- ported results [17] to compare our models on the test set-aligned SciCiteorigin dataset. Finally, we also include the results from random and most common class predic- tions as sanity checks. All the models are implemented using PyTorch [14] and trained on a machine with a sin- gle Quadro RTX 8000 GPU, 72 CPU cores, and 768GB of RAM. Implementations are available under a CC-BY-SA (b) The percentage of different citation intents. license at TBD. Figure 3: The statistic of citation intent for all publications in the Semantic Scholar corpus. The temporal trends stay steady 5.1. Results over time, suggesting a lack of information in the elapsed time from the time of publication to the time of citing. Table 3 illustrates our experimental results on both datasets. As evident from Table 3, traditional KGE meth- ods perform poorly on both datasets, only slightly beat- ing the random baseline on the macro F1 metric. In- Composite Model: terestingly, both ComplEx and RotatE perform worse To further test the capabilities of our proposed model than TransE on both datasets. This finding is surprising and use both structural and textual information, we de- as both ComplEx and RotatE are more expressive than vise a multi-modal model comprising encoders for both TransE [20]. However, when combined with MLP models, the graph structure and the citation context. Specifically, all exhibit significant performance boost, up to more than we use a pre-trained SciBERT model for encoding the 100% in the case of RotatE. After this addition, we can see citation phrase text and our MHLP model for encoding the same expressivity trend in the model results, i.e., the the citation graph around the citation link. Figure 2 illus- more powerful the model, the better the result. Moreover, trates an overview of the composite model. the control “Random + MLP” experiment showcases very similar results to the random baseline, indicating the im- portance of both components for the hybrid model to 5. Experiments perform well. Altogether, it is evident that the reasoning power of shallow traditional KGE models is not enough In this section, we report our experimental results on both to capture the complexity of this task, and we require of the SciCiteorigin and SciCiteresplit datasets. All the graph- models with more reasoning power. based experiments are carried out on the 𝒢1 KG. For the As for the MHLP method, in the transductive setting, it traditional KGE methods, we tune their hyperparameters achieves 57.88 and 62.16 macro F1 scores on SciCiteorigin as described in Appendix A.1 and train them using the and SciCiteresplit datasets, respectively. Moreover, its in- hyperparameters showcased in Table 4. For the hybrid ductive results showcase the robustness of our approach methods, the KGE component is first trained to generate in an extreme out-of-distribution setting, achieving 56.13 node features using the hyperparameters described in and 59.81 macro F1 scores. Compared to previously re- ported results [17], our model achieves superior perfor- mance to Glove-based models while slightly lagging be- hind Infersent-based models. Looking into the precision and recall comparison, our method has better precision scores on both transductive and inductive settings com- pared to all word embedding-based models; however, for recall, it performs better than Glove-based models and worse than the Infersent-based models which might stem from the imbalance in the links as illustrated by Figure 3a. Further experimentation to address the class imbalance problem in future works might help improve the overall performance of MHLP. The significance of these results is that we show structural and relational information could be used to achieve relatively high performance without using textual information. Moreover, although our mod- els underperform compared to language model-based approaches such as Structural Scaffolds and SciBERT, (a) Publication features (both sides) we showcase interesting future directions for combining graph-based and NLP-based methods. Finally, the composite model denoted as SciBERT + MHLP in Table 3, achieves the best performance among all models, even beating the fine-tuned SciBERT. When considering MHLP’s standalone performance, these re- sults showcase the potential improvements that could be achieved through the use of structural information that is not available in citation phrases. The presented ex- periments are a stepping stone for better understanding and using the structural information at scale for citation intent classification. 6. Analysis 6.1. Temporal Analysis This analysis studies the relationship between the time (b) Averaged neighborhood features (both sides) that has passed since publication and citation intent. We hypothesize that a paper is more likely to be cited as Figure 4: The calculated MI values for publication features “Result” or “Method” right after its publication, and as and averaged neighborhood features. On average, the publica- time passes, it will be more likely to be cited as “Back- tion features show stronger connections to the target variable. ground.” If this is proven accurate, we could get a rela- tively strong signal from the temporal information for each citation. We plotted the years after publication analysis or studies of temporal information for citation against intent counts and ratios for all papers in the se- intent classification. mantic scholar corpus to test our hypothesis. Figure 3a and 3b illustrate the results of our analysis. As evident 6.2. Mutual Information Analysis from these figures and contrary to our original hypothe- sis, we find out that the ratio of intent classes almost stays In this analysis, we study the quality of the engineered the same as time passes with insignificant fluctuations. features as described in Section 4.3 concerning the weakly As a result, using temporal information in our models is labeled intent classes. To this end, we use the well-known unlikely to provide any significant improvement. Note mutual information (MI) [12] measurement to quantify that these results are based on the weakly labeled links the importance of each feature. Formally, the MI between that we obtained from S2ORC. Consequently, these links are generated by another noisy model that could poten- tially be biased. Hence, it should not discourage further (a) Features before normalization (b) Features after normalization Figure 5: The t-SNE visualizations for the unnormalized and normalized features. where 𝒴 is the value space for 𝑌 , 𝒳 is the value space for 𝑋, 𝑃𝑋,𝑌 is the joint probability distribution, and 𝑃𝑋 and 𝑃𝑌 are the marginal probability distributions. Note that MI is a non-negative value, and higher values indicate more correlation between the two random variables. For our analysis, we calculate MI for both sides of the 5,886 unique citation links in the SciCiteresplit dataset. More- over, to study these features in the graph context, we also calculate MI for the average of these features over the neighborhood of each publication, i.e., all citing and cited publications, from both sides of the citation links. Figures 4a and 4b present the results of our experiments. As evi- (a) The percentage of utilized weak labels. dent from these results, while the publication-averaged features generally show stronger connections to the tar- get variable, the neighborhood-averaged features seem to show complementary connections, further emphasizing the importance of using both sets of features. 6.3. Feature Quality Analysis In this analysis, we study the effect of normalization as described in Equations 6 and 7. To this end, we project the extracted features of the 5,886 unique citation links in the SciCiteresplit dataset to a 2-dimensional space us- ing t-SNE [24]. Figure 5a and 5b illustrate the projected (b) The percentage of corrupted data. space for the unnormalized and normalized features, re- spectively. As evident from Figure 5a, it is challenging Figure 6: The macro F1 score of MHLP (Transductive) on to distinguish different intent types in the unnormalized SciCiteorigin and SciCiteresplit dataset space. However, after normalization, as evident from Fig- ure 5b, we can see that the “Method” intention more or less creates a distinguishable cluster. This result shows two discrete random variables 𝑋 and 𝑌 is defined as that the use of normalization is potentially helpful for the model. Further studies on different types of normal- ∑︁ ∑︁ 𝑃𝑋,𝑌 (𝑥, 𝑦) ization and their effects are left for future work. 𝐼(𝑋, 𝑌 ) = 𝑃𝑋,𝑌 (𝑥, 𝑦) log( ) 𝑦∈𝒴 𝑥∈𝒳 𝑃𝑋 (𝑥)𝑃𝑌 (𝑦) (15) 6.4. Robustness Analysis to augment the extracted citation graph with citation in- tents and create a multi-relational knowledge graph. Fol- In this analysis, we focus on studying the robustness lowing this, we adapted the sentence-based intent classifi- of our proposed graph-based method. To this end, we cation into a citation-based link prediction task on graphs. devise two ablation studies. In the first study, we ran- We then introduced a set of engineered graph-based and domly corrupt a percentage of the weak labels by replac- citation-based features. Built on top of these features, we ing the correct label with a random label. This study introduced a graph-based multi-hop reasoning approach aims to understand the model’s resilience to noise better. for the newly introduced task. Our approach achieves In the second study, we randomly remove a percentage 62.16 and 59.81 macro F1 scores in the transductive and in- of the weak labels. This study’s idea is to understand ductive settings, respectively. The experimental results in better the effect of weak supervision on the model’s the inductive setting further showcase the robustness of performance. These studies are carried out by running the proposed approach in the information-deprived out- the MHLP method in the transductive setting on both of-distribution environment. Compared to NLP-based SciCiteorigin and SciCiteresplit datasets. models, we reached a comparable performance to, and The feature vectors for the publications are calculated in some cases outperform, the word embedding-based by counting the number of citations and intents. These methods that rely on contextual sentences to make pre- vectors are normalized then using Equation 6 and 7. To dictions. Moreover, with a composite model comprising analyze the relationship between the model’s perfor- our method as the graph encoder and the state-of-the-art mance and the amount of available data, we create ten NLP-based model as the text encoder, we outperformed variations of the dataset by only using a portion of the all the other models we experimented with. These results available weak labels, varying from using all the available further signify the strong signal in relational informa- weak labels to only using 10% of them. Figure 6a presents tion and highlight the importance of future analysis and the result of this study. studies in this domain. Finally, our presented analyses As evident from Figure 6a, the more weakly labeled further support our methodological choices. links are available, the better our method performs. The For future works, one straightforward idea is to extend other significant observation is the robustness of the the knowledge graph with more scholarly information, model, even in the extreme scenario of having access to such as authors, venues, and fields of study. There already only 10% of the labels. Note that only 31.90% of links in exist some open repositories such as OpenAlex [15] and the S2ROC have at least one weakly labeled intent, which Microsoft Academic Graph (MAG) [25] that contain this means, even if the utilization percentage is 100%, only information. Another direction is further investigation 31.90% citation links are weakly labeled. into the temporal signals. Last but not least, although we Figure 6b showcases the relationship between the achieved an improved performance through a fusion of model performance and the percentage of corrupted data. textual and structural information, more investigation Following our intuition, the model’s performance mono- and analysis could be done in this setting in future works. tonically decreases as we add more noisy labels to the data. However, two interesting observations could be made from this figure. First, the performance of our Acknowledgments method only drops less than five macro F1 scores when half (50%) of the weak labels are replaced with randomly This work was funded by the Defense Advanced Research assigned noisy labels. This observation shows that the Projects Agency with award W911NF-19-20271 and with proposed method is exceptionally resilient when faced support from a Keston Exploratory Research Award. with mistakes. Second, even when all the labels are re- placed with random ones (100%), the model performs better than the random baselines. This observation indi- References cates that the model is learning to make inferences based [1] Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB- on purely structural information, which further solidifies ERT: A Pretrained Language Model for Scientific our hypothesis regarding the importance of structural Text. In Proceedings of the 2019 Conference on Em- information. pirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural 7. Conclusions and Future Work Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, In this work, we first introduced an expansion to the Sci- 3615–3620. https://doi.org/10.18653/v1/D19-1371 Cite dataset by extracting scholarly information from the [2] Antoine Bordes, Nicolas Usunier, Alberto Garcia- S2ORC dataset and creating an extended citation graph. Durán, Jason Weston, and Oksana Yakhnenko. Then, we gathered a large-scale weakly labeled dataset 2013. Translating Embeddings for Modeling Multi- tional Networks. In Proceedings of the 5th Interna- Relational Data. In Proceedings of the 26th Inter- tional Conference on Learning Representations (ICLR national Conference on Neural Information Pro- ’17). OpenReview.net, Palais des Congrès Neptune, cessing Systems - Volume 2 (Lake Tahoe, Nevada) Toulon, France, 14 pages. https://openreview.net/ (NIPS’13). Curran Associates Inc., Red Hook, NY, forum?id=SJU4ayYgl USA, 2787–2795. [12] Alexander Kraskov, Harald Stögbauer, and Peter [3] Lutz Bornmann and Hans-Dieter Daniel. 2008. Grassberger. 2004. Estimating mutual information. What do citation counts measure? A review of stud- Physical review E 69, 6 (2004), 066138. ies on citing behavior. J. Documentation 64 (2008), [13] Kyle Lo, Lucy Lu Wang, Mark Neumann, Rod- 45–80. ney Kinney, and Daniel Weld. 2020. S2ORC: The [4] Arman Cohan, Waleed Ammar, Madeleine van Semantic Scholar Open Research Corpus. In Pro- Zuylen, and Field Cady. 2019. Structural Scaffolds ceedings of the 58th Annual Meeting of the Asso- for Citation Intent Classification in Scientific Pub- ciation for Computational Linguistics. Association lications. In Proceedings of the 2019 Conference of for Computational Linguistics, Online, 4969–4983. the North American Chapter of the Association for https://doi.org/10.18653/v1/2020.acl-main.447 Computational Linguistics: Human Language Tech- [14] Adam Paszke, Sam Gross, Soumith Chintala, Gre- nologies, Volume 1 (Long and Short Papers). Associ- gory Chanan, Edward Yang, Zachary DeVito, Zem- ation for Computational Linguistics, Minneapolis, ing Lin, Alban Desmaison, Luca Antiga, and Adam Minnesota, 3586–3596. https://doi.org/10.18653/ Lerer. 2017. Automatic Differentiation in PyTorch. v1/N19-1361 In NIPS 2017 Workshop on Autodiff. OpenReview.net, [5] Arman Cohan and Nazli Goharian. 2015. Scien- Long Beach, California, USA, 4 pages. https: tific Article Summarization Using Citation-Context //openreview.net/forum?id=BJJsrmfCZ and Article’s Discourse Structure. In Proceedings [15] Jason Priem, Heather Piwowar, and Richard Orr. of the 2015 Conference on Empirical Methods in 2022. OpenAlex: A fully-open index of scholarly Natural Language Processing. Association for Com- works, authors, venues, institutions, and concepts. putational Linguistics, Lisbon, Portugal, 390–400. arXiv preprint arXiv:2205.01833 abs/2205.01833 https://doi.org/10.18653/v1/D15-1045 (2022), 5 pages. [6] Daniel Cummings and Marcel Nassar. 2020. Struc- [16] Anna Ritchie. 2009. Citation context analysis for tured Citation Trend Prediction Using Graph Neu- information retrieval. Technical Report. University ral Networks.. In ICASSP. IEEE, Barcelona, Spain, of Cambridge, Computer Laboratory. 3897–3901. http://dblp.uni-trier.de/db/conf/icassp/ [17] Muhammad Roman, Abdul Shahid, Shafiullah Khan, icassp2020.html#CummingsN20 Anis Koubaa, and Lisu Yu. 2021. Citation intent [7] M.A. Garzone. 1997. Automated Classification of classification using word embedding. Ieee Access 9 Citations Using Linguistic Semantic Grammars. The- (2021), 9982–9995. sis (M.Sc.)–University of Western Ontario, Lon- [18] Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem, don, Canada. https://books.google.com/books?id= Rianne van den Berg, Ivan Titov, and Max Welling. V-bwSgAACAAJ 2018. Modeling Relational Data with Graph Con- [8] William L. Hamilton, Rex Ying, and Jure Leskovec. volutional Networks. In The Semantic Web, Aldo 2017. Inductive Representation Learning on Large Gangemi, Roberto Navigli, Maria-Esther Vidal, Pas- Graphs. In Proceedings of the 31st International Con- cal Hitzler, Raphaël Troncy, Laura Hollink, Anna ference on Neural Information Processing Systems Tordai, and Mehwish Alam (Eds.). Springer Inter- (Long Beach, California, USA) (NIPS’17). Curran national Publishing, Cham, 593–607. Associates Inc., Red Hook, NY, USA, 1025–1035. [19] Henry Small. 2018. Characterizing highly cited [9] Myriam Hernández-Alvarez and José M Gomez. method and non-method papers using citation con- 2016. Survey about citation context analysis: Tasks, texts: The role of uncertainty. Journal of Informet- techniques, and resources. Natural Language Engi- rics 12, 2 (2018), 461–480. neering 22, 3 (2016), 327–349. [20] Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and [10] David Jurgens, Srijan Kumar, Raine Hoover, Dan Jian Tang. 2019. RotatE: Knowledge Graph Em- McFarland, and Dan Jurafsky. 2018. Measuring bedding by Relational Rotation in Complex Space.. the Evolution of a Scientific Field through Cita- In ICLR (Poster). OpenReview.net, New Orleans, tion Frames. Transactions of the Association for LA, 18 pages. http://dblp.uni-trier.de/db/conf/iclr/ Computational Linguistics 6 (2018), 391–406. https: iclr2019.html#SunDNT19 //doi.org/10.1162/tacl_a_00028 [21] Simone Teufel, Advaith Siddharthan, and Dan Tid- [11] Thomas N. Kipf and Max Welling. 2017. Semi- har. 2006. An annotation scheme for citation func- Supervised Classification with Graph Convolu- tion. In Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue. Association for Com- Table 4 putational Linguistics, Sydney, Australia, 80–87. Hyperparameters of KGE algorithms. https://aclanthology.org/W06-1312 Hyperparameter TransE ComplEx RotatE [22] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. 2016. Com- embedding dimension 100 100 50 plex Embeddings for Simple Link Prediction. In learning rate 0.1 0.3 0.1 regularization coefficient 1e-6 1e-6 1e-6 Proceedings of the 33rd International Conference on negative samples size 128 512 64 International Conference on Machine Learning - Vol- 𝛼 0 0.25 1 ume 48 (ICML’16). JMLR.org, New York, NY, USA, 𝛾 - - 6 2071–2080. [23] Marco Valenzuela, Vu Ha, and Oren Etzioni. 2015. Identifying Meaningful Citations. In Scholarly Big {0.25, 0.5, 1}, 𝛾 ∈ {6, 12, 24}. Note that 𝛼 and 𝛾 are the Data: AI Perspectives, Challenges, and Ideas, Papers adversarial temperature and the margin value (RotatE- from the 2015 AAAI Workshop (Technical Report, only), respectively. WS-15-13), Cornelia Caragea, C. Lee Giles, Narayan Bhamidipati, Doina Caragea, Sujatha Das Gollapalli, Saurabh Kataria, Huan Liu, and Feng Xia (Eds.). A.2. Multilayer Perceptron AAAI Press, Menlo Park, CA, 21–26. http://www. To simplify the model tuning process, we find the optimal aaai.org/Library/Workshops/ws15-13.php hyperparameters of “ComplEx + MLP” on SciCiteorigin [24] Laurens van der Maaten and Geoffrey Hinton. 2008. using grid search and reuse them for the rest of our ex- Visualizing Data using t-SNE. Journal of Machine periments. Specifically, we run a grid search over the Learning Research 9, 86 (2008), 2579–2605. http: following ranges: number of layers ∈ {0, 1, 2, 3}, dropout //jmlr.org/papers/v9/vandermaaten08a.html ∈ {0, 0.1, 0.2, 0.3, 0.4, 0.5}, dimension ∈ {32, 64, 128}, [25] Kuansan Wang, Zhihong Shen, Chiyuan Huang, The optimal hyperparameters are as follows: number of Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. layers = 2, dropout = 0.2, and dimension = [64, 32]. We 2020. Microsoft academic graph: When experts are use ReLU as the activation function for all layers. not enough. Quantitative Science Studies 1, 1 (2020), 396–413. [26] Wenhao Yu, Mengxia Yu, Tong Zhao, and Meng A.3. Multi-Hop Link Prediction Jiang. 2020. Identifying Referential Intention with We run a grid search over the following ranges: number Heterogeneous Contexts. In Proceedings of The Web of layers ∈ {0, 1, 2, 3}, dimension ∈ {10, 50, 100, 200}, Conference 2020 (Taipei, Taiwan) (WWW ’20). Asso- learning rate ∈ {0.03, 0.01, 0.003, 0.001}, The optimal ciation for Computing Machinery, New York, NY, hyperparameters are as follows: number of layers = 1, USA, 962–972. https://doi.org/10.1145/3366423. dimension = 100, learning rate = 0.01. We use Adam as 3380175 the optimizer through the tuning process. [27] Da Zheng, Xiang Song, Chao Ma, Zeyuan Tan, Zi- We use a randomized search to tune our models and hao Ye, Jin Dong, Hao Xiong, Zheng Zhang, and find near-optimal hyperparameters using the follow- George Karypis. 2020. DGL-KE: Training Knowl- ing ranges: embedding dimensions ∈ {50, 100, 200}, edge Graph Embeddings at Scale. In Proceedings learning rate ∈ {0.03, 0.1, 0.3}, regularization coef- of the 43rd International ACM SIGIR Conference on ficient ∈ {0.0, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5}, number of Research and Development in Information Retrieval negative samples ∈ {64, 128, 256, 512, 1024}, 𝛼 ∈ (SIGIR ’20). Association for Computing Machinery, {0.25, 0.5, 1}, 𝛾 ∈ {6, 12, 24}. Note that 𝛼 and 𝛾 are the New York, NY, USA, 739–748. adversarial temperature and the margin value (RotatE- only), respectively. A. Hyperparameters A.1. Knowledge Graph Embedding We use a randomized search to tune our models and find near-optimal hyperparameters using the follow- ing ranges: embedding dimensions ∈ {50, 100, 200}, learning rate ∈ {0.03, 0.1, 0.3}, regularization coef- ficient ∈ {0.0, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5}, number of negative samples ∈ {64, 128, 256, 512, 1024}, 𝛼 ∈