Citation Intent Classification Through Weakly Supervised Knowledge Graphs

Citation Intent Classification Through Weakly Supervised Knowledge Graphs XinweiDu xinweidu@usc.edu Information Sciences Institute

Marina del Ray CA USA

University of Southern California

Los Angeles CA USA

KianAhrabian ahrabian@usc.edu Information Sciences Institute

Marina del Ray CA USA

University of Southern California

Los Angeles CA USA

ArunBaalaaji SankarAnanthan Information Sciences Institute

Marina del Ray CA USA

University of Southern California

Los Angeles CA USA

RichardDelwinMyloth myloth@usc.edu Information Sciences Institute

Marina del Ray CA USA

University of Southern California

Los Angeles CA USA

JayPujara jpujara@usc.edu Information Sciences Institute

Marina del Ray CA USA

University of Southern California

Los Angeles CA USA

The Third AAAI Workshop on Scientific Document Understanding 2023

February 14th 2023 Washington DC USA

Citation Intent Classification Through Weakly Supervised Knowledge Graphs 1613-0073 24F756E5A22E9A7D1B4D4382D5FB6C31 GROBID - A machine learning software for extracting information from scholarly documents Citation Intent Classification Knowledge Graphs Graph Neural Networks Large Language Models Weakly supervised learning

Citations are scientists' tools for grounding their innovations and findings in the existing collective knowledge. They are used for semantically distinct purposes as scientists utilize them at different parts of their work to convey specific information. As a result, a crucial aspect of scientific document understanding is recognizing the authorial intent associated with citations. Current state-of-the-art methods rely on contextual sentences surrounding each citation to classify the intent. However, in the absence of textual content, these approaches become unusable. In this work, we propose a text-free citation intent classification method built on relational information among scholarly works in this work. To this end, we introduce a large-scale knowledge graph built from the publications in the SciCite dataset and their multi-hop neighborhood extracted from The Semantic Scholar Open Research Corpus (S2ORC). We also augment this knowledge graph by adding weakly-labeled links based on the intent information available in the S2ORC. Finally, we cast the intent classification task as a link prediction problem on the newly created knowledge graph. We study this problem in both transductive and inductive settings. Our experimental results show that we can achieve a comparable macro F1 score to word embedding content-based methods by only relying on features and relations derived from this knowledge graph. Specifically, we achieve macro F1 scores of 62.16 and 59.81 in the transductive and inductive settings, respectively, on the link-level SciCite dataset. Moreover, by combining our method with the state-of-the-art NLP-based model, we achieve improvements across all metrics.

Introduction

Citations are the primary way of identifying past contributions and connecting progress in new publications to existing literature. Nevertheless, not all citations indicate the same meaning. Authors use citations sparingly with specific intent behind them. For example, some papers are cited for providing background information in a domain, while others are cited when adopting or adapting a previously-used methodology. There are also scenarios where the same paper is used as background information and methodology use-case in different contexts simultaneously. Understanding citation intent is crucial to studying scholarly works, given the universality of using citations. Current state-of-the-art citation intent classification models [17,1,4] rely heavily on textual information, e.g., the sentences surrounding the citation. However, such information is expensive to obtain and in some scenarios inaccessible altogether. Consequently, we need models that could operate without having access to textual information. Previous works [3,26,6] have shown the importance of relational and structural information available in links among publications for various tasks. In this work, we propose a general citation intent classification method that relies purely on structural information.

Besides helping researchers better understand the relationship among publications, citation intent analysis has been used for studying various other aspects of scientific works such as research domain evolution [10], scientific impact analysis [19], scientific document summarization [5], and retrieving related scientific works [16]. The main three categories of citations are "Result, " "Method, " and "Background" [4]. These categories describe the reasons behind making a scientific connection, referencing a publication in another publication. Classifying citations into these groups has traditionally required a high level of expertise in the respective scientific domains. This constraint, combined with the high cost of expert human labor, has resulted in highly scarce datasets, which makes the task even more difficult.

Previous works have proposed classifying citation intent through feature engineering-based [10] and representation learning-based [1] methods. However, most of these methods depend on textual information. As a result, they require a complex multi-stage pipeline of parsing documents, identifying citation contexts, and predicting citation intent [13]. Besides being prone to error propagation from various pipeline stages, the use of these models is limited to situations where the full text is available in a proper format. This work introduces a pure graph-based approach to classifying citation intent. We extend the existing SciCite dataset with 2-hop neighborhoods extracted from The Semantic Scholar Open Research Corpus (S2ORC). To further enrich the graph, we utilize the intent information provided in the S2ORC to create a weakly supervised knowledge graph (KG) consisting of the publications and the relations that match the provided intents. Our main idea is to use contextualized relational patterns to make predictions, obviating the need for textual context. Given the newly built KG, we cast the intent classification problem into the common link prediction problem on KGs. Specifically, we train a model to learn representations for entities and relations. Using these representations, we run the following query on the KG: (𝑠, ?, 𝑜), where 𝑠 cites 𝑜.

Converting this problem into a link prediction task allows us to adapt and extend widely used KG embedding models to this problem. We study the link prediction problem in both transductive and inductive settings. Our experimental results show that although our KGbased method underperforms compared to the large language model-based approaches, it is comparable or even superior to the word embedding-based methods. Moreover, our experiments with combining the NLP-based and graph-based methods show slight improvements over the current state-of-the-art model. These findings further signify the importance of relational patterns for citation intent classification.

The contributions of this work are as follows:

1. Extending the SciCite dataset using the S2ORC dataset to generate a large-scale weakly supervised KG. 2. Introducing a novel graph-based approach for citation intent classification built on top of the newly built KG. 3. Presenting benchmarks for both transductive and inductive settings. 4. Presenting analyses on the effect of different parts of the methodology such as weak supervision and feature engineering.

Related Work

Citation Function/Intent Schemes

Many prior works have studied the problem of creating categorical schemes for citation intent which in some works is referred to as citation function [9]. Earlier works were focused on creating more fine-grained categories, going as far as defining 35 [7] and 12 [21] fine-grained schemes for scientific arguments. The more recent works however have focused on creating more concise categories. For example, ACL-ARC [10] proposes a 6-class intent categorization scheme: Background, Motivation, Uses, Extension, Comparison or Contrast, and Future. SciCite [4] is even more restrictive and drops or combines small fine-grained classes to provide a more concise 3-class annotation scheme: Background, Method, and Result.

Citation Intent Classification Methods

Before the explosion of deep learning approaches, most methods relied on a combination of hand-crafted features and classic machine learning models. For example, in one instance [23], authors propose 12 different features, including citation count, PageRank value, and author overlap, and use classic machine learning models such as SVM and Random Forest for classification. In another instance [10], authors define pattern-based, topic-based, and prototypical argument features and use SVM to make predictions.

With the advent of deep learning models and the emergence of large language models in recent years, representation learning-based methods have outperformed the hand-crafted methods achieving a higher accuracy by considering the textual information. Recent works have proposed the use of structural scaffolds [4], BERT-based models trained on the scientific corpus (SciBERT) [1], word embedding-based approaches [17], and creating a heterogeneous context graph based on an academic

Knowledge Graph Embedding Models

KGs are structured information repositories consisting of a set of nodes representing entities and a set of typed edges representing relations. Since, in most cases, the KG nodes and edges are not attributed, KG embedding (KGE) models aim to learn low-dimensional representations for all entities and relations. The most common traditional shallow KGE methods are TransE [2], Com-plEx [22], and RotatE [20]. More recent GNN-based KGE methods leverage the message-passing scheme of GNNs, enabling more complex multi-hop reasoning. Examples of these methods are GCN [11], which leverages the spectral information for information propagation but is limited to mono-relational KGs, R-GCN [18], which extends GCN to support multi-relational KGs, and Graph-SAGE [8] which introduces an inductive framework to handle unseen nodes.

Dataset

The SciCite dataset focuses on individual citation links and ignores the significance of broader relational connections and features. To overcome this issue, we construct a knowledge graph by mapping each entity in the SciCite dataset to the S2ORC and adding their 2-hop citation neighborhoods. The S2ROC contains more than 206 million publications and 2.49 billion citation links. Apart from the regular citation links, this corpus provides partial intent labels for citations using a 3-class scheme as follows:

1. Background: Describe a problem, topic, or concept 2. Method: Provide a method, tool, or dataset 3. Result: To make a comparison Moreover, the SciCite dataset is tailored for sentence classification methods, where input features are textual excerpts and the output labels are citation intents. We reformulate this task as link prediction on KGs, where the input features are a representation of the source (citing) paper and the target (cited) paper, and the output is the label of a citation link between the source and target. We release all our datasets under a CC-BY-SA license at TBD

Entity Mapping

We first map each paper in the SciCite dataset to the S2ORC by matching SciCite's IDs to Semantic Scholar's SHA IDs. Since a publication could have many SHA IDs and only one Corpus ID, we then map each SHA ID to the unique Corpus ID to extract unique entities. From the 13,080 papers with unique IDs in SciCite, we successfully map 13,019 of them to valid SHA IDs in semantic scholar, while the remaining 61 papers do not have any corresponding records. We believe this is due to publication removals, as the SciCite dataset was created from the S2ORC in 2019. After converting SHA IDs to Corpus IDs, we end up with 13,011 unique entities and 8 duplicate entities.

Dataset Splitting

The original SciCite dataset contains 11,020 humanlabeled samples. Hence, to adapt it to our link prediction setting, we reconstruct two datasets: SciCiteorigin and SciCite resplit . SciCiteorigin adheres to the same benchmarks reported in prior works but is modified to remove overlapping citation links in the training and test sets. To maximize usage of the training data while removing artifacts, we create SciCite resplit that performs additional cleaning, provides a stronger separation of training and test sets, and avoids multi-intent citations. Table 1 showcases the statistic of these datasets.

SciCiteorigin:

To make methods comparable, we use the same validation and test sets as SciCite for this dataset and try to keep the training set as close as possible. We convert each publication in the SciCite dataset to a Semantic Scholar entity using the mapped Corpus IDs and drop the contextual sentence-level information. We assign a random unique ID to publications without a Corpus ID. After this procedure, we end up with a set of links for our link prediction task.

Due to the removal of the contextual information, some of the training links appear exactly the same in the test set. Hence, we remove 641 training set samples that also appear in the test set to prevent data leakage. Moreover, since only one link in the test set has multiple intents, we treat the link prediction problem as a multi-class task rather than a multi-label task. In this scenario, the multi-intent links are represented as separate samples with the same inputs and different outputs. Multi-label methods may be a promising future extension of our work.

SciCite resplit :

Even though we convert the SciCite dataset to the SciCiteorigin, problems, such as duplicate citations and multi-label links, still exist. Therefore, we further tailor the SciCite dataset to create a better link prediction dataset for graph-based models. First, we remove all the entities, and their related samples, that do not have a mapped Corpus ID. Then, similar to SciCiteorigin, we convert the remaining samples to a set of links. Following this, we drop all duplicate samples. Among the remaining 6,458 unique links, 5,886 only have one intent, 489 have two intents, and 83 have all three intents. We remove all the multi-intent links and resplit the dataset with ratios of 70%/15%/15% for training, validation, and test sets, respectively.

Method

Throughout the rest of this work, for simplicity, we use the term publication to denote all types of academic publications, e.g., books and papers. Moreover, we use the terms citation and reference to denote incoming and outgoing links, respectively.

Weak Supervision

In order to enrich our data and provide more information to the models, we extract the set of intents provided in the S2ORC dataset for each citation link. The intent labels in S2ORC are extracted using the structural scaffolds model [4] at a sentence level. In this scenario, we implicitly use the existing data derived from the content for bootstrapping our approach. We refer to these links as weakly labeled due to being labeled by a noisy model rather than a human expert. Since the intent labels are partial at a sentence level, citation links could have zero intent in the absence of text or several intents in an abundance of use cases.

Knowledge Graph Construction

Given the S2ORC dataset, we expand the SciCite dataset using the mapped entities to construct a KG containing 2-hop neighborhoods of the publications. Figure 1 illustrates an overview of the expanded KG. This work uses the 2022-09-13 version of the corpus downloaded from the bulk API. Formally, given the set of mapped entities 𝒱0, the set of 𝑘-hop nodes 𝒱 𝑘 is defined as

𝒱 𝑘 = 𝒱 𝑘−1 ∪ {𝑦 | ∃𝑥 ∈ 𝒱 𝑘−1 : 𝑦 ∈ 𝒩𝑥}(1)

where for a given entity 𝑥, 𝒩𝑥 denotes all the entities that cite or are cited by 𝑥, i.e., the set of neighboring entities. Given the sets of unlabeled links 𝒰 and weakly labeled links ℒ, the set of 𝑘-hop edges ℰ 𝑘 is defined as

ℰ 𝒰 𝑘 = {(𝑥, 𝑦, UNK) | 𝑥, 𝑦 ∈ 𝒱 𝑘 , (𝑥, 𝑦) ∈ 𝒰} (2) ℰ ℒ 𝑘 = ∪𝑟{(𝑥, 𝑦, 𝑟) | 𝑥, 𝑦 ∈ 𝒱 𝑘 , (𝑥, 𝑦) ∈ ℒ𝑟} (3) ℰ 𝑘 = ℰ 𝒰 𝑘 ∪ ℰ ℒ 𝑘(4)

where 𝑟 ∈ {Background, Method, Result} and ℒ𝑟 denotes the set of all weakly labeled links with label 𝑟. Consequently, given the sets of 𝑘-hop nodes 𝒱 𝑘 and edges ℰ 𝑘 , the extracted 𝑘-hop KG, 𝒢 𝑘 , is defined as

𝒢 𝑘 = (𝒱 𝑘 , ℰ 𝑘 )(5)

The specific statistics of the extracted KG and the original semantic scholar corpus are reported in Table 2. Since not every link has weakly labeled intent, this table also provides the percentage of weakly labeled links for each corresponding graph. Although we extract 𝒢2, given its scale, we opt to run our current experiment only on 𝒢1 and leave the larger-scale experiments for future works.

Feature Engineering

Since none of the publications in our KGs have any features or pre-defined representation, we propose to represent them through their references, citations, and graphbased features. More specifically, from S2ROC we extract the in-degrees and out-degrees of citations (or references), background links, method links, and result links. As a result, each paper is represented with an 8-dimensional feature vector, 4 for each in-degree and out-degree feature. For the publications where the content is unavailable, the out-degree intent-based features will be zero since those features are based on the noisy sentence-level model that the Semantic Scholar uses. However, the in-degree features may not be zero as long as the citing paper's content is available. For the new publications, i.e., unseen nodes in the inductive setting, the only known non-zero feature is the reference count. We normalize the reference and citation features by a biased log factor defined as

ℎ ¯𝑥 = log 10 (ℎ𝑥 + 1 + 𝛼)(6)

where 𝛼 is a bias hyperparameter. We specifically set 𝛼 = −0.9 to get a normalized value of −1 for zeroreference and zero-citation situations. Moreover, we normalize the non-zero in-degree intentbased features into a [0, 1] probability distribution as follows:

ℎ ¯𝑥 = ℎ𝑥 ℎ Background + ℎ Method + ℎ Result (7)

The same normalization step is used for out-degree features separately.

Baselines Knowledge Graph Embedding Models:

Traditional KGE models consist of two shallow embeddings as entity and relation encoders and a score function as a decoder to predict the likelihood of a link. These models are trained in a contrastive way by masking either one of the entities in a given triplet (head, relation, tail) and sampling a set of negative entities, contrasting the positive entity.

Since the traditional KGE methods rely on shallow embeddings for encoding entities and relations, they can only be used in the transductive setting and cannot operate on unseen nodes. For our experiments, we use the available implementations of TransE, ComplEx, and Ro-tatE in the DGL-KE toolkit [27]. In the evaluation phase, we calculate the likelihood of all different relation types for each link and consider the highest likelihood as the model's intent prediction.

Hybrid Models:

To increase the reasoning power of the traditional KGE models, we devise a two-stage approach based on multilayer perceptron (MLP). We first use the traditional KGE models to learn embeddings for entities and relations. Then, instead of relying on the produced likelihood scores, we concatenate the vectors of two entities and pass that through an MLP to get logit values. Formally, given a link (𝑢, 𝑣) and their respective learned representation (𝑧𝑢, 𝑧𝑣), we calculate the logit values as

𝑝 = MLP([𝑧𝑢‖𝑧𝑣])(8)

where 𝑝 ∈ R 𝒞 contains the unnormalized logits for each class. The predicted class 𝑐 is then calculated as argmax 𝑐 sigmoid(𝑝). ( 9)

Natural Language Processing Models:

We include the reported results of several state-of-the-art Natural Language Processing (NLP) methods. Specifically, we include results from the word embedding-based methods such as Infersent-KMeans, Infersent-HDBSCAN, Glove-KMeans, and Glove-HDBSCAN [17], BiLSTMbased method Structural Scaffolds [4], and large language model-based method SciBERT [1]. Moreover, we report the results of fine-tuning a pre-trained SciBERT model on both datasets. All these methods use textural information and are evaluated on the SciCite dataset.

Multi-Hop Link Prediction (MHLP)

Transductive and inductive settings are the most common link prediction evaluating schemes for KGs. The main difference between these two settings is having a fixed set of nodes in both the training and evaluation phases (transductive) versus allowing the addition of unseen nodes in the evaluation phase (inductive). This work refers to citation intent prediction on unseen publications as the inductive setting, whereas the transductive setting refers to citation intent prediction on already seen publications.

We propose an adaptable graph-based model for citation intent prediction in both the transductive and inductive settings. The primary basis of this approach is that a node, i.e., publication, could be represented as a combination of the neighboring nodes' representations. Let ℎ (0) 𝑥 be the extracted feature vector for any arbitrary node 𝑥. We calculate the representation of an arbitrary node 𝑣 at layer 𝑙 + 1 of a multilayer model as

ℎ (𝑙+1) 𝒩𝑣 = 1 |𝒩𝑣| ∑︁ 𝑢∈𝒩𝑣 ℎ (𝑙) 𝑢 (10) ℎ (𝑙+1) 𝑣 = 𝜎(𝑊 (𝑙+1) [ℎ (𝑙) 𝑣 ‖ℎ (𝑙+1) 𝒩𝑣 ])(11)

where 𝜎 is a non-linear function. Throughout our experiments, we specifically use ReLU to introduce nonlinearity. Given the node representation from a 𝐿-layer model and a link (𝑢, 𝑣), we calculate the logit values as

𝑝 = MLP([ℎ (𝐿) 𝑢 ‖ℎ (𝐿) 𝑣 ])(12)

where 𝑝 ∈ R 𝒞 contains the unnormalized logits for each class and 𝒞 is the set of all classes. The predicted class 𝑐 is then calculated as

argmax 𝑐 sigmoid(𝑝). (13)

The main disadvantage of the inductive settings is that the unseen nodes only have one available feature, i.e., reference count. This absence of information makes the task extremely difficult, as the feature vectors are highly sparse. However, our model tries to diminish this effect by using the message-passing scheme, as defined in Equation 11, to aggregate information through connected entities, i.e., cited papers, creating a denser representation for the unseen nodes.

All our models are trained using the cross-entropy loss defined as

𝑙𝑛 = − log exp(𝑝𝑦 𝑛 ) ∑︀ |𝒞| 𝑖=1 exp(𝑝𝑖)(14)

where and 𝑝𝑥 is the logit value for class 𝑥 given the prediction vector 𝑝.

Composite Model:

To further test the capabilities of our proposed model and use both structural and textual information, we devise a multi-modal model comprising encoders for both the graph structure and the citation context. Specifically, we use a pre-trained SciBERT model for encoding the citation phrase text and our MHLP model for encoding the citation graph around the citation link. Figure 2 illustrates an overview of the composite model.

Experiments

In this section, we report our experimental results on both of the SciCiteorigin and SciCite resplit datasets. All the graphbased experiments are carried out on the 𝒢1 KG. For the traditional KGE methods, we tune their hyperparameters as described in Appendix A.1 and train them using the hyperparameters showcased in Table 4. For the hybrid methods, the KGE component is first trained to generate node features using the hyperparameters described in Table 4. Then, the MLP component is trained using the procedure described in A.2 to predict the citation intent. For the MHLP-based methods, in both transductive and inductive settings, we use a 1-layer variation on top of the normalized features extracted as described in Section 4.3. Moreover, we tune their hyperparameters and train them as described in Appendix A.3. For the SciBERT method, we freeze the pre-trained model and add an MLP module on top of the 768-dimensional [CLS] token output. Similar to the other models, the MLP module is tuned using the parameters described in A.2. For the composite model, during the training phase, we freeze the SciBERT model in the first two epochs as a warm-up step for the graph encoder; then, we jointly train both encoders along with the final prediction module.

To control for the effect of the pre-training using traditional KGE models, we also run a variation with randomly initialized node features and designate it as "Random + MLP." For the NLP models, we use the previously reported results [17] to compare our models on the test set-aligned SciCiteorigin dataset. Finally, we also include the results from random and most common class predictions as sanity checks. All the models are implemented using PyTorch [14] and trained on a machine with a single Quadro RTX 8000 GPU, 72 CPU cores, and 768GB of RAM. Implementations are available under a CC-BY-SA license at TBD.

Results

Table 3 illustrates our experimental results on both datasets. As evident from Table 3, traditional KGE methods perform poorly on both datasets, only slightly beating the random baseline on the macro F1 metric. Interestingly, both ComplEx and RotatE perform worse than TransE on both datasets. This finding is surprising as both ComplEx and RotatE are more expressive than TransE [20]. However, when combined with MLP models, all exhibit significant performance boost, up to more than 100% in the case of RotatE. After this addition, we can see the same expressivity trend in the model results, i.e., the more powerful the model, the better the result. Moreover, the control "Random + MLP" experiment showcases very similar results to the random baseline, indicating the importance of both components for the hybrid model to perform well. Altogether, it is evident that the reasoning power of shallow traditional KGE models is not enough to capture the complexity of this task, and we require models with more reasoning power.

As for the MHLP method, in the transductive setting, it achieves 57.88 and 62.16 macro F1 scores on SciCiteorigin and SciCite resplit datasets, respectively. Moreover, its inductive results showcase the robustness of our approach in an extreme out-of-distribution setting, achieving 56.13 and 59.81 macro F1 scores. Compared to previously re-ported results [17], our model achieves superior performance to Glove-based models while slightly lagging behind Infersent-based models. Looking into the precision and recall comparison, our method has better precision scores on both transductive and inductive settings compared to all word embedding-based models; however, for recall, it performs better than Glove-based models and worse than the Infersent-based models which might stem from the imbalance in the links as illustrated by Figure 3a. Further experimentation to address the class imbalance problem in future works might help improve the overall performance of MHLP. The significance of these results is that we show structural and relational information could be used to achieve relatively high performance without using textual information. Moreover, although our models underperform compared to language model-based approaches such as Structural Scaffolds and SciBERT, we showcase interesting future directions for combining graph-based and NLP-based methods.

Finally, the composite model denoted as SciBERT + MHLP in Table 3, achieves the best performance among all models, even beating the fine-tuned SciBERT. When considering MHLP's standalone performance, these results showcase the potential improvements that could be achieved through the use of structural information that is not available in citation phrases. The presented experiments are a stepping stone for better understanding and using the structural information at scale for citation intent classification.

Analysis

Temporal Analysis

This analysis studies the relationship between the time that has passed since publication and citation intent. We hypothesize that a paper is more likely to be cited as "Result" or "Method" right after its publication, and as time passes, it will be more likely to be cited as "Background." If this is proven accurate, we could get a relatively strong signal from the temporal information for each citation. We plotted the years after publication against intent counts and ratios for all papers in the semantic scholar corpus to test our hypothesis. Figure 3a and 3b illustrate the results of our analysis. As evident from these figures and contrary to our original hypothesis, we find out that the ratio of intent classes almost stays the same as time passes with insignificant fluctuations. As a result, using temporal information in our models is unlikely to provide any significant improvement. Note that these results are based on the weakly labeled links that we obtained from S2ORC. Consequently, these links are generated by another noisy model that could potentially be biased. Hence, it should not discourage further analysis or studies of temporal information for citation intent classification.

Mutual Information Analysis

In this analysis, we study the quality of the engineered features as described in Section 4.3 concerning the weakly labeled intent classes. To this end, we use the well-known mutual information (MI) [12] measurement to quantify the importance of each feature. Formally, the MI between

where 𝒴 is the value space for 𝑌 , 𝒳 is the value space for 𝑋, 𝑃𝑋,𝑌 is the joint probability distribution, and 𝑃𝑋 and 𝑃𝑌 are the marginal probability distributions. Note that MI is a non-negative value, and higher values indicate more correlation between the two random variables. For our analysis, we calculate MI for both sides of the 5,886 unique citation links in the SciCite resplit dataset. Moreover, to study these features in the graph context, we also calculate MI for the average of these features over the neighborhood of each publication, i.e., all citing and cited publications, from both sides of the citation links. Figures 4a and 4b present the results of our experiments. As evident from these results, while the publication-averaged features generally show stronger connections to the target variable, the neighborhood-averaged features seem to show complementary connections, further emphasizing the importance of using both sets of features.

Feature Quality Analysis

In this analysis, we study the effect of normalization as described in Equations 6 and 7. To this end, we project the extracted features of the 5,886 unique citation links in the SciCite resplit dataset to a 2-dimensional space using t-SNE [24]. Figure 5a and 5b illustrate the projected space for the unnormalized and normalized features, respectively. As evident from Figure 5a, it is challenging to distinguish different intent types in the unnormalized space. However, after normalization, as evident from Figure 5b, we can see that the "Method" intention more or less creates a distinguishable cluster. This result shows that the use of normalization is potentially helpful for the model. Further studies on different types of normalization and their effects are left for future work.

Robustness Analysis

In this analysis, we focus on studying the robustness of our proposed graph-based method. To this end, we devise two ablation studies. In the first study, we randomly corrupt a percentage of the weak labels by replacing the correct label with a random label. This study aims to understand the model's resilience to noise better.

In the second study, we randomly remove a percentage of the weak labels. This study's idea is to understand better the effect of weak supervision on the model's performance. These studies are carried out by running the MHLP method in the transductive setting on both SciCiteorigin and SciCite resplit datasets. The feature vectors for the publications are calculated by counting the number of citations and intents. These vectors are normalized then using Equation 6 and 7. To analyze the relationship between the model's performance and the amount of available data, we create ten variations of the dataset by only using a portion of the available weak labels, varying from using all the available weak labels to only using 10% of them. Figure 6a presents the result of this study.

As evident from Figure 6a, the more weakly labeled links are available, the better our method performs. The other significant observation is the robustness of the model, even in the extreme scenario of having access to only 10% of the labels. Note that only 31.90% of links in the S2ROC have at least one weakly labeled intent, which means, even if the utilization percentage is 100%, only 31.90% citation links are weakly labeled.

Figure 6b showcases the relationship between the model performance and the percentage of corrupted data. Following our intuition, the model's performance monotonically decreases as we add more noisy labels to the data. However, two interesting observations could be made from this figure. First, the performance of our method only drops less than five macro F1 scores when half (50%) of the weak labels are replaced with randomly assigned noisy labels. This observation shows that the proposed method is exceptionally resilient when faced with mistakes. Second, even when all the labels are replaced with random ones (100%), the model performs better than the random baselines. This observation indicates that the model is learning to make inferences based on purely structural information, which further solidifies our hypothesis regarding the importance of structural information.

Conclusions and Future Work

In this work, we first introduced an expansion to the Sci-Cite dataset by extracting scholarly information from the S2ORC dataset and creating an extended citation graph. Then, we gathered a large-scale weakly labeled dataset to augment the extracted citation graph with citation intents and create a multi-relational knowledge graph. Following this, we adapted the sentence-based intent classification into a citation-based link prediction task on graphs. We then introduced a set of engineered graph-based and citation-based features. Built on top of these features, we introduced a graph-based multi-hop reasoning approach for the newly introduced task. Our approach achieves 62.16 and 59.81 macro F1 scores in the transductive and inductive settings, respectively. The experimental results in the inductive setting further showcase the robustness of the proposed approach in the information-deprived outof-distribution environment. Compared to NLP-based models, we reached a comparable performance to, and in some cases outperform, the word embedding-based methods that rely on contextual sentences to make predictions. Moreover, with a composite model comprising our method as the graph encoder and the state-of-the-art NLP-based model as the text encoder, we outperformed all the other models we experimented with. These results further signify the strong signal in relational information and highlight the importance of future analysis and studies in this domain. Finally, our presented analyses further support our methodological choices.

For future works, one straightforward idea is to extend the knowledge graph with more scholarly information, such as authors, venues, and fields of study. There already exist some open repositories such as OpenAlex [15] and Microsoft Academic Graph (MAG) [25] that contain this information. Another direction is further investigation into the temporal signals. Last but not least, although we achieved an improved performance through a fusion of textual and structural information, more investigation and analysis could be done in this setting in future works.

Figure 1 :1Figure 1: Overview of the extracted multi-hop KG. The set of 0-hop nodes 𝒱 0 includes all the orange nodes. The set of 1-hop nodes 𝒱 1 includes all the orange and blue nodes. Similarly, the graph could be expanded to include 𝑘-hop nodes 𝒱 𝑘 . The annotated set on each edge represents that specific link's intent. Specifically, the empty set denotes that the citation link has no intent label.

Figure 2 :2Figure 2: Overview of the composite model. The model consists of two encoders for the citation phrase and the citation graph around the citation link. During the training phase, we freeze the SciBERT model in the first two epochs as a warm-up step for the graph encoder; then, we jointly train both encoders along with the final prediction module.

(a) The number of different citation intents. (b) The percentage of different citation intents.

Figure 3 :3Figure 3: The statistic of citation intent for all publications in the Semantic Scholar corpus. The temporal trends stay steady over time, suggesting a lack of information in the elapsed time from the time of publication to the time of citing.

Figure 4 :4Figure 4: The calculated MI values for publication features and averaged neighborhood features. On average, the publication features show stronger connections to the target variable.

(a) Features before normalization (b) Features after normalization

Figure 5 :5Figure 5: The t-SNE visualizations for the unnormalized and normalized features.

(a) The percentage of utilized weak labels. (b) The percentage of corrupted data.

Figure 6 :6Figure 6: The macro F1 score of MHLP (Transductive) on SciCite origin and SciCite resplit dataset

Table 11The statistic of the SciCite dataset and reconstructed datasets.DatasetSciCiteSciCite originSciCite resplitLevelSentenceLinkLink# Samples11,02010,3795,766# Train8,2437,6024,122# Validation916916822# Test1,8611,861822network [26]

Table 22Statistics of the extracted KGs along with the original S2ORC dataset.Dataset# Nodes # Citation Links # Background# Method# Result Weak LabelsZero-Hop (𝒢 0 )13,01110,7335,4794,4031,33579.04%One-Hop (𝒢 1 )5,862,261119,776,09039,202,08616,830,665 16,830,66543.18%Two-Hop (𝒢 2 )57,535,8801,621,293,902467,860,523 121,877,053 35,283,71834.41%S2ORC206,159,6292,495,513,737643,955,457 169,472,164 45,779,79331.90%

Table 33Intent classification results on SciCite origin and SciCite resplit datasets. All the metrics are macro averaged. Bold values represent the highest performance within the metric and dataset scope.SciCite originSciCite resplit

Acknowledgments

This work was funded by the Defense Advanced Research Projects Agency with award W911NF-19-20271 and with support from a Keston Exploratory Research Award.

A. Hyperparameters

A.1. Knowledge Graph Embedding

A.2. Multilayer Perceptron

To simplify the model tuning process, we find the optimal hyperparameters of "ComplEx + MLP" on SciCiteorigin using grid search and reuse them for the rest of our experiments. Specifically, we run a grid search over the following ranges: number of layers ∈ {0, 1, 2, 3}, dropout ∈ {0, 0.

A.3. Multi-Hop Link Prediction

We run a grid search over the following ranges: number of layers ∈ {0, 1, 2, 3}, dimension ∈ {10, 50, 100, 200}, learning rate ∈ {0.03, 0.01, 0.003, 0.001}, The optimal hyperparameters are as follows: number of layers = 1, dimension = 100, learning rate = 0.01. We use Adam as the optimizer through the tuning process.

We use a randomized search to tune our models and find near-optimal hyperparameters using the following ranges: embedding dimensions ∈ {50, 100, 200}, learning rate ∈ {0.03, 0.1, 0.3}, regularization coefficient ∈ {0.0, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5}, number of negative samples ∈ {64, 128, 256, 512, 1024}, 𝛼 ∈ {0.25, 0.5, 1}, 𝛾 ∈ {6, 12, 24}. Note that 𝛼 and 𝛾 are the adversarial temperature and the margin value (RotatEonly), respectively.

SciB-ERT: A Pretrained Language Model for Scientific Text IzBeltagy KyleLo ArmanCohan 10.18653/v1/D19-1371 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics

Hong Kong, China

2019 <author> <persName><forename type="first">Antoine</forename><surname>Bordes</surname></persName> </author> <author> <persName><forename type="first">Nicolas</forename><surname>Usunier</surname></persName> </author> <author> <persName><forename type="first">Alberto</forename><surname>Garcia-Durán</surname></persName> </author> <author> <persName><forename type="first">Jason</forename><surname>Weston</surname></persName> </author> <author> <persName><forename type="first">Oksana</forename><surname>Yakhnenko</surname></persName> </author> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b2"> <analytic> <title level="a" type="main">Translating Embeddings for Modeling Multi-Relational Data Proceedings of the 26th International Conference on Neural Information Processing Systems -Volume 2 the 26th International Conference on Neural Information Processing Systems -Volume 2

Lake Tahoe, Nevada; Red Hook, NY, USA

Curran Associates Inc NIPS'13 What do citation counts measure? A review of studies on citing behavior LutzBornmann Hans-DieterDaniel J. Documentation 64 2008. 2008 Structural Scaffolds for Citation Intent Classification in Scientific Publications ArmanCohan WaleedAmmar MadeleineVan Zuylen FieldCady 10.18653/v1/N19-1361 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long and Short Papers the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Minneapolis, Minnesota

Association for Computational Linguistics 2019 1 Scientific Article Summarization Using Citation-Context and Article's Discourse Structure ArmanCohan NazliGoharian 10.18653/v1/D15-1045 Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing the 2015 Conference on Empirical Methods in Natural Language Processing

Lisbon, Portugal

2015 Association for Computational Linguistics Structured Citation Trend Prediction Using Graph Neural Networks DanielCummings MarcelNassar ICASSP. IEEE

Barcelona, Spain

2020 Automated Classification of Citations Using Linguistic Semantic Grammars MAGarzone Thesis

London, Canada

1997 M.Sc.)-University of Western Ontario Inductive Representation Learning on Large Graphs WilliamLHamilton RexYing JureLeskovec Proceedings of the 31st International Conference on Neural Information Processing Systems the 31st International Conference on Neural Information Processing Systems

Long Beach, California, USA; Red Hook, NY, USA

Curran Associates Inc 2017 NIPS'17 Survey about citation context analysis: Tasks, techniques, and resources MyriamHernández -Alvarez JoséMGomez Natural Language Engineering 22 3 2016. 2016 Measuring the Evolution of a Scientific Field through Citation Frames DavidJurgens SrijanKumar RaineHoover DanMcfarland DanJurafsky 10.1162/tacl_a_00028 Transactions of the Association for Computational Linguistics 6 2018. 2018 Semi-Supervised Classification with Graph Convolutional Networks ThomasNKipf MaxWelling Proceedings of the 5th International Conference on Learning Representations (ICLR '17). OpenReview.net the 5th International Conference on Learning Representations (ICLR '17). OpenReview.net

Palais des Congrès Neptune, Toulon, France

2017. 14 Estimating mutual information AlexanderKraskov HaraldStögbauer PeterGrassberger Physical review E 69 6 66138 2004. 2004 S2ORC: The Semantic Scholar Open Research Corpus KyleLo LucyLuWang MarkNeumann RodneyKinney DanielWeld 10.18653/v1/2020.acl-main.447 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics 2020 Automatic Differentiation in PyTorch AdamPaszke SamGross SoumithChintala GregoryChanan EdwardYang ZacharyDevito ZemingLin AlbanDesmaison LucaAntiga AdamLerer NIPS 2017 Workshop on Autodiff. OpenReview.net

Long Beach, California, USA

2017 JasonPriem HeatherPiwowar RichardOrr arXiv:2205.01833abs/2205.01833 OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts 2022. 2022 arXiv preprint 5 pages Citation context analysis for information retrieval AnnaRitchie 2009 University of Cambridge, Computer Laboratory Technical Report Citation intent classification using word embedding MuhammadRoman AbdulShahid ShafiullahKhan AnisKoubaa LisuYu Ieee Access 9 2021. 2021 Modeling Relational Data with Graph Convolutional Networks MichaelSchlichtkrull ThomasNKipf PeterBloem RianneVan Den IvanBerg MaxTitov Welling The Semantic Web AldoGangemi RobertoNavigli Maria-EstherVidal PascalHitzler RaphaëlTroncy LauraHollink AnnaTordai MehwishAlam

Cham

Springer International Publishing 2018 Characterizing highly cited method and non-method papers using citation contexts: The role of uncertainty HenrySmall Journal of Informetrics 12 2 2018. 2018 RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space ZhiqingSun Zhi-HongDeng Jian-YunNie JianTang ICLR (Poster). OpenReview.net

New Orleans, LA

2019. 18 An annotation scheme for citation function SimoneTeufel AdvaithSiddharthan DanTidhar Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue. Association for Computational Linguistics the 7th SIGdial Workshop on Discourse and Dialogue. Association for Computational Linguistics

Sydney, Australia

2006 Complex Embeddings for Simple Link Prediction ThéoTrouillon JohannesWelbl SebastianRiedel ÉricGaussier GuillaumeBouchard Proceedings of the 33rd International Conference on International Conference on Machine Learning -Volume 48 (ICML'16 the 33rd International Conference on International Conference on Machine Learning -Volume 48 (ICML'16

New York, NY, USA

JMLR.org 2016 Identifying Meaningful Citations MarcoValenzuela VuHa OrenEtzioni Scholarly Big Data: AI Perspectives, Challenges, and Ideas, Papers from the 2015 AAAI Workshop (Technical Report CorneliaCaragea CLeeGiles NarayanBhamidipati DoinaCaragea SujathaDas Gollapalli SaurabhKataria HuanLiu FengXia

Menlo Park, CA

AAAI Press 2015 Visualizing Data using t-SNE LaurensVan Der Maaten GeoffreyHinton Journal of Machine Learning Research 9 2008. 2008 Microsoft academic graph: When experts are not enough KuansanWang ZhihongShen ChiyuanHuang Chieh-HanWu YuxiaoDong AnshulKanakia Quantitative Science Studies 1 1 2020. 2020 Identifying Referential Intention with Heterogeneous Contexts WenhaoYu MengxiaYu TongZhao MengJiang 10.1145/3366423.3380175 Proceedings of The Web Conference 2020 The Web Conference 2020

Taipei, Taiwan; New York, NY, USA

Association for Computing Machinery 2020 WWW '20 DGL-KE: Training Knowledge Graph Embeddings at Scale DaZheng XiangSong ChaoMa ZeyuanTan ZihaoYe JinDong HaoXiong ZhengZhang GeorgeKarypis Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '20) the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '20)

New York, NY, USA

Association for Computing Machinery 2020