=Paper=
{{Paper
|id=Vol-3656/paper8
|storemode=property
|title=Citation Intent Classification Through Weakly Supervised Knowledge Graphs
|pdfUrl=https://ceur-ws.org/Vol-3656/paper8.pdf
|volume=Vol-3656
|authors=Xinwei Du,Kian Ahrabian,Arun Baalaaji Sankar Ananthan,Richard Delwin Myloth,Jay Pujara
|dblpUrl=https://dblp.org/rec/conf/aaai/DuAAMP23
}}
==Citation Intent Classification Through Weakly Supervised Knowledge Graphs==
Citation Intent Classification Through Weakly Supervised
Knowledge Graphs
Xinwei Du1,2 , Kian Ahrabian1,2,* , Arun Baalaaji Sankar Ananthan1,2 , Richard Delwin Myloth1,2
and Jay Pujara1,2
1
Information Sciences Institute, Marina del Ray, CA, USA
2
University of Southern California, Los Angeles, CA, USA
Abstract
Citations are scientists’ tools for grounding their innovations and findings in the existing collective knowledge. They are used
for semantically distinct purposes as scientists utilize them at different parts of their work to convey specific information. As
a result, a crucial aspect of scientific document understanding is recognizing the authorial intent associated with citations.
Current state-of-the-art methods rely on contextual sentences surrounding each citation to classify the intent. However,
in the absence of textual content, these approaches become unusable. In this work, we propose a text-free citation intent
classification method built on relational information among scholarly works in this work. To this end, we introduce a
large-scale knowledge graph built from the publications in the SciCite dataset and their multi-hop neighborhood extracted
from The Semantic Scholar Open Research Corpus (S2ORC). We also augment this knowledge graph by adding weakly-labeled
links based on the intent information available in the S2ORC. Finally, we cast the intent classification task as a link prediction
problem on the newly created knowledge graph. We study this problem in both transductive and inductive settings. Our
experimental results show that we can achieve a comparable macro F1 score to word embedding content-based methods by
only relying on features and relations derived from this knowledge graph. Specifically, we achieve macro F1 scores of 62.16
and 59.81 in the transductive and inductive settings, respectively, on the link-level SciCite dataset. Moreover, by combining
our method with the state-of-the-art NLP-based model, we achieve improvements across all metrics.
Keywords
Citation Intent Classification, Knowledge Graphs, Graph Neural Networks, Large Language Models, Weakly supervised
learning
1. Introduction to textual information. Previous works [3, 26, 6] have
shown the importance of relational and structural infor-
Citations are the primary way of identifying past contri- mation available in links among publications for various
butions and connecting progress in new publications to tasks. In this work, we propose a general citation in-
existing literature. Nevertheless, not all citations indicate tent classification method that relies purely on structural
the same meaning. Authors use citations sparingly with information.
specific intent behind them. For example, some papers Besides helping researchers better understand the re-
are cited for providing background information in a do- lationship among publications, citation intent analysis
main, while others are cited when adopting or adapting has been used for studying various other aspects of scien-
a previously-used methodology. There are also scenar- tific works such as research domain evolution [10], scien-
ios where the same paper is used as background infor- tific impact analysis [19], scientific document summariza-
mation and methodology use-case in different contexts tion [5], and retrieving related scientific works [16]. The
simultaneously. Understanding citation intent is crucial main three categories of citations are “Result,” “Method,”
to studying scholarly works, given the universality of and “Background” [4]. These categories describe the rea-
using citations. Current state-of-the-art citation intent sons behind making a scientific connection, referencing a
classification models [17, 1, 4] rely heavily on textual publication in another publication. Classifying citations
information, e.g., the sentences surrounding the citation. into these groups has traditionally required a high level
However, such information is expensive to obtain and of expertise in the respective scientific domains. This
in some scenarios inaccessible altogether. Consequently, constraint, combined with the high cost of expert human
we need models that could operate without having access labor, has resulted in highly scarce datasets, which makes
The Third AAAI Workshop on Scientific Document Understanding 2023, the task even more difficult.
February 14th, 2023, Washington, DC, USA Previous works have proposed classifying citation in-
*
Corresponding author. tent through feature engineering-based [10] and repre-
$ xinweidu@usc.edu (X. Du); ahrabian@usc.edu (K. Ahrabian); sentation learning-based [1] methods. However, most
arunbaal@usc.edu (A. B. S. Ananthan); myloth@usc.edu
(R. D. Myloth); jpujara@usc.edu (J. Pujara)
of these methods depend on textual information. As a
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License result, they require a complex multi-stage pipeline of
Attribution 4.0 International (CC BY 4.0).
CEUR
CEUR Workshop Proceedings (CEUR-WS.org)
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 parsing documents, identifying citation contexts, and
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
predicting citation intent [13]. Besides being prone to
error propagation from various pipeline stages, the use
of these models is limited to situations where the full text
is available in a proper format. This work introduces a
pure graph-based approach to classifying citation intent.
We extend the existing SciCite dataset with 2-hop neigh-
borhoods extracted from The Semantic Scholar Open
Research Corpus (S2ORC). To further enrich the graph,
we utilize the intent information provided in the S2ORC
to create a weakly supervised knowledge graph (KG) con-
sisting of the publications and the relations that match
the provided intents. Our main idea is to use contextu-
alized relational patterns to make predictions, obviating
Figure 1: Overview of the extracted multi-hop KG. The set of
the need for textual context. Given the newly built KG,
0-hop nodes 𝒱0 includes all the orange nodes. The set of 1-hop
we cast the intent classification problem into the common nodes 𝒱1 includes all the orange and blue nodes. Similarly,
link prediction problem on KGs. Specifically, we train a the graph could be expanded to include 𝑘-hop nodes 𝒱𝑘 . The
model to learn representations for entities and relations. annotated set on each edge represents that specific link’s
Using these representations, we run the following query intent. Specifically, the empty set denotes that the citation
on the KG: (𝑠, ?, 𝑜), where 𝑠 cites 𝑜. link has no intent label.
Converting this problem into a link prediction task
allows us to adapt and extend widely used KG embed-
ding models to this problem. We study the link predic- going as far as defining 35 [7] and 12 [21] fine-grained
tion problem in both transductive and inductive settings. schemes for scientific arguments. The more recent works
Our experimental results show that although our KG- however have focused on creating more concise cate-
based method underperforms compared to the large lan- gories. For example, ACL-ARC [10] proposes a 6-class
guage model-based approaches, it is comparable or even intent categorization scheme: Background, Motivation,
superior to the word embedding-based methods. More- Uses, Extension, Comparison or Contrast, and Future.
over, our experiments with combining the NLP-based and SciCite [4] is even more restrictive and drops or com-
graph-based methods show slight improvements over the bines small fine-grained classes to provide a more con-
current state-of-the-art model. These findings further cise 3-class annotation scheme: Background, Method,
signify the importance of relational patterns for citation and Result.
intent classification.
The contributions of this work are as follows:
2.2. Citation Intent Classification
1. Extending the SciCite dataset using the S2ORC Methods
dataset to generate a large-scale weakly super-
vised KG. Before the explosion of deep learning approaches, most
2. Introducing a novel graph-based approach for methods relied on a combination of hand-crafted features
citation intent classification built on top of the and classic machine learning models. For example, in
newly built KG. one instance [23], authors propose 12 different features,
3. Presenting benchmarks for both transductive and including citation count, PageRank value, and author
inductive settings. overlap, and use classic machine learning models such
4. Presenting analyses on the effect of different parts as SVM and Random Forest for classification. In another
of the methodology such as weak supervision and instance [10], authors define pattern-based, topic-based,
feature engineering. and prototypical argument features and use SVM to make
predictions.
With the advent of deep learning models and the emer-
gence of large language models in recent years, represen-
2. Related Work tation learning-based methods have outperformed the
hand-crafted methods achieving a higher accuracy by
2.1. Citation Function/Intent Schemes considering the textual information. Recent works have
Many prior works have studied the problem of creat- proposed the use of structural scaffolds [4], BERT-based
ing categorical schemes for citation intent which in some models trained on the scientific corpus (SciBERT) [1],
works is referred to as citation function [9]. Earlier works word embedding-based approaches [17], and creating
were focused on creating more fine-grained categories, a heterogeneous context graph based on an academic
Table 1 paper and the target (cited) paper, and the output is the
The statistic of the SciCite dataset and reconstructed datasets. label of a citation link between the source and target. We
Dataset SciCite SciCiteorigin SciCiteresplit
release all our datasets under a CC-BY-SA license at TBD
Level Sentence Link Link
# Samples 11,020 10,379 5,766 3.1. Entity Mapping
# Train 8,243 7,602 4,122 We first map each paper in the SciCite dataset to the
# Validation 916 916 822
S2ORC by matching SciCite’s IDs to Semantic Scholar’s
# Test 1,861 1,861 822
SHA IDs. Since a publication could have many SHA
IDs and only one Corpus ID, we then map each SHA
ID to the unique Corpus ID to extract unique entities.
network [26] From the 13,080 papers with unique IDs in SciCite, we
successfully map 13,019 of them to valid SHA IDs in
2.3. Knowledge Graph Embedding semantic scholar, while the remaining 61 papers do not
Models have any corresponding records. We believe this is due to
publication removals, as the SciCite dataset was created
KGs are structured information repositories consisting from the S2ORC in 2019. After converting SHA IDs to
of a set of nodes representing entities and a set of typed Corpus IDs, we end up with 13,011 unique entities and 8
edges representing relations. Since, in most cases, the duplicate entities.
KG nodes and edges are not attributed, KG embedding
(KGE) models aim to learn low-dimensional representa-
tions for all entities and relations. The most common 3.2. Dataset Splitting
traditional shallow KGE methods are TransE [2], Com- The original SciCite dataset contains 11,020 human-
plEx [22], and RotatE [20]. More recent GNN-based KGE labeled samples. Hence, to adapt it to our link predic-
methods leverage the message-passing scheme of GNNs, tion setting, we reconstruct two datasets: SciCiteorigin
enabling more complex multi-hop reasoning. Examples and SciCiteresplit . SciCiteorigin adheres to the same bench-
of these methods are GCN [11], which leverages the marks reported in prior works but is modified to remove
spectral information for information propagation but overlapping citation links in the training and test sets.
is limited to mono-relational KGs, R-GCN [18], which To maximize usage of the training data while removing
extends GCN to support multi-relational KGs, and Graph- artifacts, we create SciCiteresplit that performs additional
SAGE [8] which introduces an inductive framework to cleaning, provides a stronger separation of training and
handle unseen nodes. test sets, and avoids multi-intent citations. Table 1 show-
cases the statistic of these datasets.
3. Dataset
SciCiteorigin :
The SciCite dataset focuses on individual citation links To make methods comparable, we use the same valida-
and ignores the significance of broader relational connec- tion and test sets as SciCite for this dataset and try to
tions and features. To overcome this issue, we construct keep the training set as close as possible. We convert each
a knowledge graph by mapping each entity in the SciCite publication in the SciCite dataset to a Semantic Scholar
dataset to the S2ORC and adding their 2-hop citation entity using the mapped Corpus IDs and drop the con-
neighborhoods. The S2ROC contains more than 206 mil- textual sentence-level information. We assign a random
lion publications and 2.49 billion citation links. Apart unique ID to publications without a Corpus ID. After
from the regular citation links, this corpus provides par- this procedure, we end up with a set of links for our link
tial intent labels for citations using a 3-class scheme as prediction task.
follows: Due to the removal of the contextual information,
1. Background: Describe a problem, topic, or con- some of the training links appear exactly the same in
cept the test set. Hence, we remove 641 training set samples
2. Method: Provide a method, tool, or dataset that also appear in the test set to prevent data leakage.
3. Result: To make a comparison Moreover, since only one link in the test set has mul-
tiple intents, we treat the link prediction problem as a
Moreover, the SciCite dataset is tailored for sentence multi-class task rather than a multi-label task. In this
classification methods, where input features are textual scenario, the multi-intent links are represented as sepa-
excerpts and the output labels are citation intents. We rate samples with the same inputs and different outputs.
reformulate this task as link prediction on KGs, where the
input features are a representation of the source (citing)
Table 2
Statistics of the extracted KGs along with the original S2ORC dataset.
Dataset # Nodes # Citation Links # Background # Method # Result Weak Labels
Zero-Hop (𝒢0 ) 13,011 10,733 5,479 4,403 1,335 79.04%
One-Hop (𝒢1 ) 5,862,261 119,776,090 39,202,086 16,830,665 16,830,665 43.18%
Two-Hop (𝒢2 ) 57,535,880 1,621,293,902 467,860,523 121,877,053 35,283,718 34.41%
S2ORC 206,159,629 2,495,513,737 643,955,457 169,472,164 45,779,793 31.90%
Multi-label methods may be a promising future extension 4.2. Knowledge Graph Construction
of our work.
Given the S2ORC dataset, we expand the SciCite dataset
using the mapped entities to construct a KG containing
SciCiteresplit : 2-hop neighborhoods of the publications. Figure 1 illus-
Even though we convert the SciCite dataset to the trates an overview of the expanded KG. This work uses
SciCiteorigin , problems, such as duplicate citations and the 2022-09-13 version of the corpus downloaded from
multi-label links, still exist. Therefore, we further tai- the bulk API. Formally, given the set of mapped entities
lor the SciCite dataset to create a better link prediction 𝒱0 , the set of 𝑘-hop nodes 𝒱𝑘 is defined as
dataset for graph-based models. First, we remove all the
entities, and their related samples, that do not have a 𝒱𝑘 = 𝒱𝑘−1 ∪ {𝑦 | ∃𝑥 ∈ 𝒱𝑘−1 : 𝑦 ∈ 𝒩𝑥 } (1)
mapped Corpus ID. Then, similar to SciCiteorigin , we con- where for a given entity 𝑥, 𝒩 denotes all the entities
𝑥
vert the remaining samples to a set of links. Following that cite or are cited by 𝑥, i.e., the set of neighboring
this, we drop all duplicate samples. Among the remaining entities. Given the sets of unlabeled links 𝒰 and weakly
6,458 unique links, 5,886 only have one intent, 489 have labeled links ℒ, the set of 𝑘-hop edges ℰ is defined as
𝑘
two intents, and 83 have all three intents. We remove
all the multi-intent links and resplit the dataset with ra- 𝒰
ℰ𝑘 = {(𝑥, 𝑦, UNK) | 𝑥, 𝑦 ∈ 𝒱𝑘 , (𝑥, 𝑦) ∈ 𝒰} (2)
tios of 70%/15%/15% for training, validation, and test sets,
respectively. ℰ𝑘ℒ = ∪𝑟 {(𝑥, 𝑦, 𝑟) | 𝑥, 𝑦 ∈ 𝒱𝑘 , (𝑥, 𝑦) ∈ ℒ𝑟 } (3)
ℰ𝑘 = ℰ𝑘𝒰 ∪ ℰ𝑘ℒ (4)
4. Method where 𝑟 ∈ {Background, Method, Result} and ℒ𝑟 de-
notes the set of all weakly labeled links with label 𝑟. Con-
Throughout the rest of this work, for simplicity, we use sequently, given the sets of 𝑘-hop nodes 𝒱𝑘 and edges
the term publication to denote all types of academic ℰ𝑘 , the extracted 𝑘-hop KG, 𝒢𝑘 , is defined as
publications, e.g., books and papers. Moreover, we use
the terms citation and reference to denote incoming 𝒢𝑘 = (𝒱𝑘 , ℰ𝑘 ) (5)
and outgoing links, respectively.
The specific statistics of the extracted KG and the origi-
nal semantic scholar corpus are reported in Table 2. Since
4.1. Weak Supervision not every link has weakly labeled intent, this table also
In order to enrich our data and provide more informa- provides the percentage of weakly labeled links for each
tion to the models, we extract the set of intents provided corresponding graph. Although we extract 𝒢2 , given its
in the S2ORC dataset for each citation link. The intent scale, we opt to run our current experiment only on 𝒢1
labels in S2ORC are extracted using the structural scaf- and leave the larger-scale experiments for future works.
folds model [4] at a sentence level. In this scenario, we
implicitly use the existing data derived from the con- 4.3. Feature Engineering
tent for bootstrapping our approach. We refer to these
links as weakly labeled due to being labeled by a noisy Since none of the publications in our KGs have any fea-
model rather than a human expert. Since the intent labels tures or pre-defined representation, we propose to repre-
are partial at a sentence level, citation links could have sent them through their references, citations, and graph-
zero intent in the absence of text or several intents in an based features. More specifically, from S2ROC we extract
abundance of use cases. the in-degrees and out-degrees of citations (or references),
background links, method links, and result links. As a re-
sult, each paper is represented with an 8-dimensional fea-
ture vector, 4 for each in-degree and out-degree feature.
Table 3
Intent classification results on SciCiteorigin and SciCiteresplit datasets. All the metrics are macro averaged. Bold values represent
the highest performance within the metric and dataset scope.
SciCiteorigin SciCiteresplit
Method Setting Accuracy Precision Recall F1 Accuracy Precision Recall F1
Random Universal 33.05 33.05 33.83 31.22 32.99 32.88 33.85 31.89
Most Common Universal 53.57 17.86 33.33 23.26 42.63 14.21 33.33 19.93
TransE Transductive 40.41 37.09 37.81 36.52 39.57 35.96 35.70 35.59
ComplEx Transductive 49.01 44.11 37.94 33.30 40.25 41.85 35.64 28.78
RotatE Transductive 23.54 32.97 32.74 22.98 28.12 36.88 36.31 27.88
Random + MLP Transductive 49.60 30.58 35.17 32.42 45.35 30.26 35.83 32.78
TransE + MLP Transductive 54.16 45.77 45.21 45.24 51.93 45.68 44.16 43.89
ComplEx + MLP Transductive 55.72 47.80 45.19 44.77 48.64 43.46 43.15 43.24
RotatE + MLP Transductive 56.37 48.79 46.15 46.55 51.81 46.92 45.46 45.63
Infersent-KMeans Universal - 58 64 60 - - - -
Infersent-HDBSCAN Universal - 57 63 58 - - - -
Glove-KMeans Universal - 51 56 51 - - - -
Glove-HDBSCAN Universal - 52 57 52 - - - -
MHLP (Ours) Transductive 66.20 62.18 56.13 57.88 66.10 63.69 61.33 62.16
MHLP (Ours) Inductive 63.94 58.36 55.05 56.13 64.17 59.86 59.83 59.81
Structural Scaffolds Universal - 84.7 83.6 84.0 - - - -
SciBERT Universal 86.94 85.30 85.92 85.58 86.39 85.51 85.14 85.28
SciBERT + MHLP Universal 87.53 85.56 87.07 86.25 86.85 86.80 85.96 86.35
For the publications where the content is unavailable, the 4.4. Baselines
out-degree intent-based features will be zero since those
Knowledge Graph Embedding Models:
features are based on the noisy sentence-level model that
the Semantic Scholar uses. However, the in-degree fea- Traditional KGE models consist of two shallow embed-
tures may not be zero as long as the citing paper’s content
dings as entity and relation encoders and a score function
is available. For the new publications, i.e., unseen nodesas a decoder to predict the likelihood of a link. These
in the inductive setting, the only known non-zero feature models are trained in a contrastive way by masking ei-
is the reference count. ther one of the entities in a given triplet (head, relation,
We normalize the reference and citation features by a tail) and sampling a set of negative entities, contrasting
biased log factor defined as the positive entity.
Since the traditional KGE methods rely on shallow em-
ℎ̄𝑥 = log10 (ℎ𝑥 + 1 + 𝛼) (6) beddings for encoding entities and relations, they can
only be used in the transductive setting and cannot op-
where 𝛼 is a bias hyperparameter. We specifically set erate on unseen nodes. For our experiments, we use the
𝛼 = −0.9 to get a normalized value of −1 for zero- available implementations of TransE, ComplEx, and Ro-
reference and zero-citation situations. tatE in the DGL-KE toolkit [27]. In the evaluation phase,
Moreover, we normalize the non-zero in-degree intent- we calculate the likelihood of all different relation types
based features into a [0, 1] probability distribution as for each link and consider the highest likelihood as the
follows: model’s intent prediction.
ℎ𝑥
ℎ̄𝑥 = (7)
ℎBackground + ℎMethod + ℎResult Hybrid Models:
The same normalization step is used for out-degree fea- To increase the reasoning power of the traditional KGE
tures separately. models, we devise a two-stage approach based on mul-
tilayer perceptron (MLP). We first use the traditional
KGE models to learn embeddings for entities and rela-
tions. Then, instead of relying on the produced likelihood
scores, we concatenate the vectors of two entities and
Figure 2: Overview of the composite model. The model consists of two encoders for the citation phrase and the citation graph
around the citation link. During the training phase, we freeze the SciBERT model in the first two epochs as a warm-up step for
the graph encoder; then, we jointly train both encoders along with the final prediction module.
pass that through an MLP to get logit values. Formally, combination of the neighboring nodes’ representations.
given a link (𝑢, 𝑣) and their respective learned represen- Let ℎ(0)
𝑥 be the extracted feature vector for any arbitrary
tation (𝑧𝑢 , 𝑧𝑣 ), we calculate the logit values as node 𝑥. We calculate the representation of an arbitrary
node 𝑣 at layer 𝑙 + 1 of a multilayer model as
𝑝 = MLP([𝑧𝑢 ‖𝑧𝑣 ]) (8)
where 𝑝 ∈ R𝒞 contains the unnormalized logits for each (𝑙+1) 1 ∑︁ (𝑙)
ℎ𝒩𝑣 = ℎ𝑢 (10)
class. The predicted class 𝑐 is then calculated as |𝒩𝑣 | 𝑢∈𝒩
𝑣
argmax𝑐 sigmoid(𝑝). (9) ℎ(𝑙+1)
𝑣
(𝑙+1)
= 𝜎(𝑊 (𝑙+1) [ℎ𝑣(𝑙) ‖ℎ𝒩𝑣 ]) (11)
Natural Language Processing Models: where 𝜎 is a non-linear function. Throughout our ex-
periments, we specifically use ReLU to introduce non-
We include the reported results of several state-of-the-art linearity. Given the node representation from a 𝐿-layer
Natural Language Processing (NLP) methods. Specifi- model and a link (𝑢, 𝑣), we calculate the logit values as
cally, we include results from the word embedding-based
methods such as Infersent-KMeans, Infersent-HDBSCAN, 𝑝 = MLP([ℎ(𝐿) (𝐿)
𝑢 ‖ℎ𝑣 ]) (12)
Glove-KMeans, and Glove-HDBSCAN [17], BiLSTM-
where 𝑝 ∈ R contains the unnormalized logits for each
𝒞
based method Structural Scaffolds [4], and large language
class and 𝒞 is the set of all classes. The predicted class 𝑐
model-based method SciBERT [1]. Moreover, we report
is then calculated as
the results of fine-tuning a pre-trained SciBERT model on
both datasets. All these methods use textural information argmax𝑐 sigmoid(𝑝). (13)
and are evaluated on the SciCite dataset.
The main disadvantage of the inductive settings is that
the unseen nodes only have one available feature, i.e.,
4.5. Multi-Hop Link Prediction (MHLP) reference count. This absence of information makes the
Transductive and inductive settings are the most common task extremely difficult, as the feature vectors are highly
link prediction evaluating schemes for KGs. The main dif- sparse. However, our model tries to diminish this effect
ference between these two settings is having a fixed set of by using the message-passing scheme, as defined in Equa-
nodes in both the training and evaluation phases (trans- tion 11, to aggregate information through connected en-
ductive) versus allowing the addition of unseen nodes tities, i.e., cited papers, creating a denser representation
in the evaluation phase (inductive). This work refers to for the unseen nodes.
citation intent prediction on unseen publications as the All our models are trained using the cross-entropy loss
inductive setting, whereas the transductive setting refers defined as
to citation intent prediction on already seen publications. exp(𝑝𝑦𝑛 )
We propose an adaptable graph-based model for cita- 𝑙𝑛 = − log ∑︀|𝒞| (14)
𝑖=1 exp(𝑝𝑖 )
tion intent prediction in both the transductive and in-
ductive settings. The primary basis of this approach is where and 𝑝𝑥 is the logit value for class 𝑥 given the
that a node, i.e., publication, could be represented as a prediction vector 𝑝.
Table 4. Then, the MLP component is trained using the
procedure described in A.2 to predict the citation intent.
For the MHLP-based methods, in both transductive and
inductive settings, we use a 1-layer variation on top of
the normalized features extracted as described in Section
4.3. Moreover, we tune their hyperparameters and train
them as described in Appendix A.3. For the SciBERT
method, we freeze the pre-trained model and add an
MLP module on top of the 768-dimensional [CLS] token
output. Similar to the other models, the MLP module
is tuned using the parameters described in A.2. For the
composite model, during the training phase, we freeze
the SciBERT model in the first two epochs as a warm-up
(a) The number of different citation intents.
step for the graph encoder; then, we jointly train both
encoders along with the final prediction module.
To control for the effect of the pre-training using tradi-
tional KGE models, we also run a variation with randomly
initialized node features and designate it as “Random +
MLP.” For the NLP models, we use the previously re-
ported results [17] to compare our models on the test
set-aligned SciCiteorigin dataset. Finally, we also include
the results from random and most common class predic-
tions as sanity checks. All the models are implemented
using PyTorch [14] and trained on a machine with a sin-
gle Quadro RTX 8000 GPU, 72 CPU cores, and 768GB of
RAM. Implementations are available under a CC-BY-SA
(b) The percentage of different citation intents. license at TBD.
Figure 3: The statistic of citation intent for all publications in
the Semantic Scholar corpus. The temporal trends stay steady 5.1. Results
over time, suggesting a lack of information in the elapsed time
from the time of publication to the time of citing. Table 3 illustrates our experimental results on both
datasets. As evident from Table 3, traditional KGE meth-
ods perform poorly on both datasets, only slightly beat-
ing the random baseline on the macro F1 metric. In-
Composite Model:
terestingly, both ComplEx and RotatE perform worse
To further test the capabilities of our proposed model than TransE on both datasets. This finding is surprising
and use both structural and textual information, we de- as both ComplEx and RotatE are more expressive than
vise a multi-modal model comprising encoders for both TransE [20]. However, when combined with MLP models,
the graph structure and the citation context. Specifically, all exhibit significant performance boost, up to more than
we use a pre-trained SciBERT model for encoding the 100% in the case of RotatE. After this addition, we can see
citation phrase text and our MHLP model for encoding the same expressivity trend in the model results, i.e., the
the citation graph around the citation link. Figure 2 illus- more powerful the model, the better the result. Moreover,
trates an overview of the composite model. the control “Random + MLP” experiment showcases very
similar results to the random baseline, indicating the im-
portance of both components for the hybrid model to
5. Experiments perform well. Altogether, it is evident that the reasoning
power of shallow traditional KGE models is not enough
In this section, we report our experimental results on both
to capture the complexity of this task, and we require
of the SciCiteorigin and SciCiteresplit datasets. All the graph-
models with more reasoning power.
based experiments are carried out on the 𝒢1 KG. For the
As for the MHLP method, in the transductive setting, it
traditional KGE methods, we tune their hyperparameters
achieves 57.88 and 62.16 macro F1 scores on SciCiteorigin
as described in Appendix A.1 and train them using the
and SciCiteresplit datasets, respectively. Moreover, its in-
hyperparameters showcased in Table 4. For the hybrid
ductive results showcase the robustness of our approach
methods, the KGE component is first trained to generate
in an extreme out-of-distribution setting, achieving 56.13
node features using the hyperparameters described in
and 59.81 macro F1 scores. Compared to previously re-
ported results [17], our model achieves superior perfor-
mance to Glove-based models while slightly lagging be-
hind Infersent-based models. Looking into the precision
and recall comparison, our method has better precision
scores on both transductive and inductive settings com-
pared to all word embedding-based models; however, for
recall, it performs better than Glove-based models and
worse than the Infersent-based models which might stem
from the imbalance in the links as illustrated by Figure 3a.
Further experimentation to address the class imbalance
problem in future works might help improve the overall
performance of MHLP. The significance of these results is
that we show structural and relational information could
be used to achieve relatively high performance without
using textual information. Moreover, although our mod-
els underperform compared to language model-based
approaches such as Structural Scaffolds and SciBERT,
(a) Publication features (both sides)
we showcase interesting future directions for combining
graph-based and NLP-based methods.
Finally, the composite model denoted as SciBERT +
MHLP in Table 3, achieves the best performance among
all models, even beating the fine-tuned SciBERT. When
considering MHLP’s standalone performance, these re-
sults showcase the potential improvements that could be
achieved through the use of structural information that
is not available in citation phrases. The presented ex-
periments are a stepping stone for better understanding
and using the structural information at scale for citation
intent classification.
6. Analysis
6.1. Temporal Analysis
This analysis studies the relationship between the time (b) Averaged neighborhood features (both sides)
that has passed since publication and citation intent. We
hypothesize that a paper is more likely to be cited as Figure 4: The calculated MI values for publication features
“Result” or “Method” right after its publication, and as and averaged neighborhood features. On average, the publica-
time passes, it will be more likely to be cited as “Back- tion features show stronger connections to the target variable.
ground.” If this is proven accurate, we could get a rela-
tively strong signal from the temporal information for
each citation. We plotted the years after publication analysis or studies of temporal information for citation
against intent counts and ratios for all papers in the se- intent classification.
mantic scholar corpus to test our hypothesis. Figure 3a
and 3b illustrate the results of our analysis. As evident
6.2. Mutual Information Analysis
from these figures and contrary to our original hypothe-
sis, we find out that the ratio of intent classes almost stays In this analysis, we study the quality of the engineered
the same as time passes with insignificant fluctuations. features as described in Section 4.3 concerning the weakly
As a result, using temporal information in our models is labeled intent classes. To this end, we use the well-known
unlikely to provide any significant improvement. Note mutual information (MI) [12] measurement to quantify
that these results are based on the weakly labeled links the importance of each feature. Formally, the MI between
that we obtained from S2ORC. Consequently, these links
are generated by another noisy model that could poten-
tially be biased. Hence, it should not discourage further
(a) Features before normalization (b) Features after normalization
Figure 5: The t-SNE visualizations for the unnormalized and normalized features.
where 𝒴 is the value space for 𝑌 , 𝒳 is the value space for
𝑋, 𝑃𝑋,𝑌 is the joint probability distribution, and 𝑃𝑋 and
𝑃𝑌 are the marginal probability distributions. Note that
MI is a non-negative value, and higher values indicate
more correlation between the two random variables. For
our analysis, we calculate MI for both sides of the 5,886
unique citation links in the SciCiteresplit dataset. More-
over, to study these features in the graph context, we also
calculate MI for the average of these features over the
neighborhood of each publication, i.e., all citing and cited
publications, from both sides of the citation links. Figures
4a and 4b present the results of our experiments. As evi-
(a) The percentage of utilized weak labels. dent from these results, while the publication-averaged
features generally show stronger connections to the tar-
get variable, the neighborhood-averaged features seem to
show complementary connections, further emphasizing
the importance of using both sets of features.
6.3. Feature Quality Analysis
In this analysis, we study the effect of normalization as
described in Equations 6 and 7. To this end, we project
the extracted features of the 5,886 unique citation links
in the SciCiteresplit dataset to a 2-dimensional space us-
ing t-SNE [24]. Figure 5a and 5b illustrate the projected
(b) The percentage of corrupted data. space for the unnormalized and normalized features, re-
spectively. As evident from Figure 5a, it is challenging
Figure 6: The macro F1 score of MHLP (Transductive) on to distinguish different intent types in the unnormalized
SciCiteorigin and SciCiteresplit dataset
space. However, after normalization, as evident from Fig-
ure 5b, we can see that the “Method” intention more or
less creates a distinguishable cluster. This result shows
two discrete random variables 𝑋 and 𝑌 is defined as that the use of normalization is potentially helpful for
the model. Further studies on different types of normal-
∑︁ ∑︁ 𝑃𝑋,𝑌 (𝑥, 𝑦) ization and their effects are left for future work.
𝐼(𝑋, 𝑌 ) = 𝑃𝑋,𝑌 (𝑥, 𝑦) log( )
𝑦∈𝒴 𝑥∈𝒳
𝑃𝑋 (𝑥)𝑃𝑌 (𝑦)
(15)
6.4. Robustness Analysis to augment the extracted citation graph with citation in-
tents and create a multi-relational knowledge graph. Fol-
In this analysis, we focus on studying the robustness
lowing this, we adapted the sentence-based intent classifi-
of our proposed graph-based method. To this end, we
cation into a citation-based link prediction task on graphs.
devise two ablation studies. In the first study, we ran-
We then introduced a set of engineered graph-based and
domly corrupt a percentage of the weak labels by replac-
citation-based features. Built on top of these features, we
ing the correct label with a random label. This study
introduced a graph-based multi-hop reasoning approach
aims to understand the model’s resilience to noise better.
for the newly introduced task. Our approach achieves
In the second study, we randomly remove a percentage
62.16 and 59.81 macro F1 scores in the transductive and in-
of the weak labels. This study’s idea is to understand
ductive settings, respectively. The experimental results in
better the effect of weak supervision on the model’s
the inductive setting further showcase the robustness of
performance. These studies are carried out by running
the proposed approach in the information-deprived out-
the MHLP method in the transductive setting on both
of-distribution environment. Compared to NLP-based
SciCiteorigin and SciCiteresplit datasets.
models, we reached a comparable performance to, and
The feature vectors for the publications are calculated
in some cases outperform, the word embedding-based
by counting the number of citations and intents. These
methods that rely on contextual sentences to make pre-
vectors are normalized then using Equation 6 and 7. To
dictions. Moreover, with a composite model comprising
analyze the relationship between the model’s perfor-
our method as the graph encoder and the state-of-the-art
mance and the amount of available data, we create ten
NLP-based model as the text encoder, we outperformed
variations of the dataset by only using a portion of the
all the other models we experimented with. These results
available weak labels, varying from using all the available
further signify the strong signal in relational informa-
weak labels to only using 10% of them. Figure 6a presents
tion and highlight the importance of future analysis and
the result of this study.
studies in this domain. Finally, our presented analyses
As evident from Figure 6a, the more weakly labeled
further support our methodological choices.
links are available, the better our method performs. The
For future works, one straightforward idea is to extend
other significant observation is the robustness of the
the knowledge graph with more scholarly information,
model, even in the extreme scenario of having access to
such as authors, venues, and fields of study. There already
only 10% of the labels. Note that only 31.90% of links in
exist some open repositories such as OpenAlex [15] and
the S2ROC have at least one weakly labeled intent, which
Microsoft Academic Graph (MAG) [25] that contain this
means, even if the utilization percentage is 100%, only
information. Another direction is further investigation
31.90% citation links are weakly labeled.
into the temporal signals. Last but not least, although we
Figure 6b showcases the relationship between the
achieved an improved performance through a fusion of
model performance and the percentage of corrupted data.
textual and structural information, more investigation
Following our intuition, the model’s performance mono-
and analysis could be done in this setting in future works.
tonically decreases as we add more noisy labels to the
data. However, two interesting observations could be
made from this figure. First, the performance of our Acknowledgments
method only drops less than five macro F1 scores when
half (50%) of the weak labels are replaced with randomly This work was funded by the Defense Advanced Research
assigned noisy labels. This observation shows that the Projects Agency with award W911NF-19-20271 and with
proposed method is exceptionally resilient when faced support from a Keston Exploratory Research Award.
with mistakes. Second, even when all the labels are re-
placed with random ones (100%), the model performs
better than the random baselines. This observation indi- References
cates that the model is learning to make inferences based
[1] Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB-
on purely structural information, which further solidifies
ERT: A Pretrained Language Model for Scientific
our hypothesis regarding the importance of structural
Text. In Proceedings of the 2019 Conference on Em-
information.
pirical Methods in Natural Language Processing and
the 9th International Joint Conference on Natural
7. Conclusions and Future Work Language Processing (EMNLP-IJCNLP). Association
for Computational Linguistics, Hong Kong, China,
In this work, we first introduced an expansion to the Sci- 3615–3620. https://doi.org/10.18653/v1/D19-1371
Cite dataset by extracting scholarly information from the [2] Antoine Bordes, Nicolas Usunier, Alberto Garcia-
S2ORC dataset and creating an extended citation graph. Durán, Jason Weston, and Oksana Yakhnenko.
Then, we gathered a large-scale weakly labeled dataset
2013. Translating Embeddings for Modeling Multi- tional Networks. In Proceedings of the 5th Interna-
Relational Data. In Proceedings of the 26th Inter- tional Conference on Learning Representations (ICLR
national Conference on Neural Information Pro- ’17). OpenReview.net, Palais des Congrès Neptune,
cessing Systems - Volume 2 (Lake Tahoe, Nevada) Toulon, France, 14 pages. https://openreview.net/
(NIPS’13). Curran Associates Inc., Red Hook, NY, forum?id=SJU4ayYgl
USA, 2787–2795. [12] Alexander Kraskov, Harald Stögbauer, and Peter
[3] Lutz Bornmann and Hans-Dieter Daniel. 2008. Grassberger. 2004. Estimating mutual information.
What do citation counts measure? A review of stud- Physical review E 69, 6 (2004), 066138.
ies on citing behavior. J. Documentation 64 (2008), [13] Kyle Lo, Lucy Lu Wang, Mark Neumann, Rod-
45–80. ney Kinney, and Daniel Weld. 2020. S2ORC: The
[4] Arman Cohan, Waleed Ammar, Madeleine van Semantic Scholar Open Research Corpus. In Pro-
Zuylen, and Field Cady. 2019. Structural Scaffolds ceedings of the 58th Annual Meeting of the Asso-
for Citation Intent Classification in Scientific Pub- ciation for Computational Linguistics. Association
lications. In Proceedings of the 2019 Conference of for Computational Linguistics, Online, 4969–4983.
the North American Chapter of the Association for https://doi.org/10.18653/v1/2020.acl-main.447
Computational Linguistics: Human Language Tech- [14] Adam Paszke, Sam Gross, Soumith Chintala, Gre-
nologies, Volume 1 (Long and Short Papers). Associ- gory Chanan, Edward Yang, Zachary DeVito, Zem-
ation for Computational Linguistics, Minneapolis, ing Lin, Alban Desmaison, Luca Antiga, and Adam
Minnesota, 3586–3596. https://doi.org/10.18653/ Lerer. 2017. Automatic Differentiation in PyTorch.
v1/N19-1361 In NIPS 2017 Workshop on Autodiff. OpenReview.net,
[5] Arman Cohan and Nazli Goharian. 2015. Scien- Long Beach, California, USA, 4 pages. https:
tific Article Summarization Using Citation-Context //openreview.net/forum?id=BJJsrmfCZ
and Article’s Discourse Structure. In Proceedings [15] Jason Priem, Heather Piwowar, and Richard Orr.
of the 2015 Conference on Empirical Methods in 2022. OpenAlex: A fully-open index of scholarly
Natural Language Processing. Association for Com- works, authors, venues, institutions, and concepts.
putational Linguistics, Lisbon, Portugal, 390–400. arXiv preprint arXiv:2205.01833 abs/2205.01833
https://doi.org/10.18653/v1/D15-1045 (2022), 5 pages.
[6] Daniel Cummings and Marcel Nassar. 2020. Struc- [16] Anna Ritchie. 2009. Citation context analysis for
tured Citation Trend Prediction Using Graph Neu- information retrieval. Technical Report. University
ral Networks.. In ICASSP. IEEE, Barcelona, Spain, of Cambridge, Computer Laboratory.
3897–3901. http://dblp.uni-trier.de/db/conf/icassp/ [17] Muhammad Roman, Abdul Shahid, Shafiullah Khan,
icassp2020.html#CummingsN20 Anis Koubaa, and Lisu Yu. 2021. Citation intent
[7] M.A. Garzone. 1997. Automated Classification of classification using word embedding. Ieee Access 9
Citations Using Linguistic Semantic Grammars. The- (2021), 9982–9995.
sis (M.Sc.)–University of Western Ontario, Lon- [18] Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem,
don, Canada. https://books.google.com/books?id= Rianne van den Berg, Ivan Titov, and Max Welling.
V-bwSgAACAAJ 2018. Modeling Relational Data with Graph Con-
[8] William L. Hamilton, Rex Ying, and Jure Leskovec. volutional Networks. In The Semantic Web, Aldo
2017. Inductive Representation Learning on Large Gangemi, Roberto Navigli, Maria-Esther Vidal, Pas-
Graphs. In Proceedings of the 31st International Con- cal Hitzler, Raphaël Troncy, Laura Hollink, Anna
ference on Neural Information Processing Systems Tordai, and Mehwish Alam (Eds.). Springer Inter-
(Long Beach, California, USA) (NIPS’17). Curran national Publishing, Cham, 593–607.
Associates Inc., Red Hook, NY, USA, 1025–1035. [19] Henry Small. 2018. Characterizing highly cited
[9] Myriam Hernández-Alvarez and José M Gomez. method and non-method papers using citation con-
2016. Survey about citation context analysis: Tasks, texts: The role of uncertainty. Journal of Informet-
techniques, and resources. Natural Language Engi- rics 12, 2 (2018), 461–480.
neering 22, 3 (2016), 327–349. [20] Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and
[10] David Jurgens, Srijan Kumar, Raine Hoover, Dan Jian Tang. 2019. RotatE: Knowledge Graph Em-
McFarland, and Dan Jurafsky. 2018. Measuring bedding by Relational Rotation in Complex Space..
the Evolution of a Scientific Field through Cita- In ICLR (Poster). OpenReview.net, New Orleans,
tion Frames. Transactions of the Association for LA, 18 pages. http://dblp.uni-trier.de/db/conf/iclr/
Computational Linguistics 6 (2018), 391–406. https: iclr2019.html#SunDNT19
//doi.org/10.1162/tacl_a_00028 [21] Simone Teufel, Advaith Siddharthan, and Dan Tid-
[11] Thomas N. Kipf and Max Welling. 2017. Semi- har. 2006. An annotation scheme for citation func-
Supervised Classification with Graph Convolu- tion. In Proceedings of the 7th SIGdial Workshop
on Discourse and Dialogue. Association for Com- Table 4
putational Linguistics, Sydney, Australia, 80–87. Hyperparameters of KGE algorithms.
https://aclanthology.org/W06-1312
Hyperparameter TransE ComplEx RotatE
[22] Théo Trouillon, Johannes Welbl, Sebastian Riedel,
Éric Gaussier, and Guillaume Bouchard. 2016. Com- embedding dimension 100 100 50
plex Embeddings for Simple Link Prediction. In learning rate 0.1 0.3 0.1
regularization coefficient 1e-6 1e-6 1e-6
Proceedings of the 33rd International Conference on
negative samples size 128 512 64
International Conference on Machine Learning - Vol- 𝛼 0 0.25 1
ume 48 (ICML’16). JMLR.org, New York, NY, USA, 𝛾 - - 6
2071–2080.
[23] Marco Valenzuela, Vu Ha, and Oren Etzioni. 2015.
Identifying Meaningful Citations. In Scholarly Big
{0.25, 0.5, 1}, 𝛾 ∈ {6, 12, 24}. Note that 𝛼 and 𝛾 are the
Data: AI Perspectives, Challenges, and Ideas, Papers
adversarial temperature and the margin value (RotatE-
from the 2015 AAAI Workshop (Technical Report,
only), respectively.
WS-15-13), Cornelia Caragea, C. Lee Giles, Narayan
Bhamidipati, Doina Caragea, Sujatha Das Gollapalli,
Saurabh Kataria, Huan Liu, and Feng Xia (Eds.). A.2. Multilayer Perceptron
AAAI Press, Menlo Park, CA, 21–26. http://www.
To simplify the model tuning process, we find the optimal
aaai.org/Library/Workshops/ws15-13.php
hyperparameters of “ComplEx + MLP” on SciCiteorigin
[24] Laurens van der Maaten and Geoffrey Hinton. 2008.
using grid search and reuse them for the rest of our ex-
Visualizing Data using t-SNE. Journal of Machine
periments. Specifically, we run a grid search over the
Learning Research 9, 86 (2008), 2579–2605. http:
following ranges: number of layers ∈ {0, 1, 2, 3}, dropout
//jmlr.org/papers/v9/vandermaaten08a.html
∈ {0, 0.1, 0.2, 0.3, 0.4, 0.5}, dimension ∈ {32, 64, 128},
[25] Kuansan Wang, Zhihong Shen, Chiyuan Huang,
The optimal hyperparameters are as follows: number of
Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia.
layers = 2, dropout = 0.2, and dimension = [64, 32]. We
2020. Microsoft academic graph: When experts are
use ReLU as the activation function for all layers.
not enough. Quantitative Science Studies 1, 1 (2020),
396–413.
[26] Wenhao Yu, Mengxia Yu, Tong Zhao, and Meng A.3. Multi-Hop Link Prediction
Jiang. 2020. Identifying Referential Intention with We run a grid search over the following ranges: number
Heterogeneous Contexts. In Proceedings of The Web of layers ∈ {0, 1, 2, 3}, dimension ∈ {10, 50, 100, 200},
Conference 2020 (Taipei, Taiwan) (WWW ’20). Asso- learning rate ∈ {0.03, 0.01, 0.003, 0.001}, The optimal
ciation for Computing Machinery, New York, NY, hyperparameters are as follows: number of layers = 1,
USA, 962–972. https://doi.org/10.1145/3366423. dimension = 100, learning rate = 0.01. We use Adam as
3380175 the optimizer through the tuning process.
[27] Da Zheng, Xiang Song, Chao Ma, Zeyuan Tan, Zi- We use a randomized search to tune our models and
hao Ye, Jin Dong, Hao Xiong, Zheng Zhang, and find near-optimal hyperparameters using the follow-
George Karypis. 2020. DGL-KE: Training Knowl- ing ranges: embedding dimensions ∈ {50, 100, 200},
edge Graph Embeddings at Scale. In Proceedings learning rate ∈ {0.03, 0.1, 0.3}, regularization coef-
of the 43rd International ACM SIGIR Conference on ficient ∈ {0.0, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5}, number of
Research and Development in Information Retrieval negative samples ∈ {64, 128, 256, 512, 1024}, 𝛼 ∈
(SIGIR ’20). Association for Computing Machinery, {0.25, 0.5, 1}, 𝛾 ∈ {6, 12, 24}. Note that 𝛼 and 𝛾 are the
New York, NY, USA, 739–748. adversarial temperature and the margin value (RotatE-
only), respectively.
A. Hyperparameters
A.1. Knowledge Graph Embedding
We use a randomized search to tune our models and
find near-optimal hyperparameters using the follow-
ing ranges: embedding dimensions ∈ {50, 100, 200},
learning rate ∈ {0.03, 0.1, 0.3}, regularization coef-
ficient ∈ {0.0, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5}, number of
negative samples ∈ {64, 128, 256, 512, 1024}, 𝛼 ∈