=Paper=
{{Paper
|id=Vol-3656/paper8
|storemode=property
|title=Citation Intent Classification Through Weakly Supervised Knowledge Graphs
|pdfUrl=https://ceur-ws.org/Vol-3656/paper8.pdf
|volume=Vol-3656
|authors=Xinwei Du,Kian Ahrabian,Arun Baalaaji Sankar Ananthan,Richard Delwin Myloth,Jay Pujara
|dblpUrl=https://dblp.org/rec/conf/aaai/DuAAMP23
}}
==Citation Intent Classification Through Weakly Supervised Knowledge Graphs==
<pdf width="1500px">https://ceur-ws.org/Vol-3656/paper8.pdf</pdf>
<pre>
                                Citation Intent Classification Through Weakly Supervised
                                Knowledge Graphs
                                Xinwei Du1,2 , Kian Ahrabian1,2,* , Arun Baalaaji Sankar Ananthan1,2 , Richard Delwin Myloth1,2
                                and Jay Pujara1,2
                                1
                                    Information Sciences Institute, Marina del Ray, CA, USA
                                2
                                    University of Southern California, Los Angeles, CA, USA


                                                                       Abstract
                                                                       Citations are scientists’ tools for grounding their innovations and findings in the existing collective knowledge. They are used
                                                                       for semantically distinct purposes as scientists utilize them at different parts of their work to convey specific information. As
                                                                       a result, a crucial aspect of scientific document understanding is recognizing the authorial intent associated with citations.
                                                                       Current state-of-the-art methods rely on contextual sentences surrounding each citation to classify the intent. However,
                                                                       in the absence of textual content, these approaches become unusable. In this work, we propose a text-free citation intent
                                                                       classification method built on relational information among scholarly works in this work. To this end, we introduce a
                                                                       large-scale knowledge graph built from the publications in the SciCite dataset and their multi-hop neighborhood extracted
                                                                       from The Semantic Scholar Open Research Corpus (S2ORC). We also augment this knowledge graph by adding weakly-labeled
                                                                       links based on the intent information available in the S2ORC. Finally, we cast the intent classification task as a link prediction
                                                                       problem on the newly created knowledge graph. We study this problem in both transductive and inductive settings. Our
                                                                       experimental results show that we can achieve a comparable macro F1 score to word embedding content-based methods by
                                                                       only relying on features and relations derived from this knowledge graph. Specifically, we achieve macro F1 scores of 62.16
                                                                       and 59.81 in the transductive and inductive settings, respectively, on the link-level SciCite dataset. Moreover, by combining
                                                                       our method with the state-of-the-art NLP-based model, we achieve improvements across all metrics.

                                                                       Keywords
                                                                       Citation Intent Classification, Knowledge Graphs, Graph Neural Networks, Large Language Models, Weakly supervised
                                                                       learning


                                1. Introduction                                                                                        to textual information. Previous works [3, 26, 6] have
                                                                                                                                       shown the importance of relational and structural infor-
                                Citations are the primary way of identifying past contri- mation available in links among publications for various
                                butions and connecting progress in new publications to tasks. In this work, we propose a general citation in-
                                existing literature. Nevertheless, not all citations indicate tent classification method that relies purely on structural
                                the same meaning. Authors use citations sparingly with information.
                                specific intent behind them. For example, some papers                                                     Besides helping researchers better understand the re-
                                are cited for providing background information in a do- lationship among publications, citation intent analysis
                                main, while others are cited when adopting or adapting has been used for studying various other aspects of scien-
                                a previously-used methodology. There are also scenar- tific works such as research domain evolution [10], scien-
                                ios where the same paper is used as background infor- tific impact analysis [19], scientific document summariza-
                                mation and methodology use-case in different contexts tion [5], and retrieving related scientific works [16]. The
                                simultaneously. Understanding citation intent is crucial main three categories of citations are “Result,” “Method,”
                                to studying scholarly works, given the universality of and “Background” [4]. These categories describe the rea-
                                using citations. Current state-of-the-art citation intent sons behind making a scientific connection, referencing a
                                classification models [17, 1, 4] rely heavily on textual publication in another publication. Classifying citations
                                information, e.g., the sentences surrounding the citation. into these groups has traditionally required a high level
                                However, such information is expensive to obtain and of expertise in the respective scientific domains. This
                                in some scenarios inaccessible altogether. Consequently, constraint, combined with the high cost of expert human
                                we need models that could operate without having access labor, has resulted in highly scarce datasets, which makes
                                The Third AAAI Workshop on Scientific Document Understanding 2023, the task even more difficult.
                                February 14th, 2023, Washington, DC, USA                                                                  Previous works have proposed classifying citation in-
                                *
                                  Corresponding author.                                                                                tent through feature engineering-based [10] and repre-
                                $ xinweidu@usc.edu (X. Du); ahrabian@usc.edu (K. Ahrabian);                                            sentation learning-based [1] methods. However, most
                                arunbaal@usc.edu (A. B. S. Ananthan); myloth@usc.edu
                                (R. D. Myloth); jpujara@usc.edu (J. Pujara)
                                                                                                                                       of these methods depend on textual information. As a
                                          © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License result, they require a complex multi-stage pipeline of
                                          Attribution 4.0 International (CC BY 4.0).
                                    CEUR

                                          CEUR Workshop Proceedings (CEUR-WS.org)
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073                                                                       parsing documents, identifying citation contexts, and


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
predicting citation intent [13]. Besides being prone to
error propagation from various pipeline stages, the use
of these models is limited to situations where the full text
is available in a proper format. This work introduces a
pure graph-based approach to classifying citation intent.
We extend the existing SciCite dataset with 2-hop neigh-
borhoods extracted from The Semantic Scholar Open
Research Corpus (S2ORC). To further enrich the graph,
we utilize the intent information provided in the S2ORC
to create a weakly supervised knowledge graph (KG) con-
sisting of the publications and the relations that match
the provided intents. Our main idea is to use contextu-
alized relational patterns to make predictions, obviating
                                                               Figure 1: Overview of the extracted multi-hop KG. The set of
the need for textual context. Given the newly built KG,
                                                               0-hop nodes 𝒱0 includes all the orange nodes. The set of 1-hop
we cast the intent classification problem into the common      nodes 𝒱1 includes all the orange and blue nodes. Similarly,
link prediction problem on KGs. Specifically, we train a       the graph could be expanded to include 𝑘-hop nodes 𝒱𝑘 . The
model to learn representations for entities and relations.     annotated set on each edge represents that specific link’s
Using these representations, we run the following query        intent. Specifically, the empty set denotes that the citation
on the KG: (𝑠, ?, 𝑜), where 𝑠 cites 𝑜.                         link has no intent label.
   Converting this problem into a link prediction task
allows us to adapt and extend widely used KG embed-
ding models to this problem. We study the link predic-         going as far as defining 35 [7] and 12 [21] fine-grained
tion problem in both transductive and inductive settings.      schemes for scientific arguments. The more recent works
Our experimental results show that although our KG-            however have focused on creating more concise cate-
based method underperforms compared to the large lan-          gories. For example, ACL-ARC [10] proposes a 6-class
guage model-based approaches, it is comparable or even         intent categorization scheme: Background, Motivation,
superior to the word embedding-based methods. More-            Uses, Extension, Comparison or Contrast, and Future.
over, our experiments with combining the NLP-based and         SciCite [4] is even more restrictive and drops or com-
graph-based methods show slight improvements over the          bines small fine-grained classes to provide a more con-
current state-of-the-art model. These findings further         cise 3-class annotation scheme: Background, Method,
signify the importance of relational patterns for citation     and Result.
intent classification.
   The contributions of this work are as follows:
                                                               2.2. Citation Intent Classification
    1. Extending the SciCite dataset using the S2ORC                Methods
       dataset to generate a large-scale weakly super-
       vised KG.                                             Before the explosion of deep learning approaches, most
    2. Introducing a novel graph-based approach for          methods relied on a combination of hand-crafted features
       citation intent classification built on top of the    and classic machine learning models. For example, in
       newly built KG.                                       one instance [23], authors propose 12 different features,
    3. Presenting benchmarks for both transductive and       including citation count, PageRank value, and author
       inductive settings.                                   overlap, and use classic machine learning models such
    4. Presenting analyses on the effect of different parts  as SVM and Random Forest for classification. In another
       of the methodology such as weak supervision and       instance [10], authors define pattern-based, topic-based,
       feature engineering.                                  and prototypical argument features and use SVM to make
                                                             predictions.
                                                                With the advent of deep learning models and the emer-
                                                             gence of large language models in recent years, represen-
2. Related Work                                              tation learning-based methods have outperformed the
                                                             hand-crafted methods achieving a higher accuracy by
2.1. Citation Function/Intent Schemes                        considering the textual information. Recent works have
Many prior works have studied the problem of creat- proposed the use of structural scaffolds [4], BERT-based
ing categorical schemes for citation intent which in some models trained on the scientific corpus (SciBERT) [1],
works is referred to as citation function [9]. Earlier works word embedding-based approaches [17], and creating
were focused on creating more fine-grained categories, a heterogeneous context graph based on an academic
Table 1                                                            paper and the target (cited) paper, and the output is the
The statistic of the SciCite dataset and reconstructed datasets.   label of a citation link between the source and target. We
    Dataset         SciCite     SciCiteorigin   SciCiteresplit
                                                                   release all our datasets under a CC-BY-SA license at TBD

      Level        Sentence         Link            Link
   # Samples        11,020         10,379           5,766          3.1. Entity Mapping
     # Train         8,243         7,602            4,122          We first map each paper in the SciCite dataset to the
  # Validation        916           916              822
                                                                   S2ORC by matching SciCite’s IDs to Semantic Scholar’s
      # Test         1,861         1,861             822
                                                                   SHA IDs. Since a publication could have many SHA
                                                                   IDs and only one Corpus ID, we then map each SHA
                                                                   ID to the unique Corpus ID to extract unique entities.
network [26]                                                       From the 13,080 papers with unique IDs in SciCite, we
                                                                   successfully map 13,019 of them to valid SHA IDs in
2.3. Knowledge Graph Embedding                                     semantic scholar, while the remaining 61 papers do not
     Models                                                        have any corresponding records. We believe this is due to
                                                                   publication removals, as the SciCite dataset was created
KGs are structured information repositories consisting             from the S2ORC in 2019. After converting SHA IDs to
of a set of nodes representing entities and a set of typed         Corpus IDs, we end up with 13,011 unique entities and 8
edges representing relations. Since, in most cases, the            duplicate entities.
KG nodes and edges are not attributed, KG embedding
(KGE) models aim to learn low-dimensional representa-
tions for all entities and relations. The most common              3.2. Dataset Splitting
traditional shallow KGE methods are TransE [2], Com-               The original SciCite dataset contains 11,020 human-
plEx [22], and RotatE [20]. More recent GNN-based KGE              labeled samples. Hence, to adapt it to our link predic-
methods leverage the message-passing scheme of GNNs,               tion setting, we reconstruct two datasets: SciCiteorigin
enabling more complex multi-hop reasoning. Examples                and SciCiteresplit . SciCiteorigin adheres to the same bench-
of these methods are GCN [11], which leverages the                 marks reported in prior works but is modified to remove
spectral information for information propagation but               overlapping citation links in the training and test sets.
is limited to mono-relational KGs, R-GCN [18], which               To maximize usage of the training data while removing
extends GCN to support multi-relational KGs, and Graph-            artifacts, we create SciCiteresplit that performs additional
SAGE [8] which introduces an inductive framework to                cleaning, provides a stronger separation of training and
handle unseen nodes.                                               test sets, and avoids multi-intent citations. Table 1 show-
                                                                   cases the statistic of these datasets.
3. Dataset
                                                                   SciCiteorigin :
The SciCite dataset focuses on individual citation links   To make methods comparable, we use the same valida-
and ignores the significance of broader relational connec- tion and test sets as SciCite for this dataset and try to
tions and features. To overcome this issue, we construct   keep the training set as close as possible. We convert each
a knowledge graph by mapping each entity in the SciCite    publication in the SciCite dataset to a Semantic Scholar
dataset to the S2ORC and adding their 2-hop citation       entity using the mapped Corpus IDs and drop the con-
neighborhoods. The S2ROC contains more than 206 mil-       textual sentence-level information. We assign a random
lion publications and 2.49 billion citation links. Apart   unique ID to publications without a Corpus ID. After
from the regular citation links, this corpus provides par- this procedure, we end up with a set of links for our link
tial intent labels for citations using a 3-class scheme as prediction task.
follows:                                                      Due to the removal of the contextual information,
     1. Background: Describe a problem, topic, or con- some of the training links appear exactly the same in
         cept                                              the test set. Hence, we remove 641 training set samples
     2. Method: Provide a method, tool, or dataset         that also appear in the test set to prevent data leakage.
     3. Result: To make a comparison                       Moreover, since only one link in the test set has mul-
                                                           tiple intents, we treat the link prediction problem as a
Moreover, the SciCite dataset is tailored for sentence multi-class task rather than a multi-label task. In this
classification methods, where input features are textual scenario, the multi-intent links are represented as sepa-
excerpts and the output labels are citation intents. We rate samples with the same inputs and different outputs.
reformulate this task as link prediction on KGs, where the
input features are a representation of the source (citing)
Table 2
Statistics of the extracted KGs along with the original S2ORC dataset.

      Dataset              # Nodes     # Citation Links   # Background       # Method       # Result   Weak Labels
      Zero-Hop (𝒢0 )         13,011              10,733            5,479         4,403         1,335      79.04%
      One-Hop (𝒢1 )       5,862,261         119,776,090       39,202,086    16,830,665    16,830,665      43.18%
      Two-Hop (𝒢2 )      57,535,880       1,621,293,902      467,860,523   121,877,053    35,283,718      34.41%
      S2ORC             206,159,629       2,495,513,737      643,955,457   169,472,164    45,779,793      31.90%


Multi-label methods may be a promising future extension        4.2. Knowledge Graph Construction
of our work.
                                                             Given the S2ORC dataset, we expand the SciCite dataset
                                                             using the mapped entities to construct a KG containing
SciCiteresplit :                                             2-hop neighborhoods of the publications. Figure 1 illus-
Even though we convert the SciCite dataset to the trates an overview of the expanded KG. This work uses
SciCiteorigin , problems, such as duplicate citations and the 2022-09-13 version of the corpus downloaded from
multi-label links, still exist. Therefore, we further tai- the bulk API. Formally, given the set of mapped entities
lor the SciCite dataset to create a better link prediction 𝒱0 , the set of 𝑘-hop nodes 𝒱𝑘 is defined as
dataset for graph-based models. First, we remove all the
entities, and their related samples, that do not have a             𝒱𝑘 = 𝒱𝑘−1 ∪ {𝑦 | ∃𝑥 ∈ 𝒱𝑘−1 : 𝑦 ∈ 𝒩𝑥 }         (1)
mapped Corpus ID. Then, similar to SciCiteorigin , we con- where for a given entity 𝑥, 𝒩 denotes all the entities
                                                                                             𝑥
vert the remaining samples to a set of links. Following that cite or are cited by 𝑥, i.e., the set of neighboring
this, we drop all duplicate samples. Among the remaining entities. Given the sets of unlabeled links 𝒰 and weakly
6,458 unique links, 5,886 only have one intent, 489 have labeled links ℒ, the set of 𝑘-hop edges ℰ is defined as
                                                                                                       𝑘
two intents, and 83 have all three intents. We remove
all the multi-intent links and resplit the dataset with ra-         𝒰
                                                                  ℰ𝑘 = {(𝑥, 𝑦, UNK) | 𝑥, 𝑦 ∈ 𝒱𝑘 , (𝑥, 𝑦) ∈ 𝒰} (2)
tios of 70%/15%/15% for training, validation, and test sets,
respectively.                                                     ℰ𝑘ℒ = ∪𝑟 {(𝑥, 𝑦, 𝑟) | 𝑥, 𝑦 ∈ 𝒱𝑘 , (𝑥, 𝑦) ∈ ℒ𝑟 } (3)
                                                                   ℰ𝑘 = ℰ𝑘𝒰 ∪ ℰ𝑘ℒ                                 (4)
4. Method                                                      where 𝑟 ∈ {Background, Method, Result} and ℒ𝑟 de-
                                                               notes the set of all weakly labeled links with label 𝑟. Con-
Throughout the rest of this work, for simplicity, we use       sequently, given the sets of 𝑘-hop nodes 𝒱𝑘 and edges
the term publication to denote all types of academic           ℰ𝑘 , the extracted 𝑘-hop KG, 𝒢𝑘 , is defined as
publications, e.g., books and papers. Moreover, we use
the terms citation and reference to denote incoming                                  𝒢𝑘 = (𝒱𝑘 , ℰ𝑘 )                   (5)
and outgoing links, respectively.
                                                              The specific statistics of the extracted KG and the origi-
                                                            nal semantic scholar corpus are reported in Table 2. Since
4.1. Weak Supervision                                       not every link has weakly labeled intent, this table also
In order to enrich our data and provide more informa- provides the percentage of weakly labeled links for each
tion to the models, we extract the set of intents provided corresponding graph. Although we extract 𝒢2 , given its
in the S2ORC dataset for each citation link. The intent scale, we opt to run our current experiment only on 𝒢1
labels in S2ORC are extracted using the structural scaf- and leave the larger-scale experiments for future works.
folds model [4] at a sentence level. In this scenario, we
implicitly use the existing data derived from the con- 4.3. Feature Engineering
tent for bootstrapping our approach. We refer to these
links as weakly labeled due to being labeled by a noisy Since none of the publications in our KGs have any fea-
model rather than a human expert. Since the intent labels tures or pre-defined representation, we propose to repre-
are partial at a sentence level, citation links could have sent them through their references, citations, and graph-
zero intent in the absence of text or several intents in an based features. More specifically, from S2ROC we extract
abundance of use cases.                                     the in-degrees and out-degrees of citations (or references),
                                                            background links, method links, and result links. As a re-
                                                            sult, each paper is represented with an 8-dimensional fea-
                                                            ture vector, 4 for each in-degree and out-degree feature.
Table 3
Intent classification results on SciCiteorigin and SciCiteresplit datasets. All the metrics are macro averaged. Bold values represent
the highest performance within the metric and dataset scope.

                                                           SciCiteorigin                                 SciCiteresplit
Method                       Setting       Accuracy      Precision     Recall       F1     Accuracy      Precision        Recall   F1
Random                      Universal         33.05        33.05           33.83   31.22     32.99         32.88          33.85    31.89
Most Common                 Universal         53.57        17.86           33.33   23.26     42.63         14.21          33.33    19.93
TransE                    Transductive        40.41        37.09           37.81   36.52     39.57         35.96          35.70    35.59
ComplEx                   Transductive        49.01        44.11           37.94   33.30     40.25         41.85          35.64    28.78
RotatE                    Transductive        23.54        32.97           32.74   22.98     28.12         36.88          36.31    27.88
Random + MLP              Transductive        49.60        30.58           35.17   32.42     45.35         30.26          35.83    32.78
TransE + MLP              Transductive        54.16        45.77           45.21   45.24     51.93         45.68          44.16    43.89
ComplEx + MLP             Transductive        55.72        47.80           45.19   44.77     48.64         43.46          43.15    43.24
RotatE + MLP              Transductive        56.37        48.79           46.15   46.55     51.81         46.92          45.46    45.63
Infersent-KMeans            Universal           -            58             64      60         -             -              -      -
Infersent-HDBSCAN           Universal           -            57             63      58         -             -              -      -
Glove-KMeans                Universal           -            51             56      51         -             -              -      -
Glove-HDBSCAN               Universal           -            52             57      52         -             -              -      -
MHLP (Ours)               Transductive        66.20        62.18           56.13   57.88     66.10         63.69          61.33    62.16
MHLP (Ours)                 Inductive         63.94        58.36           55.05   56.13     64.17         59.86          59.83    59.81
Structural Scaffolds        Universal          -            84.7        83.6        84.0       -            -               -      -
SciBERT                     Universal        86.94         85.30       85.92       85.58     86.39        85.51           85.14    85.28
SciBERT + MHLP              Universal        87.53         85.56       87.07       86.25     86.85        86.80           85.96    86.35


For the publications where the content is unavailable, the           4.4. Baselines
out-degree intent-based features will be zero since those
                                                                     Knowledge Graph Embedding Models:
features are based on the noisy sentence-level model that
the Semantic Scholar uses. However, the in-degree fea-    Traditional KGE models consist of two shallow embed-
tures may not be zero as long as the citing paper’s content
                                                          dings as entity and relation encoders and a score function
is available. For the new publications, i.e., unseen nodesas a decoder to predict the likelihood of a link. These
in the inductive setting, the only known non-zero feature models are trained in a contrastive way by masking ei-
is the reference count.                                   ther one of the entities in a given triplet (head, relation,
   We normalize the reference and citation features by a  tail) and sampling a set of negative entities, contrasting
biased log factor defined as                              the positive entity.
                                                             Since the traditional KGE methods rely on shallow em-
               ℎ̄𝑥 = log10 (ℎ𝑥 + 1 + 𝛼)               (6) beddings for encoding entities and relations, they can
                                                          only be used in the transductive setting and cannot op-
where 𝛼 is a bias hyperparameter. We specifically set erate on unseen nodes. For our experiments, we use the
𝛼 = −0.9 to get a normalized value of −1 for zero- available implementations of TransE, ComplEx, and Ro-
reference and zero-citation situations.                   tatE in the DGL-KE toolkit [27]. In the evaluation phase,
   Moreover, we normalize the non-zero in-degree intent- we calculate the likelihood of all different relation types
based features into a [0, 1] probability distribution as for each link and consider the highest likelihood as the
follows:                                                  model’s intent prediction.
                               ℎ𝑥
          ℎ̄𝑥 =                                       (7)
                 ℎBackground + ℎMethod + ℎResult          Hybrid Models:

The same normalization step is used for out-degree fea- To increase the reasoning power of the traditional KGE
tures separately.                                       models, we devise a two-stage approach based on mul-
                                                        tilayer perceptron (MLP). We first use the traditional
                                                        KGE models to learn embeddings for entities and rela-
                                                        tions. Then, instead of relying on the produced likelihood
                                                        scores, we concatenate the vectors of two entities and
Figure 2: Overview of the composite model. The model consists of two encoders for the citation phrase and the citation graph
around the citation link. During the training phase, we freeze the SciBERT model in the first two epochs as a warm-up step for
the graph encoder; then, we jointly train both encoders along with the final prediction module.


pass that through an MLP to get logit values. Formally, combination of the neighboring nodes’ representations.
given a link (𝑢, 𝑣) and their respective learned represen- Let ℎ(0)
                                                                𝑥 be the extracted feature vector for any arbitrary
tation (𝑧𝑢 , 𝑧𝑣 ), we calculate the logit values as        node 𝑥. We calculate the representation of an arbitrary
                                                           node 𝑣 at layer 𝑙 + 1 of a multilayer model as
                     𝑝 = MLP([𝑧𝑢 ‖𝑧𝑣 ])                (8)

where 𝑝 ∈ R𝒞 contains the unnormalized logits for each                        (𝑙+1)        1 ∑︁ (𝑙)
                                                                            ℎ𝒩𝑣       =             ℎ𝑢                   (10)
class. The predicted class 𝑐 is then calculated as                                        |𝒩𝑣 | 𝑢∈𝒩
                                                                                                  𝑣

                   argmax𝑐 sigmoid(𝑝).                    (9)               ℎ(𝑙+1)
                                                                             𝑣
                                                                                                          (𝑙+1)
                                                                                   = 𝜎(𝑊 (𝑙+1) [ℎ𝑣(𝑙) ‖ℎ𝒩𝑣 ])            (11)

Natural Language Processing Models:                         where 𝜎 is a non-linear function. Throughout our ex-
                                                            periments, we specifically use ReLU to introduce non-
We include the reported results of several state-of-the-art linearity. Given the node representation from a 𝐿-layer
Natural Language Processing (NLP) methods. Specifi- model and a link (𝑢, 𝑣), we calculate the logit values as
cally, we include results from the word embedding-based
methods such as Infersent-KMeans, Infersent-HDBSCAN,                          𝑝 = MLP([ℎ(𝐿)        (𝐿)
                                                                                            𝑢 ‖ℎ𝑣 ])                (12)
Glove-KMeans, and Glove-HDBSCAN [17], BiLSTM-
                                                            where 𝑝 ∈ R contains the unnormalized logits for each
                                                                          𝒞
based method Structural Scaffolds [4], and large language
                                                            class and 𝒞 is the set of all classes. The predicted class 𝑐
model-based method SciBERT [1]. Moreover, we report
                                                            is then calculated as
the results of fine-tuning a pre-trained SciBERT model on
both datasets. All these methods use textural information                      argmax𝑐 sigmoid(𝑝).                  (13)
and are evaluated on the SciCite dataset.
                                                               The main disadvantage of the inductive settings is that
                                                            the unseen nodes only have one available feature, i.e.,
4.5. Multi-Hop Link Prediction (MHLP)                       reference count. This absence of information makes the
Transductive and inductive settings are the most common task extremely difficult, as the feature vectors are highly
link prediction evaluating schemes for KGs. The main dif- sparse. However, our model tries to diminish this effect
ference between these two settings is having a fixed set of by using the message-passing scheme, as defined in Equa-
nodes in both the training and evaluation phases (trans- tion 11, to aggregate information through connected en-
ductive) versus allowing the addition of unseen nodes tities, i.e., cited papers, creating a denser representation
in the evaluation phase (inductive). This work refers to for the unseen nodes.
citation intent prediction on unseen publications as the       All our models are trained using the cross-entropy loss
inductive setting, whereas the transductive setting refers defined as
to citation intent prediction on already seen publications.                                 exp(𝑝𝑦𝑛 )
   We propose an adaptable graph-based model for cita-                      𝑙𝑛 = − log ∑︀|𝒞|                        (14)
                                                                                             𝑖=1 exp(𝑝𝑖 )
tion intent prediction in both the transductive and in-
ductive settings. The primary basis of this approach is where and 𝑝𝑥 is the logit value for class 𝑥 given the
that a node, i.e., publication, could be represented as a prediction vector 𝑝.
                                                                     Table 4. Then, the MLP component is trained using the
                                                                     procedure described in A.2 to predict the citation intent.
                                                                     For the MHLP-based methods, in both transductive and
                                                                     inductive settings, we use a 1-layer variation on top of
                                                                     the normalized features extracted as described in Section
                                                                     4.3. Moreover, we tune their hyperparameters and train
                                                                     them as described in Appendix A.3. For the SciBERT
                                                                     method, we freeze the pre-trained model and add an
                                                                     MLP module on top of the 768-dimensional [CLS] token
                                                                     output. Similar to the other models, the MLP module
                                                                     is tuned using the parameters described in A.2. For the
                                                                     composite model, during the training phase, we freeze
                                                                     the SciBERT model in the first two epochs as a warm-up
         (a) The number of different citation intents.
                                                                     step for the graph encoder; then, we jointly train both
                                                                     encoders along with the final prediction module.
                                                                        To control for the effect of the pre-training using tradi-
                                                                     tional KGE models, we also run a variation with randomly
                                                                     initialized node features and designate it as “Random +
                                                                     MLP.” For the NLP models, we use the previously re-
                                                                     ported results [17] to compare our models on the test
                                                                     set-aligned SciCiteorigin dataset. Finally, we also include
                                                                     the results from random and most common class predic-
                                                                     tions as sanity checks. All the models are implemented
                                                                     using PyTorch [14] and trained on a machine with a sin-
                                                                     gle Quadro RTX 8000 GPU, 72 CPU cores, and 768GB of
                                                                     RAM. Implementations are available under a CC-BY-SA
       (b) The percentage of different citation intents.             license at TBD.
Figure 3: The statistic of citation intent for all publications in
the Semantic Scholar corpus. The temporal trends stay steady         5.1. Results
over time, suggesting a lack of information in the elapsed time
from the time of publication to the time of citing.                  Table 3 illustrates our experimental results on both
                                                                     datasets. As evident from Table 3, traditional KGE meth-
                                                                     ods perform poorly on both datasets, only slightly beat-
                                                                     ing the random baseline on the macro F1 metric. In-
Composite Model:
                                                                     terestingly, both ComplEx and RotatE perform worse
To further test the capabilities of our proposed model               than TransE on both datasets. This finding is surprising
and use both structural and textual information, we de-              as both ComplEx and RotatE are more expressive than
vise a multi-modal model comprising encoders for both                TransE [20]. However, when combined with MLP models,
the graph structure and the citation context. Specifically,          all exhibit significant performance boost, up to more than
we use a pre-trained SciBERT model for encoding the                  100% in the case of RotatE. After this addition, we can see
citation phrase text and our MHLP model for encoding                 the same expressivity trend in the model results, i.e., the
the citation graph around the citation link. Figure 2 illus-         more powerful the model, the better the result. Moreover,
trates an overview of the composite model.                           the control “Random + MLP” experiment showcases very
                                                                     similar results to the random baseline, indicating the im-
                                                                     portance of both components for the hybrid model to
5. Experiments                                                       perform well. Altogether, it is evident that the reasoning
                                                                     power of shallow traditional KGE models is not enough
In this section, we report our experimental results on both
                                                                     to capture the complexity of this task, and we require
of the SciCiteorigin and SciCiteresplit datasets. All the graph-
                                                                     models with more reasoning power.
based experiments are carried out on the 𝒢1 KG. For the
                                                                        As for the MHLP method, in the transductive setting, it
traditional KGE methods, we tune their hyperparameters
                                                                     achieves 57.88 and 62.16 macro F1 scores on SciCiteorigin
as described in Appendix A.1 and train them using the
                                                                     and SciCiteresplit datasets, respectively. Moreover, its in-
hyperparameters showcased in Table 4. For the hybrid
                                                                     ductive results showcase the robustness of our approach
methods, the KGE component is first trained to generate
                                                                     in an extreme out-of-distribution setting, achieving 56.13
node features using the hyperparameters described in
                                                                     and 59.81 macro F1 scores. Compared to previously re-
ported results [17], our model achieves superior perfor-
mance to Glove-based models while slightly lagging be-
hind Infersent-based models. Looking into the precision
and recall comparison, our method has better precision
scores on both transductive and inductive settings com-
pared to all word embedding-based models; however, for
recall, it performs better than Glove-based models and
worse than the Infersent-based models which might stem
from the imbalance in the links as illustrated by Figure 3a.
Further experimentation to address the class imbalance
problem in future works might help improve the overall
performance of MHLP. The significance of these results is
that we show structural and relational information could
be used to achieve relatively high performance without
using textual information. Moreover, although our mod-
els underperform compared to language model-based
approaches such as Structural Scaffolds and SciBERT,
                                                                              (a) Publication features (both sides)
we showcase interesting future directions for combining
graph-based and NLP-based methods.
   Finally, the composite model denoted as SciBERT +
MHLP in Table 3, achieves the best performance among
all models, even beating the fine-tuned SciBERT. When
considering MHLP’s standalone performance, these re-
sults showcase the potential improvements that could be
achieved through the use of structural information that
is not available in citation phrases. The presented ex-
periments are a stepping stone for better understanding
and using the structural information at scale for citation
intent classification.


6. Analysis
6.1. Temporal Analysis
This analysis studies the relationship between the time                (b) Averaged neighborhood features (both sides)
that has passed since publication and citation intent. We
hypothesize that a paper is more likely to be cited as           Figure 4: The calculated MI values for publication features
“Result” or “Method” right after its publication, and as         and averaged neighborhood features. On average, the publica-
time passes, it will be more likely to be cited as “Back-        tion features show stronger connections to the target variable.
ground.” If this is proven accurate, we could get a rela-
tively strong signal from the temporal information for
each citation. We plotted the years after publication            analysis or studies of temporal information for citation
against intent counts and ratios for all papers in the se-       intent classification.
mantic scholar corpus to test our hypothesis. Figure 3a
and 3b illustrate the results of our analysis. As evident
                                                                 6.2. Mutual Information Analysis
from these figures and contrary to our original hypothe-
sis, we find out that the ratio of intent classes almost stays   In this analysis, we study the quality of the engineered
the same as time passes with insignificant fluctuations.         features as described in Section 4.3 concerning the weakly
As a result, using temporal information in our models is         labeled intent classes. To this end, we use the well-known
unlikely to provide any significant improvement. Note            mutual information (MI) [12] measurement to quantify
that these results are based on the weakly labeled links         the importance of each feature. Formally, the MI between
that we obtained from S2ORC. Consequently, these links
are generated by another noisy model that could poten-
tially be biased. Hence, it should not discourage further
              (a) Features before normalization                            (b) Features after normalization

Figure 5: The t-SNE visualizations for the unnormalized and normalized features.


                                                              where 𝒴 is the value space for 𝑌 , 𝒳 is the value space for
                                                              𝑋, 𝑃𝑋,𝑌 is the joint probability distribution, and 𝑃𝑋 and
                                                              𝑃𝑌 are the marginal probability distributions. Note that
                                                              MI is a non-negative value, and higher values indicate
                                                              more correlation between the two random variables. For
                                                              our analysis, we calculate MI for both sides of the 5,886
                                                              unique citation links in the SciCiteresplit dataset. More-
                                                              over, to study these features in the graph context, we also
                                                              calculate MI for the average of these features over the
                                                              neighborhood of each publication, i.e., all citing and cited
                                                              publications, from both sides of the citation links. Figures
                                                              4a and 4b present the results of our experiments. As evi-
         (a) The percentage of utilized weak labels.          dent from these results, while the publication-averaged
                                                              features generally show stronger connections to the tar-
                                                              get variable, the neighborhood-averaged features seem to
                                                              show complementary connections, further emphasizing
                                                              the importance of using both sets of features.

                                                              6.3. Feature Quality Analysis
                                                              In this analysis, we study the effect of normalization as
                                                              described in Equations 6 and 7. To this end, we project
                                                              the extracted features of the 5,886 unique citation links
                                                              in the SciCiteresplit dataset to a 2-dimensional space us-
                                                              ing t-SNE [24]. Figure 5a and 5b illustrate the projected
           (b) The percentage of corrupted data.              space for the unnormalized and normalized features, re-
                                                              spectively. As evident from Figure 5a, it is challenging
Figure 6: The macro F1 score of MHLP (Transductive) on        to distinguish different intent types in the unnormalized
SciCiteorigin and SciCiteresplit dataset
                                                              space. However, after normalization, as evident from Fig-
                                                              ure 5b, we can see that the “Method” intention more or
                                                              less creates a distinguishable cluster. This result shows
two discrete random variables 𝑋 and 𝑌 is defined as           that the use of normalization is potentially helpful for
                                                              the model. Further studies on different types of normal-
              ∑︁ ∑︁                         𝑃𝑋,𝑌 (𝑥, 𝑦)       ization and their effects are left for future work.
 𝐼(𝑋, 𝑌 ) =             𝑃𝑋,𝑌 (𝑥, 𝑦) log(                )
              𝑦∈𝒴 𝑥∈𝒳
                                           𝑃𝑋 (𝑥)𝑃𝑌 (𝑦)
                                                       (15)
6.4. Robustness Analysis                                  to augment the extracted citation graph with citation in-
                                                          tents and create a multi-relational knowledge graph. Fol-
In this analysis, we focus on studying the robustness
                                                          lowing this, we adapted the sentence-based intent classifi-
of our proposed graph-based method. To this end, we
                                                          cation into a citation-based link prediction task on graphs.
devise two ablation studies. In the first study, we ran-
                                                          We then introduced a set of engineered graph-based and
domly corrupt a percentage of the weak labels by replac-
                                                          citation-based features. Built on top of these features, we
ing the correct label with a random label. This study
                                                          introduced a graph-based multi-hop reasoning approach
aims to understand the model’s resilience to noise better.
                                                          for the newly introduced task. Our approach achieves
In the second study, we randomly remove a percentage
                                                          62.16 and 59.81 macro F1 scores in the transductive and in-
of the weak labels. This study’s idea is to understand
                                                          ductive settings, respectively. The experimental results in
better the effect of weak supervision on the model’s
                                                          the inductive setting further showcase the robustness of
performance. These studies are carried out by running
                                                          the proposed approach in the information-deprived out-
the MHLP method in the transductive setting on both
                                                          of-distribution environment. Compared to NLP-based
SciCiteorigin and SciCiteresplit datasets.
                                                          models, we reached a comparable performance to, and
   The feature vectors for the publications are calculated
                                                          in some cases outperform, the word embedding-based
by counting the number of citations and intents. These
                                                          methods that rely on contextual sentences to make pre-
vectors are normalized then using Equation 6 and 7. To
                                                          dictions. Moreover, with a composite model comprising
analyze the relationship between the model’s perfor-
                                                          our method as the graph encoder and the state-of-the-art
mance and the amount of available data, we create ten
                                                          NLP-based model as the text encoder, we outperformed
variations of the dataset by only using a portion of the
                                                          all the other models we experimented with. These results
available weak labels, varying from using all the available
                                                          further signify the strong signal in relational informa-
weak labels to only using 10% of them. Figure 6a presents
                                                          tion and highlight the importance of future analysis and
the result of this study.
                                                          studies in this domain. Finally, our presented analyses
   As evident from Figure 6a, the more weakly labeled
                                                          further support our methodological choices.
links are available, the better our method performs. The
                                                             For future works, one straightforward idea is to extend
other significant observation is the robustness of the
                                                          the knowledge graph with more scholarly information,
model, even in the extreme scenario of having access to
                                                          such as authors, venues, and fields of study. There already
only 10% of the labels. Note that only 31.90% of links in
                                                          exist some open repositories such as OpenAlex [15] and
the S2ROC have at least one weakly labeled intent, which
                                                          Microsoft Academic Graph (MAG) [25] that contain this
means, even if the utilization percentage is 100%, only
                                                          information. Another direction is further investigation
31.90% citation links are weakly labeled.
                                                          into the temporal signals. Last but not least, although we
   Figure 6b showcases the relationship between the
                                                          achieved an improved performance through a fusion of
model performance and the percentage of corrupted data.
                                                          textual and structural information, more investigation
Following our intuition, the model’s performance mono-
                                                          and analysis could be done in this setting in future works.
tonically decreases as we add more noisy labels to the
data. However, two interesting observations could be
made from this figure. First, the performance of our Acknowledgments
method only drops less than five macro F1 scores when
half (50%) of the weak labels are replaced with randomly This work was funded by the Defense Advanced Research
assigned noisy labels. This observation shows that the Projects Agency with award W911NF-19-20271 and with
proposed method is exceptionally resilient when faced support from a Keston Exploratory Research Award.
with mistakes. Second, even when all the labels are re-
placed with random ones (100%), the model performs
better than the random baselines. This observation indi- References
cates that the model is learning to make inferences based
                                                            [1] Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB-
on purely structural information, which further solidifies
                                                                ERT: A Pretrained Language Model for Scientific
our hypothesis regarding the importance of structural
                                                                Text. In Proceedings of the 2019 Conference on Em-
information.
                                                                pirical Methods in Natural Language Processing and
                                                                the 9th International Joint Conference on Natural
7. Conclusions and Future Work                                  Language Processing (EMNLP-IJCNLP). Association
                                                                for Computational Linguistics, Hong Kong, China,
In this work, we first introduced an expansion to the Sci-      3615–3620. https://doi.org/10.18653/v1/D19-1371
Cite dataset by extracting scholarly information from the   [2] Antoine Bordes, Nicolas Usunier, Alberto Garcia-
S2ORC dataset and creating an extended citation graph.          Durán, Jason Weston, and Oksana Yakhnenko.
Then, we gathered a large-scale weakly labeled dataset
     2013. Translating Embeddings for Modeling Multi-             tional Networks. In Proceedings of the 5th Interna-
     Relational Data. In Proceedings of the 26th Inter-           tional Conference on Learning Representations (ICLR
     national Conference on Neural Information Pro-               ’17). OpenReview.net, Palais des Congrès Neptune,
     cessing Systems - Volume 2 (Lake Tahoe, Nevada)              Toulon, France, 14 pages. https://openreview.net/
     (NIPS’13). Curran Associates Inc., Red Hook, NY,             forum?id=SJU4ayYgl
     USA, 2787–2795.                                         [12] Alexander Kraskov, Harald Stögbauer, and Peter
 [3] Lutz Bornmann and Hans-Dieter Daniel. 2008.                  Grassberger. 2004. Estimating mutual information.
     What do citation counts measure? A review of stud-           Physical review E 69, 6 (2004), 066138.
     ies on citing behavior. J. Documentation 64 (2008),     [13] Kyle Lo, Lucy Lu Wang, Mark Neumann, Rod-
     45–80.                                                       ney Kinney, and Daniel Weld. 2020. S2ORC: The
 [4] Arman Cohan, Waleed Ammar, Madeleine van                     Semantic Scholar Open Research Corpus. In Pro-
     Zuylen, and Field Cady. 2019. Structural Scaffolds           ceedings of the 58th Annual Meeting of the Asso-
     for Citation Intent Classification in Scientific Pub-        ciation for Computational Linguistics. Association
     lications. In Proceedings of the 2019 Conference of          for Computational Linguistics, Online, 4969–4983.
     the North American Chapter of the Association for            https://doi.org/10.18653/v1/2020.acl-main.447
     Computational Linguistics: Human Language Tech-         [14] Adam Paszke, Sam Gross, Soumith Chintala, Gre-
     nologies, Volume 1 (Long and Short Papers). Associ-          gory Chanan, Edward Yang, Zachary DeVito, Zem-
     ation for Computational Linguistics, Minneapolis,            ing Lin, Alban Desmaison, Luca Antiga, and Adam
     Minnesota, 3586–3596. https://doi.org/10.18653/              Lerer. 2017. Automatic Differentiation in PyTorch.
     v1/N19-1361                                                  In NIPS 2017 Workshop on Autodiff. OpenReview.net,
 [5] Arman Cohan and Nazli Goharian. 2015. Scien-                 Long Beach, California, USA, 4 pages. https:
     tific Article Summarization Using Citation-Context           //openreview.net/forum?id=BJJsrmfCZ
     and Article’s Discourse Structure. In Proceedings       [15] Jason Priem, Heather Piwowar, and Richard Orr.
     of the 2015 Conference on Empirical Methods in               2022. OpenAlex: A fully-open index of scholarly
     Natural Language Processing. Association for Com-            works, authors, venues, institutions, and concepts.
     putational Linguistics, Lisbon, Portugal, 390–400.           arXiv preprint arXiv:2205.01833 abs/2205.01833
     https://doi.org/10.18653/v1/D15-1045                         (2022), 5 pages.
 [6] Daniel Cummings and Marcel Nassar. 2020. Struc-         [16] Anna Ritchie. 2009. Citation context analysis for
     tured Citation Trend Prediction Using Graph Neu-             information retrieval. Technical Report. University
     ral Networks.. In ICASSP. IEEE, Barcelona, Spain,            of Cambridge, Computer Laboratory.
     3897–3901. http://dblp.uni-trier.de/db/conf/icassp/     [17] Muhammad Roman, Abdul Shahid, Shafiullah Khan,
     icassp2020.html#CummingsN20                                  Anis Koubaa, and Lisu Yu. 2021. Citation intent
 [7] M.A. Garzone. 1997. Automated Classification of              classification using word embedding. Ieee Access 9
     Citations Using Linguistic Semantic Grammars. The-           (2021), 9982–9995.
     sis (M.Sc.)–University of Western Ontario, Lon-         [18] Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem,
     don, Canada. https://books.google.com/books?id=              Rianne van den Berg, Ivan Titov, and Max Welling.
     V-bwSgAACAAJ                                                 2018. Modeling Relational Data with Graph Con-
 [8] William L. Hamilton, Rex Ying, and Jure Leskovec.            volutional Networks. In The Semantic Web, Aldo
     2017. Inductive Representation Learning on Large             Gangemi, Roberto Navigli, Maria-Esther Vidal, Pas-
     Graphs. In Proceedings of the 31st International Con-        cal Hitzler, Raphaël Troncy, Laura Hollink, Anna
     ference on Neural Information Processing Systems             Tordai, and Mehwish Alam (Eds.). Springer Inter-
     (Long Beach, California, USA) (NIPS’17). Curran              national Publishing, Cham, 593–607.
     Associates Inc., Red Hook, NY, USA, 1025–1035.          [19] Henry Small. 2018. Characterizing highly cited
 [9] Myriam Hernández-Alvarez and José M Gomez.                   method and non-method papers using citation con-
     2016. Survey about citation context analysis: Tasks,         texts: The role of uncertainty. Journal of Informet-
     techniques, and resources. Natural Language Engi-            rics 12, 2 (2018), 461–480.
     neering 22, 3 (2016), 327–349.                          [20] Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and
[10] David Jurgens, Srijan Kumar, Raine Hoover, Dan               Jian Tang. 2019. RotatE: Knowledge Graph Em-
     McFarland, and Dan Jurafsky. 2018. Measuring                 bedding by Relational Rotation in Complex Space..
     the Evolution of a Scientific Field through Cita-            In ICLR (Poster). OpenReview.net, New Orleans,
     tion Frames. Transactions of the Association for             LA, 18 pages. http://dblp.uni-trier.de/db/conf/iclr/
     Computational Linguistics 6 (2018), 391–406. https:          iclr2019.html#SunDNT19
     //doi.org/10.1162/tacl_a_00028                          [21] Simone Teufel, Advaith Siddharthan, and Dan Tid-
[11] Thomas N. Kipf and Max Welling. 2017. Semi-                  har. 2006. An annotation scheme for citation func-
     Supervised Classification with Graph Convolu-                tion. In Proceedings of the 7th SIGdial Workshop
     on Discourse and Dialogue. Association for Com-         Table 4
     putational Linguistics, Sydney, Australia, 80–87.       Hyperparameters of KGE algorithms.
     https://aclanthology.org/W06-1312
                                                                  Hyperparameter           TransE   ComplEx   RotatE
[22] Théo Trouillon, Johannes Welbl, Sebastian Riedel,
     Éric Gaussier, and Guillaume Bouchard. 2016. Com-         embedding dimension          100      100        50
     plex Embeddings for Simple Link Prediction. In                 learning rate            0.1      0.3       0.1
                                                              regularization coefficient    1e-6     1e-6      1e-6
     Proceedings of the 33rd International Conference on
                                                                negative samples size       128      512        64
     International Conference on Machine Learning - Vol-                  𝛼                   0      0.25        1
     ume 48 (ICML’16). JMLR.org, New York, NY, USA,                       𝛾                   -        -         6
     2071–2080.
[23] Marco Valenzuela, Vu Ha, and Oren Etzioni. 2015.
     Identifying Meaningful Citations. In Scholarly Big
                                                             {0.25, 0.5, 1}, 𝛾 ∈ {6, 12, 24}. Note that 𝛼 and 𝛾 are the
     Data: AI Perspectives, Challenges, and Ideas, Papers
                                                             adversarial temperature and the margin value (RotatE-
     from the 2015 AAAI Workshop (Technical Report,
                                                             only), respectively.
     WS-15-13), Cornelia Caragea, C. Lee Giles, Narayan
     Bhamidipati, Doina Caragea, Sujatha Das Gollapalli,
     Saurabh Kataria, Huan Liu, and Feng Xia (Eds.).         A.2. Multilayer Perceptron
     AAAI Press, Menlo Park, CA, 21–26. http://www.
                                                             To simplify the model tuning process, we find the optimal
     aaai.org/Library/Workshops/ws15-13.php
                                                             hyperparameters of “ComplEx + MLP” on SciCiteorigin
[24] Laurens van der Maaten and Geoffrey Hinton. 2008.
                                                             using grid search and reuse them for the rest of our ex-
     Visualizing Data using t-SNE. Journal of Machine
                                                             periments. Specifically, we run a grid search over the
     Learning Research 9, 86 (2008), 2579–2605. http:
                                                             following ranges: number of layers ∈ {0, 1, 2, 3}, dropout
     //jmlr.org/papers/v9/vandermaaten08a.html
                                                             ∈ {0, 0.1, 0.2, 0.3, 0.4, 0.5}, dimension ∈ {32, 64, 128},
[25] Kuansan Wang, Zhihong Shen, Chiyuan Huang,
                                                             The optimal hyperparameters are as follows: number of
     Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia.
                                                             layers = 2, dropout = 0.2, and dimension = [64, 32]. We
     2020. Microsoft academic graph: When experts are
                                                             use ReLU as the activation function for all layers.
     not enough. Quantitative Science Studies 1, 1 (2020),
     396–413.
[26] Wenhao Yu, Mengxia Yu, Tong Zhao, and Meng              A.3. Multi-Hop Link Prediction
     Jiang. 2020. Identifying Referential Intention with     We run a grid search over the following ranges: number
     Heterogeneous Contexts. In Proceedings of The Web       of layers ∈ {0, 1, 2, 3}, dimension ∈ {10, 50, 100, 200},
     Conference 2020 (Taipei, Taiwan) (WWW ’20). Asso-       learning rate ∈ {0.03, 0.01, 0.003, 0.001}, The optimal
     ciation for Computing Machinery, New York, NY,          hyperparameters are as follows: number of layers = 1,
     USA, 962–972. https://doi.org/10.1145/3366423.          dimension = 100, learning rate = 0.01. We use Adam as
     3380175                                                 the optimizer through the tuning process.
[27] Da Zheng, Xiang Song, Chao Ma, Zeyuan Tan, Zi-             We use a randomized search to tune our models and
     hao Ye, Jin Dong, Hao Xiong, Zheng Zhang, and           find near-optimal hyperparameters using the follow-
     George Karypis. 2020. DGL-KE: Training Knowl-           ing ranges: embedding dimensions ∈ {50, 100, 200},
     edge Graph Embeddings at Scale. In Proceedings          learning rate ∈ {0.03, 0.1, 0.3}, regularization coef-
     of the 43rd International ACM SIGIR Conference on       ficient ∈ {0.0, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5}, number of
     Research and Development in Information Retrieval       negative samples ∈ {64, 128, 256, 512, 1024}, 𝛼 ∈
     (SIGIR ’20). Association for Computing Machinery,       {0.25, 0.5, 1}, 𝛾 ∈ {6, 12, 24}. Note that 𝛼 and 𝛾 are the
     New York, NY, USA, 739–748.                             adversarial temperature and the margin value (RotatE-
                                                             only), respectively.

A. Hyperparameters
A.1. Knowledge Graph Embedding
We use a randomized search to tune our models and
find near-optimal hyperparameters using the follow-
ing ranges: embedding dimensions ∈ {50, 100, 200},
learning rate ∈ {0.03, 0.1, 0.3}, regularization coef-
ficient ∈ {0.0, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5}, number of
negative samples ∈ {64, 128, 256, 512, 1024}, 𝛼 ∈

</pre>