<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>texts: The role of uncertainty. Journal of Informet</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Citation Intent Classification Through Weakly Supervised Knowledge Graphs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xinwei Du</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kian Ahrabian</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arun Baalaaji Sankar Ananthan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richard Delwin Myloth</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jay Pujara</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Sciences Institute</institution>
          ,
          <addr-line>Marina del Ray, CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Southern California</institution>
          ,
          <addr-line>Los Angeles, CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>[27] Da Zheng</institution>
          ,
          <addr-line>Xiang Song, Chao Ma, Zeyuan Tan, Zi-</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>2</volume>
      <fpage>2787</fpage>
      <lpage>2795</lpage>
      <abstract>
        <p>Citations are scientists' tools for grounding their innovations and findings in the existing collective knowledge. They are used for semantically distinct purposes as scientists utilize them at diferent parts of their work to convey specific information. As a result, a crucial aspect of scientific document understanding is recognizing the authorial intent associated with citations. Current state-of-the-art methods rely on contextual sentences surrounding each citation to classify the intent. However, in the absence of textual content, these approaches become unusable. In this work, we propose a text-free citation intent classification method built on relational information among scholarly works in this work. To this end, we introduce a large-scale knowledge graph built from the publications in the SciCite dataset and their multi-hop neighborhood extracted from The Semantic Scholar Open Research Corpus (S2ORC). We also augment this knowledge graph by adding weakly-labeled links based on the intent information available in the S2ORC. Finally, we cast the intent classification task as a link prediction problem on the newly created knowledge graph. We study this problem in both transductive and inductive settings. Our experimental results show that we can achieve a comparable macro F1 score to word embedding content-based methods by only relying on features and relations derived from this knowledge graph. Specifically, we achieve macro F1 scores of 62.16 and 59.81 in the transductive and inductive settings, respectively, on the link-level SciCite dataset. Moreover, by combining our method with the state-of-the-art NLP-based model, we achieve improvements across all metrics.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Citation Intent Classification</kwd>
        <kwd>Knowledge Graphs</kwd>
        <kwd>Graph Neural Networks</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Weakly supervised learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        to textual information. Previous works [
        <xref ref-type="bibr" rid="ref7">3, 26, 6</xref>
        ] have
shown the importance of relational and structural
inforCitations are the primary way of identifying past contri- mation available in links among publications for various
butions and connecting progress in new publications to tasks. In this work, we propose a general citation
inexisting literature. Nevertheless, not all citations indicate tent classification method that relies purely on structural
the same meaning. Authors use citations sparingly with information.
specific intent behind them. For example, some papers Besides helping researchers better understand the
reare cited for providing background information in a do- lationship among publications, citation intent analysis
main, while others are cited when adopting or adapting has been used for studying various other aspects of
sciena previously-used methodology. There are also scenar- tific works such as research domain evolution [ 10],
scienios where the same paper is used as background infor- tific impact analysis [ 19], scientific document
summarizamation and methodology use-case in diferent contexts tion [5], and retrieving related scientific works [ 16]. The
simultaneously. Understanding citation intent is crucial main three categories of citations are “Result,” “Method,”
to studying scholarly works, given the universality of and “Background” [4]. These categories describe the
reausing citations. Current state-of-the-art citation intent sons behind making a scientific connection, referencing a
classification models [
        <xref ref-type="bibr" rid="ref1">17, 1, 4</xref>
        ] rely heavily on textual publication in another publication. Classifying citations
information, e.g., the sentences surrounding the citation. into these groups has traditionally required a high level
However, such information is expensive to obtain and of expertise in the respective scientific domains. This
in some scenarios inaccessible altogether. Consequently, constraint, combined with the high cost of expert human
we need models that could operate without having access labor, has resulted in highly scarce datasets, which makes
The Third AAAI Workshop on Scientific Document Understanding 2023, the task even more dificult.
      </p>
      <p>
        February 14th, 2023, Washington, DC, USA Previous works have proposed classifying citation
in* Corresponding author. tent through feature engineering-based [10] and
repre$ xinweidu@usc.edu (X. Du); ahrabian@usc.edu (K. Ahrabian); sentation learning-based [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] methods. However, most
(aRru.Dnb.aMayl@louthsc);.ejdpuuj(aAra. @B.uSs.cA.endaun(tJh. aPnu);jamray)loth@usc.edu of these methods depend on textual information. As a
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License result, they require a complex multi-stage pipeline of
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) parsing documents, identifying citation contexts, and
predicting citation intent [13]. Besides being prone to
error propagation from various pipeline stages, the use
of these models is limited to situations where the full text
is available in a proper format. This work introduces a
pure graph-based approach to classifying citation intent.
      </p>
      <p>We extend the existing SciCite dataset with 2-hop
neighborhoods extracted from The Semantic Scholar Open
Research Corpus (S2ORC). To further enrich the graph,
we utilize the intent information provided in the S2ORC
to create a weakly supervised knowledge graph (KG)
consisting of the publications and the relations that match
the provided intents. Our main idea is to use
contextualized relational patterns to make predictions, obviating
the need for textual context. Given the newly built KG, Figure 1: Overview of the extracted multi-hop KG. The set of
we cast the intent classification problem into the common 0n-ohdoepsno1deinsclu0dinecsluadlletshaellotrhaenogreanagnednboldueesn.Tohdeess.etSoimf1il-ahrolyp,
link prediction problem on KGs. Specifically, we train a the graph could be expanded to include -hop nodes . The
model to learn representations for entities and relations. annotated set on each edge represents that specific link’s
Using these representations, we run the following query intent. Specifically, the empty set denotes that the citation
on the KG: (, ?, ), where  cites . link has no intent label.</p>
      <p>Converting this problem into a link prediction task
allows us to adapt and extend widely used KG
embedding models to this problem. We study the link predic- going as far as defining 35 [ 7] and 12 [21] fine-grained
tion problem in both transductive and inductive settings. schemes for scientific arguments. The more recent works
Our experimental results show that although our KG- however have focused on creating more concise
catebased method underperforms compared to the large lan- gories. For example, ACL-ARC [10] proposes a 6-class
guage model-based approaches, it is comparable or even intent categorization scheme: Background, Motivation,
superior to the word embedding-based methods. More- Uses, Extension, Comparison or Contrast, and Future.
over, our experiments with combining the NLP-based and SciCite [4] is even more restrictive and drops or
comgraph-based methods show slight improvements over the bines small fine-grained classes to provide a more
concurrent state-of-the-art model. These findings further cise 3-class annotation scheme: Background, Method,
signify the importance of relational patterns for citation and Result.
intent classification.</p>
      <p>The contributions of this work are as follows:</p>
      <sec id="sec-1-1">
        <title>2.2. Citation Intent Classification Methods</title>
        <p>1. Extending the SciCite dataset using the S2ORC
dataset to generate a large-scale weakly
supervised KG.
2. Introducing a novel graph-based approach for
citation intent classification built on top of the
newly built KG.
3. Presenting benchmarks for both transductive and</p>
        <p>inductive settings.
4. Presenting analyses on the efect of diferent parts
of the methodology such as weak supervision and
feature engineering.</p>
        <p>
          Before the explosion of deep learning approaches, most
methods relied on a combination of hand-crafted features
and classic machine learning models. For example, in
one instance [
          <xref ref-type="bibr" rid="ref3">23</xref>
          ], authors propose 12 diferent features,
including citation count, PageRank value, and author
overlap, and use classic machine learning models such
as SVM and Random Forest for classification. In another
instance [10], authors define pattern-based, topic-based,
and prototypical argument features and use SVM to make
predictions.
        </p>
        <p>
          With the advent of deep learning models and the
emer2. Related Work gence of large language models in recent years,
representation learning-based methods have outperformed the
hand-crafted methods achieving a higher accuracy by
2.1. Citation Function/Intent Schemes considering the textual information. Recent works have
Many prior works have studied the problem of creat- proposed the use of structural scafolds [ 4], BERT-based
ing categorical schemes for citation intent which in some models trained on the scientific corpus (SciBERT) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ],
works is referred to as citation function [9]. Earlier works word embedding-based approaches [17], and creating
were focused on creating more fine-grained categories, a heterogeneous context graph based on an academic
        </p>
        <sec id="sec-1-1-1">
          <title>KGs are structured information repositories consisting</title>
          <p>
            of a set of nodes representing entities and a set of typed
edges representing relations. Since, in most cases, the
KG nodes and edges are not attributed, KG embedding
(KGE) models aim to learn low-dimensional
representations for all entities and relations. The most common 3.2. Dataset Splitting
traditional shallow KGE methods are TransE [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ], Com- The original SciCite dataset contains 11,020
humanplEx [22], and RotatE [20]. More recent GNN-based KGE labeled samples. Hence, to adapt it to our link
predicmethods leverage the message-passing scheme of GNNs, tion setting, we reconstruct two datasets: SciCiteorigin
enabling more complex multi-hop reasoning. Examples and SciCiteresplit. SciCiteorigin adheres to the same
benchof these methods are GCN [11], which leverages the marks reported in prior works but is modified to remove
spectral information for information propagation but overlapping citation links in the training and test sets.
is limited to mono-relational KGs, R-GCN [18], which To maximize usage of the training data while removing
extends GCN to support multi-relational KGs, and Graph- artifacts, we create SciCiteresplit that performs additional
SAGE [8] which introduces an inductive framework to cleaning, provides a stronger separation of training and
handle unseen nodes. test sets, and avoids multi-intent citations. Table 1
showcases the statistic of these datasets.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Dataset</title>
      <p>SciCiteorigin:</p>
      <sec id="sec-2-1">
        <title>3.1. Entity Mapping</title>
        <sec id="sec-2-1-1">
          <title>We first map each paper in the SciCite dataset to the</title>
          <p>S2ORC by matching SciCite’s IDs to Semantic Scholar’s
SHA IDs. Since a publication could have many SHA
IDs and only one Corpus ID, we then map each SHA
ID to the unique Corpus ID to extract unique entities.</p>
          <p>From the 13,080 papers with unique IDs in SciCite, we
successfully map 13,019 of them to valid SHA IDs in
semantic scholar, while the remaining 61 papers do not
have any corresponding records. We believe this is due to
publication removals, as the SciCite dataset was created
from the S2ORC in 2019. After converting SHA IDs to
Corpus IDs, we end up with 13,011 unique entities and 8
duplicate entities.
The SciCite dataset focuses on individual citation links
and ignores the significance of broader relational
connections and features. To overcome this issue, we construct
a knowledge graph by mapping each entity in the SciCite
dataset to the S2ORC and adding their 2-hop citation
neighborhoods. The S2ROC contains more than 206
million publications and 2.49 billion citation links. Apart
from the regular citation links, this corpus provides
partial intent labels for citations using a 3-class scheme as
follows:
To make methods comparable, we use the same
validation and test sets as SciCite for this dataset and try to
keep the training set as close as possible. We convert each
publication in the SciCite dataset to a Semantic Scholar
entity using the mapped Corpus IDs and drop the
contextual sentence-level information. We assign a random
unique ID to publications without a Corpus ID. After
this procedure, we end up with a set of links for our link
prediction task.</p>
          <p>Due to the removal of the contextual information,
1. Background: Describe a problem, topic, or con- some of the training links appear exactly the same in
cept the test set. Hence, we remove 641 training set samples
2. Method: Provide a method, tool, or dataset that also appear in the test set to prevent data leakage.
3. Result: To make a comparison Moreover, since only one link in the test set has
multiple intents, we treat the link prediction problem as a
multi-class task rather than a multi-label task. In this
scenario, the multi-intent links are represented as
separate samples with the same inputs and diferent outputs.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Multi-label methods may be a promising future extension of our work.</title>
        </sec>
        <sec id="sec-2-1-3">
          <title>Given the S2ORC dataset, we expand the SciCite dataset</title>
          <p>using the mapped entities to construct a KG containing
SciCiteresplit: 2-hop neighborhoods of the publications. Figure 1
illusEven though we convert the SciCite dataset to the trates an overview of the expanded KG. This work uses
SciCiteorigin, problems, such as duplicate citations and the 2022-09-13 version of the corpus downloaded from
multi-label links, still exist. Therefore, we further tai- the bulk API. Formally, given the set of mapped entities
lor the SciCite dataset to create a better link prediction 0, the set of -hop nodes  is defined as
dataset for graph-based models. First, we remove all the
entities, and their related samples, that do not have a  = − 1 ∪ { | ∃ ∈ − 1 :  ∈ } (1)
mapped Corpus ID. Then, similar to SciCiteorigin, we con- where for a given entity ,  denotes all the entities
vert the remaining samples to a set of links. Following that cite or are cited by , i.e., the set of neighboring
this, we drop all duplicate samples. Among the remaining entities. Given the sets of unlabeled links  and weakly
6,458 unique links, 5,886 only have one intent, 489 have labeled links ℒ, the set of -hop edges ℰ is defined as
two intents, and 83 have all three intents. We remove
all the multi-intent links and resplit the dataset with ra- ℰ = {(, , UNK) | ,  ∈ , (, ) ∈  } (2)
tios of 70%/15%/15% for training, validation, and test sets,
respectively. ℰℒ = ∪{(, , ) | ,  ∈ , (, ) ∈ ℒ} (3)</p>
        </sec>
        <sec id="sec-2-1-4">
          <title>The specific statistics of the extracted KG and the origi</title>
          <p>4.1. Weak Supervision nal semantic scholar corpus are reported in Table 2. Since
not every link has weakly labeled intent, this table also
In order to enrich our data and provide more informa- provides the percentage of weakly labeled links for each
tion to the models, we extract the set of intents provided corresponding graph. Although we extract 2, given its
in the S2ORC dataset for each citation link. The intent scale, we opt to run our current experiment only on 1
labels in S2ORC are extracted using the structural scaf- and leave the larger-scale experiments for future works.
folds model [4] at a sentence level. In this scenario, we
implicitly use the existing data derived from the con- 4.3. Feature Engineering
tent for bootstrapping our approach. We refer to these
links as weakly labeled due to being labeled by a noisy
model rather than a human expert. Since the intent labels
are partial at a sentence level, citation links could have
zero intent in the absence of text or several intents in an
abundance of use cases.</p>
          <p>Since none of the publications in our KGs have any
features or pre-defined representation, we propose to
represent them through their references, citations, and
graphbased features. More specifically, from S2ROC we extract
the in-degrees and out-degrees of citations (or references),
background links, method links, and result links. As a
result, each paper is represented with an 8-dimensional
feature vector, 4 for each in-degree and out-degree feature.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Method</title>
      <p>Throughout the rest of this work, for simplicity, we use
the term publication to denote all types of academic
publications, e.g., books and papers. Moreover, we use
the terms citation and reference to denote incoming
and outgoing links, respectively.
ℰ = ℰ ∪ ℰ
ℒ</p>
      <p>(4)
where  ∈ {Background, Method, Result} and ℒ
denotes the set of all weakly labeled links with label .
Consequently, given the sets of -hop nodes  and edges
ℰ, the extracted -hop KG, , is defined as
 = (, ℰ)
(5)
Intent classification results on SciCiteorigin and SciCiteresplit datasets. All the metrics are macro averaged. Bold values represent
the highest performance within the metric and dataset scope.
= − 0.9 to get a normalized value of − 1 for
zero</p>
      <sec id="sec-3-1">
        <title>Moreover, we normalize the non-zero in-degree intent</title>
        <p>
          based features into a [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] probability distribution as
follows:
ℎ¯ =
        </p>
        <p>ℎ
ℎBackground + ℎMethod + ℎResult
The same normalization step is used for out-degree
features separately.
scores, we concatenate the vectors of two entities and
where  ∈ R contains the unnormalized logits for each
class. The predicted class  is then calculated as
 = MLP([‖])
argmax sigmoid().</p>
        <p>(8)
(9)
ℎ(+1) =

1</p>
        <p>∑︁ ℎ()
|| ∈
ℎ(+1) =  ( (+1)[ℎ()‖ℎ(+ 1)])
pass that through an MLP to get logit values. Formally, combination of the neighboring nodes’ representations.
given a link (, ) and their respective learned represen- Let ℎ(0) be the extracted feature vector for any arbitrary
tation (, ), we calculate the logit values as node . We calculate the representation of an arbitrary
node  at layer  + 1 of a multilayer model as
(10)
(11)
(12)
Natural Language Processing Models:</p>
      </sec>
      <sec id="sec-3-2">
        <title>We include the reported results of several state-of-the-art</title>
        <p>
          Natural Language Processing (NLP) methods.
Specifically, we include results from the word embedding-based
methods such as Infersent-KMeans, Infersent-HDBSCAN,
Glove-KMeans, and Glove-HDBSCAN [17],
BiLSTMbased method Structural Scafolds [ 4], and large language
model-based method SciBERT [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Moreover, we report
the results of fine-tuning a pre-trained SciBERT model on
both datasets. All these methods use textural information
and are evaluated on the SciCite dataset.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>The main disadvantage of the inductive settings is that</title>
        <p>the unseen nodes only have one available feature, i.e.,
4.5. Multi-Hop Link Prediction (MHLP) reference count. This absence of information makes the
Transductive and inductive settings are the most common task extremely dificult, as the feature vectors are highly
link prediction evaluating schemes for KGs. The main dif- sparse. However, our model tries to diminish this efect
ference between these two settings is having a fixed set of by using the message-passing scheme, as defined in
Equanodes in both the training and evaluation phases (trans- tion 11, to aggregate information through connected
enductive) versus allowing the addition of unseen nodes tities, i.e., cited papers, creating a denser representation
in the evaluation phase (inductive). This work refers to for the unseen nodes.
citation intent prediction on unseen publications as the All our models are trained using the cross-entropy loss
inductive setting, whereas the transductive setting refers defined as
to Wciteatpioronpionsteenatnpareddaipcttaiobnleognraaplrhe-abdaysesdeemn opduebllifcoarticointas-.  = − log ∑︀e|=x|p1(exp()) (14)
tion intent prediction in both the transductive and
inductive settings. The primary basis of this approach is where and  is the logit value for class  given the
that a node, i.e., publication, could be represented as a prediction vector .
where  is a non-linear function. Throughout our
experiments, we specifically use ReLU to introduce
nonlinearity. Given the node representation from a -layer
model and a link (, ), we calculate the logit values as</p>
        <p>= MLP([ℎ()‖ℎ()])
where  ∈ R contains the unnormalized logits for each
class and  is the set of all classes. The predicted class 
is then calculated as
argmax sigmoid().</p>
        <p>(13)
(a) The number of diferent citation intents.</p>
        <p>(b) The percentage of diferent citation intents.
To further test the capabilities of our proposed model
and use both structural and textual information, we
devise a multi-modal model comprising encoders for both
the graph structure and the citation context. Specifically,
we use a pre-trained SciBERT model for encoding the
citation phrase text and our MHLP model for encoding
the citation graph around the citation link. Figure 2
illustrates an overview of the composite model.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Experiments</title>
      <p>In this section, we report our experimental results on both
of the SciCiteorigin and SciCiteresplit datasets. All the
graphbased experiments are carried out on the 1 KG. For the
traditional KGE methods, we tune their hyperparameters
as described in Appendix A.1 and train them using the
hyperparameters showcased in Table 4. For the hybrid
methods, the KGE component is first trained to generate
node features using the hyperparameters described in
ported results [17], our model achieves superior
performance to Glove-based models while slightly lagging
behind Infersent-based models. Looking into the precision
and recall comparison, our method has better precision
scores on both transductive and inductive settings
compared to all word embedding-based models; however, for
recall, it performs better than Glove-based models and
worse than the Infersent-based models which might stem
from the imbalance in the links as illustrated by Figure 3a.
Further experimentation to address the class imbalance
problem in future works might help improve the overall
performance of MHLP. The significance of these results is
that we show structural and relational information could
be used to achieve relatively high performance without
using textual information. Moreover, although our
models underperform compared to language model-based
approaches such as Structural Scafolds and SciBERT,
we showcase interesting future directions for combining
graph-based and NLP-based methods.</p>
      <p>Finally, the composite model denoted as SciBERT +
MHLP in Table 3, achieves the best performance among
all models, even beating the fine-tuned SciBERT. When
considering MHLP’s standalone performance, these
results showcase the potential improvements that could be
achieved through the use of structural information that
is not available in citation phrases. The presented
experiments are a stepping stone for better understanding
and using the structural information at scale for citation
intent classification.</p>
    </sec>
    <sec id="sec-5">
      <title>6. Analysis</title>
      <sec id="sec-5-1">
        <title>6.1. Temporal Analysis</title>
        <p>(a) Publication features (both sides)
This analysis studies the relationship between the time (b) Averaged neighborhood features (both sides)
that has passed since publication and citation intent. We
hypothesize that a paper is more likely to be cited as Figure 4: The calculated MI values for publication features
“Result” or “Method” right after its publication, and as and averaged neighborhood features. On average, the
publicatime passes, it will be more likely to be cited as “Back- tion features show stronger connections to the target variable.
ground.” If this is proven accurate, we could get a
relatively strong signal from the temporal information for
each citation. We plotted the years after publication analysis or studies of temporal information for citation
against intent counts and ratios for all papers in the se- intent classification.
mantic scholar corpus to test our hypothesis. Figure 3a
and 3b illustrate the results of our analysis. As evident 6.2. Mutual Information Analysis
from these figures and contrary to our original
hypothesis, we find out that the ratio of intent classes almost stays In this analysis, we study the quality of the engineered
the same as time passes with insignificant fluctuations. features as described in Section 4.3 concerning the weakly
As a result, using temporal information in our models is labeled intent classes. To this end, we use the well-known
unlikely to provide any significant improvement. Note mutual information (MI) [12] measurement to quantify
that these results are based on the weakly labeled links the importance of each feature. Formally, the MI between
that we obtained from S2ORC. Consequently, these links
are generated by another noisy model that could
potentially be biased. Hence, it should not discourage further
(a) Features before normalization
(b) Features after normalization
(,  ) = ∑︁ ∑︁ , (, ) log(
∈ ∈
, (, )
 () ()
)
(15)
where  is the value space for  ,  is the value space for
, , is the joint probability distribution, and  and
 are the marginal probability distributions. Note that
MI is a non-negative value, and higher values indicate
more correlation between the two random variables. For
our analysis, we calculate MI for both sides of the 5,886
unique citation links in the SciCiteresplit dataset.
Moreover, to study these features in the graph context, we also
calculate MI for the average of these features over the
neighborhood of each publication, i.e., all citing and cited
publications, from both sides of the citation links. Figures
4a and 4b present the results of our experiments. As
evident from these results, while the publication-averaged
features generally show stronger connections to the
target variable, the neighborhood-averaged features seem to
show complementary connections, further emphasizing
the importance of using both sets of features.</p>
      </sec>
      <sec id="sec-5-2">
        <title>6.3. Feature Quality Analysis</title>
        <p>
          In this analysis, we study the efect of normalization as
described in Equations 6 and 7. To this end, we project
the extracted features of the 5,886 unique citation links
in the SciCiteresplit dataset to a 2-dimensional space
using t-SNE [
          <xref ref-type="bibr" rid="ref4">24</xref>
          ]. Figure 5a and 5b illustrate the projected
space for the unnormalized and normalized features,
respectively. As evident from Figure 5a, it is challenging
to distinguish diferent intent types in the unnormalized
space. However, after normalization, as evident from
Figure 5b, we can see that the “Method” intention more or
less creates a distinguishable cluster. This result shows
that the use of normalization is potentially helpful for
the model. Further studies on diferent types of
normalization and their efects are left for future work.
6.4. Robustness Analysis to augment the extracted citation graph with citation
intents and create a multi-relational knowledge graph.
FolIn this analysis, we focus on studying the robustness lowing this, we adapted the sentence-based intent
classifiof our proposed graph-based method. To this end, we cation into a citation-based link prediction task on graphs.
devise two ablation studies. In the first study, we ran- We then introduced a set of engineered graph-based and
domly corrupt a percentage of the weak labels by replac- citation-based features. Built on top of these features, we
ing the correct label with a random label. This study introduced a graph-based multi-hop reasoning approach
aims to understand the model’s resilience to noise better. for the newly introduced task. Our approach achieves
In the second study, we randomly remove a percentage 62.16 and 59.81 macro F1 scores in the transductive and
inof the weak labels. This study’s idea is to understand ductive settings, respectively. The experimental results in
better the efect of weak supervision on the model’s the inductive setting further showcase the robustness of
performance. These studies are carried out by running the proposed approach in the information-deprived
outthe MHLP method in the transductive setting on both of-distribution environment. Compared to NLP-based
SciCiteorigin and SciCiteresplit datasets. models, we reached a comparable performance to, and
        </p>
        <p>The feature vectors for the publications are calculated in some cases outperform, the word embedding-based
by counting the number of citations and intents. These methods that rely on contextual sentences to make
prevectors are normalized then using Equation 6 and 7. To dictions. Moreover, with a composite model comprising
analyze the relationship between the model’s perfor- our method as the graph encoder and the state-of-the-art
mance and the amount of available data, we create ten NLP-based model as the text encoder, we outperformed
variations of the dataset by only using a portion of the all the other models we experimented with. These results
available weak labels, varying from using all the available further signify the strong signal in relational
informaweak labels to only using 10% of them. Figure 6a presents tion and highlight the importance of future analysis and
the result of this study. studies in this domain. Finally, our presented analyses</p>
        <p>
          As evident from Figure 6a, the more weakly labeled further support our methodological choices.
links are available, the better our method performs. The For future works, one straightforward idea is to extend
other significant observation is the robustness of the the knowledge graph with more scholarly information,
model, even in the extreme scenario of having access to such as authors, venues, and fields of study. There already
only 10% of the labels. Note that only 31.90% of links in exist some open repositories such as OpenAlex [15] and
the S2ROC have at least one weakly labeled intent, which Microsoft Academic Graph (MAG) [
          <xref ref-type="bibr" rid="ref6">25</xref>
          ] that contain this
means, even if the utilization percentage is 100%, only information. Another direction is further investigation
31.90% citation links are weakly labeled. into the temporal signals. Last but not least, although we
        </p>
        <p>Figure 6b showcases the relationship between the achieved an improved performance through a fusion of
model performance and the percentage of corrupted data. textual and structural information, more investigation
Following our intuition, the model’s performance mono- and analysis could be done in this setting in future works.
tonically decreases as we add more noisy labels to the
data. However, two interesting observations could be
made from this figure. First, the performance of our Acknowledgments
method only drops less than five macro F1 scores when
half (50%) of the weak labels are replaced with randomly This work was funded by the Defense Advanced Research
assigned noisy labels. This observation shows that the Projects Agency with award W911NF-19-20271 and with
proposed method is exceptionally resilient when faced support from a Keston Exploratory Research Award.
with mistakes. Second, even when all the labels are
replaced with random ones (100%), the model performs
better than the random baselines. This observation indi- References
cates that the model is learning to make inferences based
on purely structural information, which further solidifies
our hypothesis regarding the importance of structural
information.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>7. Conclusions and Future Work</title>
      <sec id="sec-6-1">
        <title>In this work, we first introduced an expansion to the Sci</title>
        <p>Cite dataset by extracting scholarly information from the
S2ORC dataset and creating an extended citation graph.
Then, we gathered a large-scale weakly labeled dataset
plex Embeddings for Simple Link Prediction. In
Proceedings of the 33rd International Conference on
International Conference on Machine Learning -
Volume 48 (ICML’16). JMLR.org, New York, NY, USA,
2071–2080.</p>
      </sec>
      <sec id="sec-6-2">
        <title>Identifying Meaningful Citations. In Scholarly Big</title>
        <p>Data: AI Perspectives, Challenges, and Ideas, Papers
from the 2015 AAAI Workshop (Technical Report,</p>
      </sec>
      <sec id="sec-6-3">
        <title>WS-15-13), Cornelia Caragea, C. Lee Giles, Narayan</title>
      </sec>
      <sec id="sec-6-4">
        <title>Bhamidipati, Doina Caragea, Sujatha Das Gollapalli,</title>
      </sec>
      <sec id="sec-6-5">
        <title>Saurabh Kataria, Huan Liu, and Feng Xia (Eds.).</title>
      </sec>
      <sec id="sec-6-6">
        <title>AAAI Press, Menlo Park, CA, 21–26. http://www.</title>
      </sec>
      <sec id="sec-6-7">
        <title>Visualizing Data using t-SNE. Journal of Machine</title>
        <p>http:</p>
      </sec>
      <sec id="sec-6-8">
        <title>Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia.</title>
      </sec>
      <sec id="sec-6-9">
        <title>Heterogeneous Contexts. In Proceedings of The Web</title>
        <p>ciation for Computing Machinery, New York, NY,
USA, 962–972.</p>
        <p>3380175
hao Ye, Jin Dong, Hao Xiong, Zheng Zhang, and
of the 43rd International ACM SIGIR Conference on
Research and Development in Information Retrieval
(SIGIR ’20). Association for Computing Machinery,
New York, NY, USA, 739–748.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>A. Hyperparameters</title>
      <sec id="sec-7-1">
        <title>A.1. Knowledge Graph Embedding</title>
        <p>We use a randomized search to tune our models and
ifnd near-optimal hyperparameters using the
following ranges: embedding dimensions ∈
negative samples ∈ {64, 128, 256, 512, 1024}, 
on Discourse and Dialogue. Association for Com- Table 4
putational Linguistics, Sydney, Australia, 80–87. Hyperparameters of KGE algorithms.
edge Graph Embeddings at Scale. In Proceedings
George Karypis. 2020. DGL-KE: Training Knowl- ing ranges: embedding dimensions ∈
1e-6
128
0
100
1e-6
512
50
1e-6
64
1
6
adversarial temperature and the margin value
(RotatEonly), respectively.</p>
      </sec>
      <sec id="sec-7-2">
        <title>A.2. Multilayer Perceptron</title>
        <sec id="sec-7-2-1">
          <title>To simplify the model tuning process, we find the optimal</title>
          <p>hyperparameters of “ComplEx + MLP” on SciCiteorigin
using grid search and reuse them for the rest of our
experiments. Specifically, we run a grid search over the
following ranges: number of layers ∈ {0, 1, 2, 3}, dropout
∈ {0, 0.1, 0.2, 0.3, 0.4, 0.5}, dimension ∈ {32, 64, 128},</p>
        </sec>
        <sec id="sec-7-2-2">
          <title>The optimal hyperparameters are as follows: number of</title>
          <p>layers = 2, dropout = 0.2, and dimension = [64, 32]. We
use ReLU as the activation function for all layers.</p>
        </sec>
      </sec>
      <sec id="sec-7-3">
        <title>A.3. Multi-Hop Link Prediction</title>
        <sec id="sec-7-3-1">
          <title>We run a grid search over the following ranges: number</title>
          <p>learning rate ∈ {0.03, 0.01, 0.003, 0.001}, The optimal
hyperparameters are as follows: number of layers = 1,
dimension = 100, learning rate = 0.01. We use Adam as
the optimizer through the tuning process.</p>
          <p>We use a randomized search to tune our models and
ifnd near-optimal hyperparameters using the
followifcient
learning rate ∈
negative samples ∈
{0.25, 0.5, 1},</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Iz</given-names>
            <surname>Beltagy</surname>
          </string-name>
          , Kyle Lo, and
          <string-name>
            <given-names>Arman</given-names>
            <surname>Cohan</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>SciBERT: A Pretrained Language Model for Scientific Text</article-title>
          .
          <source>In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          .
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <fpage>3615</fpage>
          -
          <lpage>3620</lpage>
          . https://doi.org/10.18653/v1/
          <fpage>D19</fpage>
          -1371
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Antoine</given-names>
            <surname>Bordes</surname>
          </string-name>
          , Nicolas Usunier,
          <string-name>
            <surname>Alberto</surname>
            <given-names>GarciaDurán</given-names>
          </string-name>
          , Jason Weston, and
          <string-name>
            <given-names>Oksana</given-names>
            <surname>Yakhnenko</surname>
          </string-name>
          . Éric Gaussier, and
          <string-name>
            <given-names>Guillaume</given-names>
            <surname>Bouchard</surname>
          </string-name>
          .
          <year>2016</year>
          . Com-
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Marco</surname>
            <given-names>Valenzuela</given-names>
          </string-name>
          , Vu Ha, and
          <string-name>
            <given-names>Oren</given-names>
            <surname>Etzioni</surname>
          </string-name>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Laurens</surname>
            <given-names>van der Maaten and Geofrey</given-names>
          </string-name>
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          .
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>Learning Research</source>
          <volume>9</volume>
          ,
          <issue>86</issue>
          (
          <year>2008</year>
          ),
          <fpage>2579</fpage>
          -
          <lpage>2605</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Kuansan</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Zhihong Shen,
          <string-name>
            <given-names>Chiyuan</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <year>2020</year>
          .
          <article-title>Microsoft academic graph: When experts are not enough</article-title>
          .
          <source>Quantitative Science Studies</source>
          <volume>1</volume>
          ,
          <issue>1</issue>
          (
          <year>2020</year>
          ),
          <fpage>396</fpage>
          -
          <lpage>413</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Wenhao</surname>
            <given-names>Yu</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Mengxia</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Tong</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Meng</given-names>
            <surname>Jiang</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Identifying Referential Intention with Conference 2020 (Taipei, Taiwan) (WWW '20)</article-title>
          .
          <source>Asso0.1 0.3 0.25 0</source>
          .
          <fpage>1</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>