<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">2377-844X</issn>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1007/978-3-319-06028-6_26</article-id>
      <title-group>
        <article-title>Longitudinal Citation Prediction using Temporal Graph Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andreas Nugaard Holm</string-name>
          <email>aholm@di.ku.dk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Barbara Plank</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dustin Wright</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Isabelle Augenstein</string-name>
          <email>augenstein@di.ku.dk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IT University of Copenhagen</institution>
          ,
          <addr-line>Rued Langgaards Vej 7, 2300 København</addr-line>
          ,
          <country country="DK">Denmark</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Copenhagen</institution>
          ,
          <addr-line>Universitetsparken 1, 2100 København</addr-line>
          ,
          <country country="DK">Denmark</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1973</year>
      </pub-date>
      <volume>2414</volume>
      <issue>3</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Citation count prediction is the task of predicting the number of citations a paper has gained after a period of time. Prior work viewed this as a static prediction task. As papers and their citations evolve over time, considering the dynamics of the number of citations over time seems the logical next step. Here, we introduce the task of sequence citation prediction. The goal is to accurately predict the trajectory of the number of citations a scholarly work receives over time. We propose to view papers as a structured network of citations, allowing us to use topological information as a learning signal. Additionally, we learn how this dynamic citation network changes over time and the impact of paper meta-data such as authors, venues and abstracts. To approach the new task, we derive a dynamic citation network from Semantic Scholar spanning over 42 years. We present a model which exploits topological and temporal information using graph convolution networks paired with sequence prediction, and compare it against multiple baselines, testing the importance of topological and temporal information and analyzing model performance. Our experiments show that leveraging both the temporal and topological information greatly increases the performance of predicting citation counts over time. citation count prediction, graph neural network, citation network, dynamic graph generation ture in their citation networks. Given recent develop- temporal information. By doing so, we investigate the</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The problem of predicting citation counts of papers has
been a long-standing research problem. Predicting
citation counts allows us to better understand the
relationship between a paper and its impact. However, prior
research has viewed this as a static prediction problem,
i.e. only predicting a single citation count at a static point
in time. This ignores the natural development of the data
as new papers are being published. Here, we propose
to view the problem as a sequence prediction task, with
models then having the ability to capture the evolving
nature of citations.</p>
      <p>
        This, in turn, requires a dataset to contain the papers’
citation counts over a period of time, which adds a
temporal element to the data, which can then be encoded
by sequential machine learning models, such as Long
short-term memory models (LSTM) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Additionally,
scholarly documents exhibit a natural graph-like
strucments in modeling such data [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] and prior research
showing that modeling input as graphs can be beneficial,
we hypothesize that modeling a paper’s citation network
      </p>
      <p>We
use the
well-established</p>
      <p>
        Semantic Scholar
dataset [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to construct our citation network.
      </p>
      <p>
        Its
meta-data allows us to construct a dynamic citation
network which covers a 42 year time-line, with an
updated graph for each year. The Semantic Scholar
dataset’s meta-data also contains information about
each paper’s authors, venue, and topics, allowing us to
study the correlation between these features and the
citation count of a paper when considering the evolving
nature of the citation network. The correlation between
databases such as ArnetMiner [13], Arxiv HEP-TH [14]
these features and citation counts is well-known and
and CiteSeerX [15]. These static citation networks are
studied by prior work [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>Prior studies show that
not suitable for our proposed task because they only
concitations are correlated and there is a strong correlation
tain the topological information at a single point in time.
between features such as authors, but are limited by
As longitudinal citation datasets are rare, we derive a
only predicting a single citation, and not predicting the
dataset from Semantic Scholar.
natural evolution of a papers growth.</p>
      <p>
        Citation networks are not exclusively used for
citaWe propose to use the constructed dynamic citation
tion count prediction. Other citation networks such as
network (see Section 4.2) to predict the trajectory of the
Cora [16], CiteSeer [17] or PubMed [16], all well known
number of citations papers will receive over time, a new
benchmark graphs, are used for node classification tasks,
sequence prediction task introduced in this work.
Furwhere the task is to predict a paper’s topic. These
netthermore, we propose an encoder-decoder model to solve
works are provided with minimal content. They consist
the proposed task, which uses graph convolutional
layof an adjacency matrix, the connections between
citaers [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to exploit the graphs’ topological features and an
tions, and a simple feature vector for each node of either
LSTM to model the temporal component of the graphs.
      </p>
      <p>We compare our model against a vanilla graph
convolu0/1-valued vector or a tf-idf vector, based on the
dictionary of the paper content. These existing datasets do not
tional neural network (GCN) and a vanilla LSTM, which
ift our purpose, hence we derive our own, described in
individually incorporate either the topological informa- Sec. 4.2.
a GCN and LSTM to extract the dynamic graph’s topo- is the encoder, which takes an adjacency matrix of node
deeper feature vector analysis, where regression mod-  . With a given dynamic graph, we aim to predict the
tion or the temporal information, but not both.</p>
      <p>Our contributions are as follows: 1) A dynamic
citation network based on the Semantic Scholar dataset. The
dynamic citation network contains 42 time-steps, with an
updated graph at each time-step, based on yearly
information. 2) We introduce the task of sequence citation count
prediction. 3) A novel encoder-decoder model based on
logical and temporal components. 4) A thorough study
of the correlation between citation counts and temporal
components.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related</title>
    </sec>
    <sec id="sec-3">
      <title>Work</title>
      <p>
        The task of predicting a paper’s citations aims to predict
the number of citations which a paper has obtained either
by a given year or after  years. The task itself is not
new and has been researched throughout the years, and
multiple diferent approaches have been tried and shown
to be efective. Some of these studies, have focused on
feature vectors [
        <xref ref-type="bibr" rid="ref5 ref7">5, 7</xref>
        ] and explored distinct feature vectors’
performance, where they primarily rely on meta-data,
e.g. venue and authors. As peer review data has become
available [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], recent research has focused on using
nonmeta-data information, such as peer-reviews [9, 10] to
predict a paper’s citation count.
      </p>
      <p>
        What is common in existing research is the target:
predicting a single citation count. This count can be set as
one of the following years, or the citation count  years
in the future. To predict these citation counts, we see a
variety of diferent neural network models with distinct
architectures [10, 11], as well as papers which focus on
els are used [
        <xref ref-type="bibr" rid="ref7">12, 7</xref>
        ]. A side efect from prior research’s
focus on predicting single citation counts is that the
utilized citation networks are static graphs, based on paper
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. Temporal Graph Neural</title>
    </sec>
    <sec id="sec-5">
      <title>Network</title>
      <p>Our model is an encoder-decoder model and therefore
consists of two major components. The first component
connections and a node feature matrix as input, where
the node feature matrix can e.g. consist of author
information (illustrated in Figure 2). It uses the topological
information from the graphs and creates feature vectors
containing both the topological node features via a GCN.
It should be noted that due to the use of dynamic graphs,
the encoder generates a sequence of graph embeddings,
one for each graph in the sequence. The second
component, the decoder, utilizes the sequence of graph
embeddings created by the encoder. By using an LSTM, we
extract the temporal elements and create a sequence of
citation count predictions (CCP) for each node in the
dynamic graph.</p>
      <sec id="sec-5-1">
        <title>3.1. Problem Definition</title>
        <p>While the task of CCP has been researched before, in
this paper, we are interested in predicting a sequence of
citation counts, which to our knowledge is so far
unexplored.</p>
        <p>Let us start by introducing our graph notation. We
denote our dynamic graph as  = {
0 …   −1 }, where  
is a graph, at the given time  . Each graph in the dynamic
graph set is defined as   = (  ,   ), where   is the set
of vertices at time  and   is the set of edges at time
sequence of citations for a given paper. We formalize this
as   = { 1 …</p>
        <p>∈   and  
  }, where    is the number of citations for
= |  |. For our proposed task, we are given
the dynamic graph  , and are to predict the sequence of
citation counts  .</p>
      </sec>
      <sec id="sec-5-2">
        <title>3.2. Topological Feature Extraction</title>
        <p>
          One of the central hypotheses we want to examine is if
complex structural dependencies in a citation network
can help predict the citation count of a paper. To test this,
we employ a GCN to extract topological dependencies
from the graphs. We choose a GCN over other methods
as they work in Euclidean space, and are thus easy to
use with other neural architectures such as convolutional
neural networks (CNN) [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>The GCN uses the data flow between edges in the
graph to create a graph embedding. As such, we can
create an embedding influenced by all of the neighboring
nodes in the graph. In this, we hypothesize that there is
a relationship between the number of citations a given
paper receives and that of its neighbors. The connections
between the papers is described by an adjacency matrix
 . Using our notation, we describe the GCN as follows:
 (+1) =  ( ̃ − 2  ̃ ̃ − 2  ()  () ) ,
1
1
(1)
where  =̃  +</p>
        <p>;  is the identity matrix (which enables
self-loops in  ̃);  ̃  =
∑ ̃ ,  is the  ’th layer in the</p>
        <p>model;  is an activation function; and  (+1) is the output
of the GCN layer  () . We can then simplify the above
equation:
 (+1) =  ( ̂
 
()</p>
        <p>() )
where  ̂ is defined as  =̂
1</p>
        <p>1
 ̃ − 2  ̃ ̃ − 2 and  is the time
step in the dynamic graph. It should be noted that 
has been left out in the first equation for simplicity. We
also observe here that by adding multiple GCN layers, 3.2.2. Encoder-Decoder
we allow the the graph embeddings to be afected by
extended neighbours.</p>
        <p>Since we work on a dynamic citation network, we have
 distinct adjacency matrices, and we have to create a
graph embedding for each graph in the sequence:
 = { 0 …   } = { ( ,</p>
        <p>) |   ∈ },
where the function  is the GCN network,   ∈ ℝ× is
a single graph embedding of dimensionality  with 
nodes, and  is the set of graph embeddings created by
the GCN. It should be noted that  is shown as being
independent of time, which is true for some of our node
embeddings. However, some of our node embeddings are
based on citations, which change through time, which
makes  dependent on time. We will explore the distinct
node embeddings in a later section. As shown in the
equation, we also keep the same model over time, and
do not change the GCN even though the graph changes.</p>
        <p>We instead try to generalize the model, working on all
the graphs in the dynamic graph.</p>
        <p>In the final model, we combine the GCN and LSTM in
an encoder-decoder model. The primary challenge in
combining these two models though is that they operate
on vastly diferent inputs. The GCN operates on entire
(3) including nodes which it intends to predict. The LSTM,
graphs and needs all the nodes to appear in the graphs,
however, does not have this requirement and can work
on batches. To solve this issue in a simple yet efective
approach, we embed the entire graph prior to the LSTM
steps so that in the LSTM step, we can still split the data
into batches for training, validation and testing. While
other approaches have been researched, like embedding
the GCN into the LSTM [18], we found the simple
approach to perform better.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>4. Dynamic Citation Count Prediction</title>
      <p>However, existing citation networks are not usable for
our task due to the graph of the citation network being
static in those works, i.e., the citation network does not
evolve over time. Given this, we construct a dataset,
where we reconstruct the citation networks, at each
timestep, for the purpose of studying citation count prediction
Input: data</p>
      <p>Output: G
1  
2 for  ∈ years do</p>
      <p>_ ℎ =
5 end
6 for  ∈   [</p>
      <p>_ ℎ[ ] ←
 

for  ∈ years do
_[ ] = 0
 = 0
for  ∈  
Algorithm 1: Dynamic Graph Construction
 ←
ifnd_connected_graphs (  [ ])
over time.</p>
      <sec id="sec-6-1">
        <title>4.1. Dataset</title>
        <p>
          The dataset which we used to create our dynamic graph
lection of close to 200, 000, 000 scientific papers; the size
is based on Semantic Scholar [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].1 The dataset is a col- 17 end
of a graph of this size requires an immense system to
run experiments on (recall the size of  ∈ ℝ ∗
        </p>
        <p>where
 is the number of papers). To reduce the dataset to a
manageable size, we only kept papers from the following
venues related to AI, Machine Learning and Natural Lan- 23
guage Processing: ACL, COLING, NAACL, EMNLP, AAAI, 24
NeurIPS and CoNLL. With the dataset only containing
papers from the listed venues, we reduced the dataset’s
size to 47, 091 papers. Furthermore, the Semantic Scholar 27 end
dataset also holds an extensive collection of meta-data
for each paper. We use this meta-data to construct our
end

end
_ =
dict()
for  ∈  
if 
end
end
if  ∈ 
end</p>
        <p>= ||
_[ ]+</p>
        <p>= 
_ ∈ 
[ ] = 
break
3
7
8
9
10
11
12
13
14
15
16
21
22
25
26
18 
19  =
20 for  ∈
years do
argmax(</p>
        <p>_)
_ ℎ[ ]
then</p>
        <p>do
dict()
min](  )
sort()</p>
        <p>do
_ ℎ[ ]
and || &gt; 
do
then
dynamic graph, as well as the graph’s node embeddings. probing as the process of observing the evolution of the</p>
      </sec>
      <sec id="sec-6-2">
        <title>4.2. Graph Construction</title>
        <p>With the dataset reduced to a more manageable size, we
search for an ideal dynamic graph of the citation
network. We do this because working with graphs can be
computationally heavy and the size of the graph based
on the full semantic scholar dataset, can make some
computations near unfeasible. We define an ideal dynamic
graph as the sequence of graphs which has the largest
connected graph in the final graph and has the most
significant increase of nodes over time. We do not use the
largest connected graph at each time step, as it can trick
us into selecting a sub-optimal dynamic graph. A
suboptimal dynamic graph may present itself as the largest
connected graph at a point in time, but will not stay as the
largest connected graph through time, and will contain
less nodes through time, compared to the ideal dynamic
graph. To solve the issue of being tricked into selecting
a less ideal dynamic graph, we have to probe each node
in the data to observe the graphs’ evolution. We define
1https://api.semanticscholar.org/
graph connected to the probed node. This process is
automatically performed on all nodes of the largest connected
graph in the final step. By probing all the nodes, we can
choose the sequence of graphs which contains the most
nodes over time. In Algorithm 1, we describe the process
in the form of pseudo-code for a more precise insight in
the process of constructing the ideal dynamic graph.</p>
        <p>In Table 1, we show some of the properties of the
last 10 graphs in the dynamic graph. It is clear how the
graph is evolving over time, as can be seen in how both
the number of vertices and edges increases, and how the
degree  increases, indicating that the nodes in the graph
obtains more citations over time. This indicates that the
dynamic graph reflects the natural growth of a paper’s
citations.</p>
        <p>By only using a subset of the nodes from the full graph
to construct the dynamic graph, we ablate some of the
full graph’s properties. One notable property of the full
graph is that the citation count of a paper is tied to the
degree of a node; by using a subset of the full graph
this property does not hold anymore, which leads to the
following definition of the size of the set of edges  
 = |  |

| |
||
Mean</p>
        <p>Max 
Max citation count</p>
        <p>Avg. citation count
≥ |  |. have been shown to be useful for learning downstream
Another important point is that removing edges from
tasks with scientific text, this is why we use them here.
the graph removes some of the information contained in
To obtain a feature vector of a given abstract, we
tokthe full graph (e.g. links to papers in other fields). Such
enize the abstract text and pass this through SciBERT.
edges are usually connected to more prominent papers
SciBERT prepends a special [CLS] token for performing
because it is often the high impact papers, which obtain
citations from papers outside the main field.</p>
      </sec>
      <sec id="sec-6-3">
        <title>4.3. Feature Generation</title>
        <p>The created dynamic graph nodes are not dependent on
a set of specific features, and we can therefore select
and create a set of features for each node containing our
desired information. With a wide variety of meta-data
ifelds available, we created a set of distinct features which
we used for our predictions. Furthermore, we studied
how each of these features afect the performance of the
model.</p>
        <p>
          The choice of using authors and venues as features
for our model is based on the hypothesis that authors
listed on a paper have a major impact on the number of
citations gained. We assume the same goes for venues: if
a paper is published at a more highly ranked venue, it is
more likely to gain a large amount of citations compared
to a paper published at a lower ranking venue. We
further motivate the choice of these two features based on
prior work [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], who shows that author rank and venue
rank are indeed two of the three features that are most
classification tasks, so we use the output representation
of this token as the final feature vector for an abstract.
        </p>
        <p>Author rank: To include the author information, we
created a feature vector which ranks the authors based on
their number of citations sorted by highest to lowest. Due
to many authors having the same amount of citations,
we allow authors to be of the same rank. As the final step
for the feature calculation, we normalize the rankings by
 ′ =
 −
 max− min</p>
        <p>min .</p>
        <p>
          Venue rank: Together with the author rank, we also
hypothesize that the venue has an impact on the
number of citations of a paper. Therefore, we also created a
feature ranking for the venues. The feature is calculated
identically to the author rank. It should be mentioned
that the meta-data contains a high amount of diferent
labels for each of the venues which we are using. We
reduce all the diferent labels of the same venue down
to a single label for each venue, but keep each venue
separated by year.
5. Experiments
predictive. We motivate the choice of using the abstract In this section we present our experiments and results,
based on the assumption that the abstract of a paper
conand explore the importance of exploiting topological and
tains information on the topics discussed in the paper, temporal information.
which can be used to identify if paper’s topic is currently
popular [19]. We further motivate the choice of using
author and venue rank, as prior work shows them to be
the most descriptive features [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. The following sections
provide short descriptions of the meta-data used to create
these feature vectors and how each of them is calculated.
        </p>
        <p>Abstract: To base our model on more than meta-data, we
use the abstract of the papers to create a feature vector. To
create an embedding of the abstract, we utilize BERT [20],
specifically the pre-trained SciBERT [ 21] model.
SciBERT is a contextualized embedding model trained using
a masked language modeling objective on a large amount
of scholarly literature. Representations from SciBERT
5.1. Data
We use the constructed dynamic graph for our
experiments and test each of the three distinct feature vectors.</p>
        <p>A detailed description of the feature vectors and the
dynamic graph’s construction can be found in Section 4.</p>
        <p>We split our data into a training, validation, and test set,
with the following splits: 60%, 20%, and 20%. With the
splits, we achieve a training set consisting of 22, 900, and
a validation and test set of 7, 634. The training, validation
and test sets are generated randomly, but are kept fixed
throughout the experiments.
Author
Venue</p>
        <p>All
Author + Venue
The performance of our 3 models over a 10 year period. The results are reported as the MAE of the log citations. For the
10-year period, our deterministic approach have a MAE of 1.6378.</p>
        <p>The performance of our 3 models over a 20 year period. The results are reported as the MAE of the log citations. For the
20-year period, our deterministic approach have a MAE of 2.0796.</p>
        <p>Due to the large number of time-steps in the dynamic
training early. As mentioned, we used SciBERT to encode
graph, we chose to create two diferent setups for our
experiments. One which uses the last 10 years and another,
which uses the last 20 years of the dynamic graph. We
use the later years in the dynamic graph as these years
contain the most papers and the graph has evolved the
most.
the abstracts, with an output vector of size 768. The
models have been run using random seeds, and each of the
experiments have been executed 10 times. In the results
section, we report the mean and the standard deviation
of the 10 runs.</p>
        <p>We compare to a simple deterministic baseline:
predict</p>
        <p>While not mentioned in Section 4.3, we perform some ing the mean citation count of the training and validation
further pre-processing of the data. For the feature vectors
at each time step.
of author rank and venue rank, we perform a
normalization of the values. We also perform pre-processing of
the labels due to the high fluctuation of the number of
citations. We take the ( + 1)</p>
        <p>of the citation of a paper
as the labels [22]. Taking the log of the citation increases
the stability of the model during training.</p>
      </sec>
      <sec id="sec-6-4">
        <title>5.2. Experimental Setup</title>
      </sec>
      <sec id="sec-6-5">
        <title>5.3. Evaluation Metric</title>
        <p>To evaluate the performance of the models, we measure
the mean absolute error, defined as</p>
        <p>1
 =1
  =
∑ | −  |,̂
(4)
We perform experiments with three distinct models: 1)
Our proposed model, consisting of a GCN and LSTM; 2) a
standard LSTM; 3) a standard GCN. All hyper-parameters
where  are the citation counts and  ̂ are the predicted
values. We also use the MAE to optimize the model. We
chose to use MAE, instead of mean squared error (MSE),
are shared across the models. We evaluate models at to mitigate outlier papers which have a high amount
specific times and over time.</p>
        <p>of citations. We additionally use MAE as the training</p>
        <p>For our selected models, we used the Adam [23] op- objective for the same reason.
timizer, with a learning rate of 0.001. For the GCN we
used two layers, with each layer consisting of 256 hidden
units. Both the GCN and the GCN with LSTM used this
setup. The LSTM was set to have a single uni-directional
layer of 128 hidden units, with the output being reduced
to 1 dimension by a linear layer. For the models using an
LSTM, we its batch size to 256. We ran the models for
1000 epochs and if no update to the best validation score
have been observed over 10 epochs, we terminate the</p>
      </sec>
      <sec id="sec-6-6">
        <title>5.4. Results</title>
        <p>As previously mentioned, we ran our experiments on
dynamic graphs of 10 years and 20 years. The results
of the 10 year experiment is shown in Table 2, and the
results of the 20 years experiment is shown in Table 3.</p>
        <p>The tables show that our models outperform the simple
deterministic approach. Figure 3 shows results over time.</p>
        <p>By inspecting the results, one can clearly observe that diferent feature ablations, we can observe that the author
the GCN-LSTM has the best performance among the features performs the best, confirming our hypothesis.
three models. We further observe that the GCN-LSTM Figure 3 further confirms this, showing that large parts
improves on the performance of the pure GCN and LSTM of the gain of the model over time stems from author
individually, indicating that it learns from both the tem- information.
poral and the topological information provided by the The feature vector created by the venues performs
dynamic citation network. Furthermore, the GCN in- the worst in both experiments. We hypothesize that
creases in error going from a 10 year interval to a 20 year the venues’ performance could be increased if a more
interval, where we see the other models slightly improve. generalized notation for venue meta-data were available.
To further study this, we plot the error of the diferent They are noisy (also due to OCR errors) and have many
time steps in Figure 3, which show the models’ perfor- spelling variants.
mances over time. By inspecting the plots, we observe a To further study the impact of features, we calculate
trend of the pure models i.e. the GCN and LSTM models, the average MAE for each distinct author and venue,
struggle and deteriorate over time, compared to the com- where we use the predictions made by the GCN-LSTM,
bined GCN-LSTM model, which keeps improving over trained on the author feature vectors over 20 years. We
time until it starts plateauing. Comparing the 10-year show the result for the venues in Table 4 and the ones for
and 20-year plots, one can observe that the deterioration authors in Table 5. One can observe that the diference
continues, where the 10-year plot stops. It can also be between the top and the bottom venue is drastically lower
seen, that the GCN-LSTM keeps improving up until year than the diference between the top and bottom author.
10, where it levels out. All of the models decrease drasti- This further indicates that author features are a strongly
cally in error up until two time-steps; afterward, the pure predictive feature for citation counts.
models start deteriorating. We also show the average degree and the number of
papers for each of the venues in Table 4. With a higher
5.5. Discussion representation of papers in the collection, we expect a
more reliable prediction. This is indeed the case – we
Tables 2 and 3 show the impact of single feature types. We observe the top venues often have a higher number of
hypothesize that author information is very predictive, papers in their collection. To further analyse this, we
as shown by prior work. Inspecting the results from the
observe the average degree of the papers in the collection,
however, we do not notice a higher performance where
the degree is higher. This indicates that the model is
better at predicting papers with higher citation counts,
because the degree of a node is tightly bound to the
number of citations.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusions</title>
      <p>In this paper, we propose the task of citation sequence
prediction. We introduce a new dataset of scholarly
documents for this task based on a dynamic citation graph
evolving of 42 years, starting from a single node growing
to a large graph. We further study the efect of
temporal and topological information, and propose a model to
benefit from both information (GCN+LSTM). Our results
show that utilizing both the temporal and topological
information is superior to only utilizing either the temporal
or topological information. Using the proposed model,
we study the efect of diferent features, to identify which
information is most predictive of a paper’s citation count
over time. We find author information to be the most
predictive and informative over time.</p>
      <p>In future work, the impact of training a single GCN
on the dynamic graph could be explored, since the error
over time of the GCN is deteriorates fast.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>We like to thank Johannes Bjerva for the fruitful
discussions in the early stages. This work is partly funded
by Independent Research Fund Denmark under grant
agreement number 9065-00131B.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Long</given-names>
            <surname>Short-Term</surname>
          </string-name>
          <string-name>
            <surname>Memory</surname>
          </string-name>
          ,
          <source>Neural Computation</source>
          <volume>9</volume>
          (
          <year>1997</year>
          )
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          . URL: https : / / www.mitpressjournals.org / doi / abs / 10.1162 / neco.
          <year>1997</year>
          .
          <volume>9</volume>
          .8.1735. doi:
          <volume>10</volume>
          .1162/neco.
          <year>1997</year>
          .
          <volume>9</volume>
          .8.1735.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , G. Cui,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Graph neural networks: A review of methods and applications</article-title>
          ,
          <source>AI</source>
          Open 1
          <article-title>(</article-title>
          <year>2020</year>
          )
          <fpage>57</fpage>
          -
          <lpage>81</lpage>
          . URL: https://www.sciencedirect.com/ science / article / pii / S2666651021000012. doi:
          <volume>10</volume>
          .1016/j.aiopen.
          <year>2021</year>
          .
          <volume>01</volume>
          .001.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <surname>A Comprehensive</surname>
          </string-name>
          <article-title>Survey on Graph Neural Networks</article-title>
          ,
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          <volume>32</volume>
          (
          <year>2021</year>
          )
          <fpage>4</fpage>
          -
          <lpage>24</lpage>
          . doi:
          <volume>10</volume>
          .1109/ TNNLS.
          <year>2020</year>
          .
          <volume>2978386</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Ammar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Groeneveld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bhagavatula</surname>
          </string-name>
          , I. Beltagy,
          <string-name>
            <given-names>M.</given-names>
            <surname>Crawford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Downey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dunkelberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Elgohary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Feldman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. A.</given-names>
            <surname>Ha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Kinney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kohlmeier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. C.</given-names>
            <surname>Murray</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-H. Ooi</surname>
            ,
            <given-names>M. E.</given-names>
          </string-name>
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>J. L.</given-names>
          </string-name>
          <string-name>
            <surname>Power</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Skjonsberg</surname>
            ,
            <given-names>L. L.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Wilhelm</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>M. v.</given-names>
          </string-name>
          <string-name>
            <surname>Zuylen</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Etzioni</surname>
          </string-name>
          ,
          <article-title>Construction of the Literature Graph in Semantic Scholar</article-title>
          , in: NAACL-HLT,
          <year>2018</year>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N18</fpage>
          -3011.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Shan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Citation count prediction: learning to estimate future citations for literature</article-title>
          ,
          <source>in: Proceedings of the 20th ACM international conference on Information and knowledge management - CIKM '11</source>
          , ACM Press, Glasgow, Scotland, UK,
          <year>2011</year>
          , p.
          <fpage>1247</fpage>
          . URL: http: //dl.acm.org/citation.cfm?doid=
          <volume>2063576</volume>
          .2063757. doi:
          <volume>10</volume>
          .1145/2063576.2063757.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T. N.</given-names>
            <surname>Kipf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Welling</surname>
          </string-name>
          ,
          <article-title>Semi-Supervised Classification with Graph Convolutional Networks</article-title>
          ,
          <source>in: 5th International Conference on Learning Representations, ICLR</source>
          <year>2017</year>
          , Toulon, France,
          <source>April 24- 26</source>
          ,
          <year>2017</year>
          , Conference Track Proceedings, OpenReview.net,
          <year>2017</year>
          . URL: https://openreview.net/ forum?id=
          <fpage>SJU4ayYgl</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.-Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Citation impact prediction for scientific papers using stepwise regression analysis</article-title>
          ,
          <source>Scientometrics</source>
          <volume>101</volume>
          (
          <year>2014</year>
          )
          <fpage>1233</fpage>
          -
          <lpage>1252</lpage>
          . URL: http://link.springer.
          <source>com/10.1007/s11192-014- 1279-6</source>
          . doi:
          <volume>10</volume>
          .1007/s11192-014-1279-6.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ammar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dalvi</surname>
          </string-name>
          , M. van Zuylen,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>