=Paper=
{{Paper
|id=Vol-3164/paper3
|storemode=property
|title=Longitudinal Citation Prediction using Temporal Graph Neural Networks
|pdfUrl=https://ceur-ws.org/Vol-3164/paper3.pdf
|volume=Vol-3164
|authors=Andreas Nugaard Holm,Barbara Plank,Dustin Wright,Isabelle Augenstein
|dblpUrl=https://dblp.org/rec/conf/aaai/HolmP0A22
}}
==Longitudinal Citation Prediction using Temporal Graph Neural Networks==
<pdf width="1500px">https://ceur-ws.org/Vol-3164/paper3.pdf</pdf>
<pre>
Longitudinal Citation Prediction using Temporal Graph
Neural Networks
Andreas Nugaard Holm1 , Barbara Plank2 , Dustin Wright1 and Isabelle Augenstein1
1
    University of Copenhagen, Universitetsparken 1, 2100 København, Denmark
2
    IT University of Copenhagen, Rued Langgaards Vej 7, 2300 København, Denmark


                                             Abstract
                                             Citation count prediction is the task of predicting the number of citations a paper has gained after a period of time. Prior
                                             work viewed this as a static prediction task. As papers and their citations evolve over time, considering the dynamics of the
                                             number of citations over time seems the logical next step. Here, we introduce the task of sequence citation prediction. The
                                             goal is to accurately predict the trajectory of the number of citations a scholarly work receives over time. We propose to view
                                             papers as a structured network of citations, allowing us to use topological information as a learning signal. Additionally,
                                             we learn how this dynamic citation network changes over time and the impact of paper meta-data such as authors, venues
                                             and abstracts. To approach the new task, we derive a dynamic citation network from Semantic Scholar spanning over 42
                                             years. We present a model which exploits topological and temporal information using graph convolution networks paired
                                             with sequence prediction, and compare it against multiple baselines, testing the importance of topological and temporal
                                             information and analyzing model performance. Our experiments show that leveraging both the temporal and topological
                                             information greatly increases the performance of predicting citation counts over time.

                                             Keywords
                                             citation count prediction, graph neural network, citation network, dynamic graph generation


1. Introduction
The problem of predicting citation counts of papers has
been a long-standing research problem. Predicting cita-
tion counts allows us to better understand the relation-
ship between a paper and its impact. However, prior                                                                   Figure 1: Illustration of the development of the dynamic
research has viewed this as a static prediction problem,                                                              graph through three time steps. Each node represents a paper;
i.e. only predicting a single citation count at a static point                                                        edges are citations between papers. Red nodes represent new
                                                                                                                      papers in the current time step.
in time. This ignores the natural development of the data
as new papers are being published. Here, we propose
to view the problem as a sequence prediction task, with
models then having the ability to capture the evolving                                                                is useful for predicting citation counts over time. In this
nature of citations.                                                                                                  paper, we consider citation networks, a dynamic graph
   This, in turn, requires a dataset to contain the papers’                                                           which evolves over time as new citations and papers are
citation counts over a period of time, which adds a tem-                                                              added to the network. Leveraging the structured data
poral element to the data, which can then be encoded                                                                  in the graph allows us to discover complex relationships
by sequential machine learning models, such as Long                                                                   between papers. We want to tap into that knowledge
short-term memory models (LSTM) [1]. Additionally,                                                                    and treat the citation data as a network, such that we
scholarly documents exhibit a natural graph-like struc-                                                               can further exploit topological information and not just
ture in their citation networks. Given recent develop-                                                                temporal information. By doing so, we investigate the
ments in modeling such data [2, 3] and prior research                                                                 hypothesis of paper citation counts being correlated with
showing that modeling input as graphs can be beneficial,                                                              features such as authors, venue, and topics.
we hypothesize that modeling a paper’s citation network                                                                  We use the well-established Semantic Scholar
                                                                                                                      dataset [4] to construct our citation network. Its
SDU@AAAI’22: Workshop on Scientific Document Understanding,                                                           meta-data allows us to construct a dynamic citation
March 01, 2022                                                                                                        network which covers a 42 year time-line, with an
Envelope-Open aholm@di.ku.dk (A. N. Holm); bapl@itu.dk (B. Plank);                                                    updated graph for each year. The Semantic Scholar
dw@di.ku.dk (D. Wright); augenstein@di.ku.dk (I. Augenstein)
Orcid 0000-0002-2006-5894 (A. N. Holm); 0000-0002-4394-1965
                                                                                                                      dataset’s meta-data also contains information about
(B. Plank); 0000-0001-6514-8733 (D. Wright); 0000-0003-1562-7909                                                      each paper’s authors, venue, and topics, allowing us to
(I. Augenstein)                                                                                                       study the correlation between these features and the
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                     citation count of a paper when considering the evolving
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
nature of the citation network. The correlation between         databases such as ArnetMiner [13], Arxiv HEP-TH [14]
these features and citation counts is well-known and            and CiteSeerX [15]. These static citation networks are
studied by prior work [5]. Prior studies show that              not suitable for our proposed task because they only con-
citations are correlated and there is a strong correlation      tain the topological information at a single point in time.
between features such as authors, but are limited by            As longitudinal citation datasets are rare, we derive a
only predicting a single citation, and not predicting the       dataset from Semantic Scholar.
natural evolution of a papers growth.                              Citation networks are not exclusively used for cita-
   We propose to use the constructed dynamic citation           tion count prediction. Other citation networks such as
network (see Section 4.2) to predict the trajectory of the      Cora [16], CiteSeer [17] or PubMed [16], all well known
number of citations papers will receive over time, a new        benchmark graphs, are used for node classification tasks,
sequence prediction task introduced in this work. Fur-          where the task is to predict a paper’s topic. These net-
thermore, we propose an encoder-decoder model to solve          works are provided with minimal content. They consist
the proposed task, which uses graph convolutional lay-          of an adjacency matrix, the connections between cita-
ers [6] to exploit the graphs’ topological features and an      tions, and a simple feature vector for each node of either
LSTM to model the temporal component of the graphs.             0/1-valued vector or a tf-idf vector, based on the dictio-
We compare our model against a vanilla graph convolu-           nary of the paper content. These existing datasets do not
tional neural network (GCN) and a vanilla LSTM, which           fit our purpose, hence we derive our own, described in
individually incorporate either the topological informa-        Sec. 4.2.
tion or the temporal information, but not both.
   Our contributions are as follows: 1) A dynamic cita-
tion network based on the Semantic Scholar dataset. The         3. Temporal Graph Neural
dynamic citation network contains 42 time-steps, with an           Network
updated graph at each time-step, based on yearly informa-
tion. 2) We introduce the task of sequence citation count       Our model is an encoder-decoder model and therefore
prediction. 3) A novel encoder-decoder model based on           consists of two major components. The first component
a GCN and LSTM to extract the dynamic graph’s topo-             is the encoder, which takes an adjacency matrix of node
logical and temporal components. 4) A thorough study            connections and a node feature matrix as input, where
of the correlation between citation counts and temporal         the node feature matrix can e.g. consist of author infor-
components.                                                     mation (illustrated in Figure 2). It uses the topological
                                                                information from the graphs and creates feature vectors
                                                                containing both the topological node features via a GCN.
2. Related Work                                                 It should be noted that due to the use of dynamic graphs,
                                                                the encoder generates a sequence of graph embeddings,
The task of predicting a paper’s citations aims to predict
                                                                one for each graph in the sequence. The second com-
the number of citations which a paper has obtained either
                                                                ponent, the decoder, utilizes the sequence of graph em-
by a given year or after 𝑛 years. The task itself is not
                                                                beddings created by the encoder. By using an LSTM, we
new and has been researched throughout the years, and
                                                                extract the temporal elements and create a sequence of
multiple different approaches have been tried and shown
                                                                citation count predictions (CCP) for each node in the
to be effective. Some of these studies, have focused on
                                                                dynamic graph.
feature vectors [5, 7] and explored distinct feature vectors’
performance, where they primarily rely on meta-data,
e.g. venue and authors. As peer review data has become          3.1. Problem Definition
available [8], recent research has focused on using non-        While the task of CCP has been researched before, in
meta-data information, such as peer-reviews [9, 10] to          this paper, we are interested in predicting a sequence of
predict a paper’s citation count.                               citation counts, which to our knowledge is so far unex-
   What is common in existing research is the target: pre-      plored.
dicting a single citation count. This count can be set as          Let us start by introducing our graph notation. We
one of the following years, or the citation count 𝑛 years       denote our dynamic graph as 𝐺 = {𝐺0 … 𝐺𝑇 −1 }, where 𝐺𝑡
in the future. To predict these citation counts, we see a       is a graph, at the given time 𝑡. Each graph in the dynamic
variety of different neural network models with distinct        graph set is defined as 𝐺𝑡 = (𝑉𝑡 , 𝐸𝑡 ), where 𝑉𝑡 is the set
architectures [10, 11], as well as papers which focus on        of vertices at time 𝑡 and 𝐸𝑡 is the set of edges at time
deeper feature vector analysis, where regression mod-           𝑡. With a given dynamic graph, we aim to predict the
els are used [12, 7]. A side effect from prior research’s       sequence of citations for a given paper. We formalize this
focus on predicting single citation counts is that the uti-     as 𝑦 𝑣 = {𝑦1𝑣 … 𝑦𝑇𝑣 }, where 𝑦𝑡𝑣 is the number of citations for
lized citation networks are static graphs, based on paper       𝑣𝑡 ∈ 𝑉𝑡 and 𝑦𝑡𝑣 = |𝐸𝑡𝑣 |. For our proposed task, we are given
the dynamic graph 𝐺, and are to predict the sequence of
citation counts 𝑦.

3.2. Topological Feature Extraction
One of the central hypotheses we want to examine is if
complex structural dependencies in a citation network
can help predict the citation count of a paper. To test this,
we employ a GCN to extract topological dependencies
from the graphs. We choose a GCN over other methods
as they work in Euclidean space, and are thus easy to
use with other neural architectures such as convolutional
neural networks (CNN) [3].
   The GCN uses the data flow between edges in the
graph to create a graph embedding. As such, we can
create an embedding influenced by all of the neighboring
nodes in the graph. In this, we hypothesize that there is
a relationship between the number of citations a given
paper receives and that of its neighbors. The connections
between the papers is described by an adjacency matrix
𝐴. Using our notation, we describe the GCN as follows:
                             −1      −1                            Figure 2: Our proposed encoder-decoder model
             𝐻 (𝑙+1) = 𝜎 (𝐷̃ 2 𝐴̃ 𝐷̃ 2 𝐻 (𝑙) 𝑊 (𝑙) ) ,       (1)

where 𝐴̃ = 𝐴 + 𝐼; 𝐼 is the identity matrix (which enables
self-loops in 𝐴);̃ 𝐷̃ 𝑖𝑖 = ∑𝑗 𝐴̃ 𝑖𝑗 , 𝑙 is the 𝑙’th layer in the
                                                                   3.2.1. Temporal Feature Extraction
model; 𝜎 is an activation function; and 𝐻 (𝑙+1) is the output
of the GCN layer 𝐻 (𝑙) . We can then simplify the above            With the constructed graph embeddings, containing both
equation:                                                          topological information and node information. We want
                                                                   to extract the temporal information, which we use the
                                         (𝑙)
                  𝐻 (𝑙+1) = 𝜎 (𝐴̂ 𝑡 𝐻𝑡 𝑊 (𝑙) )               (2)   sequence of graph embeddings to do. To extract the
                                     1         1
                                                                   temporal information, we utilize an LSTM, where we can
                                −     −
where 𝐴̂ is defined as 𝐴̂ = 𝐷̃ 2 𝐴̃ 𝐷̃ 2 and 𝑡 is the time         formalize the input and output as 𝑌 = 𝑙(𝑍 ), where the
step in the dynamic graph. It should be noted that 𝑡               function 𝑙 is the LSTM and 𝑌 ∈ ℝ𝑚∗𝑇 are the CCPs.
has been left out in the first equation for simplicity. We
also observe here that by adding multiple GCN layers,              3.2.2. Encoder-Decoder
we allow the the graph embeddings to be affected by
extended neighbours.                                     In the final model, we combine the GCN and LSTM in
   Since we work on a dynamic citation network, we have  an encoder-decoder model. The primary challenge in
𝑇 distinct adjacency matrices, and we have to create a   combining these two models though is that they operate
graph embedding for each graph in the sequence:          on vastly different inputs. The GCN operates on entire
                                                         graphs and needs all the nodes to appear in the graphs,
          𝑍 = {𝑍0 … 𝑍𝑇 } = {𝑓 (𝑋 , 𝐴𝑡 ) | 𝐴𝑡 ∈ 𝐴},   (3) including nodes which it intends to predict. The LSTM,
where the function 𝑓 is the GCN network, 𝑍𝑡 ∈ ℝ𝑚×𝑛 is however, does not have this requirement and can work
a single graph embedding of dimensionality 𝑛 with 𝑚 on batches. To solve this issue in a simple yet effective
nodes, and 𝑍 is the set of graph embeddings created by approach, we embed the entire graph prior to the LSTM
the GCN. It should be noted that 𝑋 is shown as being steps so that in the LSTM step, we can still split the data
independent of time, which is true for some of our node into batches for training, validation and testing. While
embeddings. However, some of our node embeddings are other approaches have been researched, like embedding
based on citations, which change through time, which the GCN into the LSTM [18], we found the simple ap-
makes 𝑋 dependent on time. We will explore the distinct proach to perform better.
node embeddings in a later section. As shown in the         Figure 2 shows the architecture of our model. The
equation, we also keep the same model over time, and GCN uses two layers to create the graph embedding. The
do not change the GCN even though the graph changes. LSTM is a single one-directional layer whose outputs are
We instead try to generalize the model, working on all reduced to a sequence of scalars through a linear layer.
the graphs in the dynamic graph.
4. Dynamic Citation Count                                        Algorithm 1: Dynamic Graph Construction
   Prediction                                                    Input: data
                                                                 Output: G
As discussed earlier, we differentiate ourselves from prior 1 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑_𝑔𝑟𝑎𝑝ℎ𝑠 = dict()
work by predicting a sequence of citation counts over 2 for 𝑦 ∈ years do
time as opposed to a single final citation count. Datasets 3         𝑔𝑠 ← find_connected_graphs(𝑑𝑎𝑡𝑎[𝑦])
for the latter exist, but are based on paper databases. 4            𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑_𝑔𝑟𝑎𝑝ℎ𝑠[𝑦] ← sort(𝑔𝑠)
However, existing citation networks are not usable for 5 end
our task due to the graph of the citation network being 6 for 𝑝𝑎𝑝𝑒𝑟 ∈ 𝑑𝑎𝑡𝑎[min](𝑦𝑒𝑎𝑟𝑠) do
static in those works, i.e., the citation network does not 7         𝑘𝑒𝑦_𝑠𝑖𝑧𝑒[𝑝𝑎𝑝𝑒𝑟] = 0
evolve over time. Given this, we construct a dataset,          8     for 𝑦 ∈ years do
where we reconstruct the citation networks, at each time-
                                                               9        𝑏𝑒𝑠𝑡 = 0
step, for the purpose of studying citation count prediction
                                                              10        for 𝑔 ∈ 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑_𝑔𝑟𝑎𝑝ℎ𝑠[𝑦] do
                                                              11            if 𝑝𝑎𝑝𝑒𝑟 ∈ 𝑔 and |𝑔| > 𝑏𝑒𝑠𝑡 then
over time.                                                    12                𝑏𝑒𝑠𝑡 = |𝑔|
                                                              13            end
4.1. Dataset                                                  14        end
                                                              15        𝑘𝑒𝑦_𝑠𝑖𝑧𝑒[𝑝𝑎𝑝𝑒𝑟]+ = 𝑏𝑒𝑠𝑡
The dataset which we used to create our dynamic graph 16             end
                                     1
is based on Semantic Scholar [4]. The dataset is a col- 17 end
lection of close to 200, 000, 000 scientific papers; the size 18 𝑏𝑒𝑠𝑡_𝑝𝑎𝑝𝑒𝑟 = argmax(𝑘𝑒𝑦_𝑠𝑖𝑧𝑒)
of a graph of this size requires an immense system to 19 𝐺 = dict()
run experiments on (recall the size of 𝑌 ∈ ℝ𝑚∗𝑇 where 20 for 𝑦 ∈ years do
𝑚 is the number of papers). To reduce the dataset to a 21            for 𝑔 ∈ 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑_𝑔𝑟𝑎𝑝ℎ𝑠[𝑦] do
manageable size, we only kept papers from the following 22              if 𝑏𝑒𝑠𝑡_𝑝𝑎𝑝𝑒𝑟 ∈ 𝑔 then
venues related to AI, Machine Learning and Natural Lan- 23                  𝐺[𝑦] = 𝑔
guage Processing: ACL, COLING, NAACL, EMNLP, AAAI, 24                       break
NeurIPS and CoNLL. With the dataset only containing 25                  end
papers from the listed venues, we reduced the dataset’s 26           end
size to 47, 091 papers. Furthermore, the Semantic Scholar 27 end
dataset also holds an extensive collection of meta-data
for each paper. We use this meta-data to construct our
dynamic graph, as well as the graph’s node embeddings. probing as the process of observing the evolution of the
                                                             graph connected to the probed node. This process is auto-
4.2. Graph Construction                                      matically performed on all nodes of the largest connected
                                                             graph in the final step. By probing all the nodes, we can
With the dataset reduced to a more manageable size, we choose the sequence of graphs which contains the most
search for an ideal dynamic graph of the citation net- nodes over time. In Algorithm 1, we describe the process
work. We do this because working with graphs can be in the form of pseudo-code for a more precise insight in
computationally heavy and the size of the graph based the process of constructing the ideal dynamic graph.
on the full semantic scholar dataset, can make some com-        In Table 1, we show some of the properties of the
putations near unfeasible. We define an ideal dynamic last 10 graphs in the dynamic graph. It is clear how the
graph as the sequence of graphs which has the largest graph is evolving over time, as can be seen in how both
connected graph in the final graph and has the most sig- the number of vertices and edges increases, and how the
nificant increase of nodes over time. We do not use the degree 𝐷 increases, indicating that the nodes in the graph
largest connected graph at each time step, as it can trick obtains more citations over time. This indicates that the
us into selecting a sub-optimal dynamic graph. A sub- dynamic graph reflects the natural growth of a paper’s
optimal dynamic graph may present itself as the largest citations.
connected graph at a point in time, but will not stay as the    By only using a subset of the nodes from the full graph
largest connected graph through time, and will contain to construct the dynamic graph, we ablate some of the
less nodes through time, compared to the ideal dynamic full graph’s properties. One notable property of the full
graph. To solve the issue of being tricked into selecting graph is that the citation count of a paper is tied to the
a less ideal dynamic graph, we have to probe each node degree of a node; by using a subset of the full graph
in the data to observe the graphs’ evolution. We define this property does not hold anymore, which leads to the
                                                             following definition of the size of the set of edges 𝑦𝑡𝑣 = |𝐸𝑡𝑣 |
    1
        https://api.semanticscholar.org/
                         2011       2012       2013       2014       2015       2016       2017       2018       2019       2020
           |𝑉 |         14, 584    16, 603    18, 529     20, 760    23, 327   26, 529    29, 293    33, 759     38, 080    38, 168
           |𝐸|          103, 519   127, 277   152, 869   181, 666   217, 807   267, 940   308, 186   387, 738   475, 007   476, 015
       Mean 𝐷             7.1        7.67       8.25        8.75       9.34      10.1      10.52      11.49       12.47      12.47
        Max 𝐷             614        761        923       1, 072     1, 220     1, 371     1, 496     1, 763     2, 084     2, 086
  Max citation count     2, 584     3, 110     3, 637     4, 186     4, 740     5, 403    11, 385    20, 893     32, 278   35, 200
  Avg. citation count    26.33      27.48      28.87       30.15      31.49     32.94      35.84      38.31        43.0      45.41

Table 1
Key values of the graphs.


changing to the following for a given node 𝑦𝑡𝑣 ≥ |𝐸𝑡𝑣 |.            have been shown to be useful for learning downstream
Another important point is that removing edges from                 tasks with scientific text, this is why we use them here.
the graph removes some of the information contained in              To obtain a feature vector of a given abstract, we tok-
the full graph (e.g. links to papers in other fields). Such         enize the abstract text and pass this through SciBERT.
edges are usually connected to more prominent papers                SciBERT prepends a special [CLS] token for performing
because it is often the high impact papers, which obtain            classification tasks, so we use the output representation
citations from papers outside the main field.                       of this token as the final feature vector for an abstract.
                                                                    Author rank: To include the author information, we
4.3. Feature Generation                                             created a feature vector which ranks the authors based on
                                                                    their number of citations sorted by highest to lowest. Due
The created dynamic graph nodes are not dependent on                to many authors having the same amount of citations,
a set of specific features, and we can therefore select             we allow authors to be of the same rank. As the final step
and create a set of features for each node containing our           for the feature calculation, we normalize the rankings by
desired information. With a wide variety of meta-data                       𝑋 −𝑋
                                                                    𝑋 ′ = 𝑋 −𝑋min .
fields available, we created a set of distinct features which               max   min
                                                                    Venue rank: Together with the author rank, we also
we used for our predictions. Furthermore, we studied
                                                                    hypothesize that the venue has an impact on the num-
how each of these features affect the performance of the
                                                                    ber of citations of a paper. Therefore, we also created a
model.
                                                                    feature ranking for the venues. The feature is calculated
   The choice of using authors and venues as features
                                                                    identically to the author rank. It should be mentioned
for our model is based on the hypothesis that authors
                                                                    that the meta-data contains a high amount of different
listed on a paper have a major impact on the number of
                                                                    labels for each of the venues which we are using. We
citations gained. We assume the same goes for venues: if
                                                                    reduce all the different labels of the same venue down
a paper is published at a more highly ranked venue, it is
                                                                    to a single label for each venue, but keep each venue
more likely to gain a large amount of citations compared
                                                                    separated by year.
to a paper published at a lower ranking venue. We fur-
ther motivate the choice of these two features based on
prior work [5], who shows that author rank and venue                5. Experiments
rank are indeed two of the three features that are most
predictive. We motivate the choice of using the abstract            In this section we present our experiments and results,
based on the assumption that the abstract of a paper con-           and explore the importance of exploiting topological and
tains information on the topics discussed in the paper,             temporal information.
which can be used to identify if paper’s topic is currently
popular [19]. We further motivate the choice of using               5.1. Data
author and venue rank, as prior work shows them to be
the most descriptive features [5]. The following sections           We use the constructed dynamic graph for our experi-
provide short descriptions of the meta-data used to create          ments and test each of the three distinct feature vectors.
these feature vectors and how each of them is calculated.           A detailed description of the feature vectors and the dy-
Abstract: To base our model on more than meta-data, we              namic graph’s construction can be found in Section 4.
use the abstract of the papers to create a feature vector. To       We split our data into a training, validation, and test set,
create an embedding of the abstract, we utilize BERT [20],          with the following splits: 60%, 20%, and 20%. With the
specifically the pre-trained SciBERT [21] model. SciB-              splits, we achieve a training set consisting of 22, 900, and
ERT is a contextualized embedding model trained using               a validation and test set of 7, 634. The training, validation
a masked language modeling objective on a large amount              and test sets are generated randomly, but are kept fixed
of scholarly literature. Representations from SciBERT               throughout the experiments.
                                             GCN + LSTM                 LSTM              GCN
                               Abstract    0.8284 ± 0.0162     1.0164 ± 0.0140    1.279 ± 0.1350
                                Author     0.7477 ± 0.0166     1.0184 ± 0.0273   1.1089 ± 0.0357
                                 Venue     0.9259 ± 0.1161     1.0414 ± 0.0197   1.0828 ± 0.0030
                        Author + Venue     0.7572 ± 0.0131     1.0186 ± 0.0240   1.1248 ± 0.0271
                                    All    0.7940 ± 0.0138     1.0152 ± 0.0157   1.3115 ± 0.1681

Table 2
The performance of our 3 models over a 10 year period. The results are reported as the MAE of the log citations. For the
10-year period, our deterministic approach have a MAE of 1.6378.

                                             GCN + LSTM                 LSTM              GCN
                               Abstract    0.8001 ± 0.0147     1.0149 ± 0.0414   1.6690 ± 0.4404
                                Author     0.7462 ± 0.0911     1.0179 ± 0.0536   1.3756 ± 0.0334
                                 Venue     0.8525 ± 0.1348     1.0156 ± 0.0388   1.3212 ± 0.0039
                        Author + Venue     0.7515 ± 0.0889     1.0132 ± 0.0480   1.3598 ± 0.0461
                                    All    0.7803 ± 0.0167     1.0165 ± 0.0383   1.5177 ± 0.1892

Table 3
The performance of our 3 models over a 20 year period. The results are reported as the MAE of the log citations. For the
20-year period, our deterministic approach have a MAE of 2.0796.


   Due to the large number of time-steps in the dynamic        training early. As mentioned, we used SciBERT to encode
graph, we chose to create two different setups for our ex-     the abstracts, with an output vector of size 768. The mod-
periments. One which uses the last 10 years and another,       els have been run using random seeds, and each of the
which uses the last 20 years of the dynamic graph. We          experiments have been executed 10 times. In the results
use the later years in the dynamic graph as these years        section, we report the mean and the standard deviation
contain the most papers and the graph has evolved the          of the 10 runs.
most.                                                             We compare to a simple deterministic baseline: predict-
   While not mentioned in Section 4.3, we perform some         ing the mean citation count of the training and validation
further pre-processing of the data. For the feature vectors    at each time step.
of author rank and venue rank, we perform a normal-
ization of the values. We also perform pre-processing of       5.3. Evaluation Metric
the labels due to the high fluctuation of the number of
citations. We take the 𝑙𝑜𝑔(𝑐 + 1) of the citation of a paper   To evaluate the performance of the models, we measure
as the labels [22]. Taking the log of the citation increases   the mean absolute error, defined as
the stability of the model during training.                                                   𝑁
                                                                                          1
                                                                                 𝑀𝐴𝐸 =      ∑ |𝑦 − 𝑦|,
                                                                                                    ̂                (4)
                                                                                          𝑁 𝑖=1
5.2. Experimental Setup
We perform experiments with three distinct models: 1)          where 𝑌 are the citation counts and 𝑌̂ are the predicted
Our proposed model, consisting of a GCN and LSTM; 2) a         values. We also use the MAE to optimize the model. We
standard LSTM; 3) a standard GCN. All hyper-parameters         chose to use MAE, instead of mean squared error (MSE),
are shared across the models. We evaluate models at            to mitigate outlier papers which have a high amount
specific times and over time.                                  of citations. We additionally use MAE as the training
   For our selected models, we used the Adam [23] op-          objective for the same reason.
timizer, with a learning rate of 0.001. For the GCN we
used two layers, with each layer consisting of 256 hidden      5.4. Results
units. Both the GCN and the GCN with LSTM used this
setup. The LSTM was set to have a single uni-directional       As previously mentioned, we ran our experiments on
layer of 128 hidden units, with the output being reduced       dynamic graphs of 10 years and 20 years. The results
to 1 dimension by a linear layer. For the models using an      of the 10 year experiment is shown in Table 2, and the
LSTM, we its batch size to 256. We ran the models for          results of the 20 years experiment is shown in Table 3.
1000 epochs and if no update to the best validation score      The tables show that our models outperform the simple
have been observed over 10 epochs, we terminate the            deterministic approach. Figure 3 shows results over time.
Figure 3: MAE at each time-step, where left show the MAE for our 10 year experiments and right show the MAE at each
time-step for our 20 year experiments, where the 𝑥-axis shows the time and 𝑦-axis the MAE.


   By inspecting the results, one can clearly observe that different feature ablations, we can observe that the author
the GCN-LSTM has the best performance among the            features performs the best, confirming our hypothesis.
three models. We further observe that the GCN-LSTM         Figure 3 further confirms this, showing that large parts
improves on the performance of the pure GCN and LSTM       of the gain of the model over time stems from author
individually, indicating that it learns from both the tem- information.
poral and the topological information provided by the         The feature vector created by the venues performs
dynamic citation network. Furthermore, the GCN in-         the worst in both experiments. We hypothesize that
creases in error going from a 10 year interval to a 20 yearthe venues’ performance could be increased if a more
interval, where we see the other models slightly improve.  generalized notation for venue meta-data were available.
To further study this, we plot the error of the different  They are noisy (also due to OCR errors) and have many
time steps in Figure 3, which show the models’ perfor-     spelling variants.
mances over time. By inspecting the plots, we observe a       To further study the impact of features, we calculate
trend of the pure models i.e. the GCN and LSTM models,     the average MAE for each distinct author and venue,
struggle and deteriorate over time, compared to the com-   where we use the predictions made by the GCN-LSTM,
bined GCN-LSTM model, which keeps improving over           trained on the author feature vectors over 20 years. We
time until it starts plateauing. Comparing the 10-year     show the result for the venues in Table 4 and the ones for
and 20-year plots, one can observe that the deterioration  authors in Table 5. One can observe that the difference
continues, where the 10-year plot stops. It can also be    between the top and the bottom venue is drastically lower
seen, that the GCN-LSTM keeps improving up until year      than the difference between the top and bottom author.
10, where it levels out. All of the models decrease drasti-This further indicates that author features are a strongly
cally in error up until two time-steps; afterward, the purepredictive feature for citation counts.
models start deteriorating.                                   We also show the average degree and the number of
                                                           papers for each of the venues in Table 4. With a higher
5.5. Discussion                                            representation of papers in the collection, we expect a
                                                           more reliable prediction. This is indeed the case – we
Tables 2 and 3 show the impact of single feature types. We observe the top venues often have a higher number of
hypothesize that author information is very predictive, papers in their collection. To further analyse this, we
as shown by prior work. Inspecting the results from the
      Venue              MAE        Avg. degree      𝑛         Acknowledgments
1     COLING 1973        0.04295    1                20
2     AAAI 2020          0.06397    4.67             240       We like to thank Johannes Bjerva for the fruitful discus-
3     NAACL 2019         0.0863     15.25            2160      sions in the early stages. This work is partly funded
⋮                                                              by Independent Research Fund Denmark under grant
185   ACL 1983           0.7714     2                20        agreement number 9065-00131B.
186   ACL 1988           0.7794     19.6             100
187   EMNLP 1998         0.8917     4.5              40
Table 4                                                        References
The top 3 and bottom 3 venues, sorted by the mean MAE,
going from lowest to highest.                                   [1] S. Hochreiter, J. Schmidhuber, Long Short-Term
                                                                    Memory, Neural Computation 9 (1997) 1735–1780.
                                                                    URL: https : / / www.mitpressjournals.org /
             Author ID     MAE       Avg. degree     𝑛              doi / abs / 10.1162 / neco.1997.9.8.1735.
   1         32968         0.0131    14              1              doi:10.1162/neco.1997.9.8.1735 .
   2         22037         0.0131    14              1          [2] J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu,
   3         32969         0.0131    14              1              L. Wang, C. Li, M. Sun, Graph neural networks: A
   ⋮
   24536     1375          2.6356 5                  1              review of methods and applications, AI Open 1
   24537     807           2.6356 5                  1              (2020) 57–81. URL: https://www.sciencedirect.com/
   24358     4290          2.6356 5                  1              science / article / pii / S2666651021000012.
Table 5
                                                                    doi:10.1016/j.aiopen.2021.01.001 .
The top 3 and bottom 3 authors, sorted by the mean MAE,         [3] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, P. S.
going from lowest to highest.                                       Yu, A Comprehensive Survey on Graph Neural
                                                                    Networks, IEEE Transactions on Neural Networks
                                                                    and Learning Systems 32 (2021) 4–24. doi:10.1109/
                                                                    TNNLS.2020.2978386 .
observe the average degree of the papers in the collection,
                                                                [4] W. Ammar, D. Groeneveld, C. Bhagavatula, I. Belt-
however, we do not notice a higher performance where
                                                                    agy, M. Crawford, D. Downey, J. Dunkelberger,
the degree is higher. This indicates that the model is
                                                                    A. Elgohary, S. Feldman, V. A. Ha, R. M. Kinney,
better at predicting papers with higher citation counts,
                                                                    S. Kohlmeier, K. Lo, T. C. Murray, H.-H. Ooi, M. E.
because the degree of a node is tightly bound to the
                                                                    Peters, J. L. Power, S. Skjonsberg, L. L. Wang, C. Wil-
number of citations.
                                                                    helm, Z. Yuan, M. v. Zuylen, O. Etzioni, Construc-
                                                                    tion of the Literature Graph in Semantic Scholar, in:
6. Conclusions                                                      NAACL-HLT, 2018. doi:10.18653/v1/N18-3011 .
                                                                [5] R. Yan, J. Tang, X. Liu, D. Shan, X. Li, Citation
In this paper, we propose the task of citation sequence             count prediction: learning to estimate future cita-
prediction. We introduce a new dataset of scholarly doc-            tions for literature, in: Proceedings of the 20th
uments for this task based on a dynamic citation graph              ACM international conference on Information and
evolving of 42 years, starting from a single node growing           knowledge management - CIKM ’11, ACM Press,
to a large graph. We further study the effect of tempo-             Glasgow, Scotland, UK, 2011, p. 1247. URL: http:
ral and topological information, and propose a model to             //dl.acm.org/citation.cfm?doid=2063576.2063757.
benefit from both information (GCN+LSTM). Our results               doi:10.1145/2063576.2063757 .
show that utilizing both the temporal and topological in-       [6] T. N. Kipf, M. Welling, Semi-Supervised Classifi-
formation is superior to only utilizing either the temporal         cation with Graph Convolutional Networks, in:
or topological information. Using the proposed model,               5th International Conference on Learning Repre-
we study the effect of different features, to identify which        sentations, ICLR 2017, Toulon, France, April 24-
information is most predictive of a paper’s citation count          26, 2017, Conference Track Proceedings, Open-
over time. We find author information to be the most                Review.net, 2017. URL: https://openreview.net/
predictive and informative over time.                               forum?id=SJU4ayYgl.
   In future work, the impact of training a single GCN          [7] T. Yu, G. Yu, P.-Y. Li, L. Wang, Citation impact pre-
on the dynamic graph could be explored, since the error             diction for scientific papers using stepwise regres-
over time of the GCN is deteriorates fast.                          sion analysis, Scientometrics 101 (2014) 1233–1252.
                                                                    URL: http://link.springer.com/10.1007/s11192-014-
                                                                    1279-6. doi:10.1007/s11192-014-1279-6 .
                                                                [8] D. Kang, W. Ammar, B. Dalvi, M. van Zuylen,
     S. Kohlmeier, E. Hovy, R. Schwartz, A Dataset                    311–322. doi:10.1007/978-3-319-06028-6_26 .
     of Peer Reviews (PeerRead): Collection, Insights            [16] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher,
     and NLP Applications, arXiv:1804.09635 [cs]                      T. Eliassi-Rad, Collective Classification in Network
     (2018). URL: http://arxiv.org/abs/1804.09635, arXiv:             Data, AI Magazine 29 (2008) 93–93. URL: https:
     1804.09635.                                                      //ojs.aaai.org/index.php/aimagazine/article/view/
 [9] B. Plank, R. v. Dalen, CiteTracked: A Longitu-                   2157. doi:10.1609/aimag.v29i3.2157 , number: 3.
     dinal Dataset of Peer Reviews and Citations, in:            [17] C. L. Giles, K. D. Bollacker, S. Lawrence, CiteSeer:
     Proceedings of BIRNDL ACM SIGIR, Paris, France,                  an automatic citation indexing system, in: Pro-
     July 25, 2019, volume 2414, CEUR-WS.org, 2019, pp.               ceedings of the ACM International Conference on
     116–122.                                                         Digital Libraries, ACM, 1998, pp. 89–98. URL: https:
[10] S. Li, W. X. Zhao, E. J. Yin, J.-R. Wen, A Neural                / / pennstate.pure.elsevier.com / en / publications /
     Citation Count Prediction Model based on Peer Re-                citeseer-an-automatic-citation-indexing-system.
     view Text, in: Proceedings of the 2019 Confer-              [18] L. Zhao, Y. Song, C. Zhang, Y. Liu, P. Wang, T. Lin,
     ence on Empirical Methods in Natural Language                    M. Deng, H. Li, T-GCN: A Temporal Graph Con-
     Processing and the 9th International Joint Con-                  volutional Network for Traffic Prediction, IEEE
     ference on Natural Language Processing (EMNLP-                   Transactions on Intelligent Transportation Systems
     IJCNLP), Association for Computational Linguis-                  (2019) 1–11. doi:10.1109/TITS.2019.2935152 .
     tics, Hong Kong, China, 2019, pp. 4913–4923.                [19] S. M. Gerrish, D. M. Blei, A language-based ap-
     URL: https://www.aclweb.org/anthology/D19-1497.                  proach to measuring scholarly impact, in: Pro-
     doi:10.18653/v1/D19-1497 .                                       ceedings of the 27th International Conference on
[11] J. Wen, L. Wu, J. Chai,                 Paper Citation           International Conference on Machine Learning,
     Count Prediction Based on Recurrent Neural Net-                  ICML’10, Omnipress, Madison, WI, USA, 2010, pp.
     work with Gated Recurrent Unit,               in: 2020           375–382.
     IEEE 10th International Conference on Elec-                 [20] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
     tronics Information and Emergency Communica-                     Pre-training of Deep Bidirectional Transformers for
     tion (ICEIEC), 2020, pp. 303–306. doi:10.1109 /                  Language Understanding, arXiv:1810.04805 [cs]
     ICEIEC49280.2020.9152330 , iSSN: 2377-844X.                      (2019). URL: http://arxiv.org/abs/1810.04805, arXiv:
[12] F. Davletov, A. S. Aydin, A. Cakmak, High Im-                    1810.04805.
     pact Academic Paper Prediction Using Temporal               [21] I. Beltagy, K. Lo, A. Cohan, SciBERT: A Pretrained
     and Topological Features, in: Proceedings of the                 Language Model for Scientific Text, in: Proceed-
     23rd ACM International Conference on Confer-                     ings of the 2019 Conference on Empirical Meth-
     ence on Information and Knowledge Management                     ods in Natural Language Processing and the 9th
     - CIKM ’14, ACM Press, Shanghai, China, 2014, pp.                International Joint Conference on Natural Lan-
     491–498. URL: http://dl.acm.org/citation.cfm?doid=               guage Processing (EMNLP-IJCNLP), Association
     2661829.2662066. doi:10.1145/2661829.2662066 .                   for Computational Linguistics, Hong Kong, China,
[13] J. Tang, D. Zhang, L. Yao, Social Network Ex-                    2019, pp. 3615–3620. URL: https://www.aclweb.org/
     traction of Academic Researchers, in: Proceed-                   anthology/D19-1371. doi:10.18653/v1/D19-1371 .
     ings of the 2007 Seventh IEEE International Con-            [22] G. Maillette de Buy Wenniger, T. van Dongen,
     ference on Data Mining, ICDM ’07, IEEE Com-                      E. Aedmaa, H. T. Kruitbosch, E. A. Valentijn,
     puter Society, USA, 2007, pp. 292–301. URL: https:               L. Schomaker, Structure-Tags Improve Text Classi-
     / / doi.org / 10.1109 / ICDM.2007.30. doi:10.1109 /              fication for Scholarly Document Quality Prediction,
     ICDM.2007.30 .                                                   in: Proceedings of the First Workshop on Scholarly
[14] J. N. Manjunatha, K. R. Sivaramakrishnan, R. K.                  Document Processing, Association for Computa-
     Pandey, M. N. Murthy, Citation prediction us-                    tional Linguistics, Online, 2020, pp. 158–167. URL:
     ing time series approach KDD Cup 2003 (task                      https://www.aclweb.org/anthology/2020.sdp-1.18.
     1), ACM SIGKDD Explorations Newsletter 5                         doi:10.18653/v1/2020.sdp-1.18 .
     (2003) 152–153. URL: https : / / doi.org / 10.1145 /        [23] D. P. Kingma, J. Ba, Adam: A Method for Stochastic
     980972.980993. doi:10.1145/980972.980993 .                       Optimization, in: Y. Bengio, Y. LeCun (Eds.), 3rd
[15] C. Caragea, J. Wu, A. Ciobanu, K. Williams,                      International Conference on Learning Representa-
     J. Fernández-Ramírez, H.-H. Chen, Z. Wu, L. Giles,               tions, ICLR 2015, San Diego, CA, USA, May 7-9,
     CiteSeerx: A Scholarly Big Dataset, in: M. de Rijke,             2015, Conference Track Proceedings, 2015. URL:
     T. Kenter, A. P. de Vries, C. Zhai, F. de Jong, K. Radin-        http://arxiv.org/abs/1412.6980.
     sky, K. Hofmann (Eds.), Advances in Informa-
     tion Retrieval, Lecture Notes in Computer Science,
     Springer International Publishing, Cham, 2014, pp.

</pre>