=Paper=
{{Paper
|id=Vol-3164/paper3
|storemode=property
|title=Longitudinal Citation Prediction using Temporal Graph Neural Networks
|pdfUrl=https://ceur-ws.org/Vol-3164/paper3.pdf
|volume=Vol-3164
|authors=Andreas Nugaard Holm,Barbara Plank,Dustin Wright,Isabelle Augenstein
|dblpUrl=https://dblp.org/rec/conf/aaai/HolmP0A22
}}
==Longitudinal Citation Prediction using Temporal Graph Neural Networks==
Longitudinal Citation Prediction using Temporal Graph
Neural Networks
Andreas Nugaard Holm1 , Barbara Plank2 , Dustin Wright1 and Isabelle Augenstein1
1
University of Copenhagen, Universitetsparken 1, 2100 København, Denmark
2
IT University of Copenhagen, Rued Langgaards Vej 7, 2300 København, Denmark
Abstract
Citation count prediction is the task of predicting the number of citations a paper has gained after a period of time. Prior
work viewed this as a static prediction task. As papers and their citations evolve over time, considering the dynamics of the
number of citations over time seems the logical next step. Here, we introduce the task of sequence citation prediction. The
goal is to accurately predict the trajectory of the number of citations a scholarly work receives over time. We propose to view
papers as a structured network of citations, allowing us to use topological information as a learning signal. Additionally,
we learn how this dynamic citation network changes over time and the impact of paper meta-data such as authors, venues
and abstracts. To approach the new task, we derive a dynamic citation network from Semantic Scholar spanning over 42
years. We present a model which exploits topological and temporal information using graph convolution networks paired
with sequence prediction, and compare it against multiple baselines, testing the importance of topological and temporal
information and analyzing model performance. Our experiments show that leveraging both the temporal and topological
information greatly increases the performance of predicting citation counts over time.
Keywords
citation count prediction, graph neural network, citation network, dynamic graph generation
1. Introduction
The problem of predicting citation counts of papers has
been a long-standing research problem. Predicting cita-
tion counts allows us to better understand the relation-
ship between a paper and its impact. However, prior Figure 1: Illustration of the development of the dynamic
research has viewed this as a static prediction problem, graph through three time steps. Each node represents a paper;
i.e. only predicting a single citation count at a static point edges are citations between papers. Red nodes represent new
papers in the current time step.
in time. This ignores the natural development of the data
as new papers are being published. Here, we propose
to view the problem as a sequence prediction task, with
models then having the ability to capture the evolving is useful for predicting citation counts over time. In this
nature of citations. paper, we consider citation networks, a dynamic graph
This, in turn, requires a dataset to contain the papers’ which evolves over time as new citations and papers are
citation counts over a period of time, which adds a tem- added to the network. Leveraging the structured data
poral element to the data, which can then be encoded in the graph allows us to discover complex relationships
by sequential machine learning models, such as Long between papers. We want to tap into that knowledge
short-term memory models (LSTM) [1]. Additionally, and treat the citation data as a network, such that we
scholarly documents exhibit a natural graph-like struc- can further exploit topological information and not just
ture in their citation networks. Given recent develop- temporal information. By doing so, we investigate the
ments in modeling such data [2, 3] and prior research hypothesis of paper citation counts being correlated with
showing that modeling input as graphs can be beneficial, features such as authors, venue, and topics.
we hypothesize that modeling a paper’s citation network We use the well-established Semantic Scholar
dataset [4] to construct our citation network. Its
SDU@AAAI’22: Workshop on Scientific Document Understanding, meta-data allows us to construct a dynamic citation
March 01, 2022 network which covers a 42 year time-line, with an
Envelope-Open aholm@di.ku.dk (A. N. Holm); bapl@itu.dk (B. Plank); updated graph for each year. The Semantic Scholar
dw@di.ku.dk (D. Wright); augenstein@di.ku.dk (I. Augenstein)
Orcid 0000-0002-2006-5894 (A. N. Holm); 0000-0002-4394-1965
dataset’s meta-data also contains information about
(B. Plank); 0000-0001-6514-8733 (D. Wright); 0000-0003-1562-7909 each paper’s authors, venue, and topics, allowing us to
(I. Augenstein) study the correlation between these features and the
© 2021 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). citation count of a paper when considering the evolving
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
nature of the citation network. The correlation between databases such as ArnetMiner [13], Arxiv HEP-TH [14]
these features and citation counts is well-known and and CiteSeerX [15]. These static citation networks are
studied by prior work [5]. Prior studies show that not suitable for our proposed task because they only con-
citations are correlated and there is a strong correlation tain the topological information at a single point in time.
between features such as authors, but are limited by As longitudinal citation datasets are rare, we derive a
only predicting a single citation, and not predicting the dataset from Semantic Scholar.
natural evolution of a papers growth. Citation networks are not exclusively used for cita-
We propose to use the constructed dynamic citation tion count prediction. Other citation networks such as
network (see Section 4.2) to predict the trajectory of the Cora [16], CiteSeer [17] or PubMed [16], all well known
number of citations papers will receive over time, a new benchmark graphs, are used for node classification tasks,
sequence prediction task introduced in this work. Fur- where the task is to predict a paper’s topic. These net-
thermore, we propose an encoder-decoder model to solve works are provided with minimal content. They consist
the proposed task, which uses graph convolutional lay- of an adjacency matrix, the connections between cita-
ers [6] to exploit the graphs’ topological features and an tions, and a simple feature vector for each node of either
LSTM to model the temporal component of the graphs. 0/1-valued vector or a tf-idf vector, based on the dictio-
We compare our model against a vanilla graph convolu- nary of the paper content. These existing datasets do not
tional neural network (GCN) and a vanilla LSTM, which fit our purpose, hence we derive our own, described in
individually incorporate either the topological informa- Sec. 4.2.
tion or the temporal information, but not both.
Our contributions are as follows: 1) A dynamic cita-
tion network based on the Semantic Scholar dataset. The 3. Temporal Graph Neural
dynamic citation network contains 42 time-steps, with an Network
updated graph at each time-step, based on yearly informa-
tion. 2) We introduce the task of sequence citation count Our model is an encoder-decoder model and therefore
prediction. 3) A novel encoder-decoder model based on consists of two major components. The first component
a GCN and LSTM to extract the dynamic graph’s topo- is the encoder, which takes an adjacency matrix of node
logical and temporal components. 4) A thorough study connections and a node feature matrix as input, where
of the correlation between citation counts and temporal the node feature matrix can e.g. consist of author infor-
components. mation (illustrated in Figure 2). It uses the topological
information from the graphs and creates feature vectors
containing both the topological node features via a GCN.
2. Related Work It should be noted that due to the use of dynamic graphs,
the encoder generates a sequence of graph embeddings,
The task of predicting a paper’s citations aims to predict
one for each graph in the sequence. The second com-
the number of citations which a paper has obtained either
ponent, the decoder, utilizes the sequence of graph em-
by a given year or after 𝑛 years. The task itself is not
beddings created by the encoder. By using an LSTM, we
new and has been researched throughout the years, and
extract the temporal elements and create a sequence of
multiple different approaches have been tried and shown
citation count predictions (CCP) for each node in the
to be effective. Some of these studies, have focused on
dynamic graph.
feature vectors [5, 7] and explored distinct feature vectors’
performance, where they primarily rely on meta-data,
e.g. venue and authors. As peer review data has become 3.1. Problem Definition
available [8], recent research has focused on using non- While the task of CCP has been researched before, in
meta-data information, such as peer-reviews [9, 10] to this paper, we are interested in predicting a sequence of
predict a paper’s citation count. citation counts, which to our knowledge is so far unex-
What is common in existing research is the target: pre- plored.
dicting a single citation count. This count can be set as Let us start by introducing our graph notation. We
one of the following years, or the citation count 𝑛 years denote our dynamic graph as 𝐺 = {𝐺0 … 𝐺𝑇 −1 }, where 𝐺𝑡
in the future. To predict these citation counts, we see a is a graph, at the given time 𝑡. Each graph in the dynamic
variety of different neural network models with distinct graph set is defined as 𝐺𝑡 = (𝑉𝑡 , 𝐸𝑡 ), where 𝑉𝑡 is the set
architectures [10, 11], as well as papers which focus on of vertices at time 𝑡 and 𝐸𝑡 is the set of edges at time
deeper feature vector analysis, where regression mod- 𝑡. With a given dynamic graph, we aim to predict the
els are used [12, 7]. A side effect from prior research’s sequence of citations for a given paper. We formalize this
focus on predicting single citation counts is that the uti- as 𝑦 𝑣 = {𝑦1𝑣 … 𝑦𝑇𝑣 }, where 𝑦𝑡𝑣 is the number of citations for
lized citation networks are static graphs, based on paper 𝑣𝑡 ∈ 𝑉𝑡 and 𝑦𝑡𝑣 = |𝐸𝑡𝑣 |. For our proposed task, we are given
the dynamic graph 𝐺, and are to predict the sequence of
citation counts 𝑦.
3.2. Topological Feature Extraction
One of the central hypotheses we want to examine is if
complex structural dependencies in a citation network
can help predict the citation count of a paper. To test this,
we employ a GCN to extract topological dependencies
from the graphs. We choose a GCN over other methods
as they work in Euclidean space, and are thus easy to
use with other neural architectures such as convolutional
neural networks (CNN) [3].
The GCN uses the data flow between edges in the
graph to create a graph embedding. As such, we can
create an embedding influenced by all of the neighboring
nodes in the graph. In this, we hypothesize that there is
a relationship between the number of citations a given
paper receives and that of its neighbors. The connections
between the papers is described by an adjacency matrix
𝐴. Using our notation, we describe the GCN as follows:
−1 −1 Figure 2: Our proposed encoder-decoder model
𝐻 (𝑙+1) = 𝜎 (𝐷̃ 2 𝐴̃ 𝐷̃ 2 𝐻 (𝑙) 𝑊 (𝑙) ) , (1)
where 𝐴̃ = 𝐴 + 𝐼; 𝐼 is the identity matrix (which enables
self-loops in 𝐴);̃ 𝐷̃ 𝑖𝑖 = ∑𝑗 𝐴̃ 𝑖𝑗 , 𝑙 is the 𝑙’th layer in the
3.2.1. Temporal Feature Extraction
model; 𝜎 is an activation function; and 𝐻 (𝑙+1) is the output
of the GCN layer 𝐻 (𝑙) . We can then simplify the above With the constructed graph embeddings, containing both
equation: topological information and node information. We want
to extract the temporal information, which we use the
(𝑙)
𝐻 (𝑙+1) = 𝜎 (𝐴̂ 𝑡 𝐻𝑡 𝑊 (𝑙) ) (2) sequence of graph embeddings to do. To extract the
1 1
temporal information, we utilize an LSTM, where we can
− −
where 𝐴̂ is defined as 𝐴̂ = 𝐷̃ 2 𝐴̃ 𝐷̃ 2 and 𝑡 is the time formalize the input and output as 𝑌 = 𝑙(𝑍 ), where the
step in the dynamic graph. It should be noted that 𝑡 function 𝑙 is the LSTM and 𝑌 ∈ ℝ𝑚∗𝑇 are the CCPs.
has been left out in the first equation for simplicity. We
also observe here that by adding multiple GCN layers, 3.2.2. Encoder-Decoder
we allow the the graph embeddings to be affected by
extended neighbours. In the final model, we combine the GCN and LSTM in
Since we work on a dynamic citation network, we have an encoder-decoder model. The primary challenge in
𝑇 distinct adjacency matrices, and we have to create a combining these two models though is that they operate
graph embedding for each graph in the sequence: on vastly different inputs. The GCN operates on entire
graphs and needs all the nodes to appear in the graphs,
𝑍 = {𝑍0 … 𝑍𝑇 } = {𝑓 (𝑋 , 𝐴𝑡 ) | 𝐴𝑡 ∈ 𝐴}, (3) including nodes which it intends to predict. The LSTM,
where the function 𝑓 is the GCN network, 𝑍𝑡 ∈ ℝ𝑚×𝑛 is however, does not have this requirement and can work
a single graph embedding of dimensionality 𝑛 with 𝑚 on batches. To solve this issue in a simple yet effective
nodes, and 𝑍 is the set of graph embeddings created by approach, we embed the entire graph prior to the LSTM
the GCN. It should be noted that 𝑋 is shown as being steps so that in the LSTM step, we can still split the data
independent of time, which is true for some of our node into batches for training, validation and testing. While
embeddings. However, some of our node embeddings are other approaches have been researched, like embedding
based on citations, which change through time, which the GCN into the LSTM [18], we found the simple ap-
makes 𝑋 dependent on time. We will explore the distinct proach to perform better.
node embeddings in a later section. As shown in the Figure 2 shows the architecture of our model. The
equation, we also keep the same model over time, and GCN uses two layers to create the graph embedding. The
do not change the GCN even though the graph changes. LSTM is a single one-directional layer whose outputs are
We instead try to generalize the model, working on all reduced to a sequence of scalars through a linear layer.
the graphs in the dynamic graph.
4. Dynamic Citation Count Algorithm 1: Dynamic Graph Construction
Prediction Input: data
Output: G
As discussed earlier, we differentiate ourselves from prior 1 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑_𝑔𝑟𝑎𝑝ℎ𝑠 = dict()
work by predicting a sequence of citation counts over 2 for 𝑦 ∈ years do
time as opposed to a single final citation count. Datasets 3 𝑔𝑠 ← find_connected_graphs(𝑑𝑎𝑡𝑎[𝑦])
for the latter exist, but are based on paper databases. 4 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑_𝑔𝑟𝑎𝑝ℎ𝑠[𝑦] ← sort(𝑔𝑠)
However, existing citation networks are not usable for 5 end
our task due to the graph of the citation network being 6 for 𝑝𝑎𝑝𝑒𝑟 ∈ 𝑑𝑎𝑡𝑎[min](𝑦𝑒𝑎𝑟𝑠) do
static in those works, i.e., the citation network does not 7 𝑘𝑒𝑦_𝑠𝑖𝑧𝑒[𝑝𝑎𝑝𝑒𝑟] = 0
evolve over time. Given this, we construct a dataset, 8 for 𝑦 ∈ years do
where we reconstruct the citation networks, at each time-
9 𝑏𝑒𝑠𝑡 = 0
step, for the purpose of studying citation count prediction
10 for 𝑔 ∈ 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑_𝑔𝑟𝑎𝑝ℎ𝑠[𝑦] do
11 if 𝑝𝑎𝑝𝑒𝑟 ∈ 𝑔 and |𝑔| > 𝑏𝑒𝑠𝑡 then
over time. 12 𝑏𝑒𝑠𝑡 = |𝑔|
13 end
4.1. Dataset 14 end
15 𝑘𝑒𝑦_𝑠𝑖𝑧𝑒[𝑝𝑎𝑝𝑒𝑟]+ = 𝑏𝑒𝑠𝑡
The dataset which we used to create our dynamic graph 16 end
1
is based on Semantic Scholar [4]. The dataset is a col- 17 end
lection of close to 200, 000, 000 scientific papers; the size 18 𝑏𝑒𝑠𝑡_𝑝𝑎𝑝𝑒𝑟 = argmax(𝑘𝑒𝑦_𝑠𝑖𝑧𝑒)
of a graph of this size requires an immense system to 19 𝐺 = dict()
run experiments on (recall the size of 𝑌 ∈ ℝ𝑚∗𝑇 where 20 for 𝑦 ∈ years do
𝑚 is the number of papers). To reduce the dataset to a 21 for 𝑔 ∈ 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑_𝑔𝑟𝑎𝑝ℎ𝑠[𝑦] do
manageable size, we only kept papers from the following 22 if 𝑏𝑒𝑠𝑡_𝑝𝑎𝑝𝑒𝑟 ∈ 𝑔 then
venues related to AI, Machine Learning and Natural Lan- 23 𝐺[𝑦] = 𝑔
guage Processing: ACL, COLING, NAACL, EMNLP, AAAI, 24 break
NeurIPS and CoNLL. With the dataset only containing 25 end
papers from the listed venues, we reduced the dataset’s 26 end
size to 47, 091 papers. Furthermore, the Semantic Scholar 27 end
dataset also holds an extensive collection of meta-data
for each paper. We use this meta-data to construct our
dynamic graph, as well as the graph’s node embeddings. probing as the process of observing the evolution of the
graph connected to the probed node. This process is auto-
4.2. Graph Construction matically performed on all nodes of the largest connected
graph in the final step. By probing all the nodes, we can
With the dataset reduced to a more manageable size, we choose the sequence of graphs which contains the most
search for an ideal dynamic graph of the citation net- nodes over time. In Algorithm 1, we describe the process
work. We do this because working with graphs can be in the form of pseudo-code for a more precise insight in
computationally heavy and the size of the graph based the process of constructing the ideal dynamic graph.
on the full semantic scholar dataset, can make some com- In Table 1, we show some of the properties of the
putations near unfeasible. We define an ideal dynamic last 10 graphs in the dynamic graph. It is clear how the
graph as the sequence of graphs which has the largest graph is evolving over time, as can be seen in how both
connected graph in the final graph and has the most sig- the number of vertices and edges increases, and how the
nificant increase of nodes over time. We do not use the degree 𝐷 increases, indicating that the nodes in the graph
largest connected graph at each time step, as it can trick obtains more citations over time. This indicates that the
us into selecting a sub-optimal dynamic graph. A sub- dynamic graph reflects the natural growth of a paper’s
optimal dynamic graph may present itself as the largest citations.
connected graph at a point in time, but will not stay as the By only using a subset of the nodes from the full graph
largest connected graph through time, and will contain to construct the dynamic graph, we ablate some of the
less nodes through time, compared to the ideal dynamic full graph’s properties. One notable property of the full
graph. To solve the issue of being tricked into selecting graph is that the citation count of a paper is tied to the
a less ideal dynamic graph, we have to probe each node degree of a node; by using a subset of the full graph
in the data to observe the graphs’ evolution. We define this property does not hold anymore, which leads to the
following definition of the size of the set of edges 𝑦𝑡𝑣 = |𝐸𝑡𝑣 |
1
https://api.semanticscholar.org/
2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
|𝑉 | 14, 584 16, 603 18, 529 20, 760 23, 327 26, 529 29, 293 33, 759 38, 080 38, 168
|𝐸| 103, 519 127, 277 152, 869 181, 666 217, 807 267, 940 308, 186 387, 738 475, 007 476, 015
Mean 𝐷 7.1 7.67 8.25 8.75 9.34 10.1 10.52 11.49 12.47 12.47
Max 𝐷 614 761 923 1, 072 1, 220 1, 371 1, 496 1, 763 2, 084 2, 086
Max citation count 2, 584 3, 110 3, 637 4, 186 4, 740 5, 403 11, 385 20, 893 32, 278 35, 200
Avg. citation count 26.33 27.48 28.87 30.15 31.49 32.94 35.84 38.31 43.0 45.41
Table 1
Key values of the graphs.
changing to the following for a given node 𝑦𝑡𝑣 ≥ |𝐸𝑡𝑣 |. have been shown to be useful for learning downstream
Another important point is that removing edges from tasks with scientific text, this is why we use them here.
the graph removes some of the information contained in To obtain a feature vector of a given abstract, we tok-
the full graph (e.g. links to papers in other fields). Such enize the abstract text and pass this through SciBERT.
edges are usually connected to more prominent papers SciBERT prepends a special [CLS] token for performing
because it is often the high impact papers, which obtain classification tasks, so we use the output representation
citations from papers outside the main field. of this token as the final feature vector for an abstract.
Author rank: To include the author information, we
4.3. Feature Generation created a feature vector which ranks the authors based on
their number of citations sorted by highest to lowest. Due
The created dynamic graph nodes are not dependent on to many authors having the same amount of citations,
a set of specific features, and we can therefore select we allow authors to be of the same rank. As the final step
and create a set of features for each node containing our for the feature calculation, we normalize the rankings by
desired information. With a wide variety of meta-data 𝑋 −𝑋
𝑋 ′ = 𝑋 −𝑋min .
fields available, we created a set of distinct features which max min
Venue rank: Together with the author rank, we also
we used for our predictions. Furthermore, we studied
hypothesize that the venue has an impact on the num-
how each of these features affect the performance of the
ber of citations of a paper. Therefore, we also created a
model.
feature ranking for the venues. The feature is calculated
The choice of using authors and venues as features
identically to the author rank. It should be mentioned
for our model is based on the hypothesis that authors
that the meta-data contains a high amount of different
listed on a paper have a major impact on the number of
labels for each of the venues which we are using. We
citations gained. We assume the same goes for venues: if
reduce all the different labels of the same venue down
a paper is published at a more highly ranked venue, it is
to a single label for each venue, but keep each venue
more likely to gain a large amount of citations compared
separated by year.
to a paper published at a lower ranking venue. We fur-
ther motivate the choice of these two features based on
prior work [5], who shows that author rank and venue 5. Experiments
rank are indeed two of the three features that are most
predictive. We motivate the choice of using the abstract In this section we present our experiments and results,
based on the assumption that the abstract of a paper con- and explore the importance of exploiting topological and
tains information on the topics discussed in the paper, temporal information.
which can be used to identify if paper’s topic is currently
popular [19]. We further motivate the choice of using 5.1. Data
author and venue rank, as prior work shows them to be
the most descriptive features [5]. The following sections We use the constructed dynamic graph for our experi-
provide short descriptions of the meta-data used to create ments and test each of the three distinct feature vectors.
these feature vectors and how each of them is calculated. A detailed description of the feature vectors and the dy-
Abstract: To base our model on more than meta-data, we namic graph’s construction can be found in Section 4.
use the abstract of the papers to create a feature vector. To We split our data into a training, validation, and test set,
create an embedding of the abstract, we utilize BERT [20], with the following splits: 60%, 20%, and 20%. With the
specifically the pre-trained SciBERT [21] model. SciB- splits, we achieve a training set consisting of 22, 900, and
ERT is a contextualized embedding model trained using a validation and test set of 7, 634. The training, validation
a masked language modeling objective on a large amount and test sets are generated randomly, but are kept fixed
of scholarly literature. Representations from SciBERT throughout the experiments.
GCN + LSTM LSTM GCN
Abstract 0.8284 ± 0.0162 1.0164 ± 0.0140 1.279 ± 0.1350
Author 0.7477 ± 0.0166 1.0184 ± 0.0273 1.1089 ± 0.0357
Venue 0.9259 ± 0.1161 1.0414 ± 0.0197 1.0828 ± 0.0030
Author + Venue 0.7572 ± 0.0131 1.0186 ± 0.0240 1.1248 ± 0.0271
All 0.7940 ± 0.0138 1.0152 ± 0.0157 1.3115 ± 0.1681
Table 2
The performance of our 3 models over a 10 year period. The results are reported as the MAE of the log citations. For the
10-year period, our deterministic approach have a MAE of 1.6378.
GCN + LSTM LSTM GCN
Abstract 0.8001 ± 0.0147 1.0149 ± 0.0414 1.6690 ± 0.4404
Author 0.7462 ± 0.0911 1.0179 ± 0.0536 1.3756 ± 0.0334
Venue 0.8525 ± 0.1348 1.0156 ± 0.0388 1.3212 ± 0.0039
Author + Venue 0.7515 ± 0.0889 1.0132 ± 0.0480 1.3598 ± 0.0461
All 0.7803 ± 0.0167 1.0165 ± 0.0383 1.5177 ± 0.1892
Table 3
The performance of our 3 models over a 20 year period. The results are reported as the MAE of the log citations. For the
20-year period, our deterministic approach have a MAE of 2.0796.
Due to the large number of time-steps in the dynamic training early. As mentioned, we used SciBERT to encode
graph, we chose to create two different setups for our ex- the abstracts, with an output vector of size 768. The mod-
periments. One which uses the last 10 years and another, els have been run using random seeds, and each of the
which uses the last 20 years of the dynamic graph. We experiments have been executed 10 times. In the results
use the later years in the dynamic graph as these years section, we report the mean and the standard deviation
contain the most papers and the graph has evolved the of the 10 runs.
most. We compare to a simple deterministic baseline: predict-
While not mentioned in Section 4.3, we perform some ing the mean citation count of the training and validation
further pre-processing of the data. For the feature vectors at each time step.
of author rank and venue rank, we perform a normal-
ization of the values. We also perform pre-processing of 5.3. Evaluation Metric
the labels due to the high fluctuation of the number of
citations. We take the 𝑙𝑜𝑔(𝑐 + 1) of the citation of a paper To evaluate the performance of the models, we measure
as the labels [22]. Taking the log of the citation increases the mean absolute error, defined as
the stability of the model during training. 𝑁
1
𝑀𝐴𝐸 = ∑ |𝑦 − 𝑦|,
̂ (4)
𝑁 𝑖=1
5.2. Experimental Setup
We perform experiments with three distinct models: 1) where 𝑌 are the citation counts and 𝑌̂ are the predicted
Our proposed model, consisting of a GCN and LSTM; 2) a values. We also use the MAE to optimize the model. We
standard LSTM; 3) a standard GCN. All hyper-parameters chose to use MAE, instead of mean squared error (MSE),
are shared across the models. We evaluate models at to mitigate outlier papers which have a high amount
specific times and over time. of citations. We additionally use MAE as the training
For our selected models, we used the Adam [23] op- objective for the same reason.
timizer, with a learning rate of 0.001. For the GCN we
used two layers, with each layer consisting of 256 hidden 5.4. Results
units. Both the GCN and the GCN with LSTM used this
setup. The LSTM was set to have a single uni-directional As previously mentioned, we ran our experiments on
layer of 128 hidden units, with the output being reduced dynamic graphs of 10 years and 20 years. The results
to 1 dimension by a linear layer. For the models using an of the 10 year experiment is shown in Table 2, and the
LSTM, we its batch size to 256. We ran the models for results of the 20 years experiment is shown in Table 3.
1000 epochs and if no update to the best validation score The tables show that our models outperform the simple
have been observed over 10 epochs, we terminate the deterministic approach. Figure 3 shows results over time.
Figure 3: MAE at each time-step, where left show the MAE for our 10 year experiments and right show the MAE at each
time-step for our 20 year experiments, where the 𝑥-axis shows the time and 𝑦-axis the MAE.
By inspecting the results, one can clearly observe that different feature ablations, we can observe that the author
the GCN-LSTM has the best performance among the features performs the best, confirming our hypothesis.
three models. We further observe that the GCN-LSTM Figure 3 further confirms this, showing that large parts
improves on the performance of the pure GCN and LSTM of the gain of the model over time stems from author
individually, indicating that it learns from both the tem- information.
poral and the topological information provided by the The feature vector created by the venues performs
dynamic citation network. Furthermore, the GCN in- the worst in both experiments. We hypothesize that
creases in error going from a 10 year interval to a 20 yearthe venues’ performance could be increased if a more
interval, where we see the other models slightly improve. generalized notation for venue meta-data were available.
To further study this, we plot the error of the different They are noisy (also due to OCR errors) and have many
time steps in Figure 3, which show the models’ perfor- spelling variants.
mances over time. By inspecting the plots, we observe a To further study the impact of features, we calculate
trend of the pure models i.e. the GCN and LSTM models, the average MAE for each distinct author and venue,
struggle and deteriorate over time, compared to the com- where we use the predictions made by the GCN-LSTM,
bined GCN-LSTM model, which keeps improving over trained on the author feature vectors over 20 years. We
time until it starts plateauing. Comparing the 10-year show the result for the venues in Table 4 and the ones for
and 20-year plots, one can observe that the deterioration authors in Table 5. One can observe that the difference
continues, where the 10-year plot stops. It can also be between the top and the bottom venue is drastically lower
seen, that the GCN-LSTM keeps improving up until year than the difference between the top and bottom author.
10, where it levels out. All of the models decrease drasti-This further indicates that author features are a strongly
cally in error up until two time-steps; afterward, the purepredictive feature for citation counts.
models start deteriorating. We also show the average degree and the number of
papers for each of the venues in Table 4. With a higher
5.5. Discussion representation of papers in the collection, we expect a
more reliable prediction. This is indeed the case – we
Tables 2 and 3 show the impact of single feature types. We observe the top venues often have a higher number of
hypothesize that author information is very predictive, papers in their collection. To further analyse this, we
as shown by prior work. Inspecting the results from the
Venue MAE Avg. degree 𝑛 Acknowledgments
1 COLING 1973 0.04295 1 20
2 AAAI 2020 0.06397 4.67 240 We like to thank Johannes Bjerva for the fruitful discus-
3 NAACL 2019 0.0863 15.25 2160 sions in the early stages. This work is partly funded
⋮ by Independent Research Fund Denmark under grant
185 ACL 1983 0.7714 2 20 agreement number 9065-00131B.
186 ACL 1988 0.7794 19.6 100
187 EMNLP 1998 0.8917 4.5 40
Table 4 References
The top 3 and bottom 3 venues, sorted by the mean MAE,
going from lowest to highest. [1] S. Hochreiter, J. Schmidhuber, Long Short-Term
Memory, Neural Computation 9 (1997) 1735–1780.
URL: https : / / www.mitpressjournals.org /
Author ID MAE Avg. degree 𝑛 doi / abs / 10.1162 / neco.1997.9.8.1735.
1 32968 0.0131 14 1 doi:10.1162/neco.1997.9.8.1735 .
2 22037 0.0131 14 1 [2] J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu,
3 32969 0.0131 14 1 L. Wang, C. Li, M. Sun, Graph neural networks: A
⋮
24536 1375 2.6356 5 1 review of methods and applications, AI Open 1
24537 807 2.6356 5 1 (2020) 57–81. URL: https://www.sciencedirect.com/
24358 4290 2.6356 5 1 science / article / pii / S2666651021000012.
Table 5
doi:10.1016/j.aiopen.2021.01.001 .
The top 3 and bottom 3 authors, sorted by the mean MAE, [3] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, P. S.
going from lowest to highest. Yu, A Comprehensive Survey on Graph Neural
Networks, IEEE Transactions on Neural Networks
and Learning Systems 32 (2021) 4–24. doi:10.1109/
TNNLS.2020.2978386 .
observe the average degree of the papers in the collection,
[4] W. Ammar, D. Groeneveld, C. Bhagavatula, I. Belt-
however, we do not notice a higher performance where
agy, M. Crawford, D. Downey, J. Dunkelberger,
the degree is higher. This indicates that the model is
A. Elgohary, S. Feldman, V. A. Ha, R. M. Kinney,
better at predicting papers with higher citation counts,
S. Kohlmeier, K. Lo, T. C. Murray, H.-H. Ooi, M. E.
because the degree of a node is tightly bound to the
Peters, J. L. Power, S. Skjonsberg, L. L. Wang, C. Wil-
number of citations.
helm, Z. Yuan, M. v. Zuylen, O. Etzioni, Construc-
tion of the Literature Graph in Semantic Scholar, in:
6. Conclusions NAACL-HLT, 2018. doi:10.18653/v1/N18-3011 .
[5] R. Yan, J. Tang, X. Liu, D. Shan, X. Li, Citation
In this paper, we propose the task of citation sequence count prediction: learning to estimate future cita-
prediction. We introduce a new dataset of scholarly doc- tions for literature, in: Proceedings of the 20th
uments for this task based on a dynamic citation graph ACM international conference on Information and
evolving of 42 years, starting from a single node growing knowledge management - CIKM ’11, ACM Press,
to a large graph. We further study the effect of tempo- Glasgow, Scotland, UK, 2011, p. 1247. URL: http:
ral and topological information, and propose a model to //dl.acm.org/citation.cfm?doid=2063576.2063757.
benefit from both information (GCN+LSTM). Our results doi:10.1145/2063576.2063757 .
show that utilizing both the temporal and topological in- [6] T. N. Kipf, M. Welling, Semi-Supervised Classifi-
formation is superior to only utilizing either the temporal cation with Graph Convolutional Networks, in:
or topological information. Using the proposed model, 5th International Conference on Learning Repre-
we study the effect of different features, to identify which sentations, ICLR 2017, Toulon, France, April 24-
information is most predictive of a paper’s citation count 26, 2017, Conference Track Proceedings, Open-
over time. We find author information to be the most Review.net, 2017. URL: https://openreview.net/
predictive and informative over time. forum?id=SJU4ayYgl.
In future work, the impact of training a single GCN [7] T. Yu, G. Yu, P.-Y. Li, L. Wang, Citation impact pre-
on the dynamic graph could be explored, since the error diction for scientific papers using stepwise regres-
over time of the GCN is deteriorates fast. sion analysis, Scientometrics 101 (2014) 1233–1252.
URL: http://link.springer.com/10.1007/s11192-014-
1279-6. doi:10.1007/s11192-014-1279-6 .
[8] D. Kang, W. Ammar, B. Dalvi, M. van Zuylen,
S. Kohlmeier, E. Hovy, R. Schwartz, A Dataset 311–322. doi:10.1007/978-3-319-06028-6_26 .
of Peer Reviews (PeerRead): Collection, Insights [16] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher,
and NLP Applications, arXiv:1804.09635 [cs] T. Eliassi-Rad, Collective Classification in Network
(2018). URL: http://arxiv.org/abs/1804.09635, arXiv: Data, AI Magazine 29 (2008) 93–93. URL: https:
1804.09635. //ojs.aaai.org/index.php/aimagazine/article/view/
[9] B. Plank, R. v. Dalen, CiteTracked: A Longitu- 2157. doi:10.1609/aimag.v29i3.2157 , number: 3.
dinal Dataset of Peer Reviews and Citations, in: [17] C. L. Giles, K. D. Bollacker, S. Lawrence, CiteSeer:
Proceedings of BIRNDL ACM SIGIR, Paris, France, an automatic citation indexing system, in: Pro-
July 25, 2019, volume 2414, CEUR-WS.org, 2019, pp. ceedings of the ACM International Conference on
116–122. Digital Libraries, ACM, 1998, pp. 89–98. URL: https:
[10] S. Li, W. X. Zhao, E. J. Yin, J.-R. Wen, A Neural / / pennstate.pure.elsevier.com / en / publications /
Citation Count Prediction Model based on Peer Re- citeseer-an-automatic-citation-indexing-system.
view Text, in: Proceedings of the 2019 Confer- [18] L. Zhao, Y. Song, C. Zhang, Y. Liu, P. Wang, T. Lin,
ence on Empirical Methods in Natural Language M. Deng, H. Li, T-GCN: A Temporal Graph Con-
Processing and the 9th International Joint Con- volutional Network for Traffic Prediction, IEEE
ference on Natural Language Processing (EMNLP- Transactions on Intelligent Transportation Systems
IJCNLP), Association for Computational Linguis- (2019) 1–11. doi:10.1109/TITS.2019.2935152 .
tics, Hong Kong, China, 2019, pp. 4913–4923. [19] S. M. Gerrish, D. M. Blei, A language-based ap-
URL: https://www.aclweb.org/anthology/D19-1497. proach to measuring scholarly impact, in: Pro-
doi:10.18653/v1/D19-1497 . ceedings of the 27th International Conference on
[11] J. Wen, L. Wu, J. Chai, Paper Citation International Conference on Machine Learning,
Count Prediction Based on Recurrent Neural Net- ICML’10, Omnipress, Madison, WI, USA, 2010, pp.
work with Gated Recurrent Unit, in: 2020 375–382.
IEEE 10th International Conference on Elec- [20] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
tronics Information and Emergency Communica- Pre-training of Deep Bidirectional Transformers for
tion (ICEIEC), 2020, pp. 303–306. doi:10.1109 / Language Understanding, arXiv:1810.04805 [cs]
ICEIEC49280.2020.9152330 , iSSN: 2377-844X. (2019). URL: http://arxiv.org/abs/1810.04805, arXiv:
[12] F. Davletov, A. S. Aydin, A. Cakmak, High Im- 1810.04805.
pact Academic Paper Prediction Using Temporal [21] I. Beltagy, K. Lo, A. Cohan, SciBERT: A Pretrained
and Topological Features, in: Proceedings of the Language Model for Scientific Text, in: Proceed-
23rd ACM International Conference on Confer- ings of the 2019 Conference on Empirical Meth-
ence on Information and Knowledge Management ods in Natural Language Processing and the 9th
- CIKM ’14, ACM Press, Shanghai, China, 2014, pp. International Joint Conference on Natural Lan-
491–498. URL: http://dl.acm.org/citation.cfm?doid= guage Processing (EMNLP-IJCNLP), Association
2661829.2662066. doi:10.1145/2661829.2662066 . for Computational Linguistics, Hong Kong, China,
[13] J. Tang, D. Zhang, L. Yao, Social Network Ex- 2019, pp. 3615–3620. URL: https://www.aclweb.org/
traction of Academic Researchers, in: Proceed- anthology/D19-1371. doi:10.18653/v1/D19-1371 .
ings of the 2007 Seventh IEEE International Con- [22] G. Maillette de Buy Wenniger, T. van Dongen,
ference on Data Mining, ICDM ’07, IEEE Com- E. Aedmaa, H. T. Kruitbosch, E. A. Valentijn,
puter Society, USA, 2007, pp. 292–301. URL: https: L. Schomaker, Structure-Tags Improve Text Classi-
/ / doi.org / 10.1109 / ICDM.2007.30. doi:10.1109 / fication for Scholarly Document Quality Prediction,
ICDM.2007.30 . in: Proceedings of the First Workshop on Scholarly
[14] J. N. Manjunatha, K. R. Sivaramakrishnan, R. K. Document Processing, Association for Computa-
Pandey, M. N. Murthy, Citation prediction us- tional Linguistics, Online, 2020, pp. 158–167. URL:
ing time series approach KDD Cup 2003 (task https://www.aclweb.org/anthology/2020.sdp-1.18.
1), ACM SIGKDD Explorations Newsletter 5 doi:10.18653/v1/2020.sdp-1.18 .
(2003) 152–153. URL: https : / / doi.org / 10.1145 / [23] D. P. Kingma, J. Ba, Adam: A Method for Stochastic
980972.980993. doi:10.1145/980972.980993 . Optimization, in: Y. Bengio, Y. LeCun (Eds.), 3rd
[15] C. Caragea, J. Wu, A. Ciobanu, K. Williams, International Conference on Learning Representa-
J. Fernández-Ramírez, H.-H. Chen, Z. Wu, L. Giles, tions, ICLR 2015, San Diego, CA, USA, May 7-9,
CiteSeerx: A Scholarly Big Dataset, in: M. de Rijke, 2015, Conference Track Proceedings, 2015. URL:
T. Kenter, A. P. de Vries, C. Zhai, F. de Jong, K. Radin- http://arxiv.org/abs/1412.6980.
sky, K. Hofmann (Eds.), Advances in Informa-
tion Retrieval, Lecture Notes in Computer Science,
Springer International Publishing, Cham, 2014, pp.