Longitudinal Citation Prediction using Temporal Graph Neural Networks Andreas Nugaard Holm1 , Barbara Plank2 , Dustin Wright1 and Isabelle Augenstein1 1 University of Copenhagen, Universitetsparken 1, 2100 København, Denmark 2 IT University of Copenhagen, Rued Langgaards Vej 7, 2300 København, Denmark Abstract Citation count prediction is the task of predicting the number of citations a paper has gained after a period of time. Prior work viewed this as a static prediction task. As papers and their citations evolve over time, considering the dynamics of the number of citations over time seems the logical next step. Here, we introduce the task of sequence citation prediction. The goal is to accurately predict the trajectory of the number of citations a scholarly work receives over time. We propose to view papers as a structured network of citations, allowing us to use topological information as a learning signal. Additionally, we learn how this dynamic citation network changes over time and the impact of paper meta-data such as authors, venues and abstracts. To approach the new task, we derive a dynamic citation network from Semantic Scholar spanning over 42 years. We present a model which exploits topological and temporal information using graph convolution networks paired with sequence prediction, and compare it against multiple baselines, testing the importance of topological and temporal information and analyzing model performance. Our experiments show that leveraging both the temporal and topological information greatly increases the performance of predicting citation counts over time. Keywords citation count prediction, graph neural network, citation network, dynamic graph generation 1. Introduction The problem of predicting citation counts of papers has been a long-standing research problem. Predicting cita- tion counts allows us to better understand the relation- ship between a paper and its impact. However, prior Figure 1: Illustration of the development of the dynamic research has viewed this as a static prediction problem, graph through three time steps. Each node represents a paper; i.e. only predicting a single citation count at a static point edges are citations between papers. Red nodes represent new papers in the current time step. in time. This ignores the natural development of the data as new papers are being published. Here, we propose to view the problem as a sequence prediction task, with models then having the ability to capture the evolving is useful for predicting citation counts over time. In this nature of citations. paper, we consider citation networks, a dynamic graph This, in turn, requires a dataset to contain the papers’ which evolves over time as new citations and papers are citation counts over a period of time, which adds a tem- added to the network. Leveraging the structured data poral element to the data, which can then be encoded in the graph allows us to discover complex relationships by sequential machine learning models, such as Long between papers. We want to tap into that knowledge short-term memory models (LSTM) [1]. Additionally, and treat the citation data as a network, such that we scholarly documents exhibit a natural graph-like struc- can further exploit topological information and not just ture in their citation networks. Given recent develop- temporal information. By doing so, we investigate the ments in modeling such data [2, 3] and prior research hypothesis of paper citation counts being correlated with showing that modeling input as graphs can be beneficial, features such as authors, venue, and topics. we hypothesize that modeling a paper’s citation network We use the well-established Semantic Scholar dataset [4] to construct our citation network. Its SDU@AAAI’22: Workshop on Scientific Document Understanding, meta-data allows us to construct a dynamic citation March 01, 2022 network which covers a 42 year time-line, with an Envelope-Open aholm@di.ku.dk (A. N. Holm); bapl@itu.dk (B. Plank); updated graph for each year. The Semantic Scholar dw@di.ku.dk (D. Wright); augenstein@di.ku.dk (I. Augenstein) Orcid 0000-0002-2006-5894 (A. N. Holm); 0000-0002-4394-1965 dataset’s meta-data also contains information about (B. Plank); 0000-0001-6514-8733 (D. Wright); 0000-0003-1562-7909 each paper’s authors, venue, and topics, allowing us to (I. Augenstein) study the correlation between these features and the © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). citation count of a paper when considering the evolving CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) nature of the citation network. The correlation between databases such as ArnetMiner [13], Arxiv HEP-TH [14] these features and citation counts is well-known and and CiteSeerX [15]. These static citation networks are studied by prior work [5]. Prior studies show that not suitable for our proposed task because they only con- citations are correlated and there is a strong correlation tain the topological information at a single point in time. between features such as authors, but are limited by As longitudinal citation datasets are rare, we derive a only predicting a single citation, and not predicting the dataset from Semantic Scholar. natural evolution of a papers growth. Citation networks are not exclusively used for cita- We propose to use the constructed dynamic citation tion count prediction. Other citation networks such as network (see Section 4.2) to predict the trajectory of the Cora [16], CiteSeer [17] or PubMed [16], all well known number of citations papers will receive over time, a new benchmark graphs, are used for node classification tasks, sequence prediction task introduced in this work. Fur- where the task is to predict a paper’s topic. These net- thermore, we propose an encoder-decoder model to solve works are provided with minimal content. They consist the proposed task, which uses graph convolutional lay- of an adjacency matrix, the connections between cita- ers [6] to exploit the graphs’ topological features and an tions, and a simple feature vector for each node of either LSTM to model the temporal component of the graphs. 0/1-valued vector or a tf-idf vector, based on the dictio- We compare our model against a vanilla graph convolu- nary of the paper content. These existing datasets do not tional neural network (GCN) and a vanilla LSTM, which fit our purpose, hence we derive our own, described in individually incorporate either the topological informa- Sec. 4.2. tion or the temporal information, but not both. Our contributions are as follows: 1) A dynamic cita- tion network based on the Semantic Scholar dataset. The 3. Temporal Graph Neural dynamic citation network contains 42 time-steps, with an Network updated graph at each time-step, based on yearly informa- tion. 2) We introduce the task of sequence citation count Our model is an encoder-decoder model and therefore prediction. 3) A novel encoder-decoder model based on consists of two major components. The first component a GCN and LSTM to extract the dynamic graph’s topo- is the encoder, which takes an adjacency matrix of node logical and temporal components. 4) A thorough study connections and a node feature matrix as input, where of the correlation between citation counts and temporal the node feature matrix can e.g. consist of author infor- components. mation (illustrated in Figure 2). It uses the topological information from the graphs and creates feature vectors containing both the topological node features via a GCN. 2. Related Work It should be noted that due to the use of dynamic graphs, the encoder generates a sequence of graph embeddings, The task of predicting a paper’s citations aims to predict one for each graph in the sequence. The second com- the number of citations which a paper has obtained either ponent, the decoder, utilizes the sequence of graph em- by a given year or after 𝑛 years. The task itself is not beddings created by the encoder. By using an LSTM, we new and has been researched throughout the years, and extract the temporal elements and create a sequence of multiple different approaches have been tried and shown citation count predictions (CCP) for each node in the to be effective. Some of these studies, have focused on dynamic graph. feature vectors [5, 7] and explored distinct feature vectors’ performance, where they primarily rely on meta-data, e.g. venue and authors. As peer review data has become 3.1. Problem Definition available [8], recent research has focused on using non- While the task of CCP has been researched before, in meta-data information, such as peer-reviews [9, 10] to this paper, we are interested in predicting a sequence of predict a paper’s citation count. citation counts, which to our knowledge is so far unex- What is common in existing research is the target: pre- plored. dicting a single citation count. This count can be set as Let us start by introducing our graph notation. We one of the following years, or the citation count 𝑛 years denote our dynamic graph as 𝐺 = {𝐺0 … 𝐺𝑇 −1 }, where 𝐺𝑡 in the future. To predict these citation counts, we see a is a graph, at the given time 𝑡. Each graph in the dynamic variety of different neural network models with distinct graph set is defined as 𝐺𝑡 = (𝑉𝑡 , 𝐸𝑡 ), where 𝑉𝑡 is the set architectures [10, 11], as well as papers which focus on of vertices at time 𝑡 and 𝐸𝑡 is the set of edges at time deeper feature vector analysis, where regression mod- 𝑡. With a given dynamic graph, we aim to predict the els are used [12, 7]. A side effect from prior research’s sequence of citations for a given paper. We formalize this focus on predicting single citation counts is that the uti- as 𝑦 𝑣 = {𝑦1𝑣 … 𝑦𝑇𝑣 }, where 𝑦𝑡𝑣 is the number of citations for lized citation networks are static graphs, based on paper 𝑣𝑡 ∈ 𝑉𝑡 and 𝑦𝑡𝑣 = |𝐸𝑡𝑣 |. For our proposed task, we are given the dynamic graph 𝐺, and are to predict the sequence of citation counts 𝑦. 3.2. Topological Feature Extraction One of the central hypotheses we want to examine is if complex structural dependencies in a citation network can help predict the citation count of a paper. To test this, we employ a GCN to extract topological dependencies from the graphs. We choose a GCN over other methods as they work in Euclidean space, and are thus easy to use with other neural architectures such as convolutional neural networks (CNN) [3]. The GCN uses the data flow between edges in the graph to create a graph embedding. As such, we can create an embedding influenced by all of the neighboring nodes in the graph. In this, we hypothesize that there is a relationship between the number of citations a given paper receives and that of its neighbors. The connections between the papers is described by an adjacency matrix 𝐴. Using our notation, we describe the GCN as follows: −1 −1 Figure 2: Our proposed encoder-decoder model 𝐻 (𝑙+1) = 𝜎 (𝐷̃ 2 𝐴̃ 𝐷̃ 2 𝐻 (𝑙) 𝑊 (𝑙) ) , (1) where 𝐴̃ = 𝐴 + 𝐼; 𝐼 is the identity matrix (which enables self-loops in 𝐴);̃ 𝐷̃ 𝑖𝑖 = ∑𝑗 𝐴̃ 𝑖𝑗 , 𝑙 is the 𝑙’th layer in the 3.2.1. Temporal Feature Extraction model; 𝜎 is an activation function; and 𝐻 (𝑙+1) is the output of the GCN layer 𝐻 (𝑙) . We can then simplify the above With the constructed graph embeddings, containing both equation: topological information and node information. We want to extract the temporal information, which we use the (𝑙) 𝐻 (𝑙+1) = 𝜎 (𝐴̂ 𝑡 𝐻𝑡 𝑊 (𝑙) ) (2) sequence of graph embeddings to do. To extract the 1 1 temporal information, we utilize an LSTM, where we can − − where 𝐴̂ is defined as 𝐴̂ = 𝐷̃ 2 𝐴̃ 𝐷̃ 2 and 𝑡 is the time formalize the input and output as 𝑌 = 𝑙(𝑍 ), where the step in the dynamic graph. It should be noted that 𝑡 function 𝑙 is the LSTM and 𝑌 ∈ ℝ𝑚∗𝑇 are the CCPs. has been left out in the first equation for simplicity. We also observe here that by adding multiple GCN layers, 3.2.2. Encoder-Decoder we allow the the graph embeddings to be affected by extended neighbours. In the final model, we combine the GCN and LSTM in Since we work on a dynamic citation network, we have an encoder-decoder model. The primary challenge in 𝑇 distinct adjacency matrices, and we have to create a combining these two models though is that they operate graph embedding for each graph in the sequence: on vastly different inputs. The GCN operates on entire graphs and needs all the nodes to appear in the graphs, 𝑍 = {𝑍0 … 𝑍𝑇 } = {𝑓 (𝑋 , 𝐴𝑡 ) | 𝐴𝑡 ∈ 𝐴}, (3) including nodes which it intends to predict. The LSTM, where the function 𝑓 is the GCN network, 𝑍𝑡 ∈ ℝ𝑚×𝑛 is however, does not have this requirement and can work a single graph embedding of dimensionality 𝑛 with 𝑚 on batches. To solve this issue in a simple yet effective nodes, and 𝑍 is the set of graph embeddings created by approach, we embed the entire graph prior to the LSTM the GCN. It should be noted that 𝑋 is shown as being steps so that in the LSTM step, we can still split the data independent of time, which is true for some of our node into batches for training, validation and testing. While embeddings. However, some of our node embeddings are other approaches have been researched, like embedding based on citations, which change through time, which the GCN into the LSTM [18], we found the simple ap- makes 𝑋 dependent on time. We will explore the distinct proach to perform better. node embeddings in a later section. As shown in the Figure 2 shows the architecture of our model. The equation, we also keep the same model over time, and GCN uses two layers to create the graph embedding. The do not change the GCN even though the graph changes. LSTM is a single one-directional layer whose outputs are We instead try to generalize the model, working on all reduced to a sequence of scalars through a linear layer. the graphs in the dynamic graph. 4. Dynamic Citation Count Algorithm 1: Dynamic Graph Construction Prediction Input: data Output: G As discussed earlier, we differentiate ourselves from prior 1 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑_𝑔𝑟𝑎𝑝ℎ𝑠 = dict() work by predicting a sequence of citation counts over 2 for 𝑦 ∈ years do time as opposed to a single final citation count. Datasets 3 𝑔𝑠 ← find_connected_graphs(𝑑𝑎𝑡𝑎[𝑦]) for the latter exist, but are based on paper databases. 4 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑_𝑔𝑟𝑎𝑝ℎ𝑠[𝑦] ← sort(𝑔𝑠) However, existing citation networks are not usable for 5 end our task due to the graph of the citation network being 6 for 𝑝𝑎𝑝𝑒𝑟 ∈ 𝑑𝑎𝑡𝑎[min](𝑦𝑒𝑎𝑟𝑠) do static in those works, i.e., the citation network does not 7 𝑘𝑒𝑦_𝑠𝑖𝑧𝑒[𝑝𝑎𝑝𝑒𝑟] = 0 evolve over time. Given this, we construct a dataset, 8 for 𝑦 ∈ years do where we reconstruct the citation networks, at each time- 9 𝑏𝑒𝑠𝑡 = 0 step, for the purpose of studying citation count prediction 10 for 𝑔 ∈ 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑_𝑔𝑟𝑎𝑝ℎ𝑠[𝑦] do 11 if 𝑝𝑎𝑝𝑒𝑟 ∈ 𝑔 and |𝑔| > 𝑏𝑒𝑠𝑡 then over time. 12 𝑏𝑒𝑠𝑡 = |𝑔| 13 end 4.1. Dataset 14 end 15 𝑘𝑒𝑦_𝑠𝑖𝑧𝑒[𝑝𝑎𝑝𝑒𝑟]+ = 𝑏𝑒𝑠𝑡 The dataset which we used to create our dynamic graph 16 end 1 is based on Semantic Scholar [4]. The dataset is a col- 17 end lection of close to 200, 000, 000 scientific papers; the size 18 𝑏𝑒𝑠𝑡_𝑝𝑎𝑝𝑒𝑟 = argmax(𝑘𝑒𝑦_𝑠𝑖𝑧𝑒) of a graph of this size requires an immense system to 19 𝐺 = dict() run experiments on (recall the size of 𝑌 ∈ ℝ𝑚∗𝑇 where 20 for 𝑦 ∈ years do 𝑚 is the number of papers). To reduce the dataset to a 21 for 𝑔 ∈ 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑_𝑔𝑟𝑎𝑝ℎ𝑠[𝑦] do manageable size, we only kept papers from the following 22 if 𝑏𝑒𝑠𝑡_𝑝𝑎𝑝𝑒𝑟 ∈ 𝑔 then venues related to AI, Machine Learning and Natural Lan- 23 𝐺[𝑦] = 𝑔 guage Processing: ACL, COLING, NAACL, EMNLP, AAAI, 24 break NeurIPS and CoNLL. With the dataset only containing 25 end papers from the listed venues, we reduced the dataset’s 26 end size to 47, 091 papers. Furthermore, the Semantic Scholar 27 end dataset also holds an extensive collection of meta-data for each paper. We use this meta-data to construct our dynamic graph, as well as the graph’s node embeddings. probing as the process of observing the evolution of the graph connected to the probed node. This process is auto- 4.2. Graph Construction matically performed on all nodes of the largest connected graph in the final step. By probing all the nodes, we can With the dataset reduced to a more manageable size, we choose the sequence of graphs which contains the most search for an ideal dynamic graph of the citation net- nodes over time. In Algorithm 1, we describe the process work. We do this because working with graphs can be in the form of pseudo-code for a more precise insight in computationally heavy and the size of the graph based the process of constructing the ideal dynamic graph. on the full semantic scholar dataset, can make some com- In Table 1, we show some of the properties of the putations near unfeasible. We define an ideal dynamic last 10 graphs in the dynamic graph. It is clear how the graph as the sequence of graphs which has the largest graph is evolving over time, as can be seen in how both connected graph in the final graph and has the most sig- the number of vertices and edges increases, and how the nificant increase of nodes over time. We do not use the degree 𝐷 increases, indicating that the nodes in the graph largest connected graph at each time step, as it can trick obtains more citations over time. This indicates that the us into selecting a sub-optimal dynamic graph. A sub- dynamic graph reflects the natural growth of a paper’s optimal dynamic graph may present itself as the largest citations. connected graph at a point in time, but will not stay as the By only using a subset of the nodes from the full graph largest connected graph through time, and will contain to construct the dynamic graph, we ablate some of the less nodes through time, compared to the ideal dynamic full graph’s properties. One notable property of the full graph. To solve the issue of being tricked into selecting graph is that the citation count of a paper is tied to the a less ideal dynamic graph, we have to probe each node degree of a node; by using a subset of the full graph in the data to observe the graphs’ evolution. We define this property does not hold anymore, which leads to the following definition of the size of the set of edges 𝑦𝑡𝑣 = |𝐸𝑡𝑣 | 1 https://api.semanticscholar.org/ 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 |𝑉 | 14, 584 16, 603 18, 529 20, 760 23, 327 26, 529 29, 293 33, 759 38, 080 38, 168 |𝐸| 103, 519 127, 277 152, 869 181, 666 217, 807 267, 940 308, 186 387, 738 475, 007 476, 015 Mean 𝐷 7.1 7.67 8.25 8.75 9.34 10.1 10.52 11.49 12.47 12.47 Max 𝐷 614 761 923 1, 072 1, 220 1, 371 1, 496 1, 763 2, 084 2, 086 Max citation count 2, 584 3, 110 3, 637 4, 186 4, 740 5, 403 11, 385 20, 893 32, 278 35, 200 Avg. citation count 26.33 27.48 28.87 30.15 31.49 32.94 35.84 38.31 43.0 45.41 Table 1 Key values of the graphs. changing to the following for a given node 𝑦𝑡𝑣 ≥ |𝐸𝑡𝑣 |. have been shown to be useful for learning downstream Another important point is that removing edges from tasks with scientific text, this is why we use them here. the graph removes some of the information contained in To obtain a feature vector of a given abstract, we tok- the full graph (e.g. links to papers in other fields). Such enize the abstract text and pass this through SciBERT. edges are usually connected to more prominent papers SciBERT prepends a special [CLS] token for performing because it is often the high impact papers, which obtain classification tasks, so we use the output representation citations from papers outside the main field. of this token as the final feature vector for an abstract. Author rank: To include the author information, we 4.3. Feature Generation created a feature vector which ranks the authors based on their number of citations sorted by highest to lowest. Due The created dynamic graph nodes are not dependent on to many authors having the same amount of citations, a set of specific features, and we can therefore select we allow authors to be of the same rank. As the final step and create a set of features for each node containing our for the feature calculation, we normalize the rankings by desired information. With a wide variety of meta-data 𝑋 −𝑋 𝑋 ′ = 𝑋 −𝑋min . fields available, we created a set of distinct features which max min Venue rank: Together with the author rank, we also we used for our predictions. Furthermore, we studied hypothesize that the venue has an impact on the num- how each of these features affect the performance of the ber of citations of a paper. Therefore, we also created a model. feature ranking for the venues. The feature is calculated The choice of using authors and venues as features identically to the author rank. It should be mentioned for our model is based on the hypothesis that authors that the meta-data contains a high amount of different listed on a paper have a major impact on the number of labels for each of the venues which we are using. We citations gained. We assume the same goes for venues: if reduce all the different labels of the same venue down a paper is published at a more highly ranked venue, it is to a single label for each venue, but keep each venue more likely to gain a large amount of citations compared separated by year. to a paper published at a lower ranking venue. We fur- ther motivate the choice of these two features based on prior work [5], who shows that author rank and venue 5. Experiments rank are indeed two of the three features that are most predictive. We motivate the choice of using the abstract In this section we present our experiments and results, based on the assumption that the abstract of a paper con- and explore the importance of exploiting topological and tains information on the topics discussed in the paper, temporal information. which can be used to identify if paper’s topic is currently popular [19]. We further motivate the choice of using 5.1. Data author and venue rank, as prior work shows them to be the most descriptive features [5]. The following sections We use the constructed dynamic graph for our experi- provide short descriptions of the meta-data used to create ments and test each of the three distinct feature vectors. these feature vectors and how each of them is calculated. A detailed description of the feature vectors and the dy- Abstract: To base our model on more than meta-data, we namic graph’s construction can be found in Section 4. use the abstract of the papers to create a feature vector. To We split our data into a training, validation, and test set, create an embedding of the abstract, we utilize BERT [20], with the following splits: 60%, 20%, and 20%. With the specifically the pre-trained SciBERT [21] model. SciB- splits, we achieve a training set consisting of 22, 900, and ERT is a contextualized embedding model trained using a validation and test set of 7, 634. The training, validation a masked language modeling objective on a large amount and test sets are generated randomly, but are kept fixed of scholarly literature. Representations from SciBERT throughout the experiments. GCN + LSTM LSTM GCN Abstract 0.8284 ± 0.0162 1.0164 ± 0.0140 1.279 ± 0.1350 Author 0.7477 ± 0.0166 1.0184 ± 0.0273 1.1089 ± 0.0357 Venue 0.9259 ± 0.1161 1.0414 ± 0.0197 1.0828 ± 0.0030 Author + Venue 0.7572 ± 0.0131 1.0186 ± 0.0240 1.1248 ± 0.0271 All 0.7940 ± 0.0138 1.0152 ± 0.0157 1.3115 ± 0.1681 Table 2 The performance of our 3 models over a 10 year period. The results are reported as the MAE of the log citations. For the 10-year period, our deterministic approach have a MAE of 1.6378. GCN + LSTM LSTM GCN Abstract 0.8001 ± 0.0147 1.0149 ± 0.0414 1.6690 ± 0.4404 Author 0.7462 ± 0.0911 1.0179 ± 0.0536 1.3756 ± 0.0334 Venue 0.8525 ± 0.1348 1.0156 ± 0.0388 1.3212 ± 0.0039 Author + Venue 0.7515 ± 0.0889 1.0132 ± 0.0480 1.3598 ± 0.0461 All 0.7803 ± 0.0167 1.0165 ± 0.0383 1.5177 ± 0.1892 Table 3 The performance of our 3 models over a 20 year period. The results are reported as the MAE of the log citations. For the 20-year period, our deterministic approach have a MAE of 2.0796. Due to the large number of time-steps in the dynamic training early. As mentioned, we used SciBERT to encode graph, we chose to create two different setups for our ex- the abstracts, with an output vector of size 768. The mod- periments. One which uses the last 10 years and another, els have been run using random seeds, and each of the which uses the last 20 years of the dynamic graph. We experiments have been executed 10 times. In the results use the later years in the dynamic graph as these years section, we report the mean and the standard deviation contain the most papers and the graph has evolved the of the 10 runs. most. We compare to a simple deterministic baseline: predict- While not mentioned in Section 4.3, we perform some ing the mean citation count of the training and validation further pre-processing of the data. For the feature vectors at each time step. of author rank and venue rank, we perform a normal- ization of the values. We also perform pre-processing of 5.3. Evaluation Metric the labels due to the high fluctuation of the number of citations. We take the 𝑙𝑜𝑔(𝑐 + 1) of the citation of a paper To evaluate the performance of the models, we measure as the labels [22]. Taking the log of the citation increases the mean absolute error, defined as the stability of the model during training. 𝑁 1 𝑀𝐴𝐸 = ∑ |𝑦 − 𝑦|, ̂ (4) 𝑁 𝑖=1 5.2. Experimental Setup We perform experiments with three distinct models: 1) where 𝑌 are the citation counts and 𝑌̂ are the predicted Our proposed model, consisting of a GCN and LSTM; 2) a values. We also use the MAE to optimize the model. We standard LSTM; 3) a standard GCN. All hyper-parameters chose to use MAE, instead of mean squared error (MSE), are shared across the models. We evaluate models at to mitigate outlier papers which have a high amount specific times and over time. of citations. We additionally use MAE as the training For our selected models, we used the Adam [23] op- objective for the same reason. timizer, with a learning rate of 0.001. For the GCN we used two layers, with each layer consisting of 256 hidden 5.4. Results units. Both the GCN and the GCN with LSTM used this setup. The LSTM was set to have a single uni-directional As previously mentioned, we ran our experiments on layer of 128 hidden units, with the output being reduced dynamic graphs of 10 years and 20 years. The results to 1 dimension by a linear layer. For the models using an of the 10 year experiment is shown in Table 2, and the LSTM, we its batch size to 256. We ran the models for results of the 20 years experiment is shown in Table 3. 1000 epochs and if no update to the best validation score The tables show that our models outperform the simple have been observed over 10 epochs, we terminate the deterministic approach. Figure 3 shows results over time. Figure 3: MAE at each time-step, where left show the MAE for our 10 year experiments and right show the MAE at each time-step for our 20 year experiments, where the 𝑥-axis shows the time and 𝑦-axis the MAE. By inspecting the results, one can clearly observe that different feature ablations, we can observe that the author the GCN-LSTM has the best performance among the features performs the best, confirming our hypothesis. three models. We further observe that the GCN-LSTM Figure 3 further confirms this, showing that large parts improves on the performance of the pure GCN and LSTM of the gain of the model over time stems from author individually, indicating that it learns from both the tem- information. poral and the topological information provided by the The feature vector created by the venues performs dynamic citation network. Furthermore, the GCN in- the worst in both experiments. We hypothesize that creases in error going from a 10 year interval to a 20 yearthe venues’ performance could be increased if a more interval, where we see the other models slightly improve. generalized notation for venue meta-data were available. To further study this, we plot the error of the different They are noisy (also due to OCR errors) and have many time steps in Figure 3, which show the models’ perfor- spelling variants. mances over time. By inspecting the plots, we observe a To further study the impact of features, we calculate trend of the pure models i.e. the GCN and LSTM models, the average MAE for each distinct author and venue, struggle and deteriorate over time, compared to the com- where we use the predictions made by the GCN-LSTM, bined GCN-LSTM model, which keeps improving over trained on the author feature vectors over 20 years. We time until it starts plateauing. Comparing the 10-year show the result for the venues in Table 4 and the ones for and 20-year plots, one can observe that the deterioration authors in Table 5. One can observe that the difference continues, where the 10-year plot stops. It can also be between the top and the bottom venue is drastically lower seen, that the GCN-LSTM keeps improving up until year than the difference between the top and bottom author. 10, where it levels out. All of the models decrease drasti-This further indicates that author features are a strongly cally in error up until two time-steps; afterward, the purepredictive feature for citation counts. models start deteriorating. We also show the average degree and the number of papers for each of the venues in Table 4. With a higher 5.5. Discussion representation of papers in the collection, we expect a more reliable prediction. This is indeed the case – we Tables 2 and 3 show the impact of single feature types. We observe the top venues often have a higher number of hypothesize that author information is very predictive, papers in their collection. To further analyse this, we as shown by prior work. Inspecting the results from the Venue MAE Avg. degree 𝑛 Acknowledgments 1 COLING 1973 0.04295 1 20 2 AAAI 2020 0.06397 4.67 240 We like to thank Johannes Bjerva for the fruitful discus- 3 NAACL 2019 0.0863 15.25 2160 sions in the early stages. This work is partly funded ⋮ by Independent Research Fund Denmark under grant 185 ACL 1983 0.7714 2 20 agreement number 9065-00131B. 186 ACL 1988 0.7794 19.6 100 187 EMNLP 1998 0.8917 4.5 40 Table 4 References The top 3 and bottom 3 venues, sorted by the mean MAE, going from lowest to highest. [1] S. Hochreiter, J. Schmidhuber, Long Short-Term Memory, Neural Computation 9 (1997) 1735–1780. URL: https : / / www.mitpressjournals.org / Author ID MAE Avg. degree 𝑛 doi / abs / 10.1162 / neco.1997.9.8.1735. 1 32968 0.0131 14 1 doi:10.1162/neco.1997.9.8.1735 . 2 22037 0.0131 14 1 [2] J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, 3 32969 0.0131 14 1 L. Wang, C. Li, M. Sun, Graph neural networks: A ⋮ 24536 1375 2.6356 5 1 review of methods and applications, AI Open 1 24537 807 2.6356 5 1 (2020) 57–81. URL: https://www.sciencedirect.com/ 24358 4290 2.6356 5 1 science / article / pii / S2666651021000012. Table 5 doi:10.1016/j.aiopen.2021.01.001 . The top 3 and bottom 3 authors, sorted by the mean MAE, [3] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, P. S. going from lowest to highest. Yu, A Comprehensive Survey on Graph Neural Networks, IEEE Transactions on Neural Networks and Learning Systems 32 (2021) 4–24. doi:10.1109/ TNNLS.2020.2978386 . observe the average degree of the papers in the collection, [4] W. Ammar, D. Groeneveld, C. Bhagavatula, I. Belt- however, we do not notice a higher performance where agy, M. Crawford, D. Downey, J. Dunkelberger, the degree is higher. This indicates that the model is A. Elgohary, S. Feldman, V. A. Ha, R. M. Kinney, better at predicting papers with higher citation counts, S. Kohlmeier, K. Lo, T. C. Murray, H.-H. Ooi, M. E. because the degree of a node is tightly bound to the Peters, J. L. Power, S. Skjonsberg, L. L. Wang, C. Wil- number of citations. helm, Z. Yuan, M. v. Zuylen, O. Etzioni, Construc- tion of the Literature Graph in Semantic Scholar, in: 6. Conclusions NAACL-HLT, 2018. doi:10.18653/v1/N18-3011 . [5] R. Yan, J. Tang, X. Liu, D. Shan, X. Li, Citation In this paper, we propose the task of citation sequence count prediction: learning to estimate future cita- prediction. We introduce a new dataset of scholarly doc- tions for literature, in: Proceedings of the 20th uments for this task based on a dynamic citation graph ACM international conference on Information and evolving of 42 years, starting from a single node growing knowledge management - CIKM ’11, ACM Press, to a large graph. We further study the effect of tempo- Glasgow, Scotland, UK, 2011, p. 1247. URL: http: ral and topological information, and propose a model to //dl.acm.org/citation.cfm?doid=2063576.2063757. benefit from both information (GCN+LSTM). Our results doi:10.1145/2063576.2063757 . show that utilizing both the temporal and topological in- [6] T. N. Kipf, M. Welling, Semi-Supervised Classifi- formation is superior to only utilizing either the temporal cation with Graph Convolutional Networks, in: or topological information. Using the proposed model, 5th International Conference on Learning Repre- we study the effect of different features, to identify which sentations, ICLR 2017, Toulon, France, April 24- information is most predictive of a paper’s citation count 26, 2017, Conference Track Proceedings, Open- over time. We find author information to be the most Review.net, 2017. URL: https://openreview.net/ predictive and informative over time. forum?id=SJU4ayYgl. In future work, the impact of training a single GCN [7] T. Yu, G. Yu, P.-Y. Li, L. Wang, Citation impact pre- on the dynamic graph could be explored, since the error diction for scientific papers using stepwise regres- over time of the GCN is deteriorates fast. sion analysis, Scientometrics 101 (2014) 1233–1252. URL: http://link.springer.com/10.1007/s11192-014- 1279-6. doi:10.1007/s11192-014-1279-6 . [8] D. Kang, W. Ammar, B. Dalvi, M. van Zuylen, S. Kohlmeier, E. Hovy, R. Schwartz, A Dataset 311–322. doi:10.1007/978-3-319-06028-6_26 . of Peer Reviews (PeerRead): Collection, Insights [16] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and NLP Applications, arXiv:1804.09635 [cs] T. Eliassi-Rad, Collective Classification in Network (2018). URL: http://arxiv.org/abs/1804.09635, arXiv: Data, AI Magazine 29 (2008) 93–93. URL: https: 1804.09635. //ojs.aaai.org/index.php/aimagazine/article/view/ [9] B. Plank, R. v. Dalen, CiteTracked: A Longitu- 2157. doi:10.1609/aimag.v29i3.2157 , number: 3. dinal Dataset of Peer Reviews and Citations, in: [17] C. L. Giles, K. D. Bollacker, S. Lawrence, CiteSeer: Proceedings of BIRNDL ACM SIGIR, Paris, France, an automatic citation indexing system, in: Pro- July 25, 2019, volume 2414, CEUR-WS.org, 2019, pp. ceedings of the ACM International Conference on 116–122. Digital Libraries, ACM, 1998, pp. 89–98. URL: https: [10] S. Li, W. X. Zhao, E. J. Yin, J.-R. Wen, A Neural / / pennstate.pure.elsevier.com / en / publications / Citation Count Prediction Model based on Peer Re- citeseer-an-automatic-citation-indexing-system. view Text, in: Proceedings of the 2019 Confer- [18] L. Zhao, Y. Song, C. Zhang, Y. Liu, P. Wang, T. Lin, ence on Empirical Methods in Natural Language M. Deng, H. Li, T-GCN: A Temporal Graph Con- Processing and the 9th International Joint Con- volutional Network for Traffic Prediction, IEEE ference on Natural Language Processing (EMNLP- Transactions on Intelligent Transportation Systems IJCNLP), Association for Computational Linguis- (2019) 1–11. doi:10.1109/TITS.2019.2935152 . tics, Hong Kong, China, 2019, pp. 4913–4923. [19] S. M. Gerrish, D. M. Blei, A language-based ap- URL: https://www.aclweb.org/anthology/D19-1497. proach to measuring scholarly impact, in: Pro- doi:10.18653/v1/D19-1497 . ceedings of the 27th International Conference on [11] J. Wen, L. Wu, J. Chai, Paper Citation International Conference on Machine Learning, Count Prediction Based on Recurrent Neural Net- ICML’10, Omnipress, Madison, WI, USA, 2010, pp. work with Gated Recurrent Unit, in: 2020 375–382. IEEE 10th International Conference on Elec- [20] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: tronics Information and Emergency Communica- Pre-training of Deep Bidirectional Transformers for tion (ICEIEC), 2020, pp. 303–306. doi:10.1109 / Language Understanding, arXiv:1810.04805 [cs] ICEIEC49280.2020.9152330 , iSSN: 2377-844X. (2019). URL: http://arxiv.org/abs/1810.04805, arXiv: [12] F. Davletov, A. S. Aydin, A. Cakmak, High Im- 1810.04805. pact Academic Paper Prediction Using Temporal [21] I. Beltagy, K. Lo, A. Cohan, SciBERT: A Pretrained and Topological Features, in: Proceedings of the Language Model for Scientific Text, in: Proceed- 23rd ACM International Conference on Confer- ings of the 2019 Conference on Empirical Meth- ence on Information and Knowledge Management ods in Natural Language Processing and the 9th - CIKM ’14, ACM Press, Shanghai, China, 2014, pp. International Joint Conference on Natural Lan- 491–498. URL: http://dl.acm.org/citation.cfm?doid= guage Processing (EMNLP-IJCNLP), Association 2661829.2662066. doi:10.1145/2661829.2662066 . for Computational Linguistics, Hong Kong, China, [13] J. Tang, D. Zhang, L. Yao, Social Network Ex- 2019, pp. 3615–3620. URL: https://www.aclweb.org/ traction of Academic Researchers, in: Proceed- anthology/D19-1371. doi:10.18653/v1/D19-1371 . ings of the 2007 Seventh IEEE International Con- [22] G. Maillette de Buy Wenniger, T. van Dongen, ference on Data Mining, ICDM ’07, IEEE Com- E. Aedmaa, H. T. Kruitbosch, E. A. Valentijn, puter Society, USA, 2007, pp. 292–301. URL: https: L. Schomaker, Structure-Tags Improve Text Classi- / / doi.org / 10.1109 / ICDM.2007.30. doi:10.1109 / fication for Scholarly Document Quality Prediction, ICDM.2007.30 . in: Proceedings of the First Workshop on Scholarly [14] J. N. Manjunatha, K. R. Sivaramakrishnan, R. K. Document Processing, Association for Computa- Pandey, M. N. Murthy, Citation prediction us- tional Linguistics, Online, 2020, pp. 158–167. URL: ing time series approach KDD Cup 2003 (task https://www.aclweb.org/anthology/2020.sdp-1.18. 1), ACM SIGKDD Explorations Newsletter 5 doi:10.18653/v1/2020.sdp-1.18 . (2003) 152–153. URL: https : / / doi.org / 10.1145 / [23] D. P. Kingma, J. Ba, Adam: A Method for Stochastic 980972.980993. doi:10.1145/980972.980993 . Optimization, in: Y. Bengio, Y. LeCun (Eds.), 3rd [15] C. Caragea, J. Wu, A. Ciobanu, K. Williams, International Conference on Learning Representa- J. Fernández-Ramírez, H.-H. Chen, Z. Wu, L. Giles, tions, ICLR 2015, San Diego, CA, USA, May 7-9, CiteSeerx: A Scholarly Big Dataset, in: M. de Rijke, 2015, Conference Track Proceedings, 2015. URL: T. Kenter, A. P. de Vries, C. Zhai, F. de Jong, K. Radin- http://arxiv.org/abs/1412.6980. sky, K. Hofmann (Eds.), Advances in Informa- tion Retrieval, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2014, pp.