1. Introduction

2377-844X

10.1007/978-3-319-06028-6_26

Longitudinal Citation Prediction using Temporal Graph Neural Networks

Andreas Nugaard Holm

aholm@di.ku.dk 1

Barbara Plank

Dustin Wright

Isabelle Augenstein

augenstein@di.ku.dk 1 0 IT University of Copenhagen , Rued Langgaards Vej 7, 2300 København , Denmark 1 University of Copenhagen , Universitetsparken 1, 2100 København , Denmark

1973

2414 3 0000 0002

Citation count prediction is the task of predicting the number of citations a paper has gained after a period of time. Prior work viewed this as a static prediction task. As papers and their citations evolve over time, considering the dynamics of the number of citations over time seems the logical next step. Here, we introduce the task of sequence citation prediction. The goal is to accurately predict the trajectory of the number of citations a scholarly work receives over time. We propose to view papers as a structured network of citations, allowing us to use topological information as a learning signal. Additionally, we learn how this dynamic citation network changes over time and the impact of paper meta-data such as authors, venues and abstracts. To approach the new task, we derive a dynamic citation network from Semantic Scholar spanning over 42 years. We present a model which exploits topological and temporal information using graph convolution networks paired with sequence prediction, and compare it against multiple baselines, testing the importance of topological and temporal information and analyzing model performance. Our experiments show that leveraging both the temporal and topological information greatly increases the performance of predicting citation counts over time. citation count prediction, graph neural network, citation network, dynamic graph generation ture in their citation networks. Given recent develop- temporal information. By doing so, we investigate the

1. Introduction

The problem of predicting citation counts of papers has been a long-standing research problem. Predicting citation counts allows us to better understand the relationship between a paper and its impact. However, prior research has viewed this as a static prediction problem, i.e. only predicting a single citation count at a static point in time. This ignores the natural development of the data as new papers are being published. Here, we propose to view the problem as a sequence prediction task, with models then having the ability to capture the evolving nature of citations.

This, in turn, requires a dataset to contain the papers’ citation counts over a period of time, which adds a temporal element to the data, which can then be encoded by sequential machine learning models, such as Long short-term memory models (LSTM) [ 1 ]. Additionally, scholarly documents exhibit a natural graph-like strucments in modeling such data [ 2, 3 ] and prior research showing that modeling input as graphs can be beneficial, we hypothesize that modeling a paper’s citation network

We use the well-established

Semantic Scholar dataset [ 4 ] to construct our citation network.

Its meta-data allows us to construct a dynamic citation network which covers a 42 year time-line, with an updated graph for each year. The Semantic Scholar dataset’s meta-data also contains information about each paper’s authors, venue, and topics, allowing us to study the correlation between these features and the citation count of a paper when considering the evolving nature of the citation network. The correlation between databases such as ArnetMiner [13], Arxiv HEP-TH [14] these features and citation counts is well-known and and CiteSeerX [15]. These static citation networks are studied by prior work [ 5 ].

Prior studies show that not suitable for our proposed task because they only concitations are correlated and there is a strong correlation tain the topological information at a single point in time. between features such as authors, but are limited by As longitudinal citation datasets are rare, we derive a only predicting a single citation, and not predicting the dataset from Semantic Scholar. natural evolution of a papers growth.

Citation networks are not exclusively used for citaWe propose to use the constructed dynamic citation tion count prediction. Other citation networks such as network (see Section 4.2) to predict the trajectory of the Cora [16], CiteSeer [17] or PubMed [16], all well known number of citations papers will receive over time, a new benchmark graphs, are used for node classification tasks, sequence prediction task introduced in this work. Furwhere the task is to predict a paper’s topic. These netthermore, we propose an encoder-decoder model to solve works are provided with minimal content. They consist the proposed task, which uses graph convolutional layof an adjacency matrix, the connections between citaers [ 6 ] to exploit the graphs’ topological features and an tions, and a simple feature vector for each node of either LSTM to model the temporal component of the graphs.

We compare our model against a vanilla graph convolu0/1-valued vector or a tf-idf vector, based on the dictionary of the paper content. These existing datasets do not tional neural network (GCN) and a vanilla LSTM, which ift our purpose, hence we derive our own, described in individually incorporate either the topological informa- Sec. 4.2. a GCN and LSTM to extract the dynamic graph’s topo- is the encoder, which takes an adjacency matrix of node deeper feature vector analysis, where regression mod- . With a given dynamic graph, we aim to predict the tion or the temporal information, but not both.

Our contributions are as follows: 1) A dynamic citation network based on the Semantic Scholar dataset. The dynamic citation network contains 42 time-steps, with an updated graph at each time-step, based on yearly information. 2) We introduce the task of sequence citation count prediction. 3) A novel encoder-decoder model based on logical and temporal components. 4) A thorough study of the correlation between citation counts and temporal components.

2. Related Work

The task of predicting a paper’s citations aims to predict the number of citations which a paper has obtained either by a given year or after years. The task itself is not new and has been researched throughout the years, and multiple diferent approaches have been tried and shown to be efective. Some of these studies, have focused on feature vectors [ 5, 7 ] and explored distinct feature vectors’ performance, where they primarily rely on meta-data, e.g. venue and authors. As peer review data has become available [ 8 ], recent research has focused on using nonmeta-data information, such as peer-reviews [9, 10] to predict a paper’s citation count.

What is common in existing research is the target: predicting a single citation count. This count can be set as one of the following years, or the citation count years in the future. To predict these citation counts, we see a variety of diferent neural network models with distinct architectures [10, 11], as well as papers which focus on els are used [ 12, 7 ]. A side efect from prior research’s focus on predicting single citation counts is that the utilized citation networks are static graphs, based on paper

3. Temporal Graph Neural Network

Our model is an encoder-decoder model and therefore consists of two major components. The first component connections and a node feature matrix as input, where the node feature matrix can e.g. consist of author information (illustrated in Figure 2). It uses the topological information from the graphs and creates feature vectors containing both the topological node features via a GCN. It should be noted that due to the use of dynamic graphs, the encoder generates a sequence of graph embeddings, one for each graph in the sequence. The second component, the decoder, utilizes the sequence of graph embeddings created by the encoder. By using an LSTM, we extract the temporal elements and create a sequence of citation count predictions (CCP) for each node in the dynamic graph.

3.1. Problem Definition

While the task of CCP has been researched before, in this paper, we are interested in predicting a sequence of citation counts, which to our knowledge is so far unexplored.

Let us start by introducing our graph notation. We denote our dynamic graph as = { 0 … −1 }, where is a graph, at the given time . Each graph in the dynamic graph set is defined as = ( , ), where is the set of vertices at time and is the set of edges at time sequence of citations for a given paper. We formalize this as = { 1 …

∈ and }, where is the number of citations for = | |. For our proposed task, we are given the dynamic graph , and are to predict the sequence of citation counts .

3.2. Topological Feature Extraction

One of the central hypotheses we want to examine is if complex structural dependencies in a citation network can help predict the citation count of a paper. To test this, we employ a GCN to extract topological dependencies from the graphs. We choose a GCN over other methods as they work in Euclidean space, and are thus easy to use with other neural architectures such as convolutional neural networks (CNN) [ 3 ].

The GCN uses the data flow between edges in the graph to create a graph embedding. As such, we can create an embedding influenced by all of the neighboring nodes in the graph. In this, we hypothesize that there is a relationship between the number of citations a given paper receives and that of its neighbors. The connections between the papers is described by an adjacency matrix . Using our notation, we describe the GCN as follows: (+1) = ( ̃ − 2 ̃ ̃ − 2 () () ) , 1 1 (1) where =̃ +

; is the identity matrix (which enables self-loops in ̃); ̃ = ∑ ̃ , is the ’th layer in the

model; is an activation function; and (+1) is the output of the GCN layer () . We can then simplify the above equation: (+1) = ( ̂ ()

() ) where ̂ is defined as =̂ 1

1 ̃ − 2 ̃ ̃ − 2 and is the time step in the dynamic graph. It should be noted that has been left out in the first equation for simplicity. We also observe here that by adding multiple GCN layers, 3.2.2. Encoder-Decoder we allow the the graph embeddings to be afected by extended neighbours.

Since we work on a dynamic citation network, we have distinct adjacency matrices, and we have to create a graph embedding for each graph in the sequence: = { 0 … } = { ( ,

) | ∈ }, where the function is the GCN network, ∈ ℝ× is a single graph embedding of dimensionality with nodes, and is the set of graph embeddings created by the GCN. It should be noted that is shown as being independent of time, which is true for some of our node embeddings. However, some of our node embeddings are based on citations, which change through time, which makes dependent on time. We will explore the distinct node embeddings in a later section. As shown in the equation, we also keep the same model over time, and do not change the GCN even though the graph changes.

We instead try to generalize the model, working on all the graphs in the dynamic graph.

In the final model, we combine the GCN and LSTM in an encoder-decoder model. The primary challenge in combining these two models though is that they operate on vastly diferent inputs. The GCN operates on entire (3) including nodes which it intends to predict. The LSTM, graphs and needs all the nodes to appear in the graphs, however, does not have this requirement and can work on batches. To solve this issue in a simple yet efective approach, we embed the entire graph prior to the LSTM steps so that in the LSTM step, we can still split the data into batches for training, validation and testing. While other approaches have been researched, like embedding the GCN into the LSTM [18], we found the simple approach to perform better.

4. Dynamic Citation Count Prediction

However, existing citation networks are not usable for our task due to the graph of the citation network being static in those works, i.e., the citation network does not evolve over time. Given this, we construct a dataset, where we reconstruct the citation networks, at each timestep, for the purpose of studying citation count prediction Input: data

Output: G 1 2 for ∈ years do

_ ℎ = 5 end 6 for ∈ [

_ ℎ[ ] ← for ∈ years do _[ ] = 0 = 0 for ∈ Algorithm 1: Dynamic Graph Construction ← ifnd_connected_graphs ( [ ]) over time.

4.1. Dataset

The dataset which we used to create our dynamic graph lection of close to 200, 000, 000 scientific papers; the size is based on Semantic Scholar [ 4 ].1 The dataset is a col- 17 end of a graph of this size requires an immense system to run experiments on (recall the size of ∈ ℝ ∗

where is the number of papers). To reduce the dataset to a manageable size, we only kept papers from the following venues related to AI, Machine Learning and Natural Lan- 23 guage Processing: ACL, COLING, NAACL, EMNLP, AAAI, 24 NeurIPS and CoNLL. With the dataset only containing papers from the listed venues, we reduced the dataset’s size to 47, 091 papers. Furthermore, the Semantic Scholar 27 end dataset also holds an extensive collection of meta-data for each paper. We use this meta-data to construct our end end _ = dict() for ∈ if end end if ∈ end

= || _[ ]+

= _ ∈ [ ] = break 3 7 8 9 10 11 12 13 14 15 16 21 22 25 26 18 19 = 20 for ∈ years do argmax(

_) _ ℎ[ ] then

do dict() min]( ) sort()

do _ ℎ[ ] and || > do then dynamic graph, as well as the graph’s node embeddings. probing as the process of observing the evolution of the

4.2. Graph Construction

With the dataset reduced to a more manageable size, we search for an ideal dynamic graph of the citation network. We do this because working with graphs can be computationally heavy and the size of the graph based on the full semantic scholar dataset, can make some computations near unfeasible. We define an ideal dynamic graph as the sequence of graphs which has the largest connected graph in the final graph and has the most significant increase of nodes over time. We do not use the largest connected graph at each time step, as it can trick us into selecting a sub-optimal dynamic graph. A suboptimal dynamic graph may present itself as the largest connected graph at a point in time, but will not stay as the largest connected graph through time, and will contain less nodes through time, compared to the ideal dynamic graph. To solve the issue of being tricked into selecting a less ideal dynamic graph, we have to probe each node in the data to observe the graphs’ evolution. We define 1https://api.semanticscholar.org/ graph connected to the probed node. This process is automatically performed on all nodes of the largest connected graph in the final step. By probing all the nodes, we can choose the sequence of graphs which contains the most nodes over time. In Algorithm 1, we describe the process in the form of pseudo-code for a more precise insight in the process of constructing the ideal dynamic graph.

In Table 1, we show some of the properties of the last 10 graphs in the dynamic graph. It is clear how the graph is evolving over time, as can be seen in how both the number of vertices and edges increases, and how the degree increases, indicating that the nodes in the graph obtains more citations over time. This indicates that the dynamic graph reflects the natural growth of a paper’s citations.

By only using a subset of the nodes from the full graph to construct the dynamic graph, we ablate some of the full graph’s properties. One notable property of the full graph is that the citation count of a paper is tied to the degree of a node; by using a subset of the full graph this property does not hold anymore, which leads to the following definition of the size of the set of edges = | | | | || Mean

Max Max citation count

Avg. citation count ≥ | |. have been shown to be useful for learning downstream Another important point is that removing edges from tasks with scientific text, this is why we use them here. the graph removes some of the information contained in To obtain a feature vector of a given abstract, we tokthe full graph (e.g. links to papers in other fields). Such enize the abstract text and pass this through SciBERT. edges are usually connected to more prominent papers SciBERT prepends a special [CLS] token for performing because it is often the high impact papers, which obtain citations from papers outside the main field.

4.3. Feature Generation

The created dynamic graph nodes are not dependent on a set of specific features, and we can therefore select and create a set of features for each node containing our desired information. With a wide variety of meta-data ifelds available, we created a set of distinct features which we used for our predictions. Furthermore, we studied how each of these features afect the performance of the model.

The choice of using authors and venues as features for our model is based on the hypothesis that authors listed on a paper have a major impact on the number of citations gained. We assume the same goes for venues: if a paper is published at a more highly ranked venue, it is more likely to gain a large amount of citations compared to a paper published at a lower ranking venue. We further motivate the choice of these two features based on prior work [ 5 ], who shows that author rank and venue rank are indeed two of the three features that are most classification tasks, so we use the output representation of this token as the final feature vector for an abstract.

Author rank: To include the author information, we created a feature vector which ranks the authors based on their number of citations sorted by highest to lowest. Due to many authors having the same amount of citations, we allow authors to be of the same rank. As the final step for the feature calculation, we normalize the rankings by ′ = − max− min

min .

Venue rank: Together with the author rank, we also hypothesize that the venue has an impact on the number of citations of a paper. Therefore, we also created a feature ranking for the venues. The feature is calculated identically to the author rank. It should be mentioned that the meta-data contains a high amount of diferent labels for each of the venues which we are using. We reduce all the diferent labels of the same venue down to a single label for each venue, but keep each venue separated by year. 5. Experiments predictive. We motivate the choice of using the abstract In this section we present our experiments and results, based on the assumption that the abstract of a paper conand explore the importance of exploiting topological and tains information on the topics discussed in the paper, temporal information. which can be used to identify if paper’s topic is currently popular [19]. We further motivate the choice of using author and venue rank, as prior work shows them to be the most descriptive features [ 5 ]. The following sections provide short descriptions of the meta-data used to create these feature vectors and how each of them is calculated.

Abstract: To base our model on more than meta-data, we use the abstract of the papers to create a feature vector. To create an embedding of the abstract, we utilize BERT [20], specifically the pre-trained SciBERT [ 21] model. SciBERT is a contextualized embedding model trained using a masked language modeling objective on a large amount of scholarly literature. Representations from SciBERT 5.1. Data We use the constructed dynamic graph for our experiments and test each of the three distinct feature vectors.

A detailed description of the feature vectors and the dynamic graph’s construction can be found in Section 4.

We split our data into a training, validation, and test set, with the following splits: 60%, 20%, and 20%. With the splits, we achieve a training set consisting of 22, 900, and a validation and test set of 7, 634. The training, validation and test sets are generated randomly, but are kept fixed throughout the experiments. Author Venue

All Author + Venue The performance of our 3 models over a 10 year period. The results are reported as the MAE of the log citations. For the 10-year period, our deterministic approach have a MAE of 1.6378.

The performance of our 3 models over a 20 year period. The results are reported as the MAE of the log citations. For the 20-year period, our deterministic approach have a MAE of 2.0796.

Due to the large number of time-steps in the dynamic training early. As mentioned, we used SciBERT to encode graph, we chose to create two diferent setups for our experiments. One which uses the last 10 years and another, which uses the last 20 years of the dynamic graph. We use the later years in the dynamic graph as these years contain the most papers and the graph has evolved the most. the abstracts, with an output vector of size 768. The models have been run using random seeds, and each of the experiments have been executed 10 times. In the results section, we report the mean and the standard deviation of the 10 runs.

We compare to a simple deterministic baseline: predict

While not mentioned in Section 4.3, we perform some ing the mean citation count of the training and validation further pre-processing of the data. For the feature vectors at each time step. of author rank and venue rank, we perform a normalization of the values. We also perform pre-processing of the labels due to the high fluctuation of the number of citations. We take the ( + 1)

of the citation of a paper as the labels [22]. Taking the log of the citation increases the stability of the model during training.

5.2. Experimental Setup 5.3. Evaluation Metric

To evaluate the performance of the models, we measure the mean absolute error, defined as

1 =1 = ∑ | − |,̂ (4) We perform experiments with three distinct models: 1) Our proposed model, consisting of a GCN and LSTM; 2) a standard LSTM; 3) a standard GCN. All hyper-parameters where are the citation counts and ̂ are the predicted values. We also use the MAE to optimize the model. We chose to use MAE, instead of mean squared error (MSE), are shared across the models. We evaluate models at to mitigate outlier papers which have a high amount specific times and over time.

of citations. We additionally use MAE as the training

For our selected models, we used the Adam [23] op- objective for the same reason. timizer, with a learning rate of 0.001. For the GCN we used two layers, with each layer consisting of 256 hidden units. Both the GCN and the GCN with LSTM used this setup. The LSTM was set to have a single uni-directional layer of 128 hidden units, with the output being reduced to 1 dimension by a linear layer. For the models using an LSTM, we its batch size to 256. We ran the models for 1000 epochs and if no update to the best validation score have been observed over 10 epochs, we terminate the

5.4. Results

As previously mentioned, we ran our experiments on dynamic graphs of 10 years and 20 years. The results of the 10 year experiment is shown in Table 2, and the results of the 20 years experiment is shown in Table 3.

The tables show that our models outperform the simple deterministic approach. Figure 3 shows results over time.

By inspecting the results, one can clearly observe that diferent feature ablations, we can observe that the author the GCN-LSTM has the best performance among the features performs the best, confirming our hypothesis. three models. We further observe that the GCN-LSTM Figure 3 further confirms this, showing that large parts improves on the performance of the pure GCN and LSTM of the gain of the model over time stems from author individually, indicating that it learns from both the tem- information. poral and the topological information provided by the The feature vector created by the venues performs dynamic citation network. Furthermore, the GCN in- the worst in both experiments. We hypothesize that creases in error going from a 10 year interval to a 20 year the venues’ performance could be increased if a more interval, where we see the other models slightly improve. generalized notation for venue meta-data were available. To further study this, we plot the error of the diferent They are noisy (also due to OCR errors) and have many time steps in Figure 3, which show the models’ perfor- spelling variants. mances over time. By inspecting the plots, we observe a To further study the impact of features, we calculate trend of the pure models i.e. the GCN and LSTM models, the average MAE for each distinct author and venue, struggle and deteriorate over time, compared to the com- where we use the predictions made by the GCN-LSTM, bined GCN-LSTM model, which keeps improving over trained on the author feature vectors over 20 years. We time until it starts plateauing. Comparing the 10-year show the result for the venues in Table 4 and the ones for and 20-year plots, one can observe that the deterioration authors in Table 5. One can observe that the diference continues, where the 10-year plot stops. It can also be between the top and the bottom venue is drastically lower seen, that the GCN-LSTM keeps improving up until year than the diference between the top and bottom author. 10, where it levels out. All of the models decrease drasti- This further indicates that author features are a strongly cally in error up until two time-steps; afterward, the pure predictive feature for citation counts. models start deteriorating. We also show the average degree and the number of papers for each of the venues in Table 4. With a higher 5.5. Discussion representation of papers in the collection, we expect a more reliable prediction. This is indeed the case – we Tables 2 and 3 show the impact of single feature types. We observe the top venues often have a higher number of hypothesize that author information is very predictive, papers in their collection. To further analyse this, we as shown by prior work. Inspecting the results from the observe the average degree of the papers in the collection, however, we do not notice a higher performance where the degree is higher. This indicates that the model is better at predicting papers with higher citation counts, because the degree of a node is tightly bound to the number of citations.

6. Conclusions

In this paper, we propose the task of citation sequence prediction. We introduce a new dataset of scholarly documents for this task based on a dynamic citation graph evolving of 42 years, starting from a single node growing to a large graph. We further study the efect of temporal and topological information, and propose a model to benefit from both information (GCN+LSTM). Our results show that utilizing both the temporal and topological information is superior to only utilizing either the temporal or topological information. Using the proposed model, we study the efect of diferent features, to identify which information is most predictive of a paper’s citation count over time. We find author information to be the most predictive and informative over time.

In future work, the impact of training a single GCN on the dynamic graph could be explored, since the error over time of the GCN is deteriorates fast.

Acknowledgments

We like to thank Johannes Bjerva for the fruitful discussions in the early stages. This work is partly funded by Independent Research Fund Denmark under grant agreement number 9065-00131B.

[1]

Hochreiter ,

Schmidhuber ,

Long

Short-Term Memory , Neural Computation 9 ( 1997 ) 1735 - 1780 . URL: https : / / www.mitpressjournals.org / doi / abs / 10.1162 / neco. 1997 . 9 .8.1735. doi: 10 .1162/neco. 1997 . 9 .8.1735.

[2]

Zhou , G. Cui,

Hu ,

Zhang ,

Yang ,

Liu ,

Wang ,

Li ,

Sun , Graph neural networks: A review of methods and applications , AI Open 1 ( 2020 ) 57 - 81 . URL: https://www.sciencedirect.com/ science / article / pii / S2666651021000012. doi: 10 .1016/j.aiopen. 2021 . 01 .001.

[3]

Wu ,

Pan ,

Chen ,

Long ,

Zhang ,

P. S.

Yu , A Comprehensive Survey on Graph Neural Networks , IEEE Transactions on Neural Networks and Learning Systems 32 ( 2021 ) 4 - 24 . doi: 10 .1109/ TNNLS. 2020 . 2978386 .

[4]

Ammar ,

Groeneveld ,

Bhagavatula , I. Beltagy,

Crawford ,

Downey ,

Dunkelberger ,

Elgohary ,

Feldman ,

V. A.

Ha ,

R. M.

Kinney ,

Kohlmeier ,

Lo ,

T. C.

Murray , H. -H. Ooi , M. E.

Peters , J. L.

Power , S.

Skjonsberg , L. L.

Wang , C.

Wilhelm , Z.

Yuan , M. v.

Zuylen , O.

Etzioni , Construction of the Literature Graph in Semantic Scholar , in: NAACL-HLT, 2018 . doi: 10 .18653/v1/ N18 -3011.

[5]

Yan ,

Tang ,

Liu ,

Shan ,

Li , Citation count prediction: learning to estimate future citations for literature , in: Proceedings of the 20th ACM international conference on Information and knowledge management - CIKM '11 , ACM Press, Glasgow, Scotland, UK, 2011 , p. 1247 . URL: http: //dl.acm.org/citation.cfm?doid= 2063576 .2063757. doi: 10 .1145/2063576.2063757.

[6]

T. N.

Kipf ,

Welling , Semi-Supervised Classification with Graph Convolutional Networks , in: 5th International Conference on Learning Representations, ICLR 2017 , Toulon, France, April 24- 26 , 2017 , Conference Track Proceedings, OpenReview.net, 2017 . URL: https://openreview.net/ forum?id= SJU4ayYgl .

[7]

Yu ,

P.-Y.

Li ,

Wang , Citation impact prediction for scientific papers using stepwise regression analysis , Scientometrics 101 ( 2014 ) 1233 - 1252 . URL: http://link.springer. com/10.1007/s11192-014- 1279-6 . doi: 10 .1007/s11192-014-1279-6.

[8]

Kang ,

Ammar ,

Dalvi , M. van Zuylen,