Scalable Graph Size Reduction for Efficient GNN Application Pavel Procházka1 , Michal Mareš1,2 and Marek Dědič1,3 1 Cisco Systems, Inc., Karlovo náměstí 10, Prague, Czech Republic 2 Czech Technical University in Prague, Technická 2, Prague, Czech Republic 3 Czech Technical University in Prague, Trojanova 13, Prague, Czech Republic Abstract Graph neural networks (GNN) present a dominant framework for representation learning on graphs for the past several years. The main strength of GNNs lies in the fact that they can simultaneously learn from both node related attributes and relations between nodes, represented by edges. In tasks leading to large graphs, GNN often requires significant computational resources to achieve its superior performance. In order to reduce the computational cost, methods allowing for a flexible balance between complexity and performance could be useful. In this work, we propose a simple scalable task-aware graph preprocessing procedure allowing us to obtain a reduced graph such as GNN achieves a given desired performance on the downstream task. In addition, the proposed preprocessing allows for fitting the reduced graph and GNN into a given memory/computational resources. The proposed preprocessing is evaluated and compared with several reference scenarios on conventional GNN benchmark datasets. 1. Introduction As a side product, the graph reduction procedure pro- duces a hierarchical clustering of nodes in the graph that Graph neural networks (GNNs) have proven to achieve is optimised for a given task. We explore these clusters superior performance on a number of graph datasets and and demonstrate how to use them to better understand are adopted in industrial applications across many fields. the coarsening procedure. Superior GNN performance is, however, often paid for by a numerically intensive training procedure with a significant memory footprint. In addition, fine tuning 1.1. Related work parameters of a complex GNN can prove challenging as Graph neural networks arose independently in several well. The issue is emphasised whenever the problem works. The authors derived graph convolutional network modeled by the graph is evolving in time and retraining (GCN) as a spectral generalisation of image convolution is required to keep the model accurate. in [1]. Node2vec as a node embedding that is trained to In many industrial applications, high computational minimise the distance of similar nodes within a graph is cost caused by training complex models might be an issue. proposed in [2]. A ”bridge” between Bayesian message Moreover, some real-world problems lead to high amount passing and GNNs is presented in [3], where the authors of data that does not fit the memory of a single machine. derive a GNN from probabilistic assumptions. In order to apply GNNs to such big data, either some Several high-level concepts of GNN complexity re- data reduction or their processing in a distributive way duction appear in literature. To mention some of them: is required. Graph batching proposed in [4] introduces graph sam- Traditional simple structure-agnostic baselines like pling (batching) to enable stochastic gradient descent. the Naive Bayes classifier usually scale very well at a Next, feature propagation through the network layers is reasonable cost, making them useful in scenarios where avoided in [5]. Instead, the product of the feature matrix such a simple model performs well enough. and powers of the adjacency matrix are concatenated In this paper, we address the problem of complexity- and passed to the multi-layer perceptron without any performance trade-off and propose a method operating negative impact on performance. Another approach is to on an arbitrary performance level between a simple base identify and remove redundant operations as proposed in model and the fine-tuned GNN. The method consists in [6]. One can also consider decomposing the graph as in a systematic reduction of the graph for GNN, implying [7] in order to enable distributive training of GNNs, since complexity reduction at the expense of performance. each cluster represents an independent graph component – this idea was adopted in [8]. Connection of batching ITAT’22: Information technologies – Applications and Theory, Septem- ber 23–27, 2022, Zuberec, Slovakia with distributive processing is considered in [9]. In con- Envelope-Open paprocha@cisco.com (P. Procházka); mimares@cisco.com trast with these approaches, the proposed complexity (M. Mareš); madedic@cisco.com (M. Dědič) reduction is achieved by graph coarsening. © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). A systematic way of graph coarsening associated with CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) node2vec is considered in the HARP framework [10]. 4. As a side product, we observed that the graph The idea is to reduce the graph step-by-step by means of coarsening based on edge contraction naturally edge and star collapsing, training a node2vec model on produces a hierarchical clustering. This cluster- the coarsened graph and then to step-by-step refine the ing combines information about predictions from embedding to the original graph. The motivation behind the base model with the graph structure, hence HARP is to provide an embedding incorporating both the clustering is driven by the downstream task. local and global graph properties. Using HARP on study- ing the complexity-performance trade-off is proposed 1.3. Structure of the Paper in [11], where the graph coarsening is optimised with respect to graph properties. The prolonging procedure The paper is organised as follows: A formal description is then driven by the downstream task. In contrast to of the complexity-performance trade-off and a general this work, our coarsening is driven in a different way, framework for solving this problem is presented in Sec- nonetheless respecting the downstream task. tion 2. A coarsening procedure based on edge contraction Graph clustering / partitioning [7] can be used for as a particular instance of the proposed framework is pre- distributive processing as described above. Another use- sented in Section 3. Section 4 then discusses a critical part case could be organising the data (typically in an unsu- of the edge contraction procedure, which is the order in pervised way) in order to better understand its structure. which the edges are contracted. Hierarchical clustering On top of traditional clustering, hierarchical clustering view of the considered graph coarsening procedure is the enables a tree-structured organisation of the clusters, in topic of Section 5. Experiments are presented in Section contrast to conventional flat clusters. In [12], the authors 6 and the work is concluded in Section 7, where we also proposed hierarchical clustering of words in a corpus that outline future research directions. is based on a bi-gram language model and the hierarchy is driven by mutual information between clusters. In [13], the authors propose a hierarchical clustering of nodes in 2. Graph Complexity Reduction a graph that is driven by a pair sampling ratio derived We consider a collection of nodes V, edges E ⊆ V × V, real from weights of the adjacency matrix. To the best of our valued node feature vectors X and node categorical labels knowledge, there is no available hierarchical clustering Y constituting an undirected graph 𝐺 = (V, E, X, Y). We of nodes in a graph that is aware of the downstream task denote 𝑒 = (𝜈𝐴 , 𝜈𝐵 ) the edge connecting nodes 𝜈𝐴 and 𝜈𝐵 . as we propose in this work. The neighborhood of the node 𝜈 is denoted N(𝜈) and the tuple (⋆, 𝜈) stands for a set of all edges incident on 𝜈, that 1.2. Goals and Contribution is (⋆, 𝜈) = {(𝜈,̃ 𝜈)}𝜈∈N(𝜈) ̃ . The downstream task is transductive node classifica- The work reported in this paper presents the following tion, where labels are available for nodes in the training contributions: set. A model 𝑀 is assumed to provide a prediction 𝑦̂ for 1. We motivate and formalise the problem of a com- each node. The performance of the model 𝑃(Y, Y)̂ is then plexity performance trade-off and propose a gen- given by a statistic evaluated on the test nodes (e.g. clas- eral way of systematically solving and evaluating sification accuracy), where the pair (Y, Y)̂ denotes true this problem. The core of this concept is GNN and predicted node labels. We also consider a complexity graph size reduction based on edge contraction. of the model 𝑀 applied on graph 𝐺 given by 𝐶(𝐺, 𝑀). 2. We adopt the edge-contraction-based coarsen- ing procedure to a particular task by means of a 2.1. Problem Formulation proper order of the edges for contraction. This order of edges is given by a similarity measure Given a (GNN) model 𝑀 and a graph 𝐺, the goal of this of predictive posterior distributions of adjacent paper is to find a graph 𝐺 ⋆ = (V⋆ , E⋆ , X⋆ , Y⋆ ) jointly with nodes, where a simple reference model on the a refinement mapping 𝐿 ∶ Y⋆̂ → Ŷ in order to either given task is used to obtain the posterior distri- butions for all nodes. 1. maximize performance 𝑃(Y, 𝐿(Y⋆̂ )) subject to com- 3. We experimentally compare various similarity plexity restriction 𝐶(𝐺 ⋆ , 𝑀) ≤ 𝐶𝑚 for a given measures of probability distributions driving the complexity constrain 𝐶𝑚 or coarsening procedure with respect to the target 2. minimize complexity 𝐶(𝐺 ⋆ , 𝑀) subject given min- complexity performance trade-off. In particular, imal required performance 𝑃(Y, 𝐿(Y⋆̂ )) ≥ 𝑃𝑚 . we show that the computational graph size can be almost halved with only a small impact on the performance. 2.2. Graph Algorithm Complexity 𝑃 There are two main aspects of GNN complexity. First, the underlying graph 𝐺 needs to be stored in memory. Second, the training procedure typically runs through G3 the graph and needs to access it in order to drive the 𝑃3 learning process. The authors report both memory and computational complexity of various training techniques G2 of GCN in [4] – Table 1. In principle, both complexi- ties or even any function of them can be covered by the 𝑃𝑚 complexity term 𝐶(𝐺, 𝑀). 2.3. The Performance Complexity Trade-Off 𝐶2 𝐶𝑚 𝐶 Given a size-decreasing sequence of graphs Figure 1: Achieving the target operating point by means of 𝑆 = [𝐺1 , … , 𝐺𝑖 , … , 𝐺𝑁 ], (1) complexity performance trade-off. In case of minimal required performance (green arrows), we find the graph 𝐺2 , where the where 𝐺𝑖 = (V𝑖 , E𝑖 , X𝑖 , Y𝑖 ) and |V𝑖 | > |V𝑗 | ⟺ 𝑖 < 𝑗, we model achieves at least performance 𝑃𝑚 . In case of maximal expect a model (GNN) performance to be proportional available resources (blue arrows), we select the model achiev- to the graph size. The main focus of this paper is on ing performance 𝑃3 . the construction of such size-decreasing graph sequence that would enable the fulfillment of our ultimate opti- misation problem described in Section 2.1. Evaluating the performance of the model 𝑀 on each graph in the • In order to find the proper working point, we do sequence, one can plot dependency of performance on not need to have the complete curve as in Figure the complexity (graph size). Denoting 𝑃𝑖 performance 1. One possible strategy could be to start with a of model 𝑀 achieved using 𝐺𝑖 with refinement into the small graph, where the complexity would be low, original graph 𝐺, our goal can be achieved according to and to continue until the desired performance is illustration1 in Figure 1: achieved. • Another option that can be applied in some cases • Given maximum available complexity 𝐶𝑚 , we se- could be to rely on a generalization of the pro- lect such a graph that the complexity given by vided information that is a transfer of the proper the number of nodes in the graph is smaller than graph size among tasks, time, models, etc. 𝐶𝑚 that is |V𝑖 | ≤ 𝐶𝑚 and achieve performance 𝑃3 • Finally, if we provide some restrictions according – blue arrows. to Section 3.4, we can use an upper-bound (or its • Given minimal required performance complexity estimation) as described in Section 4.1 instead of 𝑃𝑚 , we select such a graph that |P𝑖 | ≥ 𝑃𝑚 with true GNN performance. complexity 𝐶2 = |V2 | – green arrows. 3. Graph Size Reduction Strategy 2.4. Finding Working Point Using Based on Edge Collapsing Performance Complexity Trade-Off in Practice This section describes a way of obtaining the graph se- quence from Equation (1), which is then used for the tar- In practice, it could be very expensive to calculate the get complexity-performance trade-off evaluation. This performance for all graphs in the sequence 𝑆 according to approach is based on the edge contraction procedure that Figure 1, hence no resources would be saved as claimed in is described within this Section. The order of edges for our main motivation. In order to overcome this limitation, contraction driving the size-decreasing graph sequence we can consider one or more of the following options: from Equation (1) is described in Section 4 in greater 1 detail. Finally, a hierarchical clustering induced by the The illustration captures an intuition that the performance would edge contracting procedure can be found in Section 5. grow with the graph complexity. As it is shown in Section 6, this is not always strictly true given a particular model. Nevertheless, if we We can write the sequence 𝑆 from Equation (1) as focus on practical usage (Section 2.4) the non-decreasing property is not required in real application. 𝑆 = [(V1 , E1 , X1 , Y1 ), … , (V𝑁 , E𝑁 , X𝑁 , Y𝑁 )], (2) where 𝐺1 = (V1 , E1 , X1 , Y1 ) stands for the original graph. 𝜈𝐴 𝜈𝐶 The considered graph-size-reduction process is driven step by step. We first describe a single graph coarsening step, that is, how to get from 𝐺𝑖 to 𝐺𝑖+1 , in Equation (1) 𝜈𝑋 𝜈𝑌 and then define an overall graph reduction procedure. 𝜈𝐵 𝜈𝐷 3.1. Single Edge Contraction Step Given an edge to contract, its incident nodes are merged 𝜈𝐴 𝜈𝐶 (removed) and a new node is established – see Figure 2 for an illustration. More precisely, given a graph 𝐺𝑖 = (V𝑖 , E𝑖 , X𝑖 , Y𝑖 ) and an edge 𝑒 = (𝜈𝑋 , 𝜈𝑌 ), 𝑒 ∈ E𝑖 , 𝜈𝑋 ∈ V𝑖 , 𝜈𝑌 ∈ 𝜈𝑋 𝑌 V𝑖 to remove, the new nodes and edges describing 𝐺𝑖+1 are given by 𝜈𝐵 𝜈𝐷 V𝑖+1 = V𝑖 ⧵ {𝜈𝑋 , 𝜈𝑌 } ∪ {𝜈𝑋 𝑌 }, (3) Figure 2: Coarsening given by contracting the edge (𝜈𝑋 , 𝜈𝑌 ), where the nodes 𝜈𝑋 and 𝜈𝑌 are replaced by a new node 𝜈𝑋 𝑌 and where 𝜈𝑋 𝑌 is a new (merged) node and the edges connecting 𝜈𝑋 and 𝜈𝑌 are rewired to 𝜈𝑋 𝑌 . As discussed in Section 5, this atomic merge can also be seen as one step E𝑖+1 = E𝑖 ⧵ ([𝜈𝑋 , 𝜈𝑌 ], ⋆) ∪ {(𝜈𝑋 𝑌 , 𝜈)}𝜈∈N𝑖 (𝜈𝑋 ,𝜈𝑌 ) , (4) in a hierarchical clustering, where clusters 𝜈𝑋 and 𝜈𝑌 become child of the cluster 𝜈𝑋 𝑌 . where ([𝜈𝑋 , 𝜈𝑌 ], ⋆) = (𝜈𝑋 , ⋆) ∪ (𝜈𝑌 , ⋆) N𝑖 (𝜈𝑋 , 𝜈𝑌 ) = N𝑖 (𝜈𝑋 ) ∪ N𝑖 (𝜈𝑌 ) ⧵ {𝜈𝑋 , 𝜈𝑌 }, 3.2. Feature Aggregation where the neighborhood is taken according to 𝐺𝑖 . The GNN model assumes all feature vectors to have the In the context of our setup, that is, in the situation same dimension for each node, hence some aggregation where we want to apply a GNN on the new graph 𝐺𝑖+1 is required when merging two nodes. In particular, we and evaluate its performance on the original one (𝐺1 ), have two nodes 𝜈𝑋 , 𝜈𝑌 with feature vectors 𝑥𝑋 and 𝑥𝑌 of the following issues arise and must be resolved: length 𝐹 that must be aggregated by some aggregation 𝑎 1. Based on the nodes 𝜈𝑋 and 𝜈𝑌 , we need to aggre- into one vector 𝑥𝑋 𝑌 = 𝑎(𝑥𝑋 , 𝑥𝑌 ) of length 𝐹 corresponding gate training labels (if available) to the new node to the newly merged node 𝜈𝑋 𝑌 . 𝜈𝑋 𝑌 and decide if this label should be used for Although a number of solutions can be considered, we training on the coarsened graph, that is to define assume a simple weighted mean aggregation within this Y𝑖+1 based on Y𝑖 . This problem is discussed in work, that is Section 3.3. 𝑤 𝑥 + 𝑤 𝑌 𝑥𝑌 2. Apart from labels, node features must be aggre- 𝑥𝑋 𝑌 = 𝑎(𝑥𝑋 , 𝑥𝑌 ) = 𝑋 𝑋 , (5) 𝑤𝑋 + 𝑤 𝑌 gated as well (transforming X𝑖 into X𝑖+1 ) – see Section 3.2 for details2 . where the weight 𝑤𝐴 is given by the number of nodes 3. Finally, once we have a prediction for the node from the original graph 𝐺1 that were merged to 𝜈𝐴 in the 𝜈𝑋 𝑌 , we need to establish predictions for all the preceding steps. nodes from V1 that were merged into 𝜈𝑋 𝑌 . We introduced the refinement mapping 𝐿 for this pur- 3.3. Label Aggregation pose in Section 2.1. More details and a simple refinement mapping is presented Section 3.4. Similarly as in the case of features, we also need to ag- gregate labels associated with the nodes 𝜈𝑋 , 𝜈𝑌 that are Within this paper, we only show a very simple possible supposed to be merged. We (ad-hoc) consider a similar solution to address each of the mentioned issue. A deeper approach as in Equation (5), where we convert the labels investigation of each of them is not within the scope of to a prior distribution 𝑝𝑦 on labels and we obtain this paper and is reserved for future work. 𝜔𝑋 𝑝𝑦 𝑋 + 𝜔𝑌 𝑝𝑦 𝑌 𝑝𝑦 𝑋 𝑌 = , (6) 𝜔𝑋 + 𝜔 𝑌 2 Handling the edge features also needs to be considered if they are available. Since this is not the case in our assumptions, we leave this problem for future work. 𝜈𝐴 5 𝜈𝐷 𝜈𝐷 𝜈𝐴𝐵 𝜈𝐴𝐵 1 7 3 5 3 5 5 𝜈𝐵 2 𝜈𝐸 𝜈𝐴𝐵 2 𝜈𝐸 𝜈𝐸 𝜈𝐴𝐵𝐶𝐷𝐸 3 𝜈𝐶𝐷𝐸 𝜈𝐶𝐷 6 4 6 4 𝜈𝐶 𝜈𝐶 Figure 3: Graph sequence determined by a given sequence of edges to contract, where the numbers on edges stand for the edge order for contraction. We use colors to denote labels and to indicate the problem of label aggregation – their aggregation is clear until the final node in the last step. where 𝜔𝐴 denotes3 the number of training labels merged 4. Ordered Sequence of Edges for into 𝜈𝐴 . In multi-class classification, the prior distribution 𝑝𝑦 Removal is in the form of a probability vector with a single non- This Section focuses on the last missing piece of the zero (unit) element placed on the position of true label, complexity-performance trade-off evaluation, which is if available. Since the weight 𝜔𝐴 is non-zero only for at obtaining a sequence of edges driving the edge contrac- least one training sample, only probability vectors related tion procedure. First, a simple performance upper-bound to the training samples are reflected in Equation 6. of the graph that was reduced according to Section 3 is presented. The suggested edge sorting algorithm aims to 3.4. Label Refinement provide an ordering of edges such that the upper bound would be maximised at each step when the edge contrac- Label refinement is related to the node embedding pro- tion algorithm is applied in this order. longing in [10], where we need to transform a predic- tion about a cluster into predictions about its constituent nodes. Within this work, we assume a trivial copy-paste 4.1. Label Refinement Induced method, where all merged nodes share the same predic- Performance Upper Bound tion. Considering all nodes in the testing set, in Figure 3 we can More precisely, running a (GNN) model on 𝐺𝑖 , we re- see that the nodes of the same color (label) are merged ceive a prediction for each node from V𝑖 that needs to be together in the first three steps. In that case, 100% accu- prolonged to the nodes from V1 for the target accuracy racy can theoretically be achieved if the merged nodes evaluation. A relation of nodes in V𝑖 and V1 is hierarchi- are predicted correctly. However, the final edge contrac- cal so that each node 𝜈 ∈ V𝑖 can be seen as a hierarchical tion merges together nodes with different colors (labels) composition of nodes V̄ ⊆ V1 such that the node 𝜈 arose ̄ and the performance from this step forward cannot be by merging nodes from V according to Equation (3) in greater than 60%, which is achieved if the node 𝜈𝐴𝐵𝐶𝐷𝐸 is preceding steps. The considered simple prolonging map- predicted to be a blue one. If it is predicted as a green one, ping 𝐿 then takes the prediction related to 𝜈 and applies ̄ only 40% accuracy is achieved. In this case, 60% presents it for all nodes from V. a performance upper-bound since no model can achieve better performance in the final single node graph. Note 3.5. Graph Coarsening Sequence that the upper bound is 100% for the first three steps. More formally, according to Section 3.4, we assume all We have already described a single edge contraction step the merged nodes4 V̄ ⊆ V1 to share the prediction of the from 𝐺𝑖 to 𝐺𝑖+1 , which assumed a given edge to remove. corresponding node 𝜈 ∈ V𝑖 . Obviously, if true test labels Given an ordered list of edges to remove, we can step-by- differ within V,̄ some prediction errors are unavoidable. step obtain the sequence from Equation (1) in order to Given the node 𝜈 ∈ V𝑖 , we calculate a histogram of the evaluate performance-complexity trade-off for a given test labels from V.̄ Denoting 𝑦0 the most common testing graph and model. An example of the graph decompo- label within V,̄ a minimal number of errors on testing sition is illustrated in Figure 3, while the final missing nodes from V̄ is given by the number of test labels that piece of this procedure, that is ordering the edges to be differs to 𝑦0 . Denoting this number of errors as 𝑈 (𝜈) for contracted, is described in Section 4. each node 𝜈 ∈ V𝑖 , the performance upper-bound can be 3 Recall that 𝜔𝐴 denotes the number of training labels merged into 𝜈𝐴 , while 𝑤𝐴 from Equation (5) stands for the number of all nodes 4 Recall V̄ as a set of nodes from the original graph that are merged merged into 𝜈𝐴 . into node 𝜈 for a given 𝜈 ∈ V𝑖 . written as: 1 𝜈𝐴𝐵𝐶𝐷𝐸 ∑ 𝑈 (𝜈) (7) |𝑇 | 𝜈∈V 𝑖 𝜈𝐴𝐵 𝜈𝐶𝐷𝐸 where |𝑇 | denotes the number of nodes from 𝑉1 in test set. We also note that considering this upper-bound as 𝜈𝐴 𝜈𝐵 𝜈𝐸 𝜈𝐶𝐷 performance in the performance-complexity trade-off (Section 2.1), we achieve a non-increasing sequence of 𝜈𝐶 𝜈𝐷 performances for the sequence 𝑆 in Equation (1). Figure 4: Hierarchical clustering tree corresponding to edge 4.2. Edge Ordering Maximizing the contraction in Figure 3 Performance Upper Bound . We can achieve the upper-bound in Equation (7) equal to 1 while merging the nodes with the same label. Once 5. Hierarchical Clustering of we start to merge nodes with varying labels, the upper- Nodes Induced by Edge bound starts to decrease. This would be a trivial oper- ation if all labels would be available. Nevertheless, if Contraction only training labels are available, we need to somehow As demonstrated in Figure 2, the edge-contraction-based estimate labels for nodes for which the ground truth is coarsening leads to a hierarchical clustering represented not available. by a tree, where the nodes in the original graph 𝜈 ∈ V1 Instead of comparing true labels, we propose to com- form leaves of the tree and each edge contraction defines pare a predictive posterior distribution related to the a relation between a parent and its children according downstream task, where the prediction is provided by an to Equation (3). As an example, a tree for the procedure auxiliary (base) model. We propose then to calculate a from Figure 3 is shown in Figure 4. similarity measure upon pair of posterior distributions Although this clustering is only a side-effect of the for all edges (corresponding node pairs) and to sort the edge contraction procedure, it brings5 a useful insight edges according to the similarity to establish the order into the quality of the edge ordering. The label refine- of edges to be contracted. ment 3.4 basically operates within each hierarchical clus- ter, since the (GNN) model is assumed to make a pre- 4.3. Base Model diction about this cluster that needs to be refined into a The base model serves to evaluate a posterior probability prediction for the nodes in the original graph 𝐺. for each node in graph determining the pairwise node similarity for sorting the edges. Since the model is as- 5.1. Multi-Step Edge Contraction sumed to process the entire graph, it is expected to be a The edge contraction described in Section 3.1 can be simple one. easily extended by considering multiple edges to be con- tracted simultaneously. In that case, a sub-graph defined 4.4. Similarity Measures by a collection of edges to contract is considered. Each Within this paper, we experimentally evaluate several component of this graph then forms a new merged node. similarity measures of the probability density functions. The motivation for this multiple-step consideration is In particular, we consider cross-entropy and the KL di- to decrease the depth of the hierarchical tree on one hand vergence as similarity measures of posterior probability or reduce the number of graphs in Sequence (1) on the distributions. The asymmetry of these measures does not other. The depth of the tree referring to the hierarchical have any effect, since we assume an undirected graph, cluster can be as high as the number of nodes in the that is the similarity is calculated for both directions and graph. The multi-edge contraction procedure and its the smaller value is used. relationship with the edge contraction step described in Section 3.1 is illustrated in Figure 5 in a particular example. 5 We discuss other potential applications in Section 7. 𝜈𝐸 6.2. Experiment Setup We consider a logistic regression model as the base model 𝜈𝐷 in all experiments. Based on the posterior distribution obtained from the logistic regression, we calculate the 𝜈𝐴 ordering of edges for removal using the KL-divergence 𝜈𝐶 and cross-entropy similarity measures on predictive node 𝜈𝐵 distribution. We also consider random ordering as a 𝜈𝐵 reference. 𝜈𝐶 We first calculate the graph sequence according to Section 3.5. This sequence inherently provides the per- 𝜈𝐷 𝜈𝐸 𝜈𝐴 formance upper-bound (Section 4.1). A GCN [1] is then trained on each graph in the sequence, where the labels Figure 5: Illustration of a multi-edge coarsening provided and features are aggregated according to Sections 3.2 and in one step and its relation to the edge one-by-one edge 3.3 and performance (accuracy) is obtained on the test set coarsening, where we use hierarchical clustering as described refined to the original graph (Section 3.4). We consider in Section 5. A graph of five nodes given by edges E = the cross entropy loss function within training, where the [(𝜈𝐷 , 𝜈𝐸 ), (𝜈𝐶 , 𝜈𝐷 ), (𝜈𝐵 , 𝜈𝐶 ), (𝜈𝐴 , 𝜈𝐵 )] is coarsened in four steps cross entropy is calculated against the prior distribution (left). If these steps are provided simultaneously, a flat tree is calculated in Equation (6). The node is assumed to be a obtained (right). training one if it contains at least one aggregated label from 𝐺1 . We consider complexity given by the number of nodes 6. Experiments in the graph within our experiments. This simplified complexity measure takes only graph into account, it This section provides experimental validation of the pro- is not suitable for comparison of multiple models, since posed complexity-performance trade-off on traditional their complexity is not taken into account. real-world graph datasets. We evaluated the performance-complexity trade-off for Cora, PubMed [14] and DBLP [15] datasets, where 6.1. Experiments Objectives we sub-sampled training labels – see results in Figure 6a. In order to validate the use of the posterior instead Since our claims are not as straightforward as stating of prior distribution, we used a distribution given by the that one method overcomes another one, we first claim, base model if the label was not available and if it was, a what we try to achieve within the experiments. distribution given by 𝑟𝑝𝑦 + (1 − 𝑟)𝑝𝑦̂ , where 𝑝𝑦 denotes First of all, we would like to find a sweet spot of the the prior distribution of data determined by the training complexity-performance trade-off for a given task en- labels, 𝑝𝑦̂ the posterior distribution (base model output) abling us to outperform the baseline with reduced com- and 𝑟 is a coefficient enabling mixing between the two plexity. We found this point in all cases shown in Figure distributions. The results are shown in Figure 6b. 6a. The complexity performance trade-off curves in Fig- The next objective is to compare the performance ures 6a and 6b are equipped by 90% symmetric confidence upper-bound from Section 4.1 with true performance intervals, where the interval is given by the correspond- – Figure 6a. Since the performance upper-bound can be ing quantiles of Beta distribution constructed as the con- calculated simply compared to the true performance, it jugated prior to binomial distribution given by 𝑥 errors can be considered as an approximation of the true per- out of 𝑁𝑡 testing examples (note that 𝑥/𝑁𝑡 stands for formance in the complexity-performance trade-off. This accuracy). would save significant computational resources needed for true performance evaluation for all graphs in the se- quence 𝑆, but making conclusions on upper-bound should 6.3. Performance Upper-Bound versus be validated. GNN Performance We further investigate impact of various steps driving The comparison of the upper-bound and true perfor- the edge order for removal described in Section 4 on the mance can be found in Figures 6a and 6b. The upper- complexity performance trade-off and we also attempt bound does not suffer from the fluctuations as the true to validate using a posterior distribution for edge sorting performance. In case when the base model is strong (4.2) instead of the prior (labels) used in the upper-bound enough, the working point selection according to the (Figure 6b). Finally, we demonstrate the node clustering upper-bound could bring reasonable results. as a vital visualization of a particular graph coarsening. DBLP - 20% of training labels DBLP - 10% of training labels DBLP - 2% of training labels 1.0 1.0 1.0 0.9 0.9 0.9 Performance - Test Accuracy 0.8 Performance - Test Accuracy 0.8 Performance - Test Accuracy 0.8 0.7 0.7 0.7 0.6 0.6 0.6 Cross Entropy Cross Entropy Cross Entropy 0.5 KL Divergence 0.5 KL Divergence 0.5 KL Divergence Random Random Random Base Model Base Model Base Model 0.4 Cross Entropy - UB 0.4 Cross Entropy - UB 0.4 Cross Entropy - UB KL Divergence - UB KL Divergence - UB KL Divergence - UB Random - UB Random - UB Random - UB 0.3 0.3 0.3 0 2500 5000 7500 10000 12500 15000 17500 0 2500 5000 7500 10000 12500 15000 17500 0 2500 5000 7500 10000 12500 15000 17500 Cora - 20% of training labels Cora - 10% of training labels Cora - 2% of training labels 1.0 1.0 1.0 Cross Entropy Cross Entropy Cross Entropy KL Divergence KL Divergence KL Divergence 0.9 Random 0.9 Random 0.9 Random Base Model Base Model Base Model Cross Entropy - UB Cross Entropy - UB Cross Entropy - UB Performance - Test Accuracy 0.8 Performance - Test Accuracy 0.8 Performance - Test Accuracy KL Divergence - UB KL Divergence - UB 0.8 KL Divergence - UB Random - UB Random - UB Random - UB 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 Pubmed - 20% of training labels Pubmed - 10% of training labels Pubmed - 2% of training labels 1.0 1.0 1.0 Cross Entropy Cross Entropy KL Divergence KL Divergence 0.9 0.9 Random 0.9 Random Base Model Base Model Cross Entropy - UB Cross Entropy - UB Performance - Test Accuracy 0.8 Performance - Test Accuracy 0.8 Performance - Test Accuracy KL Divergence - UB 0.8 KL Divergence - UB Random - UB Random - UB 0.7 0.7 0.7 0.6 0.6 0.6 Cross Entropy 0.5 KL Divergence 0.5 0.5 Random Base Model 0.4 Cross Entropy - UB 0.4 0.4 KL Divergence - UB Random - UB 0.3 0.3 0.3 0 2500 5000 7500 10000 12500 15000 17500 20000 0 2500 5000 7500 10000 12500 15000 17500 20000 0 2500 5000 7500 10000 12500 15000 17500 20000 (a) Performance comparison of various distributions used for sorting the edges for contracting. The dependencies are plotted for different number of training labels. DBLP Dataset Cora Dataset Pubmed Dataset 1.00 1.00 1.00 0.95 0.95 0.95 0.90 Performance - Test Accuracy Performance - Test Accuracy Performance - Test Accuracy 0.90 0.90 0.85 0.85 0.85 0.80 0.80 0.80 0.75 0.75 0.75 0.70 r=0 r=0.4 (UB) r=0 r=0.4 (UB) r=0 r=0.4 (UB) r=0.4 r=0.8 (UB) r=0.4 r=0.8 (UB) r=0.4 r=0.8 (UB) 0.65 r=0.8 r=1.0 (UB) 0.70 r=0.8 r=1.0 (UB) 0.70 r=0.8 r=1.0 (UB) r=1 Base Model r=1 Base Model r=1 Base Model r=0.0 (UB) r=0.0 (UB) r=0.0 (UB) 0.60 0.65 0.65 2000 4000 6000 8000 10000 12000 14000 16000 500 1000 1500 2000 2500 4000 6000 8000 10000 12000 14000 16000 18000 20000 Graph complexity - Number of nodes in the graph Graph complexity - Number of nodes in the graph Graph complexity - Number of nodes in the graph (b) Performance comparison of various distribution used for sorting the edges for contracting. Figure 6: Evaluation of performance-complexity trade-off. 6.4. Impact of Edges Ordering on caused by the fact that neighbouring nodes with the same Performance Complexity Trade-Off label in the training set are merged first and the lack in training data then causes the observed performance drop. Impact of the chosen distribution similarity measure on In contrast with evaluating the similarity measures, the performance is shown in Figure 6a. Surprisingly, even the upper-bound provides an inverse observation. random baseline gives good results in some cases. While no metric is the best choice in case of Pubmed and DBLP datasets, KL divergence seems to be a superior choice in 6.5. Hierarchical Clustering case of Cora dataset. In case of 2% training labels, the The hierarchical clustering described in Section 5 presents base model is so weak that qualitative difference between a neat way of manually validating the graph coarsening the edge sorting strategies disappears for all datasets. procedure. We evaluated the base model on the Karate Interestingly, the provided observations made on true Club dataset and used the proposed coarsening proce- performance can be also deduced from the upper-bounds. dure according to Section 5.1. Two edges were contracted As can be seen in Figure 6b, adding prior information in each step. The resulting tree is shown in Figure 7. makes the results consistently worse. This is probably Observing the tree, we can recognise splits (or coars- 17 four steps of the procedure merge only nodes with a matching label. However, from the fifth step onward, at 0 4 16 least two errors occur and affect the total accuracy (one error in case of test accuracy). 2 3 0 0 15 7. Conclusion and Discussion 0 0 0 0 0 We formalized the performance-complexity trade-off and 0 14 described a framework for smoothly adjusting either desired model complexity for a given performance re- quirements or performance for available computational 13 0 0 resources. The framework consists of a methodical step- 11 by-step graph coarsening by way of edge contraction, 12 12 where we addressed practical aspects of feature and label aggregation within node merging and also prediction 0 0 11 refinement to the original nodes from the merged node. 0 0 0 We have also shown that this edge coarsening naturally produces a hierarchical clustering of the nodes in the graph. 0 10 The ordering of edges for the contracting procedure 0 is then studied in greater detail, where we suggest to 9 0 employ a base model to produce a predictive posterior distribution for each node and then sort the edges accord- 0 ing to the distance of probability distributions of adjacent 8 0 nodes. 0 7 We applied the proposed framework to widely used 0 publicly available graph datasets. In particular, we ex- plored the coarsening procedure on the Karate Club 5 dataset by means of a hierarchical tree. Finally we com- 6 2 pared KL-divergence and cross-entropy similarity mea- 0 0 5 sures for driving the edge ordering. 0 0 1 0 Although our primary focus was the complexity per- formance trade-off, we believe that we explored several 4 1 research directions for future works. 0 0 The considered label refinement is extremely simple, however, we can alternatively consider an arbitrary model 0 0 0 within each cluster. This will incur some additional cost, 0 but a distributive processing may be utilized since the Figure 7: The hierarchical tree corresponding to the edge clusters do not overlap. Moreover, the information from contracting procedure driven by the base model on the Karate this ”local” model can be combined with the global one, Club dataset. The true labels of the leaf nodes are color-coded which could lead to superior performance since both lo- and training labels are denoted by a double circle. The number in the circle denotes the step in the edge contraction order – in cal and global graph contexts would be considered. In each step, two edges are simultaneously contracted according addition, we would still avoid running the GNN on the to Section 5.1. entire graph. If we focus on the hierarchical clustering, we can relax our assumption of applying a simple model on the entire ening steps) resulting from merging nodes with the same graph to produce a predictive posterior distribution. In- label, which is the desired behavior, because the graph stead, we could consider as strong model as possible to size is reduced without any performance loss in such a achieve the best possible clustering for a given use case. case. However, merging differently labelled nodes un- Within this work, we used the hierarchical clustering avoidably leads to a performance loss, because our label as a complementary/inspection tool for the graph coars- refinement procedure guarantees the same prediction for ening. Nevertheless, it would be interesting to compare each node within the cluster (see Section 4.1). its suitability for other purposes – e.g. distributive pro- In case of Karate Club dataset tree (Figure 7), the first cessing as an alternative to [7]. In addition, the clustering is built on a supervised setup. An unsupervised approach text of local graph quality, Accepted to CEUR Work- based on e.g. the cosine distance of the node2vec embed- shop Proceedings (2022). ding could be an interesting alternative to [13]. [12] P. F. Brown, V. J. Della Pietra, P. V. Desouza, J. C. Lai, R. L. Mercer, Class-based n-gram models of nat- ural language, Computational linguistics 18 (1992) References 467–480. [13] T. Bonald, B. Charpentier, A. Galland, A. Hollocou, [1] T. N. Kipf, M. Welling, Semi-supervised classifi- Hierarchical graph clustering using node pair sam- cation with graph convolutional networks, arXiv pling, arXiv preprint arXiv:1806.01664 (2018). preprint arXiv:1609.02907 (2016). [14] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, [2] A. Grover, J. Leskovec, node2vec: Scalable feature T. Eliassi-Rad, Collective classification in network learning for networks, in: Proceedings of the 22nd data, AI Magazine 29 (2008) 93. ACM SIGKDD international conference on Knowl- [15] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, Z. Su, Ar- edge discovery and data mining, 2016, pp. 855–864. netminer: Extraction and mining of academic so- [3] H. Dai, B. Dai, L. Song, Discriminative embeddings cial networks, in: Proceedings of the 14th ACM of latent variable models for structured data, in: In- SIGKDD International Conference on Knowledge ternational conference on machine learning, PMLR, Discovery and Data Mining, KDD ’08, Association 2016, pp. 2702–2711. for Computing Machinery, New York, NY, USA, [4] W.-L. Chiang, X. Liu, S. Si, Y. Li, S. Bengio, C.-J. 2008, p. 990–998. Hsieh, Cluster-gcn: An efficient algorithm for train- ing deep and large graph convolutional networks, in: Proceedings of the 25th ACM SIGKDD interna- tional conference on knowledge discovery & data mining, 2019, pp. 257–266. [5] E. Rossi, F. Frasca, B. Chamberlain, D. Eynard, M. Bronstein, F. Monti, Sign: Scalable incep- tion graph neural networks, arXiv preprint arXiv:2004.11198 7 (2020) 15. [6] Z. Jia, S. Lin, R. Ying, J. You, J. Leskovec, A. Aiken, Redundancy-free computation for graph neural net- works, in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 997–1005. [7] G. Karypis, V. Kumar, Metis: A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices (1997). [8] D. Zheng, C. Ma, M. Wang, J. Zhou, Q. Su, X. Song, Q. Gan, Z. Zhang, G. Karypis, Distdgl: distributed graph neural network training for billion-scale graphs, in: 2020 IEEE/ACM 10th Workshop on Ir- regular Applications: Architectures and Algorithms (IA3), IEEE, 2020, pp. 36–44. [9] C. Zheng, H. Chen, Y. Cheng, Z. Song, Y. Wu, C. Li, J. Cheng, H. Yang, S. Zhang, Bytegnn: ef- ficient graph neural network training at large scale, Proceedings of the VLDB Endowment 15 (2022) 1228–1242. [10] H. Chen, B. Perozzi, Y. Hu, S. Skiena, HARP: Hi- erarchical Representation Learning for Networks, Proceedings of the AAAI Conference on Artificial Intelligence 32 (2018). Number: 1. [11] M. Dedič, L. Bajer, J. Repický, P. Procházka, M. Holeňa, Adaptive graph coarsening in the con-