Scalable Graph Size Reduction for Efficient GNN
Application
Pavel Procházka1 , Michal Mareš1,2 and Marek Dědič1,3
1
  Cisco Systems, Inc., Karlovo náměstí 10, Prague, Czech Republic
2
  Czech Technical University in Prague, Technická 2, Prague, Czech Republic
3
  Czech Technical University in Prague, Trojanova 13, Prague, Czech Republic


                                          Abstract
                                          Graph neural networks (GNN) present a dominant framework for representation learning on graphs for the past several
                                          years. The main strength of GNNs lies in the fact that they can simultaneously learn from both node related attributes and
                                          relations between nodes, represented by edges. In tasks leading to large graphs, GNN often requires significant computational
                                          resources to achieve its superior performance. In order to reduce the computational cost, methods allowing for a flexible
                                          balance between complexity and performance could be useful. In this work, we propose a simple scalable task-aware graph
                                          preprocessing procedure allowing us to obtain a reduced graph such as GNN achieves a given desired performance on
                                          the downstream task. In addition, the proposed preprocessing allows for fitting the reduced graph and GNN into a given
                                          memory/computational resources. The proposed preprocessing is evaluated and compared with several reference scenarios
                                          on conventional GNN benchmark datasets.


1. Introduction                                                                                                     As a side product, the graph reduction procedure pro-
                                                                                                                 duces a hierarchical clustering of nodes in the graph that
Graph neural networks (GNNs) have proven to achieve is optimised for a given task. We explore these clusters
superior performance on a number of graph datasets and and demonstrate how to use them to better understand
are adopted in industrial applications across many fields. the coarsening procedure.
Superior GNN performance is, however, often paid for
by a numerically intensive training procedure with a
significant memory footprint. In addition, fine tuning
                                                                                                                 1.1. Related work
parameters of a complex GNN can prove challenging as Graph neural networks arose independently in several
well. The issue is emphasised whenever the problem works. The authors derived graph convolutional network
modeled by the graph is evolving in time and retraining (GCN) as a spectral generalisation of image convolution
is required to keep the model accurate.                                                                          in [1]. Node2vec as a node embedding that is trained to
              In many industrial applications, high computational minimise the distance of similar nodes within a graph is
cost caused by training complex models might be an issue. proposed in [2]. A ”bridge” between Bayesian message
Moreover, some real-world problems lead to high amount passing and GNNs is presented in [3], where the authors
of data that does not fit the memory of a single machine. derive a GNN from probabilistic assumptions.
In order to apply GNNs to such big data, either some                                                                Several high-level concepts of GNN complexity re-
data reduction or their processing in a distributive way duction appear in literature. To mention some of them:
is required.                                                                                                     Graph batching proposed in [4] introduces graph sam-
              Traditional simple structure-agnostic baselines like pling (batching) to enable stochastic gradient descent.
the Naive Bayes classifier usually scale very well at a Next, feature propagation through the network layers is
reasonable cost, making them useful in scenarios where avoided in [5]. Instead, the product of the feature matrix
such a simple model performs well enough.                                                                        and powers of the adjacency matrix are concatenated
              In this paper, we address the problem of complexity- and passed to the multi-layer perceptron without any
performance trade-off and propose a method operating negative impact on performance. Another approach is to
on an arbitrary performance level between a simple base identify and remove redundant operations as proposed in
model and the fine-tuned GNN. The method consists in [6]. One can also consider decomposing the graph as in
a systematic reduction of the graph for GNN, implying [7] in order to enable distributive training of GNNs, since
complexity reduction at the expense of performance.                                                              each cluster represents an independent graph component
                                                                                                                 – this idea was adopted in [8]. Connection of batching
ITAT’22: Information technologies – Applications and Theory, Septem-
ber 23–27, 2022, Zuberec, Slovakia
                                                                                                                 with distributive processing is considered in [9]. In con-
Envelope-Open paprocha@cisco.com (P. Procházka); mimares@cisco.com                                               trast with these approaches, the proposed complexity
(M. Mareš); madedic@cisco.com (M. Dědič)                                                                         reduction is achieved by graph coarsening.
                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                    Attribution 4.0 International (CC BY 4.0).                                                      A systematic way of graph coarsening associated with
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
node2vec is considered in the HARP framework [10].                 4. As a side product, we observed that the graph
The idea is to reduce the graph step-by-step by means of              coarsening based on edge contraction naturally
edge and star collapsing, training a node2vec model on                produces a hierarchical clustering. This cluster-
the coarsened graph and then to step-by-step refine the               ing combines information about predictions from
embedding to the original graph. The motivation behind                the base model with the graph structure, hence
HARP is to provide an embedding incorporating both                    the clustering is driven by the downstream task.
local and global graph properties. Using HARP on study-
ing the complexity-performance trade-off is proposed           1.3. Structure of the Paper
in [11], where the graph coarsening is optimised with
respect to graph properties. The prolonging procedure          The paper is organised as follows: A formal description
is then driven by the downstream task. In contrast to          of the complexity-performance trade-off and a general
this work, our coarsening is driven in a different way,        framework for solving this problem is presented in Sec-
nonetheless respecting the downstream task.                    tion 2. A coarsening procedure based on edge contraction
   Graph clustering / partitioning [7] can be used for         as a particular instance of the proposed framework is pre-
distributive processing as described above. Another use-       sented in Section 3. Section 4 then discusses a critical part
case could be organising the data (typically in an unsu-       of the edge contraction procedure, which is the order in
pervised way) in order to better understand its structure.     which the edges are contracted. Hierarchical clustering
On top of traditional clustering, hierarchical clustering      view of the considered graph coarsening procedure is the
enables a tree-structured organisation of the clusters, in     topic of Section 5. Experiments are presented in Section
contrast to conventional flat clusters. In [12], the authors   6 and the work is concluded in Section 7, where we also
proposed hierarchical clustering of words in a corpus that     outline future research directions.
is based on a bi-gram language model and the hierarchy
is driven by mutual information between clusters. In [13],
the authors propose a hierarchical clustering of nodes in      2. Graph Complexity Reduction
a graph that is driven by a pair sampling ratio derived   We consider a collection of nodes V, edges E ⊆ V × V, real
from weights of the adjacency matrix. To the best of our  valued node feature vectors X and node categorical labels
knowledge, there is no available hierarchical clustering  Y constituting an undirected graph 𝐺 = (V, E, X, Y). We
of nodes in a graph that is aware of the downstream task  denote 𝑒 = (𝜈𝐴 , 𝜈𝐵 ) the edge connecting nodes 𝜈𝐴 and 𝜈𝐵 .
as we propose in this work.                               The neighborhood of the node 𝜈 is denoted N(𝜈) and the
                                                          tuple (⋆, 𝜈) stands for a set of all edges incident on 𝜈, that
1.2. Goals and Contribution                               is (⋆, 𝜈) = {(𝜈,̃ 𝜈)}𝜈∈N(𝜈)
                                                                                ̃     .
                                                             The downstream task is transductive node classifica-
The work reported in this paper presents the following
                                                          tion, where labels are available for nodes in the training
contributions:
                                                          set. A model 𝑀 is assumed to provide a prediction 𝑦̂ for
    1. We motivate and formalise the problem of a com- each node. The performance of the model 𝑃(Y, Y)̂ is then
       plexity performance trade-off and propose a gen- given by a statistic evaluated on the test nodes (e.g. clas-
       eral way of systematically solving and evaluating sification accuracy), where the pair (Y, Y)̂ denotes true
       this problem. The core of this concept is GNN and predicted node labels. We also consider a complexity
       graph size reduction based on edge contraction. of the model 𝑀 applied on graph 𝐺 given by 𝐶(𝐺, 𝑀).
    2. We adopt the edge-contraction-based coarsen-
       ing procedure to a particular task by means of a
                                                          2.1. Problem Formulation
       proper order of the edges for contraction. This
       order of edges is given by a similarity measure Given a (GNN) model 𝑀 and a graph 𝐺, the goal of this
       of predictive posterior distributions of adjacent paper is to find a graph 𝐺 ⋆ = (V⋆ , E⋆ , X⋆ , Y⋆ ) jointly with
       nodes, where a simple reference model on the a refinement mapping 𝐿 ∶ Y⋆̂ → Ŷ in order to either
       given task is used to obtain the posterior distri-
       butions for all nodes.                                  1. maximize performance 𝑃(Y, 𝐿(Y⋆̂ )) subject to com-
    3. We experimentally compare various similarity                plexity restriction 𝐶(𝐺 ⋆ , 𝑀) ≤ 𝐶𝑚 for a given
       measures of probability distributions driving the           complexity constrain 𝐶𝑚 or
       coarsening procedure with respect to the target         2.  minimize complexity 𝐶(𝐺 ⋆ , 𝑀) subject given min-
       complexity performance trade-off. In particular,            imal required performance 𝑃(Y, 𝐿(Y⋆̂ )) ≥ 𝑃𝑚 .
       we show that the computational graph size can
       be almost halved with only a small impact on the
       performance.
2.2. Graph Algorithm Complexity                                       𝑃

There are two main aspects of GNN complexity. First,
the underlying graph 𝐺 needs to be stored in memory.
Second, the training procedure typically runs through                                                         G3
the graph and needs to access it in order to drive the                𝑃3
learning process. The authors report both memory and
computational complexity of various training techniques                                             G2
of GCN in [4] – Table 1. In principle, both complexi-
ties or even any function of them can be covered by the              𝑃𝑚
complexity term 𝐶(𝐺, 𝑀).

2.3. The Performance Complexity
     Trade-Off                                                                                      𝐶2             𝐶𝑚                     𝐶
Given a size-decreasing sequence of graphs
                                                                     Figure 1: Achieving the target operating point by means of
                   𝑆 = [𝐺1 , … , 𝐺𝑖 , … , 𝐺𝑁 ],               (1)
                                                                     complexity performance trade-off. In case of minimal required
                                                                     performance (green arrows), we find the graph 𝐺2 , where the
where 𝐺𝑖 = (V𝑖 , E𝑖 , X𝑖 , Y𝑖 ) and |V𝑖 | > |V𝑗 | ⟺ 𝑖 < 𝑗, we
                                                                     model achieves at least performance 𝑃𝑚 . In case of maximal
expect a model (GNN) performance to be proportional                  available resources (blue arrows), we select the model achiev-
to the graph size. The main focus of this paper is on                ing performance 𝑃3 .
the construction of such size-decreasing graph sequence
that would enable the fulfillment of our ultimate opti-
misation problem described in Section 2.1. Evaluating
the performance of the model 𝑀 on each graph in the                        • In order to find the proper working point, we do
sequence, one can plot dependency of performance on                          not need to have the complete curve as in Figure
the complexity (graph size). Denoting 𝑃𝑖 performance                         1. One possible strategy could be to start with a
of model 𝑀 achieved using 𝐺𝑖 with refinement into the                        small graph, where the complexity would be low,
original graph 𝐺, our goal can be achieved according to                      and to continue until the desired performance is
illustration1 in Figure 1:                                                   achieved.
                                                                           • Another option that can be applied in some cases
     • Given maximum available complexity 𝐶𝑚 , we se-                        could be to rely on a generalization of the pro-
       lect such a graph that the complexity given by                        vided information that is a transfer of the proper
       the number of nodes in the graph is smaller than                      graph size among tasks, time, models, etc.
       𝐶𝑚 that is |V𝑖 | ≤ 𝐶𝑚 and achieve performance 𝑃3                    • Finally, if we provide some restrictions according
       – blue arrows.                                                        to Section 3.4, we can use an upper-bound (or its
     • Given minimal required performance complexity                         estimation) as described in Section 4.1 instead of
       𝑃𝑚 , we select such a graph that |P𝑖 | ≥ 𝑃𝑚 with                      true GNN performance.
       complexity 𝐶2 = |V2 | – green arrows.

                                                                     3. Graph Size Reduction Strategy
2.4. Finding Working Point Using                                        Based on Edge Collapsing
     Performance Complexity Trade-Off
     in Practice                                                         This section describes a way of obtaining the graph se-
                                                                         quence from Equation (1), which is then used for the tar-
In practice, it could be very expensive to calculate the get complexity-performance trade-off evaluation. This
performance for all graphs in the sequence 𝑆 according to approach is based on the edge contraction procedure that
Figure 1, hence no resources would be saved as claimed in is described within this Section. The order of edges for
our main motivation. In order to overcome this limitation, contraction driving the size-decreasing graph sequence
we can consider one or more of the following options:                    from Equation (1) is described in Section 4 in greater
1
                                                                         detail. Finally, a hierarchical clustering induced by the
  The illustration captures an intuition that the performance would
                                                                         edge contracting procedure can be found in Section 5.
  grow with the graph complexity. As it is shown in Section 6, this is
  not always strictly true given a particular model. Nevertheless, if we    We can write the sequence 𝑆 from Equation (1) as
focus on practical usage (Section 2.4) the non-decreasing property
is not required in real application.                                        𝑆 = [(V1 , E1 , X1 , Y1 ), … , (V𝑁 , E𝑁 , X𝑁 , Y𝑁 )],   (2)
where 𝐺1 = (V1 , E1 , X1 , Y1 ) stands for the original graph.                             𝜈𝐴                               𝜈𝐶
The considered graph-size-reduction process is driven
step by step. We first describe a single graph coarsening
step, that is, how to get from 𝐺𝑖 to 𝐺𝑖+1 , in Equation (1)                                          𝜈𝑋           𝜈𝑌
and then define an overall graph reduction procedure.
                                                                                           𝜈𝐵                               𝜈𝐷
3.1. Single Edge Contraction Step
Given an edge to contract, its incident nodes are merged                                   𝜈𝐴                               𝜈𝐶
(removed) and a new node is established – see Figure 2
for an illustration. More precisely, given a graph 𝐺𝑖 =
(V𝑖 , E𝑖 , X𝑖 , Y𝑖 ) and an edge 𝑒 = (𝜈𝑋 , 𝜈𝑌 ), 𝑒 ∈ E𝑖 , 𝜈𝑋 ∈ V𝑖 , 𝜈𝑌 ∈                                   𝜈𝑋 𝑌
V𝑖 to remove, the new nodes and edges describing 𝐺𝑖+1
are given by
                                                                                           𝜈𝐵                               𝜈𝐷
                     V𝑖+1 = V𝑖 ⧵ {𝜈𝑋 , 𝜈𝑌 } ∪ {𝜈𝑋 𝑌 },                 (3)
                                                                             Figure 2: Coarsening given by contracting the edge (𝜈𝑋 , 𝜈𝑌 ),
                                                                             where the nodes 𝜈𝑋 and 𝜈𝑌 are replaced by a new node 𝜈𝑋 𝑌 and
where 𝜈𝑋 𝑌 is a new (merged) node and
                                                                             the edges connecting 𝜈𝑋 and 𝜈𝑌 are rewired to 𝜈𝑋 𝑌 . As discussed
                                                                             in Section 5, this atomic merge can also be seen as one step
          E𝑖+1 = E𝑖 ⧵ ([𝜈𝑋 , 𝜈𝑌 ], ⋆) ∪ {(𝜈𝑋 𝑌 , 𝜈)}𝜈∈N𝑖 (𝜈𝑋 ,𝜈𝑌 ) ,   (4)
                                                                             in a hierarchical clustering, where clusters 𝜈𝑋 and 𝜈𝑌 become
                                                                             child of the cluster 𝜈𝑋 𝑌 .
where
                    ([𝜈𝑋 , 𝜈𝑌 ], ⋆) = (𝜈𝑋 , ⋆) ∪ (𝜈𝑌 , ⋆)
              N𝑖 (𝜈𝑋 , 𝜈𝑌 ) = N𝑖 (𝜈𝑋 ) ∪ N𝑖 (𝜈𝑌 ) ⧵ {𝜈𝑋 , 𝜈𝑌 },
                                                                             3.2. Feature Aggregation
where the neighborhood is taken according to 𝐺𝑖 .
                                                            The GNN model assumes all feature vectors to have the
   In the context of our setup, that is, in the situation
                                                            same dimension for each node, hence some aggregation
where we want to apply a GNN on the new graph 𝐺𝑖+1
                                                            is required when merging two nodes. In particular, we
and evaluate its performance on the original one (𝐺1 ),
                                                            have two nodes 𝜈𝑋 , 𝜈𝑌 with feature vectors 𝑥𝑋 and 𝑥𝑌 of
the following issues arise and must be resolved:
                                                            length 𝐹 that must be aggregated by some aggregation 𝑎
     1. Based on the nodes 𝜈𝑋 and 𝜈𝑌 , we need to aggre- into one vector 𝑥𝑋 𝑌 = 𝑎(𝑥𝑋 , 𝑥𝑌 ) of length 𝐹 corresponding
        gate training labels (if available) to the new node to the newly merged node 𝜈𝑋 𝑌 .
        𝜈𝑋 𝑌 and decide if this label should be used for       Although a number of solutions can be considered, we
        training on the coarsened graph, that is to define assume a simple weighted mean aggregation within this
        Y𝑖+1 based on Y𝑖 . This problem is discussed in work, that is
        Section 3.3.                                                                         𝑤 𝑥 + 𝑤 𝑌 𝑥𝑌
     2. Apart from labels, node features must be aggre-                 𝑥𝑋 𝑌 = 𝑎(𝑥𝑋 , 𝑥𝑌 ) = 𝑋 𝑋          ,        (5)
                                                                                               𝑤𝑋 + 𝑤 𝑌
        gated as well (transforming X𝑖 into X𝑖+1 ) – see
        Section 3.2 for details2 .                          where the weight 𝑤𝐴 is given by the number of nodes
     3. Finally, once we have a prediction for the node from the original graph 𝐺1 that were merged to 𝜈𝐴 in the
        𝜈𝑋 𝑌 , we need to establish predictions for all the preceding steps.
        nodes from V1 that were merged into 𝜈𝑋 𝑌 . We
        introduced the refinement mapping 𝐿 for this pur- 3.3. Label Aggregation
        pose in Section 2.1. More details and a simple
        refinement mapping is presented Section 3.4.        Similarly as in the case of features, we also need to ag-
                                                            gregate labels associated with the nodes 𝜈𝑋 , 𝜈𝑌 that are
   Within this paper, we only show a very simple possible supposed to be merged. We (ad-hoc) consider a similar
solution to address each of the mentioned issue. A deeper approach as in Equation (5), where we convert the labels
investigation of each of them is not within the scope of to a prior distribution 𝑝𝑦 on labels and we obtain
this paper and is reserved for future work.
                                                                                      𝜔𝑋 𝑝𝑦 𝑋 + 𝜔𝑌 𝑝𝑦 𝑌
                                                                             𝑝𝑦 𝑋 𝑌 =                   ,          (6)
                                                                                           𝜔𝑋 + 𝜔 𝑌
2
    Handling the edge features also needs to be considered if they are
    available. Since this is not the case in our assumptions, we leave
    this problem for future work.
     𝜈𝐴         5        𝜈𝐷                                 𝜈𝐷                                 𝜈𝐴𝐵
                                                                                                                 𝜈𝐴𝐵
           1         7         3                        5         3
                                                                                               5                       5
               𝜈𝐵         2        𝜈𝐸             𝜈𝐴𝐵        2           𝜈𝐸                              𝜈𝐸                               𝜈𝐴𝐵𝐶𝐷𝐸
                                                                                                     3                     𝜈𝐶𝐷𝐸
                                                                                            𝜈𝐶𝐷
                     6        4                         6         4
                         𝜈𝐶                                  𝜈𝐶

Figure 3: Graph sequence determined by a given sequence of edges to contract, where the numbers on edges stand for the
edge order for contraction. We use colors to denote labels and to indicate the problem of label aggregation – their aggregation
is clear until the final node in the last step.


where 𝜔𝐴 denotes3 the number of training labels merged                        4. Ordered Sequence of Edges for
into 𝜈𝐴 .
   In multi-class classification, the prior distribution 𝑝𝑦
                                                                                 Removal
is in the form of a probability vector with a single non-   This Section focuses on the last missing piece of the
zero (unit) element placed on the position of true label,   complexity-performance trade-off evaluation, which is
if available. Since the weight 𝜔𝐴 is non-zero only for at   obtaining a sequence of edges driving the edge contrac-
least one training sample, only probability vectors related tion procedure. First, a simple performance upper-bound
to the training samples are reflected in Equation 6.        of the graph that was reduced according to Section 3 is
                                                            presented. The suggested edge sorting algorithm aims to
3.4. Label Refinement                                       provide an ordering of edges such that the upper bound
                                                            would be maximised at each step when the edge contrac-
Label refinement is related to the node embedding pro- tion algorithm is applied in this order.
longing in [10], where we need to transform a predic-
tion about a cluster into predictions about its constituent
nodes. Within this work, we assume a trivial copy-paste 4.1. Label Refinement Induced
method, where all merged nodes share the same predic-              Performance Upper Bound
tion.
                                                            Considering all nodes in the testing set, in Figure 3 we can
    More precisely, running a (GNN) model on 𝐺𝑖 , we re-
                                                            see that the nodes of the same color (label) are merged
ceive a prediction for each node from V𝑖 that needs to be
                                                            together in the first three steps. In that case, 100% accu-
prolonged to the nodes from V1 for the target accuracy
                                                            racy can theoretically be achieved if the merged nodes
evaluation. A relation of nodes in V𝑖 and V1 is hierarchi-
                                                            are predicted correctly. However, the final edge contrac-
cal so that each node 𝜈 ∈ V𝑖 can be seen as a hierarchical
                                                            tion merges together nodes with different colors (labels)
composition of nodes V̄ ⊆ V1 such that the node 𝜈 arose
                            ̄                               and the performance from this step forward cannot be
by merging nodes from V according to Equation (3) in
                                                            greater than 60%, which is achieved if the node 𝜈𝐴𝐵𝐶𝐷𝐸 is
preceding steps. The considered simple prolonging map-
                                                            predicted to be a blue one. If it is predicted as a green one,
ping 𝐿 then takes the prediction related to 𝜈 and applies
                        ̄                                   only 40% accuracy is achieved. In this case, 60% presents
it for all nodes from V.
                                                            a performance upper-bound since no model can achieve
                                                            better performance in the final single node graph. Note
3.5. Graph Coarsening Sequence                              that the upper bound is 100% for the first three steps.
                                                               More formally, according to Section 3.4, we assume all
We have already described a single edge contraction step
                                                            the merged nodes4 V̄ ⊆ V1 to share the prediction of the
from 𝐺𝑖 to 𝐺𝑖+1 , which assumed a given edge to remove.
                                                            corresponding node 𝜈 ∈ V𝑖 . Obviously, if true test labels
Given an ordered list of edges to remove, we can step-by-
                                                            differ within V,̄ some prediction errors are unavoidable.
step obtain the sequence from Equation (1) in order to
                                                            Given the node 𝜈 ∈ V𝑖 , we calculate a histogram of the
evaluate performance-complexity trade-off for a given
                                                            test labels from V.̄ Denoting 𝑦0 the most common testing
graph and model. An example of the graph decompo-
                                                            label within V,̄ a minimal number of errors on testing
sition is illustrated in Figure 3, while the final missing
                                                            nodes from V̄ is given by the number of test labels that
piece of this procedure, that is ordering the edges to be
                                                            differs to 𝑦0 . Denoting this number of errors as 𝑈 (𝜈) for
contracted, is described in Section 4.
                                                            each node 𝜈 ∈ V𝑖 , the performance upper-bound can be
3
    Recall that 𝜔𝐴 denotes the number of training labels merged into
    𝜈𝐴 , while 𝑤𝐴 from Equation (5) stands for the number of all nodes        4
                                                                                  Recall V̄ as a set of nodes from the original graph that are merged
    merged into 𝜈𝐴 .                                                              into node 𝜈 for a given 𝜈 ∈ V𝑖 .
written as:
                        1                                                                  𝜈𝐴𝐵𝐶𝐷𝐸
                            ∑ 𝑈 (𝜈)                    (7)
                       |𝑇 | 𝜈∈V
                             𝑖
                                                                                  𝜈𝐴𝐵                𝜈𝐶𝐷𝐸
where |𝑇 | denotes the number of nodes from 𝑉1 in test
set.
   We also note that considering this upper-bound as                      𝜈𝐴          𝜈𝐵            𝜈𝐸        𝜈𝐶𝐷
performance in the performance-complexity trade-off
(Section 2.1), we achieve a non-increasing sequence of                                                   𝜈𝐶              𝜈𝐷
performances for the sequence 𝑆 in Equation (1).
                                                             Figure 4: Hierarchical clustering tree corresponding to edge
4.2. Edge Ordering Maximizing the                            contraction in Figure 3
     Performance Upper Bound                                                               .

We can achieve the upper-bound in Equation (7) equal
to 1 while merging the nodes with the same label. Once       5. Hierarchical Clustering of
we start to merge nodes with varying labels, the upper-         Nodes Induced by Edge
bound starts to decrease. This would be a trivial oper-
ation if all labels would be available. Nevertheless, if        Contraction
only training labels are available, we need to somehow
                                                             As demonstrated in Figure 2, the edge-contraction-based
estimate labels for nodes for which the ground truth is
                                                             coarsening leads to a hierarchical clustering represented
not available.
                                                             by a tree, where the nodes in the original graph 𝜈 ∈ V1
   Instead of comparing true labels, we propose to com-
                                                             form leaves of the tree and each edge contraction defines
pare a predictive posterior distribution related to the
                                                             a relation between a parent and its children according
downstream task, where the prediction is provided by an
                                                             to Equation (3). As an example, a tree for the procedure
auxiliary (base) model. We propose then to calculate a
                                                             from Figure 3 is shown in Figure 4.
similarity measure upon pair of posterior distributions
                                                                Although this clustering is only a side-effect of the
for all edges (corresponding node pairs) and to sort the
                                                             edge contraction procedure, it brings5 a useful insight
edges according to the similarity to establish the order
                                                             into the quality of the edge ordering. The label refine-
of edges to be contracted.
                                                             ment 3.4 basically operates within each hierarchical clus-
                                                             ter, since the (GNN) model is assumed to make a pre-
4.3. Base Model                                              diction about this cluster that needs to be refined into a
The base model serves to evaluate a posterior probability    prediction for the nodes in the original graph 𝐺.
for each node in graph determining the pairwise node
similarity for sorting the edges. Since the model is as- 5.1. Multi-Step Edge Contraction
sumed to process the entire graph, it is expected to be a
                                                             The edge contraction described in Section 3.1 can be
simple one.
                                                             easily extended by considering multiple edges to be con-
                                                             tracted simultaneously. In that case, a sub-graph defined
4.4. Similarity Measures                                     by a collection of edges to contract is considered. Each
Within this paper, we experimentally evaluate several component of this graph then forms a new merged node.
similarity measures of the probability density functions.       The motivation for this multiple-step consideration is
In particular, we consider cross-entropy and the KL di-      to decrease the depth of the hierarchical tree on one hand
vergence as similarity measures of posterior probability or reduce the number of graphs in Sequence (1) on the
distributions. The asymmetry of these measures does not other. The depth of the tree referring to the hierarchical
have any effect, since we assume an undirected graph, cluster can be as high as the number of nodes in the
that is the similarity is calculated for both directions and graph. The multi-edge contraction procedure and its
the smaller value is used.                                   relationship with the edge contraction step described
                                                             in Section 3.1 is illustrated in Figure 5 in a particular
                                                             example.


                                                             5
                                                                 We discuss other potential applications in Section 7.
                                                                  𝜈𝐸          6.2. Experiment Setup
                                                                              We consider a logistic regression model as the base model
                                                                  𝜈𝐷          in all experiments. Based on the posterior distribution
                                                                              obtained from the logistic regression, we calculate the
     𝜈𝐴                                                                       ordering of edges for removal using the KL-divergence
                                                                  𝜈𝐶
                                                                              and cross-entropy similarity measures on predictive node
             𝜈𝐵                                                               distribution. We also consider random ordering as a
                                                                  𝜈𝐵          reference.
                     𝜈𝐶                                                          We first calculate the graph sequence according to
                                                                              Section 3.5. This sequence inherently provides the per-
                             𝜈𝐷              𝜈𝐸                   𝜈𝐴          formance upper-bound (Section 4.1). A GCN [1] is then
                                                                              trained on each graph in the sequence, where the labels
Figure 5: Illustration of a multi-edge coarsening provided                    and features are aggregated according to Sections 3.2 and
in one step and its relation to the edge one-by-one edge                      3.3 and performance (accuracy) is obtained on the test set
coarsening, where we use hierarchical clustering as described                 refined to the original graph (Section 3.4). We consider
in Section 5. A graph of five nodes given by edges E =                        the cross entropy loss function within training, where the
[(𝜈𝐷 , 𝜈𝐸 ), (𝜈𝐶 , 𝜈𝐷 ), (𝜈𝐵 , 𝜈𝐶 ), (𝜈𝐴 , 𝜈𝐵 )] is coarsened in four steps   cross entropy is calculated against the prior distribution
(left). If these steps are provided simultaneously, a flat tree is
                                                                              calculated in Equation (6). The node is assumed to be a
obtained (right).
                                                                              training one if it contains at least one aggregated label
                                                                              from 𝐺1 .
                                                                                 We consider complexity given by the number of nodes
6. Experiments                                                                in the graph within our experiments. This simplified
                                                                              complexity measure takes only graph into account, it
This section provides experimental validation of the pro-                     is not suitable for comparison of multiple models, since
posed complexity-performance trade-off on traditional                         their complexity is not taken into account.
real-world graph datasets.                                                       We evaluated the performance-complexity trade-off
                                                                              for Cora, PubMed [14] and DBLP [15] datasets, where
6.1. Experiments Objectives                                                   we sub-sampled training labels – see results in Figure
                                                                              6a. In order to validate the use of the posterior instead
Since our claims are not as straightforward as stating                        of prior distribution, we used a distribution given by the
that one method overcomes another one, we first claim,                        base model if the label was not available and if it was, a
what we try to achieve within the experiments.                                distribution given by 𝑟𝑝𝑦 + (1 − 𝑟)𝑝𝑦̂ , where 𝑝𝑦 denotes
   First of all, we would like to find a sweet spot of the                    the prior distribution of data determined by the training
complexity-performance trade-off for a given task en-                         labels, 𝑝𝑦̂ the posterior distribution (base model output)
abling us to outperform the baseline with reduced com-                        and 𝑟 is a coefficient enabling mixing between the two
plexity. We found this point in all cases shown in Figure                     distributions. The results are shown in Figure 6b.
6a.                                                                              The complexity performance trade-off curves in Fig-
   The next objective is to compare the performance                           ures 6a and 6b are equipped by 90% symmetric confidence
upper-bound from Section 4.1 with true performance                            intervals, where the interval is given by the correspond-
– Figure 6a. Since the performance upper-bound can be                         ing quantiles of Beta distribution constructed as the con-
calculated simply compared to the true performance, it                        jugated prior to binomial distribution given by 𝑥 errors
can be considered as an approximation of the true per-                        out of 𝑁𝑡 testing examples (note that 𝑥/𝑁𝑡 stands for
formance in the complexity-performance trade-off. This                        accuracy).
would save significant computational resources needed
for true performance evaluation for all graphs in the se-
quence 𝑆, but making conclusions on upper-bound should                        6.3. Performance Upper-Bound versus
be validated.                                                                      GNN Performance
   We further investigate impact of various steps driving
                                                                              The comparison of the upper-bound and true perfor-
the edge order for removal described in Section 4 on the
                                                                              mance can be found in Figures 6a and 6b. The upper-
complexity performance trade-off and we also attempt
                                                                              bound does not suffer from the fluctuations as the true
to validate using a posterior distribution for edge sorting
                                                                              performance. In case when the base model is strong
(4.2) instead of the prior (labels) used in the upper-bound
                                                                              enough, the working point selection according to the
(Figure 6b). Finally, we demonstrate the node clustering
                                                                              upper-bound could bring reasonable results.
as a vital visualization of a particular graph coarsening.
                                                                              DBLP - 20% of training labels                                                                                          DBLP - 10% of training labels                                                                                           DBLP - 2% of training labels
                                                   1.0                                                                                                                       1.0                                                                                                                    1.0


                                                   0.9                                                                                                                       0.9                                                                                                                    0.9


                     Performance - Test Accuracy
                                                   0.8


                                                                                                                                               Performance - Test Accuracy
                                                                                                                                                                             0.8


                                                                                                                                                                                                                                                                      Performance - Test Accuracy
                                                                                                                                                                                                                                                                                                    0.8


                                                   0.7                                                                                                                       0.7                                                                                                                    0.7


                                                   0.6                                                                                                                       0.6                                                                                                                    0.6
                                                                                                                 Cross Entropy                                                                                                          Cross Entropy                                                                                                           Cross Entropy
                                                   0.5                                                           KL Divergence                                               0.5                                                        KL Divergence                                               0.5                                                         KL Divergence
                                                                                                                 Random                                                                                                                 Random                                                                                                                  Random
                                                                                                                 Base Model                                                                                                             Base Model                                                                                                              Base Model
                                                   0.4                                                           Cross Entropy - UB                                          0.4                                                        Cross Entropy - UB                                          0.4                                                         Cross Entropy - UB
                                                                                                                 KL Divergence - UB                                                                                                     KL Divergence - UB                                                                                                      KL Divergence - UB
                                                                                                                 Random - UB                                                                                                            Random - UB                                                                                                             Random - UB
                                                   0.3                                                                                                                       0.3                                                                                                                    0.3
                                                          0        2500      5000       7500     10000   12500      15000     17500                                                0      2500      5000       7500     10000   12500      15000    17500                                                 0      2500       5000      7500      10000   12500      15000    17500

                                                                              Cora - 20% of training labels                                                                                          Cora - 10% of training labels                                                                                           Cora - 2% of training labels
                                                   1.0                                                                                                                       1.0                                                                                                                    1.0
                                                                Cross Entropy                                                                                                          Cross Entropy                                                                                                          Cross Entropy
                                                                KL Divergence                                                                                                          KL Divergence                                                                                                          KL Divergence
                                                   0.9          Random                                                                                                       0.9       Random                                                                                                       0.9       Random
                                                                Base Model                                                                                                             Base Model                                                                                                             Base Model
                                                                Cross Entropy - UB                                                                                                     Cross Entropy - UB                                                                                                     Cross Entropy - UB
                     Performance - Test Accuracy


                                                   0.8


                                                                                                                                               Performance - Test Accuracy
                                                                                                                                                                             0.8


                                                                                                                                                                                                                                                                      Performance - Test Accuracy
                                                                KL Divergence - UB                                                                                                     KL Divergence - UB                                                                                           0.8       KL Divergence - UB
                                                                Random - UB                                                                                                            Random - UB                                                                                                            Random - UB
                                                   0.7                                                                                                                       0.7                                                                                                                    0.7


                                                   0.6                                                                                                                       0.6                                                                                                                    0.6


                                                   0.5                                                                                                                       0.5                                                                                                                    0.5


                                                   0.4                                                                                                                       0.4                                                                                                                    0.4


                                                   0.3                                                                                                                       0.3                                                                                                                    0.3
                                                          0           500            1000        1500       2000         2500                                                      0         500            1000        1500       2000         2500                                                      0         500            1000         1500      2000          2500

                                                                            Pubmed - 20% of training labels                                                                                        Pubmed - 10% of training labels                                                                                          Pubmed - 2% of training labels
                                                   1.0                                                                                                                       1.0                                                                                                                    1.0
                                                                                                                                                                                       Cross Entropy                                                                                                          Cross Entropy
                                                                                                                                                                                       KL Divergence                                                                                                          KL Divergence
                                                   0.9                                                                                                                       0.9       Random                                                                                                       0.9       Random
                                                                                                                                                                                       Base Model                                                                                                             Base Model
                                                                                                                                                                                       Cross Entropy - UB                                                                                                     Cross Entropy - UB
                     Performance - Test Accuracy


                                                   0.8
                                                                                                                                               Performance - Test Accuracy
                                                                                                                                                                             0.8


                                                                                                                                                                                                                                                                      Performance - Test Accuracy
                                                                                                                                                                                       KL Divergence - UB                                                                                           0.8       KL Divergence - UB
                                                                                                                                                                                       Random - UB                                                                                                            Random - UB
                                                   0.7                                                                                                                       0.7                                                                                                                    0.7


                                                   0.6                                                                                                                       0.6                                                                                                                    0.6
                                                                                                                 Cross Entropy
                                                   0.5                                                           KL Divergence                                               0.5                                                                                                                    0.5
                                                                                                                 Random
                                                                                                                 Base Model
                                                   0.4                                                           Cross Entropy - UB                                          0.4                                                                                                                    0.4
                                                                                                                 KL Divergence - UB
                                                                                                                 Random - UB
                                                   0.3                                                                                                                       0.3                                                                                                                    0.3
                                                          0       2500      5000     7500      10000 12500 15000 17500 20000                                                       0     2500      5000     7500      10000 12500 15000 17500 20000                                                       0      2500      5000    7500      10000 12500 15000 17500 20000


(a) Performance comparison of various distributions used for sorting the edges for contracting. The dependencies are plotted
    for different number of training labels.
                                                                                       DBLP Dataset                                                                                                            Cora Dataset                                                                                                         Pubmed Dataset
                                          1.00                                                                                                                      1.00                                                                                                                   1.00

                                          0.95
                                                                                                                                                                    0.95                                                                                                                   0.95

                                          0.90
            Performance - Test Accuracy


                                                                                                                                      Performance - Test Accuracy


                                                                                                                                                                                                                                                             Performance - Test Accuracy


                                                                                                                                                                    0.90                                                                                                                   0.90
                                          0.85
                                                                                                                                                                    0.85                                                                                                                   0.85
                                          0.80
                                                                                                                                                                    0.80                                                                                                                   0.80
                                          0.75

                                                                                                                                                                    0.75                                                                                                                   0.75
                                          0.70                                                      r=0                 r=0.4 (UB)                                                                                         r=0                 r=0.4 (UB)                                                     r=0                  r=0.4 (UB)
                                                                                                    r=0.4               r=0.8 (UB)                                                                                         r=0.4               r=0.8 (UB)                                                     r=0.4                r=0.8 (UB)
                                          0.65                                                      r=0.8               r=1.0 (UB)                                  0.70                                                   r=0.8               r=1.0 (UB)                                  0.70               r=0.8                r=1.0 (UB)
                                                                                                    r=1                 Base Model                                                                                         r=1                 Base Model                                                     r=1                  Base Model
                                                                                                    r=0.0 (UB)                                                                                                             r=0.0 (UB)                                                                         r=0.0 (UB)
                                          0.60                                                                                                                      0.65                                                                                                                   0.65
                                                         2000      4000 6000 8000 10000 12000 14000                         16000                                       500                  1000            1500            2000            2500                                                         4000     6000 8000 10000 12000 14000 16000 18000 20000
                                                                      Graph complexity - Number of nodes in the graph                                                                        Graph complexity - Number of nodes in the graph                                                                        Graph complexity - Number of nodes in the graph


                                                         (b) Performance comparison of various distribution used for sorting the edges for contracting.
Figure 6: Evaluation of performance-complexity trade-off.


6.4. Impact of Edges Ordering on                                                                                                                                                                                       caused by the fact that neighbouring nodes with the same
     Performance Complexity Trade-Off                                                                                                                                                                                  label in the training set are merged first and the lack in
                                                                                                                                                                                                                       training data then causes the observed performance drop.
Impact of the chosen distribution similarity measure on                                                                                                                                                                In contrast with evaluating the similarity measures, the
performance is shown in Figure 6a. Surprisingly, even the                                                                                                                                                              upper-bound provides an inverse observation.
random baseline gives good results in some cases. While
no metric is the best choice in case of Pubmed and DBLP
datasets, KL divergence seems to be a superior choice in                                                                                                                                                               6.5. Hierarchical Clustering
case of Cora dataset. In case of 2% training labels, the                                                                                                                                                               The hierarchical clustering described in Section 5 presents
base model is so weak that qualitative difference between                                                                                                                                                              a neat way of manually validating the graph coarsening
the edge sorting strategies disappears for all datasets.                                                                                                                                                               procedure. We evaluated the base model on the Karate
Interestingly, the provided observations made on true                                                                                                                                                                  Club dataset and used the proposed coarsening proce-
performance can be also deduced from the upper-bounds.                                                                                                                                                                 dure according to Section 5.1. Two edges were contracted
  As can be seen in Figure 6b, adding prior information                                                                                                                                                                in each step. The resulting tree is shown in Figure 7.
makes the results consistently worse. This is probably                                                                                                                                                                    Observing the tree, we can recognise splits (or coars-
                        17
                                                                                                               four steps of the procedure merge only nodes with a
                                                                                                               matching label. However, from the fifth step onward, at
            0           4                            16
                                                                                                               least two errors occur and affect the total accuracy (one
                                                                                                               error in case of test accuracy).

        2               3                    0            0             15


                                                                                                               7. Conclusion and Discussion
  0     0       0                0               0                We formalized the performance-complexity trade-off and
                                                                   0                     14


                                                                  described a framework for smoothly adjusting either
                                                                  desired model complexity for a given performance re-
                                                                  quirements or performance for available computational
                                                                   13                    0         0


                                                                  resources. The framework consists of a methodical step-
                                         11
                                                                  by-step graph coarsening by way of edge contraction,
                                                                   12                         12

                                                                  where we addressed practical aspects of feature and label
                                                                  aggregation within node merging and also prediction
                             0               0                11  refinement to the original nodes from the merged node.
                                                                                 0            0        0


                                                                  We have also shown that this edge coarsening naturally
                                                                  produces a hierarchical clustering of the nodes in the
                                                                  graph.
                                                     0             10


                                                                      The ordering of edges for the contracting procedure
                                                     0            is then studied in greater detail, where we suggest to
                                                                    9                     0


                                                                  employ a base model to produce a predictive posterior
                                                                  distribution for each node and then sort the edges accord-
                                                     0
                                                                  ing to the distance of probability distributions of adjacent
                                                                    8                     0


                                                                  nodes.
                                                     0              7
                                                                      We applied the proposed framework to widely used
                                                                                          0

                                                                  publicly available graph datasets. In particular, we ex-
                                                                  plored the coarsening procedure on the Karate Club
                                     5
                                                                  dataset by means of a hierarchical tree. Finally we com-
                                                                    6                              2


                                                                  pared KL-divergence and cross-entropy similarity mea-
                    0                0               5
                                                                  sures for driving the edge ordering.
                                                                    0                0             1       0

                                                                      Although our primary focus was the complexity per-
                                                                  formance trade-off, we believe that we explored several
                                         4                     1  research directions for future works.
                                                                                              0        0


                                                                      The considered label refinement is extremely simple,
                                                                  however, we can alternatively consider an arbitrary model
                             0               0                0
                                                                  within each cluster. This will incur some additional cost,
                                                                             0


                                                                  but a distributive processing may be utilized since the
Figure 7: The hierarchical tree corresponding to the edge clusters do not overlap. Moreover, the information from
contracting procedure driven by the base model on the Karate
                                                                  this ”local” model can be combined with the global one,
Club dataset. The true labels of the leaf nodes are color-coded
                                                                  which could lead to superior performance since both lo-
and training labels are denoted by a double circle. The number
in the circle denotes the step in the edge contraction order – in cal and global graph contexts would be considered. In
each step, two edges are simultaneously contracted according addition, we would still avoid running the GNN on the
to Section 5.1.                                                   entire graph.
                                                                      If we focus on the hierarchical clustering, we can relax
                                                                  our assumption of applying a simple model on the entire
ening steps) resulting from merging nodes with the same graph to produce a predictive posterior distribution. In-
label, which is the desired behavior, because the graph stead, we could consider as strong model as possible to
size is reduced without any performance loss in such a achieve the best possible clustering for a given use case.
case. However, merging differently labelled nodes un-                 Within this work, we used the hierarchical clustering
avoidably leads to a performance loss, because our label          as  a complementary/inspection tool for the graph coars-
refinement procedure guarantees the same prediction for           ening.   Nevertheless, it would be interesting to compare
each node within the cluster (see Section 4.1).                   its  suitability for other purposes – e.g. distributive pro-
   In case of Karate Club dataset tree (Figure 7), the first cessing as an alternative to [7]. In addition, the clustering
is built on a supervised setup. An unsupervised approach           text of local graph quality, Accepted to CEUR Work-
based on e.g. the cosine distance of the node2vec embed-           shop Proceedings (2022).
ding could be an interesting alternative to [13].             [12] P. F. Brown, V. J. Della Pietra, P. V. Desouza, J. C.
                                                                   Lai, R. L. Mercer, Class-based n-gram models of nat-
                                                                   ural language, Computational linguistics 18 (1992)
References                                                         467–480.
                                                              [13] T. Bonald, B. Charpentier, A. Galland, A. Hollocou,
  [1] T. N. Kipf, M. Welling, Semi-supervised classifi-
                                                                   Hierarchical graph clustering using node pair sam-
      cation with graph convolutional networks, arXiv
                                                                   pling, arXiv preprint arXiv:1806.01664 (2018).
      preprint arXiv:1609.02907 (2016).
                                                              [14] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher,
  [2] A. Grover, J. Leskovec, node2vec: Scalable feature
                                                                   T. Eliassi-Rad, Collective classification in network
      learning for networks, in: Proceedings of the 22nd
                                                                   data, AI Magazine 29 (2008) 93.
      ACM SIGKDD international conference on Knowl-
                                                              [15] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, Z. Su, Ar-
      edge discovery and data mining, 2016, pp. 855–864.
                                                                   netminer: Extraction and mining of academic so-
  [3] H. Dai, B. Dai, L. Song, Discriminative embeddings
                                                                   cial networks, in: Proceedings of the 14th ACM
      of latent variable models for structured data, in: In-
                                                                   SIGKDD International Conference on Knowledge
      ternational conference on machine learning, PMLR,
                                                                   Discovery and Data Mining, KDD ’08, Association
      2016, pp. 2702–2711.
                                                                   for Computing Machinery, New York, NY, USA,
  [4] W.-L. Chiang, X. Liu, S. Si, Y. Li, S. Bengio, C.-J.
                                                                   2008, p. 990–998.
      Hsieh, Cluster-gcn: An efficient algorithm for train-
      ing deep and large graph convolutional networks,
      in: Proceedings of the 25th ACM SIGKDD interna-
      tional conference on knowledge discovery & data
      mining, 2019, pp. 257–266.
  [5] E. Rossi, F. Frasca, B. Chamberlain, D. Eynard,
      M. Bronstein, F. Monti, Sign: Scalable incep-
      tion graph neural networks,             arXiv preprint
      arXiv:2004.11198 7 (2020) 15.
  [6] Z. Jia, S. Lin, R. Ying, J. You, J. Leskovec, A. Aiken,
      Redundancy-free computation for graph neural net-
      works, in: Proceedings of the 26th ACM SIGKDD
      International Conference on Knowledge Discovery
      & Data Mining, 2020, pp. 997–1005.
  [7] G. Karypis, V. Kumar, Metis: A software package
      for partitioning unstructured graphs, partitioning
      meshes, and computing fill-reducing orderings of
      sparse matrices (1997).
  [8] D. Zheng, C. Ma, M. Wang, J. Zhou, Q. Su, X. Song,
      Q. Gan, Z. Zhang, G. Karypis, Distdgl: distributed
      graph neural network training for billion-scale
      graphs, in: 2020 IEEE/ACM 10th Workshop on Ir-
      regular Applications: Architectures and Algorithms
      (IA3), IEEE, 2020, pp. 36–44.
  [9] C. Zheng, H. Chen, Y. Cheng, Z. Song, Y. Wu,
      C. Li, J. Cheng, H. Yang, S. Zhang, Bytegnn: ef-
      ficient graph neural network training at large scale,
      Proceedings of the VLDB Endowment 15 (2022)
      1228–1242.
[10] H. Chen, B. Perozzi, Y. Hu, S. Skiena, HARP: Hi-
      erarchical Representation Learning for Networks,
      Proceedings of the AAAI Conference on Artificial
      Intelligence 32 (2018). Number: 1.
[11] M. Dedič, L. Bajer, J. Repický, P. Procházka,
      M. Holeňa, Adaptive graph coarsening in the con-