<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Graph Size Reduction for Eficient GNN Application</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pavel Procházka</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michal Mareš</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marek Dědič</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cisco Systems, Inc.</institution>
          ,
          <addr-line>Karlovo náměstí 10, Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Czech Technical University in Prague</institution>
          ,
          <addr-line>Technická 2, Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Czech Technical University in Prague</institution>
          ,
          <addr-line>Trojanova 13, Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <fpage>3</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>Graph neural networks (GNN) present a dominant framework for representation learning on graphs for the past several years. The main strength of GNNs lies in the fact that they can simultaneously learn from both node related attributes and relations between nodes, represented by edges. In tasks leading to large graphs, GNN often requires significant computational resources to achieve its superior performance. In order to reduce the computational cost, methods allowing for a flexible balance between complexity and performance could be useful. In this work, we propose a simple scalable task-aware graph preprocessing procedure allowing us to obtain a reduced graph such as GNN achieves a given desired performance on the downstream task. In addition, the proposed preprocessing allows for fitting the reduced graph and GNN into a given memory/computational resources. The proposed preprocessing is evaluated and compared with several reference scenarios on conventional GNN benchmark datasets.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Graph neural networks (GNNs) have proven to achieve</title>
        <p>superior performance on a number of graph datasets and
are adopted in industrial applications across many fields.
Superior GNN performance is, however, often paid for
by a numerically intensive training procedure with a
significant memory footprint. In addition, fine tuning
parameters of a complex GNN can prove challenging as
well. The issue is emphasised whenever the problem
modeled by the graph is evolving in time and retraining
is required to keep the model accurate.</p>
      </sec>
      <sec id="sec-1-2">
        <title>In many industrial applications, high computational</title>
        <p>cost caused by training complex models might be an issue.</p>
      </sec>
      <sec id="sec-1-3">
        <title>Moreover, some real-world problems lead to high amount</title>
        <p>of data that does not fit the memory of a single machine.</p>
      </sec>
      <sec id="sec-1-4">
        <title>In order to apply GNNs to such big data, either some</title>
        <p>data reduction or their processing in a distributive way
is required.</p>
        <p>Traditional simple structure-agnostic baselines like
the Naive Bayes classifier usually scale very well at a
reasonable cost, making them useful in scenarios where
such a simple model performs well enough.
performance trade-of and propose a method operating
on an arbitrary performance level between a simple base
model and the fine-tuned GNN. The method consists in
a systematic reduction of the graph for GNN, implying
complexity reduction at the expense of performance.
ITAT’22: Information technologies – Applications and Theory,
Septem</p>
      </sec>
      <sec id="sec-1-5">
        <title>Graph batching proposed in [4] introduces graph sam</title>
        <p>pling (batching) to enable stochastic gradient descent.
Next, feature propagation through the network layers is
avoided in [5]. Instead, the product of the feature matrix
and powers of the adjacency matrix are concatenated
negative impact on performance. Another approach is to
identify and remove redundant operations as proposed in
[6]. One can also consider decomposing the graph as in
[7] in order to enable distributive training of GNNs, since
each cluster represents an independent graph component
– this idea was adopted in [8]. Connection of batching
with distributive processing is considered in [9]. In
contrast with these approaches, the proposed complexity
reduction is achieved by graph coarsening.</p>
      </sec>
      <sec id="sec-1-6">
        <title>A systematic way of graph coarsening associated with</title>
        <p>In this paper, we address the problem of complexity- and passed to the multi-layer perceptron without any
node2vec is considered in the HARP framework [10].</p>
      </sec>
      <sec id="sec-1-7">
        <title>The idea is to reduce the graph step-by-step by means of</title>
        <p>edge and star collapsing, training a node2vec model on
the coarsened graph and then to step-by-step refine the
embedding to the original graph. The motivation behind
HARP is to provide an embedding incorporating both
local and global graph properties. Using HARP on
studying the complexity-performance trade-of is proposed
in [11], where the graph coarsening is optimised with
respect to graph properties. The prolonging procedure
is then driven by the downstream task. In contrast to
this work, our coarsening is driven in a diferent way,
nonetheless respecting the downstream task.</p>
      </sec>
      <sec id="sec-1-8">
        <title>Graph clustering / partitioning [7] can be used for</title>
        <p>distributive processing as described above. Another
usecase could be organising the data (typically in an
unsupervised way) in order to better understand its structure.
On top of traditional clustering, hierarchical clustering
enables a tree-structured organisation of the clusters, in
contrast to conventional flat clusters. In [ 12], the authors
proposed hierarchical clustering of words in a corpus that
is based on a bi-gram language model and the hierarchy
is driven by mutual information between clusters. In [13],
the authors propose a hierarchical clustering of nodes in
a graph that is driven by a pair sampling ratio derived
from weights of the adjacency matrix. To the best of our
knowledge, there is no available hierarchical clustering
of nodes in a graph that is aware of the downstream task
as we propose in this work.</p>
        <sec id="sec-1-8-1">
          <title>1.3. Structure of the Paper</title>
          <p>The paper is organised as follows: A formal description
of the complexity-performance trade-of and a general
framework for solving this problem is presented in
Section 2. A coarsening procedure based on edge contraction
as a particular instance of the proposed framework is
presented in Section 3. Section 4 then discusses a critical part
of the edge contraction procedure, which is the order in
which the edges are contracted. Hierarchical clustering
view of the considered graph coarsening procedure is the
topic of Section 5. Experiments are presented in Section</p>
        </sec>
      </sec>
      <sec id="sec-1-9">
        <title>6 and the work is concluded in Section 7, where we also outline future research directions.</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Graph Complexity Reduction</title>
      <sec id="sec-2-1">
        <title>1.2. Goals and Contribution</title>
        <sec id="sec-2-1-1">
          <title>The work reported in this paper presents the following contributions:</title>
          <p>1. maximize performance  ( Y, ( Ŷ⋆)) subject to
complexity restriction ( ⋆,  ) ≤   for a given
complexity constrain   or
2. minimize complexity ( ⋆,  ) subject given
minimal required performance  ( Y, ( Ŷ⋆)) ≥   .</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Graph Algorithm Complexity</title>
        <sec id="sec-2-2-1">
          <title>There are two main aspects of GNN complexity. First,</title>
          <p>the underlying graph  needs to be stored in memory.
Second, the training procedure typically runs through
the graph and needs to access it in order to drive the
learning process. The authors report both memory and
computational complexity of various training techniques
of GCN in [4] – Table 1. In principle, both
complexities or even any function of them can be covered by the
complexity term (,  )</p>
          <p>.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. The Performance Complexity</title>
      </sec>
      <sec id="sec-2-4">
        <title>Trade-Of</title>
        <p>
          Given a size-decreasing sequence of graphs
 = [ 1, … ,   , … ,   ],
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
where   = (V , E , X , Y ) and |V | &gt; |V |
⟺  &lt;  , we
expect a model (GNN) performance to be proportional
to the graph size. The main focus of this paper is on
the construction of such size-decreasing graph sequence
that would enable the fulfillment of our ultimate
optimisation problem described in Section 2.1. Evaluating
the performance of the model 
on each graph in the
sequence, one can plot dependency of performance on
the complexity (graph size). Denoting   performance
of model
        </p>
        <p>achieved using   with refinement into the
original graph  , our goal can be achieved according to
illustration1 in Figure 1:</p>
        <p>– blue arrows.
• Given maximum available complexity   , we
select such a graph that the complexity given by
the number of nodes in the graph is smaller than
  that is |V | ≤   and achieve performance  3
• Given minimal required performance complexity
complexity  2 = |V2| – green arrows.</p>
        <p>, we select such a graph that |P | ≥   with</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.4. Finding Working Point Using</title>
      </sec>
      <sec id="sec-2-6">
        <title>Performance Complexity Trade-Of in Practice</title>
        <sec id="sec-2-6-1">
          <title>In practice, it could be very expensive to calculate the</title>
          <p>performance for all graphs in the sequence  according to
1The illustration captures an intuition that the performance would
our main motivation. In order to overcome this limitation, contraction driving the size-decreasing graph sequence
performance (green arrows), we find the graph  2, where the
model achieves at least performance   . In case of maximal
available resources (blue arrows), we select the model
achieving performance  3.</p>
          <p>• In order to find the proper working point, we do
not need to have the complete curve as in Figure
1. One possible strategy could be to start with a
small graph, where the complexity would be low,
and to continue until the desired performance is
achieved.
• Another option that can be applied in some cases
could be to rely on a generalization of the
provided information that is a transfer of the proper
graph size among tasks, time, models, etc.
• Finally, if we provide some restrictions according
to Section 3.4, we can use an upper-bound (or its
estimation) as described in Section 4.1 instead of
true GNN performance.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Graph Size Reduction Strategy</title>
    </sec>
    <sec id="sec-4">
      <title>Based on Edge Collapsing</title>
      <p>
        This section describes a way of obtaining the graph
sequence from Equation (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), which is then used for the
target complexity-performance trade-of evaluation. This
approach is based on the edge contraction procedure that
is described within this Section. The order of edges for
from Equation (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) is described in Section 4 in greater
detail. Finally, a hierarchical clustering induced by the
edge contracting procedure can be found in Section 5.
      </p>
      <p>
        We can write the sequence  from Equation (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) as
 = [( V1, E1, X1, Y1), … , (V , E , X , Y )],
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
where  1 = (V1, E1, X1, Y1) stands for the original graph.
      </p>
      <sec id="sec-4-1">
        <title>The considered graph-size-reduction process is driven</title>
        <p>
          step by step. We first describe a single graph coarsening
step, that is, how to get from   to  +1 , in Equation (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
and then define an overall graph reduction procedure.
        </p>
        <sec id="sec-4-1-1">
          <title>3.1. Single Edge Contraction Step</title>
          <p>are given by</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Given an edge to contract, its incident nodes are merged</title>
        <p>(removed) and a new node is established – see Figure 2
for an illustration. More precisely, given a graph   =
(V , E , X , Y ) and an edge  = (  ,   ),  ∈
E ,   ∈ V ,   ∈
V to remove, the new nodes and edges describing  +1</p>
        <p>V+1 = V ⧵ {  ,   } ∪ {   },
where    is a new (merged) node and</p>
        <p>E+1 = E ⧵ ([  ,   ], ⋆) ∪ {(   , )} ∈ N (  ,  ),
where</p>
        <p>([  ,   ], ⋆) = (  , ⋆) ∪ (  , ⋆)</p>
        <p>N (  ,   ) = N (  ) ∪ N (  ) ⧵ {  ,   },
where the neighborhood is taken according to   .</p>
      </sec>
      <sec id="sec-4-3">
        <title>In the context of our setup, that is, in the situation</title>
        <p>where we want to apply a GNN on the new graph  +1
and evaluate its performance on the original one ( 1),
the following issues arise and must be resolved:</p>
      </sec>
      <sec id="sec-4-4">
        <title>Section 3.3.</title>
        <p>gate training labels (if available) to the new node</p>
        <p>and decide if this label should be used for
training on the coarsened graph, that is to define
Y+1 based on Y . This problem is discussed in</p>
      </sec>
      <sec id="sec-4-5">
        <title>2. Apart from labels, node features must be aggre</title>
        <p>Section 3.2 for details2.</p>
        <p>gated as well (transforming X into X+1 ) – see
3. Finally, once we have a prediction for the node
   , we need to establish predictions for all the
nodes from V1 that were merged into    . We
introduced the refinement mapping  for this
purpose in Section 2.1. More details and a simple
refinement mapping is presented Section
3.4.</p>
        <p>
          Within this paper, we only show a very simple possible
solution to address each of the mentioned issue. A deeper
investigation of each of them is not within the scope of
this paper and is reserved for future work.
2Handling the edge features also needs to be considered if they are
available. Since this is not the case in our assumptions, we leave
this problem for future work.






















(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
        </p>
        <sec id="sec-4-5-1">
          <title>3.2. Feature Aggregation</title>
        </sec>
      </sec>
      <sec id="sec-4-6">
        <title>The GNN model assumes all feature vectors to have the</title>
        <p>same dimension for each node, hence some aggregation
is required when merging two nodes. In particular, we
have two nodes   ,   with feature vectors   and   of
length  that must be aggregated by some aggregation 
to the newly merged node    .</p>
        <p>Although a number of solutions can be considered, we
assume a simple weighted mean aggregation within this
work, that is
  
= (  ,   ) =
    +    
  +  
preceding steps.
where the weight   is given by the number of nodes
from the original graph  1 that were merged to   in the</p>
        <sec id="sec-4-6-1">
          <title>3.3. Label Aggregation</title>
          <p>
            Similarly as in the case of features, we also need to
aggregate labels associated with the nodes   ,   that are
supposed to be merged. We (ad-hoc) consider a similar
approach as in Equation (
            <xref ref-type="bibr" rid="ref5">5</xref>
            ), where we convert the labels
to a prior distribution   on labels and we obtain
   
=
     +     
  +  
(
            <xref ref-type="bibr" rid="ref5">5</xref>
            )
(
            <xref ref-type="bibr" rid="ref6">6</xref>
            )
1. Based on the nodes   and   , we need to aggre- into one vector   
= (  ,   ) of length  corresponding
edge order for contraction. We use colors to denote labels and to indicate the problem of label aggregation – their aggregation
is clear until the final node in the last step.
where   denotes3 the number of training labels merged
into   .
          </p>
          <p>In multi-class classification, the prior distribution  
is in the form of a probability vector with a single
nonzero (unit) element placed on the position of true label,
if available. Since the weight   is non-zero only for at
least one training sample, only probability vectors related
to the training samples are reflected in Equation 6.</p>
        </sec>
        <sec id="sec-4-6-2">
          <title>3.4. Label Refinement</title>
          <p>Label refinement is related to the node embedding
prolonging in [10], where we need to transform a
prediction about a cluster into predictions about its constituent
nodes. Within this work, we assume a trivial copy-paste
method, where all merged nodes share the same
prediction.
it for all nodes from V̄.</p>
          <p>
            More precisely, running a (GNN) model on   , we
receive a prediction for each node from V that needs to be
prolonged to the nodes from V1 for the target accuracy
evaluation. A relation of nodes in V and V1 is
hierarchical so that each node  ∈ V can be seen as a hierarchical
composition of nodes V̄ ⊆ V1 such that the node  arose
by merging nodes from V̄ according to Equation (
            <xref ref-type="bibr" rid="ref3">3</xref>
            ) in
preceding steps. The considered simple prolonging
mapping  then takes the prediction related to  and applies
          </p>
        </sec>
        <sec id="sec-4-6-3">
          <title>3.5. Graph Coarsening Sequence</title>
        </sec>
      </sec>
      <sec id="sec-4-7">
        <title>We have already described a single edge contraction step</title>
        <p>
          from   to  +1 , which assumed a given edge to remove.
Given an ordered list of edges to remove, we can
step-bystep obtain the sequence from Equation (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) in order to
evaluate performance-complexity trade-of for a given
graph and model. An example of the graph
decomposition is illustrated in Figure 3, while the final missing
piece of this procedure, that is ordering the edges to be
contracted, is described in Section 4.
merged into   .
3Recall that   denotes the number of training labels merged into
  , while   from Equation (
          <xref ref-type="bibr" rid="ref5">5</xref>
          ) stands for the number of all nodes
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Ordered Sequence of Edges for</title>
    </sec>
    <sec id="sec-6">
      <title>Removal</title>
      <p>This Section focuses on the last missing piece of the
complexity-performance trade-of evaluation, which is
obtaining a sequence of edges driving the edge
contraction procedure. First, a simple performance upper-bound
of the graph that was reduced according to Section 3 is
presented. The suggested edge sorting algorithm aims to
provide an ordering of edges such that the upper bound
would be maximised at each step when the edge
contraction algorithm is applied in this order.</p>
      <sec id="sec-6-1">
        <title>4.1. Label Refinement Induced</title>
      </sec>
      <sec id="sec-6-2">
        <title>Performance Upper Bound</title>
        <p>Considering all nodes in the testing set, in Figure 3 we can
see that the nodes of the same color (label) are merged
together in the first three steps. In that case, 100%
accuracy can theoretically be achieved if the merged nodes
are predicted correctly. However, the final edge
contraction merges together nodes with diferent colors (labels)
and the performance from this step forward cannot be
greater than 60%, which is achieved if the node  
predicted to be a blue one. If it is predicted as a green one,
is
only 40% accuracy is achieved. In this case, 60% presents
a performance upper-bound since no model can achieve
better performance in the final single node graph. Note
that the upper bound is 100% for the first three steps.</p>
        <sec id="sec-6-2-1">
          <title>More formally, according to Section 3.4, we assume all</title>
          <p>the merged nodes4 V̄ ⊆ V1 to share the prediction of the
corresponding node  ∈ V . Obviously, if true test labels</p>
          <p>V̄, some prediction errors are unavoidable.
difer within
Given the node  ∈</p>
          <p>V , we calculate a histogram of the
test labels from V̄. Denoting  0 the most common testing
label within V̄, a minimal number of errors on testing
nodes from V̄ is given by the number of test labels that
difers to  0. Denoting this number of errors as  ()
each node  ∈ V , the performance upper-bound can be
for
4Recall V̄ as a set of nodes from the original graph that are merged
into node  for a given  ∈ V .</p>
          <p>written as:
set.
where | | denotes the number of nodes from  1 in test</p>
        </sec>
        <sec id="sec-6-2-2">
          <title>We also note that considering this upper-bound as</title>
          <p>
            performance in the performance-complexity trade-of
(Section 2.1), we achieve a non-increasing sequence of
performances for the sequence  in Equation (
            <xref ref-type="bibr" rid="ref1">1</xref>
            ).
          </p>
        </sec>
      </sec>
      <sec id="sec-6-3">
        <title>4.2. Edge Ordering Maximizing the Performance Upper Bound</title>
      </sec>
      <sec id="sec-6-4">
        <title>4.3. Base Model</title>
        <sec id="sec-6-4-1">
          <title>The base model serves to evaluate a posterior probability for each node in graph determining the pairwise node sumed to process the entire graph, it is expected to be a simple one.</title>
        </sec>
      </sec>
      <sec id="sec-6-5">
        <title>4.4. Similarity Measures</title>
        <sec id="sec-6-5-1">
          <title>Within this paper, we experimentally evaluate several</title>
          <p>similarity measures of the probability density functions.</p>
          <p>In particular, we consider cross-entropy and the KL
divergence as similarity measures of posterior probability
distributions. The asymmetry of these measures does not
have any efect, since we assume an undirected graph,
that is the similarity is calculated for both directions and
the smaller value is used.






similarity for sorting the edges. Since the model is as- 5.1. Multi-Step Edge Contraction
.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>5. Hierarchical Clustering of</title>
    </sec>
    <sec id="sec-8">
      <title>Nodes Induced by Edge</title>
    </sec>
    <sec id="sec-9">
      <title>Contraction</title>
      <p>
        As demonstrated in Figure 2, the edge-contraction-based
coarsening leads to a hierarchical clustering represented
by a tree, where the nodes in the original graph  ∈ V
form leaves of the tree and each edge contraction defines
1
a relation between a parent and its children according
to Equation (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ). As an example, a tree for the procedure
from Figure 3 is shown in Figure 4.
      </p>
      <p>Although this clustering is only a side-efect of the
edge contraction procedure, it brings5 a useful insight
into the quality of the edge ordering. The label
refinement 3.4 basically operates within each hierarchical
cluster, since the (GNN) model is assumed to make a
prediction about this cluster that needs to be refined into a
prediction for the nodes in the original graph  .</p>
      <sec id="sec-9-1">
        <title>The edge contraction described in Section 3.1 can be</title>
        <p>easily extended by considering multiple edges to be
contracted simultaneously. In that case, a sub-graph defined
by a collection of edges to contract is considered. Each
component of this graph then forms a new merged node.</p>
        <p>
          The motivation for this multiple-step consideration is
to decrease the depth of the hierarchical tree on one hand
or reduce the number of graphs in Sequence (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) on the
other. The depth of the tree referring to the hierarchical
cluster can be as high as the number of nodes in the
graph. The multi-edge contraction procedure and its
relationship with the edge contraction step described
in Section 3.1 is illustrated in Figure 5 in a particular
example.
5We discuss other potential applications in Section 7.
[(  ,   ), (  ,   ), (  ,   ), (  ,   )] is coarsened in four steps
(left). If these steps are provided simultaneously, a flat tree is
obtained (right).
        </p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>6. Experiments</title>
      <p>posed complexity-performance trade-of on traditional
real-world graph datasets.</p>
      <sec id="sec-10-1">
        <title>6.1. Experiments Objectives</title>
        <sec id="sec-10-1-1">
          <title>Since our claims are not as straightforward as stating</title>
          <p>that one method overcomes another one, we first claim,
what we try to achieve within the experiments.</p>
        </sec>
        <sec id="sec-10-1-2">
          <title>First of all, we would like to find a sweet spot of the</title>
          <p>complexity-performance trade-of for a given task
enabling us to outperform the baseline with reduced
complexity. We found this point in all cases shown in Figure
6a.</p>
          <p>The next objective is to compare the performance
upper-bound from Section 4.1 with true performance
– Figure 6a. Since the performance upper-bound can be
calculated simply compared to the true performance, it
can be considered as an approximation of the true
performance in the complexity-performance trade-of. This
would save significant computational resources needed
for true performance evaluation for all graphs in the
sequence  , but making conclusions on upper-bound should
be validated.</p>
          <p>We further investigate impact of various steps driving
the edge order for removal described in Section 4 on the
complexity performance trade-of and we also attempt
to validate using a posterior distribution for edge sorting
(4.2) instead of the prior (labels) used in the upper-bound
(Figure 6b). Finally, we demonstrate the node clustering
as a vital visualization of a particular graph coarsening.</p>
          <p>This section provides experimental validation of the pro- is not suitable for comparison of multiple models, since
We consider a logistic regression model as the base model
in all experiments. Based on the posterior distribution
obtained from the logistic regression, we calculate the
ordering of edges for removal using the KL-divergence
and cross-entropy similarity measures on predictive node
distribution. We also consider random ordering as a
reference.</p>
        </sec>
        <sec id="sec-10-1-3">
          <title>We first calculate the graph sequence according to</title>
          <p>Section 3.5. This sequence inherently provides the
performance upper-bound (Section 4.1). A GCN [1] is then
trained on each graph in the sequence, where the labels
and features are aggregated according to Sections 3.2 and
3.3 and performance (accuracy) is obtained on the test set
refined to the original graph (Section</p>
        </sec>
        <sec id="sec-10-1-4">
          <title>3.4). We consider</title>
          <p>
            the cross entropy loss function within training, where the
cross entropy is calculated against the prior distribution
calculated in Equation (
            <xref ref-type="bibr" rid="ref6">6</xref>
            ). The node is assumed to be a
training one if it contains at least one aggregated label
from  1.
          </p>
        </sec>
        <sec id="sec-10-1-5">
          <title>We consider complexity given by the number of nodes</title>
          <p>in the graph within our experiments. This simplified
complexity measure takes only graph into account, it
their complexity is not taken into account.</p>
          <p>We evaluated the performance-complexity trade-of
for Cora, PubMed [14] and DBLP [15] datasets, where
we sub-sampled training labels – see results in Figure
6a. In order to validate the use of the posterior instead
of prior distribution, we used a distribution given by the
base model if the label was not available and if it was, a
distribution given by    + (1 −  )  ̂, where   denotes
the prior distribution of data determined by the training
labels,   ̂ the posterior distribution (base model output)
and  is a coeficient enabling mixing between the two
distributions. The results are shown in Figure 6b.</p>
          <p>The complexity performance trade-of curves in
Figures 6a and 6b are equipped by 90% symmetric confidence
intervals, where the interval is given by the
corresponding quantiles of Beta distribution constructed as the
conjugated prior to binomial distribution given by  errors
out of   testing examples (note that /  stands for
accuracy).</p>
        </sec>
      </sec>
      <sec id="sec-10-2">
        <title>6.3. Performance Upper-Bound versus</title>
      </sec>
      <sec id="sec-10-3">
        <title>GNN Performance</title>
        <sec id="sec-10-3-1">
          <title>The comparison of the upper-bound and true perfor</title>
          <p>mance can be found in Figures 6a and 6b. The
upperbound does not sufer from the fluctuations as the true
performance. In case when the base model is strong
enough, the working point selection according to the
upper-bound could bring reasonable results.
0
500
1000
1500
for diferent number of training labels.
case of Cora dataset. In case of 2% training labels, the The hierarchical clustering described in Section 5 presents
base model is so weak that qualitative diference between
a neat way of manually validating the graph coarsening
the edge sorting strategies disappears for all datasets. procedure. We evaluated the base model on the Karate</p>
        </sec>
        <sec id="sec-10-3-2">
          <title>Interestingly, the provided observations made on true</title>
        </sec>
        <sec id="sec-10-3-3">
          <title>Club dataset and used the proposed coarsening proce</title>
          <p>performance can be also deduced from the upper-bounds. dure according to Section 5.1. Two edges were contracted</p>
          <p>As can be seen in Figure 6b, adding prior information in each step. The resulting tree is shown in Figure 7.
makes the results consistently worse. This is probably</p>
          <p>Observing the tree, we can recognise splits (or
coarsr=0
r=0.4
r=0.8
r=1
r=0.0 (UB)
r=0.4 (UB)
r=0.8 (UB)
r=1.0 (UB)
Base Model
r=0
r=0.4
r=0.8
r=1
r=0.0 (UB)
r=0.4 (UB)
r=0.8 (UB)
r=1.0 (UB)
Base Model
r=0
r=0.4
r=0.8
r=1
r=0.0 (UB)
r=0.4 (UB)
r=0.8 (UB)
r=1.0 (UB)
Base Model
2000 4000 6000 8000 10000 12000 14000 16000</p>
          <p>Graph complexity - Number of nodes in the graph
1000 1500 2000
Graph complexity - Number of nodes in the graph
2500
4000 6000 8000 10000 12000 14000 16000 18000 20000</p>
          <p>Graph complexity - Number of nodes in the graph
(b) Performance comparison of various distribution used for sorting the edges for contracting.</p>
        </sec>
      </sec>
      <sec id="sec-10-4">
        <title>6.4. Impact of Edges Ordering on</title>
      </sec>
      <sec id="sec-10-5">
        <title>Performance Complexity Trade-Of</title>
        <p>random baseline gives good results in some cases. While
no metric is the best choice in case of Pubmed and DBLP
datasets, KL divergence seems to be a superior choice in
performance is shown in Figure 6a. Surprisingly, even the
upper-bound provides an inverse observation.</p>
        <sec id="sec-10-5-1">
          <title>Impact of the chosen distribution similarity measure on</title>
          <p>In contrast with evaluating the similarity measures, the
caused by the fact that neighbouring nodes with the same
label in the training set are merged first and the lack in
training data then causes the observed performance drop.
6.5. Hierarchical Clustering
2
0
0</p>
          <p>16
0
0
15
0
0
0
0
0
0
0
0
5
0
13
12
10
9
8
7
6
0
14
0
0
0
0
0
12
0
0
2
1
11
0
0
ening steps) resulting from merging nodes with the same
label, which is the desired behavior, because the graph
size is reduced without any performance loss in such a
case. However, merging diferently labelled nodes
unavoidably leads to a performance loss, because our label
refinement procedure guarantees the same prediction for
each node within the cluster (see Section 4.1).</p>
          <p>In case of Karate Club dataset tree (Figure 7), the first
0
0
1
0
0
0
0
four steps of the procedure merge only nodes with a
matching label. However, from the fith step onward, at
least two errors occur and afect the total accuracy (one
error in case of test accuracy).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>7. Conclusion and Discussion</title>
      <p>We formalized the performance-complexity trade-of and
described a framework for smoothly adjusting either
desired model complexity for a given performance
requirements or performance for available computational
resources. The framework consists of a methodical
stepby-step graph coarsening by way of edge contraction,
where we addressed practical aspects of feature and label
aggregation within node merging and also prediction
refinement to the original nodes from the merged node.</p>
      <sec id="sec-11-1">
        <title>We have also shown that this edge coarsening naturally produces a hierarchical clustering of the nodes in the graph.</title>
      </sec>
      <sec id="sec-11-2">
        <title>The ordering of edges for the contracting procedure</title>
        <p>is then studied in greater detail, where we suggest to
employ a base model to produce a predictive posterior
distribution for each node and then sort the edges
according to the distance of probability distributions of adjacent
nodes.</p>
      </sec>
      <sec id="sec-11-3">
        <title>We applied the proposed framework to widely used</title>
        <p>publicly available graph datasets. In particular, we
explored the coarsening procedure on the Karate Club
dataset by means of a hierarchical tree. Finally we
compared KL-divergence and cross-entropy similarity
measures for driving the edge ordering.</p>
      </sec>
      <sec id="sec-11-4">
        <title>Although our primary focus was the complexity per</title>
        <p>formance trade-of, we believe that we explored several
research directions for future works.</p>
        <p>The considered label refinement is extremely simple,
however, we can alternatively consider an arbitrary model
within each cluster. This will incur some additional cost,
but a distributive processing may be utilized since the
clusters do not overlap. Moreover, the information from
this ”local” model can be combined with the global one,
which could lead to superior performance since both
local and global graph contexts would be considered. In
addition, we would still avoid running the GNN on the
entire graph.</p>
      </sec>
      <sec id="sec-11-5">
        <title>If we focus on the hierarchical clustering, we can relax</title>
        <p>our assumption of applying a simple model on the entire
graph to produce a predictive posterior distribution.
Instead, we could consider as strong model as possible to
achieve the best possible clustering for a given use case.</p>
        <p>Within this work, we used the hierarchical clustering
as a complementary/inspection tool for the graph
coarsening. Nevertheless, it would be interesting to compare
its suitability for other purposes – e.g. distributive
processing as an alternative to [7]. In addition, the clustering
is built on a supervised setup. An unsupervised approach
based on e.g. the cosine distance of the node2vec
embedding could be an interesting alternative to [13].</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T. N.</given-names>
            <surname>Kipf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Welling</surname>
          </string-name>
          ,
          <article-title>Semi-supervised classification with graph convolutional networks</article-title>
          ,
          <source>arXiv preprint arXiv:1609.02907</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Grover</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          , node2vec:
          <article-title>Scalable feature learning for networks</article-title>
          ,
          <source>in: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>855</fpage>
          -
          <lpage>864</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dai</surname>
          </string-name>
          , L. Song,
          <article-title>Discriminative embeddings of latent variable models for structured data</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>2702</fpage>
          -
          <lpage>2711</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.-L.</given-names>
            <surname>Chiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Si</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-J. Hsieh</surname>
          </string-name>
          ,
          <article-title>Cluster-gcn: An eficient algorithm for training deep and large graph convolutional networks</article-title>
          ,
          <source>in: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery &amp; data mining</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>257</fpage>
          -
          <lpage>266</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Frasca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chamberlain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Eynard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bronstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Monti</surname>
          </string-name>
          , Sign:
          <article-title>Scalable inception graph neural networks</article-title>
          ,
          <source>arXiv preprint arXiv:2004.11198 7</source>
          (
          <year>2020</year>
          )
          <fpage>15</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ying</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>You</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Aiken</surname>
          </string-name>
          ,
          <article-title>Redundancy-free computation for graph neural networks</article-title>
          ,
          <source>in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>997</fpage>
          -
          <lpage>1005</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Karypis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Metis: A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices (</article-title>
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , C. Ma,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Gan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , G. Karypis,
          <article-title>Distdgl: distributed graph neural network training for billion-scale graphs</article-title>
          ,
          <source>in: 2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>36</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Y. Cheng,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          , J. Cheng, H.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>S. Zhang,</given-names>
          </string-name>
          <article-title>Bytegnn: efifcient graph neural network training at large scale</article-title>
          ,
          <source>Proceedings of the VLDB Endowment</source>
          <volume>15</volume>
          (
          <year>2022</year>
          )
          <fpage>1228</fpage>
          -
          <lpage>1242</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Perozzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          , S. Skiena, HARP:
          <article-title>Hierarchical Representation Learning for Networks</article-title>
          ,
          <source>Proceedings of the AAAI Conference on Artificial Intelligence</source>
          <volume>32</volume>
          (
          <year>2018</year>
          ).
          <source>Number: 1.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dedič</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bajer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Repický</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Procházka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Holeňa</surname>
          </string-name>
          ,
          <article-title>Adaptive graph coarsening in the context of local graph quality</article-title>
          , Accepted to CEUR Workshop Proceedings (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P. F.</given-names>
            <surname>Brown</surname>
          </string-name>
          , V. J. Della Pietra,
          <string-name>
            <given-names>P. V.</given-names>
            <surname>Desouza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Mercer</surname>
          </string-name>
          ,
          <article-title>Class-based n-gram models of natural language</article-title>
          ,
          <source>Computational linguistics 18</source>
          (
          <year>1992</year>
          )
          <fpage>467</fpage>
          -
          <lpage>480</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bonald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Charpentier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Galland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hollocou</surname>
          </string-name>
          ,
          <article-title>Hierarchical graph clustering using node pair sampling</article-title>
          , arXiv preprint arXiv:
          <year>1806</year>
          .
          <volume>01664</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sen</surname>
          </string-name>
          , G. Namata,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bilgic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Getoor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Galligher</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          Eliassi-Rad,
          <article-title>Collective classification in network data</article-title>
          ,
          <source>AI</source>
          Magazine
          <volume>29</volume>
          (
          <year>2008</year>
          )
          <fpage>93</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Yao,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <article-title>Arnetminer: Extraction and mining of academic social networks</article-title>
          ,
          <source>in: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          , KDD '08,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2008</year>
          , p.
          <fpage>990</fpage>
          -
          <lpage>998</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>