1. Introduction

Graph Size Reduction for Eficient GNN Application

Pavel Procházka

Michal Mareš

0 1

Marek Dědič

0 2 0 Cisco Systems, Inc. , Karlovo náměstí 10, Prague , Czech Republic 1 Czech Technical University in Prague , Technická 2, Prague , Czech Republic 2 Czech Technical University in Prague , Trojanova 13, Prague , Czech Republic

3 12

Graph neural networks (GNN) present a dominant framework for representation learning on graphs for the past several years. The main strength of GNNs lies in the fact that they can simultaneously learn from both node related attributes and relations between nodes, represented by edges. In tasks leading to large graphs, GNN often requires significant computational resources to achieve its superior performance. In order to reduce the computational cost, methods allowing for a flexible balance between complexity and performance could be useful. In this work, we propose a simple scalable task-aware graph preprocessing procedure allowing us to obtain a reduced graph such as GNN achieves a given desired performance on the downstream task. In addition, the proposed preprocessing allows for fitting the reduced graph and GNN into a given memory/computational resources. The proposed preprocessing is evaluated and compared with several reference scenarios on conventional GNN benchmark datasets.

1. Introduction Graph neural networks (GNNs) have proven to achieve

superior performance on a number of graph datasets and are adopted in industrial applications across many fields. Superior GNN performance is, however, often paid for by a numerically intensive training procedure with a significant memory footprint. In addition, fine tuning parameters of a complex GNN can prove challenging as well. The issue is emphasised whenever the problem modeled by the graph is evolving in time and retraining is required to keep the model accurate.

In many industrial applications, high computational

cost caused by training complex models might be an issue.

Moreover, some real-world problems lead to high amount

of data that does not fit the memory of a single machine.

In order to apply GNNs to such big data, either some

data reduction or their processing in a distributive way is required.

Traditional simple structure-agnostic baselines like the Naive Bayes classifier usually scale very well at a reasonable cost, making them useful in scenarios where such a simple model performs well enough. performance trade-of and propose a method operating on an arbitrary performance level between a simple base model and the fine-tuned GNN. The method consists in a systematic reduction of the graph for GNN, implying complexity reduction at the expense of performance. ITAT’22: Information technologies – Applications and Theory, Septem

Graph batching proposed in [4] introduces graph sam

pling (batching) to enable stochastic gradient descent. Next, feature propagation through the network layers is avoided in [5]. Instead, the product of the feature matrix and powers of the adjacency matrix are concatenated negative impact on performance. Another approach is to identify and remove redundant operations as proposed in [6]. One can also consider decomposing the graph as in [7] in order to enable distributive training of GNNs, since each cluster represents an independent graph component – this idea was adopted in [8]. Connection of batching with distributive processing is considered in [9]. In contrast with these approaches, the proposed complexity reduction is achieved by graph coarsening.

A systematic way of graph coarsening associated with

In this paper, we address the problem of complexity- and passed to the multi-layer perceptron without any node2vec is considered in the HARP framework [10].

The idea is to reduce the graph step-by-step by means of

edge and star collapsing, training a node2vec model on the coarsened graph and then to step-by-step refine the embedding to the original graph. The motivation behind HARP is to provide an embedding incorporating both local and global graph properties. Using HARP on studying the complexity-performance trade-of is proposed in [11], where the graph coarsening is optimised with respect to graph properties. The prolonging procedure is then driven by the downstream task. In contrast to this work, our coarsening is driven in a diferent way, nonetheless respecting the downstream task.

Graph clustering / partitioning [7] can be used for

distributive processing as described above. Another usecase could be organising the data (typically in an unsupervised way) in order to better understand its structure. On top of traditional clustering, hierarchical clustering enables a tree-structured organisation of the clusters, in contrast to conventional flat clusters. In [ 12], the authors proposed hierarchical clustering of words in a corpus that is based on a bi-gram language model and the hierarchy is driven by mutual information between clusters. In [13], the authors propose a hierarchical clustering of nodes in a graph that is driven by a pair sampling ratio derived from weights of the adjacency matrix. To the best of our knowledge, there is no available hierarchical clustering of nodes in a graph that is aware of the downstream task as we propose in this work.

1.3. Structure of the Paper

The paper is organised as follows: A formal description of the complexity-performance trade-of and a general framework for solving this problem is presented in Section 2. A coarsening procedure based on edge contraction as a particular instance of the proposed framework is presented in Section 3. Section 4 then discusses a critical part of the edge contraction procedure, which is the order in which the edges are contracted. Hierarchical clustering view of the considered graph coarsening procedure is the topic of Section 5. Experiments are presented in Section

6 and the work is concluded in Section 7, where we also outline future research directions. 2. Graph Complexity Reduction 1.2. Goals and Contribution The work reported in this paper presents the following contributions:

1. maximize performance ( Y, ( Ŷ⋆)) subject to complexity restriction ( ⋆, ) ≤ for a given complexity constrain or 2. minimize complexity ( ⋆, ) subject given minimal required performance ( Y, ( Ŷ⋆)) ≥ .

2.2. Graph Algorithm Complexity There are two main aspects of GNN complexity. First,

the underlying graph needs to be stored in memory. Second, the training procedure typically runs through the graph and needs to access it in order to drive the learning process. The authors report both memory and computational complexity of various training techniques of GCN in [4] – Table 1. In principle, both complexities or even any function of them can be covered by the complexity term (, )

2.3. The Performance Complexity Trade-Of

Given a size-decreasing sequence of graphs = [ 1, … , , … , ], ( 1 ) where = (V , E , X , Y ) and |V | > |V | ⟺ < , we expect a model (GNN) performance to be proportional to the graph size. The main focus of this paper is on the construction of such size-decreasing graph sequence that would enable the fulfillment of our ultimate optimisation problem described in Section 2.1. Evaluating the performance of the model on each graph in the sequence, one can plot dependency of performance on the complexity (graph size). Denoting performance of model

achieved using with refinement into the original graph , our goal can be achieved according to illustration1 in Figure 1:

– blue arrows. • Given maximum available complexity , we select such a graph that the complexity given by the number of nodes in the graph is smaller than that is |V | ≤ and achieve performance 3 • Given minimal required performance complexity complexity 2 = |V2| – green arrows.

, we select such a graph that |P | ≥ with

2.4. Finding Working Point Using Performance Complexity Trade-Of in Practice In practice, it could be very expensive to calculate the

performance for all graphs in the sequence according to 1The illustration captures an intuition that the performance would our main motivation. In order to overcome this limitation, contraction driving the size-decreasing graph sequence performance (green arrows), we find the graph 2, where the model achieves at least performance . In case of maximal available resources (blue arrows), we select the model achieving performance 3.

• In order to find the proper working point, we do not need to have the complete curve as in Figure 1. One possible strategy could be to start with a small graph, where the complexity would be low, and to continue until the desired performance is achieved. • Another option that can be applied in some cases could be to rely on a generalization of the provided information that is a transfer of the proper graph size among tasks, time, models, etc. • Finally, if we provide some restrictions according to Section 3.4, we can use an upper-bound (or its estimation) as described in Section 4.1 instead of true GNN performance.

3. Graph Size Reduction Strategy Based on Edge Collapsing

This section describes a way of obtaining the graph sequence from Equation ( 1 ), which is then used for the target complexity-performance trade-of evaluation. This approach is based on the edge contraction procedure that is described within this Section. The order of edges for from Equation ( 1 ) is described in Section 4 in greater detail. Finally, a hierarchical clustering induced by the edge contracting procedure can be found in Section 5.

We can write the sequence from Equation ( 1 ) as = [( V1, E1, X1, Y1), … , (V , E , X , Y )], ( 2 ) where 1 = (V1, E1, X1, Y1) stands for the original graph.

The considered graph-size-reduction process is driven

step by step. We first describe a single graph coarsening step, that is, how to get from to +1 , in Equation ( 1 ) and then define an overall graph reduction procedure.

3.1. Single Edge Contraction Step

are given by

Given an edge to contract, its incident nodes are merged

(removed) and a new node is established – see Figure 2 for an illustration. More precisely, given a graph = (V , E , X , Y ) and an edge = ( , ), ∈ E , ∈ V , ∈ V to remove, the new nodes and edges describing +1

V+1 = V ⧵ { , } ∪ { }, where is a new (merged) node and

E+1 = E ⧵ ([ , ], ⋆) ∪ {( , )} ∈ N ( , ), where

([ , ], ⋆) = ( , ⋆) ∪ ( , ⋆)

N ( , ) = N ( ) ∪ N ( ) ⧵ { , }, where the neighborhood is taken according to .

In the context of our setup, that is, in the situation

where we want to apply a GNN on the new graph +1 and evaluate its performance on the original one ( 1), the following issues arise and must be resolved:

Section 3.3.

gate training labels (if available) to the new node

and decide if this label should be used for training on the coarsened graph, that is to define Y+1 based on Y . This problem is discussed in

2. Apart from labels, node features must be aggre

Section 3.2 for details2.

gated as well (transforming X into X+1 ) – see 3. Finally, once we have a prediction for the node , we need to establish predictions for all the nodes from V1 that were merged into . We introduced the refinement mapping for this purpose in Section 2.1. More details and a simple refinement mapping is presented Section 3.4.

Within this paper, we only show a very simple possible solution to address each of the mentioned issue. A deeper investigation of each of them is not within the scope of this paper and is reserved for future work. 2Handling the edge features also needs to be considered if they are available. Since this is not the case in our assumptions, we leave this problem for future work. ( 3 ) ( 4 )

3.2. Feature Aggregation The GNN model assumes all feature vectors to have the

same dimension for each node, hence some aggregation is required when merging two nodes. In particular, we have two nodes , with feature vectors and of length that must be aggregated by some aggregation to the newly merged node .

Although a number of solutions can be considered, we assume a simple weighted mean aggregation within this work, that is = ( , ) = + + preceding steps. where the weight is given by the number of nodes from the original graph 1 that were merged to in the

3.3. Label Aggregation

Similarly as in the case of features, we also need to aggregate labels associated with the nodes , that are supposed to be merged. We (ad-hoc) consider a similar approach as in Equation ( 5 ), where we convert the labels to a prior distribution on labels and we obtain = + + ( 5 ) ( 6 ) 1. Based on the nodes and , we need to aggre- into one vector = ( , ) of length corresponding edge order for contraction. We use colors to denote labels and to indicate the problem of label aggregation – their aggregation is clear until the final node in the last step. where denotes3 the number of training labels merged into .

In multi-class classification, the prior distribution is in the form of a probability vector with a single nonzero (unit) element placed on the position of true label, if available. Since the weight is non-zero only for at least one training sample, only probability vectors related to the training samples are reflected in Equation 6.

3.4. Label Refinement

Label refinement is related to the node embedding prolonging in [10], where we need to transform a prediction about a cluster into predictions about its constituent nodes. Within this work, we assume a trivial copy-paste method, where all merged nodes share the same prediction. it for all nodes from V̄.

More precisely, running a (GNN) model on , we receive a prediction for each node from V that needs to be prolonged to the nodes from V1 for the target accuracy evaluation. A relation of nodes in V and V1 is hierarchical so that each node ∈ V can be seen as a hierarchical composition of nodes V̄ ⊆ V1 such that the node arose by merging nodes from V̄ according to Equation ( 3 ) in preceding steps. The considered simple prolonging mapping then takes the prediction related to and applies

3.5. Graph Coarsening Sequence We have already described a single edge contraction step

from to +1 , which assumed a given edge to remove. Given an ordered list of edges to remove, we can step-bystep obtain the sequence from Equation ( 1 ) in order to evaluate performance-complexity trade-of for a given graph and model. An example of the graph decomposition is illustrated in Figure 3, while the final missing piece of this procedure, that is ordering the edges to be contracted, is described in Section 4. merged into . 3Recall that denotes the number of training labels merged into , while from Equation ( 5 ) stands for the number of all nodes

4. Ordered Sequence of Edges for Removal

This Section focuses on the last missing piece of the complexity-performance trade-of evaluation, which is obtaining a sequence of edges driving the edge contraction procedure. First, a simple performance upper-bound of the graph that was reduced according to Section 3 is presented. The suggested edge sorting algorithm aims to provide an ordering of edges such that the upper bound would be maximised at each step when the edge contraction algorithm is applied in this order.

4.1. Label Refinement Induced Performance Upper Bound

Considering all nodes in the testing set, in Figure 3 we can see that the nodes of the same color (label) are merged together in the first three steps. In that case, 100% accuracy can theoretically be achieved if the merged nodes are predicted correctly. However, the final edge contraction merges together nodes with diferent colors (labels) and the performance from this step forward cannot be greater than 60%, which is achieved if the node predicted to be a blue one. If it is predicted as a green one, is only 40% accuracy is achieved. In this case, 60% presents a performance upper-bound since no model can achieve better performance in the final single node graph. Note that the upper bound is 100% for the first three steps.

More formally, according to Section 3.4, we assume all

the merged nodes4 V̄ ⊆ V1 to share the prediction of the corresponding node ∈ V . Obviously, if true test labels

V̄, some prediction errors are unavoidable. difer within Given the node ∈

V , we calculate a histogram of the test labels from V̄. Denoting 0 the most common testing label within V̄, a minimal number of errors on testing nodes from V̄ is given by the number of test labels that difers to 0. Denoting this number of errors as () each node ∈ V , the performance upper-bound can be for 4Recall V̄ as a set of nodes from the original graph that are merged into node for a given ∈ V .

written as: set. where | | denotes the number of nodes from 1 in test

We also note that considering this upper-bound as

performance in the performance-complexity trade-of (Section 2.1), we achieve a non-increasing sequence of performances for the sequence in Equation ( 1 ).

4.2. Edge Ordering Maximizing the Performance Upper Bound 4.3. Base Model The base model serves to evaluate a posterior probability for each node in graph determining the pairwise node sumed to process the entire graph, it is expected to be a simple one. 4.4. Similarity Measures Within this paper, we experimentally evaluate several

similarity measures of the probability density functions.

In particular, we consider cross-entropy and the KL divergence as similarity measures of posterior probability distributions. The asymmetry of these measures does not have any efect, since we assume an undirected graph, that is the similarity is calculated for both directions and the smaller value is used. similarity for sorting the edges. Since the model is as- 5.1. Multi-Step Edge Contraction .

5. Hierarchical Clustering of Nodes Induced by Edge Contraction

As demonstrated in Figure 2, the edge-contraction-based coarsening leads to a hierarchical clustering represented by a tree, where the nodes in the original graph ∈ V form leaves of the tree and each edge contraction defines 1 a relation between a parent and its children according to Equation ( 3 ). As an example, a tree for the procedure from Figure 3 is shown in Figure 4.

Although this clustering is only a side-efect of the edge contraction procedure, it brings5 a useful insight into the quality of the edge ordering. The label refinement 3.4 basically operates within each hierarchical cluster, since the (GNN) model is assumed to make a prediction about this cluster that needs to be refined into a prediction for the nodes in the original graph .

The edge contraction described in Section 3.1 can be

easily extended by considering multiple edges to be contracted simultaneously. In that case, a sub-graph defined by a collection of edges to contract is considered. Each component of this graph then forms a new merged node.

The motivation for this multiple-step consideration is to decrease the depth of the hierarchical tree on one hand or reduce the number of graphs in Sequence ( 1 ) on the other. The depth of the tree referring to the hierarchical cluster can be as high as the number of nodes in the graph. The multi-edge contraction procedure and its relationship with the edge contraction step described in Section 3.1 is illustrated in Figure 5 in a particular example. 5We discuss other potential applications in Section 7. [( , ), ( , ), ( , ), ( , )] is coarsened in four steps (left). If these steps are provided simultaneously, a flat tree is obtained (right).

6. Experiments

posed complexity-performance trade-of on traditional real-world graph datasets.

6.1. Experiments Objectives Since our claims are not as straightforward as stating

that one method overcomes another one, we first claim, what we try to achieve within the experiments.

First of all, we would like to find a sweet spot of the

complexity-performance trade-of for a given task enabling us to outperform the baseline with reduced complexity. We found this point in all cases shown in Figure 6a.

The next objective is to compare the performance upper-bound from Section 4.1 with true performance – Figure 6a. Since the performance upper-bound can be calculated simply compared to the true performance, it can be considered as an approximation of the true performance in the complexity-performance trade-of. This would save significant computational resources needed for true performance evaluation for all graphs in the sequence , but making conclusions on upper-bound should be validated.

We further investigate impact of various steps driving the edge order for removal described in Section 4 on the complexity performance trade-of and we also attempt to validate using a posterior distribution for edge sorting (4.2) instead of the prior (labels) used in the upper-bound (Figure 6b). Finally, we demonstrate the node clustering as a vital visualization of a particular graph coarsening.

This section provides experimental validation of the pro- is not suitable for comparison of multiple models, since We consider a logistic regression model as the base model in all experiments. Based on the posterior distribution obtained from the logistic regression, we calculate the ordering of edges for removal using the KL-divergence and cross-entropy similarity measures on predictive node distribution. We also consider random ordering as a reference.

We first calculate the graph sequence according to

Section 3.5. This sequence inherently provides the performance upper-bound (Section 4.1). A GCN [1] is then trained on each graph in the sequence, where the labels and features are aggregated according to Sections 3.2 and 3.3 and performance (accuracy) is obtained on the test set refined to the original graph (Section

3.4). We consider

the cross entropy loss function within training, where the cross entropy is calculated against the prior distribution calculated in Equation ( 6 ). The node is assumed to be a training one if it contains at least one aggregated label from 1.

We consider complexity given by the number of nodes

in the graph within our experiments. This simplified complexity measure takes only graph into account, it their complexity is not taken into account.

We evaluated the performance-complexity trade-of for Cora, PubMed [14] and DBLP [15] datasets, where we sub-sampled training labels – see results in Figure 6a. In order to validate the use of the posterior instead of prior distribution, we used a distribution given by the base model if the label was not available and if it was, a distribution given by + (1 − ) ̂, where denotes the prior distribution of data determined by the training labels, ̂ the posterior distribution (base model output) and is a coeficient enabling mixing between the two distributions. The results are shown in Figure 6b.

The complexity performance trade-of curves in Figures 6a and 6b are equipped by 90% symmetric confidence intervals, where the interval is given by the corresponding quantiles of Beta distribution constructed as the conjugated prior to binomial distribution given by errors out of testing examples (note that / stands for accuracy).

6.3. Performance Upper-Bound versus GNN Performance The comparison of the upper-bound and true perfor

mance can be found in Figures 6a and 6b. The upperbound does not sufer from the fluctuations as the true performance. In case when the base model is strong enough, the working point selection according to the upper-bound could bring reasonable results. 0 500 1000 1500 for diferent number of training labels. case of Cora dataset. In case of 2% training labels, the The hierarchical clustering described in Section 5 presents base model is so weak that qualitative diference between a neat way of manually validating the graph coarsening the edge sorting strategies disappears for all datasets. procedure. We evaluated the base model on the Karate

Interestingly, the provided observations made on true Club dataset and used the proposed coarsening proce

performance can be also deduced from the upper-bounds. dure according to Section 5.1. Two edges were contracted

As can be seen in Figure 6b, adding prior information in each step. The resulting tree is shown in Figure 7. makes the results consistently worse. This is probably

Observing the tree, we can recognise splits (or coarsr=0 r=0.4 r=0.8 r=1 r=0.0 (UB) r=0.4 (UB) r=0.8 (UB) r=1.0 (UB) Base Model r=0 r=0.4 r=0.8 r=1 r=0.0 (UB) r=0.4 (UB) r=0.8 (UB) r=1.0 (UB) Base Model r=0 r=0.4 r=0.8 r=1 r=0.0 (UB) r=0.4 (UB) r=0.8 (UB) r=1.0 (UB) Base Model 2000 4000 6000 8000 10000 12000 14000 16000

Graph complexity - Number of nodes in the graph 1000 1500 2000 Graph complexity - Number of nodes in the graph 2500 4000 6000 8000 10000 12000 14000 16000 18000 20000

Graph complexity - Number of nodes in the graph (b) Performance comparison of various distribution used for sorting the edges for contracting.

6.4. Impact of Edges Ordering on Performance Complexity Trade-Of

random baseline gives good results in some cases. While no metric is the best choice in case of Pubmed and DBLP datasets, KL divergence seems to be a superior choice in performance is shown in Figure 6a. Surprisingly, even the upper-bound provides an inverse observation.

Impact of the chosen distribution similarity measure on

In contrast with evaluating the similarity measures, the caused by the fact that neighbouring nodes with the same label in the training set are merged first and the lack in training data then causes the observed performance drop. 6.5. Hierarchical Clustering 2 0 0

16 0 0 15 0 0 0 0 0 0 0 0 5 0 13 12 10 9 8 7 6 0 14 0 0 0 0 0 12 0 0 2 1 11 0 0 ening steps) resulting from merging nodes with the same label, which is the desired behavior, because the graph size is reduced without any performance loss in such a case. However, merging diferently labelled nodes unavoidably leads to a performance loss, because our label refinement procedure guarantees the same prediction for each node within the cluster (see Section 4.1).

In case of Karate Club dataset tree (Figure 7), the first 0 0 1 0 0 0 0 four steps of the procedure merge only nodes with a matching label. However, from the fith step onward, at least two errors occur and afect the total accuracy (one error in case of test accuracy).

7. Conclusion and Discussion

We formalized the performance-complexity trade-of and described a framework for smoothly adjusting either desired model complexity for a given performance requirements or performance for available computational resources. The framework consists of a methodical stepby-step graph coarsening by way of edge contraction, where we addressed practical aspects of feature and label aggregation within node merging and also prediction refinement to the original nodes from the merged node.

We have also shown that this edge coarsening naturally produces a hierarchical clustering of the nodes in the graph. The ordering of edges for the contracting procedure

is then studied in greater detail, where we suggest to employ a base model to produce a predictive posterior distribution for each node and then sort the edges according to the distance of probability distributions of adjacent nodes.

We applied the proposed framework to widely used

publicly available graph datasets. In particular, we explored the coarsening procedure on the Karate Club dataset by means of a hierarchical tree. Finally we compared KL-divergence and cross-entropy similarity measures for driving the edge ordering.

Although our primary focus was the complexity per

formance trade-of, we believe that we explored several research directions for future works.

The considered label refinement is extremely simple, however, we can alternatively consider an arbitrary model within each cluster. This will incur some additional cost, but a distributive processing may be utilized since the clusters do not overlap. Moreover, the information from this ”local” model can be combined with the global one, which could lead to superior performance since both local and global graph contexts would be considered. In addition, we would still avoid running the GNN on the entire graph.

If we focus on the hierarchical clustering, we can relax

our assumption of applying a simple model on the entire graph to produce a predictive posterior distribution. Instead, we could consider as strong model as possible to achieve the best possible clustering for a given use case.

Within this work, we used the hierarchical clustering as a complementary/inspection tool for the graph coarsening. Nevertheless, it would be interesting to compare its suitability for other purposes – e.g. distributive processing as an alternative to [7]. In addition, the clustering is built on a supervised setup. An unsupervised approach based on e.g. the cosine distance of the node2vec embedding could be an interesting alternative to [13].

[1]

T. N.

Kipf ,

Welling , Semi-supervised classification with graph convolutional networks , arXiv preprint arXiv:1609.02907 ( 2016 ).

[2]

Grover ,

Leskovec , node2vec: Scalable feature learning for networks , in: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining , 2016 , pp. 855 - 864 .

[3]

Dai ,

Dai , L. Song, Discriminative embeddings of latent variable models for structured data , in: International conference on machine learning, PMLR , 2016 , pp. 2702 - 2711 .

[4]

W.-L.

Chiang ,

Liu ,

Si ,

Li ,

Bengio , C.-J. Hsieh , Cluster-gcn: An eficient algorithm for training deep and large graph convolutional networks , in: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining , 2019 , pp. 257 - 266 .

[5]

Rossi ,

Frasca ,

Chamberlain ,

Eynard ,

Bronstein ,

Monti , Sign: Scalable inception graph neural networks , arXiv preprint arXiv:2004.11198 7 ( 2020 ) 15 .

[6]

Jia ,

Lin ,

Ying ,

You ,

Leskovec ,

Aiken , Redundancy-free computation for graph neural networks , in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , 2020 , pp. 997 - 1005 .

[7]

Karypis ,

Kumar , Metis: A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices ( 1997 ).

[8]

Zheng , C. Ma,

Wang ,

Zhou ,

Su ,

Song ,

Gan ,

Zhang , G. Karypis, Distdgl: distributed graph neural network training for billion-scale graphs , in: 2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3) , IEEE, 2020 , pp. 36 - 44 .

[9]

Zheng ,

Chen , Y. Cheng,

Song ,

Wu ,

Li , J. Cheng, H. Yang , S. Zhang, Bytegnn: efifcient graph neural network training at large scale , Proceedings of the VLDB Endowment 15 ( 2022 ) 1228 - 1242 .

[10]

Chen ,

Perozzi ,

Hu , S. Skiena, HARP: Hierarchical Representation Learning for Networks , Proceedings of the AAAI Conference on Artificial Intelligence 32 ( 2018 ). Number: 1.

[11]

Dedič ,

Bajer ,

Repický ,

Procházka ,

Holeňa , Adaptive graph coarsening in the context of local graph quality , Accepted to CEUR Workshop Proceedings ( 2022 ).

[12]

P. F.

Brown , V. J. Della Pietra,

P. V.

Desouza ,

J. C.

Lai ,

R. L.

Mercer , Class-based n-gram models of natural language , Computational linguistics 18 ( 1992 ) 467 - 480 .

[13]

Bonald ,

Charpentier ,

Galland ,

Hollocou , Hierarchical graph clustering using node pair sampling , arXiv preprint arXiv: 1806 . 01664 ( 2018 ).

[14]

Sen , G. Namata,

Bilgic ,

Getoor ,

Galligher , T. Eliassi-Rad, Collective classification in network data , AI Magazine 29 ( 2008 ) 93 .

[15]

Tang ,

Zhang , L. Yao,

Li ,

Zhang ,

Su , Arnetminer: Extraction and mining of academic social networks , in: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD '08, Association for Computing Machinery, New York, NY, USA, 2008 , p. 990 - 998 .