-

Performance for a Given Downstream Task?

Pavel Procházka

0 3 4

Michal Mareš

0 1 3 4

Marek Dědič

marek@dedic.eu 0 2 3 4 0 Cisco Systems, Inc. , Karlovo náměstí 10, Prague , Czech Republic 1 Czech Technical University in Prague , Technická 2, Prague , Czech Republic 2 Czech Technical University in Prague , Trojanova 13, Prague , Czech Republic 3 In our work, we investigate the problem of creating a 4 To solve the problem of predicting the suitability of a

Machine learning algorithms on graphs, in particular graph neural networks, became a popular framework for solving various tasks on graphs, attracting significant interest in the research community in recent years. As presented, however, these algorithms usually assume that the input graph is fixed and well-defined and do not consider the problem of constructing the graph for a given practical task. This work proposes a methodical way of linking graph properties with the performance of a GNN solving a given task on such graph via a surrogate regression model that is trained to predict the performance of the GNN from the properties of the graph dataset. Furthermore, the GNN model hyper-parameters are optionally added as additional features of the surrogate model and it is shown that this technique can be used to solve the practical problem of hyper-parameter tuning. We experimentally evaluate the importance of graph properties as features of the surrogate model with regards to the node classification task for several common graph datasets and discuss how these results can be used for graph composition tailored to the given task. Finally, our experiments indicate a significant gain in the proposed hyper-parameter tuning method compared to the reference grid-search method. Graph neural networks & Model performance prediction & Hyper-parameter tuning & Node classification or a more general similarity or distance measure, both the nodes and edges may have associated with them additional features, the full dataset may be prohibitively large to eficiently process and some data points may not resent the graph dataset by its properties (Section 2.2) work is to extract information about the usefulness of individual graph properties from a meta-model that is trained on the graph dataset properties to predict the

1. Introduction Across a wide variety of applications and domains, graphs emerge as a ubiquitous way of organizing data. Consequently, machine learning on graphs has, in recent years, seen an explosion in popularity, breadth and depth of both research and applications. At the same time, the underlying graph topology has, until recent works [1, 2], received much less attention. Specifically, the organization of data points into nodes and edges of a graph is usually assumed to be given, unambiguous, and welldefined, especially in works utilizing common, publicly available graph datasets that have a pre-defined topology. ered beneficial in simplifying the comparison of various graph-based methods, in many practical applications, the mapping from data to graphs is a non-trivial and open problem. An example of such an application domain is computer network security, where a graph representation of network telemetry may contain entities of various ∗Corresponding author LGOBE 0000-0003-1021-8428 (M. Dědič) GNN performance represented by the performance metrics. In the reported research, we propose and evaluate While in the research environment this may be consid- representation to a given task. types (users, servers, IP addresses), the edges may repre- that are aggregate values representing the whole graph sent either a physical connection between two entities dataset instead of individual nodes or edges. A GNN model [3] is trained to solve the task, and its performance ITAT’23: Information technologies – Applications and Theory, Septem- is measured using several metrics. The main aim of this ever, it forms a useful and general tool for evaluating the suitability of graph datasets to tasks on them, which is a basic prerequisite to solving the more general problem of constructing an advantageous graph representation. 1.2. Related work • We experimentally verify the generalization ca

pability of the meta-model. • We evaluate the importance of graph properties

and their impact on GNN performance. • We experimentally validate the hyper-parameter tuning approach with very promising results.

Machine learning model performance prediction is commonly used to avoid the expensive evaluation of the orig- 2. Graph representation for GNN inal model on the test set [4]. However, the problem of performance prediction trust in these meta-models limits their applicability in real-world scenarios. To address this problem, the au- 2.1. Notation and definitions thors in [5] propose attaching prediction uncertainty to the meta-models and suggest a method for evaluating Consider an undirected graph = ( , , X) with nodes this uncertainty. In [6], the authors observe that state- , edges ⊆ 2, and real-valued node features X ∈ ℝ× , of-the-art shift detection metrics (referred to as graph where = | |. In this work, we limit the definition of a properties in our paper) do not generalize well across graph task to be one of transductive node classification , datasets, and they propose incorporating error predic- however, the method as defined is general and can be tors. In this paper, we address both the trust and gen- applied to other tasks such as inductive node classification eralization problems. The novelty of this paper lies in or link prediction. In the transductive setting, a task on our use of the meta-model: firstly, for interpreting the graph can be viewed as an assignment of node labels graph properties that drive the model’s performance, and (belonging to one of classes) to the graph. Using a secondly, for hyper-parameter tuning. To the best of our model M, a prediction =̂ M ( ) is obtained for the task, knowledge, there is no existing use of the meta-model and compared to the ground truth using a performance for these purposes in the current state of the art. metric ( , ̂ ) .

Graph theory encompasses various numeric graph properties, ranging from basic ones such as the num- 2.2. Graph representation ber of nodes, to more sophisticated metrics like graph curvature [1]. In this paper, we select a subset of these Our goal is to find a set of graph dataset properties P such metrics, listed in Table 1, as features for the meta-model. that those properties would only keep global-level infor

Graph Neural Networks (GNNs) [3] achieve superior mation about the graph and at the same time provide performance on graph datasets. However, this perfor- as much information as possible about the performance mance often comes at the cost of high computational obtainable on . resources required for training. Additionally, the large We ofer a range of graph dataset properties (see Taconfiguration space of these models necessitates non- ble 1). We categorize these properties into three types trivial resources for fine-tuning. Our research aims to of information. Specifically, these properties can conreduce the required computational resources in two ways. vey information regarding: 1) node attributes, 2) graph Firstly, we attempt to construct a graph dataset from the structure, 3) specified task, or any combination thereof source data with favorable properties for GNN execution. (awareness in Table 1).

Secondly, the proposed hyper-parameter search aims to Apart from basic graph properties and well-established reduce computational resources during fine-tuning. metrics on graphs, we consider some additional graph

Shapley Additive Explanations (SHAP) [7] is a frame- properties for better description. In order to define these work for explaining predictions of any model based on additional non-standard properties formally, we denote coalition game theory concepts introduced in [8]. An ⊆ the set of nodes belonging to the class and | | additional benefit of this framework is its ability to see its size. The mean attribute vector over the class is then whether low or high values of the input variables con- given as ̄ = | 1 | ∑∈ , where denotes the attribute tribute to low/high predictions of the model. In this paper, vector of the corresponding node . Finally, we define a we adopt the SHAP framework for model explanation. mean squared distance between attributes in class 1 and mean of attributes in 2 (attribute similarity) as 1.3. Contribution • We propose a method to identify important graph

dataset properties using the meta-model. • We present a hyper-parameter tuning method based on the meta-model. ( 1, 2) =

1 | 1| ∈∑ 1( − ̄ 2)2.

( 1 ) This asymmetric quantity is used to express similarity between attributes based on the task (see Table 1).

Task

No No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No No No No No No No No Yes No No No No No No No No Yes Yes Yes

Number of nodes – dataset size. Ratio between the number of positive and negative nodes. Number of connected components of the graph. Average node degree in the graph. Measure of the tendency of nodes to connect with other

similar nodes, rather than dissimilar nodes [9].

Average cosine similarity of attributes across all edges in the graph. Measure of how clustered together are nodes with similar attributes [10]. Fraction of edges connecting nodes of the same class [11]. Fraction of node neighbours having the same class as the

node in question, averaged over all nodes [12].

A modification of node homophily that is invariant to the number of classes [13]. Ratio of positive nodes with degree greater than one. The fraction of positive nodes with degree greater than

two, out of those with degree greater than one.

Average node degree in the sub-graph restricted to nodes

from 1.

Number of edges connecting positive nodes, divided by

the number of edges that would be present in a theoretical clique constructed of all positive nodes. ( 1, 1 )– see Equation ( 1 ). ( 1, 0 )/ ( 1, 1 )– see Equation ( 1 ). ( 0, 1 )/ ( 1, 1 )– see Equation ( 1 ).

Node count Class ratio Number of components Average node degree Global assortativity Attribute similarity Attribute homophily Edge homophily Node homophily Class homophily Ratio of positive nodes of degree > 1 Fraction of positive

nodes of degree > 2

Average positive node degree Relative presence of positive edges Positive attribute similarity Positive to negative at

tribute similarity

Negative to positive attribute similarity

No Yes No No No No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes

Yes 2.3. GNN performance prediction 2.5. Measuring graph property usefulness 2.4. Multiple binary classification Based on the graph dataset properties, we consider a meta-model Mmeta, which makes a prediction ̂ of the true performance based on the properties {P }.

To train and evaluate the meta-model, a suficient dataset is needed. Given that the individual points in this dataset themselves correspond to graph-task pairs and models trained on them, obtaining such a dataset for the metamodel is computationally expensive. To aid with its creation, only binary classification tasks were considered, where for datasets with more than 2 classes, multiple tasks were constructed by taking one class as positive and other classes as negative, for each class in the original dataset. This procedure has its motivation in applications, 2.6. Hyper-parameter optimization where e.g. in the domain of computer security, a classifier distinguishing each particular kind of malware is a useful In order to apply the meta-model to hyper-parameter addition to a general malware classifier. optimization, the hyper-parameter values are added to We train a regression meta-model on a dataset consisting of graph properties (features of the meta-model) and the corresponding GNN performance metric (label for the meta-model). If the regression model generalizes well, we consider the graph properties that are important for the meta-model prediction to also be important for the GNN performance on a graph with the given properties.

By evaluating the meta-model’s performance on the test set and determining the important features of the meta-model (e.g., using SHAP), we propose applying the meta-model explanation to determine the impact of individual graph properties on the GNN performance. The validity of this claim is assessed through the meta-model’s performance on the test set. basic graph properties task specific properties

GNN hyper

parameters

GNN performance task 1 task 2 task 3 setup 1 setup 2 setup M 2 6 1/3 1 0.3 1 0.4 0.6 2 6 1/3 0.3 0.1 3 0.3 0.6 3. Experimental evaluation 3.1. Experiment description

Dataset

ArXiv [14]

Flickr [15] Computers [16] Pubmed [17]

DBLP [18]

Squirrel [19] Cora [17] #Classes 40

10 7 3 4 5 70 #Nodes

(0) = being the feature vector corresponding to the node ∈ . The parameters of the design space the design space [20] to run the GNN for each dataset. are described in Table 3 and N () denotes the 1-hop

Train: RMSE = 1.32e-02, Corr = 0.99 Test: RMSE = 6.10e-02, Corr = 0.92 0.4 s0.3 s o log0.2 L e ruT0.1 0.0 s s o l g o L e u r T 0.0

ArXiv Computers CoraFull DBLP Flickr

P0u.6bMed

Squirrel

The tuple (basic graph properties, target class specific properties, hyper-parameters, performance measured on the test nodes) constitutes a datapoint in the final dataset (see Figure 1).

Our meta-model is a random forest regression model with 100 trees with mean squared optimisation criterion with at most 30% of features considered at each split. 3.2. GNN performance prediction based on graph properties Given a dataset described in Section 3.1, we train a regression meta-model predicting the performance metric using the graph properties. In this experiment, we consider a selection of the best performance over all available hyper-parameter settings for each graph and target class, where only graph properties are used for training.

The dataset was split randomly to a training and testing subsets of sizes 93, respectively 46 data-points. The meta-model was optimised to minimize MSE between the model prediction and true performance of GNN.

We evaluated Spearman correlation of the considered graph properties with the target metrics within the dataset (see Figure 2). The results show very high correlation between the log loss and the Brier Score and also high correlation between the ROC AUC and precision at 10% of positive nodes. Additionally, graph properties correlate better with log loss and Brier score, indicating better performance in predicting these metrics (which is later confirmed in the experiments).

The results of our meta-model predicting the true log loss of the GNN are shown in Figure 3. We can see quite decent performance on the testing set. Although we do not use this performance directly for any task, it provides us with the important information that the meta-model does not just memorize the training set and indeed uses the graph properties to model the true performance of the GNN. Based on this generalization ability, we claim that graph properties are driving the decision of both the meta-model as well as the underlying GNN (Section 2.5).

Based on SHAP explanation of the meta-model, we evaluated how individual graph properties afect the final GNN performance (see results in Figures 4 and 5 for Log Loss and ROC AUC). We can observe that the most important graph properties difer for each task.

As expected, a higher homophily (all of its variants) contributes to better performance. Interestingly, a higher class ratio leads to better performance in ROC-AUC prediction but worsens the performance in log-loss prediction. Although these observations are very interesting and essentially answer our research question, we should also consider the limitations of these results. Firstly, their validity is conditioned by the validity of our hypothesis assuming that the explanation of the meta-model holds for the task itself. Secondly, as the graph properties are not independent of each other (see Figure 2), impact of one particular property can be reflected in importance of multiple correlated properties. We leave deeper investigation of these limitations and their impact for future work. 3.3. Hyper-parameter optimization In this experiment, we use the dataset generation method described in Figure 1 for each graph from Table 2. We split randomly test and train sets with a ratio so that the training set size is given by round( ), where is the dataset size. We provide 100 realizations of this split for each ratio . In each realization, we train the meta-model on the training set and calculate predictions on the test set. We consider the performance based on the following hyper-parameter selection procedures: • Reference (random search): We select the best performance on the training set plus one sample from the test set to ensure fair comparison. • Ours: We find the hyper-parameter setup achieving the best performance prediction on the test set, evaluate the true corresponding performance, and select the best performance on the training set along with the evaluated one. • Optimum: We select the best performance from both the test and training sets. • Ours - Cross-datasets: Similar to “ours” method, but we consider all graph datasets except the evaluated one for training.

The mean of the resulting performance over realisations is reported in Figure 6. In addition to the aforementioned hyper-parameter selection procedures, we consider one more reference (best hyper-parameter) by selecting a single hyper-parameter setup with the best average performance over all binary tasks for each dataset, ensuring that the model does more than just learn the best setup for a specific dataset.

Node homophily Edge homophily Class ratio Relative presence of positive edges Positive to negative attribute covariance Average positive node degree Number of components Fraction of positive nodes of degree > 2 Positive class attribute variance Sum of 6 other features Class homophily Node count Positive to negative attribute covariance Number of components Average positive node degree Average node degree Edge homophily Positive class attribute variance Class ratio Sum of 6 other features

Low High 0.02 0.00 0.02 0.04 0.06

SHAP value (impact on model output)

As we can see, the suggested method (“ours”) outper- simply learn the global solution for all datasets. forms the reference in almost all cases, resulting in a significant diference, for example, in the Cora dataset.

However, the most interesting result is achieved by “ours 4. Conclusion cross-datasets” method. This method is evidently able to We propose a systematic approach to linking graph proplearn the optimal parameter setup from the graph proper- erties with corresponding GNN performance using a simties since it achieves nearly optimal performance across ple meta-model. This meta-model is trained to predict all datasets. The comparison to the best hyper-parameter the true performance based on the graph properties. We reference method ensures that the meta-model did not experimentally validated the generalization capability of 100 this meta-model on common datasets in the graph re- International Conference on Pattern Recognition search community. By interpreting the meta-model’s ex- (ICPR), 2022, pp. 2466–2474. ISSN: 2831-7475. planations, we identified graph properties that influence [7] S. M. Lundberg, S.-I. Lee, A Unified Approach to the meta-model’s behavior and claim that this interpre- Interpreting Model Predictions, in: Advances in tation also applies to the impact on GNN performance. Neural Information Processing Systems, volume 30, We evaluated these properties and found that they align Curran Associates, Inc., 2017. with our expectations. [8] L. S. Shapley, Notes on the N-Person Game — II:

The meta-model predictions were also utilized to solve The Value of an N-Person Game, Technical Report, the hyper-parameter optimization problem. Leveraging RAND Corporation, 1951. the fact that the meta-model is computationally cheaper [9] M. E. J. Newman, Mixing patterns in networks, compared to the GNN, we demonstrated that relying on Physical Review E 67 (2003) 026126. Publisher: the meta-model’s predictions can lead to superior perfor- American Physical Society. mance compared to the reference random search method. [10] L. Yang, et al., Diverse Message Passing for AtSpecifically, when the meta-model incorporates knowl- tribute with Heterophily, in: Advances in Neural edge from other graph datasets, we achieved almost op- Information Processing Systems, volume 34, Curtimal performance even without seeing any data points ran Associates, Inc., 2021, pp. 4751–4763. from the target dataset. This indicates that the model is [11] J. Zhu, et al., Beyond Homophily in Graph Neucapable of learning solely from the graph properties. ral Networks: Current Limitations and Efective

The proposed hyper-parameter search method can po- Designs, in: Advances in Neural Information Protentially be extended beyond graph datasets, where we cessing Systems, volume 33, Curran Associates, Inc., train the meta-model on suitable properties of the given online, 2020, pp. 7793–7804. dataset. However, in this paper, we only scratched the [12] H. Pei, et al., Geom-GCN: Geometric Graph Consurface of this topic, which warrants further research volutional Networks, 2020. ArXiv:2002.05287 [cs, and an in-depth survey of available works on hyper- stat]. parameter optimization. In the context of this paper, [13] D. Lim, et al., Large Scale Learning on Nonwe view it as a validation of the concept of learning a Homophilous Graphs: New Benchmarks and meta-model based on graph properties. Nonetheless, the Strong Simple Methods, in: Advances in Neural Inpresented results ofer a practical approach to solving formation Processing Systems, volume 34, Curran the hyper-parameter search problem for graph datasets. Associates, Inc., online, 2021, pp. 20887–20902. [14] W. Hu, et al., Open Graph Benchmark:

Datasets for Machine Learning on Graphs, References 2021. ArXiv:2005.00687 [cs, stat]. [15] H. Zeng, et al., GraphSAINT: Graph Sampling

Based Inductive Learning Method, in: International

Conference on Learning Representations, 2019. [16] O. Shchur, et al., Pitfalls of Graph Neural Network

Evaluation, 2019. ArXiv:1811.05868 [cs, stat]. [17] Z. Yang, W. Cohen, R. Salakhudinov, Revisiting

Semi-Supervised Learning with Graph Embeddings, in: Proceedings of The 33rd International Conference on Machine Learning, PMLR, New York, NY,

USA, 2016, pp. 40–48. [18] A. Bojchevski, S. Günnemann, Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking, in: 6th International Conference on Learning Representations, 2018. [19] B. Rozemberczki, C. Allen, R. Sarkar, Multi-Scale attributed node embedding, Journal of Complex

Networks 9 (2021) cnab014. [20] J. You, Z. Ying, J. Leskovec, Design Space for Graph

Neural Networks, in: Advances in Neural Information Processing Systems, volume 33, Curran Associates, Inc., 2020, pp. 17009–17021.

[1]

Topping , et al., Understanding over-squashing and bottlenecks on graphs via curvature , in: The Tenth International Conference on Learning Representations , 2021 .

[2]

Veličković , Geometric Deep Learning - Grids, Groups, Graphs, Geodesics, and Gauges , 2021 .

[3]

Hamilton ,

Ying ,

Leskovec , Inductive representation learning on large graphs , Advances in neural information processing systems 30 ( 2017 ).

[4]

S. B.

Guerra ,

R. B.

Prudêncio , T. B. Ludermir , Predicting the performance of learning algorithms using support vector machines as meta-regressors , in: Artificial Neural Networks-ICANN 2008 : 18th International Conference, Prague, Czech Republic, September 3-6 , 2008 , Proceedings, Part I 18 , Springer, 2008 , pp. 523 - 532 .

[5]

Elder , et al., Learning Prediction Intervals for Model Performance, Proceedings of the AAAI Conference on Artificial Intelligence 35 ( 2021 ) 7305 - 7313 . Number: 8 .

[6]

Maggio ,

Bouvier ,

Dreyfus-Schmidt , Performance Prediction Under Dataset Shift , in: 2022 26th