=Paper=
{{Paper
|id=Vol-2962/paper50
|storemode=property
|title=Experimental Investigation of Neural and Weisfeiler-Lehman-Kernel Graph Representations for Downstream Classification
|pdfUrl=https://ceur-ws.org/Vol-2962/paper50.pdf
|volume=Vol-2962
|authors=Sergej Borisov,Marek Dědič,Martin Holeňa
|dblpUrl=https://dblp.org/rec/conf/itat/BorisovDH21
}}
==Experimental Investigation of Neural and Weisfeiler-Lehman-Kernel Graph Representations for Downstream Classification ==
Experimental Investigation of Neural and Weisfeiler-Lehman-Kernel Graph Representations for Downstream SVM-Based Classification Sergej Borisov1 , Marek Dědič2,3 , and Martin Holeňa4 1 Faculty of Mathematics and Physics, Charles University, Prague, borisov@cs.cas.cz 2 Faculty of Nuclear Sciences and Physical Engineering, Czech Technical University, Prague, marek@dedic.eu 3 Cisco Cognitive Intelligence, Cisco Systems, Inc., Prague 4 Institute of Computer Science, Academy of Sciences, Prague, martin@cs.cas.cz Abstract: Graphs are one of the most ubiquitous kinds other kinds of non-numerical data, the best-known exam- of data. However, data analysis methods have been de- ple probably being textual data. The seminal papers about veloped primarily for numerical data, and to make use graph representation [20] and [9], which introduced, re- of them, graphs need to be represented as elements of spectively, the neighbourhood sampling strategies Deep- some Euclidean space. An increasingly popular way of Walk and node2vec, were strongly influenced by the Skip- representing them in this way are graph neural networks gram model for text, implemented by the word2vec algo- (GNNs). Because data analysis applications typically re- rithm [16, 17]. quire identical results for isomorphic graphs, the repre- An increasingly popular way of representing graphs by sentations learned by GNNs also need to be invariant vectors is using graph neural networks [10]. Typically, with respect to graph isomorphism. That motivated re- downstream applications are desired to provide identical cent research into the possibilities of recognizing non- results for isomorphic graphs. Consequently, the repre- isomorphic pairs of graphs by GNNs, primarily based on sentations learned by GNNs need to be invariant with re- the Weisfeiler-Lehman (WL) isomorphism test. This pa- spect to isomorphism. That motivated recent research into per reports the results of a first experimental comparison of the possibilities of recognizing non-isomorphic pairs of four variants of two important GNNs based on the WL test graphs by GNNs [1, 3, 6, 11, 15, 18, 29], primarily based from the point of view of graph representation for down- on the classical WL isomorphism test, capable of reveal- stream classification by means of a support vector machins ing such pairs in many situations [27]. (SVM). Those methods are compared not only with each The WL-test iteratively constructs neighbourhood sub- other, but also with a recent generalization of the WL sub- trees rooted in graph vertices. Such a construction can also tree kernel. For all GNN variants, two different representa- be used for graph kernels [22]. In particular [24] intro- tions are included in the comparison. The comparison re- duced the WL graph kernels, of which most relevant to our vealed that the four considered representations of the same work is the WL subtree kernel. That kernel was recently kind of GNN never significantly differ. On the other hand, generalized to the relaxed WL kernel [23]. Whereas the there was always a statistically significant difference be- kernel itself evaluates the graph with a scalar value, the tween representations originating from different kinds of rooted subtrees used in a kernel definition can be easily GNNs, as well as between any representation originating represented with vectors of non-negative numbers. There- from any of the considered GNNs and the representation fore, the WL kernels can be viewed as a kernel counterpart originating from the generalized WL kernel. of representation learning with WL-test-based GNNs. Keywords: graph representation learning, graph neural This work-in-progress paper reports a first experimental networks, message-passing networks, Weisfeiler-Lehman comparison of two important WL-test-based GNNs with isomorphism test, Weisfeiler-Lehman subtree kernel the general relaxed WL kernel from the point of view of graph representation for downstream classification by means of an SVM. Of each GNN, four variants were avail- 1 Introduction able, and for all employed GNN variants, two different representations were included into the comparison. The In the present time, graph-structured data are one of the comparison was performed on 20 graphsets created from most ubiquitous kinds of data. However, from the point benchmark datasets and its results were tested for statisti- of view of data analysis and knowledge discovery in cal significance. data, they received attention only during the last decades. The next section gives an overview of message-passing Therefore, data analysis methods for common tasks such neural networks, which are the most common kind of as classification, regression and clustering have been de- GNNs, and at the same time the kind to which the known veloped primarily for numerical data, and if they are used WL-test-based GNNs belong. It also recalls the principles for graph-structured data, graphs needs to be represented of the relaxed WL kernel. The key part of the paper is Sec- as vectors in some Euclidean space. This requirement is tion 3, in which the performed experimental comparison is not specific for graphs, but can also be encountered with described and its first results are presented. Finally, Sec- _______________________ Copyright ©2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). tion 4 concludes the paper and indicates possible furhter research. 2 Methodology Algorithm 1 Weisfeiler-Lehman isomorphism test Require: Graphs G, G0 with sets of vertices V,V 0 , sets of 2.1 Preliminaries edges E, E 0 , and colourings c, c0 such that V ∩ V 0 = / #V = #V 0 , maximal number of iterations h ≤ #V 0, In this comparison, we consider undirected graphs with 0 1: Define the colouring c0 : V ∪ V 0 → {0, 1}dc ∪ {0, 1}dc labelled and colored vertices. More precisely, graphs G = by (V, E, λ ) such that ( • E ⊆ {X ⊆ V : #X = 2}, where V is the set of vertices, c(v) if v ∈ V c0 (v) = 0 (1) E is the set of edges, and #X denotes the cardinality c (v) if v ∈ V 0 of X; 2: Set Σ0 = c0 (V ∪V 0 ) • λ : V → Rdλ denotes a general labelling; #Σ 3: Order Σ0 as σ01 , . . . , σ0 0 4: if exists j = 1, . . . , #Σ0 such that #{v ∈ V : c0 (v) = • ∃c : V → {0, 1}dc , called colouring, where dc ≤ dλ , ∀v ∈ V : (c(v))1 + · · · + (c(v))dc = 1, and c fulfils c = σ0j } 6= #{v ∈ V 0 : c0 (v) = σ0j } then (λ1 , . . . , λdc ), i.e. the coloring is part of the general 5: Return the fact that G and G0 are not isomorphic labelling, which may in addition contain also other 6: else components. 7: Set i = 1 #Σ 8: Define the colouring c1 : V ∪V 0 → Σ0 × N0 0 by Using this notation and the symbol N(v) for the neigh- bours of vertex v in the graph G, the WL isomorphism c1 (v) = (c0 (v), (s1 , . . . , s#Σ0 )), where test [27] can be described. The test consists in an iterative s j = #{u ∈ N(v) : c0 (u) = σ0j } for j = 1, . . . , #Σ0 algorithm applied to pairs G, G0 of graphs with equal num- (2) bers of vertices and such that if G and G0 are isomorphic, then the isomporphism preserves their colorings. In each 9: Set Σ1 = c1 (V ∪V 0 ) iteration, the colorings are updated, taking into account the 10: Order Σ1 as σ11 , . . . , σ1#Σ1 neighbours of all vertices, in such a way that an isomor- 11: end if phism preserves also the updated colorings. Hence, if af- j 12: while i < h and #{v ∈ V : ci (v) = σi } = #{v ∈ V 0 : ter a finite number of iterations, some value of the updated j coloring occurs in both graphs with different frequency, ci (v) = σi } for j = 1, . . . , #Σi do then the isomorphism of G and G0 can be rejected. How- 13: Increment i = i + 1 #Σ ever, it can never be definitely confirmed. The pseudocode 14: Define the colouring ci : V ∪ V 0 → Σi−1 × N0 i−1 of the WL test is in Algorithm 1. by Denote AG and A0G the adjacency matrices of G and G0 , respectively, and 1#V the vector (1, . . . , 1) of length #V . ci (v) = (ci−1 (v), (s1 , . . . , s#Σi−1 )), where Then the situation that the WL isomorphism test cannot s j = #{u ∈ N(v) : c0 (u) = σ0j } for j = 1, . . . , #Σi reject isomorphism of G and G0 can be characterized by (3) the following three equivalent conditions [5]: (i) the WL test does not reject the isomorphism of G and 15: Set Σi = ci (V ∪V 0 ) G0 within #V iterations; 16: Order Σi as σi1 , . . . , σi#Σi (ii) for each tree T , the number of homomorphisms from 17: end while T to G equals the number of homomorphisms from T 18: if exists j = 1, . . . , #Σi such that #{v ∈ V : ci (v) = to G0 ; σ0j } 6= #{v ∈ V 0 : ci (v) = σ0j } then (iii) there exists a matrix X ∈ 0, 1#V ×#V solving the fol- 19: Return the information that G and G0 are not iso- lowing system of linear equations: morphic 20: else AG X = XA0G , 21: Return the information that the WL test did not X1#V = 1#V , (4) reject the isomorphism of G and G0 within h iterations 22: end if 1> > #V X = 1#V . Observe that if G and G0 are really isomorphic, then the equations (4) are solved by the permutation matrix X ful- filling A0G = X > AG X, which transforms AG into A0G . Finally, a feedforward artificial neural network NN will 2.2 Message-passing Neural Networks for the purpose of this paper be a stationary connectionist feedforward structure Message-passing neural networks (MPNN) are probably the kind of neural networks most often used for graph NN = (I, O, H,C, F , (Φe )e∈C , (Ψν )ν∈H∪O ) where (5) data. They are characterized by the following two prop- erties [10]: • I, O and H are mutually disjoint sets of input, output 1. Their input and hidden neurons are structured into and hidden neurons; a sequence of layers H0 = I, H1 , . . . , HL , like in the case of multilayer perceptrons (MLPs) and radial ba- • C ⊆ I × H ∪ H × H ∪ H × O is a set of connections; sis functions: • ∀ν ∈ I : out(ν) = {ξ ∈ H ∪ O : (ν, ξ ) ∈ C} = 6 0, / out(ν) is the output set of the neuron ν; k 6= ` ⇒ Hk ∩ H` = 0, / [ C ∩ (H × H) ⊆ H`−1 × H` . (9) • ∀ν ∈ O : inp(ν) = {ξ ∈ I ∪ H : (ξ , ν) ∈ C} = 6 0, / 1≤`≤L inp(ν) is the input set of the neuron ν; 2. The structure of their connections as well as their so- • ∀ν ∈ H : inp(ν) 6= 0, / out(ν) 6= 0; / matic and synaptic operators attempts to reflect the structure and properties of the graphs to which the • ∃Dom ⊆ R#I : F ⊆ {F : Dom → R#O }, F is the set network is applied. The approach that is most typ- of mappings computable by the NN; ical, and at the same time perfectly suitable for the kind of graphs considered in this paper, consists in: • ∀ν ∈ H ∪ O ∃Domν ⊆ R# inp(ν) : Ψν ⊆ {ψ : Domν → R}, the elements of Ψν are called somatic operators • network connections attempt to reflect the struc- available for the neuron ν; ture of graph edges, in particular input and out- put sets of neurons attempt to reflect the neigh- • ∀e ∈ C ∃Dome ⊆ R : Φe ⊆ {ϕ : Dome → R}, the el- bourhoods of graph vertices; ements of Φe are called synaptic operators available • in each layer ` = 0, . . . , L, a group of neurons for the connection e; νv,1 , . . . , νv,d` ∈ H` can be assigned to a vertex (`) v ∈ V , with d0 = dλ , and the vector ψv = • all mappings computable by the NN fulfil the follow- (ψνv,1 , . . . , ψνv,d ) of their somatic operators for ing condition of resursive composability: ` a layer ` < L fulfils ∀F ∈ F ∀ν ∈ H ∪O ∀ξ ∈ inp(ν) ∃ψν ∈ Ψν ∃ϕ(ξ ,ν) ∈ Φ(ξ ,ν) ∀X ∈ Dom ∃AX : I ∪H ∪O → R – the activation (`+1) (`) (`) (`) (`) (`) ψv = fup (ψv , fag (ψu1 , . . . , ψu#N(v) )), state of NN corresponding to the input X, fulfilling (10) AX |I = X, AX |O = F(X), (6) (`) ∀ν ∈ H ∪ O : AX (ν) = ψν ((ϕ(ξ ,ν) (AX (ξ )))ξ ∈inp(ν) ). where fup : Rd` × Rdag → Rd`+1 is called up- (`) (7) date function, {u1 , . . . , u#N(v) } = N(v), fag : Rd` ×#N(v) → Rdag is called aggregation func- tion, and dag ∈ N with a usual choice dag = d` The above definition covers in particular multilayer per- or dag = d`+1 ; ceptrons and radial basis functions, for which a plethora of theoretical results is available, such as the classical univer- • synaptic operators are not used. sal approximation results concerning density of F in gen- eral function spaces [4, 12–14, 19], and results concerning 3. The resulting mapping F composed according to (6) the applicability of the laws of large numbers and of the is required to be either invariant or equivariant with central limit theorem [28]. As usually, the set F will be respect to permutations of the d0 columns of its input assumed parametrizable with a finite-dimensional set W of matrix. Needless to say, equivariance – preserving in parameters. Hence, the output the permutation of the input columns, is possible only if |O| = d0 . ∃q ∈ N ∃W ⊆ Rq ∃ω : W → {F : Dom → R#O } An important concept relevant to MPNN as well as to such that F = ω(W ). (8) other kinds of GNNs is consistency with graph colouring. An MPNN is consistent with the colouring c of a graph G Needless to say, the parametrizability of F also induces if it fulfils the parametrizability of the sets of somatic operators (0) (0) Ψv , v ∈ H ∪ O and synaptic operators Φe , e ∈ C. ψu = ψv ⇔ c(u) = c(v). (11) (`) Because (ψv )v∈V is for each ` = 1, . . . , L a mapping Let X ⊆ Rd` be countable and bounded, pmax ∈ N. Then into the Euclidean space Rd` ×#V , or equivalently, into the there exists a function f : X → Rd` +1 such that for in- Euclidean space Rd` #V , it can be used for graph representa- finitely many choices of ε,S including all irrational num- (L) tion. Usually, however, only (ψv )v∈V is used to this end. bers, the function h : X × pp=1 max X p → X defined In our experiments reported in Section 3, both possibilities were used. ∀c ∈ X ∀p = 1, . . . , pmax ∀X = (x1 , . . . , x p ) ∈ X p : p Network 1-GNN was proposed in [18] as part of a study h(c, X) = (1 + ε) f (c) + ∑ f (x j ) (14) j=1 of the relationships between GNNs and the WL test. That study actually considered more general GNNs, called k- is unique with respect to multisets, which means that GNNs, with a general k ∈ N, and investigated their rela- tionships to k-WL tests, which are generalizations of the ∀c ∈ X ∀p = 1, . . . , pmax ∀X, X 0 ∈ X p : WL test from vertices to k-tuples of vertices. The 1-GNN [∀x ∈ X : #{ j = 1, . . . , p : x j = x} = is not only the simplest network of this kind, but at the same time also a specific kind of MPNN, for which the = #{ j = 1, . . . , p : x0j = x}] ⇒ h(c, X) = h(c, X 0 ). (15) (`) (`) functions fup and fag in (10) are defined as 2.3 Relaxed Weisfeiler-Lehman Kernel (`) (`) (`) (`) (`) fup (ψv , fag (ψu1 , . . . , ψu#N(v) )) = The relaxed WL kernel [23] is a recent relaxation of the (`) (`) (`) (`) (`) = σ (W1 ψv + b(`) + fag (ψu1 , . . . , ψu#N(v) )), WL subtree kernel, which was proposed in [24] and is based on the WL isomorphism test described in Algo- #N(v) (`) (`) (`) (`) (`) rithm 1. Taking into account the number of iterations of fag (ψu1 , . . . , ψu#N(v) ) = ∑ W2 ψu j , (12) the WL test, which is in the context of the WL subtree j=1 kernel called depth, the value of this kernel for a pair of (`) (`) graphs G = (V, E, λ ), G0 = (V 0 , E 0 , λ 0 ) is defined as where W1 ,W2 ∈ Rd`+1 × Rd` , b(`) ∈ Rd`+1 , and σ is a component-wise non-linear activation function, e.g. a sig- h (h) moid or ReLU. kWLsubtree (G, G0 ) = ∑ ∑ ∑ δ (ci (v), ci (v0 )), (16) From the point of view of a relationship between this i=0 v∈V v0 ∈V 0 kind of networks and the WL test, it is important that where δ denotes the Kronecker delta, also known as Dirac for a 1-GNN consistent with the graph colouring, and for kernel. Consequently, the WL subtree kernel reflects only the colouring ci constructed in the i-th iteration of Algo- exact match of the colouring produced for both graphs in rithm 1, i ≤ min(h, L), it was proven in [18] that: every iteration of the WL test, although in many real-world (i) (i) (i) ∀u, v ∈ V : ci (u) = ci (v) ⇒ ψu = ψv ; problems, more important than exact match is a similarity (ii) there exists a sequence of weight matrices and bias of those colourings. (0) (0) (i−1) (i−1) (i−1) vectors (W1 ,W2 , b(0) ), . . . , (W1 ,W2 ,b ) To overcome this drawback, the exact match is in [23] (0) (0) (i−1) (i−1) such that if the functions fup , fag , . . . , fup , fag for i = 0, . . . , h weakened to the equivalence with respect ρ ρ are defined using this sequence, then for any i0 = to the clusters C1 , . . . ,Ckρ produced by a partitioning ρ of (i0 ) (i0 ) Σi . Hence, 1, . . . , i, u, v ∈ V : ψu = ψv ⇒ c(i0 ) (u) = c(i0 ) (v). ρ ρ ∀i, j = 1, . . . , kρ : i 6= j ⇒ Ci ∩C j = 0, / Graph Isomorphism Network (GIN) was proposed in [29] (`) (`) ρ ρ and defines the functions fup and fag in (10) as C1 ∪ · · · ∪Ckρ = Σi , ρ : Σi → {1, . . . , kρ }, ρ ∀k = 1, . . . , kρ , ∀σ ∈ Ck : ρ(σ ) = k. (17) (`) (`) (`) (`) (`) fup (ψv , fag (ψu1 , . . . , ψu#N(v) )) = Moreover, for each Σi , not only one such partitioning ρ (`) (`) (`) (`) (`) = fmlp ((1 + ε` ))ψv + fag (ψu1 , . . . , ψu#N(v) )), is used, but a finite set Θi = {ρ1i , . . . , ρ#Θ i } of them. This i #N(v) turns (16) finally into the definition of a relaxed WL ker- (`) (`) (`) (`) nel: fag (ψu1 , . . . , ψu#N(v) ) = ∑ ψu j , (13) j=1 h (h) (`) kR-WL (G, G0 ) = ∑ ∑ ∑ ∑ δ (ρ(ci (v)), ρ(ci (v0 ))). where ε` > 0 and fmlp : Rd` → Rd` +1 is an MLP producing i=0 ρ∈Θi v∈V v0 ∈V 0 the representation of the graph in the (` + 1)st layer. (18) A relationship between GIN and the WL test is based (h) on applying the universal approximation resultsfor MLPs Its name originates from the fact that kR-WL is more gen- (`) (h) [4, 12–14] to fmlp , in combination with a result proven eral than kWLsubtree , to which it turns if Θi = {ρi } with in [29]as Corollary 6: ρi (σ ) = {σ } for i = 1, . . . , h, σ ∈ Σi . The construction of the partitionings ρ ∈ Θi , i = 1, . . . , h, 3 Experimental Comparison in (18) is based on replacing each σ ∈ Σi for i = 1, . . . , h with a set of isomorphic unfolding trees. An unfolding For the experimental comparison, we implemented the 1- tree T i (G, v) of depth i in a graph G rooted in a vertex GNN and GIN networks using layers and neural networks v ∈ V will be defined in accordance with [5], as a rooted learning methods available in the PyTorch Geometric li- tree T = (V (T ), E(T )) with root r such that: brary [21]. We also made use of the GenWL implemen- tation [8] of the relaxed WL kernel by the authors of the • ∃ f – homomorphism from T to G; paper [23]. Both network implementations were used in • f (r) = v; their decay variants, and the GenWL implementation was used in the R-WL∗ variant, in accordance with the experi- • for each non-leaf t ∈ V (T ), f induces a bijection be- ments reported in [23]. tween the set of children of t in T and N(v) in G. For unfolding trees, a natural distance is a tree-edit dis- 3.1 Employed Graphsets tance. The definition of the tree-edit distance employed in To be able to asses the statistical significance of the com- the context of the relaxed WL kernel can be found in [23]. parison results, we performed the comparison on 20 mu- From the point of view of graph representation, it is im- tually disjoint graphsets GS1–GS20, adopting the usual portant that (16) and (18) are actually scalar products of assumption that for the employed data, disjointness is a vectors, which suggests to use those vectors as Euclidean- sufficient condition for statistical independence. Those space representations of the graphs G and G0 in down- graphsets were obtained from real-world benchmark sets stream data analysis applications. In particular for the of graph data. Due to our objective of investigating the WL subtree kernel, define a vector φ WL (G) ∈ RnWL with suitability of the compared graph representation methods nWL = ∑hi=0 #Σi as for downstream classification, we used datasets from bi- φ WL (G) = (φ0,1 WL WL , . . . , φ0,#Σ0 WL , φ1,1 WL , . . . , φh,#Σh ), (19) nary graph classification tasks to this end. We have chosen four benchmark sets of graph data, where for i = 0, . . . , h, j = 1, . . . , #Σi , which are available both in the PyTorch Geometric library φi,WL i [21], and in the GenWL implementation of the relaxed WL j = #{v ∈ V : ci (v) = σ j }. (20) kernel [8]. Then (16) can be indeed rewritten as the scalar product 1. BZR is a set of 405 ligands for the benzodiazepine re- (h) kWLsubtree (G, G0 ) = φ WL (G)> φ WL (G0 ). (21) ceptor, classified with respect to their activity in ben- zodiazepine binding [25]. Similarly for the relaxed WL kernel, define a vector φ R-WL (G) ∈ RnR-WL with nR-WL = ∑hi=0 ∑#Θ i 2. COX-2 is a set of 467 cyclooxygenase-2 inhibitors, j=1 kρ i as j classified with respect to their activity against human recombination enzyme [25]. φ R-WL (G) = (φ0,1 R-WL R-WL R-WL , . . . , φ0,#Σ 0 R-WL , φ1,1,1 , . . . , φ1,1,k 1 , ρ1 3. DHFR is a set of 756 inhibitors of dihydrofolate re- R-WL R-WL R-WL R-WL φ1,2,1 , . . . , φ1,#Θ 1 ,kρ 1 , φ2,1,1 , . . . , φh,#Θ h ,k h ), (22) ductase, classified with respect to their activity in the #Θ1 ρ#Θ h inhibition of the enzymatic reduction that converts di- where for i = 0, . . . , h, j = 1, . . . , #Gi , k = 1, . . . , kρ i , hydrofolate to tetrahydrofolate [25]. j φi,R-WL i 4. NC1 is a set of 4110 compounds evaluated in bioas- j,k = #{v ∈ V : ρ j (ci (v)) = k}. (23) says of the National Cancer Institute on non-small Then (18) can be rewritten as the scalar product cells of lung tumour, classified with respect to growth (h) inhibition of this kind of human tumour [26]. kR-WL (G, G0 ) = φ R-WL (G)> φ R-WL (G0 ). (24) The precise origin of the graphsets GS1–GS20 in those It is useful to realize that the label sets Σi for i = 0, . . . , h four benchmark sets is listed in Table 1. on which the definitions of φ WL and φ R-WL rely, were de- fined only for a pair of graphs (G, G0 ), as Σi = ci (V ∪V 0 ), 3.2 Experimental Setup cf. Line 15 od Algorithm 1. For dealing with a whole set G of graphs such that each G ∈ G has its own sets of Before starting the experiments, we need two make two vertices V G and edges E G as well as its own labelling λ G decisions concerning the compared GNNs: First, what net- and colouring cG , then it is convenient to generalize their work topology to use, and how to obtain a graph represen- definition to tation from a trained network. Second, a decision concern- [ ing the comparison as a whole, including the relaxed WL Σi = cG G i (v ), i = 0, . . . , h. (25) kernel – how to evaluate the suitability of each representa- G∈G tion for downstream classification. We now address each It is this definition that we used in our experiments. of those design decisions in some detail. average pooling layer during their training. According to Table 1: Origin of the employed graphsets. the neurons that were actually used to this end, two kinds of graph representation were considered: Graphset Original benchmark data (i) representation restricted to activities of neurons of the GS1 BZR, graphs with nr. 1-200 last MPNN layer, which is a restriction commonly en- GS2 BZR, graphs with nr. 201-405 countered in representation learning by artificial neu- GS3 COX2, graphs with nr. 1-250 ral networks; GS4 COX2, graphs with nr. 251-467 (ii) representation with activities of neurons of all MPNN layers, which is more similar to the representation GS5 DHFR, graphs with nr. 1-250 (24) based on the relaxed WL kernel. GS6 DHFR, graphs with nr. 251-500 Apart from the GNN topology, the hyperparameter val- GS7 DHFR, graphs with nr. 501-756 ues of all compared methods were set to their defaults in GS8 NCI1, graphs with nr. mod 13 = 0 the employed implementation, and if no default was avail- GS9 NCI1, graphs with nr. mod 13 = 1 able, to values obtained through slight tuning. GS10 NCI1, graphs with nr. mod 13 = 2 Evaluation of graph representations with respect to their GS11 NCI1, graphs with nr. mod 13 = 3 suitability for downstream classification was by means of GS12 NCI1, graphs with nr. mod 13 = 4 accuracy on test data obtained in classification using as GS13 NCI1, graphs with nr. mod 13 = 5 input each of the representations. To this end, a linear GS14 NCI1, graphs with nr. mod 13 = 6 support vector machine (SVM) classifier was employed, GS15 NCI1, graphs with nr. mod 13 = 7 in accordance with the default setting in the GenWL im- GS16 NCI1, graphs with nr. mod 13 = 8 plementation of the relaxed WL kernel [8]. To increase the reliability of the accuracy assessment, a 10-fold cross- GS17 NCI1, graphs with nr. mod 13 = 9 validation was used. GS18 NCI1, graphs with nr. mod 13 = 10 GS19 NCI1, graphs with nr. mod 13 = 11 3.3 First Results GS20 NCI1, graphs with nr. mod 13 = 12 The results for both kinds of GNN-based representations for the 4 considered networks, i.e. 1-GNN and GIN with Topology of the Compared GNNs . Due to the fact that the both variants (iia) and (iib) of the MPNN layers, as well GNNs were trained on classification data, and also due to as the results for the R-WL∗ -based representation are pre- the default settings of the employed GenWL implementa- sented in Table 2. For each of those representations, the tion of the relaxed WL kernel, both of them contained the mean and standard deviation of the accuracies on the vali- following layers: dation folds are reported, from 10-fold cross-validation on (i) an input layer, which receives the colourings of ver- each of the 20 graphsets. tices, i.e. the components (λ1 , . . . , λdc ) of the label According to Table 2, the highest accuracy has been λ , no matter whether the label has possibly still other most frequently, namely 8 times among the 20 employed components, i.e. whether dc = dλ or dc < dλ ; graphsets, achieved with the representation by activities of (ii) 5 MPNN layers of the same size, specific for 1-GNN the last layer of the the variant (iia) of MPNN layers of and for GIN; a 1-GNN, as well as with the representation by activities (iii) an average-pooling layer averaging over vertices of of all layers of the the variant (iib) of MPNN layers of a the graph; 1-GNN. However, the differences between accuracies in (iv) a crossentropy-classification layer. Table 2 are not only due to essential differences between As to the MPNN layers, we investigated 2 variants of the considered representations, but also due to random in- them, differing in size: fluences. To separate the former from the latter requires to (iia) The size of all 5 layers equals the average number of assess statistical significance of those differences. vertices among the graphs in the graphset. 3.4 Assessment of Statistical Significance (iib) The size of all 5 layers equals the maximal number of vertices among the graphs in the graphset. To assess the statistical significance of the obtained results, we first tested the basic null hypotheses that the mean GNN values used for graph representation were obtained classification accuracy for all 9 representations coincides. as the activities, i.e. the results of somatic operators, in To this end, we applied the Friedman test with 10 repli- neurons of the MPNN layers. In both compared GNNs, the cates to the results for all 200 validation folds from the activities are obtained separately for each vertex of each cross-validation of SVM-classification on the 20 graph- graph. Therefore, activities for each graph are first aver- sets. This basic null hypothesis was strongly rejected, with aged over all of its vertices, in accordance with using an the achieved significance p = 4 ∗ 10−92 . For the post-hoc Table 2: Mean and standard deviation of the validation-fold accuracy from a 10-fold cross-validation of classification by an SVM, using as input the representations of graphs obtained from the 1-GNN and GIN networks, as well as from the R-WL∗ variant of the relaxed WL kernel. MPNN layers (iia) refer to 5 message passing layers of length equal to the average number of vertices among the graphs in the graphset on which the network is being trained, MPNN layers (iib) refer to 5 message passing layers of length equal to the maximal number of vertices among the graphs in that graphset. For each graphset, the representation yielding the highest accuracy is in bold. Representation Activities of the last MPNN layer Activities of all MPNN layers Relaxed MPNN layers (iia) MPNN layers (iib) MPNN layers (iia) MPNN layers (iib) WL Graphset 1-GNN GIN 1-GNN GIN 1-GNN GIN 1-GNN GIN kernel Accuracy [%]: mean ± standard deviation GS1 96 ± 4 94 ± 4 98 ± 4 94 ± 4 96 ± 5 94 ± 3 98 ± 3 94 ± 5 90 ± 9 GS2 86 ± 8 82 ± 8 88 ± 7 83 ± 9 85 ± 7 86 ± 6 89 ± 5 84 ± 9 81 ± 9 GS3 82 ± 7 81 ± 3 84 ± 4 82 ± 4 84 ± 5 81 ± 4 86 ± 6 85 ± 6 82 ± 4 GS4 88 ± 6 81 ± 8 89 ± 6 82 ± 7 89 ± 5 83 ± 3 90 ± 8 82 ± 6 76 ± 7 GS5 81 ± 9 79 ± 8 84 ± 6 74 ± 8 80 ± 5 76 ± 7 84 ± 5 76 ± 8 66 ± 5 GS6 96 ± 5 85 ± 7 92 ± 5 84 ± 6 96 ± 4 84 ± 11 94 ± 3 84 ± 5 83 ± 8 GS7 98 ± 3 93 ± 5 92 ± 3 91 ± 5 98 ± 3 92 ± 4 92 ± 5 92 ± 5 89 ± 5 GS8 83 ± 6 79 ± 6 79 ± 7 80 ± 7 83 ± 4 78 ± 6 79 ± 9 79 ± 6 65 ± 8 GS9 80 ± 6 75 ± 8 93 ± 6 75 ± 7 79 ± 10 76 ± 5 94 ± 5 73 ± 7 75 ± 7 GS10 83 ± 8 81 ± 5 66 ± 8 76 ± 7 83 ± 6 78 ± 6 69 ± 6 77 ± 7 74 ± 7 GS11 73 ± 11 79 ± 7 75 ± 4 71 ± 8 75 ± 6 78 ± 8 77 ± 7 71 ± 6 71 ± 8 GS12 93 ± 4 80 ± 6 90 ± 5 80 ± 5 91 ± 4 84 ± 8 90 ± 7 79 ± 7 64 ± 6 GS13 85 ± 4 75 ± 10 85 ± 4 78 ± 8 86 ± 7 71 ± 9 85 ± 8 74 ± 10 71 ± 8 GS14 85 ± 7 73 ± 9 81 ± 5 72 ± 9 85 ± 6 76 ± 9 82 ± 3 71 ± 9 69 ± 7 GS15 91 ± 6 66 ± 11 79 ± 7 71 ± 6 91 ± 4 67 ± 7 78 ± 7 67 ± 12 64 ± 6 GS16 81 ± 8 86 ± 5 75 ± 6 80 ± 7 81 ± 7 85 ± 4 77 ± 7 79 ± 7 69 ± 6 GS17 87 ± 6 78 ± 7 91 ± 6 74 ± 8 88 ± 6 78 ± 8 91 ± 5 72 ± 5 68 ± 10 GS18 92 ± 4 78 ± 8 94 ± 4 75 ± 10 93 ± 3 75 ± 7 95 ± 3 75 ± 8 63 ± 8 GS19 88 ± 7 72 ± 7 82 ± 6 72 ± 7 87 ± 5 75 ± 6 81 ± 8 73 ± 9 69 ± 6 GS20 85 ± 4 73 ± 7 90 ± 5 71 ± 9 82 ± 6 75 ± 8 89 ± 9 70 ± 6 66 ± 7 analysis, we employed the Wilcoxon signed rank test with ther those originating from a 1-GNN, nor those originating two-sided alternative for all 36 pairs of the investigated from a GIN. On the other hand, if one of the representa- representations, because of the inconsistence of the more tions originates from a 1-GNN and the other from a GIN, commonly used mean ranks post-hoc test, to which re- then the difference between the accuracies is significant, cently Benavoli et al. pointed out [2]. For correction to and similarly if one of them is GNN-based and the other multiple hypotheses testing, we used the Holm method [7]. originates from the relaxed WL kernel. Combined with The results of the pairwise comparisons of the 9 inves- the results in Table 2, this means that any of the four rep- tigated representations and of their significance testing are resentations originating from the 1-GNN yields most fre- presented in Table 3. A number na,b in a row of the table quently the highest accuracy, whereas the differences be- corresponding to a representation a and a column corre- tween those four representations are not significant. The sponding to a representation b states in how many among fact that the accuracies of the four representations origi- the graphsets GS1–GS20, the representation a lead to a nating from the same kind of GNNs were neither for 1- higher mean classification accuracy than the representa- GNN nor for GIN significantly different, suggests to test tion b. If na,b > nb,a and the difference between the rep- the stronger hypothesis that the mean classification accu- resentations a and b is according to the Wilcoxon signed racy of all those four representations coincides. For 1- rank test significant at the familywise level 5%, after the GNN, the Friedman test indeed did not reject that hypoth- Holm correction, then na,b is in bold. esis, with quite high achieved significance p = 0.19. For The boldfaced significant differences in Table 3 reveal GIN, it rejected the hypothesis on the usual significance that representations originating from the same kind of level 5%, but even on the significance level 4% it did not GNNs never lead to significantly different accuracy, nei- reject it any more (achieved significance p = 0.043). Table 3: Comparison of the investigated representations. A number na,b in a row of the table corresponding to a rep- resentation a and a column corresponding to a representation b states in how many among the graphsets GS1–GS20, the representation a lead to a higher mean classification accuracy than the representation b. That number is in bold if na,b > nb,a and the difference between the representations a and b is according to the Wilcoxon singed rank test significant at the familywise level 5%, after the Holm correction. MPNN layers (iia) refer to 5 MPNN layers of length equal to the average number of vertices among the graphs in the graphset on which the network is being trained, MPNN layers (iib) refer to 5 MPNN layers of length equal to the maximal number of vertices among the graphs in that graphset. Activities of the last MPNN layer Activities of all MPNN layers Relaxed Representation MPNN layers (iia) MPNN layers (iib) MPNN layers (iia) MPNN layers (iib) WL 1-GNN GIN 1-GNN GIN 1-GNN GIN 1-GNN GIN kernel MPNN layers (iia) Activities of the last MPNN layer 1-GNN F 18 9 19 6 17 9 19 19 GIN 2 F 4 10 2 9 4 13 17 MPNN layers (iib) 1-GNN 10 15 F 17 7 16 3 15 19 GIN 0 6 3 F 0 4 3 9 17 MPNN layers (iia) 1-GNN 6 18 10 20 F 17 10 19 20 Activities of all MPNN layers GIN 2 8 3 13 3 F 3 11 18 MPNN layers (iib) 1-GNN 10 15 10 17 10 16 F 16 19 GIN 1 5 3 6 1 3 2 F 18 Relaxed WL kernel 0 1 1 0 0 1 1 1 F 4 Conclusion and Further Research always lead to significantly different accuracies, and also the accuracies of a GNN-based representation and of the This work-in-progress paper experimentally compared 8 representation originating from the relaxed WL kernel dif- variants of graph representations based on two recently fer significantly. On the other hand, the four accuracies of proposed graph neural networks inspired by the WL iso- representations originating from the same kind of GNNs morphism test, as well as a representation based on a re- are never significantly different. Moreover, the data even cent generalization of the WL subtree kernel. They were don’t contradict a stronger hypothesis that the mean clas- compared not with respect to their distance in the embe- sification accuracy of all those representations coincides. ding space, but with respect to downstream classification The result that the representation based on the relaxed by means of an SVM. The results of that comparison in- WL kernel was inferior to the GNN-based representations dicate that the highest classification accuracy is achieved is explainable by the fact that the kernel was designed for with representations originating from 1-GNN networks. dense and structurally more diverse graphs, whereas the At the same time, the comparison revealed that any two reported investigation was performed on simple molecular representations originating from different kinds of GNNs graphs. However, the paper presents really only first results that, [3] Z. Chen, S. Villar, L. Chen, and J. Bruna. On the equiva- in our opinion, indicate usefulness of a possible further lence between graph isomorphism testing and function ap- research in this direction, which we consider interesting proximation with GNNs. In 33rd Conference on Neural due to the crucial role nowadays played by representation Information Processing Systems, pages 1–19, 2019. learning. Needless to say, such a research would have to be [4] G. Cybenko. Approximation by superpositions of a sign- more comprehensive with respect to the employed graph moidal function. Mathematics of Control, Signals, and Sys- data, as well as with respect to the involved representation tems, 2:303–314, 1989. methods. [5] H. Dell, M. Grohe, and G. Rattan. Lovász meets weisfeiler As to the employed data, we used 20 comparatively and leman. In 45th International Colloquium on Automata, Languages, and Programming, page article no. 40, 2018. small disjoint graphsets from only four different bench- mark datasets, to be able to assess the statistical signifi- [6] F. Errica, D. Bacciu, and A. Micheli. Theoretically expres- sive and edge-aware graph learning. In European Sympo- cance of the differences between the compared represen- sium on Artificial Neural Networks, Computational Intelli- tations. To use instead a similar number of separate bench- gence and Machine Learning, pages 175–180, 2020. marks, would lead to much larger graphsets and would al- [7] S. Garcia and F. Herrera. An extension on "Statistical Com- low to cover a broader spectrum of applications, and to parisons of Classifiers over Multiple Data Sets" for all pair- drop the assumption that for the employed data, disjoint- wise comparisons. Journal of Machine Learning Research, ness is a sufficient condition for statistical independence. 9:2677–2694, 2008. The benchmarks for future investigation should include [8] GenWL. A Generalized Weisfeiler-Lehman Graph Kernel, also dense and structurally diverse graphs, i.e. the kind 2021. https://github.com/mlai-bonn/GenWL. of graphs for which the relaxed WL kernel is intended. In [9] A. Grover and J. Leskovec. Node2vec: Scalable feature addition, it would be interesting to know how the GNN- learning for networks. In ACM SIGKDD International based representations change if instead of the colouring Conference on Knowledge Discovery and Data Mining, components (λ1 , . . . , λdc ) of a labelling λ , the complete pages 855–86, 2016. labeling is used. [10] W.L. Hamilton. Graph Repesentation Learning. Morgan & As to the involved representation methods, it would be Claypool Publishers, San Rafael, 2020. interesting to compare the relaxed WL kernel with meth- [11] N.T. Hoang and T. Maehara. Graph homomorphism con- ods based on other GNNs inspired by the WL test. Indeed, volution. In 37th International Conference on Machine other variants of the WL subtree kernel have been suffi- Learning, pages 1–15, 2020. ciently compared with the relaxed WL kernel in [23] and [12] K. Hornik. Approximation capabilities of multilayer neural shown to be inferior to it, and several of them have also al- networks. Neural Networks, 4:251–257, 1991. ready been compared with GNN-based methods inspired [13] K. Hornik, M. Stichcombe, and H. White. Multilayer by the WL test [1, 11,15]. On the other hand, comparisons feeforward networks are univesal approximators. Neural among different methods are sporadic [3, 11, 15], and a Networks, 2:359–366, 1989. comparison of some of them with the relaxed WL kernel [14] V. Kůrková. Kolmogorov’s theorem and multilayer neural is, to the best of our knowledge, for the first time reported networks. Neural Networks, 5:501–506, 1992. in this paper. [15] H. Maron, H. Ben-Hamu, H. Serviansky, and Y. Lipman. Provably powerful graph networks. In 33rd Conference on Neural Information Processing Systems, pages 1–12, 2019. Acknowledgement [16] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient The authors are deeply indebted to Till Hendrik Schulz for estimation of word representations in vector space. arXiv preprint 1301.3781, 2013. repeated discussions and for his help with the relaxed WL kernel and with its implementation GenWL. The research [17] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their reported in this paper has been supported by the Charles compositionality. In 27th Conference on Neural Informa- University SVV project 260575 and partially supported by tion Processing Systems, pages 3111–3119, 2013. the Czech Science Foundation (GAČR) grant 18-18080S. [18] C. Morris, M. Ritzert, M. Fey, W.L. Hamilton, J.E. Lens- sen, et al. Weisfeiler and Leman go neural: Higher-order References graph neural networks. In 33rd AAAI Conference on Artifi- cial Intelligence, pages 4602–4609, 2019. [1] D. Al-Rfou, R. adn Zelle and B. Perozzi. DDGK: Learning [19] J. Park and I.W. Sandberg. Approximation and radial-basis- graph representations for deep divergence graph kernels. In function networks. Neural Computation, 5:305–316, 1993. International Wolrd Wide Web Conference, pages 37–48, [20] B. Perozzi, R. Al-Rfou, and S. Skiena. DeepWalk| online 2019. learning of social representations. In ACM KDDM, pages [2] A. Benavoli, G. Corani, and F. Mangili. Should we really 701–710, 2014. use post-hoc tests based on mean-ranks? Journal of Ma- [21] PyTorch Geometric Documentation, 2021. https://pytorch- chine Learning Research, 17:1–10, 2016. geometric.readthedocs.io/. [22] B. Schölkopf and A.J. Smola. Learning with Kernels. MIT Press, Cambridge, 2002. [23] T.H. Schulz, T. Horváth, P. Welke, and S. Wrobel. A gen- [26] N. Wale, I.A. Watson, and G. Karypis. Comparison of de- eralized weisfeiler-lehman graph kernel. ArXiv preprint scriptor spaces for chemical compound retrieval and classi- arXiv:2101.08104.v1, 2021. fication. Knowledge and Information Systems, 14:347–375, [24] N. Shervarshidze, P. Schweitzer, E.J. van Leeuwen, 2008. K. Mehlhorn, and K.M. Borgwardt. Weisfeiler-Lehman [27] B. Weisfeiler and A.A. Lehman. A reduction of a graph to a graph kernels. Journal of Machine Learning Research, canonical form and an algebra arising during this reduction. 12:2539–2561, 2011. Nauchno-Technicheskaya Informatsia, 2:12–16, 1968. [25] J.J. Sutherland, L.A. O’Brien, and D.F. Weaver. Spline- [28] H. White. Artificial Neural Networks: Approximation and fitting with a genetic algorithm: A method for develop- Learning Theory. Blackwell Publishers, Cambridge, 1992. ing classification structure-activity relationships. Journal [29] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful of Chemical Information and Computer Science, 43:1906– are graph neural networks? In International Conference on 1915, 2003. Learning Representations, pages 1–17, 2019.