=Paper=
{{Paper
|id=Vol-2600/short2
|storemode=property
|title=Vec2Struc: A Method Towards Explainable Structural-Based Node Embeddings
|pdfUrl=https://ceur-ws.org/Vol-2600/short2.pdf
|volume=Vol-2600
|authors=Srishti Tomar,Ryan Compton
|dblpUrl=https://dblp.org/rec/conf/aaaiss/TomarC20
}}
==Vec2Struc: A Method Towards Explainable Structural-Based Node Embeddings==
<pdf width="1500px">https://ceur-ws.org/Vol-2600/short2.pdf</pdf>
<pre>
 Vec2Struc: A Method Towards Explainable Structural-Based Node Embeddings

                                                 Srishti Tomar, Ryan Compton
                                    Quantiply, 333 W San Carlos St. #1650, San Jose, California,
                                                 {srishti, compton}@quantiply.com


                            Abstract                                 determine the real-world representation of the node asso-
                                                                     ciations in such an ambiguous space? There is a need to
  Network embeddings are a popular new method for encapsu-
                                                                     map the embedding space to real-world representations if
  lating the complexity of networks in a reduced feature space.
  However, embeddings obfuscates the complexity of networks          we are to better understand models that rely on network em-
  and make explainability of such models difficult. We propose       beddings.
  Vec2Struc, a new method aimed at providing more explain-              This short paper proposes Vec2Struc, an algorithmic ap-
  able representations of network models. This approach exam-        proach for discovering common structures or topologies be-
  ines nodes similar to one another in the generated embedding       tween nodes in an embedded space. While this is not a uni-
  space and highlights common structures among them. This            versal approach for network embeddings (more on that be-
  method provides a means to explore embeddings distribution-        low), this provides a first step toward interpreting network
  ally and visually, leading towards more transparent and inter-     embeddings, paving the way for such state-of-the-art tech-
  pretable AI systems utilizing state-of-the-art structural-based    niques to be more usable within AI systems that are depen-
  node embeddings.
                                                                     dent on explainability.

                        Introduction                                 Network/Node Embeddings
Networks have long been used in analysing complex sys-               Network embeddings are formally defined from (Goyal and
tems with entity-to-entity relationships. For example, biolo-        Ferrara 2018):
gists have used networks to represent Protein-to-Protein in-            Given a network G = (V,E), a network embedding is a
teractions whereas social scientist’s usage has been in under-       mapping f : vi → yi ∈ Rd ; ∀i ∈ [n] such that d  |V | and
standing of friendship or community relationships (Goyal             the function f preserves some proximity measure defined on
and Ferrara 2018). A growing consensus on the lack of com-           network G
mon metrics and modeling of networks has led to a more                  Embeddings have been oriented to two different vector-
representative approach in building condensed vectorized             ized representations. 1) a representation of a network as a
representations of network properties.                               whole (referred to as network or graph embedding) and 2) a
   Network (or Node) embeddings are an approach that pro-            representation of node properties within a network (referred
vide scalable learning within networks. Typically, networks          to as node embeddings). This work focuses on the node level
are very rich in information and the size of them explored by        representation and hence forth in this paper when referring
researchers has increased into the billions of nodes. Through        to network embeddings this is the prospective taken.
embeddings, a reduced feature space can be created to en-               (Goyal and Ferrara 2018) categorize embedding methods
capsulate both network and node information. With this re-           into taxonomy of three categories: Factorization, Random
duced space comes more scalable and generalizable methods            Walk, and Deep Learning. But these methods miss the im-
without needing to reduce the amount of information being            portance of the local structures surrounding nodes. Under-
stored in the network (Goyal and Ferrara 2018).                      standing local structure properties is vital for analyzing com-
   There are many variants to embedding approaches (Goyal            plex networks as they showcase higher level properties be-
and Ferrara 2018; Ribeiro, Saverese, and Figueiredo 2017),           yond the one-to-one interactions commonly examined.
but there remains to be a method for assessing what these               Additional methods have been implemented in which
embeddings represent as they are an ambiguation of the orig-         the structural identity of a node’s neighbor is considered
inal network, thus leaving the question how can researchers          (Ribeiro, Saverese, and Figueiredo 2017). Struc2Vec, pro-
Copyright c 2020 held by the author(s). In S. Tomar, R. Comp-
                                                                     posed by (Ribeiro, Saverese, and Figueiredo 2017), con-
ton, Proceedings of the AAAI 2020 Spring Symposium on Com-           structs a structural context for nodes through a multi-layer
bining Machine Learning and Knowledge Engineering in Practice        network which encodes structural similarities within the
(AAAI-MAKE 2020). Stanford University, Palo Alto, California,        embeddings. Previous network embedding methods tend to
USA, March 23-25, 2020. Use permitted under Creative Commons         encapsulate information around node distances, however
License Attribution 4.0 International (CC BY 4.0).                   Struc2Vec is noted to be distance agnostic and can embed
structural information between nodes very far apart in the
network (Ribeiro, Saverese, and Figueiredo 2017). While
other types of structural based embeddings exist (Goyal and
Ferrara 2018), this work utilizes Struc2Vec embeddings.

Need for Explainable AI
The need for such complex methods to be explainable is ar-
gued by (Weld and Bansal 2019) that most computer-based
produced behavior is actually alien to humans and can fail in
many unexpected ways. People can’t trust nor control com-
plex systems that lack the ability to verify why they pro-
duced a specific output. The authors propose multiple crite-
rion for which an AI system to reach such an ability:
1. It is clear what factors caused the system’s action, allow-
   ing users to predict how changes to the situation would
   have led to alternative behaviors
2. Permits effective control of the AI by enabling interaction
To approach such a system, they propose the need to en-
sure that the underlying reasoning is interpretable (Weld and
Bansal 2019).
  Vec2Struc fulfills the first criterion proposed by (Weld and
Bansal 2019) through visual representations of the network
topologies. This allows for high-level heuristic evaluation
of the embedded space, highlighting the factors for which
nodes are associated in structural-based embeddings.

                           Method
The summary of Vec2Struc is as follows: with an embedding
representation generated, the nodes similar to each other
need to be gathered. As a pair-wise comparison is exhaustive
and computationally heavy, a filter is needed to find which                       Figure 1: Vec2Struc Procedure.
pairs are best to compare neighborhoods. Then with these
pairs, find if any common structure exists between them and
measure the frequency of such structures to associate them          a cluster that are above a similarity threshold. Using cosine
with each embedding. The procedure of Vec2Struc is shown            similarity, we found pairs of nodes that are above a similarity
in Fig. 1 and is conducted in four stages.                          of 0.9, providing a reduced search space of node pairs.

1. Node Embedding Generation                                        3. Ego Structure Extraction of Similar Pairs in
Embeddings are generated using an approach oriented to-
                                                                    Clusters
ward representing the structural information of a node’s            Next, using the pairs of nodes that are most similar in their
neighborhood. While any structural based embedding                  clusters, pair-wise comparisons were conducted on their lo-
should work theoretically, we only tested this with                 cal neighbor networks. The problem of examining two sep-
Struc2Vec (Ribeiro, Saverese, and Figueiredo 2017).                 arate networks and finding common structures has been de-
                                                                    fined as the Maximum Common Edge Subgraph problem
2. Cluster Modeling on Embedded Space +                             and has been found to be an NP-Hard problem (Bahiense et
Pair-wise Filtering                                                 al. 2012). Hence the need for a reduced space as algorithmic
                                                                    complexity of this problem makes the runtime of current so-
Once embeddings are generated, a similarity heuristic is
                                                                    lutions potentially intractable given a large enough network.
used to find those pairs worth comparing. This can be done
                                                                    To further assist with this performance issue, we only exam-
through clustering nodes using the embeddings. A Gaussian
                                                                    ine 2-hop ego networks centered on nodes of interest before
Mixture model was used in our experiment with the number
                                                                    conducting the maximum common subgraph search.
of clusters chosen through minimizing the Bayesian Infor-
mation Criterion of the number of clusters. Thus nodes have         Maximum Common Subgraphs To tackle finding com-
been binned into groups for which pair-wise comparison can          mon subgraphs, we incorporated the algorithm SailMCS
be made more efficiently.                                           provided by (Larsen et al. 2016). This a heuristic algorithm
   However, even with the clusters there is still a potential for   using a combination of iterative local search and a perturba-
the pair-wise search space to be too large. We conducted fur-       tion strategy. This algorithm is parallelized and handled the
ther filtering through only examining pairs of nodes within         ego-network sizes we found efficiently enough to make this
            Cluster 1      Tot. Structs       979                 strict isomorphic checks within step 3 to fail as the exact
            Struc ID          Freq.         Total %               number of nodes and edges is needed for these checks to re-
          Fan + Tri             84           8.58%                turn true. This was a good indication that isomorphic meth-
          Fan                   45           4.59%                ods are too strict.
          Single Match          24           2.45%                   To showcase similar structures, we manually examined
          Path                  15           1.53%                the commonly found structures and categorized them. Tables
          Fan + Path             3           0.30%                1 and 2 show the frequency in which each category of struc-
          No Match             808          82.53%                ture was found. The primary category found was what we
                                                                  refer to as a Fan (shown within Fig. 2a), which is a central
                                                                  node acting as the only connection between the surround-
 Table 1: Distribution of found structures within Cluster 1.      ing neighborhood. Iterations of the Fan structure was seen
                                                                  in many other types, Fan+Triangle (shown in Fig. 2b) has
           Cluster 2         Tot. Structs       1026              the inclusion of a triangle clique between two nodes and the
            Struc ID            Freq.         Total %             central node. The number of triangles present were domi-
        Fan                      126          12.28%              nantly one, but some structures were found to have up to 3.
        Node bt. Fan              49           4.77%                 The next primary structure was a node acting as a bridge
        Single Match              44           4.28%              between two Fans, referred to as Node Between Fan (shown
        Fan + Path                24           2.34%              in 2c). These highlight a more complex structure as a bridge
        Fan + Tri                 19           1.85%              node is the primary connection between two neighborhoods.
        Path                      19           1.85%              Lastly, we found a fair amount of path structures which were
        Fan + Path + Tri           4           0.38%              only a continuous sequence of nodes with those in the mid-
        No Match                 741          72.22%              dle of the path having exactly 2 edges. Fans were found to
                                                                  have paths extending outside of the central node, which are
                                                                  classified separately from the Fans shown in Fig. 2a.
 Table 2: Distribution of found structures within Cluster 2.
                                                                     The most common structures overall tended to have a Fan
                                                                  structure, which isn’t surprising given that this is a com-
                                                                  mon pattern made from random connections within AML-
method tractable. Examples of common subgraphs found              Sim (Weber et al. 2018). Cluster 1 has a higher frequency
can be seen within Fig. 2.                                        of Fans with a Triangle, indicating more nuanced relation-
                                                                  ships. As Cluster 2’s second most frequent structure was a
4. Frequency Distribution of Structures per Cluster               node acting as a bridge between two fans (shown within Fig.
With the extracted substructures found, we bring this in-         2c) which wasn’t present within Cluster 1, indicating nodes
formation back to the cluster level by examining the dis-         in Cluster 2 may be acting as neighbor connectors.
tribution of structures across clusters. Through equivalence         While unique structures were found across each cluster,
checks between each common subgraph (using a graph iso-           many of the them were unable to find a match (82.53%
morphism algorithm (Cordella et al. 2001)), we count how          within Cluster 1 and 72.22% within Cluster 2). However,
often they occur in each cluster. This distributional represen-   after conducting a visual inspection, many of the structures
tation is used to describe the clusters visually through find-    matched similar high level categorization as the categories
ing the boundaries of the clusters in the embedded space.         already reported within each Cluster. Due to the high amount
   To test this method, we implemented it on a network built      of no matches we were unable to categorize them into the
from a network simulator on financial transactions called         found types. This further indicates a need for a less strict
AMLSim by (Weber et al. 2018). AMLSim works well for              checks on isomorphism and allow for more higher level sim-
a testing environment as we were able to specify network          ilarity measures of substructures to automate this process.
size and density as well as incorporate specific substructures       There were structural types found within the no match cat-
within the network. We built a network of ∼10,000 nodes           egory that were too complex and unique to match any other.
with an average degree of 23.25.                                  Fig. 3 shows an interesting structure as it shows a higher
                                                                  connected pair of Fans as there exist multiple bridges in-
                          Results                                 stead of the more common one bridge within Fig 2c. Again
Using the AMLSim built network, we found 4 clusters us-           highlighting the need for better similarity measures.
ing Struct2Vec embeddings as an input to a Gaussian Mix-
ture Model. Examining the two dominant clusters (Cluster                                 Discussion
1 contains 53% of nodes and Cluster 2 contains 38% of             This work presents Vec2Struc, a new method pushing
nodes), we explored a subset of structures within each clus-      more interpretable representations of state-of-the-art net-
ter. We examined around 1000 structures for each cluster, as      work models. As embeddings have continued to highlight
the number of possible pair-wise comparisons in each clus-        their potential for graphical representations in a more com-
ter, even after filtering, was high (25 million possible pair-    pact space (Goyal and Ferrara 2018), this method provides
wise comparison were possible within Cluster 1). While ex-        a step forward in being able to explain the associations of
amining these, many can be considered similar but still vary      nodes. This is through addressing the first criterion men-
in the number of nodes and edges, which would cause the           tioned by (Weld and Bansal 2019) by bringing forward the
(a) Fan Structures commonly found across (b) Fan + Triangle structures more often found (c) Node between Fan structures commonly
both clusters.                           within cluster 1.                              found within cluster 2.

                      Figure 2: Examples of commonly found structures across the dominant two clusters.


                                                                   (i.e how does density of the network affect runtime?). Real-
                                                                   world networks need to be explored, as the most common
                                                                   type of structure found (Fan shown in Fig. 2a) is easily
                                                                   produced from random edge generation. Lastly, additional
                                                                   work is needed in understanding the dynamics of embed-
                                                                   dings through these visualizations to reach the second crite-
                                                                   rion proposed by (Weld and Bansal 2019).

                                                                                           References
Figure 3: Structures within Cluster 2 containing multiple          Bahiense, L.; Manić, G.; Piva, B.; and De Souza, C. C.
bridges.                                                           2012. The maximum common edge subgraph problem:
                                                                   A polyhedral investigation. Discrete Applied Mathematics
                                                                   160(18):2523–2541.
structural factors influencing decisions made using embed-         Cordella, L. P.; Foggia, P.; Sansone, C.; and Vento, M. 2001.
dings with visualizations of such structures. Visualizations       An improved algorithm for matching large graphs. In 3rd
have been a common representation of other prediction ex-          IAPR-TC15 workshop on graph-based representations in
planations (Ribeiro, Singh, and Guestrin 2016) as well as          pattern recognition, 149–159.
a common representation of networks. Applications of this
work can be within the areas mentioned above in the Intro-         Goyal, P., and Ferrara, E.         2018.       Graph embed-
duction of discovering new Protein-to-Protein interactions         ding techniques, applications, and performance: A survey.
as well as social network structures, but directly this can        Knowledge-Based Systems 151:78–94.
be applied along with AMLSim to be used in financial net-          Larsen, S. J.; Alkærsig, F. G.; Ditzel, H. J.; Jurisica, I.; Al-
works for the discovery of new criminal financial activities       caraz, N.; and Baumbach, J. 2016. A simulated annealing
in money laundering (Weber et al. 2018).                           algorithm for maximum common edge subgraph detection
   While this work showcases a test case of this method, it        in biological networks. In Proc. of the Genetic and Evolu-
still contains many limits in its ability. First being the com-    tionary Computation Conference 2016, 341–348. ACM.
putational complexity of the algorithm and the main bot-           Ribeiro, L. F.; Saverese, P. H.; and Figueiredo, D. R. 2017.
tleneck being the Maximum Common Subgraph discovery                struc2vec: Learning node representations from structural
procedure, which is an NP-Hard problem (Bahiense et al.            identity. In Proc. of the 23rd ACM SIGKDD Interna-
2012). To work around this, filter techniques were used to         tional Conference on Knowledge Discovery and Data Min-
only examine the most common pairs, hence the typically            ing, 385–394. ACM.
assumptions of clustering apply (Clusters within the embed-        Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. Why
ded space are not guaranteed to be separable and goodness          should i trust you?: Explaining the predictions of any clas-
of fit measures are needed to audit the cluster model’s per-       sifier. In Proc. of the 22nd ACM SIGKDD international
formance), as well as the thresholds of our similarity check       conference on knowledge discovery and data mining, 1135–
need to be further examined within each space as to what           1144. ACM.
threshold is appropriate. Another issue lies within our ex-        Weber, M.; Chen, J.; Suzumura, T.; Pareja, A.; Ma, T.;
periment, as we needed to conduct further sampling (1000           Kanezashi, H.; Kaler, T.; Leiserson, C. E.; and Schardl, T. B.
common subgraphs found per cluster) to allow for a reason-         2018. Scalable graph learning for anti-money laundering: A
able runtime of SailMCS.                                           first look. In NeurIPS 2018 Workshop on Challenges and
   Future work could expand on this experiment through             Opportunities for AI in Financial Services: the Impact of
exploring higher frequency of specific structures of inter-        Fairness, Explainability, Accuracy, and Privacy.
est, as well as explore other network types as this is un-
clear of its performance given global network properties,          Weld, D. S., and Bansal, G. 2019. The challenge of crafting
                                                                   intelligible intelligence. Commun. ACM 62(6):70–79.

</pre>