<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Vec2Struc: A Method Towards Explainable Structural-Based Node Embeddings</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Srishti Tomar</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ryan Compton Quantiply</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>W San Carlos St.</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>San Jose</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>California</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>srishti</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>comptong@quantiply.com</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>23</fpage>
      <lpage>25</lpage>
      <abstract>
        <p>Network embeddings are a popular new method for encapsulating the complexity of networks in a reduced feature space. However, embeddings obfuscates the complexity of networks and make explainability of such models difficult. We propose Vec2Struc, a new method aimed at providing more explainable representations of network models. This approach examines nodes similar to one another in the generated embedding space and highlights common structures among them. This method provides a means to explore embeddings distributionally and visually, leading towards more transparent and interpretable AI systems utilizing state-of-the-art structural-based node embeddings.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Networks have long been used in analysing complex
systems with entity-to-entity relationships. For example,
biologists have used networks to represent Protein-to-Protein
interactions whereas social scientist’s usage has been in
understanding of friendship or community relationships
        <xref ref-type="bibr" rid="ref4">(Goyal
and Ferrara 2018)</xref>
        . A growing consensus on the lack of
common metrics and modeling of networks has led to a more
representative approach in building condensed vectorized
representations of network properties.
      </p>
      <p>
        Network (or Node) embeddings are an approach that
provide scalable learning within networks. Typically, networks
are very rich in information and the size of them explored by
researchers has increased into the billions of nodes. Through
embeddings, a reduced feature space can be created to
encapsulate both network and node information. With this
reduced space comes more scalable and generalizable methods
without needing to reduce the amount of information being
stored in the network
        <xref ref-type="bibr" rid="ref4">(Goyal and Ferrara 2018)</xref>
        .
      </p>
      <p>
        There are many variants to embedding approaches
        <xref ref-type="bibr" rid="ref4 ref7">(Goyal
and Ferrara 2018; Ribeiro, Saverese, and Figueiredo 2017)</xref>
        ,
but there remains to be a method for assessing what these
embeddings represent as they are an ambiguation of the
original network, thus leaving the question how can researchers
determine the real-world representation of the node
associations in such an ambiguous space? There is a need to
map the embedding space to real-world representations if
we are to better understand models that rely on network
embeddings.
      </p>
      <p>This short paper proposes Vec2Struc, an algorithmic
approach for discovering common structures or topologies
between nodes in an embedded space. While this is not a
universal approach for network embeddings (more on that
below), this provides a first step toward interpreting network
embeddings, paving the way for such state-of-the-art
techniques to be more usable within AI systems that are
dependent on explainability.</p>
      <sec id="sec-1-1">
        <title>Network/Node Embeddings</title>
        <p>
          Network embeddings are formally defined from
          <xref ref-type="bibr" rid="ref4">(Goyal and
Ferrara 2018)</xref>
          :
        </p>
        <p>Given a network G = (V,E), a network embedding is a
mapping f : vi ! yi 2 Rd; 8i 2 [n] such that d jV j and
the function f preserves some proximity measure defined on
network G</p>
        <p>Embeddings have been oriented to two different
vectorized representations. 1) a representation of a network as a
whole (referred to as network or graph embedding) and 2) a
representation of node properties within a network (referred
to as node embeddings). This work focuses on the node level
representation and hence forth in this paper when referring
to network embeddings this is the prospective taken.</p>
        <p>
          <xref ref-type="bibr" rid="ref4">(Goyal and Ferrara 2018)</xref>
          categorize embedding methods
into taxonomy of three categories: Factorization, Random
Walk, and Deep Learning. But these methods miss the
importance of the local structures surrounding nodes.
Understanding local structure properties is vital for analyzing
complex networks as they showcase higher level properties
beyond the one-to-one interactions commonly examined.
        </p>
        <p>
          Additional methods have been implemented in which
the structural identity of a node’s neighbor is considered
          <xref ref-type="bibr" rid="ref7">(Ribeiro, Saverese, and Figueiredo 2017)</xref>
          . Struc2Vec,
proposed by
          <xref ref-type="bibr" rid="ref7">(Ribeiro, Saverese, and Figueiredo 2017)</xref>
          ,
constructs a structural context for nodes through a multi-layer
network which encodes structural similarities within the
embeddings. Previous network embedding methods tend to
encapsulate information around node distances, however
Struc2Vec is noted to be distance agnostic and can embed
structural information between nodes very far apart in the
network
          <xref ref-type="bibr" rid="ref7">(Ribeiro, Saverese, and Figueiredo 2017)</xref>
          . While
other types of structural based embeddings exist
          <xref ref-type="bibr" rid="ref4">(Goyal and
Ferrara 2018)</xref>
          , this work utilizes Struc2Vec embeddings.
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>Need for Explainable AI</title>
        <p>
          The need for such complex methods to be explainable is
argued by
          <xref ref-type="bibr" rid="ref11">(Weld and Bansal 2019)</xref>
          that most computer-based
produced behavior is actually alien to humans and can fail in
many unexpected ways. People can’t trust nor control
complex systems that lack the ability to verify why they
produced a specific output. The authors propose multiple
criterion for which an AI system to reach such an ability:
1. It is clear what factors caused the system’s action,
allowing users to predict how changes to the situation would
have led to alternative behaviors
2. Permits effective control of the AI by enabling interaction
To approach such a system, they propose the need to
ensure that the underlying reasoning is interpretable
          <xref ref-type="bibr" rid="ref11">(Weld and
Bansal 2019)</xref>
          .
        </p>
        <p>
          Vec2Struc fulfills the first criterion proposed by
          <xref ref-type="bibr" rid="ref11">(Weld and
Bansal 2019)</xref>
          through visual representations of the network
topologies. This allows for high-level heuristic evaluation
of the embedded space, highlighting the factors for which
nodes are associated in structural-based embeddings.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Method</title>
      <p>The summary of Vec2Struc is as follows: with an embedding
representation generated, the nodes similar to each other
need to be gathered. As a pair-wise comparison is exhaustive
and computationally heavy, a filter is needed to find which
pairs are best to compare neighborhoods. Then with these
pairs, find if any common structure exists between them and
measure the frequency of such structures to associate them
with each embedding. The procedure of Vec2Struc is shown
in Fig. 1 and is conducted in four stages.</p>
      <sec id="sec-2-1">
        <title>1. Node Embedding Generation</title>
        <p>
          Embeddings are generated using an approach oriented
toward representing the structural information of a node’s
neighborhood. While any structural based embedding
should work theoretically, we only tested this with
Struc2Vec
          <xref ref-type="bibr" rid="ref7">(Ribeiro, Saverese, and Figueiredo 2017)</xref>
          .
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2. Cluster Modeling on Embedded Space +</title>
      </sec>
      <sec id="sec-2-3">
        <title>Pair-wise Filtering</title>
        <p>Once embeddings are generated, a similarity heuristic is
used to find those pairs worth comparing. This can be done
through clustering nodes using the embeddings. A Gaussian
Mixture model was used in our experiment with the number
of clusters chosen through minimizing the Bayesian
Information Criterion of the number of clusters. Thus nodes have
been binned into groups for which pair-wise comparison can
be made more efficiently.</p>
        <p>However, even with the clusters there is still a potential for
the pair-wise search space to be too large. We conducted
further filtering through only examining pairs of nodes within
a cluster that are above a similarity threshold. Using cosine
similarity, we found pairs of nodes that are above a similarity
of 0.9, providing a reduced search space of node pairs.</p>
      </sec>
      <sec id="sec-2-4">
        <title>3. Ego Structure Extraction of Similar Pairs in</title>
      </sec>
      <sec id="sec-2-5">
        <title>Clusters</title>
        <p>Next, using the pairs of nodes that are most similar in their
clusters, pair-wise comparisons were conducted on their
local neighbor networks. The problem of examining two
separate networks and finding common structures has been
defined as the Maximum Common Edge Subgraph problem
and has been found to be an NP-Hard problem (Bahiense et
al. 2012). Hence the need for a reduced space as algorithmic
complexity of this problem makes the runtime of current
solutions potentially intractable given a large enough network.
To further assist with this performance issue, we only
examine 2-hop ego networks centered on nodes of interest before
conducting the maximum common subgraph search.</p>
      </sec>
      <sec id="sec-2-6">
        <title>Maximum Common Subgraphs To tackle finding com</title>
        <p>
          mon subgraphs, we incorporated the algorithm SailMCS
provided by
          <xref ref-type="bibr" rid="ref6">(Larsen et al. 2016)</xref>
          . This a heuristic algorithm
using a combination of iterative local search and a
perturbation strategy. This algorithm is parallelized and handled the
ego-network sizes we found efficiently enough to make this
        </p>
      </sec>
      <sec id="sec-2-7">
        <title>4. Frequency Distribution of Structures per Cluster</title>
        <p>
          With the extracted substructures found, we bring this
information back to the cluster level by examining the
distribution of structures across clusters. Through equivalence
checks between each common subgraph (using a graph
isomorphism algorithm
          <xref ref-type="bibr" rid="ref2">(Cordella et al. 2001)</xref>
          ), we count how
often they occur in each cluster. This distributional
representation is used to describe the clusters visually through
finding the boundaries of the clusters in the embedded space.
        </p>
        <p>To test this method, we implemented it on a network built
from a network simulator on financial transactions called
AMLSim by (Weber et al. 2018). AMLSim works well for
a testing environment as we were able to specify network
size and density as well as incorporate specific substructures
within the network. We built a network of 10,000 nodes
with an average degree of 23.25.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>Using the AMLSim built network, we found 4 clusters
using Struct2Vec embeddings as an input to a Gaussian
Mixture Model. Examining the two dominant clusters (Cluster
1 contains 53% of nodes and Cluster 2 contains 38% of
nodes), we explored a subset of structures within each
cluster. We examined around 1000 structures for each cluster, as
the number of possible pair-wise comparisons in each
cluster, even after filtering, was high (25 million possible
pairwise comparison were possible within Cluster 1). While
examining these, many can be considered similar but still vary
in the number of nodes and edges, which would cause the
strict isomorphic checks within step 3 to fail as the exact
number of nodes and edges is needed for these checks to
return true. This was a good indication that isomorphic
methods are too strict.</p>
      <p>To showcase similar structures, we manually examined
the commonly found structures and categorized them. Tables
1 and 2 show the frequency in which each category of
structure was found. The primary category found was what we
refer to as a Fan (shown within Fig. 2a), which is a central
node acting as the only connection between the
surrounding neighborhood. Iterations of the Fan structure was seen
in many other types, Fan+Triangle (shown in Fig. 2b) has
the inclusion of a triangle clique between two nodes and the
central node. The number of triangles present were
dominantly one, but some structures were found to have up to 3.</p>
      <p>The next primary structure was a node acting as a bridge
between two Fans, referred to as Node Between Fan (shown
in 2c). These highlight a more complex structure as a bridge
node is the primary connection between two neighborhoods.
Lastly, we found a fair amount of path structures which were
only a continuous sequence of nodes with those in the
middle of the path having exactly 2 edges. Fans were found to
have paths extending outside of the central node, which are
classified separately from the Fans shown in Fig. 2a.</p>
      <p>The most common structures overall tended to have a Fan
structure, which isn’t surprising given that this is a
common pattern made from random connections within
AMLSim (Weber et al. 2018). Cluster 1 has a higher frequency
of Fans with a Triangle, indicating more nuanced
relationships. As Cluster 2’s second most frequent structure was a
node acting as a bridge between two fans (shown within Fig.
2c) which wasn’t present within Cluster 1, indicating nodes
in Cluster 2 may be acting as neighbor connectors.</p>
      <p>While unique structures were found across each cluster,
many of the them were unable to find a match (82.53%
within Cluster 1 and 72.22% within Cluster 2). However,
after conducting a visual inspection, many of the structures
matched similar high level categorization as the categories
already reported within each Cluster. Due to the high amount
of no matches we were unable to categorize them into the
found types. This further indicates a need for a less strict
checks on isomorphism and allow for more higher level
similarity measures of substructures to automate this process.</p>
      <p>There were structural types found within the no match
category that were too complex and unique to match any other.
Fig. 3 shows an interesting structure as it shows a higher
connected pair of Fans as there exist multiple bridges
instead of the more common one bridge within Fig 2c. Again
highlighting the need for better similarity measures.</p>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>
        This work presents Vec2Struc, a new method pushing
more interpretable representations of state-of-the-art
network models. As embeddings have continued to highlight
their potential for graphical representations in a more
compact space
        <xref ref-type="bibr" rid="ref4">(Goyal and Ferrara 2018)</xref>
        , this method provides
a step forward in being able to explain the associations of
nodes. This is through addressing the first criterion
mentioned by
        <xref ref-type="bibr" rid="ref11">(Weld and Bansal 2019)</xref>
        by bringing forward the
(a) Fan Structures commonly found across (b) Fan + Triangle structures more often found (c) Node between Fan structures commonly
both clusters. within cluster 1. found within cluster 2.
structural factors influencing decisions made using
embeddings with visualizations of such structures. Visualizations
have been a common representation of other prediction
explanations
        <xref ref-type="bibr" rid="ref9">(Ribeiro, Singh, and Guestrin 2016)</xref>
        as well as
a common representation of networks. Applications of this
work can be within the areas mentioned above in the
Introduction of discovering new Protein-to-Protein interactions
as well as social network structures, but directly this can
be applied along with AMLSim to be used in financial
networks for the discovery of new criminal financial activities
in money laundering (Weber et al. 2018).
      </p>
      <p>While this work showcases a test case of this method, it
still contains many limits in its ability. First being the
computational complexity of the algorithm and the main
bottleneck being the Maximum Common Subgraph discovery
procedure, which is an NP-Hard problem (Bahiense et al.
2012). To work around this, filter techniques were used to
only examine the most common pairs, hence the typically
assumptions of clustering apply (Clusters within the
embedded space are not guaranteed to be separable and goodness
of fit measures are needed to audit the cluster model’s
performance), as well as the thresholds of our similarity check
need to be further examined within each space as to what
threshold is appropriate. Another issue lies within our
experiment, as we needed to conduct further sampling (1000
common subgraphs found per cluster) to allow for a
reasonable runtime of SailMCS.</p>
      <p>
        Future work could expand on this experiment through
exploring higher frequency of specific structures of
interest, as well as explore other network types as this is
unclear of its performance given global network properties,
(i.e how does density of the network affect runtime?).
Realworld networks need to be explored, as the most common
type of structure found (Fan shown in Fig. 2a) is easily
produced from random edge generation. Lastly, additional
work is needed in understanding the dynamics of
embeddings through these visualizations to reach the second
criterion proposed by
        <xref ref-type="bibr" rid="ref11">(Weld and Bansal 2019)</xref>
        .
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <year>2012</year>
          .
          <article-title>The maximum common edge subgraph problem: A polyhedral investigation</article-title>
          .
          <source>Discrete Applied Mathematics</source>
          <volume>160</volume>
          (
          <issue>18</issue>
          ):
          <fpage>2523</fpage>
          -
          <lpage>2541</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Cordella</surname>
            ,
            <given-names>L. P.</given-names>
          </string-name>
          ; Foggia,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Sansone</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          ; and Vento,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>An improved algorithm for matching large graphs</article-title>
          .
          <source>In 3rd IAPR-TC15 workshop on graph-based representations in pattern recognition</source>
          ,
          <fpage>149</fpage>
          -
          <lpage>159</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ferrara</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Graph embedding techniques, applications, and performance: A survey</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>Knowledge-Based Systems</source>
          <volume>151</volume>
          :
          <fpage>78</fpage>
          -
          <lpage>94</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Larsen</surname>
            ,
            <given-names>S. J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Alkaersig</surname>
            ,
            <given-names>F. G.</given-names>
          </string-name>
          ; Ditzel,
          <string-name>
            <given-names>H. J.</given-names>
            ;
            <surname>Jurisica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            ; Alcaraz, N.; and
            <surname>Baumbach</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>A simulated annealing algorithm for maximum common edge subgraph detection in biological networks</article-title>
          .
          <source>In Proc. of the Genetic and Evolutionary Computation Conference</source>
          <year>2016</year>
          ,
          <fpage>341</fpage>
          -
          <lpage>348</lpage>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Ribeiro</surname>
            ,
            <given-names>L. F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Saverese</surname>
            ,
            <given-names>P. H.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Figueiredo</surname>
            ,
            <given-names>D. R.</given-names>
          </string-name>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>struc2vec: Learning node representations from structural identity</article-title>
          .
          <source>In Proc. of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          ,
          <fpage>385</fpage>
          -
          <lpage>394</lpage>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Ribeiro</surname>
          </string-name>
          , M. T.;
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Guestrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Why should i trust you?: Explaining the predictions of any classifier</article-title>
          .
          <source>In Proc. of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining</source>
          ,
          <fpage>1135</fpage>
          -
          <lpage>1144</lpage>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          2018.
          <article-title>Scalable graph learning for anti-money laundering: A first look</article-title>
          . In NeurIPS 2018 Workshop on Challenges and
          <article-title>Opportunities for AI in Financial Services: the Impact of Fairness, Explainability</article-title>
          , Accuracy, and Privacy.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Weld</surname>
            ,
            <given-names>D. S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>The challenge of crafting intelligible intelligence</article-title>
          .
          <source>Commun. ACM</source>
          <volume>62</volume>
          (
          <issue>6</issue>
          ):
          <fpage>70</fpage>
          -
          <lpage>79</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>