=Paper= {{Paper |id=Vol-3630/paper10 |storemode=property |title=SuMExplorer: Summarisation-based Frequent Subgraph Mining for Visual Exploratory Subgraph Searching |pdfUrl=https://ceur-ws.org/Vol-3630/LWDA2023-paper10.pdf |volume=Vol-3630 |authors=Chimi Wangmo,Lena Wiese |dblpUrl=https://dblp.org/rec/conf/lwa/WangmoW23 }} ==SuMExplorer: Summarisation-based Frequent Subgraph Mining for Visual Exploratory Subgraph Searching== https://ceur-ws.org/Vol-3630/LWDA2023-paper10.pdf

SuMExplorer: Summarisation-based Frequent
Subgraph Mining for Visual Exploratory Subgraph
Searching
Chimi Wangmo1 , Lena Wiese1
1
Goethe University Frankfurt, Robert-Mayer-Str. 10, 60629, Frankfurt am Main, Germany

Abstract
Graphs are ubiquitous with their vast application in various disciplines, including bioinformatics, security,
and social sciences. One of the most popular queries involves searching for subgraphs and similarities
on large numbers of small-to-medium-sized graphs. The visual exploratory subgraph search is proposed
to be a search paradigm to support visual analysis and to ease user interaction. In this paper, we address
the problem of reducing computational complexity during the indexing stage of visual exploratory
subgraph searching. We propose (1) a compact summarisation of our input graph to reduce the index size
and construction time. In addition, (2) we introduce weighted frequent subgraph mining for our novel
graph summarisation-based visual subgraph searching. Based on these, (3) we propose a novel index
structure to represent the subgraphs extracted from the summary graphs in a compact form. Finally, (4)
we empirically demonstrate that the proposed framework provides higher efficiency than the baseline by
benchmarking on three real-world biological datasets.

Keywords
visual exploratory subgraph searching, graph summarisation, graph indexing, graph database

1. Introduction
Subgraph containment and similarity searching are critical problems in analysing social net-
works, computational biology, collaboration networks, etc. In the case of subgraph containment,
our interest lies in finding all the labelled graphs that contain a given query graph i.e. a subgraph
to a given supergraph. In subgraph similarity, earlier works such as [1] have investigated the
concept of missing edges and the need for a distance measure to identify similar graphs. The
naive approach to solving this problem involves adopting a subgraph isomorphism algorithm.
However, [2] proved that performing vertex-to-vertex mapping in the subgraph isomorphism
problem is NP-complete. As a result, we introduce subgraph-based graph indexing to prune
unrelated candidate labelled graphs and perform subgraph isomorphism (also known as “verifi-
cation”). We call this process subgraph-based graph indexing because it requires embedding
subgraphs within the graph database to filter non-candidate labelled graphs.
LWDA’23: Lernen, Wissen, Daten, Analysen. October 09–11, 2023, Marburg, Germany
∗
Corresponding author.
†
These authors contributed equally.
Envelope-Open wangmo@uni-frankfurt.de (C. Wangmo); lwiese@cs.uni-frankfurt.de (L. Wiese)
GLOBE http://wiese.free.fr/ (L. Wiese)
Orcid 0000-0002-8921-398X (C. Wangmo); 0000-0003-3515-9209 (L. Wiese)
© 2022 Copyright © 2023 by the paper’s authors. Copying permitted only for private and academic purposes. In: M. Leyer, Wichmann, J. (Eds.): Proceedings of
the LWDA 2023 Workshops: BIA, DB, IR, KDML and WM. Marburg, Germany, 09.-11. October 2023, published at http://ceur‐ws.org.
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)

CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
The use of current graph databases require knowledge of either programming or query
language; hence the usage of modern databases is limited to specific, knowledgeable users.
On the other hand, users such as chemists may lack an understanding of complex knowledge
regarding query formulation using a particular query language. As a result, visual exploratory
subgraph searching helps to overcome such issues [3]. To summarise, the manipulation opera-
tions available to users in visual exploratory subgraph searching comprise: (1) the addition of
a new vertex and its connection to an existing vertex or the addition of a new edge between
existing vertices, (2) the deletion of either an edge or of both a vertex and the corresponding
edges, and (3) finally, the execution of query to find a matching subgraph in the transactional
graph database. Operations (1) and (2) lead to modifying a query graph and, hence, require a
maintenance algorithm for an index structure. Visual exploratory subgraph searching relies on
the inner workings of the index structure to perform the subgraph query processing efficiently.
We have looked, in particular, at solving issues related to the efficiency of index construction
and memory usage and improving the effectiveness of filtering power and query processing.
In this paper, we propose a new framework for efficiently representing the graph as a summary
graph and improving pruning power using weighted-frequent subgraph mining (Sec. 4). We
introduce frequent subgraph mining on the summary graph with weighting functions. We
devise a new index structure, called the Weighted-Index (W-Index), for compactly storing
frequent and discriminative in-frequent subgraphs in the summary graph (Sec. 5 and 6). Finally,
we verify the efficiency of the proposed solutions with extensive experiments (Sec. 7).

2. Notations and Problem Statement
In this section, we first present the fundamental concepts related to graph theory, focussing
on identifying the topological components that correspond to real-world entities. To maintain
consistency, we use the term “vertex” to refer to an entity in the graph, while the term “node” is
reserved for objects in the index graph. We now formally define the labelled graph as follows.

Definition 1 (Labelled Graph). A labelled graph is a triplet 𝐺 = (𝑉𝐺 , 𝐸𝐺 , 𝐿𝐺 ) where 𝑉𝐺 denotes
the set of vertices, 𝐸𝐺 ⊆ {{𝑢, 𝑣} |𝑢, 𝑣 ∈ 𝑉𝐺 } is the set of edges, and 𝐿𝐺 ∶ 𝑉𝐺 ↦ 𝜎𝐺 assigns a label to
(1) (2)
a vertex where 𝜎𝐺 = {𝜎𝐺 , 𝜎𝐺 , …} is the set of labels. We refer to the number of edges |𝐸𝐺 | in the
given graph 𝐺 as its size and the number of vertices and labels as |𝑉𝐺 | and |𝜎𝐺 |, respectively.

We work with a transactional graph database, which contains a set of 𝑛 medium-sized labelled
graphs, denoted as 𝒢 = {𝐺1 , 𝐺2 , 𝐺3 , … , 𝐺𝑛 }. Further, we denote the cardinality of the graph
database as |𝒢 | = 𝑛, the average number of vertices, edges, and labels of all the labelled graphs in
the transactional graph database as |𝑉𝒢 |, |𝐸𝒢 |, and |𝜎𝒢 |, respectively. In this paper, our motivation
is to query for a query graph; this is a subgraph of a labelled graph in the transactional graph
database. Hence, it is necessary to understand and conduct subgraph isomorphism iteratively
for each labelled graph in the graph database to solve the query. Yet, the major downside of
subgraph isomorphism is its expensive operation time.

Definition 2 (Subgraph Isomorphism [4]). Given a query graph, 𝑄, and a labelled graph,
𝐺, an embedding of 𝑄 in 𝐺 is a mapping, where 𝑀 ∶ 𝑉𝑄 → 𝑉𝐺 such that: (1) 𝑀 is injective, i.e.
𝑀(𝑢) ≠ 𝑀(𝑣) for 𝑢 ≠ 𝑣 ∈ 𝑉𝑄 , (2) the labels match, i.e. 𝐿𝑄 (𝑢) = 𝐿𝐺 (𝑀(𝑢)) for every 𝑢 ∈ 𝑉𝑄 , and
(3) and all edges of 𝑄 exist in 𝐺, i.e. (𝑀(𝑢), 𝑀(𝑣))(or(𝑀(𝑣), 𝑀(𝑢))) ∈ 𝐸𝐺 ∀(𝑢, 𝑣)(or(𝑣, 𝑢)) ∈ 𝐸𝑄 .
Graph 𝑄 is subgraph-isomorphic to 𝐺, denoted by 𝑄 ⊆ 𝐺 if there exists an embedding of 𝑄 in 𝐺.

Example 1. In Fig. 1 (a), we illustrate an example of labelled graphs representing molecules in
the transactional graph database, denoted by 𝒢. The transactional graph database contains 5
molecular graphs, 𝒢 = {𝐺1 , 𝐺2 , 𝐺3 , 𝐺4 , 𝐺5 }. Each labelled graph contains several vertices (12 in 𝐺1 ,
5 in 𝐺2 , 6 in 𝐺3 , 8 in 𝐺4 , and 3 in 𝐺5 ) denoted by 𝑣𝑙 where 1 ≤ 𝑙 ≤ |𝑉𝐺 |, and its corresponding label
denoted by a letter, 𝜎𝑙 ∈ 𝜎; here, 𝜎 = {𝐶, 𝐻 , 𝑂}. The vertex 𝑣1 in 𝐺1 contains label 𝐶. Furthermore,
there exists an edge from one node to another, such as an edge from 𝑣1 to 𝑣2 .

2.1. Problem Statement
Prior to providing a formal definition of visual exploratory subgraph search (𝑉 𝐸𝑆𝑆), we formally
define the query graph in VESS, 𝑄𝑡 , as follows.

Definition 3 (Query Graph in VESS). The query graph, 𝑄𝑡 , is a labelled graph (see Definition
1), which we start with an insertion of a single edge, 𝑄1 = (𝑉𝑄1 , 𝐸𝑄1 , 𝐿𝑄1 ), where 𝐸𝑄1 = {𝑒𝑄1 }.
We then sequentially perform incremental updates based on the user’s update operation, 𝑜𝑝, on
the source and destination vertices, denoted by 𝑣𝑠 and 𝑣𝑑 , respectively. The user can perform an
operation on the previous instance of the query graph at time 𝑡 − 1, where 1 ≤ 𝑡 ≤ 𝒩. Here, 𝑜𝑝 is
either “+” or “-”, indicating an add or remove operation, respectively. The resulting graph from the
addition of a 𝑣𝑄𝑡 and an edge, 𝑒𝑄𝑡 , to 𝑄𝑡−1 is 𝑄𝑡 = {𝑉𝑄𝑡 , 𝐸𝑄𝑡 , 𝐿𝑄𝑡 }. It has 𝑉𝑄𝑡 nodes and 𝐸𝑄𝑡 edges.

Definition 4 (Visual Exploratory Subgraph Search). Given a batch of update operations, 𝐵,
on an initial query graph 𝑄1 , and a transactional graph database, 𝒢, the visual exploratory
subgraph search (VESS) finds all the graphs, 𝐺𝑖 ∈ 𝒢, that contain the query graph, 𝑄𝑡 = {𝐵, 𝑄1 }.

Example 2. In Fig. 1 (b), we provide an example of exploratory subgraph searching in a trans-
actional graph database. In this figure, we have denoted time with 𝑡𝑖 , where 1 ≤ 𝑖 ≤ 𝒩. Initially,
the user will begin the exploratory search with a single-edge addition at time 𝑡1 . The exploratory
search involves a series of add() operations on new vertices and connections to existing vertices, as
observed in time 𝑡2 , 𝑡3 , 𝑡5 , 𝑡8 , 𝑡9 , 𝑡10 , 𝑡11 . In addition, the add() operation may be simply an addition
of a new edge between two existing vertices, as observed at time 𝑡6 . Furthermore, the user may also
remove certain edges, as seen at time 𝑡4 and 𝑡7 . Finally, after some time, say 𝑡11 , the user may issue
the 𝑟𝑢𝑛() operation to find answers to the searched query.

In this paper, we focus on the subgraph containment problem formally defined as follows.

Definition 5 (Subgraph Containment). Given a query graph 𝑄 and labelled graphs, 𝐺𝑖 ∈ 𝒢,
the subgraph containment (also called “Subgraph search”) problem is to find all the labelled
graphs that contain 𝑄.
V6
H V7 H H O O
G2 G3
V5 V3
V8 V2
V12 V1 C C C
C H
H V6 V1
V1 V2
C C H H O O
V2 V4 V3
V5
G1 V4
V5
V3 V6 V1 O
C C H O
C H G4 G5
H V3 V8
V4 H C
V11 V9 H C C
V4
H V1 V2 V2 V3
O O
H
V10 V5 V7
(a)
H H H

H H H H H H H H H H H H H
H
C C C C C C C C C
C
C H
C C C C C C C C C

C C
H H O O O O O O O O O

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11
(b)

Figure 1: Running example of (a) a transactional graph database and performing (b) visual exploratory
searching through add(), modify(), and run() operations. We have highlighted (red for add() and grey
for modify()) the vertices and edges to indicate the current modification made.

3. Related work
In a graph database system, one of the main components of management is query processing,
which includes an operation to facilitate efficient searching of the query graph. The ProgressIve
Connected SubgrAph Substructure Search TOol (PICASSO) [3] platform is the first experiment
that supports the VESS. The framework was mainly utilised for lookup operations, primarily
emphasising streams of edges forming the query graph. PICASSO proposed two index structures
namely, the action-aware frequent index and action-aware infrequent index, denoted as 𝐴2 𝐹 and
𝐴2 𝐼, respectively. 𝐴2 𝐹 indexes frequent subgraphs while 𝐴2 𝐼 indexes discriminative infrequent
subgraphs. Furthermore, the author(s) of PICASSO maintain another online index known as
𝑆𝑃𝐼 𝐺. The work requires the construction of 𝑆𝑃𝐼 𝐺 sets, which are mainly time and memory
intensive due to the underlying index structure, for every query formulation.
The PICASSO indexes exhibits a large size and requires a significant amount of time for
construction. This is because of the many subgraphs (ranging from a few thousand to millions of
molecular compounds) in the transactional graph database. In particular, memory consumption
increases greatly with the increasing numbers of labelled graphs present in the transactional
graph database. Additionally, the index construction time is substantial due to the requirement
of detecting numerous frequent subgraphs within each labeled graph. In the case of pruning
power, the prior index structure treats each subgraph as a disconnected component. As a
result, we lose the neighbourhood information associated with a frequent subgraph, despite
it containing structural information. To avoid expensive graph operations and space usage,
we adopt graph summarisation to represent a general graph and perform weighted frequent
subgraph mining on the summary graphs.
4. The SuMExplorer Architecture
4.1. The Overall Framework
Our novel framework, Summarisation-based frequent subgraph mining for visual exploratory
subgraph searching, SuMExplorer, consists of six modules: (1) a user-interface module, where
we provide a visual interface for a user to formulate queries through add(), modify() and run()
operations, (2) a graph loader, representing the textual format in an appropriate graph structure,
(3) a graph summariser, representing the input graph structures in the summarised form, (4) a
frequent subgraph miner, mining frequent subgraphs efficiently, (5) a graph Indexer, storing the
frequent and in-frequent subgraphs of the summarised graphs, and (6) a manipulator, performing
update operations on a query graph. In Section 5 and 6, we will provide the discussion on
modules (5) and (6). The SuMExplorer is depicted in the Fig. 2. In this paper, we propose a
1. User Interface

2. Graph
g1 Loader

3. Graph 4. Frequent
5. W-Index
Summarizer Pattern Miner

Offline
Online

Subgraph
2. Graph Loader M-Index Filter
Isomorphism

q0 6. Add/Delete Result

Figure 2: SuMExplorer Framework.

novel method of utilising a summary graph for minimal memory usage by indexes and faster
query processing. Firstly, we will provide a formal definition of the intermediate summary graph
(see Definition 6), which we obtain from the general graph represented by the graph loader
module. The graph summariser module converts a simple, undirected, labelled graph to its
intermediate summary graph through “vertex-based summarisation”.

Definition 6 (Intermediate summary graph). Let 𝐺 = (𝑉𝐺 , 𝐸𝐺 , 𝐿𝐺 ) be a labelled graph. We
denote an intermediate summary graph of 𝐺 as 𝐼 𝐺 ′ = (𝑉𝐼 𝐺 ′ , 𝐸𝐼 𝐺 ′ , 𝐿𝐼 𝐺 ′ , 𝑊𝐼 𝐺 ′ ). The set 𝑉𝐼 𝐺 ′
comprises the supernodes containing distinct vertices, 𝑉𝐼 𝐺 ′ ⊂ 𝑉𝐺 , and we refer to the edges between
these supernodes as superedges, 𝐸𝐼 𝐺 ′ . 𝐿𝐼 𝐺 ′ ∶ 𝑉𝐼 𝐺 ′ ↦ 𝜎𝐼 𝐺 ′ assigns a label to a vertex where
(1) (2)
𝜎𝐼 𝐺 ′ = {𝜎𝐼 𝐺 ′ , 𝜎𝐼 𝐺 ′ , …} is the set of labels. With the graph summarisation algorithm, we assign the
weights on the superedges as 𝑊𝐼 𝐺 ′ . The weight on edge, 𝑊𝐼 𝐺 ′ is a real number that indicates
the number of merged vertices sharing a common property and that are connected to the same
destination vertex. The 𝑊𝐼 𝐺 ′ is a metric that shows the interconnectivity and, thus, the importance
of the nodes in the graph.

We generate an intermediate summary graph for each graph in the transactional graph
database. Given a labeled graph in the transactional graph database, 𝐺 ∈ 𝒢, we can now prove
the following result.
Lemma 1. The size of an intermediate summary graph 𝐼 𝐺 ′ is at the most the size of the original
graph, |𝐼 𝐺 ′ | ≤ |𝐺|.
We will use SSumM, the Sparse Summarisation of Massive Graphs [5] proposed to obtain a
space-efficient summary graph with a lower reconstruction error. The reconstruction error is the
difference between the original graph 𝐺 and the reconstructed graph from the summary graph
𝐼 𝐺 ′ . The intermediate summary graph can provide compactness and structure preservation
by reducing nodes and edges, however, the result of the intermediate summary graph is an
edge-weighted multi-graph. This is an issue since there need to be more studies for frequent
subgraph mining on the multi-weighted graph. Due to the space constraints, we omit computing
the weighted frequent mining here.
Our index structure is responsible for maintaining the extracted weighted frequent and
in-frequent subgraphs (or edges). We have two index structures designed to facilitate efficient
query processing. The first is an offline-based index structure, which we call a Weighted-Index,
denoted by W-Index. In W-Index, we index the subgraphs extracted from the labelled graphs
within the transactional graph database. We build this index only once. The second index is
an online-based index structure, which we call a Maintainable-Index, denoted by M-Index.
In the case of M-Index, the input graph is incrementally built from a single edge through the
add() operation induced by the end user. Therefore, we require an index structure to maintain
the existing result in order to facilitate efficient query processing efficiently. We provide the
detailed discussion in Section 6.

5. W-Index
In this section, we will discuss the offline-based index structure, W-Index, and its index con-
struction process. The W-Index is a hybrid Trie-based data structure that efficiently supports
subgraph search queries by using subgraph containment searches and the intersection of
graph identifiers. Each indexed node, denoted by 𝛾𝑢 ∈ 𝛾, represents frequent (discriminative
in-frequent) subgraphs. We formally define the indexed node, 𝛾𝑢 ∈ 𝛾, as follows.

Definition 7 (Index node in W-Index). The indexed node in the W-Index is a quadruple 𝛾𝑢 =
⟨𝑖𝑑, 𝐶ℎ , 𝑝, 𝐺𝐼 ⟩. In order to identify uniquely the indexed node, we assign a unique identifier to the
indexed node, denoted as 𝑖𝑑, based on its insertion order. Now, to support the storing of the frequent
(or in-frequent) subgraphs, we store the label, denoted as 𝐶ℎ . 𝐶ℎ is a canonical labelling of the
extracted subgraphs, which we represent using the Depth First Search (DFS) code. DFS code is a
graph sequentialisation method, which represents a graph as a string sequence translating to the
edge sequences in the graph. We can obtain DFS code through the depth-first search method and
is ordered based on the lexicographic order of the vertices label. We chose the minimal DFS code
for this study. Furthermore, we store a boolean value 𝑝 in each node, which we use to differentiate
between frequent and in-frequent subgraphs (or edge). Finally, we store the encoded list of graph
identifiers 𝐺𝐼 associated with the extracted subgraphs.

In order to define the tree formally, we must first define the necessary conditions that
constitute the W-Index. Therefore, let us denote the set of indexed nodes in the tree, denoted as
𝒯, as 𝑈 ∈ 𝒯. We denote the total number of index nodes in the W-Index as 𝑁𝑤 .
Definition 8 (W-Index). We call a tree 𝒯 a W-Index iff 𝒯 satisfies the six conditions: (1) The
root of 𝒯 has 𝑖𝑑 = 0 and 𝐶ℎ as an empty sequence ( ) with no 𝐺𝐼 . (2) The subgraphs ℎ𝑗 inserted
in the 𝒯 contain weights and are frequent, which we denote by 𝐷𝐹, and in-frequent subgraphs,
denoted by 𝐷𝐼. (3) The index nodes contain 𝑝 as either 0 or 1, which we can use to differentiate
between frequent and in-frequent subgraphs. (4) The indexed nodes {𝑢1 , 𝑢2 , … , 𝑢𝑠 ∈ 𝑈 } follow the
lexicographic order, where 1 ≤ 𝑠 ≤ 𝑁𝑤 . (5) The size of the subgraph in the indexed node is level−1.
(6) Similar to general trees, leaf nodes have no children.

By default, we have the W-Index as a rooted tree as described earlier. The insertion of the
new indexed nodes includes the checking of whether the indexed nodes are either frequent
or in-frequent subgraphs; hence, we will assign their 𝑝 as either 0 or 1. We can create a new
indexed node in the W-Index by adding a single edge in the level 1 of the tree and we update
the W-Index by adding more labelled graphs. Indexing includes the canonical labelling of the
frequent (or in-frequent) edges and insertion of the entries to the identifier of the labelled graph
(denoted as 𝐺𝐼 ).

Theorem 1. Given an intermediate summary graph 𝐼 𝐺𝑖′ for each labelled graph in the trans-
actional graph database, assumes that there are discriminative 𝑁 𝑢𝑚𝐷𝐹 frequent and 𝑁 𝑢𝑚𝐷𝐼
infrequent subgraphs. Assume that the time complexity of generating an intermediate summary
graph is 𝑇𝐼 𝐺𝑖′ , then consider that the time complexity for frequent subgraph mining is 𝑇ℎ , and the
time complexity for canonical labelling is 𝑇𝑐 . We denote the maximum number of edges in the
intermediate summary graph by 𝐸𝐼 𝐺𝑖′ . Now, consider that the maximum number of vertices in
the frequent subgraph is 𝑀𝐷𝐹 . Therefore, the overall time complexity for building the W-Index is
2 ) + 𝑁 𝑢𝑚 ⋅ 𝑇 ).
𝒪(|𝒢 | ⋅ 𝑇𝐼 𝐺𝑖′ + |𝐸𝐼 𝐺𝑖′ | + 𝑇ℎ + 𝑁 𝑢𝑚𝐷𝐹 ⋅ 𝑇𝑐 ⋅ (𝑀𝐷𝐹 𝐷𝐼 𝑐

6. M-Index
In this section, we describe an auxiliary structure called the M-Index, which is an online-based
structured organisation that we build dynamically through various graph update operations
such as add() and modify(). M-Index is a maintainable index that aims to enhance the query
processing during the refinement of an evolving query graph and numerous query executions.
The main idea is to store compactly the partial results in the context of the currently executed
query, which will then aid in the faster processing of future queries. Now, let us turn to the
structural representation of an index node in the M-Index. Note here that the query graph
undergoes refinement on one edge at a time. We formally define an index node in the M-Index
as follows.

Definition 9 (Index node in M-Index). The indexed node 𝜆𝑡 ∈ 𝜆 in the M-Index is a tuple
𝜆𝑡 = 𝑄𝑡 = ⟨𝜀(𝑄𝑡 ), 𝑒𝑖𝑑⟩, where 𝑄𝑡 is a new query graph resulting after the addition or modification
of a new edge at timestamp 𝑡. We denote the total number of indexed nodes in the M-Index
as 𝒩. Each indexed node stores the identifier of the indexed node in the W-Index, whose 𝐶ℎ
matches the canonical labelling of the query graph. As a result, we need a node matching function
𝜀(𝑄𝑡 ) = 𝜒 (𝑄𝑡−1 , 𝑜𝑝, 𝑣𝑠 , 𝑣𝑑 ) = 𝛾𝑖𝑑 . Furthermore, each indexed node in the M-Index contains a set of
identifiers of the edges denoted as 𝑒𝑖𝑑.
Observe that the complete set of an indexed node generates the M-Index, which is merely the
whole query graph at the time stamp 𝑄𝑡 . We now present the formal definition of the M-Index.

Definition 10 (M-Index). M-Index is a rooted tree, 𝒯, with a tuple ⟨𝜆, 𝑀⟩. The tree 𝒯 is an
M-Index, iff 𝒯 satisfies the two conditions: (1) The index node 𝜆 is a set of unique nodes that
correspond to the instance of the query graph 𝑄𝑡 . (2) The edge 𝑚 ∈ 𝑀 indicates the parent-child
relationship where we state that the child node derives from the parent due to an update operation
on the parent node.

We then run the operation to re-examine the subgraph query problem in the context of the
partial results stored in the M-Index. The visual exploratory subgraph search problem, VESS, is
now a variant of the workload-aware subgraph search. Formally, we define the workload-aware
subgraph search as follows.

Definition 11 (Workload-aware subgraph search). First, let us consider a new query 𝑄𝑡 ob-
tained after the exploration process, an offline-based index, the W-Index, and an M-Index. The
workload-aware subgraph search is to find the identifier of the indexed node in the W-Index using
the matching node function, 𝜒 (𝑄𝑡−1 , 𝑜𝑝, 𝑣𝑠 , 𝑣𝑑 ), in the presence of 𝑄𝑡−1 .

Therefore, we can interpret the execution of the query graph at a particular time as the
intersection of all the graph identifiers contained in each leaf node of the M-Index tree.

Theorem 2. Given the graph index W-Index, for each new edge addition to the query graph, which
is 𝑄𝑡−1 , the time complexity of building the query index, M-Index is 𝒪(𝑙𝑜𝑔|𝜆|⋅𝑇𝑐 +𝑚𝑖𝑛(|𝜆|, 𝑁𝑄 )(|𝑉𝑄𝑡 |+
|𝐸𝑄𝑡 |)). Assume that a user adds a new edge, denoted as 𝑒𝑄𝑡 , to the current query graph, denoted
as 𝑄𝑡 . We will have to check for the matching node in the W-Index that has the same canonical
label 𝐶ℎ as the DFS code generated for the new edge, 𝐶ℎ (𝑒𝑄𝑡 ). Hence, we would have to perform a
𝒪(|𝜆| ⋅ 𝑇𝑐 ) canonical label comparison. Then, we would need to add an identifier of the matching
index node present in the W-Index, denoted as 𝑖𝑑, to a new index node 𝜆𝑡 in the M-Index. Following
that, we extend the 𝜆𝑡 by searching for connections from 𝑄𝑡 , which is the upper bound of the frequent
fragments, 𝑚𝑖𝑛(|𝜆|, 𝑁𝑄 ). Now, it requires the addition or updating of a matching index node in
M-Index with the complexity of (|𝑉𝑄𝑡 | + |𝐸𝑄𝑡 |).

6.1. Comparison with PICASSO and gIndex
In contrast to PICASSO, our SuMExplorer exploits the graph summarisation method to reduce
the computational cost associated with large numbers of redundant vertices having the same
properties (such as labels) and their connecting edges. Further, PICASSO utilise unweighted
frequent subgraph mining, whereas SuMExplorer introduces the concept of weighted subgraph
mining which improves the pruning power of frequent subgraph mining. In our index structure,
W-Index, we capture the weighted frequent and infrequent subgraphs in a suffix tree. In addition,
we employ a compressed bitmap to reduce the filtering time and indexing size. We observe that
PICASSO does not address these issues. While PICASSO are limited to look-up operations, our
SuMExplorer stores meta information regarding query formulation in the offline-based index
structure, W-Index. Since PICASSO does not consider this meta information, they have to build
a query index for each edge addition; this increases both storage size and building time.
7. Evaluation
In this section, we evaluate the performance of competing algorithms on some chemical datasets.
Specifically, our experimental study aimed to answer the research question: how do the index
size and construction time of our SuMExplorer compare to the state-of-the-art PICASSO?
We obtained the source code of 𝑆𝑆𝑢𝑀 from [5]. Our own SuMExplorer is implemented in
Java, and its source code is available online [6]. Hence, we compare our work to PICASSO,
which is also implemented in Java. We ran our experiment on a machine with Seven Intel Core
1185G7@ 3.00 GHz 1.80 GHz CPUs and 32 GB of RAM under Windows.
We used three real datasets, namely antiviral screen (AIDS) dataset (|𝒢 | = 40, 000, |𝑉𝒢 | =
45, |𝐸𝐺 | = 46.95, |𝜎𝒢 | = 62) [7], P388 dataset (|𝒢 | = 41, 472, |𝑉𝒢 | = 45, |𝐸𝒢 | = 41.8, |𝜎𝒢 | = 73),
and Yeast dataset (|𝒢 | = 79, 601, |𝑉𝒢 | = 45, |𝐸𝒢 | = 40.7, |𝜎𝒢 | = 75) to show the efficiency
of our indexing algorithm. We received the remaining datasets (P388 and Yeast) from [8].
First, we parsed the chemical graphs using SmilesGraphParser and then serialise them into
LineGraphParser format. Both the graph parsers are available in the ParMol package [9].
Then, we obtained frequent and infrequent subgraphs using our weighted frequent subgraph
mining algorithm. In contrast to our approach of weighted frequent subgraph mining, PICASSO
employed gSpan algorithm without assigning weights to the edge labels.
Now, we will describe the workload used for the evaluation of query processing. We utilised
queries 𝑄1 → 𝑄4 for the AIDS dataset, 𝑄5 → 𝑄6 for the P388 dataset, and 𝑄7 → 𝑄8 for the Yeast
dataset. The queries are illustrated in Fig 3. Regarding the parameter selection for benchmarking,
in the case of the graph summarisation module, we set the reconstruction parameter as the
default value, as suggested in the original paper [5]. Furthermore, we set the minimum support,
𝑚𝑖𝑛𝑠𝑢𝑝 = 0.1 ⋅ |𝒢 |, as suggested by PICASSO.

2 10 7
1 Hg 1 5 1 N N 3 7 C C C 1 2 C C
1 8 5 5 Hg O 2 5 N 2 O C 1 4 N N
Ga C 2 C C 2 S 3 4 S C 2 C 3 4 N N 3 C
6 C 4 C C C C C 3 6 N 3
5 O C 6 O C
C C 6 6 N
1 1 3 6 N C C 3 4 N C 2 8 N N 4 8 C C C 4 O C
3 7 8 C 8 N O 7 N 5 9 5 8
C C C C Hg C Hg N 7
4 Q1 9 Q2 7 Q3 7 Q4 Q5 Q6 Q7 Q8

Figure 3: Query workload for visual exploratory subgraph searching. The edge label indicates the
sequence of insertion.

Next, we present the results using two metrics: index construction time, measured in millisec-
onds (ms), and index size, in megabytes (MB). To verify our claim that our W-Index is an efficient
approach for reducing computational resources, we compared its index size and time to that of
PICASSO. Our method utilises a summary graph with fewer vertices and edges and employs
weighted frequent subgraph mining to improve its effectiveness. The results of our comparison
demonstrate that our approach is, indeed, efficient, as evidenced by its smaller index size and
faster building time. By demonstrating the efficiency of our W-Index, we can highlight that
using an efficient graph summarisation and an effective weighted frequent subgraph mining can
significantly impact the index building time. In Fig. 4 (a), we have shown the index time of our
W-Index compared to PICASSO’s 𝐴2 𝐹 and 𝐴2 𝐼 for the above-mentioned datasets. As observed in
Fig. 4 (a), the index building time for W-Index is faster than 𝐴2 𝐹 and 𝐴2 𝐼 for all the datasets. The
usage of optimisation such as graph summarisation and weighted frequent mining led to faster
construction time and memory cost for W-Index in comparison to PICASSO. In Fig. 4 (b), we
D&RQVWUXFWLRQ7LPH 6X0([SORUHU 3,&$662 E&RQVWUXFWLRQ6L]H

,QGH[VL]H0%
,QGH[WLPHPV

$,'6 3 <HDVW $,'6 3 <HDVW
'DWDVHW 'DWDVHW

Figure 4: Comparison of: (a) construction time (ms) and (b) index size (MB) for various techniques
(SuMExplorer and PICASSO) using a log-scale.

6X0([SORUHU 3,&$662
4XHU\3URFHVVLQJ7LPHPV

4 4 4 4 4 4 4 4
4XHU\7\SH

Figure 5: Comparison of query processing time (ms) for various techniques (SuMExplorer and PICASSO)
on different query workloads using a log-scale.

presented the memory cost of the aforementioned index structures. We observed that W-Index
requires noticeably less memory than PICASSO. Finally, we evaluated the performance of the
query processing time when the user invoked the 𝑟𝑢𝑛() operation (Fig. 5).

8. Conclusions
In this paper, we have presented SuMExplorer, a framework for performing iterative subgraph
searching on the summary graph. We have also introduced the notion of embedding subgraph-
neighbourhood information to improve the pruning power of the frequent subgraph mining
algorithm. Our results from diverse biological datasets show that the W-Index outperform
PICASSO in terms of efficiency and scalability for all the three datasets. Specifically, we have
extensively discussed generating summary graphs and indexing them in a compact represen-
tation known as the W-Index. In addition, we have provided an algorithm to update partial
results and to detect changes in order to maintain an iterative query graph. In future work,
we plan to extend the current framework for a similarity search query and use the notion of
a graph neural network for continuous subgraph searching. An extensive evaluation of the
M-Index will be produced in the extended work.
Acknowledgments
The authors would like to thank Deutscher Akademischer Austauschdienst (DAAD) for provid-
ing funds for the research on this project.

References
[1] R. H. Choi, C. Chung, Efficient processing of graph similarity search, World Wide Web 18
(2015) 633–659. doi:10.1007/s11280- 014- 0274- 4 .
[2] J. R. Ullmann, An algorithm for subgraph isomorphism, Journal of the ACM 23 (1976) 31–42.
doi:10.1145/321921.321925 .
[3] K. Huang, S. S. Bhowmick, S. Zhou, B. Choi, PICASSO: exploratory search of connected
subgraph substructures in graph databases, Proceedings of the VLDB Endowment 10 (2017)
1861–1864. doi:10.14778/3137765.3137794 .
[4] H. Kim, Y. Choi, K. Park, X. Lin, S. Hong, W. Han, Versatile equivalences: Speeding up
subgraph query processing and subgraph matching, in: G. Li, Z. Li, S. Idreos, D. Srivastava
(Eds.), SIGMOD ’21: International Conference on Management of Data, Virtual Event, China,
June 20-25, 2021, ACM, 2021, pp. 925–937. doi:10.1145/3448016.3457265 .
[5] K. Lee, H. Jo, J. Ko, S. Lim, K. Shin, Ssumm: Sparse summarization of massive graphs, in:
R. Gupta, Y. Liu, J. Tang, B. A. Prakash (Eds.), KDD ’20: The 26th ACM SIGKDD Conference
on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020,
ACM, 2020, pp. 144–154. doi:10.1145/3394486.3403057 .
[6] C. Wangmo, L. Wiese, Sumexplorer, 2023. URL: https://anonymous.4open.science/r/
SuMExplorer-1137.
[7] AIDS, 2004. URL: https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+
Data.
[8] R. Ayed, Aggregated search in Distributed Graph Databases. (Recherche d’information
agrégative dans des bases de graphes distribuées), Ph.D. thesis, University of Lyon, France,
2019.
[9] T. Meinl, M. Wörlein, O. Urzova, I. Fischer, M. Philippsen, The parmol package for frequent
subgraph mining, Electronic Communication of the European Association of Software
Science and Technology 1 (2006). doi:10.14279/tuj.eceasst.1.85 .