SuMExplorer: Summarisation-based Frequent Subgraph Mining for Visual Exploratory Subgraph Searching Chimi Wangmo1 , Lena Wiese1 1 Goethe University Frankfurt, Robert-Mayer-Str. 10, 60629, Frankfurt am Main, Germany Abstract Graphs are ubiquitous with their vast application in various disciplines, including bioinformatics, security, and social sciences. One of the most popular queries involves searching for subgraphs and similarities on large numbers of small-to-medium-sized graphs. The visual exploratory subgraph search is proposed to be a search paradigm to support visual analysis and to ease user interaction. In this paper, we address the problem of reducing computational complexity during the indexing stage of visual exploratory subgraph searching. We propose (1) a compact summarisation of our input graph to reduce the index size and construction time. In addition, (2) we introduce weighted frequent subgraph mining for our novel graph summarisation-based visual subgraph searching. Based on these, (3) we propose a novel index structure to represent the subgraphs extracted from the summary graphs in a compact form. Finally, (4) we empirically demonstrate that the proposed framework provides higher efficiency than the baseline by benchmarking on three real-world biological datasets. Keywords visual exploratory subgraph searching, graph summarisation, graph indexing, graph database 1. Introduction Subgraph containment and similarity searching are critical problems in analysing social net- works, computational biology, collaboration networks, etc. In the case of subgraph containment, our interest lies in finding all the labelled graphs that contain a given query graph i.e. a subgraph to a given supergraph. In subgraph similarity, earlier works such as [1] have investigated the concept of missing edges and the need for a distance measure to identify similar graphs. The naive approach to solving this problem involves adopting a subgraph isomorphism algorithm. However, [2] proved that performing vertex-to-vertex mapping in the subgraph isomorphism problem is NP-complete. As a result, we introduce subgraph-based graph indexing to prune unrelated candidate labelled graphs and perform subgraph isomorphism (also known as “verifi- cation”). We call this process subgraph-based graph indexing because it requires embedding subgraphs within the graph database to filter non-candidate labelled graphs. LWDA’23: Lernen, Wissen, Daten, Analysen. October 09–11, 2023, Marburg, Germany ∗ Corresponding author. † These authors contributed equally. Envelope-Open wangmo@uni-frankfurt.de (C. Wangmo); lwiese@cs.uni-frankfurt.de (L. Wiese) GLOBE http://wiese.free.fr/ (L. Wiese) Orcid 0000-0002-8921-398X (C. Wangmo); 0000-0003-3515-9209 (L. Wiese) © 2022 Copyright © 2023 by the paper’s authors. Copying permitted only for private and academic purposes. In: M. Leyer, Wichmann, J. (Eds.): Proceedings of the LWDA 2023 Workshops: BIA, DB, IR, KDML and WM. Marburg, Germany, 09.-11. October 2023, published at http://ceur‐ws.org. CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings The use of current graph databases require knowledge of either programming or query language; hence the usage of modern databases is limited to specific, knowledgeable users. On the other hand, users such as chemists may lack an understanding of complex knowledge regarding query formulation using a particular query language. As a result, visual exploratory subgraph searching helps to overcome such issues [3]. To summarise, the manipulation opera- tions available to users in visual exploratory subgraph searching comprise: (1) the addition of a new vertex and its connection to an existing vertex or the addition of a new edge between existing vertices, (2) the deletion of either an edge or of both a vertex and the corresponding edges, and (3) finally, the execution of query to find a matching subgraph in the transactional graph database. Operations (1) and (2) lead to modifying a query graph and, hence, require a maintenance algorithm for an index structure. Visual exploratory subgraph searching relies on the inner workings of the index structure to perform the subgraph query processing efficiently. We have looked, in particular, at solving issues related to the efficiency of index construction and memory usage and improving the effectiveness of filtering power and query processing. In this paper, we propose a new framework for efficiently representing the graph as a summary graph and improving pruning power using weighted-frequent subgraph mining (Sec. 4). We introduce frequent subgraph mining on the summary graph with weighting functions. We devise a new index structure, called the Weighted-Index (W-Index), for compactly storing frequent and discriminative in-frequent subgraphs in the summary graph (Sec. 5 and 6). Finally, we verify the efficiency of the proposed solutions with extensive experiments (Sec. 7). 2. Notations and Problem Statement In this section, we first present the fundamental concepts related to graph theory, focussing on identifying the topological components that correspond to real-world entities. To maintain consistency, we use the term “vertex” to refer to an entity in the graph, while the term “node” is reserved for objects in the index graph. We now formally define the labelled graph as follows. Definition 1 (Labelled Graph). A labelled graph is a triplet 𝐺 = (𝑉𝐺 , 𝐸𝐺 , 𝐿𝐺 ) where 𝑉𝐺 denotes the set of vertices, 𝐸𝐺 ⊆ {{𝑢, 𝑣} |𝑢, 𝑣 ∈ 𝑉𝐺 } is the set of edges, and 𝐿𝐺 ∶ 𝑉𝐺 ↦ 𝜎𝐺 assigns a label to (1) (2) a vertex where 𝜎𝐺 = {𝜎𝐺 , 𝜎𝐺 , …} is the set of labels. We refer to the number of edges |𝐸𝐺 | in the given graph 𝐺 as its size and the number of vertices and labels as |𝑉𝐺 | and |𝜎𝐺 |, respectively. We work with a transactional graph database, which contains a set of 𝑛 medium-sized labelled graphs, denoted as 𝒢 = {𝐺1 , 𝐺2 , 𝐺3 , … , 𝐺𝑛 }. Further, we denote the cardinality of the graph database as |𝒢 | = 𝑛, the average number of vertices, edges, and labels of all the labelled graphs in the transactional graph database as |𝑉𝒢 |, |𝐸𝒢 |, and |𝜎𝒢 |, respectively. In this paper, our motivation is to query for a query graph; this is a subgraph of a labelled graph in the transactional graph database. Hence, it is necessary to understand and conduct subgraph isomorphism iteratively for each labelled graph in the graph database to solve the query. Yet, the major downside of subgraph isomorphism is its expensive operation time. Definition 2 (Subgraph Isomorphism [4]). Given a query graph, 𝑄, and a labelled graph, 𝐺, an embedding of 𝑄 in 𝐺 is a mapping, where 𝑀 ∶ 𝑉𝑄 → 𝑉𝐺 such that: (1) 𝑀 is injective, i.e. 𝑀(𝑢) ≠ 𝑀(𝑣) for 𝑢 ≠ 𝑣 ∈ 𝑉𝑄 , (2) the labels match, i.e. 𝐿𝑄 (𝑢) = 𝐿𝐺 (𝑀(𝑢)) for every 𝑢 ∈ 𝑉𝑄 , and (3) and all edges of 𝑄 exist in 𝐺, i.e. (𝑀(𝑢), 𝑀(𝑣))(or(𝑀(𝑣), 𝑀(𝑢))) ∈ 𝐸𝐺 ∀(𝑢, 𝑣)(or(𝑣, 𝑢)) ∈ 𝐸𝑄 . Graph 𝑄 is subgraph-isomorphic to 𝐺, denoted by 𝑄 ⊆ 𝐺 if there exists an embedding of 𝑄 in 𝐺. Example 1. In Fig. 1 (a), we illustrate an example of labelled graphs representing molecules in the transactional graph database, denoted by 𝒢. The transactional graph database contains 5 molecular graphs, 𝒢 = {𝐺1 , 𝐺2 , 𝐺3 , 𝐺4 , 𝐺5 }. Each labelled graph contains several vertices (12 in 𝐺1 , 5 in 𝐺2 , 6 in 𝐺3 , 8 in 𝐺4 , and 3 in 𝐺5 ) denoted by 𝑣𝑙 where 1 ≤ 𝑙 ≤ |𝑉𝐺 |, and its corresponding label denoted by a letter, 𝜎𝑙 ∈ 𝜎; here, 𝜎 = {𝐶, 𝐻 , 𝑂}. The vertex 𝑣1 in 𝐺1 contains label 𝐶. Furthermore, there exists an edge from one node to another, such as an edge from 𝑣1 to 𝑣2 . 2.1. Problem Statement Prior to providing a formal definition of visual exploratory subgraph search (𝑉 𝐸𝑆𝑆), we formally define the query graph in VESS, 𝑄𝑡 , as follows. Definition 3 (Query Graph in VESS). The query graph, 𝑄𝑡 , is a labelled graph (see Definition 1), which we start with an insertion of a single edge, 𝑄1 = (𝑉𝑄1 , 𝐸𝑄1 , 𝐿𝑄1 ), where 𝐸𝑄1 = {𝑒𝑄1 }. We then sequentially perform incremental updates based on the user’s update operation, 𝑜𝑝, on the source and destination vertices, denoted by 𝑣𝑠 and 𝑣𝑑 , respectively. The user can perform an operation on the previous instance of the query graph at time 𝑡 − 1, where 1 ≤ 𝑡 ≤ 𝒩. Here, 𝑜𝑝 is either “+” or “-”, indicating an add or remove operation, respectively. The resulting graph from the addition of a 𝑣𝑄𝑡 and an edge, 𝑒𝑄𝑡 , to 𝑄𝑡−1 is 𝑄𝑡 = {𝑉𝑄𝑡 , 𝐸𝑄𝑡 , 𝐿𝑄𝑡 }. It has 𝑉𝑄𝑡 nodes and 𝐸𝑄𝑡 edges. Definition 4 (Visual Exploratory Subgraph Search). Given a batch of update operations, 𝐵, on an initial query graph 𝑄1 , and a transactional graph database, 𝒢, the visual exploratory subgraph search (VESS) finds all the graphs, 𝐺𝑖 ∈ 𝒢, that contain the query graph, 𝑄𝑡 = {𝐵, 𝑄1 }. Example 2. In Fig. 1 (b), we provide an example of exploratory subgraph searching in a trans- actional graph database. In this figure, we have denoted time with 𝑡𝑖 , where 1 ≤ 𝑖 ≤ 𝒩. Initially, the user will begin the exploratory search with a single-edge addition at time 𝑡1 . The exploratory search involves a series of add() operations on new vertices and connections to existing vertices, as observed in time 𝑡2 , 𝑡3 , 𝑡5 , 𝑡8 , 𝑡9 , 𝑡10 , 𝑡11 . In addition, the add() operation may be simply an addition of a new edge between two existing vertices, as observed at time 𝑡6 . Furthermore, the user may also remove certain edges, as seen at time 𝑡4 and 𝑡7 . Finally, after some time, say 𝑡11 , the user may issue the 𝑟𝑢𝑛() operation to find answers to the searched query. In this paper, we focus on the subgraph containment problem formally defined as follows. Definition 5 (Subgraph Containment). Given a query graph 𝑄 and labelled graphs, 𝐺𝑖 ∈ 𝒢, the subgraph containment (also called “Subgraph search”) problem is to find all the labelled graphs that contain 𝑄. V6 H V7 H H O O G2 G3 V5 V3 V8 V2 V12 V1 C C C C H H V6 V1 V1 V2 C C H H O O V2 V4 V3 V5 G1 V4 V5 V3 V6 V1 O C C H O C H G4 G5 H V3 V8 V4 H C V11 V9 H C C V4 H V1 V2 V2 V3 O O H V10 V5 V7 (a) H H H H H H H H H H H H H H H H H C C C C C C C C C C C H C C C C C C C C C C C H H O O O O O O O O O t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 (b) Figure 1: Running example of (a) a transactional graph database and performing (b) visual exploratory searching through add(), modify(), and run() operations. We have highlighted (red for add() and grey for modify()) the vertices and edges to indicate the current modification made. 3. Related work In a graph database system, one of the main components of management is query processing, which includes an operation to facilitate efficient searching of the query graph. The ProgressIve Connected SubgrAph Substructure Search TOol (PICASSO) [3] platform is the first experiment that supports the VESS. The framework was mainly utilised for lookup operations, primarily emphasising streams of edges forming the query graph. PICASSO proposed two index structures namely, the action-aware frequent index and action-aware infrequent index, denoted as 𝐴2 𝐹 and 𝐴2 𝐼, respectively. 𝐴2 𝐹 indexes frequent subgraphs while 𝐴2 𝐼 indexes discriminative infrequent subgraphs. Furthermore, the author(s) of PICASSO maintain another online index known as 𝑆𝑃𝐼 𝐺. The work requires the construction of 𝑆𝑃𝐼 𝐺 sets, which are mainly time and memory intensive due to the underlying index structure, for every query formulation. The PICASSO indexes exhibits a large size and requires a significant amount of time for construction. This is because of the many subgraphs (ranging from a few thousand to millions of molecular compounds) in the transactional graph database. In particular, memory consumption increases greatly with the increasing numbers of labelled graphs present in the transactional graph database. Additionally, the index construction time is substantial due to the requirement of detecting numerous frequent subgraphs within each labeled graph. In the case of pruning power, the prior index structure treats each subgraph as a disconnected component. As a result, we lose the neighbourhood information associated with a frequent subgraph, despite it containing structural information. To avoid expensive graph operations and space usage, we adopt graph summarisation to represent a general graph and perform weighted frequent subgraph mining on the summary graphs. 4. The SuMExplorer Architecture 4.1. The Overall Framework Our novel framework, Summarisation-based frequent subgraph mining for visual exploratory subgraph searching, SuMExplorer, consists of six modules: (1) a user-interface module, where we provide a visual interface for a user to formulate queries through add(), modify() and run() operations, (2) a graph loader, representing the textual format in an appropriate graph structure, (3) a graph summariser, representing the input graph structures in the summarised form, (4) a frequent subgraph miner, mining frequent subgraphs efficiently, (5) a graph Indexer, storing the frequent and in-frequent subgraphs of the summarised graphs, and (6) a manipulator, performing update operations on a query graph. In Section 5 and 6, we will provide the discussion on modules (5) and (6). The SuMExplorer is depicted in the Fig. 2. In this paper, we propose a 1. User Interface 2. Graph g1 Loader 3. Graph 4. Frequent 5. W-Index Summarizer Pattern Miner g0 Offline Online Subgraph 2. Graph Loader M-Index Filter Isomorphism q0 6. Add/Delete Result Figure 2: SuMExplorer Framework. novel method of utilising a summary graph for minimal memory usage by indexes and faster query processing. Firstly, we will provide a formal definition of the intermediate summary graph (see Definition 6), which we obtain from the general graph represented by the graph loader module. The graph summariser module converts a simple, undirected, labelled graph to its intermediate summary graph through “vertex-based summarisation”. Definition 6 (Intermediate summary graph). Let 𝐺 = (𝑉𝐺 , 𝐸𝐺 , 𝐿𝐺 ) be a labelled graph. We denote an intermediate summary graph of 𝐺 as 𝐼 𝐺 ′ = (𝑉𝐼 𝐺 ′ , 𝐸𝐼 𝐺 ′ , 𝐿𝐼 𝐺 ′ , 𝑊𝐼 𝐺 ′ ). The set 𝑉𝐼 𝐺 ′ comprises the supernodes containing distinct vertices, 𝑉𝐼 𝐺 ′ ⊂ 𝑉𝐺 , and we refer to the edges between these supernodes as superedges, 𝐸𝐼 𝐺 ′ . 𝐿𝐼 𝐺 ′ ∶ 𝑉𝐼 𝐺 ′ ↦ 𝜎𝐼 𝐺 ′ assigns a label to a vertex where (1) (2) 𝜎𝐼 𝐺 ′ = {𝜎𝐼 𝐺 ′ , 𝜎𝐼 𝐺 ′ , …} is the set of labels. With the graph summarisation algorithm, we assign the weights on the superedges as 𝑊𝐼 𝐺 ′ . The weight on edge, 𝑊𝐼 𝐺 ′ is a real number that indicates the number of merged vertices sharing a common property and that are connected to the same destination vertex. The 𝑊𝐼 𝐺 ′ is a metric that shows the interconnectivity and, thus, the importance of the nodes in the graph. We generate an intermediate summary graph for each graph in the transactional graph database. Given a labeled graph in the transactional graph database, 𝐺 ∈ 𝒢, we can now prove the following result. Lemma 1. The size of an intermediate summary graph 𝐼 𝐺 ′ is at the most the size of the original graph, |𝐼 𝐺 ′ | ≤ |𝐺|. We will use SSumM, the Sparse Summarisation of Massive Graphs [5] proposed to obtain a space-efficient summary graph with a lower reconstruction error. The reconstruction error is the difference between the original graph 𝐺 and the reconstructed graph from the summary graph 𝐼 𝐺 ′ . The intermediate summary graph can provide compactness and structure preservation by reducing nodes and edges, however, the result of the intermediate summary graph is an edge-weighted multi-graph. This is an issue since there need to be more studies for frequent subgraph mining on the multi-weighted graph. Due to the space constraints, we omit computing the weighted frequent mining here. Our index structure is responsible for maintaining the extracted weighted frequent and in-frequent subgraphs (or edges). We have two index structures designed to facilitate efficient query processing. The first is an offline-based index structure, which we call a Weighted-Index, denoted by W-Index. In W-Index, we index the subgraphs extracted from the labelled graphs within the transactional graph database. We build this index only once. The second index is an online-based index structure, which we call a Maintainable-Index, denoted by M-Index. In the case of M-Index, the input graph is incrementally built from a single edge through the add() operation induced by the end user. Therefore, we require an index structure to maintain the existing result in order to facilitate efficient query processing efficiently. We provide the detailed discussion in Section 6. 5. W-Index In this section, we will discuss the offline-based index structure, W-Index, and its index con- struction process. The W-Index is a hybrid Trie-based data structure that efficiently supports subgraph search queries by using subgraph containment searches and the intersection of graph identifiers. Each indexed node, denoted by 𝛾𝑢 ∈ 𝛾, represents frequent (discriminative in-frequent) subgraphs. We formally define the indexed node, 𝛾𝑢 ∈ 𝛾, as follows. Definition 7 (Index node in W-Index). The indexed node in the W-Index is a quadruple 𝛾𝑢 = ⟨𝑖𝑑, 𝐶ℎ , 𝑝, 𝐺𝐼 ⟩. In order to identify uniquely the indexed node, we assign a unique identifier to the indexed node, denoted as 𝑖𝑑, based on its insertion order. Now, to support the storing of the frequent (or in-frequent) subgraphs, we store the label, denoted as 𝐶ℎ . 𝐶ℎ is a canonical labelling of the extracted subgraphs, which we represent using the Depth First Search (DFS) code. DFS code is a graph sequentialisation method, which represents a graph as a string sequence translating to the edge sequences in the graph. We can obtain DFS code through the depth-first search method and is ordered based on the lexicographic order of the vertices label. We chose the minimal DFS code for this study. Furthermore, we store a boolean value 𝑝 in each node, which we use to differentiate between frequent and in-frequent subgraphs (or edge). Finally, we store the encoded list of graph identifiers 𝐺𝐼 associated with the extracted subgraphs. In order to define the tree formally, we must first define the necessary conditions that constitute the W-Index. Therefore, let us denote the set of indexed nodes in the tree, denoted as 𝒯, as 𝑈 ∈ 𝒯. We denote the total number of index nodes in the W-Index as 𝑁𝑤 . Definition 8 (W-Index). We call a tree 𝒯 a W-Index iff 𝒯 satisfies the six conditions: (1) The root of 𝒯 has 𝑖𝑑 = 0 and 𝐶ℎ as an empty sequence ( ) with no 𝐺𝐼 . (2) The subgraphs ℎ𝑗 inserted in the 𝒯 contain weights and are frequent, which we denote by 𝐷𝐹, and in-frequent subgraphs, denoted by 𝐷𝐼. (3) The index nodes contain 𝑝 as either 0 or 1, which we can use to differentiate between frequent and in-frequent subgraphs. (4) The indexed nodes {𝑢1 , 𝑢2 , … , 𝑢𝑠 ∈ 𝑈 } follow the lexicographic order, where 1 ≤ 𝑠 ≤ 𝑁𝑤 . (5) The size of the subgraph in the indexed node is level−1. (6) Similar to general trees, leaf nodes have no children. By default, we have the W-Index as a rooted tree as described earlier. The insertion of the new indexed nodes includes the checking of whether the indexed nodes are either frequent or in-frequent subgraphs; hence, we will assign their 𝑝 as either 0 or 1. We can create a new indexed node in the W-Index by adding a single edge in the level 1 of the tree and we update the W-Index by adding more labelled graphs. Indexing includes the canonical labelling of the frequent (or in-frequent) edges and insertion of the entries to the identifier of the labelled graph (denoted as 𝐺𝐼 ). Theorem 1. Given an intermediate summary graph 𝐼 𝐺𝑖′ for each labelled graph in the trans- actional graph database, assumes that there are discriminative 𝑁 𝑢𝑚𝐷𝐹 frequent and 𝑁 𝑢𝑚𝐷𝐼 infrequent subgraphs. Assume that the time complexity of generating an intermediate summary graph is 𝑇𝐼 𝐺𝑖′ , then consider that the time complexity for frequent subgraph mining is 𝑇ℎ , and the time complexity for canonical labelling is 𝑇𝑐 . We denote the maximum number of edges in the intermediate summary graph by 𝐸𝐼 𝐺𝑖′ . Now, consider that the maximum number of vertices in the frequent subgraph is 𝑀𝐷𝐹 . Therefore, the overall time complexity for building the W-Index is 2 ) + 𝑁 𝑢𝑚 ⋅ 𝑇 ). 𝒪(|𝒢 | ⋅ 𝑇𝐼 𝐺𝑖′ + |𝐸𝐼 𝐺𝑖′ | + 𝑇ℎ + 𝑁 𝑢𝑚𝐷𝐹 ⋅ 𝑇𝑐 ⋅ (𝑀𝐷𝐹 𝐷𝐼 𝑐 6. M-Index In this section, we describe an auxiliary structure called the M-Index, which is an online-based structured organisation that we build dynamically through various graph update operations such as add() and modify(). M-Index is a maintainable index that aims to enhance the query processing during the refinement of an evolving query graph and numerous query executions. The main idea is to store compactly the partial results in the context of the currently executed query, which will then aid in the faster processing of future queries. Now, let us turn to the structural representation of an index node in the M-Index. Note here that the query graph undergoes refinement on one edge at a time. We formally define an index node in the M-Index as follows. Definition 9 (Index node in M-Index). The indexed node 𝜆𝑡 ∈ 𝜆 in the M-Index is a tuple 𝜆𝑡 = 𝑄𝑡 = ⟨𝜀(𝑄𝑡 ), 𝑒𝑖𝑑⟩, where 𝑄𝑡 is a new query graph resulting after the addition or modification of a new edge at timestamp 𝑡. We denote the total number of indexed nodes in the M-Index as 𝒩. Each indexed node stores the identifier of the indexed node in the W-Index, whose 𝐶ℎ matches the canonical labelling of the query graph. As a result, we need a node matching function 𝜀(𝑄𝑡 ) = 𝜒 (𝑄𝑡−1 , 𝑜𝑝, 𝑣𝑠 , 𝑣𝑑 ) = 𝛾𝑖𝑑 . Furthermore, each indexed node in the M-Index contains a set of identifiers of the edges denoted as 𝑒𝑖𝑑. Observe that the complete set of an indexed node generates the M-Index, which is merely the whole query graph at the time stamp 𝑄𝑡 . We now present the formal definition of the M-Index. Definition 10 (M-Index). M-Index is a rooted tree, 𝒯, with a tuple ⟨𝜆, 𝑀⟩. The tree 𝒯 is an M-Index, iff 𝒯 satisfies the two conditions: (1) The index node 𝜆 is a set of unique nodes that correspond to the instance of the query graph 𝑄𝑡 . (2) The edge 𝑚 ∈ 𝑀 indicates the parent-child relationship where we state that the child node derives from the parent due to an update operation on the parent node. We then run the operation to re-examine the subgraph query problem in the context of the partial results stored in the M-Index. The visual exploratory subgraph search problem, VESS, is now a variant of the workload-aware subgraph search. Formally, we define the workload-aware subgraph search as follows. Definition 11 (Workload-aware subgraph search). First, let us consider a new query 𝑄𝑡 ob- tained after the exploration process, an offline-based index, the W-Index, and an M-Index. The workload-aware subgraph search is to find the identifier of the indexed node in the W-Index using the matching node function, 𝜒 (𝑄𝑡−1 , 𝑜𝑝, 𝑣𝑠 , 𝑣𝑑 ), in the presence of 𝑄𝑡−1 . Therefore, we can interpret the execution of the query graph at a particular time as the intersection of all the graph identifiers contained in each leaf node of the M-Index tree. Theorem 2. Given the graph index W-Index, for each new edge addition to the query graph, which is 𝑄𝑡−1 , the time complexity of building the query index, M-Index is 𝒪(𝑙𝑜𝑔|𝜆|⋅𝑇𝑐 +𝑚𝑖𝑛(|𝜆|, 𝑁𝑄 )(|𝑉𝑄𝑡 |+ |𝐸𝑄𝑡 |)). Assume that a user adds a new edge, denoted as 𝑒𝑄𝑡 , to the current query graph, denoted as 𝑄𝑡 . We will have to check for the matching node in the W-Index that has the same canonical label 𝐶ℎ as the DFS code generated for the new edge, 𝐶ℎ (𝑒𝑄𝑡 ). Hence, we would have to perform a 𝒪(|𝜆| ⋅ 𝑇𝑐 ) canonical label comparison. Then, we would need to add an identifier of the matching index node present in the W-Index, denoted as 𝑖𝑑, to a new index node 𝜆𝑡 in the M-Index. Following that, we extend the 𝜆𝑡 by searching for connections from 𝑄𝑡 , which is the upper bound of the frequent fragments, 𝑚𝑖𝑛(|𝜆|, 𝑁𝑄 ). Now, it requires the addition or updating of a matching index node in M-Index with the complexity of (|𝑉𝑄𝑡 | + |𝐸𝑄𝑡 |). 6.1. Comparison with PICASSO and gIndex In contrast to PICASSO, our SuMExplorer exploits the graph summarisation method to reduce the computational cost associated with large numbers of redundant vertices having the same properties (such as labels) and their connecting edges. Further, PICASSO utilise unweighted frequent subgraph mining, whereas SuMExplorer introduces the concept of weighted subgraph mining which improves the pruning power of frequent subgraph mining. In our index structure, W-Index, we capture the weighted frequent and infrequent subgraphs in a suffix tree. In addition, we employ a compressed bitmap to reduce the filtering time and indexing size. We observe that PICASSO does not address these issues. While PICASSO are limited to look-up operations, our SuMExplorer stores meta information regarding query formulation in the offline-based index structure, W-Index. Since PICASSO does not consider this meta information, they have to build a query index for each edge addition; this increases both storage size and building time. 7. Evaluation In this section, we evaluate the performance of competing algorithms on some chemical datasets. Specifically, our experimental study aimed to answer the research question: how do the index size and construction time of our SuMExplorer compare to the state-of-the-art PICASSO? We obtained the source code of 𝑆𝑆𝑢𝑀 from [5]. Our own SuMExplorer is implemented in Java, and its source code is available online [6]. Hence, we compare our work to PICASSO, which is also implemented in Java. We ran our experiment on a machine with Seven Intel Core 1185G7@ 3.00 GHz 1.80 GHz CPUs and 32 GB of RAM under Windows. We used three real datasets, namely antiviral screen (AIDS) dataset (|𝒢 | = 40, 000, |𝑉𝒢 | = 45, |𝐸𝐺 | = 46.95, |𝜎𝒢 | = 62) [7], P388 dataset (|𝒢 | = 41, 472, |𝑉𝒢 | = 45, |𝐸𝒢 | = 41.8, |𝜎𝒢 | = 73), and Yeast dataset (|𝒢 | = 79, 601, |𝑉𝒢 | = 45, |𝐸𝒢 | = 40.7, |𝜎𝒢 | = 75) to show the efficiency of our indexing algorithm. We received the remaining datasets (P388 and Yeast) from [8]. First, we parsed the chemical graphs using SmilesGraphParser and then serialise them into LineGraphParser format. Both the graph parsers are available in the ParMol package [9]. Then, we obtained frequent and infrequent subgraphs using our weighted frequent subgraph mining algorithm. In contrast to our approach of weighted frequent subgraph mining, PICASSO employed gSpan algorithm without assigning weights to the edge labels. Now, we will describe the workload used for the evaluation of query processing. We utilised queries 𝑄1 → 𝑄4 for the AIDS dataset, 𝑄5 → 𝑄6 for the P388 dataset, and 𝑄7 → 𝑄8 for the Yeast dataset. The queries are illustrated in Fig 3. Regarding the parameter selection for benchmarking, in the case of the graph summarisation module, we set the reconstruction parameter as the default value, as suggested in the original paper [5]. Furthermore, we set the minimum support, 𝑚𝑖𝑛𝑠𝑢𝑝 = 0.1 ⋅ |𝒢 |, as suggested by PICASSO. 2 10 7 1 Hg 1 5 1 N N 3 7 C C C 1 2 C C 1 8 5 5 Hg O 2 5 N 2 O C 1 4 N N Ga C 2 C C 2 S 3 4 S C 2 C 3 4 N N 3 C 6 C 4 C C C C C 3 6 N 3 5 O C 6 O C C C 6 6 N 1 1 3 6 N C C 3 4 N C 2 8 N N 4 8 C C C 4 O C 3 7 8 C 8 N O 7 N 5 9 5 8 C C C C Hg C Hg N 7 4 Q1 9 Q2 7 Q3 7 Q4 Q5 Q6 Q7 Q8 Figure 3: Query workload for visual exploratory subgraph searching. The edge label indicates the sequence of insertion. Next, we present the results using two metrics: index construction time, measured in millisec- onds (ms), and index size, in megabytes (MB). To verify our claim that our W-Index is an efficient approach for reducing computational resources, we compared its index size and time to that of PICASSO. Our method utilises a summary graph with fewer vertices and edges and employs weighted frequent subgraph mining to improve its effectiveness. The results of our comparison demonstrate that our approach is, indeed, efficient, as evidenced by its smaller index size and faster building time. By demonstrating the efficiency of our W-Index, we can highlight that using an efficient graph summarisation and an effective weighted frequent subgraph mining can significantly impact the index building time. In Fig. 4 (a), we have shown the index time of our W-Index compared to PICASSO’s 𝐴2 𝐹 and 𝐴2 𝐼 for the above-mentioned datasets. As observed in Fig. 4 (a), the index building time for W-Index is faster than 𝐴2 𝐹 and 𝐴2 𝐼 for all the datasets. The usage of optimisation such as graph summarisation and weighted frequent mining led to faster construction time and memory cost for W-Index in comparison to PICASSO. In Fig. 4 (b), we D &RQVWUXFWLRQ7LPH 6X0([SORUHU 3,&$662 E &RQVWUXFWLRQ6L]H       ,QGH[VL]H 0% ,QGH[WLPH PV             $,'6 3 <HDVW $,'6 3 <HDVW 'DWDVHW 'DWDVHW Figure 4: Comparison of: (a) construction time (ms) and (b) index size (MB) for various techniques (SuMExplorer and PICASSO) using a log-scale. 6X0([SORUHU 3,&$662 4XHU\3URFHVVLQJ7LPH PV       4 4 4 4 4 4 4 4 4XHU\7\SH Figure 5: Comparison of query processing time (ms) for various techniques (SuMExplorer and PICASSO) on different query workloads using a log-scale. presented the memory cost of the aforementioned index structures. We observed that W-Index requires noticeably less memory than PICASSO. Finally, we evaluated the performance of the query processing time when the user invoked the 𝑟𝑢𝑛() operation (Fig. 5). 8. Conclusions In this paper, we have presented SuMExplorer, a framework for performing iterative subgraph searching on the summary graph. We have also introduced the notion of embedding subgraph- neighbourhood information to improve the pruning power of the frequent subgraph mining algorithm. Our results from diverse biological datasets show that the W-Index outperform PICASSO in terms of efficiency and scalability for all the three datasets. Specifically, we have extensively discussed generating summary graphs and indexing them in a compact represen- tation known as the W-Index. In addition, we have provided an algorithm to update partial results and to detect changes in order to maintain an iterative query graph. In future work, we plan to extend the current framework for a similarity search query and use the notion of a graph neural network for continuous subgraph searching. An extensive evaluation of the M-Index will be produced in the extended work. Acknowledgments The authors would like to thank Deutscher Akademischer Austauschdienst (DAAD) for provid- ing funds for the research on this project. References [1] R. H. Choi, C. Chung, Efficient processing of graph similarity search, World Wide Web 18 (2015) 633–659. doi:10.1007/s11280- 014- 0274- 4 . [2] J. R. Ullmann, An algorithm for subgraph isomorphism, Journal of the ACM 23 (1976) 31–42. doi:10.1145/321921.321925 . [3] K. Huang, S. S. Bhowmick, S. Zhou, B. Choi, PICASSO: exploratory search of connected subgraph substructures in graph databases, Proceedings of the VLDB Endowment 10 (2017) 1861–1864. doi:10.14778/3137765.3137794 . [4] H. Kim, Y. Choi, K. Park, X. Lin, S. Hong, W. Han, Versatile equivalences: Speeding up subgraph query processing and subgraph matching, in: G. Li, Z. Li, S. Idreos, D. Srivastava (Eds.), SIGMOD ’21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, ACM, 2021, pp. 925–937. doi:10.1145/3448016.3457265 . [5] K. Lee, H. Jo, J. Ko, S. Lim, K. Shin, Ssumm: Sparse summarization of massive graphs, in: R. Gupta, Y. Liu, J. Tang, B. A. Prakash (Eds.), KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, ACM, 2020, pp. 144–154. doi:10.1145/3394486.3403057 . [6] C. Wangmo, L. Wiese, Sumexplorer, 2023. URL: https://anonymous.4open.science/r/ SuMExplorer-1137. [7] AIDS, 2004. URL: https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+ Data. [8] R. Ayed, Aggregated search in Distributed Graph Databases. (Recherche d’information agrégative dans des bases de graphes distribuées), Ph.D. thesis, University of Lyon, France, 2019. [9] T. Meinl, M. Wörlein, O. Urzova, I. Fischer, M. Philippsen, The parmol package for frequent subgraph mining, Electronic Communication of the European Association of Software Science and Technology 1 (2006). doi:10.14279/tuj.eceasst.1.85 .