Ensuring Consistency in Graph Cache for Graph-Pattern
                            Queries

                      Jing Wang                               Nikos Ntarmos                     Peter Triantafillou
           School of Computing Science               School of Computing Science          School of Computing Science
             University of Glasgow, UK                 University of Glasgow, UK            University of Glasgow, UK
           j.wang.3@research.gla.ac.uk               nikos.ntarmos@glasgow.ac.uk          peter.triantafillou@glasgow.ac.uk


ABSTRACT                                                                 then-verify” (FTV) paradigm: first, dataset graphs are de-
Graph queries are costly, as they entail the NP-Complete                 composed to substructures and indexed; then, substructures
subgraph isomorphism problem. Graph caching had been re-                 of coming query graph g are looked up in the dataset index,
cently suggested, showing the potential to significantly accel-          producing the set of dataset graphs that contain all substruc-
erate subgraph/supergraph queries. Subsequently, Graph-                  tures of g (resp. dataset graphs whose substructures are all
Cache, the first full-fledged graph caching system was put               contained in g), coined the candidate set of g; finally, graphs
forth. However, when the underlying dataset changes con-                 in the candidate set are verified for subgraph isomorphism
currently with the query workload proceeding, how to ensure              (abbreviated as sub-iso or SI), returning the answer set of
the graph cache consistency becomes an issue. The current                g. Similarly, [23] provides a solution for subgraph queries
work provides a systematic solution to address this prob-                against historical (i.e., snapshotted) graphs – a variation of
lem, by presenting an upgraded GraphCache system coined                  typical graph queries where snapshots can be viewed as dif-
GraphCache+ (abbreviated as GC+). We develop two GC+                     ferent graphs. However, extensive evaluations of FTV meth-
exclusive models that employ different approaches to deal                ods [7, 8] show significant performance limitations. Another
with the consistency issue. Moreover, we present the logic               research thread is concerned with heuristic SI algorithms,
of GC+ in expediting queries, bundled with the formally                  which can operate either against a single very large graph
proved correctness. We evaluate the performance of GC+ by                or a dataset containing a large number of graphs. These
a real-world graph dataset and a number of query workloads               algorithms are capable of pruning away (parts of) dataset
with different characteristics, highlighting the considerable            graphs that cannot contain the query, expediting sub-iso
speedup in term of quantified benefit and overhead.                      testing without specialized graph indexes. [14, 9] provide
                                                                         insightful evaluations for such SI algorithms.
                                                                            Unfortunately, both FTV and SI methods suffer from ex-
CCS Concepts                                                             ecuting unnecessary sub-iso tests. For example, if a previ-
•Information systems → Database query processing;                        ously issued query is submitted again to the system, it still
                                                                         has to undergo sub-iso tests, as all laboriously derived knowl-
Keywords                                                                 edge (i.e., answer set of previous queries) has been thrown
                                                                         away. Moreover, in many applications, it is natural that sub-
Graph queries; Caching system; Cache consistency                         mitted graph queries share subgraph or supergraph relations
                                                                         with future queries – in protein datasets, there is a hierarchy
1.    INTRODUCTION                                                       of queries for aminoacids, proteins, protein mixtures, uni-
   Graph structured data is increasingly prevalent in modern             cell bacteria, all the way to multi-cell organisms; in social
big data applications. Central to high performance graph                 network exploratory, queries could start off broad (e.g., all
analytics is to locate patterns in dataset graphs. Informally,           people in a geographic location) and become gradually nar-
given a query graph g, the system is called to return the                rower (e.g., by homing in on specific demographics). Based
set of dataset graphs that contain g (subgraph query) or are             on these observations, we proposed in [25] a fresh graph
contained in g (resp. supergraph query), named the answer                query processing method, where queries (and their answers)
set of g. These operations can be time consuming for the                 are indexed to expedite future query processing with FTV
NP-Complete problem of subgraph isomorphism [4]. Hence,                  methods. Subsequently, we presented GraphCache [26], the
the community has contributed a number of innovative solu-               first full-fledged caching system for general subgraph/super-
tions in recent years. One research thread follows the “filter-          graph query processing, applicable for all current FTV and
                                                                         SI methods. Following established research, GraphCache
                                                                         handles graph queries against a static dataset throughout
                                                                         the continuous query streaming.
                                                                            In real-world applications, however, the graph dataset nat-
                                                                         urally evolves/changes over time. For example, in social net-
                                                                         working, newly added groups (graph modeled relations/in-
©2017, Copyright is with the authors. Published in the Workshop Pro-     teractions among people), break-up of existed groups, and
ceedings of the EDBT/ICDT 2017 Joint Conference (March 21, 2017,
Venice, Italy) on CEUR-WS.org (ISSN 1613-0073). Distribution of this
                                                                         the changed relations/interactions among group members
paper is permitted under the terms of the Creative Commons license CC-   are frequently happening. Also, biochemical datasets keep
by-nc-nd 4.0
refreshing by newly-translated, disregarded or transformed        cache for SPARQL queries. [22] contributed a cache for
proteins, for application/research reason. Such changes could     SPARQL queries with a novel canonical labelling scheme
be modeled as graph addition (ADD), graph deletion (DEL),         to identify cache hits and a popular dynamic programming
graph update by edge addition (UA) and graph update by            planner. Similar to GC+, query processing optimization in
edge removal (UR) in graph dataset analytics.                     [22] does not require any a priori knowledge on datasets/
   SI algorithms could accommodate these changes on the fly       workloads and is workload adaptive. However, like XML
as each dataset graph shall undergo subgraph isomorphism          queries, SPARQL queries are less expressive than general
test eventually, whereas FTV methods additionally require         graph queries and thus less challenging [10, 24]: SPARQL
an updatable indexing mechanism to evict proper candidate         query processing consists of solving the subgraph homomor-
set. To the best of our knowledge, none of the proposed           phism problem, which is different from the subgraph isomor-
FTV algorithms so far has updatable index or similar solu-        phism problem, as the former drops the injective property
tions to tackle dataset changes. As a result, the way turns       of the latter. Furthermore, GC+ discovers subgraph, su-
to SI methods, where each dataset graph is painstakingly          pergraph, and exact-match relationships between the com-
verified. On the other hand, caching is proved efficient in       ing query and cached queries, which the canonical labelling
accelerating graph queries [26]. To this end, it naturally fol-   scheme in [22] fails to achieve. SPARQL query processing
lows an approach using graph cache to alleviate the costly        also targets at optimizing join execution plans [5] (based
SI methods in processing subgraph/supergraph queries with         on join selectivity estimator statistics and related cost func-
dataset changes – a topic that has not been discussed yet.        tions), and the cache in [22] is focusing on this goal, whereas
Therefore, underpinned by our previous work in [25, 26], we       GC+ aims to avoid/reduce costs of executing SI heuristics
contribute in this paper:                                         whose execution time can be highly unpredictable and much
    • A systematic solution to expedite subgraph/supergraph       higher. Therefore, the overall rationale of GC+ and its way
       queries with dataset changing throughout the query         to utilize cache contents differ from that in [22] and in re-
       streaming, by presenting an upgraded graph caching         lated caching solutions for SPARQL queries.
       system GC+ featured by newly plugged-in subsystems            Pertaining to cache consistency, [6] first explicitly speci-
       to deal with the evoked cache consistency issue.           fied the consistency constraint in a query-centric approach
    • Two cache models named EVI and CON that are pro-            and presented how it could be expressed succinctly in SQL.
       posed exclusively for GC+, representing different de-      Also, there are studies of cache consistency regarding XML
       signs of ensuring graph cache consistency.                 datasets [2] and SPARQL query processing [16]. However,
    • The logic of GC+ in pruning candidate set for general       the topic of ensuring graph cache consistency for general
       subgraph/supergraph query processing, together with        subgraph/supergraph queries has not been discussed yet.
       the formally proved correctness.
    • Performance evaluations using well-established SI meth-
       ods, a real-world dataset and a number of workloads,
                                                                  3.   DEFINITIONS
       emphasizing the significant speedup of CON cache.             GC+ is implemented for undirected labelled graphs, fol-
                                                                  lowing established studies in literature [1, 11]. We assume
                                                                  that only vertices have labels; all our results straightfor-
2.   RELATED WORK                                                 wardly generalize to directed graphs and/or graphs with
  Subgraph isomorphism problem has two versions: (i) the          edge labels.
decision problem answers Y/N to whether the query is con-            A labeled graph G = (V, E, l) consists of a set of ver-
tained in each graph in the dataset; (ii) the matching prob-      tices V (G) and edges E(G) = {(u, v), u ∈ V, v ∈ V }, and
lem locates all occurrences of the query graph within a large     a function l : V → U , where U is the label set, defining
graph (or a dataset of graphs). To address the decision and       the domain of labels of vertices. A graph Gi = (Vi , Ei , li ) is
matching problems, the brute-force approach is to execute         subgraph-isomorphic to a graph Gj = (Vj , Ej , lj ), (by abuse
sub-iso test of query against each dataset graph (SI method).     of notation) denoted by Gi ⊆ Gj , when there exists an in-
SI algorithms deteriorate when the dataset contains a large       jection φ : Vi → Vj , such that ∀(u, v) ∈ Ei , u, v ∈ Vi , ⇒
number of graphs, which prompted the prevalence of FTV            (φ(u), φ(v)) ∈ Ej and ∀u ∈ Vi , li (u) = lj (φ(u)).
methods. GC+ could benefit both SI and FTV solutions for             As is common in literature, we focus on non-induced sub-
general subgraph/supergraph queries.                              graph isomorphism. Informally, there is a subgraph isomor-
  The community has also looked into subgraph queries             phism Gi ⊆ Gj if Gj contains a subgraph that is isomor-
against a single massive graph via scale-out architectures        phic to Gi . In this case, we say that Gi is a subgraph of
[12] or large memory clusters with massive parallelism [24].      (or contained in) Gj , or inversely that Gj is a supergraph of
For the time being, GC+ does not target such use cases,           (contains) Gi (denoted by Gj ⊇ Gi ). The subgraph querying
which are left for future work.                                   problem entails a set D = {G1 , . . . , Gn } containing n graphs,
  Using cache to expedite queries is prevalent in relational      and a query graph g, and determines all graphs Gi ∈ D such
databases, whereas little work had been done for graph-           that g ⊆ Gi . The supergraph querying problem entails a set
structured query processing. XML datasets have used views         D = {G1 , . . . , Gn } containing n graphs, and a query graph
to expedite path/tree queries [15]; [13] first proposed the       g, and determines all graphs Gi ∈ D such that g ⊇ Gi .
MCR (maximally contained rewriting) approach for tree-
pattern queries, but introduced false negatives. GC+ does
not produce false negatives or false positives (see §6). Also,    4.   SYSTEM ARCHITECTURE
GC+ deals with much more complex graph queries, which               GC+ is a scalable semantic cache for subgraph/super-
entail the NP-Complete subgraph isomorphism problem.              graph queries, consisting of four subsystems Data Manager,
  Caching has also been leveraged to expedite SPARQL              Cache Manager, Query Processing Runtime and Method M,
query processing in RDF graphs. [18] proposed the first           as shown by Figure 1. The first three are GC+ internal and
                                                                 Time
                     Query Processing Runtime
                                                                 T0     {G0 , G1 , G2 , G3 }
      Resource         GC+sub        GC+super
                     Processor      Processor                    T1       query g' execution            g' entering cache               CON Cache
     Manager /                                  Statistics
       Query                                     Monitor                                                                        [0]   [1]   [2]   [3]      Dataset
     Dispatcher       Candidate Set Pruner                                                                                   g' 1      1     1     1       CGvalid
                                                                           ADD G4                                                0     0     1     1       Answer
                                                                 T2                             {G0 , G1 , G2 , G3 , G4 }
                                                                            UR G3
                          Cache Manager

       Cache           Cache         Window     Statistics                                                                      [0]   [1]   [2]   [3]   [4] Dataset
     Validator      Replacement      Manager     Manager                                                                     g' 1      1     1     0     0 CGvalid
                                                                 T3       query g'' execution           g'' entering cache       0     0     1     x     x Answer
                                                                                                                                [0]   [1]   [2]   [3]   [4] Dataset
                  Dataset Manager                Method M
                                                                                                                             g'' 1     1     1     1     1 CGvalid
                                                                                                                                 0     0     1     1     0 Answer
      Log Analyzer          Dataset Graphs        Mverifier
                                                                           DEL G0
                                                                 T4                             {G1 , G2 , G3 , G4 }
                                                                           UA G1
                                                                                                                                [1]   [2]   [3]   [4]      Dataset
           Figure 1: GC+ System Architecture                                                                                 g' 0      1     0     0       CGvalid
                                                                                                                                x      1     x     x       Answer
Method M is the external SI method that GC+ is called to                                                                        [1]   [2]   [3]   [4]      Dataset
expedite. Method M subsystem includes an SI implementa-          T5       query g execution
                                                                                                                             g'' 0     1     1     1       CGvalid
                                                                                                                                 x    1      1     0       Answer
tion, denoted Mverif ier , sub-iso testing candidate set MCS
(the whole dataset when GC+ is not used).
   Dataset Manager subsystem is GC+ specific, containing         Figure 2: A CON Cache Running Example with Timeline
a component Log Analyzer to handle dataset logs. Cache           cache, i.e., evicting (EVI) graph cache whenever dataset
Manager is responsible for the management of data and            changes. Therefore, Log Analyzer has to do nothing but
metadata stored in the cache. Besides the cache replace-         raising a flag indicating the dataset is changed, and Cache
ment mechanisms, a Window Manager for cache admission            Validator then clears cached contents indiscriminately. In
control and a Statistics Manager for metadata that work          this way, the caching system in [26] could be easily adapted
as usual as in GC, Cache Manager of GC+ boasts an ad-            to tackle graph queries with dataset changes, as the cleared
ditional component Cache Validator, which cooperates with        cache will never produce error for future query processing.
Log Analyzer to ensure the cache consistency. Both Log An-       But the limitation is obvious – EVI cache has to warm from
alyzer and Cache Validator launch cache model dependent          scratch upon each dataset change.
mechanisms to cope with cache consistency (see §5).                 The problem of EVI cache lies in that it fails to differ-
   Query Processing Runtime subsystem takes charge of query      entiate the validity of cached results. For example, when
execution and metrics monitoring. Like in GC, it comprises       a dataset graph undergoes some change, its relations with
a resource/thread manager, the GC+ internal subgraph/su-         all previous queries in cache (reflected by the cached query
pergraph query processors, the logic for candidate set prun-     results) are affected – some still hold, while others fail. Of
ing, and a statistics monitor – these components of Query        course, invalid contents cannot be leveraged to accelerate fu-
Processing Runtime communicate with Method M and the             ture queries and should be abandoned. However, the purge
Cache Manager via well-defined APIs. Please note that            in EVI throws away those valid contents as well, making the
GC+’s logic of Candidate Set Pruner is different and more        cache efficiency truncated.
complicated than that in GC, though the former could be
viewed as adapting from the latter. The logic of GC+ shall       5.2     CON Cache Model
be presented and proved in §6.                                       To solve the aforementioned problem of EVI, we develop
   When a query g arrives, Dataset Manager subsystem first       another cache model named CON, which beavers away to
identifies whether the dataset has been changed recently. If     identify valid contents in cache. We shall use an example to
so, Cache Validator is triggered to reflect these changes to     illustrate how the CON model works for subgraph queries.
previous queries residing in cache and window (where queries     Mechanism for supergraph queries is similar and is omitted
are batched to enter cache; in this work, cached graphs/-        for space reason.
queries by default cover those previous queries in both cache        Figure 2 depicts an example, where GC+ starts off with
and window). Then, g is sent to Query Processing Run-            a dataset containing four graphs {G0 , G1 , G2 , G3 } and an
time subsystem for query execution. Specifically, GC+ calls      empty CON cache. At time T1 , query g 0 arrived and was
processors (GC+sub and GC+super ) to discover whether g          executed. Assuming that g 0 passed the admission control
shares subgraph/supergraph relations with cached graphs –        and entered the cache later. CON cache thus recorded that
each hit graph in cache then contribute its valid part of an-    g 0 holds validity towards its relation with all graphs in cur-
swer set to prune g’s candidate set. Finally, the issued query   rent dataset (i.e., GCvalid covers dataset graphs with id 0,
g, together with its answer set and statistics pertaining to     1, 2, 3), among which g 0 * G0 , g 0 * G1 , g 0 ⊆ G2 and
the contribution of cached graphs, is fed to the Cache Man-      g 0 ⊆ G3 . Then, at time T2 , dataset was changed by adding
ager subsystem, where admission control and cache replace-       a new graph G4 , and an update on G3 of edge removals.
ment are performed, concurrently with the Query Processing       Obviously, there is no clue of G4 containing g 0 or not, i.e., g 0
Runtime subsystem executing subsequent queries.                  does not hold validity regarding its relation with the newly
                                                                 coming G4 . As to G3 , there was g 0 ⊆ G3 , which becomes
5. CACHE CONSISTENCY                                             unknown as removing edge may result g 0 * G3 . Hence, the
                                                                 validity of g 0 pertaining to G3 is turned off. Subsequently,
5.1 EVI Cache Model                                              at time T3 , another query g 00 was executed and later entered
  To address the said challenge arising from dynamic graph       cache, holding the validity towards each graph in state-of-
dataset, a straightforward solution is to abandon the vague      the-art dataset containing five graphs. Again, the dataset
Algorithm 1 Analyzing Log for the CON Cache                             Algorithm 2 Refreshing a cached graph’s validity indicator
 1: Input: Dataset update log L                                          1: Input: Counter container C (containing CT , CA and
 2: Output: A container C with counters to categorize op-                   CR ), currently maximum graph id m in dataset, stored
    erations performed on dataset graphs                                    info of a cached graph (with its dataset-graph-validity-
 3:                                                                         indicator CGvalid and query result Answer, both struc-
 4: Initialize C with an empty HashMap per counter (CT ,                    tured by BitSet)
    CA and CR )                                                          2: Function: Updating CGvalid
 5: Extract the incremental records R from L                             3:
 6: for all r ∈ R do                                                     4: if (m + 1) > CGvalid .size then
 7:     i = id of the dataset graph G in r                               5:     Extend CGvalid to length (m + 1) by assigning false
 8:     t = operation type in r                                             to extended bits
 9:     switch t do                                                      6: end if
10:         case UA                                                      7: for all i ∈ CT .keySet() do
11:             CA .get(i) += 1                                          8:     tc = CT .get(i)
12:             break                                                    9:     uac = !CA .containsKey(i)? 0 : CA .get(i)
13:         case UR                                                     10:     urc = !CR .containsKey(i)? 0 : CR .get(i)
14:             CR .get(i) += 1                                         11:
15:             break                                                   12:     if tc == uac ∧ CGvalid .get(i) ∧ Answer.get(i)
16:     CT .get(i) += 1                                                     then
17: end for                                                             13:         continue
18: return C                                                            14:     else if tc == urc ∧ CGvalid .get(i) ∧ !Answer.get(i)
                                                                            then
                                                                        15:         continue
changed at time T4 – graph G0 was deleted and G1 was                    16:     else
updated by edge additions. Regarding the current dataset                17:         CGvalid .set(i, f alse)
{G1 , G2 , G3 , G4 }, g 0 * G1 is not guaranteed, since adding          18:     end if
edges may introduce g 0 ⊆ G1 . g 0 , as well as g 00 , thus loses the   19: end for
validity on G1 . Therefore, when the new query g comes, it
would be facilitated by cached graphs g 0 and g 00 , each with
the up-to-date valid info pertaining to all current dataset             ous queries. Therefore, to deal with dataset changes, GC+
graphs, ensuring the cache consistency of CON model.                    employs a BitSet indicator CGvalid per cached query, with
5.2.1     Analyzing Dataset Log                                         each bit identifying the up-to-date validity of the query’s
                                                                        relation towards a dataset graph.
   GC+ is designed to warrant CON cache possessing the
                                                                           Algorithm 2 depicts how the CGvalid of a cached graph
potential to benefit queries at full steam. To this end, both
                                                                        g is refreshed by Cache Validator. To start with, CGvalid
Dataset Manager subsystem and Cache Manager subsystem
                                                                        is checked whether it contains all the bits required (line 4
develop CON specific mechanisms.
                                                                        where dataset graph id starts from 0). If not, it implies
   First, Dataset Manager subsystem employs the compo-
                                                                        that there are newly-added dataset graphs. Obviously, the
nent Log Analyzer to categorize dataset changes, acting
                                                                        relation between g and those new dataset graphs is unknown
as a preprocessing step for validating CON cache. Algo-
                                                                        and those extended bits are thus assigned false (line 5). The
rithm 1 illustrates how the corresponding analysis is per-
                                                                        idea is to make recent dataset changes take effect on relevant
formed. Briefly, the incremental records that have not been
                                                                        bits of CGvalid (lines 7–19).
reflected in cache are first extracted from dataset log. Log
                                                                           Specifically, for each concerned dataset graph Gi (iden-
Analyzer then launches a container with three counters, im-
                                                                        tified by i in line 7), its numbers of total operations (tc ),
plemented by HashMap with key of dataset graph id and
                                                                        UA (uac ) and UR (urc ) are retrieved from the input coun-
value of count for the operations on this graph. The said
                                                                        ters (lines 8–10). If operations on dataset graph Gi are ex-
three counters (CT , CA and CR ) (line 4) are responsible
                                                                        clusively of UA category (tc == uac in line 12), together
for counting the total, UA and UR operations, respectively.
                                                                        with valid (CGvalid .get(i) in line 12) query result g ⊆ Gi
Each aforementioned record identifies the related dataset
                                                                        (Answer.get(i) in line 12), such validity still holds (g ⊆ Gi
graph and its operation type (lines 7–8). Exhausting these
                                                                        is not bothered by adding edges to Gi ). Similarly, if oper-
records (lines 9–16) hence results the total-counter CT , UA-
                                                                        ations on dataset graph Gi are exclusively of UR category,
exclusive-counter CA and UR-exclusive-counter CR .
                                                                        the query result g * Gi (!Answer.get(i) in line 14) keeps
5.2.2     Validating CON Cache                                          valid. Whereas other operations fade g’s validity on Gi if
   Then, the counter container returned by Dataset Man-                 applicable (line 17). By harnessing the validity per cached
ager subsystem is forwarded to Cache Manager subsystem,                 query and dataset graph, as well as the optimizations in UA-
where Cache Validator refreshes the dataset-graph-validity-             exclusive and UR-exclusive cases, CON model manages to
indicator for cached graphs. CON cache targets at the biggest           enhance the cache efficiency in expediting graph queries.
possible benefit for query processing. The Cache Valida-
tor hence strives to exploit useful previous query result for           6.   QUERY PROCESSING
CON. In GC+, once a query is executed, its answer set                     This section shall illustrate the specific logic of Candidate
is finalized, which snapshots the query’s relation against              Set Pruner in GC+. For space reason, we only present for
dataset at the execution time – even the dataset would un-              subgraph queries (supergraph queries follow the exact in-
dergo changes later, GC+ will not repeat processing previ-              verse logic). As shown in Figure 2, when a query g arrives,
                               CSM (g) = {G1 , G2 , G3 , G4 }
                                                                                                                          CSM (g) = {G1 , G2 , G3 , G4 }


                                CSM (g) Answersub (g) =
                                                                    Mverif ier       Answer                               CSM (g) \ Answersuper (g) =          Mverif ier
                                      {G1 , G3 , G4 }                                                                                                                       Answer(g)
                                                                                                                                 {G1 , G2 , G3 }

                                   g ✓ g 0, g 0 ✓ {G2 , G3 }                                                                   g ◆ g 00, g 00 ✓ {G2 , G3 }
 Subgraph                                                                                       Subgraph
               GC+sub                                                                                      GC+super
 Query g                           CGvalid (g 0 ) = {G2 }          Answer(g) = Answer [ {G2 }   Query g                    CGvalid (g 00 ) = {G2 , G3 , G4 }
              P rocessor                                                                                   P rocessor
                                  Answersub (g) = {G2 }                                                                  Answersuper (g) = {G1 , G2 , G3 }

                               (a) Subgraph Case                                                                        (b) Supergraph Case
                                                               Figure 3: GC+ Processing of a Subgraph Query g

GC+ discovers whether g is a subgraph or supergraph of                                             Answersub (g) by formula (3). Furthermore, formula (1) im-
cached queries concurrently by processors GC+sub /GC+super ,                                       plies ∃g 0 such that g ⊆ g 0 , GF P ∈ CGvalid (g 0 ) and GF P ∈
referred as subgraph/supergraph case.                                                              Answer(g 0 ). Hence, GF P ∈ Answer(g 0 ) is valid for up-to-
                                                                                                   date dataset, i.e., g 0 ⊆ GF P . But g 0 ⊆ GF P and g ⊆ g 0 ⇒
6.1         Subgraph Case                                                                          g ⊆ GF P (a contradiction).
   For example, in Figure 3(a), the new query g comes with
candidate set CSM (g)(the current dataset) containing four                                           Lemma 2. The final answer of GC+ in the subgraph case
graphs {G1 , G2 , G3 , G4 }. GC+sub Processor finds that there                                     does not introduce false negatives.
exists a previous query g 0 , such that g ⊆ g 0 . Then g 0 ’s                                         Proof. Assume false negatives are possible and consider
cached answer set {G2 , G3 }, as well as its up-to-time validity                                   the first ever false negative produced by GC+ particularly;
indicator CGvalid = {G2 }, is retrieved. (As mentioned in                                          i.e., for some query g, ∃GF N such that g ⊆ GF N and GF N ∈/
Algorithm 2, both Answer(g 0 ) and CGvalid (g 0 ) are BitSet                                       Answer(g). As GF N ∈ CSM (g) in GC+ by default and
structures; here, we employ a set containing dataset-graph-                                        sub-iso testing does not introduce any false negative, the
id to help illustrate.)                                                                            only possibility for error is that GF N was removed using
   For graph G2 ∈ CSM (g), considering g ⊆ g 0 and g 0 ⊆ G2                                        formula (2); i.e., GF N ∈/ CSGC+sub (g). That implies that
(still holds for current dataset, as G2 exists in CGvalid (g 0 )),                                 ∃g 0 such that g ⊆ g 0 and GF N ∈ Answer(g 0 ). But then, by
it must follow that g ⊆ G2 . Hence, G2 can be safely removed                                       formula (3), GF N will be added to Answersub (g) and thus
from CSM (g) and added directly to the final answer set of                                         GF N ∈ Answer(g) (a contradiction).
g. Whereas G3 is not sub-iso exempted though it appears
in g 0 ’s cached answer set, as g 0 ⊆ G3 fails to hold with                                          Theorem 3. The final answer of GC+ in the subgraph
state-of-the-art dataset (G3 does not appear in CGvalid (g 0 ))                                    case is correct.
– g 0 ⊆ G3 was found when executing previous query g 0 but                                           Proof. There are only two possibilities for error; GC+
has been faded by subsequent dataset changes (e.g., G3 was                                         can produce false negatives or false positives. The theorem
updated by removing some edges).                                                                   then follows straightforwardly from Lemmas 1 and 2.
   Therefore, the logic of GC+ for subgraph case is that only
dataset graphs in CGvalid (g 0 ) ∩Answer(g 0 ) are sub-iso test-
free, which can be safely removed from CSM (g) and directly                                        6.2     Supergraph Case
added to the final answer set of query g. In the general case,                                        Figure 3(b) depicts an example of supergraph case in GC+,
g may be a subgraph of multiple previous query graphs gi0 .                                        where GC+super Processor identifies there exists a previ-
Then, the said sub-iso test-free graphs Answersub (g) is:                                          ous query g 00 satisfying g 00 ⊆ g. For g 00 , the cached an-
                           [                                                                       swer set {G2 , G3 } and the dataset-graph-validity indicator
 Answersub (g) =                   CGvalid (gi0 ) ∩ Answer(gi0 )                                   CGvalid (g 00 ) ({G2 , G3 , G4 }) are retrieved. Again, the new
                           gi0 ∈Resultsub (g)                                                      query g comes with candidate set CSM (g) ({G1 , G2 , G3 , G4 }).
                                                            (1)                                       Compared with subgraph case, reasoning the logic of GC+
where Resultsub (g) contains all the currently cached queries                                      in supergraph case is more interesting. (i) Consider graph
of which g is a subgraph.                                                                          G1 ∈ CSM (g): G1 does not hold validity regarding its re-
   Hence, the set of dataset graphs for sub-iso testing is:                                        lation with g 00 , hence no previous query result about G1
                                                                                                   could be utilized. g has to rest on the sub-iso test to iden-
             CSGC+sub (g) = CSM (g) \ Answersub (g)                                    (2)         tify whether G1 is in the final answer set. (ii) For graph
Finally, if AnswerGC+sub (g) is the set of graphs verified to                                      G2 ∈ CSM (g): g 00 ⊆ G2 holds, which does not make G2
be containing g through sub-iso tests on CSGC+sub (g), the                                         sub-iso test free though providing g 00 ⊆ g, as the relation
final answer set for query g will be:                                                              between g and G2 is still obscure and has to be verified by
                                                                                                   sub-iso test. Similarly, G3 is not free of sub-iso test either.
      Answer(g) = AnswerGC+sub (g) ∪ Answersub (g)                                     (3)         (iii) For graph G4 ∈ CSM (g), G4 holds validity for its rela-
                                                                                                   tion with g 00 as g 00 * G4 . Since g 00 ⊆ g, if g ⊆ G4 were to be
  Lemma 1. The final answer of GC+ in the subgraph case                                            true, then it should follow g 00 ⊆ G4 , which is a contradiction.
does not contain false positives.                                                                  So it is safe to conclude that g * G4 and thus G4 can be re-
  Proof. Assume false positives are possible and consider                                          moved from CSM (g). In overall, among graphs in CSM (g),
the first ever false positive produced by GC+; i.e., for some                                      those failing to appear in (CGvalid (g 00 ) ∪ Answer(g 00 ) can
query g, ∃GF P such that g * GF P and GF P ∈ Answer(g).                                            never exist in the final answer set of g and thus become
Note that GF P cannot be in AnswerGC+sub (g) where each                                            sub-iso test free, where CGvalid (g 0 ) is the complementary
graph has passed the sub-iso test, which follows that GF P ∈                                       set of CGvalid (g 00 ) against state-of-the-art dataset. In other
words, the set (CGvalid (g 00 ) ∪ Answer(g 00 ) covers all graphs       result set for g. The idea is that if there were a dataset
that could possibly exist in the final answer set of g, denoted         graph G such that g ⊆ G, since g 00 ⊆ g we would conclude
as g 00 .Answersuper (g), i.e.,                                         that g 00 ⊆ G ⇒ G ∈ Answer(g 00 ), contradicting the fact
                                                                        that Answer(g 00 ) = ∅; thus, no such graph G can exist and
   g 00 .Answersuper (g) = (CGvalid (g 00 ) ∪ Answer(g 00 )   (4)       the final result set of g is necessarily empty.
                                              00
Performing sub-iso tests on CSM (g) ∩ g .Answersuper (g)
therefore results the verified query answer Answer(g).                    7. PERFORMANCE EVALUATION
    In the general case, g may be a supergraph of multiple
previous query graphs gj00 . Then, the set of graphs tested for           7.1 Experimental Setup
sub-iso by GC+ is:                                                           GC+ is implemented in Java, by extending the graph
                                         \             00                 caching system GC [26]. Experiments were performed on a
CSGC+super (g) = CSM (g) ∩                            gj .Answersuper (g)
                                                                          Dell R920 host (4 Intel Xeon E7-4870 CPUs (15 cores each),
                                gj00 ∈Resultsuper (g)
                                                                          with 320GB of RAM and 4×1TB disks, running Ubuntu
                                                                     (5)  Linux 14.04.4LTS. Following GC, the default value in GC+
where Resultsuper (g) contains all the currently cached queries           for the upper limit on the sizes of Cache and the Window
of which g is a supergraph.                                               stores were 100 and 20 respectively.
    The final answer for query g, Answer(g), will be the set                 Method M We used three well-established SI meth-
of graphs in CSGC+super (g) that pass the sub-iso test.                   ods, including GraphQL (GQL) provided by [14], a modified
                                                                          VF2[3] (denoted VF2+) provided by [11], for being well-
     Lemma 4. The final answer of GC+ in the supergraph
                                                                          established and good performers [14, 8]; we also used vanilla
case does not contain false positives.
                                                                          VF2[3] as it is extensively used in FTV methods.
     Proof. This trivially follows by construction as all graphs             Dataset We employ a real-world dataset AIDS [19] (the
in Answer(g) have passed through subgraph isomorphism                     Antiviral Screen Dataset of the National Cancer Institute)
testing at the final stage of query processing.                           that is prevalently used in literature [14, 7, 1, 11]. AIDS
                                                                          contains 40,000 graphs, each with on average ≈45 vertices
     Lemma 5. The final answer of GC+ in the supergraph                   (std.dev.: 22, max: 245) and ≈47 edges (std.dev.: 23, max:
case does not introduce false negatives.                                  250), whereby the few largest graphs have an order of mag-
     Proof. Assume false negatives are possible and consider              nitude more vertices and edges.
the first ever false negative produced by GC+; i.e., for some                Dataset Change Plan Dataset change operations are
query g, ∃GF N such that g ⊆ GF N and GF N ∈           / Answer(g).       performed in batches, with occurrence time indicated by the
Since GF N ∈ CSM (g), the only possibility for error is that              id of queries in workload (recall Figure 2). The plan we
GF N is removed from CSGC+super (g) by formula (5). This                  used for AIDS consists of 2,000 operations (in 100 batches,
implies that ∃g 00 such that GF N ∈     / g 00 .Answersuper (g) and       20 operations per batch), during the processing of 10,000
  00                                                                      queries. A batch of operations are generated as following:
g ⊆ g. By formula (4), it turns to GF N ∈        / (CGvalid (g 00 ) and
                     00                                                   first, an occurrence time for the batch is selected uniformly
GF N ∈  / Answer(g ), with GF N ∈      / (CGvalid (g 00 ) ⇒ GF N ∈        at randomly from the id of queries; then, a type uniformly
           00
CGvalid (g ). Hence, for state-of-the-art dataset, GF N ∈              /  selected from {ADD, DEL, UA, UR}, a graph uniformly
Answer(g 00 ) is valid, i.e., GF N * g 00 . But g ⊆ GF N and              selected from dataset (ADD using the initial dataset instead
  00          00
g ⊆ g ⇒ g ⊆ GF N (a contradiction).                                       of synthesizing additional graphs so as to maximumly keep
                                                                          the original dataset characteristics; DEL, UA and UR using
     Theorem 6. The final answer of GC+ in the supergraph
                                                                          the up-to-date dataset at running time) and a uniformly
case is correct.
                                                                          selected edge within the graph providing UA or UR being
     Proof. There are only two possibilities for error; GC+               the selected type (UA would add an edge that has not been
can produce false negatives or false positives. The theorem               in graph yet; UR would remove an existed edge of graph)
then follows straightforwardly from Lemmas 4 and 5.                       codetermine a specific operation – such process is repeated
                                                                          until the batch contain the required number of operations.
6.3 Putting It All Together and Optimal Cases                                We follow the established principle for the generation of
    The Query Processing Runtime subsystem first applies                  our workloads, using two different algorithms to synthesize
equation (2) on CSM and then applies (5) on the result of the             queries from the initial dataset graphs. Each workload con-
previous operation. The end result is a reduced candidate                 sists of 10,000 queries with typical sizes in literature [11, 27]
set, which is then sub-iso tested.                                        – 4, 8, 12, 16 and 20 edges.
    Additionally, there are two optimal cases that warrant                   Type A Workloads Queries of these workloads are gen-
further performance gains. First, note that GC+ can easily                erated in the following manner: first, a source graph is
recognize the case where a new query, g, is isomorphic to a               randomly selected from dataset graphs; then, a node is se-
previous cached query g 0 . For connected query graphs, this              lected randomly in the said graph; finally, a query size is
holds providing that (i) g ⊆ g 0 or g ⊇ g 0 ; and (ii) g and              selected uniformly at randomly from given sizes and a BFS
g 0 have the same number of nodes and edges; and (iii) g 0                is performed starting from the selected node. For each new
holds validity on all the up-to-date dataset graphs. Hence,               node, all its edges connecting it to already visited nodes are
GC+ can return the cached result of g 0 directly, rendering               added to the generated query, until the desired query size is
sub-iso test free. Second, consider the case: for a new query             reached. For the first two random selections above, we have
g, a cached query g 00 is discovered by GC+ that g 00 ⊆ g,                used two different distributions; namely, Uniform (U) and
Answer(g 00 ) = ∅ and g 00 holds validity on all graphs cur-              Zipf (Z), with the probability density function of the latter
rently in dataset. Thus, GC+ can directly return an empty                 given by p(x) = x−α /ζ(α), where ζ is the Riemann Zeta
function[21]. Ultimately, we had three categories of Type                                                                  EVI         CON
                                                                          7.85                                      7.31
A workloads: “UU”, “ZU” and “ZZ”, where the first letter                                                                                         6.21
                                                                                                                                   5.79                        5.78
in each pair denotes the distribution used for selecting the                            4.77          5.13                                                                   4.57
                                                                                                                                                                                           3.90
starting graph, and the second for the starting node.              1.74          1.43          1.28          1.79           1.78          1.52          1.31          1.27          1.23
   Type B Workloads (with no-answer queries) These
workloads are generated as follows. For each of the query            ZZ             ZU            UU           ZZ                 ZU          UU          ZZ             ZU           UU

sizes, we first create two query pools: a 10,000-query pool                        VF2                                        VF2+                                       GQL

with queries with non-empty answer sets against the initial                                                         9.50
                                                                                                                                                               7.31
dataset, and a second 3,000-query pool with no match in any               6.52                                                                   6.14                        6.68          6.67
                                                                                        5.20                                       5.35
untreated dataset graph (i.e., empty result set). Queries for                                         4.57
                                                                   1.90                                      2.17           1.95          1.84
the first pool are extracted from dataset graphs by uniformly                    1.76          1.57                                                     1.34          1.25          1.18
selecting a start node across all nodes in all dataset graphs,
                                                                     0%            20%            50%          0%             20%             50%         0%             20%          50%
and then performing a random walk till the required query
                                                                                   VF2                                        VF2+                                       GQL
graph size is reached. Generation of no-answer queries has
one extra step: we continuously relabel the nodes in the                            Figure 4: GC+ Speedup in Query Time
query with randomly selected labels from the dataset, un-
til the resulting query has a non-empty candidate set but                                                                   EVI        CON

an empty answer set against the dataset graphs. Once the                                                                                      9.84
                                                                                 8.71
                                                                                                      6.53                 7.30
query pools are filled up, we generate workloads by first flip-                                                                                                   5.42                6.23

ping a biased coin to choose between the two pools (with the          1.94                 1.81                1.53                    2.21              1.96                  1.83
“no-answer” pool selected with probability 0%, 20% or 50%),
                                                                            ZZ                  ZU                   UU                   0%                   20%                  50%
then randomly (Zipf) selecting a query from the chosen pool.
We thus have three categories of Type B workloads: “0%”,              Figure 5: GC+ Speedup in Number of Sub-iso Tests
“20%” and “50%”, denoting the above probability used.
   We use Zipf α = 1.4 by default; as a reference point, web      ability, as exponential distributions have CoV = 1 and typ-
page popularities follow a Zipf distribution with α = 2.4 [21].   ically hyper-exponential distributions (which capture many
We only allow one for one Window (i.e., 20 queries) before        high-variance, heavy tailed distributions) have CoV > 1. In
starting measuring GC+’s performance. We report on both           this case, HD performs cache eviction using PIN’s scoring
the benefits and the overheads of GC+. Reported metrics           scheme; otherwise, it turns to PINC’s scoring scheme.
include query time and number of sub-iso tests per query,
along with the speedups introduced by GC+. Speedup is             7.2            Results and Insights
defined as the ratio of the average performance (query time
                                                                     Figure 4 depicts the query time speedups of GC+ across
or number of sub-iso tests) of the base Method M over the
                                                                  all method M and workloads. We can see that CON achieves
average performance of GC+ when deployed over Method M
                                                                  considerable speedup with the meagre 100-query cache
(i.e., speedups >1 indicate improvements). As a yardstick,
                                                                  configuration whereas gains of EVI are limited. Note the
[17] (also a cache but for XML databases) report a query
                                                                  interesting finding that speedups for UU workload are close
time speedup of 2.6× with 10,000-query workloads generated
                                                                  to those of ZU (e.g., 4.77 vs 5.13 against VF2 base method),
using Zipf α = 1.5, and a 1,500-query warm-up.
                                                                  whereas one might have expected a different outcome. Intu-
   Cache Replacement Policy GC+ incorporates all the
                                                                  itively, the ZU workload bears more exact-match hits than
replacement policies developed in GC. Here, we use the
                                                                  UU, due to the skewness of selecting source graphs during
coined hybrid (HD) policy for experiments, as its perfor-
                                                                  query generation. And it does: we measured 2.5X the num-
mance is always better or on par with the best alternative
                                                                  ber of exact-match cache hits in ZU vs UU. However, in
[26]. Specifically, HD coalesces another two GC/GC+ ex-
                                                                  GC+, exact-match cache hit does not necessarily introduce
clusive policies PIN and PINC. PIN scores each graph in
                                                                  sub-iso test free (see §6.3) – those resulting zero sub-iso test
cache by considering the total number of subgraph isomor-
                                                                  merely account 4% among exact-match cache hits of the said
phism tests alleviated by the said graph (coined R). PINC
                                                                  ZU workload (vs 11% in UU). Moreover, recall that GC+
extends the mentioned ranking by taking into account the
                                                                  exploits also subgraph/supergraph hits. Indeed, we mea-
possibly vast differences in query execution time (denoted
                                                                  sured circa 2X such matches for the UU workload vs ZU.
C), where cost is estimated by a heuristic [25]. As C is
                                                                  Of course, the overall performance result is a very complex
estimated, using it in PINC does not always lead to im-
                                                                  picture and depends on how big benefit is each saved exact-
provements in query time. And through a large number of
                                                                  match vs each saved subgraph/supergraph match. But the
experiments, we have observed that when the values of the
                                                                  key insight here is that by utilizing exact-matches and
R exhibit a high variability, they are discriminative enough
                                                                  subgraph/supergraph matches, GC+ can benefit both
on their own, where considering C can actually lead to lower
                                                                  skewed and non-skewed workloads.
time gains (i.e., PIN performs better than PINC). However,
                                                                     Figure 5 shows the speedups in the number of sub-iso tests
when the values of R exhibit a low variability, adding in
                                                                  performed. Please note that under a given configuration
the C component leads to considerable query time improve-
                                                                  (specified by dataset, dataset change plan, workload,
ments. When the HD policy is invoked, it first retrieves the
                                                                  replacement policy, cache model EVI/CON, the upper
R from Statistics Manager and computes its variability [20]
                                                                  limit on the sizes of Cache and the Window stores),
by using the (squared) coefficient of variation (CoV ). CoV
                                                                  whatever SI method being the Method M, GC+ re-
is defined as the ratio of the (square of the) standard devia-
                                                                  sults exactly the same pruned candidate set for each
tion over the (square of the) mean of the distribution. When
                                                                  query. Therefore, Figure 5 is independent of the three meth-
CoV > 1, the associated distribution is deemed of high vari-
                                                                  ods considered in this work. Juxtaposing Figures 4 and 5
             Average Query Time (milliseconds)        Overhead (milliseconds)          [4] M. Garey and D. Johnson. Computers and
                                                                        3                  Intractability: A Guide to the Theory of
              4                           3                                                NP-Completeness. W.H. Freeman, 1979.
     1,217                                                  1,385
                               1,130                                  1,085            [5] A. Gubichev and T. Neumann. Exploiting the query
                                         789      9                              7
             698        11
                       155                       237                            270        structure for efficient join ordering in SPARQL
      VF2    EVI      CON       VF2      EVI     CON         VF2      EVI       CON        queries. In Proc. EDBT, 2014.
              ZZ                         ZU                            UU
                                                                                       [6] H. Guo et al. Relaxed currency and consistency: How
                                                                                           to say “good enough” in SQL. In Proc. SIGMOD, 2004.
              3                           3                                            [7] W.-S. Han, J. Lee, M.-D. Pham, and J. X. Yu. iGraph:
     1,627                                                             3
                              1,383                                                        A framework for comparisons of disk-based graph
             856       11                785     10         990                  8
                                                                      631
                      250                        266                            217        indexing techniques. PVLDB, 3(1-2):449–459, 2010.
     VF2     EVI      CON      VF2       EVI     CON        VF2       EVI       CON    [8] F. Katsarou, N. Ntarmos, and P. Triantafillou.
             0%                         20%                          50%
                                                                                           Performance and scalability of indexed subgraph query
                                                                                           processing methods. PVLDB, 8(12):1566–1577, 2015.
Figure 6: Average Execution Time and Overhead per Query
                                                                                       [9] F. Katsarou, N. Ntarmos, and P. Triantafillou.
                                                                                           Subgraph querying with parallel use of query
leads to the interesting insight: Reductions in the number
                                                                                           rewritings and alternative algorithms. In Proc. EDBT,
of sub-iso tests do not translate directly into reduc-
                                                                                           2017.
tions in query time, echoing the claim that graph cache
hits render different benefits [26]. In all cases, though, GC+                        [10] J. Kim, H. Shin, and W.-S. Han. Taming subgraph
achieves significant improvements in both query pro-                                       isomorphism for RDF query processing. PVLDB,
cessing time and number of sub-iso tests performed.                                        8(11):1238–1249, 2015.
   Figure 6 depicts a break-down of query processing time for                         [11] K. Klein, N. Kriege, and P. Mutzel. CT-index:
method M, EVI and CON. EVI pays overhead on updating                                       Fingerprint-based graph indexing combining cycles
the Window and Cache data stores, including executing the                                  and trees. In Proc. ICDE, 2011.
cache replacement algorithms and re-indexing the cached                               [12] L. Lai et al. Scalable subgraph enumeration in
graphs. Whereas the overhead of CON also covers the time                                   MapReduce. PVLDB, 8(10):974–985, 2015.
of analyzing dataset log and validating cache (see Algorithm                          [13] L. V. S. Lakshmanan et al. Answering tree pattern
1 and 2) – such CON specific cost is trial, taking less than                               queries using views. In Proc. VLDB, 2006.
1% in CON overhead, across all the aforementioned work-                               [14] J. Lee et al. An in-depth comparison of subgraph
loads and methods M. This confirms a significant conclusion                                isomorphism algorithms in graph databases. PVLDB,
– the CON exclusive algorithms 1 and 2 are efficient.                                      6(2):133–144, 2012.
The dominant part of CON overhead (for updating the Win-                              [15] K. Lillis and E. Pitoura. Cooperative XPath caching.
dow and Cache data stores) is higher than that of EVI, as                                  In Proc. SIGMOD, 2008.
the latter is frequently purged hence bearing less to be up-                          [16] J. Lorey et al. Caching and prefetching strategies for
dated. Putting the Figure 4 aside Figure 5, it is obvious that                             SPARQL queries. In Proc. USEWOD, 2013.
CON sweeps EVI in query processing speedup with a                                     [17] B. Mandhani and D. Suciu. Query caching and view
negligible additional overhead.                                                            selection for XML databases. In Proc. VLDB, 2005.
                                                                                      [18] M. Martin, J. Unbehauen, and S. Auer. Improving the
8.         CONCLUSIONS                                                                     performance of semantic web applications with
   We presented a systematic solution to handle graph cache                                SPARQL query caching. In Proc. ESWC, 2010.
consistency, by providing an upgraded system GC+, which is                            [19] NCI - DTP AIDS antiviral screen dataset.
capable of expediting general subgraph/supergraph queries                                  http://dtp.nci.nih.gov/docs/aids/aids data.html.
with dataset changes. We developed two GC+ exclusive                                  [20] R. Nelson. Probability, Stochastic Processes, and
cache models EVI and CON with different mechanisms on                                      Queueing Theory. Springer Verlag, 1995.
ensuring cache consistency. We illustrated the specific logic                         [21] M. Newman. Power laws, Pareto distributions and
of GC+ in reducing candidate set for query execution and                                   Zipf’s law. Contemporary Physics, 46:323–351, 2005.
formally proved its correctness. Our performance evalua-                              [22] N. Papailiou, D. Tsoumakos, P. Karras, and
tion has demonstrated the considerable speedup achieved by                                 N. Koziris. Graph-aware , workload-adaptive SPARQL
CON. Future works include further optimizing CON cache                                     query caching. In Proc. SIGMOD, 2015.
with retrospective validating mechanisms, developing a dis-
                                                                                      [23] K. Semertzidis and E. Pitoura. Durable graph pattern
tributed/decentralized version of GC+ and extending GC+
                                                                                           queries on historical graphs. In Proc. ICDE, 2016.
to benefit subgraph queries when finding all occurrences of
                                                                                      [24] Z. Sun et al. Efficient subgraph matching on billion
a query graph against a single massive graph.
                                                                                           node graphs. PVLDB, 5(9):788–799, 2012.
                                                                                      [25] J. Wang, N. Ntarmos, and P. Triantafillou. Indexing
9.         REFERENCES                                                                      query graphs to speedup graph query processing. In
 [1] V. Bonnici et al. Enhancing graph database indexing                                   Proc. EDBT, 2016.
     by suffix tree structure. In Proc. IAPR PRIB, 2010.
                                                                                      [26] J. Wang, N. Ntarmos, and P. Triantafillou.
 [2] S. Bottcher. Cache consistency in mobile XML                                          GraphCache: a caching system for graph queries. In
     databases. In Proc. WAIM, pages 300–312, 2006.                                        Proc. EDBT, 2017.
 [3] L. P. Cordella, P. Foggia, C. Sansone, and M. Vento.                             [27] X. Yan et al. Graph indexing: a frequent
     A (sub) graph isomorphism algorithm for matching                                      structure-based approach. In Proc. SIGMOD, 2004.
     large graphs. IEEE TPAMI, 26(10):1367–1372, 2004.