<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ensuring Consistency in Graph Cache for Graph-Pattern Queries</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jing Wang</string-name>
          <email>j.wang.3@research.gla.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikos Ntarmos</string-name>
          <email>nikos.ntarmos@glasgow.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Triantafillou</string-name>
          <email>llou@glasgow.ac.uk</email>
          <email>peter.triantafillou@glasgow.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computing Science, University of Glasgow</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Graph queries are costly, as they entail the NP-Complete subgraph isomorphism problem. Graph caching had been recently suggested, showing the potential to signi cantly accelerate subgraph/supergraph queries. Subsequently, GraphCache, the rst full- edged graph caching system was put forth. However, when the underlying dataset changes concurrently with the query workload proceeding, how to ensure the graph cache consistency becomes an issue. The current work provides a systematic solution to address this problem, by presenting an upgraded GraphCache system coined GraphCache+ (abbreviated as GC+). We develop two GC+ exclusive models that employ di erent approaches to deal with the consistency issue. Moreover, we present the logic of GC+ in expediting queries, bundled with the formally proved correctness. We evaluate the performance of GC+ by a real-world graph dataset and a number of query workloads with di erent characteristics, highlighting the considerable speedup in term of quanti ed bene t and overhead.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Graph queries; Caching system; Cache consistency</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>
        Graph structured data is increasingly prevalent in modern
big data applications. Central to high performance graph
analytics is to locate patterns in dataset graphs. Informally,
given a query graph g, the system is called to return the
set of dataset graphs that contain g (subgraph query) or are
contained in g (resp. supergraph query), named the answer
set of g. These operations can be time consuming for the
NP-Complete problem of subgraph isomorphism [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Hence,
the community has contributed a number of innovative
solutions in recent years. One research thread follows the \
lterthen-verify" (FTV) paradigm: rst, dataset graphs are
decomposed to substructures and indexed; then, substructures
of coming query graph g are looked up in the dataset index,
producing the set of dataset graphs that contain all
substructures of g (resp. dataset graphs whose substructures are all
contained in g), coined the candidate set of g; nally, graphs
in the candidate set are veri ed for subgraph isomorphism
(abbreviated as sub-iso or SI), returning the answer set of
g. Similarly, [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] provides a solution for subgraph queries
against historical (i.e., snapshotted) graphs { a variation of
typical graph queries where snapshots can be viewed as
different graphs. However, extensive evaluations of FTV
methods [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ] show signi cant performance limitations. Another
research thread is concerned with heuristic SI algorithms,
which can operate either against a single very large graph
or a dataset containing a large number of graphs. These
algorithms are capable of pruning away (parts of) dataset
graphs that cannot contain the query, expediting sub-iso
testing without specialized graph indexes. [
        <xref ref-type="bibr" rid="ref14 ref9">14, 9</xref>
        ] provide
insightful evaluations for such SI algorithms.
      </p>
      <p>
        Unfortunately, both FTV and SI methods su er from
executing unnecessary sub-iso tests. For example, if a
previously issued query is submitted again to the system, it still
has to undergo sub-iso tests, as all laboriously derived
knowledge (i.e., answer set of previous queries) has been thrown
away. Moreover, in many applications, it is natural that
submitted graph queries share subgraph or supergraph relations
with future queries { in protein datasets, there is a hierarchy
of queries for aminoacids, proteins, protein mixtures,
unicell bacteria, all the way to multi-cell organisms; in social
network exploratory, queries could start o broad (e.g., all
people in a geographic location) and become gradually
narrower (e.g., by homing in on speci c demographics). Based
on these observations, we proposed in [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] a fresh graph
query processing method, where queries (and their answers)
are indexed to expedite future query processing with FTV
methods. Subsequently, we presented GraphCache [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], the
rst full- edged caching system for general
subgraph/supergraph query processing, applicable for all current FTV and
SI methods. Following established research, GraphCache
handles graph queries against a static dataset throughout
the continuous query streaming.
      </p>
      <p>In real-world applications, however, the graph dataset
naturally evolves/changes over time. For example, in social
networking, newly added groups (graph modeled
relations/interactions among people), break-up of existed groups, and
the changed relations/interactions among group members
are frequently happening. Also, biochemical datasets keep
refreshing by newly-translated, disregarded or transformed
proteins, for application/research reason. Such changes could
be modeled as graph addition (ADD), graph deletion (DEL),
graph update by edge addition (UA) and graph update by
edge removal (UR) in graph dataset analytics.</p>
      <p>
        SI algorithms could accommodate these changes on the y
as each dataset graph shall undergo subgraph isomorphism
test eventually, whereas FTV methods additionally require
an updatable indexing mechanism to evict proper candidate
set. To the best of our knowledge, none of the proposed
FTV algorithms so far has updatable index or similar
solutions to tackle dataset changes. As a result, the way turns
to SI methods, where each dataset graph is painstakingly
veri ed. On the other hand, caching is proved e cient in
accelerating graph queries [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. To this end, it naturally
follows an approach using graph cache to alleviate the costly
SI methods in processing subgraph/supergraph queries with
dataset changes { a topic that has not been discussed yet.
Therefore, underpinned by our previous work in [
        <xref ref-type="bibr" rid="ref25 ref26">25, 26</xref>
        ], we
contribute in this paper:
      </p>
      <p>A systematic solution to expedite subgraph/supergraph
queries with dataset changing throughout the query
streaming, by presenting an upgraded graph caching
system GC+ featured by newly plugged-in subsystems
to deal with the evoked cache consistency issue.</p>
      <p>Two cache models named EVI and CON that are
proposed exclusively for GC+, representing di erent
designs of ensuring graph cache consistency.</p>
      <p>The logic of GC+ in pruning candidate set for general
subgraph/supergraph query processing, together with
the formally proved correctness.</p>
      <p>Performance evaluations using well-established SI
methods, a real-world dataset and a number of workloads,
emphasizing the signi cant speedup of CON cache.
2.</p>
    </sec>
    <sec id="sec-3">
      <title>RELATED WORK</title>
      <p>Subgraph isomorphism problem has two versions: (i) the
decision problem answers Y/N to whether the query is
contained in each graph in the dataset; (ii) the matching
problem locates all occurrences of the query graph within a large
graph (or a dataset of graphs). To address the decision and
matching problems, the brute-force approach is to execute
sub-iso test of query against each dataset graph (SI method).
SI algorithms deteriorate when the dataset contains a large
number of graphs, which prompted the prevalence of FTV
methods. GC+ could bene t both SI and FTV solutions for
general subgraph/supergraph queries.</p>
      <p>
        The community has also looked into subgraph queries
against a single massive graph via scale-out architectures
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] or large memory clusters with massive parallelism [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ].
For the time being, GC+ does not target such use cases,
which are left for future work.
      </p>
      <p>
        Using cache to expedite queries is prevalent in relational
databases, whereas little work had been done for
graphstructured query processing. XML datasets have used views
to expedite path/tree queries [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]; [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] rst proposed the
MCR (maximally contained rewriting) approach for
treepattern queries, but introduced false negatives. GC+ does
not produce false negatives or false positives (see x6). Also,
GC+ deals with much more complex graph queries, which
entail the NP-Complete subgraph isomorphism problem.
      </p>
      <p>
        Caching has also been leveraged to expedite SPARQL
query processing in RDF graphs. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] proposed the rst
cache for SPARQL queries. [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] contributed a cache for
SPARQL queries with a novel canonical labelling scheme
to identify cache hits and a popular dynamic programming
planner. Similar to GC+, query processing optimization in
[
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] does not require any a priori knowledge on datasets/
workloads and is workload adaptive. However, like XML
queries, SPARQL queries are less expressive than general
graph queries and thus less challenging [
        <xref ref-type="bibr" rid="ref10 ref24">10, 24</xref>
        ]: SPARQL
query processing consists of solving the subgraph
homomorphism problem, which is di erent from the subgraph
isomorphism problem, as the former drops the injective property
of the latter. Furthermore, GC+ discovers subgraph,
supergraph, and exact-match relationships between the
coming query and cached queries, which the canonical labelling
scheme in [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] fails to achieve. SPARQL query processing
also targets at optimizing join execution plans [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] (based
on join selectivity estimator statistics and related cost
functions), and the cache in [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] is focusing on this goal, whereas
GC+ aims to avoid/reduce costs of executing SI heuristics
whose execution time can be highly unpredictable and much
higher. Therefore, the overall rationale of GC+ and its way
to utilize cache contents di er from that in [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] and in
related caching solutions for SPARQL queries.
      </p>
      <p>
        Pertaining to cache consistency, [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] rst explicitly
specied the consistency constraint in a query-centric approach
and presented how it could be expressed succinctly in SQL.
Also, there are studies of cache consistency regarding XML
datasets [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and SPARQL query processing [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. However,
the topic of ensuring graph cache consistency for general
subgraph/supergraph queries has not been discussed yet.
3.
      </p>
    </sec>
    <sec id="sec-4">
      <title>DEFINITIONS</title>
      <p>
        GC+ is implemented for undirected labelled graphs,
following established studies in literature [
        <xref ref-type="bibr" rid="ref1 ref11">1, 11</xref>
        ]. We assume
that only vertices have labels; all our results
straightforwardly generalize to directed graphs and/or graphs with
edge labels.
      </p>
      <p>A labeled graph G = (V; E; l) consists of a set of
vertices V (G) and edges E(G) = f(u; v); u 2 V; v 2 V g, and
a function l : V ! U , where U is the label set, de ning
the domain of labels of vertices. A graph Gi = (Vi; Ei; li) is
subgraph-isomorphic to a graph Gj = (Vj; Ej; lj), (by abuse
of notation) denoted by Gi Gj, when there exists an
injection : Vi ! Vj, such that 8(u; v) 2 Ei; u; v 2 Vi; )
( (u); (v)) 2 Ej and 8u 2 Vi; li(u) = lj( (u)).</p>
      <p>As is common in literature, we focus on non-induced
subgraph isomorphism. Informally, there is a subgraph
isomorphism Gi Gj if Gj contains a subgraph that is
isomorphic to Gi. In this case, we say that Gi is a subgraph of
(or contained in) Gj, or inversely that Gj is a supergraph of
(contains) Gi (denoted by Gj Gi). The subgraph querying
problem entails a set D = fG1; : : : ; Gng containing n graphs,
and a query graph g, and determines all graphs Gi 2 D such
that g Gi. The supergraph querying problem entails a set
D = fG1; : : : ; Gng containing n graphs, and a query graph
g, and determines all graphs Gi 2 D such that g Gi.</p>
    </sec>
    <sec id="sec-5">
      <title>4. SYSTEM ARCHITECTURE</title>
      <p>GC+ is a scalable semantic cache for
subgraph/supergraph queries, consisting of four subsystems Data Manager,
Cache Manager, Query Processing Runtime and Method M,
as shown by Figure 1. The rst three are GC+ internal and
Method M is the external SI method that GC+ is called to
expedite. Method M subsystem includes an SI
implementation, denoted Mverifier, sub-iso testing candidate set MCS
(the whole dataset when GC+ is not used).</p>
      <p>Dataset Manager subsystem is GC+ speci c, containing
a component Log Analyzer to handle dataset logs. Cache
Manager is responsible for the management of data and
metadata stored in the cache. Besides the cache
replacement mechanisms, a Window Manager for cache admission
control and a Statistics Manager for metadata that work
as usual as in GC, Cache Manager of GC+ boasts an
additional component Cache Validator, which cooperates with
Log Analyzer to ensure the cache consistency. Both Log
Analyzer and Cache Validator launch cache model dependent
mechanisms to cope with cache consistency (see x5).</p>
      <p>Query Processing Runtime subsystem takes charge of query
execution and metrics monitoring. Like in GC, it comprises
a resource/thread manager, the GC+ internal
subgraph/supergraph query processors, the logic for candidate set
pruning, and a statistics monitor { these components of Query
Processing Runtime communicate with Method M and the
Cache Manager via well-de ned APIs. Please note that
GC+'s logic of Candidate Set Pruner is di erent and more
complicated than that in GC, though the former could be
viewed as adapting from the latter. The logic of GC+ shall
be presented and proved in x6.</p>
      <p>When a query g arrives, Dataset Manager subsystem rst
identi es whether the dataset has been changed recently. If
so, Cache Validator is triggered to re ect these changes to
previous queries residing in cache and window (where queries
are batched to enter cache; in this work, cached
graphs/queries by default cover those previous queries in both cache
and window). Then, g is sent to Query Processing
Runtime subsystem for query execution. Speci cally, GC+ calls
processors (GC+sub and GC+super) to discover whether g
shares subgraph/supergraph relations with cached graphs {
each hit graph in cache then contribute its valid part of
answer set to prune g's candidate set. Finally, the issued query
g, together with its answer set and statistics pertaining to
the contribution of cached graphs, is fed to the Cache
Manager subsystem, where admission control and cache
replacement are performed, concurrently with the Query Processing
Runtime subsystem executing subsequent queries.
5.
5.1</p>
    </sec>
    <sec id="sec-6">
      <title>CACHE CONSISTENCY</title>
    </sec>
    <sec id="sec-7">
      <title>EVI Cache Model</title>
      <p>To address the said challenge arising from dynamic graph
dataset, a straightforward solution is to abandon the vague
Time
T0
T1
T2
T3
T4
T5
{G0, G1, G2, G3}
query g' execution</p>
      <p>g' entering cache
query g' execution</p>
      <p>g' entering cache
{G0, G1, G2, G3, G4}
{G1, G2, G3, G4}
ADD G4
UR G3
DEL G0
UA G1
query g execution</p>
      <p>
        CON Cache
cache, i.e., evicting (EVI) graph cache whenever dataset
changes. Therefore, Log Analyzer has to do nothing but
raising a ag indicating the dataset is changed, and Cache
Validator then clears cached contents indiscriminately. In
this way, the caching system in [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] could be easily adapted
to tackle graph queries with dataset changes, as the cleared
cache will never produce error for future query processing.
But the limitation is obvious { EVI cache has to warm from
scratch upon each dataset change.
      </p>
      <p>The problem of EVI cache lies in that it fails to di
erentiate the validity of cached results. For example, when
a dataset graph undergoes some change, its relations with
all previous queries in cache (re ected by the cached query
results) are a ected { some still hold, while others fail. Of
course, invalid contents cannot be leveraged to accelerate
future queries and should be abandoned. However, the purge
in EVI throws away those valid contents as well, making the
cache e ciency truncated.
5.2</p>
    </sec>
    <sec id="sec-8">
      <title>CON Cache Model</title>
      <p>To solve the aforementioned problem of EVI, we develop
another cache model named CON, which beavers away to
identify valid contents in cache. We shall use an example to
illustrate how the CON model works for subgraph queries.
Mechanism for supergraph queries is similar and is omitted
for space reason.</p>
      <p>Figure 2 depicts an example, where GC+ starts o with
a dataset containing four graphs fG0; G1; G2; G3g and an
empty CON cache. At time T1, query g0 arrived and was
executed. Assuming that g0 passed the admission control
and entered the cache later. CON cache thus recorded that
g0 holds validity towards its relation with all graphs in
current dataset (i.e., GCvalid covers dataset graphs with id 0,
1, 2, 3), among which g0 * G0, g0 * G1, g0 G2 and
g0 G3. Then, at time T2, dataset was changed by adding
a new graph G4, and an update on G3 of edge removals.
Obviously, there is no clue of G4 containing g0 or not, i.e., g0
does not hold validity regarding its relation with the newly
coming G4. As to G3, there was g0 G3, which becomes
unknown as removing edge may result g0 * G3. Hence, the
validity of g0 pertaining to G3 is turned o . Subsequently,
at time T3, another query g00 was executed and later entered
cache, holding the validity towards each graph in
state-ofthe-art dataset containing ve graphs. Again, the dataset
Algorithm 1 Analyzing Log for the CON Cache
1: Input: Dataset update log L
2: Output: A container C with counters to categorize
operations performed on dataset graphs
changed at time T4 { graph G0 was deleted and G1 was
updated by edge additions. Regarding the current dataset
fG1; G2; G3; G4g, g0 * G1 is not guaranteed, since adding
edges may introduce g0 G1. g0, as well as g00, thus loses the
validity on G1. Therefore, when the new query g comes, it
would be facilitated by cached graphs g0 and g00, each with
the up-to-date valid info pertaining to all current dataset
graphs, ensuring the cache consistency of CON model.
5.2.1</p>
      <sec id="sec-8-1">
        <title>Analyzing Dataset Log</title>
        <p>GC+ is designed to warrant CON cache possessing the
potential to bene t queries at full steam. To this end, both
Dataset Manager subsystem and Cache Manager subsystem
develop CON speci c mechanisms.</p>
        <p>First, Dataset Manager subsystem employs the
component Log Analyzer to categorize dataset changes, acting
as a preprocessing step for validating CON cache.
Algorithm 1 illustrates how the corresponding analysis is
performed. Brie y, the incremental records that have not been
re ected in cache are rst extracted from dataset log. Log
Analyzer then launches a container with three counters,
implemented by HashMap with key of dataset graph id and
value of count for the operations on this graph. The said
three counters (CT , CA and CR) (line 4) are responsible
for counting the total, UA and UR operations, respectively.
Each aforementioned record identi es the related dataset
graph and its operation type (lines 7{8). Exhausting these
records (lines 9{16) hence results the total-counter CT ,
UAexclusive-counter CA and UR-exclusive-counter CR.
5.2.2</p>
      </sec>
      <sec id="sec-8-2">
        <title>Validating CON Cache</title>
        <p>Then, the counter container returned by Dataset
Manager subsystem is forwarded to Cache Manager subsystem,
where Cache Validator refreshes the
dataset-graph-validityindicator for cached graphs. CON cache targets at the biggest
possible bene t for query processing. The Cache
Validator hence strives to exploit useful previous query result for
CON. In GC+, once a query is executed, its answer set
is nalized, which snapshots the query's relation against
dataset at the execution time { even the dataset would
undergo changes later, GC+ will not repeat processing
previAlgorithm 2 Refreshing a cached graph's validity indicator
1: Input: Counter container C (containing CT , CA and
CR), currently maximum graph id m in dataset, stored
info of a cached graph (with its
dataset-graph-validityindicator CGvalid and query result Answer, both
structured by BitSet)
2: Function: Updating CGvalid
3:
4: if (m + 1) &gt; CGvalid:size then
5: Extend CGvalid to length (m + 1) by assigning false
to extended bits
6: end if
7: for all i 2 CT :keySet() do
8: tc = CT :get(i)
9: uac = !CA:containsKey(i)? 0 : CA:get(i)
10: urc = !CR:containsKey(i)? 0 : CR:get(i)
11:
12:</p>
        <p>if tc == uac ^ CGvalid:get(i) ^ Answer:get(i)
then
13:
14:</p>
        <p>continue
else if tc == urc ^ CGvalid:get(i) ^ !Answer:get(i)
then
15: continue
16: else
17: CGvalid:set(i; f alse)
18: end if
19: end for
ous queries. Therefore, to deal with dataset changes, GC+
employs a BitSet indicator CGvalid per cached query, with
each bit identifying the up-to-date validity of the query's
relation towards a dataset graph.</p>
        <p>Algorithm 2 depicts how the CGvalid of a cached graph
g is refreshed by Cache Validator. To start with, CGvalid
is checked whether it contains all the bits required (line 4
where dataset graph id starts from 0). If not, it implies
that there are newly-added dataset graphs. Obviously, the
relation between g and those new dataset graphs is unknown
and those extended bits are thus assigned false (line 5). The
idea is to make recent dataset changes take e ect on relevant
bits of CGvalid (lines 7{19).</p>
        <p>Speci cally, for each concerned dataset graph Gi
(identi ed by i in line 7), its numbers of total operations (tc),
UA (uac) and UR (urc) are retrieved from the input
counters (lines 8{10). If operations on dataset graph Gi are
exclusively of UA category (tc == uac in line 12), together
with valid (CGvalid:get(i) in line 12) query result g Gi
(Answer:get(i) in line 12), such validity still holds (g Gi
is not bothered by adding edges to Gi). Similarly, if
operations on dataset graph Gi are exclusively of UR category,
the query result g * Gi (!Answer:get(i) in line 14) keeps
valid. Whereas other operations fade g's validity on Gi if
applicable (line 17). By harnessing the validity per cached
query and dataset graph, as well as the optimizations in
UAexclusive and UR-exclusive cases, CON model manages to
enhance the cache e ciency in expediting graph queries.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>QUERY PROCESSING</title>
      <p>This section shall illustrate the speci c logic of Candidate
Set Pruner in GC+. For space reason, we only present for
subgraph queries (supergraph queries follow the exact
inverse logic). As shown in Figure 2, when a query g arrives,
CSM(g) Answersub(g) =</p>
      <p>{G1, G3, G4}
g ✓ g0, g0 ✓ {G2, G3}
CGvalid(g0) = {G2}</p>
      <p>Answersub(g) = {G2}
(a) Subgraph Case
GC+ discovers whether g is a subgraph or supergraph of
cached queries concurrently by processors GC+sub/GC+super,
referred as subgraph/supergraph case.
6.1</p>
    </sec>
    <sec id="sec-10">
      <title>Subgraph Case</title>
      <p>For example, in Figure 3(a), the new query g comes with
candidate set CSM (g)(the current dataset) containing four
graphs fG1; G2; G3; G4g. GC+sub Processor nds that there
exists a previous query g0, such that g g0. Then g0's
cached answer set fG2; G3g, as well as its up-to-time validity
indicator CGvalid = fG2g, is retrieved. (As mentioned in
Algorithm 2, both Answer(g0) and CGvalid(g0) are BitSet
structures; here, we employ a set containing
dataset-graphid to help illustrate.)</p>
      <p>For graph G2 2 CSM (g), considering g g0 and g0 G2
(still holds for current dataset, as G2 exists in CGvalid(g0)),
it must follow that g G2. Hence, G2 can be safely removed
from CSM (g) and added directly to the nal answer set of
g. Whereas G3 is not sub-iso exempted though it appears
in g0's cached answer set, as g0 G3 fails to hold with
state-of-the-art dataset (G3 does not appear in CGvalid(g0))
{ g0 G3 was found when executing previous query g0 but
has been faded by subsequent dataset changes (e.g., G3 was
updated by removing some edges).</p>
      <p>Therefore, the logic of GC+ for subgraph case is that only
dataset graphs in CGvalid(g0) \Answer(g0) are sub-iso
testfree, which can be safely removed from CSM (g) and directly
added to the nal answer set of query g. In the general case,
g may be a subgraph of multiple previous query graphs gi0.
Then, the said sub-iso test-free graphs Answersub(g) is:
CGvalid(gi0) \ Answer(gi0)
Answersub(g) =</p>
      <p>[
gi02Resultsub(g)
(1)
where Resultsub(g) contains all the currently cached queries
of which g is a subgraph.</p>
      <p>Hence, the set of dataset graphs for sub-iso testing is:</p>
      <p>CSGC+sub (g) = CSM (g) n Answersub(g)
Finally, if AnswerGC+sub (g) is the set of graphs veri ed to
be containing g through sub-iso tests on CSGC+sub (g), the
nal answer set for query g will be:</p>
      <p>Answer(g) = AnswerGC+sub (g) [ Answersub(g)
(2)
(3)</p>
      <p>Lemma 1. The nal answer of GC+ in the subgraph case
does not contain false positives.</p>
      <p>Proof. Assume false positives are possible and consider
the rst ever false positive produced by GC+; i.e., for some
query g, 9GF P such that g * GF P and GF P 2 Answer(g).
Note that GF P cannot be in AnswerGC+sub (g) where each
graph has passed the sub-iso test, which follows that GF P 2
CSM(g) \ Answersuper(g) =</p>
      <p>{G1, G2, G3}
g ◆ g00, g00 ✓ {G2, G3}</p>
      <p>CGvalid(g00) = {G2, G3, G4}
Answersuper(g) = {G1, G2, G3}
(b) Supergraph Case</p>
      <p>Proof. Assume false negatives are possible and consider
the rst ever false negative produced by GC+ particularly;
i.e., for some query g, 9GF N such that g GF N and GF N 2=
Answer(g). As GF N 2 CSM (g) in GC+ by default and
sub-iso testing does not introduce any false negative, the
only possibility for error is that GF N was removed using
formula (2); i.e., GF N 2= CSGC+sub (g). That implies that
9g0 such that g g0 and GF N 2 Answer(g0). But then, by
formula (3), GF N will be added to Answersub(g) and thus
GF N 2 Answer(g) (a contradiction).</p>
      <p>Theorem 3. The nal answer of GC+ in the subgraph
case is correct.</p>
      <p>Proof. There are only two possibilities for error; GC+
can produce false negatives or false positives. The theorem
then follows straightforwardly from Lemmas 1 and 2.
6.2</p>
    </sec>
    <sec id="sec-11">
      <title>Supergraph Case</title>
      <p>Figure 3(b) depicts an example of supergraph case in GC+,
where GC+super Processor identi es there exists a
previous query g00 satisfying g00 g. For g00, the cached
answer set fG2; G3g and the dataset-graph-validity indicator
CGvalid(g00) (fG2; G3; G4g) are retrieved. Again, the new
query g comes with candidate set CSM (g) (fG1; G2; G3; G4g).</p>
      <p>Compared with subgraph case, reasoning the logic of GC+
in supergraph case is more interesting. (i) Consider graph
G1 2 CSM (g): G1 does not hold validity regarding its
relation with g00, hence no previous query result about G1
could be utilized. g has to rest on the sub-iso test to
identify whether G1 is in the nal answer set. (ii) For graph
G2 2 CSM (g): g00 G2 holds, which does not make G2
sub-iso test free though providing g00 g, as the relation
between g and G2 is still obscure and has to be veri ed by
sub-iso test. Similarly, G3 is not free of sub-iso test either.
(iii) For graph G4 2 CSM (g), G4 holds validity for its
relation with g00 as g00 * G4. Since g00 g, if g G4 were to be
true, then it should follow g00 G4, which is a contradiction.
So it is safe to conclude that g * G4 and thus G4 can be
removed from CSM (g). In overall, among graphs in CSM (g),
those failing to appear in (CGvalid(g00) [ Answer(g00) can
never exist in the nal answer set of g and thus become
sub-iso test free, where CGvalid(g0) is the complementary
set of CGvalid(g00) against state-of-the-art dataset. In other
words, the set (CGvalid(g00) [ Answer(g00) covers all graphs
that could possibly exist in the nal answer set of g, denoted
as g00:Answersuper(g), i.e.,
g00:Answersuper(g) = (CGvalid(g00) [ Answer(g00)
(4)
Performing sub-iso tests on CSM (g) \ g00:Answersuper(g)
therefore results the veri ed query answer Answer(g).</p>
      <p>In the general case, g may be a supergraph of multiple
previous query graphs gj00. Then, the set of graphs tested for
sub-iso by GC+ is:
CSGC+super (g) = CSM (g) \
gj00:Answersuper(g)
\
gj002Resultsuper(g)
The Query Processing Runtime subsystem rst applies
equation (2) on CSM and then applies (5) on the result of the
previous operation. The end result is a reduced candidate
set, which is then sub-iso tested.</p>
      <p>Additionally, there are two optimal cases that warrant
further performance gains. First, note that GC+ can easily
recognize the case where a new query, g, is isomorphic to a
previous cached query g0. For connected query graphs, this
holds providing that (i) g g0 or g g0; and (ii) g and
g0 have the same number of nodes and edges; and (iii) g0
holds validity on all the up-to-date dataset graphs. Hence,
GC+ can return the cached result of g0 directly, rendering
sub-iso test free. Second, consider the case: for a new query
g, a cached query g00 is discovered by GC+ that g00 g,
Answer(g00) = ; and g00 holds validity on all graphs
currently in dataset. Thus, GC+ can directly return an empty
result set for g. The idea is that if there were a dataset
graph G such that g G, since g00 g we would conclude
that g00 G ) G 2 Answer(g00), contradicting the fact
that Answer(g00) = ;; thus, no such graph G can exist and
the nal result set of g is necessarily empty.
7.
7.1</p>
    </sec>
    <sec id="sec-12">
      <title>PERFORMANCE EVALUATION</title>
    </sec>
    <sec id="sec-13">
      <title>Experimental Setup</title>
      <p>
        GC+ is implemented in Java, by extending the graph
caching system GC [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. Experiments were performed on a
Dell R920 host (4 Intel Xeon E7-4870 CPUs (15 cores each),
with 320GB of RAM and 4 1TB disks, running Ubuntu
Linux 14.04.4LTS. Following GC, the default value in GC+
for the upper limit on the sizes of Cache and the Window
stores were 100 and 20 respectively.
      </p>
      <p>
        Method M We used three well-established SI
methods, including GraphQL (GQL) provided by [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], a modi ed
VF2[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] (denoted VF2+) provided by [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], for being
wellestablished and good performers [
        <xref ref-type="bibr" rid="ref14 ref8">14, 8</xref>
        ]; we also used vanilla
VF2[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] as it is extensively used in FTV methods.
      </p>
      <p>
        Dataset We employ a real-world dataset AIDS [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] (the
Antiviral Screen Dataset of the National Cancer Institute)
that is prevalently used in literature [
        <xref ref-type="bibr" rid="ref1 ref11 ref14 ref7">14, 7, 1, 11</xref>
        ]. AIDS
contains 40,000 graphs, each with on average 45 vertices
(std.dev.: 22, max: 245) and 47 edges (std.dev.: 23, max:
250), whereby the few largest graphs have an order of
magnitude more vertices and edges.
      </p>
      <p>Dataset Change Plan Dataset change operations are
performed in batches, with occurrence time indicated by the
id of queries in workload (recall Figure 2). The plan we
used for AIDS consists of 2,000 operations (in 100 batches,
20 operations per batch), during the processing of 10,000
queries. A batch of operations are generated as following:
rst, an occurrence time for the batch is selected uniformly
at randomly from the id of queries; then, a type uniformly
selected from fADD, DEL, UA, URg, a graph uniformly
selected from dataset (ADD using the initial dataset instead
of synthesizing additional graphs so as to maximumly keep
the original dataset characteristics; DEL, UA and UR using
the up-to-date dataset at running time) and a uniformly
selected edge within the graph providing UA or UR being
the selected type (UA would add an edge that has not been
in graph yet; UR would remove an existed edge of graph)
codetermine a speci c operation { such process is repeated
until the batch contain the required number of operations.</p>
      <p>
        We follow the established principle for the generation of
our workloads, using two di erent algorithms to synthesize
queries from the initial dataset graphs. Each workload
consists of 10,000 queries with typical sizes in literature [
        <xref ref-type="bibr" rid="ref11 ref27">11, 27</xref>
        ]
{ 4, 8, 12, 16 and 20 edges.
      </p>
      <p>
        Type A Workloads Queries of these workloads are
generated in the following manner: rst, a source graph is
randomly selected from dataset graphs; then, a node is
selected randomly in the said graph; nally, a query size is
selected uniformly at randomly from given sizes and a BFS
is performed starting from the selected node. For each new
node, all its edges connecting it to already visited nodes are
added to the generated query, until the desired query size is
reached. For the rst two random selections above, we have
used two di erent distributions; namely, Uniform (U) and
Zipf (Z), with the probability density function of the latter
given by p(x) = x = ( ), where is the Riemann Zeta
function[
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Ultimately, we had three categories of Type
A workloads: \UU", \ZU" and \ZZ", where the rst letter
in each pair denotes the distribution used for selecting the
starting graph, and the second for the starting node.
      </p>
      <p>Type B Workloads (with no-answer queries) These
workloads are generated as follows. For each of the query
sizes, we rst create two query pools: a 10,000-query pool
with queries with non-empty answer sets against the initial
dataset, and a second 3,000-query pool with no match in any
untreated dataset graph (i.e., empty result set). Queries for
the rst pool are extracted from dataset graphs by uniformly
selecting a start node across all nodes in all dataset graphs,
and then performing a random walk till the required query
graph size is reached. Generation of no-answer queries has
one extra step: we continuously relabel the nodes in the
query with randomly selected labels from the dataset,
until the resulting query has a non-empty candidate set but
an empty answer set against the dataset graphs. Once the
query pools are lled up, we generate workloads by rst
ipping a biased coin to choose between the two pools (with the
\no-answer" pool selected with probability 0%, 20% or 50%),
then randomly (Zipf) selecting a query from the chosen pool.
We thus have three categories of Type B workloads: \0%",
\20%" and \50%", denoting the above probability used.</p>
      <p>
        We use Zipf = 1:4 by default; as a reference point, web
page popularities follow a Zipf distribution with = 2:4 [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
We only allow one for one Window (i.e., 20 queries) before
starting measuring GC+'s performance. We report on both
the bene ts and the overheads of GC+. Reported metrics
include query time and number of sub-iso tests per query,
along with the speedups introduced by GC+. Speedup is
de ned as the ratio of the average performance (query time
or number of sub-iso tests) of the base Method M over the
average performance of GC+ when deployed over Method M
(i.e., speedups &gt;1 indicate improvements). As a yardstick,
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] (also a cache but for XML databases) report a query
time speedup of 2.6 with 10,000-query workloads generated
using Zipf = 1:5, and a 1,500-query warm-up.
      </p>
      <p>
        Cache Replacement Policy GC+ incorporates all the
replacement policies developed in GC. Here, we use the
coined hybrid (HD) policy for experiments, as its
performance is always better or on par with the best alternative
[
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. Speci cally, HD coalesces another two GC/GC+
exclusive policies PIN and PINC. PIN scores each graph in
cache by considering the total number of subgraph
isomorphism tests alleviated by the said graph (coined R). PINC
extends the mentioned ranking by taking into account the
possibly vast di erences in query execution time (denoted
C), where cost is estimated by a heuristic [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. As C is
estimated, using it in PINC does not always lead to
improvements in query time. And through a large number of
experiments, we have observed that when the values of the
R exhibit a high variability, they are discriminative enough
on their own, where considering C can actually lead to lower
time gains (i.e., PIN performs better than PINC). However,
when the values of R exhibit a low variability, adding in
the C component leads to considerable query time
improvements. When the HD policy is invoked, it rst retrieves the
R from Statistics Manager and computes its variability [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]
by using the (squared) coe cient of variation (CoV ). CoV
is de ned as the ratio of the (square of the) standard
deviation over the (square of the) mean of the distribution. When
CoV &gt; 1, the associated distribution is deemed of high
vari
      </p>
      <p>Figure 4 depicts the query time speedups of GC+ across
all method M and workloads. We can see that CON achieves
considerable speedup with the meagre 100-query cache
con guration whereas gains of EVI are limited. Note the
interesting nding that speedups for UU workload are close
to those of ZU (e.g., 4.77 vs 5.13 against VF2 base method),
whereas one might have expected a di erent outcome.
Intuitively, the ZU workload bears more exact-match hits than
UU, due to the skewness of selecting source graphs during
query generation. And it does: we measured 2.5X the
number of exact-match cache hits in ZU vs UU. However, in
GC+, exact-match cache hit does not necessarily introduce
sub-iso test free (see x6.3) { those resulting zero sub-iso test
merely account 4% among exact-match cache hits of the said
ZU workload (vs 11% in UU). Moreover, recall that GC+
exploits also subgraph/supergraph hits. Indeed, we
measured circa 2X such matches for the UU workload vs ZU.
Of course, the overall performance result is a very complex
picture and depends on how big bene t is each saved
exactmatch vs each saved subgraph/supergraph match. But the
key insight here is that by utilizing exact-matches and
subgraph/supergraph matches, GC+ can bene t both
skewed and non-skewed workloads.</p>
      <p>
        Figure 5 shows the speedups in the number of sub-iso tests
performed. Please note that under a given con guration
(speci ed by dataset, dataset change plan, workload,
replacement policy, cache model EVI/CON, the upper
limit on the sizes of Cache and the Window stores),
whatever SI method being the Method M, GC+
results exactly the same pruned candidate set for each
query. Therefore, Figure 5 is independent of the three
methods considered in this work. Juxtaposing Figures 4 and 5
1,627
VF2
4
698
leads to the interesting insight: Reductions in the number
of sub-iso tests do not translate directly into
reductions in query time, echoing the claim that graph cache
hits render di erent bene ts [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. In all cases, though, GC+
achieves signi cant improvements in both query
processing time and number of sub-iso tests performed.
      </p>
      <p>Figure 6 depicts a break-down of query processing time for
method M, EVI and CON. EVI pays overhead on updating
the Window and Cache data stores, including executing the
cache replacement algorithms and re-indexing the cached
graphs. Whereas the overhead of CON also covers the time
of analyzing dataset log and validating cache (see Algorithm
1 and 2) { such CON speci c cost is trial, taking less than
1% in CON overhead, across all the aforementioned
workloads and methods M. This con rms a signi cant conclusion
{ the CON exclusive algorithms 1 and 2 are e cient.
The dominant part of CON overhead (for updating the
Window and Cache data stores) is higher than that of EVI, as
the latter is frequently purged hence bearing less to be
updated. Putting the Figure 4 aside Figure 5, it is obvious that
CON sweeps EVI in query processing speedup with a
negligible additional overhead.</p>
    </sec>
    <sec id="sec-14">
      <title>CONCLUSIONS</title>
      <p>We presented a systematic solution to handle graph cache
consistency, by providing an upgraded system GC+, which is
capable of expediting general subgraph/supergraph queries
with dataset changes. We developed two GC+ exclusive
cache models EVI and CON with di erent mechanisms on
ensuring cache consistency. We illustrated the speci c logic
of GC+ in reducing candidate set for query execution and
formally proved its correctness. Our performance
evaluation has demonstrated the considerable speedup achieved by
CON. Future works include further optimizing CON cache
with retrospective validating mechanisms, developing a
distributed/decentralized version of GC+ and extending GC+
to bene t subgraph queries when nding all occurrences of
a query graph against a single massive graph.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>V.</given-names>
            <surname>Bonnici</surname>
          </string-name>
          et al.
          <article-title>Enhancing graph database indexing by su x tree structure</article-title>
          .
          <source>In Proc. IAPR PRIB</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bottcher</surname>
          </string-name>
          .
          <article-title>Cache consistency in mobile XML databases</article-title>
          .
          <source>In Proc. WAIM</source>
          , pages
          <volume>300</volume>
          {
          <fpage>312</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L. P.</given-names>
            <surname>Cordella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Foggia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sansone</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Vento</surname>
          </string-name>
          .
          <article-title>A (sub) graph isomorphism algorithm for matching large graphs</article-title>
          .
          <source>IEEE TPAMI</source>
          ,
          <volume>26</volume>
          (
          <issue>10</issue>
          ):
          <volume>1367</volume>
          {
          <fpage>1372</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Garey</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Johnson</surname>
          </string-name>
          . Computers and
          <article-title>Intractability: A Guide to the Theory of NP-Completeness</article-title>
          .
          <string-name>
            <given-names>W.H.</given-names>
            <surname>Freeman</surname>
          </string-name>
          ,
          <year>1979</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gubichev</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Neumann</surname>
          </string-name>
          .
          <article-title>Exploiting the query structure for e cient join ordering in SPARQL queries</article-title>
          .
          <source>In Proc. EDBT</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Guo</surname>
          </string-name>
          et al.
          <article-title>Relaxed currency and consistency: How to say \good enough" in SQL</article-title>
          .
          <source>In Proc. SIGMOD</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>W.-S.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-D. Pham</surname>
            , and
            <given-names>J. X.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>iGraph: A framework for comparisons of disk-based graph indexing techniques</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>3</volume>
          (
          <issue>1</issue>
          -2):
          <volume>449</volume>
          {
          <fpage>459</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Katsarou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ntarmos</surname>
          </string-name>
          , and
          <string-name>
            <surname>P.</surname>
          </string-name>
          <article-title>Trianta llou. Performance and scalability of indexed subgraph query processing methods</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>8</volume>
          (
          <issue>12</issue>
          ):
          <volume>1566</volume>
          {
          <fpage>1577</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Katsarou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ntarmos</surname>
          </string-name>
          , and
          <string-name>
            <surname>P.</surname>
          </string-name>
          <article-title>Trianta llou. Subgraph querying with parallel use of query rewritings and alternative algorithms</article-title>
          .
          <source>In Proc. EDBT</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shin</surname>
          </string-name>
          , and W.-S. Han.
          <article-title>Taming subgraph isomorphism for RDF query processing</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>8</volume>
          (
          <issue>11</issue>
          ):
          <volume>1238</volume>
          {
          <fpage>1249</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kriege</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Mutzel.</surname>
          </string-name>
          CT-index:
          <article-title>Fingerprint-based graph indexing combining cycles and trees</article-title>
          .
          <source>In Proc. ICDE</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Lai</surname>
          </string-name>
          et al.
          <article-title>Scalable subgraph enumeration in MapReduce</article-title>
          . PVLDB,
          <volume>8</volume>
          (
          <issue>10</issue>
          ):
          <volume>974</volume>
          {
          <fpage>985</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L. V. S.</given-names>
            <surname>Lakshmanan</surname>
          </string-name>
          et al.
          <article-title>Answering tree pattern queries using views</article-title>
          .
          <source>In Proc. VLDB</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          et al.
          <article-title>An in-depth comparison of subgraph isomorphism algorithms in graph databases</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>6</volume>
          (
          <issue>2</issue>
          ):
          <volume>133</volume>
          {
          <fpage>144</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>K.</given-names>
            <surname>Lillis</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Pitoura</surname>
          </string-name>
          .
          <article-title>Cooperative XPath caching</article-title>
          .
          <source>In Proc. SIGMOD</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lorey</surname>
          </string-name>
          et al.
          <article-title>Caching and prefetching strategies for SPARQL queries</article-title>
          .
          <source>In Proc. USEWOD</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>B.</given-names>
            <surname>Mandhani</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Suciu</surname>
          </string-name>
          .
          <article-title>Query caching and view selection for XML databases</article-title>
          .
          <source>In Proc. VLDB</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Unbehauen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          .
          <article-title>Improving the performance of semantic web applications with SPARQL query caching</article-title>
          .
          <source>In Proc. ESWC</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <article-title>NCI - DTP AIDS antiviral screen dataset</article-title>
          . http://dtp.nci.nih.gov/docs/aids/aids data.
          <source>html.</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nelson</surname>
          </string-name>
          . Probability, Stochastic Processes, and
          <source>Queueing Theory</source>
          . Springer Verlag,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>M.</given-names>
            <surname>Newman</surname>
          </string-name>
          .
          <article-title>Power laws, Pareto distributions and Zipf's law</article-title>
          .
          <source>Contemporary Physics</source>
          ,
          <volume>46</volume>
          :
          <fpage>323</fpage>
          {
          <fpage>351</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>N.</given-names>
            <surname>Papailiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tsoumakos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Karras</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Koziris</surname>
          </string-name>
          .
          <article-title>Graph-aware , workload-adaptive SPARQL query caching</article-title>
          .
          <source>In Proc. SIGMOD</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>K.</given-names>
            <surname>Semertzidis</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Pitoura</surname>
          </string-name>
          .
          <article-title>Durable graph pattern queries on historical graphs</article-title>
          .
          <source>In Proc. ICDE</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          et al.
          <article-title>E cient subgraph matching on billion node graphs</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>5</volume>
          (
          <issue>9</issue>
          ):
          <volume>788</volume>
          {
          <fpage>799</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ntarmos</surname>
          </string-name>
          , and
          <string-name>
            <surname>P.</surname>
          </string-name>
          <article-title>Trianta llou. Indexing query graphs to speedup graph query processing</article-title>
          .
          <source>In Proc. EDBT</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ntarmos</surname>
          </string-name>
          , and
          <string-name>
            <surname>P.</surname>
          </string-name>
          <article-title>Trianta llou. GraphCache: a caching system for graph queries</article-title>
          .
          <source>In Proc. EDBT</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yan</surname>
          </string-name>
          et al.
          <article-title>Graph indexing: a frequent structure-based approach</article-title>
          .
          <source>In Proc. SIGMOD</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>