-

Ensuring Consistency in Graph Cache for Graph-Pattern Queries

Jing Wang

j.wang.3@research.gla.ac.uk 0

Nikos Ntarmos

nikos.ntarmos@glasgow.ac.uk 0

Peter Triantafillou

llou@glasgow.ac.uk peter.triantafillou@glasgow.ac.uk 0 0 School of Computing Science, University of Glasgow , UK

Graph queries are costly, as they entail the NP-Complete subgraph isomorphism problem. Graph caching had been recently suggested, showing the potential to signi cantly accelerate subgraph/supergraph queries. Subsequently, GraphCache, the rst full- edged graph caching system was put forth. However, when the underlying dataset changes concurrently with the query workload proceeding, how to ensure the graph cache consistency becomes an issue. The current work provides a systematic solution to address this problem, by presenting an upgraded GraphCache system coined GraphCache+ (abbreviated as GC+). We develop two GC+ exclusive models that employ di erent approaches to deal with the consistency issue. Moreover, we present the logic of GC+ in expediting queries, bundled with the formally proved correctness. We evaluate the performance of GC+ by a real-world graph dataset and a number of query workloads with di erent characteristics, highlighting the considerable speedup in term of quanti ed bene t and overhead.

Graph queries; Caching system; Cache consistency

1. INTRODUCTION

Graph structured data is increasingly prevalent in modern big data applications. Central to high performance graph analytics is to locate patterns in dataset graphs. Informally, given a query graph g, the system is called to return the set of dataset graphs that contain g (subgraph query) or are contained in g (resp. supergraph query), named the answer set of g. These operations can be time consuming for the NP-Complete problem of subgraph isomorphism [ 4 ]. Hence, the community has contributed a number of innovative solutions in recent years. One research thread follows the \ lterthen-verify" (FTV) paradigm: rst, dataset graphs are decomposed to substructures and indexed; then, substructures of coming query graph g are looked up in the dataset index, producing the set of dataset graphs that contain all substructures of g (resp. dataset graphs whose substructures are all contained in g), coined the candidate set of g; nally, graphs in the candidate set are veri ed for subgraph isomorphism (abbreviated as sub-iso or SI), returning the answer set of g. Similarly, [ 23 ] provides a solution for subgraph queries against historical (i.e., snapshotted) graphs { a variation of typical graph queries where snapshots can be viewed as different graphs. However, extensive evaluations of FTV methods [ 7, 8 ] show signi cant performance limitations. Another research thread is concerned with heuristic SI algorithms, which can operate either against a single very large graph or a dataset containing a large number of graphs. These algorithms are capable of pruning away (parts of) dataset graphs that cannot contain the query, expediting sub-iso testing without specialized graph indexes. [ 14, 9 ] provide insightful evaluations for such SI algorithms.

Unfortunately, both FTV and SI methods su er from executing unnecessary sub-iso tests. For example, if a previously issued query is submitted again to the system, it still has to undergo sub-iso tests, as all laboriously derived knowledge (i.e., answer set of previous queries) has been thrown away. Moreover, in many applications, it is natural that submitted graph queries share subgraph or supergraph relations with future queries { in protein datasets, there is a hierarchy of queries for aminoacids, proteins, protein mixtures, unicell bacteria, all the way to multi-cell organisms; in social network exploratory, queries could start o broad (e.g., all people in a geographic location) and become gradually narrower (e.g., by homing in on speci c demographics). Based on these observations, we proposed in [ 25 ] a fresh graph query processing method, where queries (and their answers) are indexed to expedite future query processing with FTV methods. Subsequently, we presented GraphCache [ 26 ], the rst full- edged caching system for general subgraph/supergraph query processing, applicable for all current FTV and SI methods. Following established research, GraphCache handles graph queries against a static dataset throughout the continuous query streaming.

In real-world applications, however, the graph dataset naturally evolves/changes over time. For example, in social networking, newly added groups (graph modeled relations/interactions among people), break-up of existed groups, and the changed relations/interactions among group members are frequently happening. Also, biochemical datasets keep refreshing by newly-translated, disregarded or transformed proteins, for application/research reason. Such changes could be modeled as graph addition (ADD), graph deletion (DEL), graph update by edge addition (UA) and graph update by edge removal (UR) in graph dataset analytics.

SI algorithms could accommodate these changes on the y as each dataset graph shall undergo subgraph isomorphism test eventually, whereas FTV methods additionally require an updatable indexing mechanism to evict proper candidate set. To the best of our knowledge, none of the proposed FTV algorithms so far has updatable index or similar solutions to tackle dataset changes. As a result, the way turns to SI methods, where each dataset graph is painstakingly veri ed. On the other hand, caching is proved e cient in accelerating graph queries [ 26 ]. To this end, it naturally follows an approach using graph cache to alleviate the costly SI methods in processing subgraph/supergraph queries with dataset changes { a topic that has not been discussed yet. Therefore, underpinned by our previous work in [ 25, 26 ], we contribute in this paper:

A systematic solution to expedite subgraph/supergraph queries with dataset changing throughout the query streaming, by presenting an upgraded graph caching system GC+ featured by newly plugged-in subsystems to deal with the evoked cache consistency issue.

Two cache models named EVI and CON that are proposed exclusively for GC+, representing di erent designs of ensuring graph cache consistency.

The logic of GC+ in pruning candidate set for general subgraph/supergraph query processing, together with the formally proved correctness.

Performance evaluations using well-established SI methods, a real-world dataset and a number of workloads, emphasizing the signi cant speedup of CON cache. 2.

RELATED WORK

Subgraph isomorphism problem has two versions: (i) the decision problem answers Y/N to whether the query is contained in each graph in the dataset; (ii) the matching problem locates all occurrences of the query graph within a large graph (or a dataset of graphs). To address the decision and matching problems, the brute-force approach is to execute sub-iso test of query against each dataset graph (SI method). SI algorithms deteriorate when the dataset contains a large number of graphs, which prompted the prevalence of FTV methods. GC+ could bene t both SI and FTV solutions for general subgraph/supergraph queries.

The community has also looked into subgraph queries against a single massive graph via scale-out architectures [ 12 ] or large memory clusters with massive parallelism [ 24 ]. For the time being, GC+ does not target such use cases, which are left for future work.

Using cache to expedite queries is prevalent in relational databases, whereas little work had been done for graphstructured query processing. XML datasets have used views to expedite path/tree queries [ 15 ]; [ 13 ] rst proposed the MCR (maximally contained rewriting) approach for treepattern queries, but introduced false negatives. GC+ does not produce false negatives or false positives (see x6). Also, GC+ deals with much more complex graph queries, which entail the NP-Complete subgraph isomorphism problem.

Caching has also been leveraged to expedite SPARQL query processing in RDF graphs. [ 18 ] proposed the rst cache for SPARQL queries. [ 22 ] contributed a cache for SPARQL queries with a novel canonical labelling scheme to identify cache hits and a popular dynamic programming planner. Similar to GC+, query processing optimization in [ 22 ] does not require any a priori knowledge on datasets/ workloads and is workload adaptive. However, like XML queries, SPARQL queries are less expressive than general graph queries and thus less challenging [ 10, 24 ]: SPARQL query processing consists of solving the subgraph homomorphism problem, which is di erent from the subgraph isomorphism problem, as the former drops the injective property of the latter. Furthermore, GC+ discovers subgraph, supergraph, and exact-match relationships between the coming query and cached queries, which the canonical labelling scheme in [ 22 ] fails to achieve. SPARQL query processing also targets at optimizing join execution plans [ 5 ] (based on join selectivity estimator statistics and related cost functions), and the cache in [ 22 ] is focusing on this goal, whereas GC+ aims to avoid/reduce costs of executing SI heuristics whose execution time can be highly unpredictable and much higher. Therefore, the overall rationale of GC+ and its way to utilize cache contents di er from that in [ 22 ] and in related caching solutions for SPARQL queries.

Pertaining to cache consistency, [ 6 ] rst explicitly specied the consistency constraint in a query-centric approach and presented how it could be expressed succinctly in SQL. Also, there are studies of cache consistency regarding XML datasets [ 2 ] and SPARQL query processing [ 16 ]. However, the topic of ensuring graph cache consistency for general subgraph/supergraph queries has not been discussed yet. 3.

DEFINITIONS

GC+ is implemented for undirected labelled graphs, following established studies in literature [ 1, 11 ]. We assume that only vertices have labels; all our results straightforwardly generalize to directed graphs and/or graphs with edge labels.

A labeled graph G = (V; E; l) consists of a set of vertices V (G) and edges E(G) = f(u; v); u 2 V; v 2 V g, and a function l : V ! U , where U is the label set, de ning the domain of labels of vertices. A graph Gi = (Vi; Ei; li) is subgraph-isomorphic to a graph Gj = (Vj; Ej; lj), (by abuse of notation) denoted by Gi Gj, when there exists an injection : Vi ! Vj, such that 8(u; v) 2 Ei; u; v 2 Vi; ) ( (u); (v)) 2 Ej and 8u 2 Vi; li(u) = lj( (u)).

As is common in literature, we focus on non-induced subgraph isomorphism. Informally, there is a subgraph isomorphism Gi Gj if Gj contains a subgraph that is isomorphic to Gi. In this case, we say that Gi is a subgraph of (or contained in) Gj, or inversely that Gj is a supergraph of (contains) Gi (denoted by Gj Gi). The subgraph querying problem entails a set D = fG1; : : : ; Gng containing n graphs, and a query graph g, and determines all graphs Gi 2 D such that g Gi. The supergraph querying problem entails a set D = fG1; : : : ; Gng containing n graphs, and a query graph g, and determines all graphs Gi 2 D such that g Gi.

4. SYSTEM ARCHITECTURE

GC+ is a scalable semantic cache for subgraph/supergraph queries, consisting of four subsystems Data Manager, Cache Manager, Query Processing Runtime and Method M, as shown by Figure 1. The rst three are GC+ internal and Method M is the external SI method that GC+ is called to expedite. Method M subsystem includes an SI implementation, denoted Mverifier, sub-iso testing candidate set MCS (the whole dataset when GC+ is not used).

Dataset Manager subsystem is GC+ speci c, containing a component Log Analyzer to handle dataset logs. Cache Manager is responsible for the management of data and metadata stored in the cache. Besides the cache replacement mechanisms, a Window Manager for cache admission control and a Statistics Manager for metadata that work as usual as in GC, Cache Manager of GC+ boasts an additional component Cache Validator, which cooperates with Log Analyzer to ensure the cache consistency. Both Log Analyzer and Cache Validator launch cache model dependent mechanisms to cope with cache consistency (see x5).

Query Processing Runtime subsystem takes charge of query execution and metrics monitoring. Like in GC, it comprises a resource/thread manager, the GC+ internal subgraph/supergraph query processors, the logic for candidate set pruning, and a statistics monitor { these components of Query Processing Runtime communicate with Method M and the Cache Manager via well-de ned APIs. Please note that GC+'s logic of Candidate Set Pruner is di erent and more complicated than that in GC, though the former could be viewed as adapting from the latter. The logic of GC+ shall be presented and proved in x6.

When a query g arrives, Dataset Manager subsystem rst identi es whether the dataset has been changed recently. If so, Cache Validator is triggered to re ect these changes to previous queries residing in cache and window (where queries are batched to enter cache; in this work, cached graphs/queries by default cover those previous queries in both cache and window). Then, g is sent to Query Processing Runtime subsystem for query execution. Speci cally, GC+ calls processors (GC+sub and GC+super) to discover whether g shares subgraph/supergraph relations with cached graphs { each hit graph in cache then contribute its valid part of answer set to prune g's candidate set. Finally, the issued query g, together with its answer set and statistics pertaining to the contribution of cached graphs, is fed to the Cache Manager subsystem, where admission control and cache replacement are performed, concurrently with the Query Processing Runtime subsystem executing subsequent queries. 5. 5.1

CACHE CONSISTENCY EVI Cache Model

To address the said challenge arising from dynamic graph dataset, a straightforward solution is to abandon the vague Time T0 T1 T2 T3 T4 T5 {G0, G1, G2, G3} query g' execution

g' entering cache query g' execution

g' entering cache {G0, G1, G2, G3, G4} {G1, G2, G3, G4} ADD G4 UR G3 DEL G0 UA G1 query g execution

CON Cache cache, i.e., evicting (EVI) graph cache whenever dataset changes. Therefore, Log Analyzer has to do nothing but raising a ag indicating the dataset is changed, and Cache Validator then clears cached contents indiscriminately. In this way, the caching system in [ 26 ] could be easily adapted to tackle graph queries with dataset changes, as the cleared cache will never produce error for future query processing. But the limitation is obvious { EVI cache has to warm from scratch upon each dataset change.

The problem of EVI cache lies in that it fails to di erentiate the validity of cached results. For example, when a dataset graph undergoes some change, its relations with all previous queries in cache (re ected by the cached query results) are a ected { some still hold, while others fail. Of course, invalid contents cannot be leveraged to accelerate future queries and should be abandoned. However, the purge in EVI throws away those valid contents as well, making the cache e ciency truncated. 5.2

CON Cache Model

To solve the aforementioned problem of EVI, we develop another cache model named CON, which beavers away to identify valid contents in cache. We shall use an example to illustrate how the CON model works for subgraph queries. Mechanism for supergraph queries is similar and is omitted for space reason.

Figure 2 depicts an example, where GC+ starts o with a dataset containing four graphs fG0; G1; G2; G3g and an empty CON cache. At time T1, query g0 arrived and was executed. Assuming that g0 passed the admission control and entered the cache later. CON cache thus recorded that g0 holds validity towards its relation with all graphs in current dataset (i.e., GCvalid covers dataset graphs with id 0, 1, 2, 3), among which g0 * G0, g0 * G1, g0 G2 and g0 G3. Then, at time T2, dataset was changed by adding a new graph G4, and an update on G3 of edge removals. Obviously, there is no clue of G4 containing g0 or not, i.e., g0 does not hold validity regarding its relation with the newly coming G4. As to G3, there was g0 G3, which becomes unknown as removing edge may result g0 * G3. Hence, the validity of g0 pertaining to G3 is turned o . Subsequently, at time T3, another query g00 was executed and later entered cache, holding the validity towards each graph in state-ofthe-art dataset containing ve graphs. Again, the dataset Algorithm 1 Analyzing Log for the CON Cache 1: Input: Dataset update log L 2: Output: A container C with counters to categorize operations performed on dataset graphs changed at time T4 { graph G0 was deleted and G1 was updated by edge additions. Regarding the current dataset fG1; G2; G3; G4g, g0 * G1 is not guaranteed, since adding edges may introduce g0 G1. g0, as well as g00, thus loses the validity on G1. Therefore, when the new query g comes, it would be facilitated by cached graphs g0 and g00, each with the up-to-date valid info pertaining to all current dataset graphs, ensuring the cache consistency of CON model. 5.2.1

Analyzing Dataset Log

GC+ is designed to warrant CON cache possessing the potential to bene t queries at full steam. To this end, both Dataset Manager subsystem and Cache Manager subsystem develop CON speci c mechanisms.

First, Dataset Manager subsystem employs the component Log Analyzer to categorize dataset changes, acting as a preprocessing step for validating CON cache. Algorithm 1 illustrates how the corresponding analysis is performed. Brie y, the incremental records that have not been re ected in cache are rst extracted from dataset log. Log Analyzer then launches a container with three counters, implemented by HashMap with key of dataset graph id and value of count for the operations on this graph. The said three counters (CT , CA and CR) (line 4) are responsible for counting the total, UA and UR operations, respectively. Each aforementioned record identi es the related dataset graph and its operation type (lines 7{8). Exhausting these records (lines 9{16) hence results the total-counter CT , UAexclusive-counter CA and UR-exclusive-counter CR. 5.2.2

Validating CON Cache

Then, the counter container returned by Dataset Manager subsystem is forwarded to Cache Manager subsystem, where Cache Validator refreshes the dataset-graph-validityindicator for cached graphs. CON cache targets at the biggest possible bene t for query processing. The Cache Validator hence strives to exploit useful previous query result for CON. In GC+, once a query is executed, its answer set is nalized, which snapshots the query's relation against dataset at the execution time { even the dataset would undergo changes later, GC+ will not repeat processing previAlgorithm 2 Refreshing a cached graph's validity indicator 1: Input: Counter container C (containing CT , CA and CR), currently maximum graph id m in dataset, stored info of a cached graph (with its dataset-graph-validityindicator CGvalid and query result Answer, both structured by BitSet) 2: Function: Updating CGvalid 3: 4: if (m + 1) > CGvalid:size then 5: Extend CGvalid to length (m + 1) by assigning false to extended bits 6: end if 7: for all i 2 CT :keySet() do 8: tc = CT :get(i) 9: uac = !CA:containsKey(i)? 0 : CA:get(i) 10: urc = !CR:containsKey(i)? 0 : CR:get(i) 11: 12:

if tc == uac ^ CGvalid:get(i) ^ Answer:get(i) then 13: 14:

continue else if tc == urc ^ CGvalid:get(i) ^ !Answer:get(i) then 15: continue 16: else 17: CGvalid:set(i; f alse) 18: end if 19: end for ous queries. Therefore, to deal with dataset changes, GC+ employs a BitSet indicator CGvalid per cached query, with each bit identifying the up-to-date validity of the query's relation towards a dataset graph.

Algorithm 2 depicts how the CGvalid of a cached graph g is refreshed by Cache Validator. To start with, CGvalid is checked whether it contains all the bits required (line 4 where dataset graph id starts from 0). If not, it implies that there are newly-added dataset graphs. Obviously, the relation between g and those new dataset graphs is unknown and those extended bits are thus assigned false (line 5). The idea is to make recent dataset changes take e ect on relevant bits of CGvalid (lines 7{19).

Speci cally, for each concerned dataset graph Gi (identi ed by i in line 7), its numbers of total operations (tc), UA (uac) and UR (urc) are retrieved from the input counters (lines 8{10). If operations on dataset graph Gi are exclusively of UA category (tc == uac in line 12), together with valid (CGvalid:get(i) in line 12) query result g Gi (Answer:get(i) in line 12), such validity still holds (g Gi is not bothered by adding edges to Gi). Similarly, if operations on dataset graph Gi are exclusively of UR category, the query result g * Gi (!Answer:get(i) in line 14) keeps valid. Whereas other operations fade g's validity on Gi if applicable (line 17). By harnessing the validity per cached query and dataset graph, as well as the optimizations in UAexclusive and UR-exclusive cases, CON model manages to enhance the cache e ciency in expediting graph queries.

QUERY PROCESSING

This section shall illustrate the speci c logic of Candidate Set Pruner in GC+. For space reason, we only present for subgraph queries (supergraph queries follow the exact inverse logic). As shown in Figure 2, when a query g arrives, CSM(g) Answersub(g) =

{G1, G3, G4} g ✓ g0, g0 ✓ {G2, G3} CGvalid(g0) = {G2}

Answersub(g) = {G2} (a) Subgraph Case GC+ discovers whether g is a subgraph or supergraph of cached queries concurrently by processors GC+sub/GC+super, referred as subgraph/supergraph case. 6.1

Subgraph Case

For example, in Figure 3(a), the new query g comes with candidate set CSM (g)(the current dataset) containing four graphs fG1; G2; G3; G4g. GC+sub Processor nds that there exists a previous query g0, such that g g0. Then g0's cached answer set fG2; G3g, as well as its up-to-time validity indicator CGvalid = fG2g, is retrieved. (As mentioned in Algorithm 2, both Answer(g0) and CGvalid(g0) are BitSet structures; here, we employ a set containing dataset-graphid to help illustrate.)

For graph G2 2 CSM (g), considering g g0 and g0 G2 (still holds for current dataset, as G2 exists in CGvalid(g0)), it must follow that g G2. Hence, G2 can be safely removed from CSM (g) and added directly to the nal answer set of g. Whereas G3 is not sub-iso exempted though it appears in g0's cached answer set, as g0 G3 fails to hold with state-of-the-art dataset (G3 does not appear in CGvalid(g0)) { g0 G3 was found when executing previous query g0 but has been faded by subsequent dataset changes (e.g., G3 was updated by removing some edges).

Therefore, the logic of GC+ for subgraph case is that only dataset graphs in CGvalid(g0) \Answer(g0) are sub-iso testfree, which can be safely removed from CSM (g) and directly added to the nal answer set of query g. In the general case, g may be a subgraph of multiple previous query graphs gi0. Then, the said sub-iso test-free graphs Answersub(g) is: CGvalid(gi0) \ Answer(gi0) Answersub(g) =

[ gi02Resultsub(g) (1) where Resultsub(g) contains all the currently cached queries of which g is a subgraph.

Hence, the set of dataset graphs for sub-iso testing is:

CSGC+sub (g) = CSM (g) n Answersub(g) Finally, if AnswerGC+sub (g) is the set of graphs veri ed to be containing g through sub-iso tests on CSGC+sub (g), the nal answer set for query g will be:

Answer(g) = AnswerGC+sub (g) [ Answersub(g) (2) (3)

Lemma 1. The nal answer of GC+ in the subgraph case does not contain false positives.

Proof. Assume false positives are possible and consider the rst ever false positive produced by GC+; i.e., for some query g, 9GF P such that g * GF P and GF P 2 Answer(g). Note that GF P cannot be in AnswerGC+sub (g) where each graph has passed the sub-iso test, which follows that GF P 2 CSM(g) \ Answersuper(g) =

{G1, G2, G3} g ◆ g00, g00 ✓ {G2, G3}

CGvalid(g00) = {G2, G3, G4} Answersuper(g) = {G1, G2, G3} (b) Supergraph Case

Proof. Assume false negatives are possible and consider the rst ever false negative produced by GC+ particularly; i.e., for some query g, 9GF N such that g GF N and GF N 2= Answer(g). As GF N 2 CSM (g) in GC+ by default and sub-iso testing does not introduce any false negative, the only possibility for error is that GF N was removed using formula (2); i.e., GF N 2= CSGC+sub (g). That implies that 9g0 such that g g0 and GF N 2 Answer(g0). But then, by formula (3), GF N will be added to Answersub(g) and thus GF N 2 Answer(g) (a contradiction).

Theorem 3. The nal answer of GC+ in the subgraph case is correct.

Proof. There are only two possibilities for error; GC+ can produce false negatives or false positives. The theorem then follows straightforwardly from Lemmas 1 and 2. 6.2

Supergraph Case

Figure 3(b) depicts an example of supergraph case in GC+, where GC+super Processor identi es there exists a previous query g00 satisfying g00 g. For g00, the cached answer set fG2; G3g and the dataset-graph-validity indicator CGvalid(g00) (fG2; G3; G4g) are retrieved. Again, the new query g comes with candidate set CSM (g) (fG1; G2; G3; G4g).

Compared with subgraph case, reasoning the logic of GC+ in supergraph case is more interesting. (i) Consider graph G1 2 CSM (g): G1 does not hold validity regarding its relation with g00, hence no previous query result about G1 could be utilized. g has to rest on the sub-iso test to identify whether G1 is in the nal answer set. (ii) For graph G2 2 CSM (g): g00 G2 holds, which does not make G2 sub-iso test free though providing g00 g, as the relation between g and G2 is still obscure and has to be veri ed by sub-iso test. Similarly, G3 is not free of sub-iso test either. (iii) For graph G4 2 CSM (g), G4 holds validity for its relation with g00 as g00 * G4. Since g00 g, if g G4 were to be true, then it should follow g00 G4, which is a contradiction. So it is safe to conclude that g * G4 and thus G4 can be removed from CSM (g). In overall, among graphs in CSM (g), those failing to appear in (CGvalid(g00) [ Answer(g00) can never exist in the nal answer set of g and thus become sub-iso test free, where CGvalid(g0) is the complementary set of CGvalid(g00) against state-of-the-art dataset. In other words, the set (CGvalid(g00) [ Answer(g00) covers all graphs that could possibly exist in the nal answer set of g, denoted as g00:Answersuper(g), i.e., g00:Answersuper(g) = (CGvalid(g00) [ Answer(g00) (4) Performing sub-iso tests on CSM (g) \ g00:Answersuper(g) therefore results the veri ed query answer Answer(g).

In the general case, g may be a supergraph of multiple previous query graphs gj00. Then, the set of graphs tested for sub-iso by GC+ is: CSGC+super (g) = CSM (g) \ gj00:Answersuper(g) \ gj002Resultsuper(g) The Query Processing Runtime subsystem rst applies equation (2) on CSM and then applies (5) on the result of the previous operation. The end result is a reduced candidate set, which is then sub-iso tested.

Additionally, there are two optimal cases that warrant further performance gains. First, note that GC+ can easily recognize the case where a new query, g, is isomorphic to a previous cached query g0. For connected query graphs, this holds providing that (i) g g0 or g g0; and (ii) g and g0 have the same number of nodes and edges; and (iii) g0 holds validity on all the up-to-date dataset graphs. Hence, GC+ can return the cached result of g0 directly, rendering sub-iso test free. Second, consider the case: for a new query g, a cached query g00 is discovered by GC+ that g00 g, Answer(g00) = ; and g00 holds validity on all graphs currently in dataset. Thus, GC+ can directly return an empty result set for g. The idea is that if there were a dataset graph G such that g G, since g00 g we would conclude that g00 G ) G 2 Answer(g00), contradicting the fact that Answer(g00) = ;; thus, no such graph G can exist and the nal result set of g is necessarily empty. 7. 7.1

PERFORMANCE EVALUATION Experimental Setup

GC+ is implemented in Java, by extending the graph caching system GC [ 26 ]. Experiments were performed on a Dell R920 host (4 Intel Xeon E7-4870 CPUs (15 cores each), with 320GB of RAM and 4 1TB disks, running Ubuntu Linux 14.04.4LTS. Following GC, the default value in GC+ for the upper limit on the sizes of Cache and the Window stores were 100 and 20 respectively.

Method M We used three well-established SI methods, including GraphQL (GQL) provided by [ 14 ], a modi ed VF2[ 3 ] (denoted VF2+) provided by [ 11 ], for being wellestablished and good performers [ 14, 8 ]; we also used vanilla VF2[ 3 ] as it is extensively used in FTV methods.

Dataset We employ a real-world dataset AIDS [ 19 ] (the Antiviral Screen Dataset of the National Cancer Institute) that is prevalently used in literature [ 14, 7, 1, 11 ]. AIDS contains 40,000 graphs, each with on average 45 vertices (std.dev.: 22, max: 245) and 47 edges (std.dev.: 23, max: 250), whereby the few largest graphs have an order of magnitude more vertices and edges.

Dataset Change Plan Dataset change operations are performed in batches, with occurrence time indicated by the id of queries in workload (recall Figure 2). The plan we used for AIDS consists of 2,000 operations (in 100 batches, 20 operations per batch), during the processing of 10,000 queries. A batch of operations are generated as following: rst, an occurrence time for the batch is selected uniformly at randomly from the id of queries; then, a type uniformly selected from fADD, DEL, UA, URg, a graph uniformly selected from dataset (ADD using the initial dataset instead of synthesizing additional graphs so as to maximumly keep the original dataset characteristics; DEL, UA and UR using the up-to-date dataset at running time) and a uniformly selected edge within the graph providing UA or UR being the selected type (UA would add an edge that has not been in graph yet; UR would remove an existed edge of graph) codetermine a speci c operation { such process is repeated until the batch contain the required number of operations.

We follow the established principle for the generation of our workloads, using two di erent algorithms to synthesize queries from the initial dataset graphs. Each workload consists of 10,000 queries with typical sizes in literature [ 11, 27 ] { 4, 8, 12, 16 and 20 edges.

Type A Workloads Queries of these workloads are generated in the following manner: rst, a source graph is randomly selected from dataset graphs; then, a node is selected randomly in the said graph; nally, a query size is selected uniformly at randomly from given sizes and a BFS is performed starting from the selected node. For each new node, all its edges connecting it to already visited nodes are added to the generated query, until the desired query size is reached. For the rst two random selections above, we have used two di erent distributions; namely, Uniform (U) and Zipf (Z), with the probability density function of the latter given by p(x) = x = ( ), where is the Riemann Zeta function[ 21 ]. Ultimately, we had three categories of Type A workloads: \UU", \ZU" and \ZZ", where the rst letter in each pair denotes the distribution used for selecting the starting graph, and the second for the starting node.

Type B Workloads (with no-answer queries) These workloads are generated as follows. For each of the query sizes, we rst create two query pools: a 10,000-query pool with queries with non-empty answer sets against the initial dataset, and a second 3,000-query pool with no match in any untreated dataset graph (i.e., empty result set). Queries for the rst pool are extracted from dataset graphs by uniformly selecting a start node across all nodes in all dataset graphs, and then performing a random walk till the required query graph size is reached. Generation of no-answer queries has one extra step: we continuously relabel the nodes in the query with randomly selected labels from the dataset, until the resulting query has a non-empty candidate set but an empty answer set against the dataset graphs. Once the query pools are lled up, we generate workloads by rst ipping a biased coin to choose between the two pools (with the \no-answer" pool selected with probability 0%, 20% or 50%), then randomly (Zipf) selecting a query from the chosen pool. We thus have three categories of Type B workloads: \0%", \20%" and \50%", denoting the above probability used.

We use Zipf = 1:4 by default; as a reference point, web page popularities follow a Zipf distribution with = 2:4 [ 21 ]. We only allow one for one Window (i.e., 20 queries) before starting measuring GC+'s performance. We report on both the bene ts and the overheads of GC+. Reported metrics include query time and number of sub-iso tests per query, along with the speedups introduced by GC+. Speedup is de ned as the ratio of the average performance (query time or number of sub-iso tests) of the base Method M over the average performance of GC+ when deployed over Method M (i.e., speedups >1 indicate improvements). As a yardstick, [ 17 ] (also a cache but for XML databases) report a query time speedup of 2.6 with 10,000-query workloads generated using Zipf = 1:5, and a 1,500-query warm-up.

Cache Replacement Policy GC+ incorporates all the replacement policies developed in GC. Here, we use the coined hybrid (HD) policy for experiments, as its performance is always better or on par with the best alternative [ 26 ]. Speci cally, HD coalesces another two GC/GC+ exclusive policies PIN and PINC. PIN scores each graph in cache by considering the total number of subgraph isomorphism tests alleviated by the said graph (coined R). PINC extends the mentioned ranking by taking into account the possibly vast di erences in query execution time (denoted C), where cost is estimated by a heuristic [ 25 ]. As C is estimated, using it in PINC does not always lead to improvements in query time. And through a large number of experiments, we have observed that when the values of the R exhibit a high variability, they are discriminative enough on their own, where considering C can actually lead to lower time gains (i.e., PIN performs better than PINC). However, when the values of R exhibit a low variability, adding in the C component leads to considerable query time improvements. When the HD policy is invoked, it rst retrieves the R from Statistics Manager and computes its variability [ 20 ] by using the (squared) coe cient of variation (CoV ). CoV is de ned as the ratio of the (square of the) standard deviation over the (square of the) mean of the distribution. When CoV > 1, the associated distribution is deemed of high vari

Figure 4 depicts the query time speedups of GC+ across all method M and workloads. We can see that CON achieves considerable speedup with the meagre 100-query cache con guration whereas gains of EVI are limited. Note the interesting nding that speedups for UU workload are close to those of ZU (e.g., 4.77 vs 5.13 against VF2 base method), whereas one might have expected a di erent outcome. Intuitively, the ZU workload bears more exact-match hits than UU, due to the skewness of selecting source graphs during query generation. And it does: we measured 2.5X the number of exact-match cache hits in ZU vs UU. However, in GC+, exact-match cache hit does not necessarily introduce sub-iso test free (see x6.3) { those resulting zero sub-iso test merely account 4% among exact-match cache hits of the said ZU workload (vs 11% in UU). Moreover, recall that GC+ exploits also subgraph/supergraph hits. Indeed, we measured circa 2X such matches for the UU workload vs ZU. Of course, the overall performance result is a very complex picture and depends on how big bene t is each saved exactmatch vs each saved subgraph/supergraph match. But the key insight here is that by utilizing exact-matches and subgraph/supergraph matches, GC+ can bene t both skewed and non-skewed workloads.

Figure 5 shows the speedups in the number of sub-iso tests performed. Please note that under a given con guration (speci ed by dataset, dataset change plan, workload, replacement policy, cache model EVI/CON, the upper limit on the sizes of Cache and the Window stores), whatever SI method being the Method M, GC+ results exactly the same pruned candidate set for each query. Therefore, Figure 5 is independent of the three methods considered in this work. Juxtaposing Figures 4 and 5 1,627 VF2 4 698 leads to the interesting insight: Reductions in the number of sub-iso tests do not translate directly into reductions in query time, echoing the claim that graph cache hits render di erent bene ts [ 26 ]. In all cases, though, GC+ achieves signi cant improvements in both query processing time and number of sub-iso tests performed.

Figure 6 depicts a break-down of query processing time for method M, EVI and CON. EVI pays overhead on updating the Window and Cache data stores, including executing the cache replacement algorithms and re-indexing the cached graphs. Whereas the overhead of CON also covers the time of analyzing dataset log and validating cache (see Algorithm 1 and 2) { such CON speci c cost is trial, taking less than 1% in CON overhead, across all the aforementioned workloads and methods M. This con rms a signi cant conclusion { the CON exclusive algorithms 1 and 2 are e cient. The dominant part of CON overhead (for updating the Window and Cache data stores) is higher than that of EVI, as the latter is frequently purged hence bearing less to be updated. Putting the Figure 4 aside Figure 5, it is obvious that CON sweeps EVI in query processing speedup with a negligible additional overhead.

CONCLUSIONS

We presented a systematic solution to handle graph cache consistency, by providing an upgraded system GC+, which is capable of expediting general subgraph/supergraph queries with dataset changes. We developed two GC+ exclusive cache models EVI and CON with di erent mechanisms on ensuring cache consistency. We illustrated the speci c logic of GC+ in reducing candidate set for query execution and formally proved its correctness. Our performance evaluation has demonstrated the considerable speedup achieved by CON. Future works include further optimizing CON cache with retrospective validating mechanisms, developing a distributed/decentralized version of GC+ and extending GC+ to bene t subgraph queries when nding all occurrences of a query graph against a single massive graph.

[1]

Bonnici et al. Enhancing graph database indexing by su x tree structure . In Proc. IAPR PRIB , 2010 .

[2]

Bottcher . Cache consistency in mobile XML databases . In Proc. WAIM , pages 300 { 312 , 2006 .

[3]

L. P.

Cordella ,

Foggia ,

Sansone , and

Vento . A (sub) graph isomorphism algorithm for matching large graphs . IEEE TPAMI , 26 ( 10 ): 1367 { 1372 , 2004 .

[4]

Garey and

Johnson . Computers and Intractability: A Guide to the Theory of NP-Completeness .

W.H.

Freeman , 1979 .

[5]

Gubichev and

Neumann . Exploiting the query structure for e cient join ordering in SPARQL queries . In Proc. EDBT , 2014 .

[6]

Guo et al. Relaxed currency and consistency: How to say \good enough" in SQL . In Proc. SIGMOD , 2004 .

[7]

W.-S.

Han ,

Lee , M.-D. Pham , and J. X. Yu . iGraph: A framework for comparisons of disk-based graph indexing techniques . PVLDB , 3 ( 1 -2): 449 { 459 , 2010 .

[8]

Katsarou ,

Ntarmos , and P. Trianta llou. Performance and scalability of indexed subgraph query processing methods . PVLDB , 8 ( 12 ): 1566 { 1577 , 2015 .

[9]

Katsarou ,

Ntarmos , and P. Trianta llou. Subgraph querying with parallel use of query rewritings and alternative algorithms . In Proc. EDBT , 2017 .

[10]

Kim ,

Shin , and W.-S. Han. Taming subgraph isomorphism for RDF query processing . PVLDB , 8 ( 11 ): 1238 { 1249 , 2015 .

[11]

Klein ,

Kriege , and

Mutzel. CT-index: Fingerprint-based graph indexing combining cycles and trees . In Proc. ICDE , 2011 .

[12]

Lai et al. Scalable subgraph enumeration in MapReduce . PVLDB, 8 ( 10 ): 974 { 985 , 2015 .

[13]

L. V. S.

Lakshmanan et al. Answering tree pattern queries using views . In Proc. VLDB , 2006 .

[14]

Lee et al. An in-depth comparison of subgraph isomorphism algorithms in graph databases . PVLDB , 6 ( 2 ): 133 { 144 , 2012 .

[15]

Lillis and

Pitoura . Cooperative XPath caching . In Proc. SIGMOD , 2008 .

[16]

Lorey et al. Caching and prefetching strategies for SPARQL queries . In Proc. USEWOD , 2013 .

[17]

Mandhani and

Suciu . Query caching and view selection for XML databases . In Proc. VLDB , 2005 .

[18]

Martin ,

Unbehauen , and

Auer . Improving the performance of semantic web applications with SPARQL query caching . In Proc. ESWC , 2010 .

[19] NCI - DTP AIDS antiviral screen dataset . http://dtp.nci.nih.gov/docs/aids/aids data. html.

[20]

Nelson . Probability, Stochastic Processes, and Queueing Theory . Springer Verlag, 1995 .

[21]

Newman . Power laws, Pareto distributions and Zipf's law . Contemporary Physics , 46 : 323 { 351 , 2005 .

[22]

Papailiou ,

Tsoumakos ,

Karras , and

Koziris . Graph-aware , workload-adaptive SPARQL query caching . In Proc. SIGMOD , 2015 .

[23]

Semertzidis and

Pitoura . Durable graph pattern queries on historical graphs . In Proc. ICDE , 2016 .

[24]

Sun et al. E cient subgraph matching on billion node graphs . PVLDB , 5 ( 9 ): 788 { 799 , 2012 .

[25]

Wang ,

Ntarmos , and P. Trianta llou. Indexing query graphs to speedup graph query processing . In Proc. EDBT , 2016 .

[26]

Wang ,

Ntarmos , and P. Trianta llou. GraphCache: a caching system for graph queries . In Proc. EDBT , 2017 .

[27]

Yan et al. Graph indexing: a frequent structure-based approach . In Proc. SIGMOD , 2004 .