Ensuring Consistency in Graph Cache for Graph-Pattern Queries Jing Wang Nikos Ntarmos Peter Triantafillou School of Computing Science School of Computing Science School of Computing Science University of Glasgow, UK University of Glasgow, UK University of Glasgow, UK j.wang.3@research.gla.ac.uk nikos.ntarmos@glasgow.ac.uk peter.triantafillou@glasgow.ac.uk ABSTRACT then-verify” (FTV) paradigm: first, dataset graphs are de- Graph queries are costly, as they entail the NP-Complete composed to substructures and indexed; then, substructures subgraph isomorphism problem. Graph caching had been re- of coming query graph g are looked up in the dataset index, cently suggested, showing the potential to significantly accel- producing the set of dataset graphs that contain all substruc- erate subgraph/supergraph queries. Subsequently, Graph- tures of g (resp. dataset graphs whose substructures are all Cache, the first full-fledged graph caching system was put contained in g), coined the candidate set of g; finally, graphs forth. However, when the underlying dataset changes con- in the candidate set are verified for subgraph isomorphism currently with the query workload proceeding, how to ensure (abbreviated as sub-iso or SI), returning the answer set of the graph cache consistency becomes an issue. The current g. Similarly, [23] provides a solution for subgraph queries work provides a systematic solution to address this prob- against historical (i.e., snapshotted) graphs – a variation of lem, by presenting an upgraded GraphCache system coined typical graph queries where snapshots can be viewed as dif- GraphCache+ (abbreviated as GC+). We develop two GC+ ferent graphs. However, extensive evaluations of FTV meth- exclusive models that employ different approaches to deal ods [7, 8] show significant performance limitations. Another with the consistency issue. Moreover, we present the logic research thread is concerned with heuristic SI algorithms, of GC+ in expediting queries, bundled with the formally which can operate either against a single very large graph proved correctness. We evaluate the performance of GC+ by or a dataset containing a large number of graphs. These a real-world graph dataset and a number of query workloads algorithms are capable of pruning away (parts of) dataset with different characteristics, highlighting the considerable graphs that cannot contain the query, expediting sub-iso speedup in term of quantified benefit and overhead. testing without specialized graph indexes. [14, 9] provide insightful evaluations for such SI algorithms. Unfortunately, both FTV and SI methods suffer from ex- CCS Concepts ecuting unnecessary sub-iso tests. For example, if a previ- •Information systems → Database query processing; ously issued query is submitted again to the system, it still has to undergo sub-iso tests, as all laboriously derived knowl- Keywords edge (i.e., answer set of previous queries) has been thrown away. Moreover, in many applications, it is natural that sub- Graph queries; Caching system; Cache consistency mitted graph queries share subgraph or supergraph relations with future queries – in protein datasets, there is a hierarchy 1. INTRODUCTION of queries for aminoacids, proteins, protein mixtures, uni- Graph structured data is increasingly prevalent in modern cell bacteria, all the way to multi-cell organisms; in social big data applications. Central to high performance graph network exploratory, queries could start off broad (e.g., all analytics is to locate patterns in dataset graphs. Informally, people in a geographic location) and become gradually nar- given a query graph g, the system is called to return the rower (e.g., by homing in on specific demographics). Based set of dataset graphs that contain g (subgraph query) or are on these observations, we proposed in [25] a fresh graph contained in g (resp. supergraph query), named the answer query processing method, where queries (and their answers) set of g. These operations can be time consuming for the are indexed to expedite future query processing with FTV NP-Complete problem of subgraph isomorphism [4]. Hence, methods. Subsequently, we presented GraphCache [26], the the community has contributed a number of innovative solu- first full-fledged caching system for general subgraph/super- tions in recent years. One research thread follows the “filter- graph query processing, applicable for all current FTV and SI methods. Following established research, GraphCache handles graph queries against a static dataset throughout the continuous query streaming. In real-world applications, however, the graph dataset nat- urally evolves/changes over time. For example, in social net- working, newly added groups (graph modeled relations/in- ©2017, Copyright is with the authors. Published in the Workshop Pro- teractions among people), break-up of existed groups, and ceedings of the EDBT/ICDT 2017 Joint Conference (March 21, 2017, Venice, Italy) on CEUR-WS.org (ISSN 1613-0073). Distribution of this the changed relations/interactions among group members paper is permitted under the terms of the Creative Commons license CC- are frequently happening. Also, biochemical datasets keep by-nc-nd 4.0 refreshing by newly-translated, disregarded or transformed cache for SPARQL queries. [22] contributed a cache for proteins, for application/research reason. Such changes could SPARQL queries with a novel canonical labelling scheme be modeled as graph addition (ADD), graph deletion (DEL), to identify cache hits and a popular dynamic programming graph update by edge addition (UA) and graph update by planner. Similar to GC+, query processing optimization in edge removal (UR) in graph dataset analytics. [22] does not require any a priori knowledge on datasets/ SI algorithms could accommodate these changes on the fly workloads and is workload adaptive. However, like XML as each dataset graph shall undergo subgraph isomorphism queries, SPARQL queries are less expressive than general test eventually, whereas FTV methods additionally require graph queries and thus less challenging [10, 24]: SPARQL an updatable indexing mechanism to evict proper candidate query processing consists of solving the subgraph homomor- set. To the best of our knowledge, none of the proposed phism problem, which is different from the subgraph isomor- FTV algorithms so far has updatable index or similar solu- phism problem, as the former drops the injective property tions to tackle dataset changes. As a result, the way turns of the latter. Furthermore, GC+ discovers subgraph, su- to SI methods, where each dataset graph is painstakingly pergraph, and exact-match relationships between the com- verified. On the other hand, caching is proved efficient in ing query and cached queries, which the canonical labelling accelerating graph queries [26]. To this end, it naturally fol- scheme in [22] fails to achieve. SPARQL query processing lows an approach using graph cache to alleviate the costly also targets at optimizing join execution plans [5] (based SI methods in processing subgraph/supergraph queries with on join selectivity estimator statistics and related cost func- dataset changes – a topic that has not been discussed yet. tions), and the cache in [22] is focusing on this goal, whereas Therefore, underpinned by our previous work in [25, 26], we GC+ aims to avoid/reduce costs of executing SI heuristics contribute in this paper: whose execution time can be highly unpredictable and much • A systematic solution to expedite subgraph/supergraph higher. Therefore, the overall rationale of GC+ and its way queries with dataset changing throughout the query to utilize cache contents differ from that in [22] and in re- streaming, by presenting an upgraded graph caching lated caching solutions for SPARQL queries. system GC+ featured by newly plugged-in subsystems Pertaining to cache consistency, [6] first explicitly speci- to deal with the evoked cache consistency issue. fied the consistency constraint in a query-centric approach • Two cache models named EVI and CON that are pro- and presented how it could be expressed succinctly in SQL. posed exclusively for GC+, representing different de- Also, there are studies of cache consistency regarding XML signs of ensuring graph cache consistency. datasets [2] and SPARQL query processing [16]. However, • The logic of GC+ in pruning candidate set for general the topic of ensuring graph cache consistency for general subgraph/supergraph query processing, together with subgraph/supergraph queries has not been discussed yet. the formally proved correctness. • Performance evaluations using well-established SI meth- ods, a real-world dataset and a number of workloads, 3. DEFINITIONS emphasizing the significant speedup of CON cache. GC+ is implemented for undirected labelled graphs, fol- lowing established studies in literature [1, 11]. We assume that only vertices have labels; all our results straightfor- 2. RELATED WORK wardly generalize to directed graphs and/or graphs with Subgraph isomorphism problem has two versions: (i) the edge labels. decision problem answers Y/N to whether the query is con- A labeled graph G = (V, E, l) consists of a set of ver- tained in each graph in the dataset; (ii) the matching prob- tices V (G) and edges E(G) = {(u, v), u ∈ V, v ∈ V }, and lem locates all occurrences of the query graph within a large a function l : V → U , where U is the label set, defining graph (or a dataset of graphs). To address the decision and the domain of labels of vertices. A graph Gi = (Vi , Ei , li ) is matching problems, the brute-force approach is to execute subgraph-isomorphic to a graph Gj = (Vj , Ej , lj ), (by abuse sub-iso test of query against each dataset graph (SI method). of notation) denoted by Gi ⊆ Gj , when there exists an in- SI algorithms deteriorate when the dataset contains a large jection φ : Vi → Vj , such that ∀(u, v) ∈ Ei , u, v ∈ Vi , ⇒ number of graphs, which prompted the prevalence of FTV (φ(u), φ(v)) ∈ Ej and ∀u ∈ Vi , li (u) = lj (φ(u)). methods. GC+ could benefit both SI and FTV solutions for As is common in literature, we focus on non-induced sub- general subgraph/supergraph queries. graph isomorphism. Informally, there is a subgraph isomor- The community has also looked into subgraph queries phism Gi ⊆ Gj if Gj contains a subgraph that is isomor- against a single massive graph via scale-out architectures phic to Gi . In this case, we say that Gi is a subgraph of [12] or large memory clusters with massive parallelism [24]. (or contained in) Gj , or inversely that Gj is a supergraph of For the time being, GC+ does not target such use cases, (contains) Gi (denoted by Gj ⊇ Gi ). The subgraph querying which are left for future work. problem entails a set D = {G1 , . . . , Gn } containing n graphs, Using cache to expedite queries is prevalent in relational and a query graph g, and determines all graphs Gi ∈ D such databases, whereas little work had been done for graph- that g ⊆ Gi . The supergraph querying problem entails a set structured query processing. XML datasets have used views D = {G1 , . . . , Gn } containing n graphs, and a query graph to expedite path/tree queries [15]; [13] first proposed the g, and determines all graphs Gi ∈ D such that g ⊇ Gi . MCR (maximally contained rewriting) approach for tree- pattern queries, but introduced false negatives. GC+ does not produce false negatives or false positives (see §6). Also, 4. SYSTEM ARCHITECTURE GC+ deals with much more complex graph queries, which GC+ is a scalable semantic cache for subgraph/super- entail the NP-Complete subgraph isomorphism problem. graph queries, consisting of four subsystems Data Manager, Caching has also been leveraged to expedite SPARQL Cache Manager, Query Processing Runtime and Method M, query processing in RDF graphs. [18] proposed the first as shown by Figure 1. The first three are GC+ internal and Time Query Processing Runtime T0 {G0 , G1 , G2 , G3 } Resource GC+sub GC+super Processor Processor T1 query g' execution g' entering cache CON Cache Manager / Statistics Query Monitor [0] [1] [2] [3] Dataset Dispatcher Candidate Set Pruner g' 1 1 1 1 CGvalid ADD G4 0 0 1 1 Answer T2 {G0 , G1 , G2 , G3 , G4 } UR G3 Cache Manager Cache Cache Window Statistics [0] [1] [2] [3] [4] Dataset Validator Replacement Manager Manager g' 1 1 1 0 0 CGvalid T3 query g'' execution g'' entering cache 0 0 1 x x Answer [0] [1] [2] [3] [4] Dataset Dataset Manager Method M g'' 1 1 1 1 1 CGvalid 0 0 1 1 0 Answer Log Analyzer Dataset Graphs Mverifier DEL G0 T4 {G1 , G2 , G3 , G4 } UA G1 [1] [2] [3] [4] Dataset Figure 1: GC+ System Architecture g' 0 1 0 0 CGvalid x 1 x x Answer Method M is the external SI method that GC+ is called to [1] [2] [3] [4] Dataset expedite. Method M subsystem includes an SI implementa- T5 query g execution g'' 0 1 1 1 CGvalid x 1 1 0 Answer tion, denoted Mverif ier , sub-iso testing candidate set MCS (the whole dataset when GC+ is not used). Dataset Manager subsystem is GC+ specific, containing Figure 2: A CON Cache Running Example with Timeline a component Log Analyzer to handle dataset logs. Cache cache, i.e., evicting (EVI) graph cache whenever dataset Manager is responsible for the management of data and changes. Therefore, Log Analyzer has to do nothing but metadata stored in the cache. Besides the cache replace- raising a flag indicating the dataset is changed, and Cache ment mechanisms, a Window Manager for cache admission Validator then clears cached contents indiscriminately. In control and a Statistics Manager for metadata that work this way, the caching system in [26] could be easily adapted as usual as in GC, Cache Manager of GC+ boasts an ad- to tackle graph queries with dataset changes, as the cleared ditional component Cache Validator, which cooperates with cache will never produce error for future query processing. Log Analyzer to ensure the cache consistency. Both Log An- But the limitation is obvious – EVI cache has to warm from alyzer and Cache Validator launch cache model dependent scratch upon each dataset change. mechanisms to cope with cache consistency (see §5). The problem of EVI cache lies in that it fails to differ- Query Processing Runtime subsystem takes charge of query entiate the validity of cached results. For example, when execution and metrics monitoring. Like in GC, it comprises a dataset graph undergoes some change, its relations with a resource/thread manager, the GC+ internal subgraph/su- all previous queries in cache (reflected by the cached query pergraph query processors, the logic for candidate set prun- results) are affected – some still hold, while others fail. Of ing, and a statistics monitor – these components of Query course, invalid contents cannot be leveraged to accelerate fu- Processing Runtime communicate with Method M and the ture queries and should be abandoned. However, the purge Cache Manager via well-defined APIs. Please note that in EVI throws away those valid contents as well, making the GC+’s logic of Candidate Set Pruner is different and more cache efficiency truncated. complicated than that in GC, though the former could be viewed as adapting from the latter. The logic of GC+ shall 5.2 CON Cache Model be presented and proved in §6. To solve the aforementioned problem of EVI, we develop When a query g arrives, Dataset Manager subsystem first another cache model named CON, which beavers away to identifies whether the dataset has been changed recently. If identify valid contents in cache. We shall use an example to so, Cache Validator is triggered to reflect these changes to illustrate how the CON model works for subgraph queries. previous queries residing in cache and window (where queries Mechanism for supergraph queries is similar and is omitted are batched to enter cache; in this work, cached graphs/- for space reason. queries by default cover those previous queries in both cache Figure 2 depicts an example, where GC+ starts off with and window). Then, g is sent to Query Processing Run- a dataset containing four graphs {G0 , G1 , G2 , G3 } and an time subsystem for query execution. Specifically, GC+ calls empty CON cache. At time T1 , query g 0 arrived and was processors (GC+sub and GC+super ) to discover whether g executed. Assuming that g 0 passed the admission control shares subgraph/supergraph relations with cached graphs – and entered the cache later. CON cache thus recorded that each hit graph in cache then contribute its valid part of an- g 0 holds validity towards its relation with all graphs in cur- swer set to prune g’s candidate set. Finally, the issued query rent dataset (i.e., GCvalid covers dataset graphs with id 0, g, together with its answer set and statistics pertaining to 1, 2, 3), among which g 0 * G0 , g 0 * G1 , g 0 ⊆ G2 and the contribution of cached graphs, is fed to the Cache Man- g 0 ⊆ G3 . Then, at time T2 , dataset was changed by adding ager subsystem, where admission control and cache replace- a new graph G4 , and an update on G3 of edge removals. ment are performed, concurrently with the Query Processing Obviously, there is no clue of G4 containing g 0 or not, i.e., g 0 Runtime subsystem executing subsequent queries. does not hold validity regarding its relation with the newly coming G4 . As to G3 , there was g 0 ⊆ G3 , which becomes 5. CACHE CONSISTENCY unknown as removing edge may result g 0 * G3 . Hence, the validity of g 0 pertaining to G3 is turned off. Subsequently, 5.1 EVI Cache Model at time T3 , another query g 00 was executed and later entered To address the said challenge arising from dynamic graph cache, holding the validity towards each graph in state-of- dataset, a straightforward solution is to abandon the vague the-art dataset containing five graphs. Again, the dataset Algorithm 1 Analyzing Log for the CON Cache Algorithm 2 Refreshing a cached graph’s validity indicator 1: Input: Dataset update log L 1: Input: Counter container C (containing CT , CA and 2: Output: A container C with counters to categorize op- CR ), currently maximum graph id m in dataset, stored erations performed on dataset graphs info of a cached graph (with its dataset-graph-validity- 3: indicator CGvalid and query result Answer, both struc- 4: Initialize C with an empty HashMap per counter (CT , tured by BitSet) CA and CR ) 2: Function: Updating CGvalid 5: Extract the incremental records R from L 3: 6: for all r ∈ R do 4: if (m + 1) > CGvalid .size then 7: i = id of the dataset graph G in r 5: Extend CGvalid to length (m + 1) by assigning false 8: t = operation type in r to extended bits 9: switch t do 6: end if 10: case UA 7: for all i ∈ CT .keySet() do 11: CA .get(i) += 1 8: tc = CT .get(i) 12: break 9: uac = !CA .containsKey(i)? 0 : CA .get(i) 13: case UR 10: urc = !CR .containsKey(i)? 0 : CR .get(i) 14: CR .get(i) += 1 11: 15: break 12: if tc == uac ∧ CGvalid .get(i) ∧ Answer.get(i) 16: CT .get(i) += 1 then 17: end for 13: continue 18: return C 14: else if tc == urc ∧ CGvalid .get(i) ∧ !Answer.get(i) then 15: continue changed at time T4 – graph G0 was deleted and G1 was 16: else updated by edge additions. Regarding the current dataset 17: CGvalid .set(i, f alse) {G1 , G2 , G3 , G4 }, g 0 * G1 is not guaranteed, since adding 18: end if edges may introduce g 0 ⊆ G1 . g 0 , as well as g 00 , thus loses the 19: end for validity on G1 . Therefore, when the new query g comes, it would be facilitated by cached graphs g 0 and g 00 , each with the up-to-date valid info pertaining to all current dataset ous queries. Therefore, to deal with dataset changes, GC+ graphs, ensuring the cache consistency of CON model. employs a BitSet indicator CGvalid per cached query, with 5.2.1 Analyzing Dataset Log each bit identifying the up-to-date validity of the query’s relation towards a dataset graph. GC+ is designed to warrant CON cache possessing the Algorithm 2 depicts how the CGvalid of a cached graph potential to benefit queries at full steam. To this end, both g is refreshed by Cache Validator. To start with, CGvalid Dataset Manager subsystem and Cache Manager subsystem is checked whether it contains all the bits required (line 4 develop CON specific mechanisms. where dataset graph id starts from 0). If not, it implies First, Dataset Manager subsystem employs the compo- that there are newly-added dataset graphs. Obviously, the nent Log Analyzer to categorize dataset changes, acting relation between g and those new dataset graphs is unknown as a preprocessing step for validating CON cache. Algo- and those extended bits are thus assigned false (line 5). The rithm 1 illustrates how the corresponding analysis is per- idea is to make recent dataset changes take effect on relevant formed. Briefly, the incremental records that have not been bits of CGvalid (lines 7–19). reflected in cache are first extracted from dataset log. Log Specifically, for each concerned dataset graph Gi (iden- Analyzer then launches a container with three counters, im- tified by i in line 7), its numbers of total operations (tc ), plemented by HashMap with key of dataset graph id and UA (uac ) and UR (urc ) are retrieved from the input coun- value of count for the operations on this graph. The said ters (lines 8–10). If operations on dataset graph Gi are ex- three counters (CT , CA and CR ) (line 4) are responsible clusively of UA category (tc == uac in line 12), together for counting the total, UA and UR operations, respectively. with valid (CGvalid .get(i) in line 12) query result g ⊆ Gi Each aforementioned record identifies the related dataset (Answer.get(i) in line 12), such validity still holds (g ⊆ Gi graph and its operation type (lines 7–8). Exhausting these is not bothered by adding edges to Gi ). Similarly, if oper- records (lines 9–16) hence results the total-counter CT , UA- ations on dataset graph Gi are exclusively of UR category, exclusive-counter CA and UR-exclusive-counter CR . the query result g * Gi (!Answer.get(i) in line 14) keeps 5.2.2 Validating CON Cache valid. Whereas other operations fade g’s validity on Gi if Then, the counter container returned by Dataset Man- applicable (line 17). By harnessing the validity per cached ager subsystem is forwarded to Cache Manager subsystem, query and dataset graph, as well as the optimizations in UA- where Cache Validator refreshes the dataset-graph-validity- exclusive and UR-exclusive cases, CON model manages to indicator for cached graphs. CON cache targets at the biggest enhance the cache efficiency in expediting graph queries. possible benefit for query processing. The Cache Valida- tor hence strives to exploit useful previous query result for 6. QUERY PROCESSING CON. In GC+, once a query is executed, its answer set This section shall illustrate the specific logic of Candidate is finalized, which snapshots the query’s relation against Set Pruner in GC+. For space reason, we only present for dataset at the execution time – even the dataset would un- subgraph queries (supergraph queries follow the exact in- dergo changes later, GC+ will not repeat processing previ- verse logic). As shown in Figure 2, when a query g arrives, CSM (g) = {G1 , G2 , G3 , G4 } CSM (g) = {G1 , G2 , G3 , G4 } CSM (g) Answersub (g) = Mverif ier Answer CSM (g) \ Answersuper (g) = Mverif ier {G1 , G3 , G4 } Answer(g) {G1 , G2 , G3 } g ✓ g 0, g 0 ✓ {G2 , G3 } g ◆ g 00, g 00 ✓ {G2 , G3 } Subgraph Subgraph GC+sub GC+super Query g CGvalid (g 0 ) = {G2 } Answer(g) = Answer [ {G2 } Query g CGvalid (g 00 ) = {G2 , G3 , G4 } P rocessor P rocessor Answersub (g) = {G2 } Answersuper (g) = {G1 , G2 , G3 } (a) Subgraph Case (b) Supergraph Case Figure 3: GC+ Processing of a Subgraph Query g GC+ discovers whether g is a subgraph or supergraph of Answersub (g) by formula (3). Furthermore, formula (1) im- cached queries concurrently by processors GC+sub /GC+super , plies ∃g 0 such that g ⊆ g 0 , GF P ∈ CGvalid (g 0 ) and GF P ∈ referred as subgraph/supergraph case. Answer(g 0 ). Hence, GF P ∈ Answer(g 0 ) is valid for up-to- date dataset, i.e., g 0 ⊆ GF P . But g 0 ⊆ GF P and g ⊆ g 0 ⇒ 6.1 Subgraph Case g ⊆ GF P (a contradiction). For example, in Figure 3(a), the new query g comes with candidate set CSM (g)(the current dataset) containing four Lemma 2. The final answer of GC+ in the subgraph case graphs {G1 , G2 , G3 , G4 }. GC+sub Processor finds that there does not introduce false negatives. exists a previous query g 0 , such that g ⊆ g 0 . Then g 0 ’s Proof. Assume false negatives are possible and consider cached answer set {G2 , G3 }, as well as its up-to-time validity the first ever false negative produced by GC+ particularly; indicator CGvalid = {G2 }, is retrieved. (As mentioned in i.e., for some query g, ∃GF N such that g ⊆ GF N and GF N ∈/ Algorithm 2, both Answer(g 0 ) and CGvalid (g 0 ) are BitSet Answer(g). As GF N ∈ CSM (g) in GC+ by default and structures; here, we employ a set containing dataset-graph- sub-iso testing does not introduce any false negative, the id to help illustrate.) only possibility for error is that GF N was removed using For graph G2 ∈ CSM (g), considering g ⊆ g 0 and g 0 ⊆ G2 formula (2); i.e., GF N ∈/ CSGC+sub (g). That implies that (still holds for current dataset, as G2 exists in CGvalid (g 0 )), ∃g 0 such that g ⊆ g 0 and GF N ∈ Answer(g 0 ). But then, by it must follow that g ⊆ G2 . Hence, G2 can be safely removed formula (3), GF N will be added to Answersub (g) and thus from CSM (g) and added directly to the final answer set of GF N ∈ Answer(g) (a contradiction). g. Whereas G3 is not sub-iso exempted though it appears in g 0 ’s cached answer set, as g 0 ⊆ G3 fails to hold with Theorem 3. The final answer of GC+ in the subgraph state-of-the-art dataset (G3 does not appear in CGvalid (g 0 )) case is correct. – g 0 ⊆ G3 was found when executing previous query g 0 but Proof. There are only two possibilities for error; GC+ has been faded by subsequent dataset changes (e.g., G3 was can produce false negatives or false positives. The theorem updated by removing some edges). then follows straightforwardly from Lemmas 1 and 2. Therefore, the logic of GC+ for subgraph case is that only dataset graphs in CGvalid (g 0 ) ∩Answer(g 0 ) are sub-iso test- free, which can be safely removed from CSM (g) and directly 6.2 Supergraph Case added to the final answer set of query g. In the general case, Figure 3(b) depicts an example of supergraph case in GC+, g may be a subgraph of multiple previous query graphs gi0 . where GC+super Processor identifies there exists a previ- Then, the said sub-iso test-free graphs Answersub (g) is: ous query g 00 satisfying g 00 ⊆ g. For g 00 , the cached an- [ swer set {G2 , G3 } and the dataset-graph-validity indicator Answersub (g) = CGvalid (gi0 ) ∩ Answer(gi0 ) CGvalid (g 00 ) ({G2 , G3 , G4 }) are retrieved. Again, the new gi0 ∈Resultsub (g) query g comes with candidate set CSM (g) ({G1 , G2 , G3 , G4 }). (1) Compared with subgraph case, reasoning the logic of GC+ where Resultsub (g) contains all the currently cached queries in supergraph case is more interesting. (i) Consider graph of which g is a subgraph. G1 ∈ CSM (g): G1 does not hold validity regarding its re- Hence, the set of dataset graphs for sub-iso testing is: lation with g 00 , hence no previous query result about G1 could be utilized. g has to rest on the sub-iso test to iden- CSGC+sub (g) = CSM (g) \ Answersub (g) (2) tify whether G1 is in the final answer set. (ii) For graph Finally, if AnswerGC+sub (g) is the set of graphs verified to G2 ∈ CSM (g): g 00 ⊆ G2 holds, which does not make G2 be containing g through sub-iso tests on CSGC+sub (g), the sub-iso test free though providing g 00 ⊆ g, as the relation final answer set for query g will be: between g and G2 is still obscure and has to be verified by sub-iso test. Similarly, G3 is not free of sub-iso test either. Answer(g) = AnswerGC+sub (g) ∪ Answersub (g) (3) (iii) For graph G4 ∈ CSM (g), G4 holds validity for its rela- tion with g 00 as g 00 * G4 . Since g 00 ⊆ g, if g ⊆ G4 were to be Lemma 1. The final answer of GC+ in the subgraph case true, then it should follow g 00 ⊆ G4 , which is a contradiction. does not contain false positives. So it is safe to conclude that g * G4 and thus G4 can be re- Proof. Assume false positives are possible and consider moved from CSM (g). In overall, among graphs in CSM (g), the first ever false positive produced by GC+; i.e., for some those failing to appear in (CGvalid (g 00 ) ∪ Answer(g 00 ) can query g, ∃GF P such that g * GF P and GF P ∈ Answer(g). never exist in the final answer set of g and thus become Note that GF P cannot be in AnswerGC+sub (g) where each sub-iso test free, where CGvalid (g 0 ) is the complementary graph has passed the sub-iso test, which follows that GF P ∈ set of CGvalid (g 00 ) against state-of-the-art dataset. In other words, the set (CGvalid (g 00 ) ∪ Answer(g 00 ) covers all graphs result set for g. The idea is that if there were a dataset that could possibly exist in the final answer set of g, denoted graph G such that g ⊆ G, since g 00 ⊆ g we would conclude as g 00 .Answersuper (g), i.e., that g 00 ⊆ G ⇒ G ∈ Answer(g 00 ), contradicting the fact that Answer(g 00 ) = ∅; thus, no such graph G can exist and g 00 .Answersuper (g) = (CGvalid (g 00 ) ∪ Answer(g 00 ) (4) the final result set of g is necessarily empty. 00 Performing sub-iso tests on CSM (g) ∩ g .Answersuper (g) therefore results the verified query answer Answer(g). 7. PERFORMANCE EVALUATION In the general case, g may be a supergraph of multiple previous query graphs gj00 . Then, the set of graphs tested for 7.1 Experimental Setup sub-iso by GC+ is: GC+ is implemented in Java, by extending the graph \ 00 caching system GC [26]. Experiments were performed on a CSGC+super (g) = CSM (g) ∩ gj .Answersuper (g) Dell R920 host (4 Intel Xeon E7-4870 CPUs (15 cores each), gj00 ∈Resultsuper (g) with 320GB of RAM and 4×1TB disks, running Ubuntu (5) Linux 14.04.4LTS. Following GC, the default value in GC+ where Resultsuper (g) contains all the currently cached queries for the upper limit on the sizes of Cache and the Window of which g is a supergraph. stores were 100 and 20 respectively. The final answer for query g, Answer(g), will be the set Method M We used three well-established SI meth- of graphs in CSGC+super (g) that pass the sub-iso test. ods, including GraphQL (GQL) provided by [14], a modified VF2[3] (denoted VF2+) provided by [11], for being well- Lemma 4. The final answer of GC+ in the supergraph established and good performers [14, 8]; we also used vanilla case does not contain false positives. VF2[3] as it is extensively used in FTV methods. Proof. This trivially follows by construction as all graphs Dataset We employ a real-world dataset AIDS [19] (the in Answer(g) have passed through subgraph isomorphism Antiviral Screen Dataset of the National Cancer Institute) testing at the final stage of query processing. that is prevalently used in literature [14, 7, 1, 11]. AIDS contains 40,000 graphs, each with on average ≈45 vertices Lemma 5. The final answer of GC+ in the supergraph (std.dev.: 22, max: 245) and ≈47 edges (std.dev.: 23, max: case does not introduce false negatives. 250), whereby the few largest graphs have an order of mag- Proof. Assume false negatives are possible and consider nitude more vertices and edges. the first ever false negative produced by GC+; i.e., for some Dataset Change Plan Dataset change operations are query g, ∃GF N such that g ⊆ GF N and GF N ∈ / Answer(g). performed in batches, with occurrence time indicated by the Since GF N ∈ CSM (g), the only possibility for error is that id of queries in workload (recall Figure 2). The plan we GF N is removed from CSGC+super (g) by formula (5). This used for AIDS consists of 2,000 operations (in 100 batches, implies that ∃g 00 such that GF N ∈ / g 00 .Answersuper (g) and 20 operations per batch), during the processing of 10,000 00 queries. A batch of operations are generated as following: g ⊆ g. By formula (4), it turns to GF N ∈ / (CGvalid (g 00 ) and 00 first, an occurrence time for the batch is selected uniformly GF N ∈ / Answer(g ), with GF N ∈ / (CGvalid (g 00 ) ⇒ GF N ∈ at randomly from the id of queries; then, a type uniformly 00 CGvalid (g ). Hence, for state-of-the-art dataset, GF N ∈ / selected from {ADD, DEL, UA, UR}, a graph uniformly Answer(g 00 ) is valid, i.e., GF N * g 00 . But g ⊆ GF N and selected from dataset (ADD using the initial dataset instead 00 00 g ⊆ g ⇒ g ⊆ GF N (a contradiction). of synthesizing additional graphs so as to maximumly keep the original dataset characteristics; DEL, UA and UR using Theorem 6. The final answer of GC+ in the supergraph the up-to-date dataset at running time) and a uniformly case is correct. selected edge within the graph providing UA or UR being Proof. There are only two possibilities for error; GC+ the selected type (UA would add an edge that has not been can produce false negatives or false positives. The theorem in graph yet; UR would remove an existed edge of graph) then follows straightforwardly from Lemmas 4 and 5. codetermine a specific operation – such process is repeated until the batch contain the required number of operations. 6.3 Putting It All Together and Optimal Cases We follow the established principle for the generation of The Query Processing Runtime subsystem first applies our workloads, using two different algorithms to synthesize equation (2) on CSM and then applies (5) on the result of the queries from the initial dataset graphs. Each workload con- previous operation. The end result is a reduced candidate sists of 10,000 queries with typical sizes in literature [11, 27] set, which is then sub-iso tested. – 4, 8, 12, 16 and 20 edges. Additionally, there are two optimal cases that warrant Type A Workloads Queries of these workloads are gen- further performance gains. First, note that GC+ can easily erated in the following manner: first, a source graph is recognize the case where a new query, g, is isomorphic to a randomly selected from dataset graphs; then, a node is se- previous cached query g 0 . For connected query graphs, this lected randomly in the said graph; finally, a query size is holds providing that (i) g ⊆ g 0 or g ⊇ g 0 ; and (ii) g and selected uniformly at randomly from given sizes and a BFS g 0 have the same number of nodes and edges; and (iii) g 0 is performed starting from the selected node. For each new holds validity on all the up-to-date dataset graphs. Hence, node, all its edges connecting it to already visited nodes are GC+ can return the cached result of g 0 directly, rendering added to the generated query, until the desired query size is sub-iso test free. Second, consider the case: for a new query reached. For the first two random selections above, we have g, a cached query g 00 is discovered by GC+ that g 00 ⊆ g, used two different distributions; namely, Uniform (U) and Answer(g 00 ) = ∅ and g 00 holds validity on all graphs cur- Zipf (Z), with the probability density function of the latter rently in dataset. Thus, GC+ can directly return an empty given by p(x) = x−α /ζ(α), where ζ is the Riemann Zeta function[21]. Ultimately, we had three categories of Type EVI CON 7.85 7.31 A workloads: “UU”, “ZU” and “ZZ”, where the first letter 6.21 5.79 5.78 in each pair denotes the distribution used for selecting the 4.77 5.13 4.57 3.90 starting graph, and the second for the starting node. 1.74 1.43 1.28 1.79 1.78 1.52 1.31 1.27 1.23 Type B Workloads (with no-answer queries) These workloads are generated as follows. For each of the query ZZ ZU UU ZZ ZU UU ZZ ZU UU sizes, we first create two query pools: a 10,000-query pool VF2 VF2+ GQL with queries with non-empty answer sets against the initial 9.50 7.31 dataset, and a second 3,000-query pool with no match in any 6.52 6.14 6.68 6.67 5.20 5.35 untreated dataset graph (i.e., empty result set). Queries for 4.57 1.90 2.17 1.95 1.84 the first pool are extracted from dataset graphs by uniformly 1.76 1.57 1.34 1.25 1.18 selecting a start node across all nodes in all dataset graphs, 0% 20% 50% 0% 20% 50% 0% 20% 50% and then performing a random walk till the required query VF2 VF2+ GQL graph size is reached. Generation of no-answer queries has one extra step: we continuously relabel the nodes in the Figure 4: GC+ Speedup in Query Time query with randomly selected labels from the dataset, un- til the resulting query has a non-empty candidate set but EVI CON an empty answer set against the dataset graphs. Once the 9.84 8.71 6.53 7.30 query pools are filled up, we generate workloads by first flip- 5.42 6.23 ping a biased coin to choose between the two pools (with the 1.94 1.81 1.53 2.21 1.96 1.83 “no-answer” pool selected with probability 0%, 20% or 50%), ZZ ZU UU 0% 20% 50% then randomly (Zipf) selecting a query from the chosen pool. We thus have three categories of Type B workloads: “0%”, Figure 5: GC+ Speedup in Number of Sub-iso Tests “20%” and “50%”, denoting the above probability used. We use Zipf α = 1.4 by default; as a reference point, web ability, as exponential distributions have CoV = 1 and typ- page popularities follow a Zipf distribution with α = 2.4 [21]. ically hyper-exponential distributions (which capture many We only allow one for one Window (i.e., 20 queries) before high-variance, heavy tailed distributions) have CoV > 1. In starting measuring GC+’s performance. We report on both this case, HD performs cache eviction using PIN’s scoring the benefits and the overheads of GC+. Reported metrics scheme; otherwise, it turns to PINC’s scoring scheme. include query time and number of sub-iso tests per query, along with the speedups introduced by GC+. Speedup is 7.2 Results and Insights defined as the ratio of the average performance (query time Figure 4 depicts the query time speedups of GC+ across or number of sub-iso tests) of the base Method M over the all method M and workloads. We can see that CON achieves average performance of GC+ when deployed over Method M considerable speedup with the meagre 100-query cache (i.e., speedups >1 indicate improvements). As a yardstick, configuration whereas gains of EVI are limited. Note the [17] (also a cache but for XML databases) report a query interesting finding that speedups for UU workload are close time speedup of 2.6× with 10,000-query workloads generated to those of ZU (e.g., 4.77 vs 5.13 against VF2 base method), using Zipf α = 1.5, and a 1,500-query warm-up. whereas one might have expected a different outcome. Intu- Cache Replacement Policy GC+ incorporates all the itively, the ZU workload bears more exact-match hits than replacement policies developed in GC. Here, we use the UU, due to the skewness of selecting source graphs during coined hybrid (HD) policy for experiments, as its perfor- query generation. And it does: we measured 2.5X the num- mance is always better or on par with the best alternative ber of exact-match cache hits in ZU vs UU. However, in [26]. Specifically, HD coalesces another two GC/GC+ ex- GC+, exact-match cache hit does not necessarily introduce clusive policies PIN and PINC. PIN scores each graph in sub-iso test free (see §6.3) – those resulting zero sub-iso test cache by considering the total number of subgraph isomor- merely account 4% among exact-match cache hits of the said phism tests alleviated by the said graph (coined R). PINC ZU workload (vs 11% in UU). Moreover, recall that GC+ extends the mentioned ranking by taking into account the exploits also subgraph/supergraph hits. Indeed, we mea- possibly vast differences in query execution time (denoted sured circa 2X such matches for the UU workload vs ZU. C), where cost is estimated by a heuristic [25]. As C is Of course, the overall performance result is a very complex estimated, using it in PINC does not always lead to im- picture and depends on how big benefit is each saved exact- provements in query time. And through a large number of match vs each saved subgraph/supergraph match. But the experiments, we have observed that when the values of the key insight here is that by utilizing exact-matches and R exhibit a high variability, they are discriminative enough subgraph/supergraph matches, GC+ can benefit both on their own, where considering C can actually lead to lower skewed and non-skewed workloads. time gains (i.e., PIN performs better than PINC). However, Figure 5 shows the speedups in the number of sub-iso tests when the values of R exhibit a low variability, adding in performed. Please note that under a given configuration the C component leads to considerable query time improve- (specified by dataset, dataset change plan, workload, ments. When the HD policy is invoked, it first retrieves the replacement policy, cache model EVI/CON, the upper R from Statistics Manager and computes its variability [20] limit on the sizes of Cache and the Window stores), by using the (squared) coefficient of variation (CoV ). CoV whatever SI method being the Method M, GC+ re- is defined as the ratio of the (square of the) standard devia- sults exactly the same pruned candidate set for each tion over the (square of the) mean of the distribution. When query. Therefore, Figure 5 is independent of the three meth- CoV > 1, the associated distribution is deemed of high vari- ods considered in this work. Juxtaposing Figures 4 and 5 Average Query Time (milliseconds) Overhead (milliseconds) [4] M. Garey and D. Johnson. Computers and 3 Intractability: A Guide to the Theory of 4 3 NP-Completeness. W.H. Freeman, 1979. 1,217 1,385 1,130 1,085 [5] A. Gubichev and T. Neumann. Exploiting the query 789 9 7 698 11 155 237 270 structure for efficient join ordering in SPARQL VF2 EVI CON VF2 EVI CON VF2 EVI CON queries. In Proc. EDBT, 2014. ZZ ZU UU [6] H. Guo et al. Relaxed currency and consistency: How to say “good enough” in SQL. In Proc. SIGMOD, 2004. 3 3 [7] W.-S. Han, J. Lee, M.-D. Pham, and J. X. Yu. iGraph: 1,627 3 1,383 A framework for comparisons of disk-based graph 856 11 785 10 990 8 631 250 266 217 indexing techniques. PVLDB, 3(1-2):449–459, 2010. VF2 EVI CON VF2 EVI CON VF2 EVI CON [8] F. Katsarou, N. Ntarmos, and P. Triantafillou. 0% 20% 50% Performance and scalability of indexed subgraph query processing methods. PVLDB, 8(12):1566–1577, 2015. Figure 6: Average Execution Time and Overhead per Query [9] F. Katsarou, N. Ntarmos, and P. Triantafillou. Subgraph querying with parallel use of query leads to the interesting insight: Reductions in the number rewritings and alternative algorithms. In Proc. EDBT, of sub-iso tests do not translate directly into reduc- 2017. tions in query time, echoing the claim that graph cache hits render different benefits [26]. In all cases, though, GC+ [10] J. Kim, H. Shin, and W.-S. Han. Taming subgraph achieves significant improvements in both query pro- isomorphism for RDF query processing. PVLDB, cessing time and number of sub-iso tests performed. 8(11):1238–1249, 2015. Figure 6 depicts a break-down of query processing time for [11] K. Klein, N. Kriege, and P. Mutzel. CT-index: method M, EVI and CON. EVI pays overhead on updating Fingerprint-based graph indexing combining cycles the Window and Cache data stores, including executing the and trees. In Proc. ICDE, 2011. cache replacement algorithms and re-indexing the cached [12] L. Lai et al. Scalable subgraph enumeration in graphs. Whereas the overhead of CON also covers the time MapReduce. PVLDB, 8(10):974–985, 2015. of analyzing dataset log and validating cache (see Algorithm [13] L. V. S. Lakshmanan et al. Answering tree pattern 1 and 2) – such CON specific cost is trial, taking less than queries using views. In Proc. VLDB, 2006. 1% in CON overhead, across all the aforementioned work- [14] J. Lee et al. An in-depth comparison of subgraph loads and methods M. This confirms a significant conclusion isomorphism algorithms in graph databases. PVLDB, – the CON exclusive algorithms 1 and 2 are efficient. 6(2):133–144, 2012. The dominant part of CON overhead (for updating the Win- [15] K. Lillis and E. Pitoura. Cooperative XPath caching. dow and Cache data stores) is higher than that of EVI, as In Proc. SIGMOD, 2008. the latter is frequently purged hence bearing less to be up- [16] J. Lorey et al. Caching and prefetching strategies for dated. Putting the Figure 4 aside Figure 5, it is obvious that SPARQL queries. In Proc. USEWOD, 2013. CON sweeps EVI in query processing speedup with a [17] B. Mandhani and D. Suciu. Query caching and view negligible additional overhead. selection for XML databases. In Proc. VLDB, 2005. [18] M. Martin, J. Unbehauen, and S. Auer. Improving the 8. CONCLUSIONS performance of semantic web applications with We presented a systematic solution to handle graph cache SPARQL query caching. In Proc. ESWC, 2010. consistency, by providing an upgraded system GC+, which is [19] NCI - DTP AIDS antiviral screen dataset. capable of expediting general subgraph/supergraph queries http://dtp.nci.nih.gov/docs/aids/aids data.html. with dataset changes. We developed two GC+ exclusive [20] R. Nelson. Probability, Stochastic Processes, and cache models EVI and CON with different mechanisms on Queueing Theory. Springer Verlag, 1995. ensuring cache consistency. We illustrated the specific logic [21] M. Newman. Power laws, Pareto distributions and of GC+ in reducing candidate set for query execution and Zipf’s law. Contemporary Physics, 46:323–351, 2005. formally proved its correctness. Our performance evalua- [22] N. Papailiou, D. Tsoumakos, P. Karras, and tion has demonstrated the considerable speedup achieved by N. Koziris. Graph-aware , workload-adaptive SPARQL CON. Future works include further optimizing CON cache query caching. In Proc. SIGMOD, 2015. with retrospective validating mechanisms, developing a dis- [23] K. Semertzidis and E. Pitoura. Durable graph pattern tributed/decentralized version of GC+ and extending GC+ queries on historical graphs. In Proc. ICDE, 2016. to benefit subgraph queries when finding all occurrences of [24] Z. Sun et al. Efficient subgraph matching on billion a query graph against a single massive graph. node graphs. PVLDB, 5(9):788–799, 2012. [25] J. Wang, N. Ntarmos, and P. Triantafillou. Indexing 9. REFERENCES query graphs to speedup graph query processing. In [1] V. Bonnici et al. Enhancing graph database indexing Proc. EDBT, 2016. by suffix tree structure. In Proc. IAPR PRIB, 2010. [26] J. Wang, N. Ntarmos, and P. Triantafillou. [2] S. Bottcher. Cache consistency in mobile XML GraphCache: a caching system for graph queries. In databases. In Proc. WAIM, pages 300–312, 2006. Proc. EDBT, 2017. [3] L. P. Cordella, P. Foggia, C. Sansone, and M. Vento. [27] X. Yan et al. Graph indexing: a frequent A (sub) graph isomorphism algorithm for matching structure-based approach. In Proc. SIGMOD, 2004. large graphs. IEEE TPAMI, 26(10):1367–1372, 2004.