<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SWARMGUIDE: Towards Multiple-Query Optimization in Graph Databases</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zahid Abul-Basherx Parke Godfreyy</string-name>
          <email>godfrey@cse.yorku.ca</email>
          <email>zahid@mie.utoronto.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikolay Yakovetsy Mark Chignellx</string-name>
          <email>chignell@mie.utoronto.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Graph Databases &amp; Queries</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Preliminaries. A graph database G is a nite, directed, edge-labeled, multigraph de ned by G = hN; ; Ei, where N is a nite set of nodes (vertices), is a set of labels, E is a set of directed, labeled edges, and E N N . A path p in G is de ned as a sequence of n0a0n1 nk 1ak 1nk where ni 2 N , ai 2 , and hni; ai; ni+1i 2 E for 0 i k. We call the sequence of edge labels of a particular path p the word, !(p) that p induces. A regular path rp is a path in the graph where !(rp) is a word in a given regular language L(reg) (e.g., !(rp) 2 L(reg)). A regular path query (RPQ) [2] is a triple hx; reg; yi in which x and y are free variables over the domain of nodes, and reg is a regular expression. An answer of an RPQ is a node-pair hs; ti (s; t 2 N ) such that there is a path p in G between s and t and !(p) 2 L(reg). The answer set of an RPQ is the set of all its answers. The RDF data model and SPARQL query language instantiate these concepts. With the introduction of property paths in SPARQL 1.1, the query language encompasses RPQs. For our examples in this paper, we adopt SPARQL syntax.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Analytic tasks over graph databases often require sequences of related queries to
be evaluated. While there are promising query optimization methods for graph
queries [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], these do not optimize globally for sets of queries. For relational,
multiple query optimization (MQO) has been studied [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. We propose Swarmguide,
a framework for MQO for graph databases.
      </p>
      <p>
        Evaluation of RPQs. Evaluating an RPQ is a two-step procedure. Path
recognition determines whether the word of path p, !(p), is recognized by the regular
expression reg (i.e., !(p) 2 L(reg)); path search nds pairs of nodes, s; t 2 N
such that there exists such a path p from node s to node t. There has been
recent interest in how to evaluate RPQs e ciently in SPARQL over RDF stores.
Yakovets et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] demonstrate in their Waveguide framework that there is a
rich plan space for such queries, and that di erent plans for a given query can
di er by orders of magnitude in evaluation cost. A simple cost model for such
a plan is how many edges traversals (edge-walks) in the graph are made during
the plan's evaluation.
      </p>
      <p>Consider Q1 = hx; director of =actor=born in; yi over the movie graph in
Figure 1. This query can be evaluated in di erent ways, with di erent
edgewalk costs. We can evaluate it in forward-direction by retrieving all node-pairs
connected by edge label director of then appending with those linked by actor
and then by born in. We can also evaluate it in the backward direction by
retrieving all node-pairs connected by edge label born in then prepending with
those connected by actor and then by director of . A third way is from the middle
by retrieving all node-pairs connected by edge label actor then appending with
those connected by born in and then prepending by director of . Note that, the
edge-walk cost for the rst is 10, in the second is 9, and in the third is 7.</p>
      <p>In Waveguide, path search is performed e ciently while simultaneously
recognizing the path expressions. Waveguide's input is a graph database G and
a waveplan (WP) PQ which guides a number of search wavefronts that explore
the given graph. The term wavefront is introduced to refer to a part of the
plan that evaluates breadth- rst during the evaluation. This graph exploration,
driven by an iterative search procedure, is inspired by the semi-nave bottom-up
strategy used in the evaluation of linear recursive expressions based on a xpoint.</p>
      <p>The key idea is to expand repeatedly the search wavefronts until no new
answers are produced; i.e., we reach a xpoint. Each search wavefront is guided
by an automaton in the plan, a nite state machine based on an NFA.</p>
      <p>Consider query plans for Q1 = hx; director of =actor=born in; yi as in Figure
2. Plan A employs a WP corresponding to a single FA that directly encodes
a recognizer for the query's regular expression. This plan captures the notion
of \pipelining". Whereas Plan B employs a WP that consists of two subplan
automata, the rst subplan (SP) is used as a view for transition over states in
the second plan. This captures the notion of \materialization". Note that in the
second subplan automaton, we used a prepend transition ( director of ) over the
previous state.</p>
      <p>The Waveguide prototype implements guided graph search as
procedural SQL on top of PostgreSQL. However, any type of back-end physical graph
database model (e.g., triple store or adjacency lists) can be used.</p>
    </sec>
    <sec id="sec-2">
      <title>SWARMGUIDE: The Framework</title>
      <p>
        Queries running in a batch may have common subexpressions among them.
Instead of running each query individually, one could bene t from sharing the
results of common subexpressions [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Consider two queries, Q1 = hx; director of =
actor=born in; yi and Q2 = hx; actor=born in=located in; yi as in Figure 2. Clearly,
the optimal automaton plans (Plan A) do not share any common subplan for
Q1 and Q2. However, if we chose suboptimal plans (Plan B) then we could have
a common sub-plan automaton (actor=born in). This common subplan could be
shared among Q1 and Q2 without re-evaluating it.
      </p>
      <p>Evaluation of MRPQs. The evaluation of multiple regular path queries
(MRPQs) is two-step: identifying common subexpressions among queries; and
searching for a global optimal plan such that its cost is less than the summed cost of
the local optimal plans in the batch. We present our framework Swarmguide,
a generic framework for optimizing multiple regular path queries (MRPQs) in
graph databases. Figure 3 shows Swarmguide architecture.</p>
      <p>
        Clustering Queries. Clustering queries is a preprocessing step to nding the
commonalities among the RPQs in the batch. These can be detected by nding
the isomorphic subgraphs of their corresponding nite automata (FAs). This
process is known to be NP-hard, in general [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Therefore, we use heuristics to
group only queries that can have common sub-automata, and then do the hard
task of identifying the common sub-automata within each group.
      </p>
      <p>
        Finding Common Sub-automata. Finding common sub-automaton
resembles nding the maximum common subgraph among graphs. The problem
becomes more di cult here as we need to nd the largest common subgraphs
(sub-automata) for multiple graphs (FAs). Most existing solutions only consider
non-labeled edges and nodes in undirected graphs. We adopt the solution from
the maximal common-edge subgraph (MCES) problem [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to detect common
subautomata. This has three steps: transforming labeled-graphs into the equivalent
line-graphs; producing a product graph from the line-graphs; and detecting the
maximal cliques in the product graph, which corresponds to MCESs (therefore,
common sub-automata).
      </p>
      <p>Query Rewriting. A regular expression reg can be recognized equivalently by
many FAs. For instance, the two FAs for two regular expressions reg1 = (ab) ac
and reg2 = a(ba) c are equivalent; however, the plan space may di er, depending
on the chosen FA. In this step, we rewrite FAs according to their shared common
sub-automata.</p>
      <p>
        Global Optimization. A local optimal plan for a given query may not be the
best plan when optimizing a batch of queries. The goal of the global optimization
is then to search local plans of queries to nd a global plan by choosing one local
plan per query in a cluster of RP Qs. The cost of this global plan should be less
than, or equal to, the total cost of local optimal plans of queries in the batch.
In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the authors propose cost-based heuristic algorithms for this; we likewise
adapt their algorithms for here.
      </p>
      <p>Other Considerations. A single FA plan is often decomposed into several
subplans as views. When planning globally, one has to check if views from other
queries' plans can be shared. Intermediate node-pairs|along with their states
in FA|are cached to avoid unbounded computation over cyclic graphs. When
evaluating a global plan in MRPQs, these local caches need to be maintained
globally so subplans shared among queries can be leveraged. The global cache
also can be used to obtain useful statistics about the graph to obtain a more
accurate global optimal plan and choosing the initial nodes for search exploration.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusions</title>
      <p>Graph database analytic applications are rapidly on the rise. These raise new
and arduous challenges. We can apply many of the lessons learned from the
relational domain|e.g., pipelining, materialized views, cost-based optimization,
query answering by views, and multiple query optimization|to these challenges
to great e ect. However, these techniques have to be carefully re-imagined to be
e ective for graph databases, their queries, and applications.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>W.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kementsietsidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Duan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Scalable multi-query optimization for SPARQL</article-title>
          .
          <source>In Data Engineering (ICDE)</source>
          ,
          <year>2012</year>
          IEEE 28th International Conference on, pages
          <volume>666</volume>
          {
          <fpage>677</fpage>
          . IEEE,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A. O.</given-names>
            <surname>Mendelzon</surname>
          </string-name>
          and
          <string-name>
            <given-names>P. T.</given-names>
            <surname>Wood</surname>
          </string-name>
          .
          <article-title>Finding regular simple paths in graph databases</article-title>
          .
          <source>SIAM Journal on Computing</source>
          ,
          <volume>24</volume>
          (
          <issue>6</issue>
          ):
          <volume>1235</volume>
          {
          <fpage>1258</fpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Raymond</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Willett</surname>
          </string-name>
          .
          <article-title>Maximum common subgraph isomorphism algorithms for the matching of chemical structures</article-title>
          .
          <source>Journal of computer-aided molecular design</source>
          ,
          <volume>16</volume>
          (
          <issue>7</issue>
          ):
          <volume>521</volume>
          {
          <fpage>533</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>P.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Seshadri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sudarshan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhobe</surname>
          </string-name>
          .
          <article-title>E cient and extensible algorithms for multi query optimization</article-title>
          .
          <source>In ACM SIGMOD Record</source>
          , volume
          <volume>29</volume>
          , pages
          <fpage>249</fpage>
          {
          <fpage>260</fpage>
          . ACM,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>T. K.</given-names>
            <surname>Sellis</surname>
          </string-name>
          .
          <article-title>Multiple-query optimization</article-title>
          .
          <source>ACM Transactions on Database Systems (TODS)</source>
          ,
          <volume>13</volume>
          (
          <issue>1</issue>
          ):
          <volume>23</volume>
          {
          <fpage>52</fpage>
          ,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>N.</given-names>
            <surname>Yakovets</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Godfrey</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Gryz</surname>
          </string-name>
          .
          <article-title>Query planning for evaluating SPARQL property paths</article-title>
          .
          <source>In Proceedings of the ACM SIGMOD International Conference on Management of Data</source>
          , pages
          <volume>1</volume>
          {
          <fpage>15</fpage>
          , San Francisco, CA, USA,
          <year>June 2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>