<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Workload-Aware Streaming Graph Partitioning</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Hugo</forename><surname>Firth</surname></persName>
							<email>h.firth@ncl.ac.uk</email>
							<affiliation key="aff0">
								<orgName type="department">School of Computing Science</orgName>
								<orgName type="institution">Newcastle University</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Paolo</forename><surname>Missier</surname></persName>
							<email>paolo.missier@ncl.ac.uk</email>
							<affiliation key="aff1">
								<orgName type="department">School of Computing Science</orgName>
								<orgName type="institution">Newcastle University</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Workload-Aware Streaming Graph Partitioning</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">CF6FBE3F49789388558201D7165FF3A9</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T21:39+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>H</term>
					<term>2</term>
					<term>4 [Database Management]: Systems</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Partitioning large graphs, in order to balance storage and processing costs across multiple physical machines, is becoming increasingly necessary as the typical scale of graph data continues to increase. A partitioning, however, may introduce query processing latency due to inter-partition communication overhead, especially if the query workload exhibits skew, frequently traversing a limited subset of graph edges. Existing partitioners are typically workload agnostic and susceptible to such skew; they minimise the likelihood of any edge crossing partition boundaries.</p><p>We present our progress on LOOM: a streaming graph partitioner based upon efficient existing heuristics, which reduces inter-partition traversals when executing a stream of sub-graph pattern matching queries Q. We are able to continuously summarise the traversal patterns caused by queries within a window over Q. We do this using a generalisation over a trie data structure, which we call TPSTry++, to compactly encode frequent sub-graphs, or motifs, common to many query graphs in Q. When the graph-stream being partitioned contains a match for a motif, LOOM uses graph-stream pattern matching to capture it, and place it wholly within partition boundaries. This increases the likelihood that a random query q ∈ Q may be answered within a single partition, with no inter-partition communication to introduce additional latency.</p><p>Finally, we discuss the potential pitfalls and drawbacks which exist with our approach, and detail the work yet to be completed.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>Recently there has been a proliferation of web hyperlinks, social network users, protein interaction networks, and other content readily modelled as large graphs. Sub-graph pattern matching over these graphs is common to many modern applications, including fraud detection <ref type="bibr" target="#b18">[18]</ref>, recommender systems <ref type="bibr" target="#b7">[7]</ref> and genome analysis <ref type="bibr" target="#b4">[4]</ref>. Pattern matching can be simply discussed in terms of sub-graph isomorphism, where, given a labelled query graph Gq i and a labelled parent graph G, the answer to the query is all sub-graphs in G which are isomorphic to Gq i ; i.e. all sub-graphs which have the same structure (vertices, edges, and labels) as Gq i . For instance, given the graph and query workload in figure <ref type="figure" target="#fig_0">1</ref>, the answer to q1 would be the sub-graph of G containing the vertices 1, 2, 5, 6 and their interconnecting edges. Figure <ref type="figure" target="#fig_0">1</ref> also demonstrates query graphs which share common substructure. In this work, we exploit the existence of such frequently reoccurring sub-graphs within a query workload Q to improve Q's performance over large, distributed graphs. We refer to frequent sub-graphs of query graphs as motifs.</p><p>Pattern matching is computationally complex and, over "big-graph data", would prove prohibitively expensive to a single commodity machine. Distributed graph partitioning has long been seen as a viable approach to address such scalability issues in graph processing frameworks <ref type="bibr" target="#b11">[11,</ref><ref type="bibr" target="#b12">12]</ref>, and graph database management systems (GDBMS) <ref type="bibr" target="#b1">[1]</ref>. These systems distribute vertices and computation across multiple machines, using a simple hash function to determine vertex placement by default. Although a hash-partitioning is efficient to compute and creates partitions with even numbers of vertices, it ignores vertex locality, and is therefore to create a large number of inter-partition edges. This is undesirable, incurring a high communication overhead between partitions for many types of graph workload, including pattern matching queries.</p><p>The problem of minimising the number of inter-partition, or cut, edges in a distributed graph, whilst maintaining an even distribution of vertices, is known as k-balanced graph partitioning, which is NP-Hard <ref type="bibr" target="#b3">[3]</ref>. Despite this, there exist several practical solutions <ref type="bibr" target="#b8">[8,</ref><ref type="bibr" target="#b17">17,</ref><ref type="bibr" target="#b19">19,</ref><ref type="bibr" target="#b20">20]</ref> to the problem. Some, such as the state-of-the-art "offline" partitioner METIS <ref type="bibr" target="#b8">[8]</ref>, are memory intensive, their performance suffering over graphs with billions of vertices <ref type="bibr" target="#b19">[19]</ref>. They may also have to perform expensive repartitioning in the presence of graph changes. Others <ref type="bibr" target="#b17">[17,</ref><ref type="bibr" target="#b19">19,</ref><ref type="bibr" target="#b20">20]</ref> adopt a simpler streaming graph partitioning model. A graph-stream is an ordering over the elements of a dynamic, growing graph, often by creation time. Social networks are often viewed as graphstreams <ref type="bibr" target="#b16">[16]</ref>. Procedures on graph-streams will usually consider each graph element, in order, just once and therefore likely have very good complexity, regardless of graph size. Streaming graph partitioners typically produce more interpartition edges than METIS, but are much faster.</p><p>Although these streaming graph partitioners do reduce the number of cut edges and seamlessly handle graph updates, they are agnostic to the specific workload being executed over the graph partitioning: edges to be cut are computed purely based upon graph structure. For some workloads, which may traverse a limited subset of edges in a graph, such partitionings may incur unnecessary communication overhead <ref type="bibr" target="#b15">[15,</ref><ref type="bibr" target="#b20">20]</ref>. An example of such a workload is a one of pattern matching queries, where the topologies which are likely to be traversed are those which correspond to motifs defined in query graphs. In this work we focus on a different measure of partitioning quality: the probability of inter-partition traversals which is different from the number of inter-partition edges, given a workload Q.</p><p>Whilst systems such as LogGP <ref type="bibr" target="#b20">[20]</ref> do attempt to collect runtime statistics in order to improve graph partitioning for a given workload, they are focused on the Bulk-Synchronous-Parallel (BSP) model of computation used by Pregel-like systems for "offline" analytical workloads. Our goal, however, is to improve graph partitionings for a given "online" workload of pattern matching queries over a dynamic labelled graph, as would be common to GDBMS.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Contributions</head><p>We describe a novel extension to existing streaming graph partitioning methods <ref type="bibr" target="#b17">[17]</ref>, aimed at avoiding introducing unnecessary inter-partition communication overhead for a given query workload. More precisely, let Q be a workload of queries over G, along with the relative frequency of each query in Q. We are able to efficiently derive the most common motifs from the query graphs in Q. Using an approach to graph-stream pattern matching <ref type="bibr" target="#b16">[16]</ref>, we identify those sub-graphs in G which match these motifs, and are therefore likely to be traversed during the execution of a random q ∈Q. Having grouped a graph-stream into frequently traversed sub-graphs, we are then able to use the successful Linear Deterministic Greedy heuristic <ref type="bibr" target="#b17">[17]</ref> (LDG) to assign these to a partition, excepting some balance constraints. This increases the likelihood that a random q ∈ Q can be processed without causing inter-partition traversals and communication overhead.</p><p>Concretely, this work makes the following contributions:</p><p>• We extend an existing streaming graph partitioning approach <ref type="bibr" target="#b17">[17]</ref>, to account for the probabilities of crossing partition boundaries during execution of a query from a given workload Q.</p><p>• We present an efficient intensional representation of the probable edge traversals caused by a given workload of sub-graph pattern matching queries Q. • We propose a graph-stream pattern matching approach to transform a stream of vertices and edges G into a sequence of motifs, where each motif represents a subgraph in G likely to be traversed during execution of a random q ∈ Q.</p><p>The rest of this paper is organised as follows. In the subsequent section we discuss background material and related work. In Section 4.2 we present the TPSTry++ datastructure for capturing an intensional representation of graph traversals from a query workload. In Section 4.3 we present a detailed overview of the streaming graph partitioner which accounts for the workloads captured in Section 4.2. In the Conclusion we discuss our progress with this work, highlight its limitations, and present some potential avenues for future study.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">DEFINITIONS</head><p>A labelled graph G = (V, E, LV , f l ) is of the form: a set of vertices V = {v1, v2, ..., vn}, a set of pairwise relationships called edges e = (vi, vj) ∈ E and a set of vertex labels LV . The function f l : V → LV is a surjective mapping of vertices to labels.</p><p>A graph motif is simply a sub-graph structure which occurs repeatedly within a graph or graph-stream G.</p><p>A pattern matching query is defined in terms of subgraph isomorphism. Given a pattern graph Q = (VQ, EQ), a query should return G : a set of sub-graphs of G. For each returned sub-graph G i = (V i , E i ) there should exist a bijective function f such that: (a) for every vertex v ∈ V i , there exists a corresponding vertex f (v) ∈ VQ; (b) for every edge (v1, v2) ∈ E i , there exists a corresponding edge (f (v1), f (v2)) ∈ EQ; and (c) for every vertex v ∈ G i , the labels match those of the corresponding vertices in</p><formula xml:id="formula_0">Q, l(v) = l(f (v)).</formula><p>A graph partitioning is defined as an disjoint family of sets of vertices P k (V ) = {V1, V2, . . . , V k }. Each set Vi, together with its edges Ei (where ei ∈ Ei, ei = (vi, vj), and {vi, vj} ⊆ Vi), is referred to as a partition Si. A partition forms a proper sub-graph of G such that Si = (Vi, Ei), Vi ⊆ V and Ei ⊆ E.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">BACKGROUND &amp; RELATED WORK</head><p>The three main areas of work which relate to our own are: 1) graph partitioning, particularly when workload-aware; 2) frequent sub-graph mining; &amp; 3) graph-stream pattern matching. We provide an overview of the first below, but defer discussion of the latter two to sections 4.2 and 4.3 respectively, for context.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Graph partitioning</head><p>balanced graph partitioning is an NP-Hard problem with application to many areas across distributed systems and scientific computing; it has been exhaustively studied in literature since the 1970s <ref type="bibr">[6,</ref><ref type="bibr" target="#b8">8,</ref><ref type="bibr" target="#b9">9]</ref>, and several practical solutions exist <ref type="bibr" target="#b8">[8,</ref><ref type="bibr" target="#b19">19]</ref>. One such solution is METIS <ref type="bibr" target="#b8">[8]</ref>, a reliable standard for offline, fast partitioning. METIS is a multilevel technique: it computes a succession of recursively compressed graphs, partitions the smallest then "projects" that partitioning onto previous graphs in the sequence, applying local refinement techniques <ref type="bibr" target="#b9">[9]</ref> to the partitioning at each step. This produces a balanced k-way partitioning on the original graph, optimised for minimal edge cut.</p><p>Despite its prevalence, there are soome issues with METIS which makes it unsuitable for our goal of workload-aware partitioning of a large, dynamic graph. Firstly, the performance of METIS suffers in the presence of graphs with more than a few hundred million elements <ref type="bibr" target="#b19">[19]</ref>. Secondly, if a graph partitioning produced with METIS, or other offline techniques, grows over time then expensive full repartitioning operations will be required to maintain partition quality. Finally, METIS may account for a static query workload known a priori, using individual edge-weights to represent traversal frequency, however tracking this information is memory intensive, and otherwise non-trivial.</p><p>The streaming graph partitioning model <ref type="bibr" target="#b17">[17]</ref> addresses the first two of these shortcomings. By assigning vertices and edges to a partition as soon as they arrive and not storing them to perform introspection of graph structure, such partitioners are able to maintain a small memory footprint. Thus, streaming partitioners such as Fennel <ref type="bibr" target="#b19">[19]</ref>, created by Tsourakakis et al, are able to scale to large graphs unbounded by the main memory of a host machine. Also, because element placement is computed "on the fly", streaming partitioners adapt seamlessly to graph growth, applying the same placement operation for each new vertex and edge that arrives over time.</p><p>It is worth noting, however, that the heuristics used by streaming graph partitioners are sensitive to the order of graph elements in a stream <ref type="bibr" target="#b17">[17]</ref>. There are three categories of graph ordering commonly considered when evaluating streaming graph partitioners: random, adversarial and stochastic. Consider an 2-way partitioning for the graph in figure <ref type="figure" target="#fig_0">1</ref>, with a vertex ordering of V = (1, 3, 6, 8, 2, 4, 5, 7). Given no neighbours for the first half of vertices received, a naive partitioner might greedily place them in a single partition which, intuitively, causes a final balanced partitioning with the maximum edge cut: |E|. This is an adversarial ordering. For the streaming graph partitioning approach described in this work, we will consider stochastic ordering; that is, a graph-stream continuously generated by some stochastic process, such as user input.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Workload aware partitioning</head><p>To the best of our knowledge, existing streaming graph partitioning solutions do not satisfy our goal for this work: producing workload-aware graph partitionings, which account for the edge traversals patterns of a given "online" workload of sub-graph pattern matching queries.</p><p>Xu et al 's LogGP <ref type="bibr" target="#b20">[20]</ref> tackles workload aware partitioning improvement for graphs processed in Pregel-like systems <ref type="bibr" target="#b12">[12]</ref>, where operations are computed in a vertex-centric fashion across a series of supersteps. LogGP collects information about the set of vertices accessed in each superstep, using it to predicts the set to be accessed in the next. Subsequently, this meta-data is incorporated with the original graph, transforming it into a hypergraph. With a novel streaming hypergraph partitioning technique, LogGP then repartitions the graph after each superstep, reducing the communication overhead for the next and reducing the overall execution time of an operation. Though LogGP's approach is workload-aware, its dependence on supersteps and vertex-centric computation renders it unsatisfactory for our goal.</p><p>There are a number of works addressing the related problem of workload-aware data partitioning in distributed relational database systems <ref type="bibr" target="#b5">[5,</ref><ref type="bibr" target="#b13">13]</ref>. Schism <ref type="bibr" target="#b5">[5]</ref> and SWORD <ref type="bibr" target="#b13">[13]</ref> use an a priori workload to generate a hypergraph, where each edge represents a set of tuples involved in a single transaction. This hypergraph is then partitioned using a version of METIS to achieve a minimal edge-cut. Mapped back to the original database, the partitioning represents an arrangement of records which causes a minimal number of transactions in the captured workload to be distributed.</p><p>Though the goals of these works and our own are similar, they are focused on a relational data model, where typical workloads overwhelmingly consist of short 1-2 "hop" queries. It is unclear how the techniques described would perform given a workload containing many successions of JOIN operations, equivalent to the traversals required for sub-graph pattern matching. Furthermore, these works do not consider dynamic graphs at all.</p><p>In <ref type="bibr" target="#b21">[21]</ref>, Yang et al propose algorithms to efficiently analyse online query workloads and to dynamically replicate "hotspots" (clusters of vertices over 2 or more partitions which are being frequently traversed), thereby temporarily dissipating network load. Whilst highly effective at dealing with unbalanced query workloads, Yang et al focus solely upon the replication of vertices and edges using temporary secondary partitions. They do not consider workload characteristics when producing the initial partitioning, nor do they consider workload characteristics when producing it. This can result in replication mechanisms doing far more work than is necessary over time, adversely affecting the performance of a system. As a result, the partitioning technique we present here could effectively complement many workload aware replication approaches, such as this.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">LOOM PARTITIONING OVERVIEW</head><p>In this section we provide an intuition of how we are going to partition a graph-stream to account for a specific workload. Initially, we describe the efficient streaming-graph partitioning heuristic used as a base for our workload-aware extensions. The heuristic assesses characteristics of each individual vertex before placing them in an appropriate partition. However, by running efficient pattern matching procedures against a buffered window over the graph-stream, we are able to capture motifs; treating these motifs as single vertices, we may then use the same heuristic to place them wholly within beneficial partitions. If the motifs we capture correspond to those likely to be frequently traversed by a known workload Q, then this would increase the likelihood that a random q ∈Q is executed without inter-partition traversals. Thus, we subsequently present a method for continuously summarising the motifs most frequently traversed by a given stream of sub-graph pattern matching queries Q. Finally, we present our chosen pattern matching procedure and discuss the issues which currently exist with our partitioning approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Base partitioning heuristic</head><p>LOOM's partitioning is based upon the Linear Deterministic Greedy heuristic (LDG) proposed by Stanton and Kliot <ref type="bibr" target="#b17">[17]</ref>. LDG is a simple heuristic which seeks to assign a new vertex to the partition where it has the most edges, as this is efficient to compute, and greedily minimises the number of inter-partition edges for each vertex. In order to  <ref type="figure" target="#fig_0">1</ref> create a partitioning which is balanced in the number of vertices, each partition is given a capacity constraint C. For a given vertex v and partition Si = (Vi, Ei), the number of v's edges in Si is weighted by Si's free capacity 1</p><formula xml:id="formula_1">− |V i | C .</formula><p>In this way partitions are progressively more penalised the more vertices they contain. LDG is an effective heuristic <ref type="bibr" target="#b17">[17,</ref><ref type="bibr" target="#b19">19]</ref>, reducing the number of edges cut by up to 90%.</p><p>In LOOM we buffer a sliding window over a graph-stream, and use LDG to assign both connected sub-graphs<ref type="foot" target="#foot_0">1</ref> and single vertices from the buffer to partitions. Stanton and Kliot describe similar extensions in their original work <ref type="bibr" target="#b17">[17]</ref>. In particular, the Greedy EvoCut partitioning heuristic is closely related to our own. The local partitioning algorithm Evo-Cut <ref type="bibr" target="#b2">[2]</ref> is used to split sub-graphs which occur within the stream buffer into small pseudo-partitions, which are then wholly assigned to parent partitions using LDG. In LOOM however, we attempt to detect sub-graphs within the stream buffer which are likely to be frequently traversed by a workload Q, and greedily place those wholly within a partition.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Capturing a query workload</head><p>In order to detect the sub-graphs from stream G which are likely to be frequently traversed by a workload Q, we must first discover those motifs which occur frequently within the query graphs defined in Q. This act of discovery is a form of frequent sub-graph mining.</p><p>In a previous work by the authors which is submitted for publication elsewhere, we define the traversal pattern summary trie (TPSTry). The TPSTry datastructure, inspired by Li et al 's work <ref type="bibr" target="#b10">[10]</ref> to find common traversal paths amongst sessions of hyperlink click-streams, is an encoding of the frequent motifs in a workload of path queries. The encoding is intensional, encoding paths of vertex labels, rather than the vertices themselves, in order to save space. Each node n is associated with the set of queries which could cause the path of traversals which n represents. Each node n in the trie is additionally associated with a probability P (n), representing the likelihood of a traversal in graph G along a path whose vertex labels match those of the path ε → . . . → n in the TPSTry. Using these probabilities, we are able to estimate the probability of any traversal from a vertex v, given its v label and those of v's local neighbourhood.</p><p>In this work we extend the TPSTry data structure from a trie to a directed acyclic graph, which we call TPSTry++, capable of encoding the features of more complex motifs: branches, cycles etc. Encoding the motifs described by inexact pattern matching queries, such as those including variable length paths, is considered out of scope for this work.</p><p>The TPSTry++ is inspired by the work of Ribeiro and Silva, who propose G-Tries <ref type="bibr" target="#b14">[14]</ref>. A G-Trie is a trie data structure which stores unlabelled graphs in such a way that a parent node in the trie represents a sub-graph of its children. Each graph in a G-Trie node is represented in its canonical form. A canonical form is guaranteed to be equal for two graphs which are isomorphic to one another, avoiding multiple trie branches per graph. In order to discover frequent sub-graphs (motifs), Ribeiro and Silva traverse the elements of a graph, constructing the branches of a G-Trie as they encounter distinct motifs. A p-value is associated with each node, based upon the number of times a particular motif has been observed. This process is similar to how we construct a TPSTry++ from a stream of sub-graph pattern matching queries, however there are a number of differences. Firstly, as we capture labelled topologies, the TPSTry++ must be a directed acyclic graph (DAG), rather than a tree, as it may have multiple possible root nodes: one for each vertex with a distinct label. Secondly, we must use a different method for checking isomorphism between two motifs, as the unlabelled canonical form used when constructing G-Tries is no longer sufficient. To match two motifs, we use an efficient algorithm by Song et al <ref type="bibr" target="#b16">[16]</ref> for computing numerical signatures for graphs. This algorithm is proposed as part of a graph-stream pattern matching approach, which we use to detect matches for Q's motifs in a graph-stream G. Both algorithm and pattern matching approach are detailed in the next section 4.3. Note that signature equality constitutes a non-authoritative form of isomorphism checking. However, the probability of signature collisions, and therefore of mistakenly representing distinct motifs with a single TPSTry++ node, is shown to be very low.</p><p>Figure <ref type="figure" target="#fig_1">2</ref> shows a representation of a TPSTry++ for the workload Q in figure <ref type="figure" target="#fig_0">1</ref>, without p-values. We capture the motifs common to Q, along with their frequencies, by executing a simple co-recursive algorithm, presented in algorithm 1, for each query graph Gq i .</p><p>Algorithm 1 Recompute TPSTry++ for each query q ∈Q qG ← the query graph defined by q signature(g) ← the signature of a graph g support(g) ← a map of TPSTry++ nodes to p-values tpstry ← the TPSTry++ for Q g ← some sub-graph of qG, initially a single vertex weave(qg, tpstry) for v in vertices from qG do g ← new graph with just {v} corecurse(g, tpstry) sig ← signature(g) if sig not in tpstry then tpstry ← tpstry + g //Add a g node to TPSTry++ support(g) ← support(g) + 1 newEdges ← edges incident to g but not in g for e in newEdges corecurse(g + e, tpstry) //Traverse through qG return tpstry Any node in the TPSTry which has a p-value above a user-defined threshold T is denoted frequent, and the node's associated sub-graph is considered a motif in Q.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Detecting motif matches in a graph-stream</head><p>The TPSTry++ for Q provides a set of query motifs which, where they occur in G, are likely to be frequently traversed by a random q ∈Q. Given this information, we must attempt to identify sub-graphs in G which match these motifs, in order to make sure they cross partition boundaries as little as possible. As mentioned, global graph introspection or pattern matching operations are expensive and limit the scalability of a partitioner <ref type="bibr" target="#b19">[19]</ref>. Instead, we use a graphstream pattern matching algorithm to capture those subgraphs in G which match a query motif and occur within a given window 2 over the graph-stream.</p><p>Song et al <ref type="bibr" target="#b16">[16]</ref> propose a highly efficient algorithm based on number theoretic signatures. Offline, they construct a "signature" for each query graph Gq i in a workload. This signature is really a large integer hash, which captures key information about a graph, such as vertices, labels and their degree, as distinct factors. Subsequently, as an edge e arrives online, the signature for a sub-graph S which contains e is calculated by multiplying the previous signature of the subgraph S\e by the factor for e. If the signature for S is divisible by the signature for Gq i then there is likely to be a match for qi in S.</p><p>Song et al demonstrate that if a graph does not have a signature equal to that of a given query graph Gq i , then it cannot be a match for the query qi. Note that this is a weaker property than a signature match being equivalent to a graph match, and as such this pattern matching algorithm is non-authoritative; indeed, Song et al must use a secondary algorithm to verify matches. However, they also demonstrate that signature collision is highly unlikely, which should be sufficient for our purposes of heuristically improving a partitioning, without further verification.</p><p>Recall from algorithm 1 that a signature is computed for each motif represented by a node in the TPSTry++. As each edge arrives in the graph-stream, if it connects two vertices within the stream window to form a sub-graph S, then we compute a signature. If the signature is a match for a node n in the TPSTry++, then S is a match for a motif. For a subsequently added edge, the signature for sub-graph S must match a signature associated with a child of n, or else S is not a match for a motif. Note that the sub-graph S not being a match for a motif does not imply that the newly added edge is not part of a sub-graph which is a match.</p><p>Figure <ref type="figure" target="#fig_2">3</ref> presents an illustrative example of the above. The subgraph S is not a match for any node of the TPSTry++ in figure <ref type="figure" target="#fig_1">2</ref>, however it contains two distinct instances of the abc motif. The pattern matching algorithm does not detect this, because sub-graph signatures are iteratively recomputed with each update, and previous signatures discarded. As a result, we risk assigning the added c labelled vertex to a different partition than the sub-graph S, creating an inter-partition edge which is likely to be traversed. In order to avoid this, we adopt the following simple procedure: if an edge e is added to a sub-graph S, and the new sub-graph S is not a match for a motif in the TPSTry++, then we incrementally compute a new signature for S , starting with the new edge e. This computation is similar to algorithm 1, in that we traverse each edge in S , starting with those incident to vertices in e. After each step we recompute the signature for the sub-graph of S which we have traversed so far. If this recomputed signature is not in the TPSTry++ then we discard the most recent edge, and do not traverse to its neighbours. We eventually traverse and identify the largest sub-graph of S which both contains the edge e and is a match for a query motif<ref type="foot" target="#foot_1">3</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Assigning motif matches to partitions</head><p>The pattern matching algorithm described in the previous section will maintain the set of sub-graphs which are currently within the graph-stream window, and which match common motifs from a query workload Q. Over time, vertices and edges will leave the stream window and be assigned to partitions using the LDG heuristic, as mentioned in section 4.1. When the "oldest" vertex in a motif match is due to be assigned, we assign the whole matching sub-graph at once. Other matching sub-graphs which share common substructure with the sub-graph being assigned, as in figure <ref type="figure" target="#fig_2">3</ref>, will also be assigned to the same partition. This greedy approach is naive, as it risks assigning some motif matching sub-graphs to sub-optimal partitions because they share substructure with another sub-graph which was assigned earlier. Furthermore if a set of connected sub-graphs is very large, it is unclear what effect this would have on partition balance, even given LDG's penalty weighting. Evaluating alternative approaches, including local partitioning of motif matches to separate them across partitions, is a focus of our ongoing work.</p><p>As stated previously, isolated vertices, or sub-graphs which do match motifs from Q, are assigned according to the LDG heuristic.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">CONCLUSION AND FUTURE WORK</head><p>We have presented our ongoing work on LOOM: a workloadaware streaming graph partitioner. Our primary contribution is using a generalised trie data structure to identify query motifs, small sub-graphs common to many of the query graphs defined in a sub-graph pattern matching workload Q. We have also described how we use an efficient graph-stream pattern matching technique to identify matches for query motifs in a graph-stream G, and greedily assign these matches to partitions to reduce the probability of inter-partition traversals.</p><p>As future work we will perform extensive evaluation of the prototype LOOM architecture, specifically in the presence of a number of different graph-stream orderings, and different query workloads. Furthermore, our choice of greedy assignment semantics, never splitting sub-graphs in G which match query motifs, risks poor performance when large subgraphs are assigned to sub-optimal partitions in order to maintain partition balance. We must propose a local partitioning procedure for large matched sub-graphs which alleviates this. Finally it would be interesting to extend our base partitioning heuristic (LDG) to incorporate edge traversal probabilities from the TPSTry++ into the process of selecting assignment partitions.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>cFigure 1 :</head><label>1</label><figDesc>Figure 1: An example graph G with query workload Q</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: TPSTry++ for Q in fig.1</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>2 3 :</head><label>3</label><figDesc>Stream windows may be defined in terms of time, or element count graph-stream G Motif matching over the graph-stream</figDesc></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">When assigning sub-graphs, LDG considers the total edges from all vertices, to each partition.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1">This may be none!</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title/>
		<author>
			<persName><surname>References</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<ptr target="http://thinkaurelius.github.io/titan/" />
		<title level="m">Titan -Distributed Graph Database</title>
				<imprint>
			<date type="published" when="2015-12-01">2015-12-01</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Finding sparse cuts locally using evolving sets</title>
		<author>
			<persName><forename type="first">R</forename><surname>Andersen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Peres</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 41st annual ACM symposium on Symposium on theory of computing -STOC &apos;09</title>
				<meeting>the 41st annual ACM symposium on Symposium on theory of computing -STOC &apos;09<address><addrLine>New York, New York, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM Press</publisher>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page">235</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Balanced Graph Partitioning</title>
		<author>
			<persName><forename type="first">K</forename><surname>Andreev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Racke</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Theory of Computing Systems</title>
		<imprint>
			<biblScope unit="volume">39</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="929" to="939" />
			<date type="published" when="2006-11">nov 2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Pattern Matching in Protein-Protein Interaction Graphs</title>
		<author>
			<persName><forename type="first">G</forename><surname>Brevier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Rizzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Vialette</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Fundamentals of Computation Theory</title>
				<imprint>
			<biblScope unit="page" from="137" to="148" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Schism</title>
		<author>
			<persName><forename type="first">C</forename><surname>Curino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Madden</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the VLDB Endowment</title>
				<meeting>the VLDB Endowment</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="48" to="57" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">An Improved Spectral Graph Partitioning Algorithm for Mapping Parallel Computations</title>
		<author>
			<persName><forename type="first">B</forename><surname>Hendrickson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Leland</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SIAM Journal on Scientific Computing</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="452" to="469" />
			<date type="published" when="1995-03">mar 1995</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">A graph-based recommender system for digital library</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Chung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T.-H</forename><surname>Ong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries</title>
				<meeting>the 2nd ACM/IEEE-CS joint conference on Digital libraries</meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="65" to="73" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Multilevel k -way Partitioning Scheme for Irregular Graphs</title>
		<author>
			<persName><forename type="first">G</forename><surname>Karypis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kumar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Parallel and Distributed Computing</title>
		<imprint>
			<biblScope unit="volume">47</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="109" to="124" />
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">An efficient heuristic procedure for partitioning graphs</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">W</forename><surname>Kernighan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Lin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Bell systems technical journal</title>
		<imprint>
			<biblScope unit="volume">49</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="291" to="307" />
			<date type="published" when="1970">1970</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Mining Top-K Path Traversal Patterns over Streaming Web Click-Sequences</title>
		<author>
			<persName><forename type="first">H</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Information Science and Engineering</title>
		<imprint>
			<biblScope unit="volume">1133</biblScope>
			<biblScope unit="issue">95</biblScope>
			<biblScope unit="page" from="1121" to="1133" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Distributed GraphLab</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Low</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Bickson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gonzalez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Guestrin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kyrola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Hellerstein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the VLDB Endowment</title>
				<meeting>the VLDB Endowment</meeting>
		<imprint>
			<date type="published" when="2012-04">apr 2012</date>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="page" from="716" to="727" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Pregel</title>
		<author>
			<persName><forename type="first">G</forename><surname>Malewicz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">H</forename><surname>Austern</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">J</forename><surname>Bik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">C</forename><surname>Dehnert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Horn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Leiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Czajkowski</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2010 international conference on Management of data -SIGMOD &apos;10</title>
				<meeting>the 2010 international conference on Management of data -SIGMOD &apos;10<address><addrLine>New York, New York, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM Press</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page">135</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">SWORD</title>
		<author>
			<persName><forename type="first">A</forename><surname>Quamar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">A</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Deshpande</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 16th International Conference on Extending Database Technology</title>
				<meeting>the 16th International Conference on Extending Database Technology</meeting>
		<imprint>
			<publisher>ACM Press</publisher>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page">430</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">G-Tries: a data structure for storing and finding subgraphs</title>
		<author>
			<persName><forename type="first">P</forename><surname>Ribeiro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Silva</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Data Mining and Knowledge Discovery</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="337" to="377" />
			<date type="published" when="2014-03">mar 2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Catch the Wind: Graph workload balancing on cloud</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Shang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">X</forename><surname>Yu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE 29th International Conference on Data Engineering (ICDE)</title>
				<imprint>
			<date type="published" when="2013-04">2013. apr 2013</date>
			<biblScope unit="page" from="553" to="564" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Event pattern matching over graph streams</title>
		<author>
			<persName><forename type="first">C</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the VLDB Endowment</title>
				<meeting>the VLDB Endowment</meeting>
		<imprint>
			<date type="published" when="2014-12">dec 2014</date>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page" from="413" to="424" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Streaming graph partitioning for large distributed graphs</title>
		<author>
			<persName><forename type="first">I</forename><surname>Stanton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Kliot</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining</title>
				<meeting>the 18th ACM SIGKDD international conference on Knowledge discovery and data mining</meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="1222" to="1230" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Fast best-effort pattern matching in large attributed graphs</title>
		<author>
			<persName><forename type="first">H</forename><surname>Tong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Gallagher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Faloutsos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Eliassi-Rad</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining</title>
				<meeting>the 13th ACM SIGKDD international conference on Knowledge discovery and data mining</meeting>
		<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page">737</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Tsourakakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Gkantsidis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Radunovic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Vojnovic</surname></persName>
		</author>
		<author>
			<persName><surname>Fennel</surname></persName>
		</author>
		<title level="m">Proceedings of the 7th ACM international conference on Web search and data mining</title>
				<meeting>the 7th ACM international conference on Web search and data mining</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="333" to="342" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">LogGP</title>
		<author>
			<persName><forename type="first">N</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Cui</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the VLDB Endowment</title>
				<meeting>the VLDB Endowment</meeting>
		<imprint>
			<date type="published" when="2014-10">oct 2014</date>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="page" from="1917" to="1928" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Towards effective partition management for large graphs</title>
		<author>
			<persName><forename type="first">S</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Zong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Khan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2012 international conference on Management of Data</title>
				<meeting>the 2012 international conference on Management of Data</meeting>
		<imprint>
			<publisher>ACM Press</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="517" to="528" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
