<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Graph n-grams for Scientific Workflow Similarity Search</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David Luis Wiegandt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johannes Starlinger</string-name>
          <email>starling@informatik.hu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ulf Leser</string-name>
          <email>leser@informatik.hu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Humboldt-Universita ̈t zu Berlin, Institut fu ̈r Informatik Unter den Linden 6</institution>
          ,
          <addr-line>10099 Berlin, Germany https://</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>As scientific workflows increasingly gain popularity as a means of automated data analysis, the repositories such workflows are shared in have grown to sizes that require advanced methods for managing the workflows they contain. To facilitate clustering of similar workflows as well as reuse of existing components, a similarity measure for workflows is required. We explore a new structure-based approach to scientific workflow similarity assessment that measures similarity as the overlap in local structure patterns represented as n-grams. Our evaluation shows that this approach reaches state-of-the-art quality in scientific workflow comparison and outperforms some established scientific workflow similarity measures.</p>
      </abstract>
      <kwd-group>
        <kwd>Scientific Workflows</kwd>
        <kwd>DAGs</kwd>
        <kwd>n-grams</kwd>
        <kwd>graph theory</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Over the last years, scientific workflows (SWFs) have emerged as a useful means of data
analysis, particularly in the life sciences. SWFs are shared in (online) repositories to
optimally support non-experts who cannot develop SWFs by themselves. A problem
that arises with the growth of such repositories is the occurrence of (partial) duplicates.
To avoid duplicates in the first place and enforce the reuse of existing components
instead, and to allow for grouping of workflows that fulfil similar tasks (e.g. for result
presentation), a similarity measure for SWFs is required. The problem of defining
adequate similarity measures has been attacked in various ways, either based on the SWFs’
graph structure or on their associated description and annotations in repositories. In
this work, we focus on graph-based approaches as they are in principal applicable to
arbitrary workflows, in contrast to approaches that rely on the presence of annotations.</p>
      <p>
        Graphs can be compared using either global or local measures. A systematic
approach to categorising and evaluating various established measures [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] yields that
algorithms that consider a graph’s full structure are too strict in the context of SWF
similarity assessment. On the contrary, algorithms that fully ignore any structural
information and only perceive graphs as sets of vertices are too loose, as two workflows that
share a high amount of common tasks but with very different interconnections, should
not be declared as similar. Instead, approaches that retain substructures only to a certain
degree yield promising results [
        <xref ref-type="bibr" rid="ref1 ref15 ref17">1,15,17</xref>
        ].
      </p>
      <p>Since it is our aim to measure the similarity of SWFs that are stored in repositories,
we particularly focus on algorithms that decompose graphs into smaller units. These
units, encoding a certain amount of structure, can be computed once when the workflow
is added to the repository, and be reused in subsequent similarity assessments, making
the computational complexity of the decomposition process negligible. In information
retrieval, one way to represent such units are n-grams, a concept which has been applied
to graphs as well.</p>
      <p>
        For instance, in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], workflows are perceived as finite state automata and the
automata’s states are encoded into fixed-size words called n-grams. Two workflows’
similarity is measured by the overlap in their n-gram-sets. This approach was proven to
outperform other similarity measures in the more general field of workflow similarity
assessment, however, it cannot easily be transferred to SWFs, as, in general, SWFs lack
transition labels that are required to convert them to FSAs. A similar approach that
aims at similarity assessment of trees is presented in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], an n-gram-based
approach to measuring two graphs’ similarity in the context of graph similarity joins was
presented as an alternative to the computationally expensive graph edit distance. It is
among the state-of-the-art algorithms in this field [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Since n-gram-based approaches have proven to be successful in various domains, in
this work, we transfer the concept to SWF similarity assessment. To this end, we
particularly focus on n-grams that represent sub-paths of length n. For our evaluation, we
make use of the gold standard corpus of manually retrieved similarity ratings introduced
in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>In the following, we start with formal definitions of graphs and workflow graphs in
Section 2, that we refer to in the subsequent review of existing solutions in Section 3.
In Section 4, we introduce our novel approach which we evaluate in Section 5. In
Section 6, we make specific propositions on how to improve our approach in the context of
possible future work and close with some concluding remarks.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <sec id="sec-2-1">
        <title>Graphs</title>
        <p>A directed graph G = (V; E) is a tuple consisting of a vertex set V and an edge set
E V V . With [n] := fi 2 N : 1 i ng we usually assume V = [n] for some n 2 N.
A path of length n is a sequence of vertices v1; : : : ; vn 2 V that are connected by edges,
i.e. (vi 1; vi) 2 E; i = 2 : : : n. If v1 = vn, the path is called a cycle. Directed graphs without
cycles are called directed acyclic graphs (DAGs). For the rest of this work, when talking
about graphs we always mean DAGs if not explicitly stated otherwise.</p>
        <p>With Eˆ := ffv; wg : (v; w) 2 Eg, we call Gˆ := (V; Eˆ ) the underlying undirected graph
of G. A graph G is connected if there exists a path between any two vertices in V
and weakly connected if only Gˆ is connected. A vertex-labelled graph is a quadruple
G = (V; E; S ; l) with labels over a finite alphabet S and a function l : V ! S ?,
although we usually do not explicitly state the alphabet and the labelling function. An
edge-labelled graph is defined analogously. The ratio of jEj and the graph’s maximum
possible amount of edges is called its density.</p>
        <p>Given any v 2 V , we call suc(v) = fw 2 V : (v; w) 2 Eg its set of successors and
pre(v) = fw 2 V : (w; v) 2 Eg its set of predecessors. Furthermore, we denote a vertices’
in-degree by deg (v) = jpre(v)j and its out-degree by deg+(v) = jsuc(v)j. We refer to
b(G) = jjVEjj as the average branching factor of G, being the average out-degree (and
in-degree) of any vertex v 2 V . For some S V we call G[S] = (S; E[S]) with E[S] =
f(v; w) 2 E : v; w 2 Sg the vertex-induced subgraph of G.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Workflow Graphs</title>
        <p>
          represent SWFs as DAGs [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]: We set E = Ec [ Ed and omit the explicit mentioning of
IN and OU T , because we are only interested in the underlying DAG’s structure, which
can be described by its set of vertices and edges as introduced in the previous section.
The problem of workflow similarity assessment has been attacked in various different
ways. As we focus on graph-based approaches, we group the approaches outlined in
this section wrt. their preservation of topology and structure and distinguish (1) local
approaches that completely disregard the graphs’ structure, (2) global approaches that
preserve the graphs’ structure (almost) entirely and (3) alternative approaches that aim
to compromise between local and global by preserving substructures.
        </p>
        <p>
          Local Measures. Considering only the workflows’ tasks, in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], the Vector Space
Model (VSM) as known from information retrieval [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] is adapted for workflow
comparison. Given two vertex-labelled workflow graphs Gi = (Vi; Ei; Si; li) for i = 1; 2,
each workflow is represented as a k-dimensional vector Di = (di;1; : : : ; di;k) with T =
l1[V1] [ l2[V2] = ft1; : : : ; tkg being the set of all vertex labels in both workflows and
di; j = jfv 2 Vi : l(v) = t jgj; j = 1::k being the amount of vertices in the i-th workflow
that are labelled with t j. In a last step, the similarity is computed as the cosine similarity
between both workflows’ vectors.
        </p>
        <p>
          Another approach called Module Sets (MS), that is similar to the VSM in that it
completely disregards substructures by perceiving a workflow graph only as a set of
tasks (called modules in the referenced work), proved to be comparably reliable and
fast [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]: Given SWF graphs Gi = (Vi; Ei) for i = 1; 2, all pairs of tasks in V1 V2 are
compared (e.g. with regard to their label’s string edit distances).
        </p>
        <p>simM : V1</p>
        <p>V2 ! [0; 1]</p>
        <p>R
Based on simM, a maximum weight matching mw
larity score is given by eq. (2).</p>
        <p>V1</p>
        <sec id="sec-2-2-1">
          <title>V2 is computed and the simi</title>
          <p>nnsimMS(V1;V2) =</p>
          <p>
            å
(v1;v2)2mw
simM(v1; v2)
In a last step, the similarity score is normalised over the workflows’ size using a
modified Jaccard index:
simMS(V1;V2) =
jV1j + jV2j
nnsimMS(V1;V2)
nnsimMS(V1;V2)
Global Measures. An alternative examined in the same work [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ] that turned out to
provide more stable results than module sets is to retain substructures by comparing
path sets. Given a DAG G = (V; E), we denote the set of all paths of length n that start
at a node with no inbound edges and end in a node with no outbound edges by Pn(G)
(1)
(2)
(3)
and the set of all such paths of arbitrary length by PS(G):
          </p>
          <p>
            Pn(G) = f(v1; : : : ; vn) 2 V n : (vi 1; vi) 2 E; i = 2 : : : n; deg (v1) = deg+(vn) = 0g (4)
jV j
PS(G) = [ Pn(G)
n=1
Given SWF graphs G1; G2 and their path sets PS1 := PS(G1); PS2 := PS(G2), the
computation of the pairwise similarity score between paths (simP), the non-normalised
(nnsimPS) and the normalised similarity score (simPS) are defined analogous to the
module sets approach by using paths instead of modules. To compute the similarity of two
paths P1 = v1; : : : ; vn; P2 = w1; : : : ; wm, a maximum weight non-crossing (mwnc)
matching is computed on their vertices, i.e. for all (vi; w j); (vk; wl ) 2 mwnc P1 P2 it holds
that k &gt; i =) l &gt; j and k &lt; i =) l &lt; j [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ].
          </p>
          <p>
            Another common graph distance measure is the graph edit distance. Given graphs
G1; G2 and the three edit operations insert, delete and substitute (all applicable to edges
and nodes), each with a certain cost assigned, the graph edit distance is defined as the
minimum cost sequence of edit operations to transform G1 into G2 or vice versa [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ].
Despite being optimal in terms of accuracy in finding structural differences between
graphs, the graph edit distance’s time complexity is exponential, making it inappropriate
for large graphs. Further, it turned out to be too strict for SWF similarity assessment and
yields comparably bad results [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ].
          </p>
          <p>
            Alternative Approaches. In [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ], it was shown that graph comparison algorithms that
consider the graphs full structure appear to be too strict for fine grained assessment
of SWF similarity, whereas the comparison of SWFs by their module sets turned out
to provide convincing results. Comparing substructures and focusing on the execution
order of the workflows’ tasks, on the other hand, turned out to lead to more reliable
results, but is computationally expensive.
          </p>
          <p>
            In [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ], a new approach to SWF similarity search, that aims to compromise between
local and global approaches, is introduced. Layer Decomposition (LD) performs a
topological decomposition of a given SWF graph, i.e. it decomposes the graph into disjoint
sets of vertices depending on the vertices relative position within their data flow strand.
This is done by repeatedly removing all vertices with no inbound edges and their
associated outbound edges. Given a graph G = (V; E), the i-th layer Li is defined recursively:
(5)
(6)
(7)
L1(G) = fv 2 V : deg (v) = 0g
          </p>
          <p>
            Li(G) = fv 2 G [Li 1(G)] : deg (v) = 0g
The ordered set LD(G) that contains all layers is called the graph’s layer decomposition.
To compare two graphs G1; G2 by their layer decompositions LD1 := LD(G1); LD2 :=
LD(G2), the approach of computing nnsimLD and simLD is analogous to the path and
module sets, except for the use of a maximum weight non-crossing matching between
layers and a maximum weight matching between two layers’ vertices. LD is able to
outperform other algorithms for SWF similarity assessment, including path sets, module
sets and the graph edit distance [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ].
          </p>
          <p>The distinctive feature of layer decomposition and path set comparison is that the
matching between two workflows’ tasks within the retrieved substructures is performed
as a part of the similarity assessment, while the other approaches assume a global
matching to be computed beforehand. Two workflows’ tasks are matched with respect
to their execution order, because tasks in one workflow can only be mapped to tasks
that are within the matched layer or path in the other workflow. Since the matching
is non-crossing, the overall similarity depends on the extent to which both workflows
overlap in their module sets but also on the similarity of the order the tasks appear in.
This is important as two workflows might share a great amount of common tasks (i.e.
they have similar module sets) but due to their differing execution order, the workflows
might implement different functionalities.</p>
          <p>
            In the related area of automata theory, an n-gram-based approach was introduced
[
            <xref ref-type="bibr" rid="ref17">17</xref>
            ] in which a workflow with transition labels in S ? is perceived as a finite state
automaton (FSA) M = (Q; S ; d ; q0; F) with states Q, an alphabet S , a transition function
d : Q S ! Q, an initial state q0 2 Q and final states F Q. We define dˆ (q; w) as the
state of M after reading a word w, starting in state q:
dˆ : Q
          </p>
          <p>S ? ! Q
An n-gram is defined as a word w 2 S n with w = w1w2 : : : wn, s.t. there exists a state
from which on the word w is recognized, ending in a state q 2 Q:</p>
          <p>
            Sn(q) = fw 2 S n : 9r 2 Q : dˆ (r; w) = qg
The idea is to represent an automaton by the union of all states’ n-grams and to measure
two FSAs distance by the minimum sum of edit distances (the minimum cost sequence
of insertions, deletions and substitutions for transforming one given string into another
[
            <xref ref-type="bibr" rid="ref18">18</xref>
            ]) across all pairs of n-grams.
          </p>
          <p>
            It was shown in [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ] that this approach empirically outperforms certain
structureand language-based algorithms.
          </p>
          <p>In summary, it can be observed that it is important to partially retain a workflow’s
structural topology and thus the execution order, but the increase in computational
complexity should not be too high. Module and path set comparison, as well as the graph
edit distance and layer decomposition have all been evaluated in the context of SWF
comparison. In the following, we investigate the adaption of the concept of n-grams for
SWFs.
(8)
(9)
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Novel Approach</title>
      <p>
        We propose a new approach that takes local graph patterns into account to compare
graphs by measuring their commonalities regarding certain patterns. First, we introduce
the notion of n-grams as used in our approach (similar to [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]). We assume S to be the
set of all SWF graphs.
      </p>
      <p>Definition 1 (n-gram). Given a graph G = (V; E) 2 S, we define the function NGSn :
S ! P (V n) to compute a set of all paths of length n as</p>
      <p>NGSn(G) = f(v1; v2; : : : ; vn) 2 V nj(vi; vi+1) 2 E; i = 1 : : : n
1g
(10)
with NGSn(G) being a set of n-grams of vertices.</p>
      <p>Algorithm 2: Path n-gram
input : Graph G = (V; E), n
output: n-grams V n</p>
      <sec id="sec-3-1">
        <title>Example 1 (n-grams). We illustrate the concept of n</title>
        <p>grams by calculating all 3-grams for the graph on the
right. To simplify the notation, we denote the tuples by
concatenating their vertices. We obtain the following
3gram set:</p>
        <p>NGS3(G) =
fABD; ABE; BDF; BDK; BEI; CBD;
CBE; DFI; DKIg
F</p>
        <p>A
D</p>
        <p>B
K
I</p>
        <p>C</p>
        <p>E</p>
        <p>
          For the computation of a graph’s set of n-grams we propose algorithm 2, based on
a modified iterative deepening depth-first search (IDDFS) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The decision to propose
an IDDFS-based algorithm instead of a Floyd-Warshall-based algorithm was made
because the Floyd-Warshall algorithm is less efficient for graphs with a low density, which
is, in turn, correlated with the graph’s average branching factor. It has been shown in
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], that most workflows have a rather low average branching factor, which also implies
a low density. Our observations regarding our evaluation corpus, where the median
density is approximately 0:3, support these findings. Our evaluation corpus (introduced
in [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]) contains 1483 of the 2123 SWFs in the myExperiment repository, the largest
public SWF repository to date.
        </p>
        <p>To derive a measure from two graphs’ sets of n-grams, we first introduce a method
to compare two n-grams. For the remainder of this section, we assume two graphs
G1 = (V1; E1); G2 = (V2; E2) and refer to their accompanying n-gram sets by NGS1 :=
f (G1) V1n; NGS2 := f (G2) V2n with f 2 Sn2N NGSn.</p>
        <p>Since an n-gram a = (a1; : : : ; an) 2 NGSn(G) for some graph G = (V; E) is a vector
in the n-dimensional vector space V n, we refer to the i-th component of an n-gram by
ai.</p>
        <p>Definition 2 (n-gram similarity). Given two graphs G1 = (V1; E1); G2 = (V2; E2), a
similarity function sim : V1 V2 ! [0; 1] and n-gram sets NGS1; NGS2, we define the
similarity of two n-grams a 2 NGS1; b 2 NGS2 as
simNG(a; b) =
1 n</p>
        <p>å sim(ai; bi)
n i=1
(11)</p>
        <p>
          In our case, as well as in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], the sim-function is derived from the normalised string
edit distance between the tasks’ labels ai; bi. To retrieve a set of mappings of n-grams
that are present in both sets wrt. the global maximum weight matching, we introduce
the n-gram mapping operator:
        </p>
        <sec id="sec-3-1-1">
          <title>Definition 3 (n-gram mapping operator). Given two n-gram sets NGS1; NGS2 and</title>
          <p>a global maximum weight matching mw V1 V2 between both workflows’ tasks, we
define the set of common n-grams as</p>
          <p>NGS1
mw NGS2 := f(a; b) 2 NGS1</p>
          <p>NGS2 : (ai; bi) 2 mw; i = 1::ng
(12)</p>
          <p>The similarity of two n-gram sets NGS1; NGS2 is calculated as the sum of the
similarities of all n-gram mappings in NGS1 mw NGS2:
Definition 4 (n-gram set similarity measure). Given two n-gram sets NGS1; NGS2,
we define the non-normalised similarity as
nnsimNGS(NGS1; NGS2) :=
simNG(a; b)</p>
          <p>(13)
å
(a;b)2NGS1 mwNGS2</p>
          <p>
            To normalise this similarity value, we use a modified Jaccard index as described in
[
            <xref ref-type="bibr" rid="ref14">14</xref>
            ] and already used for layer decomposition [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]:
Definition 5 (Normalisation). We define the normalised similarity of two n-gram sets
NGS1; NGS2 as
simNGS(NGS1; NGS2) :=
jNGS1j + jNGS2j
nnsimNGS(NGS1; NGS2)
nnsimNGS(NGS1; NGS2)
(14)
          </p>
          <p>
            We also considered another type of n-grams that encode each vertices’
neighbourhood (i.e. its predecessors, successors and itself) similar to [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]. Due to the low average
branching factor in SWFs, this type of pattern turned out to be too restrictive and its
detailed analysis is omitted here.
5
5.1
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <sec id="sec-4-1">
        <title>Setup</title>
        <p>
          For the evaluation of our novel approach, we make use of the gold standard of similarity
ratings utilised in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]: From a corpus of 1485 Apache Taverna workflows, 24 query
workflows were chosen, each having another 10 workflows associated with it. 15 SWF
experts from 6 different institutions were shown a query workflow along with one of
its 10 associated workflows and had to decide whether they are very similar, similar,
related or dissimilar. A fifth option was to choose unsure, though these ratings were not
further considered in the evaluation. The result were multiple rankings of 10 workflows,
all ordered by their similarity to the respective query workflow.
        </p>
        <p>
          In a last step, a consensus ranking was computed from these rankings to
aggregate the single experts’ ratings. We use this consensus to compare it to the rankings
retrieved by our algorithm using the measures introduced in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]: We represent a
consensus ranking and a ranking obtained by running our algorithm as partial orders A?
and A, respectively, and call a pair of workflows W1;W2 concordant if it appears in the
same order in both rankings, i.e.:
        </p>
        <p>C = f(W1;W2) : (W1 A W2 ^ W1 A? W2) _ (W2 A W1 ^ W2 A? W1)g
(15)
If the order in our predicted ranking differs from the consensus ranking, we call it
discordant, i.e.:</p>
        <p>D = f(W1;W2) : (W1 A W2 ^ W2 A? W1) _ (W2 A W1 ^ W1 A? W2)g
(16)
To compare a ranking obtained using our algorithm to the respective consensus
ranking, we count the amount of concordant and discordant pairs and use the definition of
ranking correctness (eq. (17)) and completeness (eq. (18)):</p>
        <p>CR(A; A?) = jCj jDj
jCj + jDj
(17)</p>
        <p>CP(A) = jCj + jDj
j A? j
(18)</p>
        <p>The value of correctness is in [ 1; 1], with 1 indicating a perfectly negative
correlation between A? and A as in this case jCj = 0. Conversely, +1 indicates a perfectly
positive correlation. On the other hand, completeness measures the number of pairs
whose elements can be distinguished by the experts but not by our algorithm.
5.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Results</title>
        <p>Figure 2a shows the results: The coloured bars show the algorithms’ ranking
correctness, whereas the small black squares indicate ranking completeness. The black error
bars visualise the standard deviation of ranking correctness. All approaches present in
the diagrams use matchings of tasks based on the edit distance of the tasks’ labels.</p>
        <p>For n 2 f3; 4; 5g, n-grams increasingly outperform path sets in terms of correctness,
for n = 5 even with a slightly lower variance as can be seen in Figure 2a. Contrariwise,
there is an overall decrease in completeness as n grows, making n-grams the only
approach with a completeness far below 1. This is due to our definition of an n-gram: As
opposed to e.g. path sets, we require both graphs to contain at least one path of length n,
whereas in path sets, where graphs are compared by paths starting at a source and
ending in a sink without any constraints on the paths’ lengths, even a single vertex per graph
is sufficient to conduct a similarity assessment. The same applies to module sets, graph
edit distance and layer decomposition, as neither of them imposes any requirements on
the graphs’ size.
To analyse the impact of this de- p5 0.58 0.35
sign specificity, we first define cover- p4 0.48 0.45
age as the ratio of a graph’s vertices, p3 0.33 0.61
that are part of at least one n-gram. p2 0.13 0.85
The histogram in Figure 3 shows
the share of workflows in our corpus 0 0.1 [0,00].2 (0, 00.2.3) [0.02.,40.4) 0.[50.4, 0.60).6 [0.6,00.7.8) [00..88, 1] 0.9 1
whose coverage is within different
ranges for n-grams of size n = 2::5. Fig. 3: A histogram depicting the coverage of
For instance, in 45% of the work- the graphs in our corpus by n-grams of
differflows in our corpus, between 80% ent sizes.
and 100% of the vertices are covered by 4-grams, while 48% do not even contain a
single 4-gram, as can be seen in the green and light blue bars of p4. Further, only 67%
contain at least one 3-gram, which implies that only 45% of all possible pair-wise
comparisons can be conducted using 3-grams. For 4-grams, the share further decreases
to 27%.</p>
        <p>Observations regarding our corpus of 1483 SWFs yield that the median number of
(a) vertices is 5, (b) edges is 4 and (c) average branches per vertex is 0:8, i.e. the majority
of SWF graphs are comparably small, prohibiting a further increase in correctness by
the use of larger n-grams. This is the major problem that is inherent to our approach,
as we have to reject a comparison whenever at least one of the workflows that are
to be compared does not contain an n-gram of the required size, leading to a lower
completeness.</p>
        <p>An obvious remedy could be to let the value of n depend on the involved graphs’
size, instead of requiring both graphs to contain at least one path of a fixed length n. To
this end, we define the maximum n that a graph contains at least one n-gram for as:
maxn(G) = maxfn 2 N : NGSn(G) 6= 0/ g
Given workflow graphs G1; G2, we let n1 = maxn(G1); n2 = maxn(G2) and compute
the similarity as the (weighted) sum of similarity scores for all 2 n n? with n? =
min(n1; n2):
csim(G1; G2) :=</p>
        <p>2
n? (n? + 1)
n?
å i simNGS(NGSi(G1); NGSi(G2))
2 i=2
In eq. (20), the weighted sum is normalised by dividing it by the sum of all integers
2
from 2 to n?, with n? (n?+1) 2 being derived from the Gaussian summation formula.
(19)
(20)</p>
        <p>
          Unfortunately, the great increase in completeness ( 0:92 for all n 2 f2; 3; 4; 5g)
comes at the cost of bad correctness and variance as can be seen in Figure 2b. In fact, as
noted in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], there exists a negative correlation between correctness and completeness,
that is due to the definitions of the evaluation measures, i.e. the correctness can be
increased at the cost of more rejections and, in turn, a lower completeness. As a result, our
initial approach performs better as it simply rejects workflow pairs it can not compare,
instead of falling back to a lower n which would yield a less reliable similarity value.
6
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>
        In this work, we introduced n-grams in the context of SWF comparison as fixed-size
local graph patterns that cover a vertices context to different extents, depending on the
value of n. To compare two SWFs by their sets of n-grams, we adapted the similarity
measure from [
        <xref ref-type="bibr" rid="ref14 ref15">14,15</xref>
        ] and were able to show that n-grams for n 2 f3; 4; 5g outperform
path and module sets and are close to layer decomposition in terms of the mean
correctness. However, we still see potential for various optimisations based on these first
results on n-gram based SWF similarity.
      </p>
      <p>First of all, we see the evaluation of other local graph patterns as a possible
direction of further research. For instance, another interesting approach could be to compare
only data-flow splits or to retrieve similarity values for multiple patterns and to assign
a weight to each one. Regarding the approach presented in this work, we propose
optimisations specifically dedicated to either correctness or completeness:
Correctness. To further increase correctness, the matching strategy could be adjusted.
In particular, a more local matching such as a maximum weight non-crossing
matching within n-grams combined with a maximum weight matching across n-grams should
lead to an increase in correctness as the similarity measure would become more
sensitive to partly similar n-grams and thus would allow for a more fine-grained similarity
assessment.</p>
      <p>
        Also, some modules in an SWF perform trivial tasks with a low specificity for a
certain context [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], which is particularly a problem for higher n, which yield smaller
n-gram sets. To this end, an optimisation the correctness can be expected to benefit
from is importance projection introduced in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], where modules performing trivial
tasks are removed from the graph prior to the similarity assessment. As the workflow
graphs are possibly reduced in size by the use of importance projection, the n-gram
retrieval algorithm (Algorithm 2) would also benefit in terms of run time, as its run time
complexity has an exponential correlation with n.
      </p>
      <p>Completeness. The most obvious way to increase the completeness would be to relax
the definition of the n-gram similarity measure from Definition 2 to instead compare
n-grams of possibly different sizes. Formally, we compute the n-gram set for some
n1 = min(maxn(G1); n) (with maxn as defined in eq. (19)) for workflow G1 and some
n2 for G2 respectively. The overall comparison process becomes similar to the path
set comparison as presented in Section 3: To preserve the execution order within the
compared n-grams, a maximum weight non crossing matching is computed within the
n-grams and the overall similarity is the maximum weight matching of both workflows’
n-gram sets as described before regarding the correctness.</p>
      <p>Altogether, we have shown n-grams to be a powerful technique for SWF similarity
assessment, that still leaves room for further refinements to take on in future work.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Augsten</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , Bo¨hlen,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Gamper</surname>
          </string-name>
          , J.:
          <article-title>Approximate matching of hierarchical data using pqgrams</article-title>
          .
          <source>In: PVLDB 2005</source>
          . pp.
          <fpage>301</fpage>
          -
          <lpage>312</lpage>
          . VLDB Endowment (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bux</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leser</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Parallelization in scientific workflow management systems</article-title>
          .
          <source>arXiv preprint arXiv:1303.7195</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
          </string-name>
          , J.:
          <article-title>Efficient and scalable graph similarity joins in mapreduce</article-title>
          .
          <source>TheScientificWorldJournal</source>
          <year>2014</year>
          ,
          <volume>749028</volume>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Cheng, W.,
          <string-name>
            <surname>Rademaker</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Baets</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , Hu¨llermeier, E.:
          <article-title>Predicting partial orders: ranking with abstention</article-title>
          .
          <source>In: ECML-PKDD</source>
          . pp.
          <fpage>215</fpage>
          -
          <lpage>230</lpage>
          . Springer (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Kiepuszewski</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>ter Hofstede</surname>
          </string-name>
          , A.,
          <string-name>
            <surname>van der Aalst</surname>
          </string-name>
          , W.:
          <article-title>Fundamentals of control flow in workflows</article-title>
          .
          <source>Acta Informatica</source>
          <volume>39</volume>
          (
          <issue>3</issue>
          ),
          <fpage>143</fpage>
          -
          <lpage>209</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Korf</surname>
          </string-name>
          , R.E.:
          <article-title>Depth-first iterative-deepening: An optimal admissible tree search</article-title>
          .
          <source>Artificial intelligence</source>
          <volume>27</volume>
          (
          <issue>1</issue>
          ),
          <fpage>97</fpage>
          -
          <lpage>109</lpage>
          (
          <year>1985</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Copasi time simulation of sbml model</article-title>
          . http://www.myexperiment.org/ workflows/1202/versions/1/previews/full, accessed:
          <volume>05</volume>
          .
          <fpage>03</fpage>
          .2016
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Malucelli</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ottmann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pretolani</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Efficient labelling algorithms for the maximum noncrossing matching problem</article-title>
          .
          <source>Discrete Applied Mathematics</source>
          <volume>47</volume>
          (
          <issue>2</issue>
          ),
          <fpage>175</fpage>
          -
          <lpage>179</lpage>
          (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Ostermann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prodan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fahringer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iosup</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Epema</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>On the characteristics of grid workflows</article-title>
          .
          <source>In: CoreGRID Symposium-Euro-Par</source>
          . vol.
          <year>2008</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Prodan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fahringer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Grid Computing</article-title>
          , vol.
          <volume>4340</volume>
          . Springer Berlin Heidelberg, Berlin, Heidelberg (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Riesen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bunke</surname>
          </string-name>
          , H.:
          <article-title>Approximate graph edit distance computation by means of bipartite graph matching</article-title>
          .
          <source>Image and Vision Computing</source>
          <volume>27</volume>
          (
          <issue>7</issue>
          ),
          <fpage>950</fpage>
          -
          <lpage>959</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wong</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>C.S.:</given-names>
          </string-name>
          <article-title>A vector space model for automatic indexing</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>18</volume>
          (
          <issue>11</issue>
          ),
          <fpage>613</fpage>
          -
          <lpage>620</lpage>
          (
          <year>1975</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lins</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ahrens</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freire</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silva</surname>
          </string-name>
          , C.T.:
          <article-title>A first study on clustering collections of workflow graphs</article-title>
          .
          <source>In: Provenance and Annotation of Data and Processes</source>
          , pp.
          <fpage>160</fpage>
          -
          <lpage>173</lpage>
          . Springer (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Starlinger</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brancotte</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cohen-Boulakia</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leser</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Similarity search for scientific workflows</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          <volume>7</volume>
          (
          <issue>12</issue>
          ),
          <fpage>1143</fpage>
          -
          <lpage>1154</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Starlinger</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Cohen-Boulakia</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khanna</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davidson</surname>
            ,
            <given-names>S.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leser</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Layer decomposition: An effective structure-based approach for scientific workflow similarity</article-title>
          .
          <source>In: 10th International Conference on e-Science</source>
          . vol.
          <volume>1</volume>
          , pp.
          <fpage>169</fpage>
          -
          <lpage>176</lpage>
          . IEEE (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Talia</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Workflow systems for science: concepts and tools</article-title>
          .
          <source>ISRN Software Engineering</source>
          <year>2013</year>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Wombacher</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Evaluation of technical measures for workflow similarity based on a pilot study</article-title>
          .
          <source>In: On the Move to Meaningful Internet Systems</source>
          <year>2006</year>
          :
          <article-title>CoopIS, DOA</article-title>
          , GADA, and ODBASE, pp.
          <fpage>255</fpage>
          -
          <lpage>272</lpage>
          . Springer (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Wombacher</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Alternative approaches for workflow similarity</article-title>
          .
          <source>In: Services Computing (SCC)</source>
          ,
          <year>2010</year>
          IEEE International Conference on. pp.
          <fpage>337</fpage>
          -
          <lpage>345</lpage>
          . IEEE (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Efficient graph similarity joins with edit distance constraints</article-title>
          .
          <source>In: 2012 IEEE International Conference on Data Engineering (ICDE</source>
          <year>2012</year>
          ). pp.
          <fpage>834</fpage>
          -
          <lpage>845</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>