1 Introduction

Graph n-grams for Scientific Workflow Similarity Search

David Luis Wiegandt

Johannes Starlinger

starling@informatik.hu-berlin.de 0

Ulf Leser

leser@informatik.hu-berlin.de 0 0 Humboldt-Universita ̈t zu Berlin, Institut fu ̈r Informatik Unter den Linden 6 , 10099 Berlin, Germany https://

As scientific workflows increasingly gain popularity as a means of automated data analysis, the repositories such workflows are shared in have grown to sizes that require advanced methods for managing the workflows they contain. To facilitate clustering of similar workflows as well as reuse of existing components, a similarity measure for workflows is required. We explore a new structure-based approach to scientific workflow similarity assessment that measures similarity as the overlap in local structure patterns represented as n-grams. Our evaluation shows that this approach reaches state-of-the-art quality in scientific workflow comparison and outperforms some established scientific workflow similarity measures.

Scientific Workflows DAGs n-grams graph theory

1 Introduction

Over the last years, scientific workflows (SWFs) have emerged as a useful means of data analysis, particularly in the life sciences. SWFs are shared in (online) repositories to optimally support non-experts who cannot develop SWFs by themselves. A problem that arises with the growth of such repositories is the occurrence of (partial) duplicates. To avoid duplicates in the first place and enforce the reuse of existing components instead, and to allow for grouping of workflows that fulfil similar tasks (e.g. for result presentation), a similarity measure for SWFs is required. The problem of defining adequate similarity measures has been attacked in various ways, either based on the SWFs’ graph structure or on their associated description and annotations in repositories. In this work, we focus on graph-based approaches as they are in principal applicable to arbitrary workflows, in contrast to approaches that rely on the presence of annotations.

Graphs can be compared using either global or local measures. A systematic approach to categorising and evaluating various established measures [ 14 ] yields that algorithms that consider a graph’s full structure are too strict in the context of SWF similarity assessment. On the contrary, algorithms that fully ignore any structural information and only perceive graphs as sets of vertices are too loose, as two workflows that share a high amount of common tasks but with very different interconnections, should not be declared as similar. Instead, approaches that retain substructures only to a certain degree yield promising results [ 1,15,17 ].

Since it is our aim to measure the similarity of SWFs that are stored in repositories, we particularly focus on algorithms that decompose graphs into smaller units. These units, encoding a certain amount of structure, can be computed once when the workflow is added to the repository, and be reused in subsequent similarity assessments, making the computational complexity of the decomposition process negligible. In information retrieval, one way to represent such units are n-grams, a concept which has been applied to graphs as well.

For instance, in [ 17 ], workflows are perceived as finite state automata and the automata’s states are encoded into fixed-size words called n-grams. Two workflows’ similarity is measured by the overlap in their n-gram-sets. This approach was proven to outperform other similarity measures in the more general field of workflow similarity assessment, however, it cannot easily be transferred to SWFs, as, in general, SWFs lack transition labels that are required to convert them to FSAs. A similar approach that aims at similarity assessment of trees is presented in [ 1 ]. In [ 19 ], an n-gram-based approach to measuring two graphs’ similarity in the context of graph similarity joins was presented as an alternative to the computationally expensive graph edit distance. It is among the state-of-the-art algorithms in this field [ 3 ].

Since n-gram-based approaches have proven to be successful in various domains, in this work, we transfer the concept to SWF similarity assessment. To this end, we particularly focus on n-grams that represent sub-paths of length n. For our evaluation, we make use of the gold standard corpus of manually retrieved similarity ratings introduced in [ 14 ].

In the following, we start with formal definitions of graphs and workflow graphs in Section 2, that we refer to in the subsequent review of existing solutions in Section 3. In Section 4, we introduce our novel approach which we evaluate in Section 5. In Section 6, we make specific propositions on how to improve our approach in the context of possible future work and close with some concluding remarks. 2 2.1

Preliminaries Graphs

A directed graph G = (V; E) is a tuple consisting of a vertex set V and an edge set E V V . With [n] := fi 2 N : 1 i ng we usually assume V = [n] for some n 2 N. A path of length n is a sequence of vertices v1; : : : ; vn 2 V that are connected by edges, i.e. (vi 1; vi) 2 E; i = 2 : : : n. If v1 = vn, the path is called a cycle. Directed graphs without cycles are called directed acyclic graphs (DAGs). For the rest of this work, when talking about graphs we always mean DAGs if not explicitly stated otherwise.

With Eˆ := ffv; wg : (v; w) 2 Eg, we call Gˆ := (V; Eˆ ) the underlying undirected graph of G. A graph G is connected if there exists a path between any two vertices in V and weakly connected if only Gˆ is connected. A vertex-labelled graph is a quadruple G = (V; E; S ; l) with labels over a finite alphabet S and a function l : V ! S ?, although we usually do not explicitly state the alphabet and the labelling function. An edge-labelled graph is defined analogously. The ratio of jEj and the graph’s maximum possible amount of edges is called its density.

Given any v 2 V , we call suc(v) = fw 2 V : (v; w) 2 Eg its set of successors and pre(v) = fw 2 V : (w; v) 2 Eg its set of predecessors. Furthermore, we denote a vertices’ in-degree by deg (v) = jpre(v)j and its out-degree by deg+(v) = jsuc(v)j. We refer to b(G) = jjVEjj as the average branching factor of G, being the average out-degree (and in-degree) of any vertex v 2 V . For some S V we call G[S] = (S; E[S]) with E[S] = f(v; w) 2 E : v; w 2 Sg the vertex-induced subgraph of G. 2.2

Workflow Graphs

represent SWFs as DAGs [ 16 ]: We set E = Ec [ Ed and omit the explicit mentioning of IN and OU T , because we are only interested in the underlying DAG’s structure, which can be described by its set of vertices and edges as introduced in the previous section. The problem of workflow similarity assessment has been attacked in various different ways. As we focus on graph-based approaches, we group the approaches outlined in this section wrt. their preservation of topology and structure and distinguish (1) local approaches that completely disregard the graphs’ structure, (2) global approaches that preserve the graphs’ structure (almost) entirely and (3) alternative approaches that aim to compromise between local and global by preserving substructures.

Local Measures. Considering only the workflows’ tasks, in [ 13 ], the Vector Space Model (VSM) as known from information retrieval [ 12 ] is adapted for workflow comparison. Given two vertex-labelled workflow graphs Gi = (Vi; Ei; Si; li) for i = 1; 2, each workflow is represented as a k-dimensional vector Di = (di;1; : : : ; di;k) with T = l1[V1] [ l2[V2] = ft1; : : : ; tkg being the set of all vertex labels in both workflows and di; j = jfv 2 Vi : l(v) = t jgj; j = 1::k being the amount of vertices in the i-th workflow that are labelled with t j. In a last step, the similarity is computed as the cosine similarity between both workflows’ vectors.

Another approach called Module Sets (MS), that is similar to the VSM in that it completely disregards substructures by perceiving a workflow graph only as a set of tasks (called modules in the referenced work), proved to be comparably reliable and fast [ 14 ]: Given SWF graphs Gi = (Vi; Ei) for i = 1; 2, all pairs of tasks in V1 V2 are compared (e.g. with regard to their label’s string edit distances).

simM : V1

V2 ! [0; 1]

R Based on simM, a maximum weight matching mw larity score is given by eq. (2).

V2 is computed and the simi

nnsimMS(V1;V2) =

å (v1;v2)2mw simM(v1; v2) In a last step, the similarity score is normalised over the workflows’ size using a modified Jaccard index: simMS(V1;V2) = jV1j + jV2j nnsimMS(V1;V2) nnsimMS(V1;V2) Global Measures. An alternative examined in the same work [ 14 ] that turned out to provide more stable results than module sets is to retain substructures by comparing path sets. Given a DAG G = (V; E), we denote the set of all paths of length n that start at a node with no inbound edges and end in a node with no outbound edges by Pn(G) (1) (2) (3) and the set of all such paths of arbitrary length by PS(G):

Pn(G) = f(v1; : : : ; vn) 2 V n : (vi 1; vi) 2 E; i = 2 : : : n; deg (v1) = deg+(vn) = 0g (4) jV j PS(G) = [ Pn(G) n=1 Given SWF graphs G1; G2 and their path sets PS1 := PS(G1); PS2 := PS(G2), the computation of the pairwise similarity score between paths (simP), the non-normalised (nnsimPS) and the normalised similarity score (simPS) are defined analogous to the module sets approach by using paths instead of modules. To compute the similarity of two paths P1 = v1; : : : ; vn; P2 = w1; : : : ; wm, a maximum weight non-crossing (mwnc) matching is computed on their vertices, i.e. for all (vi; w j); (vk; wl ) 2 mwnc P1 P2 it holds that k > i =) l > j and k < i =) l < j [ 8 ].

Another common graph distance measure is the graph edit distance. Given graphs G1; G2 and the three edit operations insert, delete and substitute (all applicable to edges and nodes), each with a certain cost assigned, the graph edit distance is defined as the minimum cost sequence of edit operations to transform G1 into G2 or vice versa [ 11 ]. Despite being optimal in terms of accuracy in finding structural differences between graphs, the graph edit distance’s time complexity is exponential, making it inappropriate for large graphs. Further, it turned out to be too strict for SWF similarity assessment and yields comparably bad results [ 14 ].

Alternative Approaches. In [ 14 ], it was shown that graph comparison algorithms that consider the graphs full structure appear to be too strict for fine grained assessment of SWF similarity, whereas the comparison of SWFs by their module sets turned out to provide convincing results. Comparing substructures and focusing on the execution order of the workflows’ tasks, on the other hand, turned out to lead to more reliable results, but is computationally expensive.

In [ 15 ], a new approach to SWF similarity search, that aims to compromise between local and global approaches, is introduced. Layer Decomposition (LD) performs a topological decomposition of a given SWF graph, i.e. it decomposes the graph into disjoint sets of vertices depending on the vertices relative position within their data flow strand. This is done by repeatedly removing all vertices with no inbound edges and their associated outbound edges. Given a graph G = (V; E), the i-th layer Li is defined recursively: (5) (6) (7) L1(G) = fv 2 V : deg (v) = 0g

Li(G) = fv 2 G [Li 1(G)] : deg (v) = 0g The ordered set LD(G) that contains all layers is called the graph’s layer decomposition. To compare two graphs G1; G2 by their layer decompositions LD1 := LD(G1); LD2 := LD(G2), the approach of computing nnsimLD and simLD is analogous to the path and module sets, except for the use of a maximum weight non-crossing matching between layers and a maximum weight matching between two layers’ vertices. LD is able to outperform other algorithms for SWF similarity assessment, including path sets, module sets and the graph edit distance [ 15 ].

The distinctive feature of layer decomposition and path set comparison is that the matching between two workflows’ tasks within the retrieved substructures is performed as a part of the similarity assessment, while the other approaches assume a global matching to be computed beforehand. Two workflows’ tasks are matched with respect to their execution order, because tasks in one workflow can only be mapped to tasks that are within the matched layer or path in the other workflow. Since the matching is non-crossing, the overall similarity depends on the extent to which both workflows overlap in their module sets but also on the similarity of the order the tasks appear in. This is important as two workflows might share a great amount of common tasks (i.e. they have similar module sets) but due to their differing execution order, the workflows might implement different functionalities.

In the related area of automata theory, an n-gram-based approach was introduced [ 17 ] in which a workflow with transition labels in S ? is perceived as a finite state automaton (FSA) M = (Q; S ; d ; q0; F) with states Q, an alphabet S , a transition function d : Q S ! Q, an initial state q0 2 Q and final states F Q. We define dˆ (q; w) as the state of M after reading a word w, starting in state q: dˆ : Q

S ? ! Q An n-gram is defined as a word w 2 S n with w = w1w2 : : : wn, s.t. there exists a state from which on the word w is recognized, ending in a state q 2 Q:

Sn(q) = fw 2 S n : 9r 2 Q : dˆ (r; w) = qg The idea is to represent an automaton by the union of all states’ n-grams and to measure two FSAs distance by the minimum sum of edit distances (the minimum cost sequence of insertions, deletions and substitutions for transforming one given string into another [ 18 ]) across all pairs of n-grams.

It was shown in [ 18 ] that this approach empirically outperforms certain structureand language-based algorithms.

In summary, it can be observed that it is important to partially retain a workflow’s structural topology and thus the execution order, but the increase in computational complexity should not be too high. Module and path set comparison, as well as the graph edit distance and layer decomposition have all been evaluated in the context of SWF comparison. In the following, we investigate the adaption of the concept of n-grams for SWFs. (8) (9) 4

Novel Approach

We propose a new approach that takes local graph patterns into account to compare graphs by measuring their commonalities regarding certain patterns. First, we introduce the notion of n-grams as used in our approach (similar to [ 17 ]). We assume S to be the set of all SWF graphs.

Definition 1 (n-gram). Given a graph G = (V; E) 2 S, we define the function NGSn : S ! P (V n) to compute a set of all paths of length n as

NGSn(G) = f(v1; v2; : : : ; vn) 2 V nj(vi; vi+1) 2 E; i = 1 : : : n 1g (10) with NGSn(G) being a set of n-grams of vertices.

Algorithm 2: Path n-gram input : Graph G = (V; E), n output: n-grams V n

Example 1 (n-grams). We illustrate the concept of n

grams by calculating all 3-grams for the graph on the right. To simplify the notation, we denote the tuples by concatenating their vertices. We obtain the following 3gram set:

NGS3(G) = fABD; ABE; BDF; BDK; BEI; CBD; CBE; DFI; DKIg F

A D

B K I

For the computation of a graph’s set of n-grams we propose algorithm 2, based on a modified iterative deepening depth-first search (IDDFS) [ 6 ]. The decision to propose an IDDFS-based algorithm instead of a Floyd-Warshall-based algorithm was made because the Floyd-Warshall algorithm is less efficient for graphs with a low density, which is, in turn, correlated with the graph’s average branching factor. It has been shown in [ 9 ], that most workflows have a rather low average branching factor, which also implies a low density. Our observations regarding our evaluation corpus, where the median density is approximately 0:3, support these findings. Our evaluation corpus (introduced in [ 14 ]) contains 1483 of the 2123 SWFs in the myExperiment repository, the largest public SWF repository to date.

To derive a measure from two graphs’ sets of n-grams, we first introduce a method to compare two n-grams. For the remainder of this section, we assume two graphs G1 = (V1; E1); G2 = (V2; E2) and refer to their accompanying n-gram sets by NGS1 := f (G1) V1n; NGS2 := f (G2) V2n with f 2 Sn2N NGSn.

Since an n-gram a = (a1; : : : ; an) 2 NGSn(G) for some graph G = (V; E) is a vector in the n-dimensional vector space V n, we refer to the i-th component of an n-gram by ai.

Definition 2 (n-gram similarity). Given two graphs G1 = (V1; E1); G2 = (V2; E2), a similarity function sim : V1 V2 ! [0; 1] and n-gram sets NGS1; NGS2, we define the similarity of two n-grams a 2 NGS1; b 2 NGS2 as simNG(a; b) = 1 n

å sim(ai; bi) n i=1 (11)

In our case, as well as in [ 15 ], the sim-function is derived from the normalised string edit distance between the tasks’ labels ai; bi. To retrieve a set of mappings of n-grams that are present in both sets wrt. the global maximum weight matching, we introduce the n-gram mapping operator:

Definition 3 (n-gram mapping operator). Given two n-gram sets NGS1; NGS2 and

a global maximum weight matching mw V1 V2 between both workflows’ tasks, we define the set of common n-grams as

NGS1 mw NGS2 := f(a; b) 2 NGS1

NGS2 : (ai; bi) 2 mw; i = 1::ng (12)

The similarity of two n-gram sets NGS1; NGS2 is calculated as the sum of the similarities of all n-gram mappings in NGS1 mw NGS2: Definition 4 (n-gram set similarity measure). Given two n-gram sets NGS1; NGS2, we define the non-normalised similarity as nnsimNGS(NGS1; NGS2) := simNG(a; b)

(13) å (a;b)2NGS1 mwNGS2

To normalise this similarity value, we use a modified Jaccard index as described in [ 14 ] and already used for layer decomposition [ 15 ]: Definition 5 (Normalisation). We define the normalised similarity of two n-gram sets NGS1; NGS2 as simNGS(NGS1; NGS2) := jNGS1j + jNGS2j nnsimNGS(NGS1; NGS2) nnsimNGS(NGS1; NGS2) (14)

We also considered another type of n-grams that encode each vertices’ neighbourhood (i.e. its predecessors, successors and itself) similar to [ 1 ]. Due to the low average branching factor in SWFs, this type of pattern turned out to be too restrictive and its detailed analysis is omitted here. 5 5.1

Evaluation Setup

For the evaluation of our novel approach, we make use of the gold standard of similarity ratings utilised in [ 15 ]: From a corpus of 1485 Apache Taverna workflows, 24 query workflows were chosen, each having another 10 workflows associated with it. 15 SWF experts from 6 different institutions were shown a query workflow along with one of its 10 associated workflows and had to decide whether they are very similar, similar, related or dissimilar. A fifth option was to choose unsure, though these ratings were not further considered in the evaluation. The result were multiple rankings of 10 workflows, all ordered by their similarity to the respective query workflow.

In a last step, a consensus ranking was computed from these rankings to aggregate the single experts’ ratings. We use this consensus to compare it to the rankings retrieved by our algorithm using the measures introduced in [ 4 ]: We represent a consensus ranking and a ranking obtained by running our algorithm as partial orders A? and A, respectively, and call a pair of workflows W1;W2 concordant if it appears in the same order in both rankings, i.e.:

C = f(W1;W2) : (W1 A W2 ^ W1 A? W2) _ (W2 A W1 ^ W2 A? W1)g (15) If the order in our predicted ranking differs from the consensus ranking, we call it discordant, i.e.:

D = f(W1;W2) : (W1 A W2 ^ W2 A? W1) _ (W2 A W1 ^ W1 A? W2)g (16) To compare a ranking obtained using our algorithm to the respective consensus ranking, we count the amount of concordant and discordant pairs and use the definition of ranking correctness (eq. (17)) and completeness (eq. (18)):

CR(A; A?) = jCj jDj jCj + jDj (17)

CP(A) = jCj + jDj j A? j (18)

The value of correctness is in [ 1; 1], with 1 indicating a perfectly negative correlation between A? and A as in this case jCj = 0. Conversely, +1 indicates a perfectly positive correlation. On the other hand, completeness measures the number of pairs whose elements can be distinguished by the experts but not by our algorithm. 5.2

Results

Figure 2a shows the results: The coloured bars show the algorithms’ ranking correctness, whereas the small black squares indicate ranking completeness. The black error bars visualise the standard deviation of ranking correctness. All approaches present in the diagrams use matchings of tasks based on the edit distance of the tasks’ labels.

For n 2 f3; 4; 5g, n-grams increasingly outperform path sets in terms of correctness, for n = 5 even with a slightly lower variance as can be seen in Figure 2a. Contrariwise, there is an overall decrease in completeness as n grows, making n-grams the only approach with a completeness far below 1. This is due to our definition of an n-gram: As opposed to e.g. path sets, we require both graphs to contain at least one path of length n, whereas in path sets, where graphs are compared by paths starting at a source and ending in a sink without any constraints on the paths’ lengths, even a single vertex per graph is sufficient to conduct a similarity assessment. The same applies to module sets, graph edit distance and layer decomposition, as neither of them imposes any requirements on the graphs’ size. To analyse the impact of this de- p5 0.58 0.35 sign specificity, we first define cover- p4 0.48 0.45 age as the ratio of a graph’s vertices, p3 0.33 0.61 that are part of at least one n-gram. p2 0.13 0.85 The histogram in Figure 3 shows the share of workflows in our corpus 0 0.1 [0,00].2 (0, 00.2.3) [0.02.,40.4) 0.[50.4, 0.60).6 [0.6,00.7.8) [00..88, 1] 0.9 1 whose coverage is within different ranges for n-grams of size n = 2::5. Fig. 3: A histogram depicting the coverage of For instance, in 45% of the work- the graphs in our corpus by n-grams of differflows in our corpus, between 80% ent sizes. and 100% of the vertices are covered by 4-grams, while 48% do not even contain a single 4-gram, as can be seen in the green and light blue bars of p4. Further, only 67% contain at least one 3-gram, which implies that only 45% of all possible pair-wise comparisons can be conducted using 3-grams. For 4-grams, the share further decreases to 27%.

Observations regarding our corpus of 1483 SWFs yield that the median number of (a) vertices is 5, (b) edges is 4 and (c) average branches per vertex is 0:8, i.e. the majority of SWF graphs are comparably small, prohibiting a further increase in correctness by the use of larger n-grams. This is the major problem that is inherent to our approach, as we have to reject a comparison whenever at least one of the workflows that are to be compared does not contain an n-gram of the required size, leading to a lower completeness.

An obvious remedy could be to let the value of n depend on the involved graphs’ size, instead of requiring both graphs to contain at least one path of a fixed length n. To this end, we define the maximum n that a graph contains at least one n-gram for as: maxn(G) = maxfn 2 N : NGSn(G) 6= 0/ g Given workflow graphs G1; G2, we let n1 = maxn(G1); n2 = maxn(G2) and compute the similarity as the (weighted) sum of similarity scores for all 2 n n? with n? = min(n1; n2): csim(G1; G2) :=

2 n? (n? + 1) n? å i simNGS(NGSi(G1); NGSi(G2)) 2 i=2 In eq. (20), the weighted sum is normalised by dividing it by the sum of all integers 2 from 2 to n?, with n? (n?+1) 2 being derived from the Gaussian summation formula. (19) (20)

Unfortunately, the great increase in completeness ( 0:92 for all n 2 f2; 3; 4; 5g) comes at the cost of bad correctness and variance as can be seen in Figure 2b. In fact, as noted in [ 4 ], there exists a negative correlation between correctness and completeness, that is due to the definitions of the evaluation measures, i.e. the correctness can be increased at the cost of more rejections and, in turn, a lower completeness. As a result, our initial approach performs better as it simply rejects workflow pairs it can not compare, instead of falling back to a lower n which would yield a less reliable similarity value. 6

Conclusion

In this work, we introduced n-grams in the context of SWF comparison as fixed-size local graph patterns that cover a vertices context to different extents, depending on the value of n. To compare two SWFs by their sets of n-grams, we adapted the similarity measure from [ 14,15 ] and were able to show that n-grams for n 2 f3; 4; 5g outperform path and module sets and are close to layer decomposition in terms of the mean correctness. However, we still see potential for various optimisations based on these first results on n-gram based SWF similarity.

First of all, we see the evaluation of other local graph patterns as a possible direction of further research. For instance, another interesting approach could be to compare only data-flow splits or to retrieve similarity values for multiple patterns and to assign a weight to each one. Regarding the approach presented in this work, we propose optimisations specifically dedicated to either correctness or completeness: Correctness. To further increase correctness, the matching strategy could be adjusted. In particular, a more local matching such as a maximum weight non-crossing matching within n-grams combined with a maximum weight matching across n-grams should lead to an increase in correctness as the similarity measure would become more sensitive to partly similar n-grams and thus would allow for a more fine-grained similarity assessment.

Also, some modules in an SWF perform trivial tasks with a low specificity for a certain context [ 14 ], which is particularly a problem for higher n, which yield smaller n-gram sets. To this end, an optimisation the correctness can be expected to benefit from is importance projection introduced in [ 14 ], where modules performing trivial tasks are removed from the graph prior to the similarity assessment. As the workflow graphs are possibly reduced in size by the use of importance projection, the n-gram retrieval algorithm (Algorithm 2) would also benefit in terms of run time, as its run time complexity has an exponential correlation with n.

Completeness. The most obvious way to increase the completeness would be to relax the definition of the n-gram similarity measure from Definition 2 to instead compare n-grams of possibly different sizes. Formally, we compute the n-gram set for some n1 = min(maxn(G1); n) (with maxn as defined in eq. (19)) for workflow G1 and some n2 for G2 respectively. The overall comparison process becomes similar to the path set comparison as presented in Section 3: To preserve the execution order within the compared n-grams, a maximum weight non crossing matching is computed within the n-grams and the overall similarity is the maximum weight matching of both workflows’ n-gram sets as described before regarding the correctness.

Altogether, we have shown n-grams to be a powerful technique for SWF similarity assessment, that still leaves room for further refinements to take on in future work.

1. Augsten , N. , Bo¨hlen, M. , Gamper , J.: Approximate matching of hierarchical data using pqgrams . In: PVLDB 2005 . pp. 301 - 312 . VLDB Endowment ( 2005 )

2. Bux , M. , Leser , U. : Parallelization in scientific workflow management systems . arXiv preprint arXiv:1303.7195 ( 2013 )

3. Chen , Y. , Zhao , X. , Xiao , C. , Zhang , W. , Tang , J.: Efficient and scalable graph similarity joins in mapreduce . TheScientificWorldJournal 2014 , 749028 ( 2014 )

4. Cheng, W., Rademaker , M. , De Baets , B. , Hu¨llermeier, E.: Predicting partial orders: ranking with abstention . In: ECML-PKDD . pp. 215 - 230 . Springer ( 2010 )

5. Kiepuszewski , B. , ter Hofstede , A., van der Aalst , W.: Fundamentals of control flow in workflows . Acta Informatica 39 ( 3 ), 143 - 209 ( 2003 )

6. Korf , R.E.: Depth-first iterative-deepening: An optimal admissible tree search . Artificial intelligence 27 ( 1 ), 97 - 109 ( 1985 )

7. Li , P. : Copasi time simulation of sbml model . http://www.myexperiment.org/ workflows/1202/versions/1/previews/full, accessed: 05 . 03 .2016

8. Malucelli , F. , Ottmann , T. , Pretolani , D. : Efficient labelling algorithms for the maximum noncrossing matching problem . Discrete Applied Mathematics 47 ( 2 ), 175 - 179 ( 1993 )

9. Ostermann , S. , Prodan , R. , Fahringer , T. , Iosup , R. , Epema , D. : On the characteristics of grid workflows . In: CoreGRID Symposium-Euro-Par . vol. 2008 , pp. 1 - 12 ( 2008 )

10. Prodan , R. , Fahringer , T. : Grid Computing , vol. 4340 . Springer Berlin Heidelberg, Berlin, Heidelberg ( 2007 )

11. Riesen , K. , Bunke , H.: Approximate graph edit distance computation by means of bipartite graph matching . Image and Vision Computing 27 ( 7 ), 950 - 959 ( 2009 )

12. Salton , G. , Wong , A. , Yang , C.S.: A vector space model for automatic indexing . Communications of the ACM 18 ( 11 ), 613 - 620 ( 1975 )

13. Santos , E. , Lins , L. , Ahrens , J.P. , Freire , J. , Silva , C.T.: A first study on clustering collections of workflow graphs . In: Provenance and Annotation of Data and Processes , pp. 160 - 173 . Springer ( 2008 )

14. Starlinger , J. , Brancotte , B. , Cohen-Boulakia , S. , Leser , U. : Similarity search for scientific workflows . Proceedings of the VLDB Endowment 7 ( 12 ), 1143 - 1154 ( 2014 )

15. Starlinger , J., Cohen-Boulakia , S. , Khanna , S. , Davidson , S.B. , Leser , U. : Layer decomposition: An effective structure-based approach for scientific workflow similarity . In: 10th International Conference on e-Science . vol. 1 , pp. 169 - 176 . IEEE ( 2014 )

16. Talia , D. : Workflow systems for science: concepts and tools . ISRN Software Engineering 2013 ( 2013 )

17. Wombacher , A. : Evaluation of technical measures for workflow similarity based on a pilot study . In: On the Move to Meaningful Internet Systems 2006 : CoopIS, DOA , GADA, and ODBASE, pp. 255 - 272 . Springer ( 2006 )

18. Wombacher , A. , Li , C. : Alternative approaches for workflow similarity . In: Services Computing (SCC) , 2010 IEEE International Conference on. pp. 337 - 345 . IEEE ( 2010 )

19. Zhao , X. , Xiao , C. , Lin , X. , Wang , W. : Efficient graph similarity joins with edit distance constraints . In: 2012 IEEE International Conference on Data Engineering (ICDE 2012 ). pp. 834 - 845 ( 2012 )