Introduction

Keyword Search in Workflow Repositories with Access Control

Susan Davidson

Soohyun M. Lee

Julia Stoyanovich

susan

soohyunl

jstoy}@cis.upenn.edu

0 0 Computer and Information Science Department University of Pennsylvania , Philadelphia, PA , USA

A number of on-line repositories of scientific workflows are emerging. These repositories enable sharing and reuse of workflows, and aid in the design of new workflows. The growing size of these repositories, the complex hierarchical structure of the workflows, and the need to incorporate access control mechanisms make information discovery in these repositories an interesting challenge. This paper formalizes keyword search in repositories of hierarchical workflows. We start by defining what it means for a single hierarchical workflow to match a keyword query while accounting for access control, and discuss options for displaying the resulting matches within the workflow. We extend this to search over workflow repositories, by proposing various ranking semantics that build on techniques from XML search and from information retrieval, and adapting them to our setting.

Introduction

Scientific workflows are increasingly important within the life sciences, where users deal with large amounts of data and perform sophisticated analyses. Workflows are an attractive paradigm for scientists because they make analyses repeatable and re-usable. A workflow is typically represented as a directed graph whose nodes represent analysis steps or modules and edges represent potential dataflow between modules. A module implements a certain functionality by transforming its data inputs into data outputs.

Workflow development platforms typically come equipped with libraries of pre-defined modules, which workflow designers may combine to implement a new analysis. Since workflows may be complex, sometimes consisting of hundreds of modules, it is natural to allow a sub-graph of the workflow graph – a sub-workflow – to be encapsulated in a composite module. A composite module implements the functionality of the sub-workflow, but reveals implementation details only when expanded. This naturally leads to a hierarchical workflow structure.

As an example, consider the hierarchical workflow in Fig. 1(a). This workflow, labeled W1, estimates individual susceptibility to genetic disorders based on genome-wide SNP array data. The input to W1, indicated by node I1, is a set of single nucleotide polymorphisms (SNPs), ethnicity, lifestyle, family history, and symptoms. The output of W1 is a prognosis, indicated by node O1. The first SNP disease I1 W1

M1 M2 M8 O1 τ τ τ W2

SNP PubMed OMIM

W3 parse W4 risk

M5 M9

τ join M6 prognosis

OMIM W5 OpaMrsIMe

M7 module in W1, named M1, determines a set of disorders the patient is genetically susceptible to based on the input SNPs and ethnicity information. The second module, M2, refines the set of disorders for which the patient is at risk, based on lifestyle, family history, and symptoms. Module M8 computes the prognosis. Modules M1, M2, M4, and M8 are composite, and expand to sub-workflows that implement the required functionality (indicated by τ edges in Fig. 1(a)). The presence of such composite modules gives rise to an expansion hierarchy. The expansion hierarchy for W1 is presented in Fig. 1(b).

On-line repositories of scientific workflows, such as myExperiment.org [ 11 ], are emerging in support of the sharing and re-use of workflows. To enable discovery of workflows of interest, several of these repositories support tagging — a workflow may be tagged by its author as well as by other users of the system. Workflow W1 in our running example is tagged with “SNP” and “disease”. Fig. 1(a) shows tags associated with workflows and modules as call-outs.

In a recent work [ 12 ] we considered how tags may be used to enable browsing of workflow repositories such as myExperiment.org. In the browsing scenario, the user is not fully aware of his information need; he is exploring the repository to better formulate his information need and identify items of potential interest. In this paper we shift focus, and consider how tags may be used in search — a task performed by a user who is well-aware of his information need, and is looking for objects that satisfy that need precisely. For example, a user looking for workflows that make disease susceptibility predictions based on SNP information may pose a keyword query Q1 = {“SNP”, “disease”} against the repository, expecting that workflows such as W1 be retrieved. If the user’s task is implemented fully by some workflow in the result, he may re-use the workflow without further modification.

However, users may also want to develop new functionality, possibly re-using existing workflows as building blocks. Suppose that the user wants to build a workflow that combines SNP-based disease susceptibility hypotheses with information about potentially affected genes, and retrieves the pathways in which these genes participate. Realizing that part of his task deals with SNP look-ups in OMIM, he may issue the query Q2 = {“SNP”, “OMIM”}.

Q2 matches workflow W2, and so W2 should be returned as an answer. However, the user may also want to understand how to re-use W2 in the scope of a larger workflow. Therefore, it is also valuable to return W1 as a result for Q2, and to show how W2 is used in the context of W1. Understanding this re-use context is important since it is often easier to learn by example, e.g., how to format the input parameters and what to expect as output.

Repositories of hierarchical workflows, while giving rise to exciting re-use opportunities, also bring about interesting challenges with respect to security. Security considerations arise naturally in domains where data objects are attributable to individual users (their authors) and have intrinsic intellectual or commercial value. We take the point of view that any information discovery approach should be mindful of these considerations, and develop a model of search in repositories of hierarchical workflows that incorporates access control.

Note that our approach leverages user-assigned tags to identify matching workflows and modules. Tags are thought to provide the emergent bottom-up semantics in many datasets on the Social Web, and have been used for search and data exploration in, e.g., Delicious, Flickr, and myExperiment.org, the largest public repository of scientific workflows. Note, however, that our approach is not specific to tags, and that it can also use the top-down semantics prescribed by ontologies or controlled vocabularies instead of, or in addition to, tags. Related Work. Our work builds on the WISE search engine over repositories of hierarchical workflows presented in [ 9 ]. Following the intuition of [ 9 ], we return a portion of the workflow that is both informative, in that it shows the expansion and dataflow relationships necessary to understand a keyword match, and concise, in that only the modules and edges necessary to understand the match are returned. However, we extend the model of [ 9 ] in several important ways. First, we formalize the ideas, leveraging hierarchical workflows with τ expansion. Second, our characterizations of informativeness and conciseness are different, in that we always make the context of the match w.r.t. the top-level workflow explicit. Third, we discuss what the result is when a query matches a workflow in multiple ways. Fourth, we incorporate access control.

Our work is also related to keyword search over tree-structured data (e.g., XML) [ 1, 4, 6–8, 13 ], as well as XML access control [ 2, 3 ]. We compare and contrast our work to these in the following sections.

This paper makes the following contributions. We formalize keyword search over hierarchical workflows, and define what it means for a workflow to match a keyword query, while also accounting for access control specifications (Section 2). We discuss various options for identifying and displaying one or more matches per workflow (Section 3). We consider search over workflow repositories, and propose various ranking semantics that build on techniques from XML search and from information retrieval, and adapt them to our setting (Section 4).

Model

Workflows and Keyword Search A workflow is hierarchical, in the sense that a module may be composite and expand to a sub-workflow. We start by defining a simple workflow in which composite modules are not expanded, and then extend to a system of workflow specifications that expand composite modules.

Definition 1. A simple workflow W is a labeled directed acyclic graph G = (V ∪ {I, O}, E) whose set of nodes, V , models the modules of the workflow. The set of edges, E, models potential dataflow between modules v ∈ V as well as input to and output from the workflow (I → v, v → O). Every v ∈ V must be on a path from I to O and has a module name, name(v).

For example, the box in Fig. 1(a) named W1 is a simple workflow. The input to W1 is a node labeled I1, and the output of W1 is a node labeled O1. The workflow consists of three modules, M1, M2, and M8. Each of these modules is composite, and may be expanded.

Definition 2. A workflow S is a pair (W, τ ), where W is a finite set of simple workflows and τ is a (partial) expansion function that maps some of the module names in the workflows in W to simple workflows in W.

Note that if a module is used multiple times in a workflow, i.e., v1 6= v2 but name(v1) = name(v2), its τ expansion will be the same in each occurrence. From here on, we will use N to denote either a workflow or a module, M to denote a module name, modules(S) to refer to the set of module names in the workflows in W for a workflow S, and wf(N ) to refer to the workflow in which N appears (if N is a module) or the workflow itself (if N is a workflow). Returning to Fig. 1(a), τ (M1) = W2, τ (M2) = W3, τ (M4) = W5, and τ (M8) = W4; wf(W1)=W1, wf(M1)=W1, and wf(M4)=W2.

Definition 2 allows a simple workflow that is part of two different workflows to have different expansion functions defined over its module names. For example, W2 in the workflow W1 has τ1(M4) = W5. However, in a different top level workflow W0 it could be that τ0(M4) is not defined, or maps to a different subworkflow. In a repository of workflows, we will disallow such inconsistencies. Definition 3. A repository of workflows is a set of workflows such that, for any two workflows W1 and W2, if M ∈ modules(W1) ∩ modules(W2) then either (1) τ1(M ) and τ2(M ) are both undefined; or (2) τ1(M ) = τ2(M ).

We now define the notion of an expansion hierarchy for a workflow. Definition 4. Given a workflow S = (W, τ ), if M is a module name in a workflow W ′ and τ ′(M ) = W ′′, we say that W ′′ is a sub-workflow of W ′. The expansion hierarchy of S is the directed graph denoted H(S) whose nodes are the workflows in W, and in which there is an edge from W ′ ∈ W to W ′′ ∈ W if and only if W ′′ is a sub-workflow of W ′. Each edge is labeled with the module name M whose expansion creates the sub-workflow relationship. A workflow is well-formed if its expansion hierarchy is a tree.

The expansion hierarchy for our running example is shown in Fig. 1(b). Since the hierarchy is a tree, W1 is well-formed. To simplify the discussion, we assume in the rest of this paper that all workflows are well-formed. Our results can be easily extended to expansion hierarchies that are DAGs. Extending our work to the (less common) recursive workflow definitions is an interesting open problem.

A module or workflow N is associated with a set of keywords, denoted keywords(N ), e.g., keywords(W1)= {“SNP”, “disease”}, keywords(M6)= {“join”}. Since the expansion of a module is its sub-workflow definition, we require that keywords(M )= keywords(τ (M )) whenever τ (M ) is defined, e.g., keywords(M1)= keywords(W2)= {“SNP”}. To reduce clutter, we omit the keywords associated with modules when they expand to a sub-workflow, and show only the keywords associated with the sub-workflow.

A keyword query Q is a set of keywords {k1, ...kn}. Informally, a query matches a workflow S if each ki can be found somewhere in S.

Definition 5. A keyword k matches a workflow S iff k ∈ keywords(S) or k ∈ keywords(M ) for some M ∈ modules(S). A query Q= {k1, ...kn} matches S iff each ki matches S. 2.2

Access Control and Search Inspired by conventions in Unix file systems, a workflow S has two actions associated with it: read, which allows an authorized user to perform a match against keywords(S); and expand, which allows an authorized user to see the structure of S, i.e., the module names and the dataflow links between the modules. Similarly, a module named M has two actions associated with it: read, which allows an authorized user to perform a match against keywords(M ); and expand, which allows an authorized user to expand its definition to a sub-workflow.

It is helpful to think of the following analogy between workflows and modules, and Unix file system objects. A workflow corresponds to a directory, and having expand permissions on a workflow is analogous to having permission to enter the directory. A module corresponds to a file, and read access on a module is analogous to read permissions on a file. A module with a defined τ expansion corresponds to a symbolic link to a Unix directory, and having expand permissions on such a module is analogous to having permission to enter the directory.

In an access control specification for a workflow, read and expand privileges are associated with a set of user groups for each module and sub-workflow N ; the special group world includes all users. A user u is authorized for an action on N iff u is in a group associated with N for that action.

We denote the set of users who are authorized to read (expand) N as readers(N ) (expanders(N )). Authorizations associated with a sub-workflow are inherited by the modules that expand to it, i.e., readers(M )= readers(τ (M )) and expanders(M )= expanders(τ (M )), whenever τ (M ) is defined.

Definition 6. An access control specification for a workflow S is an assignment of readers(N ), expanders(N ) for each N ∈ S ∪ modules(S).

In a repository of workflows we require that the access control specification be consistent across workflows.

Definition 7. A repository R of workflows, together with their access control specifications, is access control-consistent iff for any user u and for any two workflows W1 and W2 in R, if M ∈ modules(W1) ∩ modules(W2), then (1) u ∈ readers(M ) in both W1 and W2 or in neither W1 nor W2, and (2) u ∈ expanders(M ) in both W1 and W2 or in neither W1 nor W2.

We also revise the notion of a match to take access control into account: Definition 8. A keyword k matches a workflow S under access control specification A for user u iff 1. k ∈ keywords(S) and u ∈ readers(S); or 2. k ∈ keywords(M ) for some M ∈ modules(S), u ∈ readers(M ), and u ∈ expanders(W ) for every sub-workflow W on the path from S to wf(M ) in the expansion hierarchy H(S).

A query Q= {k1, ...kn} matches S iff each ki matches S.

Consider again the example in Fig. 1(a), and suppose that user u has the following access privileges: u ∈ readers(W1)∩expanders(W1), u ∈ expanders(W3), u ∈ readers(W5)∩expanders(W5), and u ∈ readers(M5). The query Q1 = {“SNP”, “parse”} matches W1 for u, because “SNP” ∈ keywords(W1) and “parse” ∈ keywords(M5). Both keywords are accessible to u: keywords(W1) is accessible because u ∈ readers(W1), and keywords(M5) is accessible because u ∈ expanders(W1) ∩ expanders(W3) ∩ readers(M5). However, Q2 ={“OMIM”} does not match W1 for u because u 6∈ readers(W3), preventing a match on keywords(W3). Further, while u ∈ readers(W5), he cannot reach W5 from W1, because u 6∈ expanders(W2). 3

Result Presentation for a Single Workflow

Given a workflow that matches the query, the goal of this section is to define a query result that is concise, informative, and non-redundant. First, we formalize the notion of a match assignment, which maps query keywords to nodes in the expansion hierarchy. Next, we define a projection over the expansion hierarchy, which contains only the relevant portion of the hierarchy, and is a concise and informative representation of the match assignment. We then eliminate redundant projections from the set of projections for each match assignment. Finally, we show how non-redundant projections can be used to produce a query result. Definition 9. A match assignment, μ, is a function mapping each keyword ki to the workflow or module name N ∈ S ∪ modules(S) such that ki ∈ keywords(N ).

Returning to our example, W1 matches the query Q = {“SNP”, “OMIM”} since “SNP” ∈ keywords(W1) and “OMIM” ∈ keywords(M2), yielding a match assignment μ1(“SNP”)= W1 and μ1(“OMIM”)= M2. Importantly, keyword matches occur only at the top-level workflow (W1 in our case) or at a module. In our example, the keyword “OMIM” is in keywords(W3) as well as in keywords(M2), but only M2 will be used in a match assignment; the match for W3 will be “caught” at M2 since keywords(M2)= keywords(τ (M2)) = keywords(W3).

Each keyword query that matches workflow S (as per Definition 5) will generate a non-empty set of match assignments {μ1, ..., μn}, since each keyword may occur in the workflow multiple times. In our running example, there are a total of nine match assignments, listed in Fig. 2(a): “SNP” is among the keywords for W1, M1 and M3, and “OMIM” is among the keywords of M2, M4, and M7.

Note that a match assignment may map multiple keywords to the same workflow or module N . For example, a valid match assignment for Q = {“SNP”, “disease”} is μ(“SNP”) = W1 and μ(“disease”) = W1.

In an access-controlled workflow, the set of match assignments for a user is restricted to those for which he has appropriate privileges (recall Definition 8). Definition 10. A match assignment μ(k1 = N1), . . . μ(kn = Nn) is valid for user u in an access-controlled workflow S iff u ∈ readers(Ni) and u ∈ expanders(W ) for every workflow W on the path from S to wf(M ) in the expansion hierarchy H(S).

This notion of match is used in most approaches to querying under access control in XML, see, e.g., [ 3 ] and the strongest approach presented in [ 2 ].1

Having defined a match assignment in workflow S, we next need to transform it into a query result. We start by placing the match assignment into the context of the expansion hierarchy H(S) (see Definition 4).

Definition 11. Given a workflow S that matches query Q, and a corresponding match assignment μ, let R= { wf(M ) | ∃ki ∈ Q. μ(ki)=M } be the relevant sub-workflows of S. Let Wi be the least common ancestor in H(S) of nodes that appear in R. Then the projection of H(S) over R is the tree T formed by the union of paths from Wi to Wj in H(S), for all Wj ∈ R.

Fig. 2 lists all match assignments for the query Q = {“SNP”, “OMIM”}, and shows the corresponding projections of H(W1). Nodes and edges in the expansion hierarchy that are projected out are greyed out; nodes at which a keyword match occurred are presented in red font and denoted by a “*”. Note that, while there are nine distinct match assignments for the query, there are only five distinct projections. To see why, consider the projection in Fig. 2(b) that corresponds to two match assignments μ1 and μ2: “OMIM” is matched by M2 and maps to W1 in the expansion hierarchy, and “SNP” is matched by either W1 or M1, both also mapping to W1 in the expansion hierarchy. 1 We do not consider the weaker (and nonstandard) notions discussed in [ 2 ], which allow matches to occur below “unexpandable” terms under certain conditions.

W1 W3

M2 M8

W M4 2 W * 5

W * 1

M2 M8 W3

W4 (d) T3 : μ6 (e) T4 : μ7, μ8

M1 W M4 2

W *

1 W3

M2 M8

W4 W5 (b) T1 : μ1, μ2

W1 M1

M2 M8 MW42* W3 W * 5 (f) T5 : μ9

W4 µ1("SNP") = W1,µ1("OMIM") = M2 µ2("SNP") = M1,µ2("OMIM") = M2 µ3("SNP") = W1,µ3("OMIM") = M4 µ4("SNP") = M1,µ4("OMIM") = M4 µ5("SNP") = M3,µ5("OMIM") = M2 µ6("SNP") = M3,µ6("OMIM") = M4 µ7("SNP") = W1,µ7("OMIM") = M7 µ8("SNP") = M1,µ8("OMIM") = M7 µ9("SNP") = M3,µ9("OMIM") = M7 (a) match assignments W *

1 M1

M2 M8 MW42* W3

W5 (c) T2 : μ3, μ4, μ5

M1 W * M4 2

Some of the sub-workflows appearing as nodes in the projection T may not appear in R, but are necessary to show the connection between the match assignments, and so must be included to make the result informative. E.g., W2 is not part of the match assignment, yet it is part of the projection in Fig. 2(e).

Projecting over the expansion hierarchy for a match assignment ultimately makes the query result concise, because T “trims” H(S): T may be rooted at a sub-workflow of S rather than at S, trimming off top levels of H(S) (e.g., Fig. 2(d) and 2(f)); subtrees of H(S) rooted at leaves of T are also trimmed (e.g., all subtrees rooted at W1 are trimmed in Fig. 2(b)).

Consider now all projections over H(W1) in Fig. 2, and note that some are more concise than others: Projection T2 in Fig. 2(c) is rooted at W1 but also contains W2, while projection T3 in Fig. 2(d) contains a complete match at W2. In fact, T3 is a subtree of T2. Similarly, observe that T5 is a subtree of T4. We use these relationships to define non-redundant projections.

Definition 12. Given query Q and workflow S, let T be the set of all projections of H(S) for Q. A projection T ∈ T is non-redundant if there does not exist another projection T ′ ∈ T , such that T 6= T ′ and T ′ is a sub-tree of T .

Non-redundant projections of H(W1) for Q are T1, T3, and T5, in Fig. 2. Definition 12 is similar to that used in search and ranking over XML documents (e.g. [ 6, 7 ]), where it is referred to as “conciseness”. We use “conciseness” w.r.t. a single projection, and “non-redundancy” w.r.t. a set of projections.

The node corresponding to the top-level workflow, W1, is not always part of the projection; e.g., this is the case in Fig. 2(d). Since a user interested in re-using workflows will want to see how the relevant workflows are invoked, we need to provide re-use context. To improve informativeness of a result, we return a projection that is rooted at the top-level workflow.

Definition 13. Let T = (V, E) be a projection of H(S) over R and let W be the root of T in H(S). Then the informative projection T ′ is formed by adding to T the path in H(S) from S to W .

We are now ready to define the representation of a query result: Definition 14. Let T = (V, E) be an informative projection of H(S) over R for workflow S = (W, τ ). Then the query result is the specification S′ = (V, τ ′), where τ ′(M ) = τ (M ) for all M that appear as labels on edges in E and undefined otherwise.

To generate a result for a set of match assignments, we compute the set of projections T , remove redundant projections from this set, resulting in a set of projections T ′, and then compute a query result for each T ∈ T ′. One option is to present these results to the user individually. For example, trees T1, T3 and T5 in Fig. 2 are non-redundant, and the corresponding query results are shown in Fig. 3(a), 3(b), and 3(c). Another option is to first generate a union of the projections in T ′, and then produce a single query result for this union. Note that, because we require that all projections be informative (Definition 13), they will all have the same root — the top-level workflow S. The union of T1, T3 and T5 is T5, and the corresponding query result is shown in Fig. 4(a). Fig. 4(b) is the combined query result for W2, which is a union of two results: one matching W2 (top level), and another matching “SNP” at W2 and “OMIM” at M7.

Before proceeding to the question of result generation from a set of matches, it is worth understanding the relationship between our notion of keyword search and that of keyword search in XML trees. Since the dataflow structure of a workflow is not used in matching, one can think of the workflow as an XML tree that is formed from the workflow expansion hierarchy and also includes atomic modules as children of the sub-workflow to which they belong. Each node in the XML tree is labeled with the concatenation of its keywords; a node matches a query if its label contains one or several query terms as a substring. Similarly to our scenario, in XML keyword search, the result returned is a subtree rooted at the least common ancestor of the keyword matches, whenever the result does not have a proper subtree which is also a keyword match (i.e., is non-redundant). In our scenario, the result is further trimmed by deleting subtrees that do not contain matches (i.e., is concise). Finally, our result is augmented with the path to the root from the least common ancestor (i.e., is informative). 4

Search and Ranking in Workflow Repositories

We now focus on search in workflow repositories, where queries will likely return many matches, motivating ranking. Suppose that a workflow repository is given, and consists of five access-controlled workflows (per Definition 10) W1, W2, W3, W4, and W5. Suppose that user u issuing query Q = {“SNP”, “OMIM”} is among the readers and expanders of all workflows and modules in the repository. Workflows W1 and W2 match Q, and one option is to present a ranked list of τ

W2 SNP OMIM

M3 M4

W1 τ

W2 SNP

M4 M1 M2 M8

O1 M4 (e) W22 τ

W5 OMIM

M7 (c) W13 τ O2 W5 OMIM

M7 SNP I1 W1

M1 M2 M8

O1 (a) W11

SNP OMIM

M1 M2 M8

I1 O1

(b) W12

SNP I2 W2

SNP

OMIM O2

M4 (d) W21

SNP I2 W2 SNP

M3 query results in Fig. 3. Another presentation option is to return the union of all query results for each workflow, and to then rank these combined results. Fig. 4 presents combined query results for W1 and W2. We discuss both options below.

Consider again query results in Fig. 3, and note that there is a correspondence between some of these results. In particular, Fig. 3(b) and 3(d) refer to the same actual match that occurs in W2, but in Fig. 3(b) this match is presented in the context of W1. A similar relationship holds between results in Fig. 3(c) and 3(e). We argued in the introduction that both types of results are of potential interest to the user. However, it is certainly reasonable to prioritize one type of a result over another in a ranked list, using a measure of quality.

Given our conjunctive semantics of matching, all query results are exact matches, and are therefore relevant. However, larger results are more difficult for the user to understand, and so smaller results may be preferred. A natural way to quantify result quality is to consider the number of modules in the result. Definition 15. Given an access-controlled workflow S, a query Q, and a result Si for Q in S, the size of Si, denoted size(Si), is the number of modules in Si.

Returning to our example in Fig. 3, we have: size(W11) = 3, size(W12) = 5, size(W13) = 6, size(W21) = 2, and size(W22) = 3. Using size as a measure of quality gives rise to the following ranking (rank position is recorded next to each entry): (W21, 1), (W22, 2), (W11, 2), (W12, 4),(W13, 5).

Another natural way to quantify result quality is by using hierarchical workflow structure, and we define one possible such measure below.

Definition 16. Given an access-controlled workflow S, a query Q, and a query result Si for Q in S, the depth of Si, denoted depth(Si), is the number of τ expansions in Si. W1

M1 M2 M8

I1 SNP

τ OMIM O1

W2 SNP OMIM

M3 M4 τ

W5 OMIM

In our example in Fig. 3, we have: depth(W11) = 0, depth(W12) = 1, depth(W13) = 2, depth(W21) = 0, and depth(W22) = 1. This gives rise to the following ranking: (W11, 1), (W21, 1), (W12, 3), (W22, 3), (W13, 5). Size and depth may be combined into a single quality measure, using, e.g., a (weighted) average of the two scores.

Another presentation option is to return the union of all query results for each workflow, and to then rank these combined results. Fig. 4 presents combined query results for W1 and W2. One natural way to discriminate between multiple combined results is to consider what portion of each result is query-relevant. We refer to this measure as specificity, and define it below. Our definition is in-line with the specificity measure proposed by INEX and used in XML retrieval [ 5 ]. Definition 17. Given an access-controlled workflow S, a query Q, and a combined query result S′ for Q in S, the specificity of S′, denoted spec(S′), is the proportion of query-relevant modules in S′ to size(S′).

For the combined results in Fig. 4, spec(W1′) = 56 and spec(W2′) = 1, and so W2′ would rank higher than W1.

′

Specificity treats occurrences of different keywords equally. However, there are cases when it is natural to assign higher importance to certain keywords. For example, “SNP” may be a very common keyword in the repository, while “OMIM” may be uncommon. A combined results containing many uncommon keywords should receive a high score. This semantics may be represented using, e.g., the popular tf-idf weighting scheme from information retrieval [ 10 ].

Keyword search and ranking over XML has been considered, e.g., [ 1, 4 ]. In [ 1 ], XML documents have a tree structure. Queries consist of keywords appearing at leaf nodes and of labels appearing at intermediate tree nodes. Subtrees of the document are returned, ranked on a combination of tf-idf-style weights and subtree structure, e.g., its size and the ancestor-descendant relationships between query-relevant nodes. While the semantics of a match are somewhat different in our case, where keywords may be assigned to tree nodes other than leaves, the ranking options of [ 1 ] may be applicable. In [ 4 ], a weight of each keyword is computed using authority propagation in hyperlinked XML documents. Keyword proximity is taken into account when computing the rank of a result. We formalized keyword search in access-controlled repositories of hierarchical workflows. We defined what it means for a single hierarchical workflow to match a keyword query, and discussed ranking semantics for workflow repositories.

Much interesting follow-up work remains to be done. The focus of this paper was on defining the formalism, and on making connections to search and ranking in XML and in information retrieval. A natural next step is to develop efficient query processing and ranking algorithms. This is non-trivial, because of the interesting interactions between query results and access control. For example, to support tf-idf ranking over a repository, one may pre-compute tf-idf weights for all (keyword, workflow) pairs. In an access-controlled setting, each user may potentially have a different view over the repository, requiring that tf-idf weights be computed per (user, keyword, workflow). This data structure may quickly become too large to compute and maintain, motivating optimizations. 6

Acknowledgments

This work was supported in part by NSF grants IIS-0803524, IIS-0629846, IIS0915438, CCF-0635084, IIS-0904314, and by grant 0937060 to the Computing Research Association for the CIFellows Project.

Cohen et al. XSEarch: A semantic search engine for XML . In

VLDB

, 2003 .

Damiani et al. A fine-grained access control system for XML documents . TISSEC , 5 , 2002 .

Fan ,

C. Y.

Chan , and

M. N.

Garofalakis . Secure XML querying with security views . In SIGMOD , 2004 .

Guo et al. XRANK: Ranked keyword search over XML documents . In SIGMOD , 2003 .

Kazai and

Lalmas . Inex 2005 evaluation measures . In INEX , 2005 .

Li et al. Effective keyword search for valuable LCAs over XML documents . In CIKM , 2007 .

Li ,

Yu , and

H. V.

Jagadish . Schema-free XQuery . In VLDB , 2004 .

Liu ,

Walker , and

Chen. XSeek : A semantic XML search engine using keywords . In VLDB , 2007 .

Lui ,

Shao , and

Chen . WISE: Searching workflow hierarchies . VLDB , 2010 .

10. C. D. Manning , P. Raghavan , and H. Schu¨tze. Introduction to Information Retrieval. Cambridge University Press, 2008 .

11. D. D. Roure , C. A.

Goble , and R.

Stevens . The design and realisation of the myExperiment virtual research environment for social sharing of workflows . FGCS , 25 ( 5 ), 2009 .

12. J. Stoyanovich , B.

Taskar , and S.

Davidson . Exploring repositories of scientific workflows . In WANDS , 2010 .

13.

Xu and

Papakonstantinou . Efficient keyword search for smallest LCAs in XML databases . In SIGMOD , 2005 .