Keyword Search in Workflow Repositories with Access Control

Keyword Search in Workflow Repositories with Access Control SusanDavidson Computer and Information Science Department University of Pennsylvania

Philadelphia PA USA

SoohyunMLee soohyunl@cis.upenn.edu Computer and Information Science Department University of Pennsylvania

Philadelphia PA USA

JuliaStoyanovich Computer and Information Science Department University of Pennsylvania

Philadelphia PA USA

Keyword Search in Workflow Repositories with Access Control 41DA41F5D4E6D82DD715F2B2AB1B6899 GROBID - A machine learning software for extracting information from scholarly documents

A number of on-line repositories of scientific workflows are emerging. These repositories enable sharing and reuse of workflows, and aid in the design of new workflows. The growing size of these repositories, the complex hierarchical structure of the workflows, and the need to incorporate access control mechanisms make information discovery in these repositories an interesting challenge. This paper formalizes keyword search in repositories of hierarchical workflows. We start by defining what it means for a single hierarchical workflow to match a keyword query while accounting for access control, and discuss options for displaying the resulting matches within the workflow. We extend this to search over workflow repositories, by proposing various ranking semantics that build on techniques from XML search and from information retrieval, and adapting them to our setting.

Introduction

Scientific workflows are increasingly important within the life sciences, where users deal with large amounts of data and perform sophisticated analyses. Workflows are an attractive paradigm for scientists because they make analyses repeatable and re-usable. A workflow is typically represented as a directed graph whose nodes represent analysis steps or modules and edges represent potential dataflow between modules. A module implements a certain functionality by transforming its data inputs into data outputs.

Workflow development platforms typically come equipped with libraries of pre-defined modules, which workflow designers may combine to implement a new analysis. Since workflows may be complex, sometimes consisting of hundreds of modules, it is natural to allow a sub-graph of the workflow graph -a sub-workflow -to be encapsulated in a composite module. A composite module implements the functionality of the sub-workflow, but reveals implementation details only when expanded. This naturally leads to a hierarchical workflow structure.

As an example, consider the hierarchical workflow in Fig. 1(a). This workflow, labeled W 1 , estimates individual susceptibility to genetic disorders based on genome-wide SNP array data. The input to W 1 , indicated by node I 1 , is a set of single nucleotide polymorphisms (SNPs), ethnicity, lifestyle, family history, and symptoms. The output of W 1 is a prognosis, indicated by node O 1 . The first On-line repositories of scientific workflows, such as myExperiment.org [11], are emerging in support of the sharing and re-use of workflows. To enable discovery of workflows of interest, several of these repositories support tagging -a workflow may be tagged by its author as well as by other users of the system. Workflow W 1 in our running example is tagged with "SNP" and "disease". Fig. 1(a) shows tags associated with workflows and modules as call-outs.

In a recent work [12] we considered how tags may be used to enable browsing of workflow repositories such as myExperiment.org. In the browsing scenario, the user is not fully aware of his information need; he is exploring the repository to better formulate his information need and identify items of potential interest. In this paper we shift focus, and consider how tags may be used in search -a task performed by a user who is well-aware of his information need, and is looking for objects that satisfy that need precisely. For example, a user looking for workflows that make disease susceptibility predictions based on SNP information may pose a keyword query Q 1 = {"SNP", "disease"} against the repository, expecting that workflows such as W 1 be retrieved. If the user's task is implemented fully by some workflow in the result, he may re-use the workflow without further modification.

However, users may also want to develop new functionality, possibly re-using existing workflows as building blocks. Suppose that the user wants to build a workflow that combines SNP-based disease susceptibility hypotheses with information about potentially affected genes, and retrieves the pathways in which these genes participate. Realizing that part of his task deals with SNP look-ups in OMIM, he may issue the query Q 2 = {"SNP", "OMIM"}.

Q 2 matches workflow W 2 , and so W 2 should be returned as an answer. However, the user may also want to understand how to re-use W 2 in the scope of a larger workflow. Therefore, it is also valuable to return W 1 as a result for Q 2 , and to show how W 2 is used in the context of W 1 . Understanding this re-use context is important since it is often easier to learn by example, e.g., how to format the input parameters and what to expect as output.

Repositories of hierarchical workflows, while giving rise to exciting re-use opportunities, also bring about interesting challenges with respect to security. Security considerations arise naturally in domains where data objects are attributable to individual users (their authors) and have intrinsic intellectual or commercial value. We take the point of view that any information discovery approach should be mindful of these considerations, and develop a model of search in repositories of hierarchical workflows that incorporates access control.

Note that our approach leverages user-assigned tags to identify matching workflows and modules. Tags are thought to provide the emergent bottom-up semantics in many datasets on the Social Web, and have been used for search and data exploration in, e.g., Delicious, Flickr, and myExperiment.org, the largest public repository of scientific workflows. Note, however, that our approach is not specific to tags, and that it can also use the top-down semantics prescribed by ontologies or controlled vocabularies instead of, or in addition to, tags. Related Work. Our work builds on the WISE search engine over repositories of hierarchical workflows presented in [9]. Following the intuition of [9], we return a portion of the workflow that is both informative, in that it shows the expansion and dataflow relationships necessary to understand a keyword match, and concise, in that only the modules and edges necessary to understand the match are returned. However, we extend the model of [9] in several important ways. First, we formalize the ideas, leveraging hierarchical workflows with τ expansion. Second, our characterizations of informativeness and conciseness are different, in that we always make the context of the match w.r.t. the top-level workflow explicit. Third, we discuss what the result is when a query matches a workflow in multiple ways. Fourth, we incorporate access control.

Our work is also related to keyword search over tree-structured data (e.g., XML) [1,4,[6][7][8]13], as well as XML access control [2,3]. We compare and contrast our work to these in the following sections.

This paper makes the following contributions. We formalize keyword search over hierarchical workflows, and define what it means for a workflow to match a keyword query, while also accounting for access control specifications (Section 2). We discuss various options for identifying and displaying one or more matches per workflow (Section 3). We consider search over workflow repositories, and propose various ranking semantics that build on techniques from XML search and from information retrieval, and adapt them to our setting (Section 4).

Model

Workflows and Keyword Search

A workflow is hierarchical, in the sense that a module may be composite and expand to a sub-workflow. We start by defining a simple workflow in which composite modules are not expanded, and then extend to a system of workflow specifications that expand composite modules. Definition 1. A simple workflow W is a labeled directed acyclic graph G = (V ∪ {I, O}, E) whose set of nodes, V , models the modules of the workflow. The set of edges, E, models potential dataflow between modules v ∈ V as well as input to and output from the workflow (I → v, v → O). Every v ∈ V must be on a path from I to O and has a module name, name(v).

For example, the box in Fig. 1(a) named W 1 is a simple workflow. The input to W 1 is a node labeled I 1 , and the output of W 1 is a node labeled O 1 . The workflow consists of three modules, M 1 , M 2 , and M 8 . Each of these modules is composite, and may be expanded. Definition 2. A workflow S is a pair (W, τ ), where W is a finite set of simple workflows and τ is a (partial) expansion function that maps some of the module names in the workflows in W to simple workflows in W.

Note that if a module is used multiple times in a workflow, i.e., v 1 = v 2 but name(v 1 ) = name(v 2 ), its τ expansion will be the same in each occurrence. From here on, we will use N to denote either a workflow or a module, M to denote a module name, modules(S) to refer to the set of module names in the workflows in W for a workflow S, and wf(N ) to refer to the workflow in which N appears (if N is a module) or the workflow itself (if N is a workflow). Returning to Fig. 1

(a), τ (M 1 ) = W 2 , τ (M 2 ) = W 3 , τ (M 4 ) = W 5 , and τ (M 8 ) = W 4 ; wf(W 1 )=W 1 , wf(M 1 )=W 1 , and wf(M 4 )=W 2 .

Definition 2 allows a simple workflow that is part of two different workflows to have different expansion functions defined over its module names. For example, W 2 in the workflow W 1 has τ 1 (M 4 ) = W 5 . However, in a different top level workflow W 0 it could be that τ 0 (M 4 ) is not defined, or maps to a different subworkflow. In a repository of workflows, we will disallow such inconsistencies. Definition 3. A repository of workflows is a set of workflows such that, for any two workflows

W 1 and W 2 , if M ∈ modules(W 1 ) ∩ modules(W 2 ) then either (1) τ 1 (M ) and τ 2 (M ) are both undefined; or (2) τ 1 (M ) = τ 2 (M ).

We now define the notion of an expansion hierarchy for a workflow. Definition 4. Given a workflow S = (W, τ ), if M is a module name in a workflow W ′ and τ ′ (M ) = W ′′ , we say that W ′′ is a sub-workflow of W ′ . The expansion hierarchy of S is the directed graph denoted H(S) whose nodes are the workflows in W, and in which there is an edge from W ′ ∈ W to W ′′ ∈ W if and only if W ′′ is a sub-workflow of W ′ . Each edge is labeled with the module name M whose expansion creates the sub-workflow relationship. A workflow is well-formed if its expansion hierarchy is a tree.

The expansion hierarchy for our running example is shown in Fig. 1(b). Since the hierarchy is a tree, W 1 is well-formed. To simplify the discussion, we assume in the rest of this paper that all workflows are well-formed. Our results can be easily extended to expansion hierarchies that are DAGs. Extending our work to the (less common) recursive workflow definitions is an interesting open problem.

A module or workflow N is associated with a set of keywords, denoted keywords(N ), e.g., keywords(W 1 )= {"SNP", "disease"}, keywords(M 6 )= {"join"}. Since the expansion of a module is its sub-workflow definition, we require that keywords(M )= keywords(τ (M )) whenever τ (M ) is defined, e.g., keywords(M 1 )= keywords(W 2 )= {"SNP"}. To reduce clutter, we omit the keywords associated with modules when they expand to a sub-workflow, and show only the keywords associated with the sub-workflow.

A keyword query Q is a set of keywords {k 1 , ...k n }. Informally, a query matches a workflow S if each k i can be found somewhere in S.

Definition 5. A keyword k matches a workflow S iff k ∈ keywords(S) or k ∈ keywords(M ) for some M ∈ modules(S). A query Q= {k 1 , ...k n } matches S iff each k i matches S.

Access Control and Search

Inspired by conventions in Unix file systems, a workflow S has two actions associated with it: read, which allows an authorized user to perform a match against keywords(S); and expand, which allows an authorized user to see the structure of S, i.e., the module names and the dataflow links between the modules. Similarly, a module named M has two actions associated with it: read, which allows an authorized user to perform a match against keywords(M ); and expand, which allows an authorized user to expand its definition to a sub-workflow.

It is helpful to think of the following analogy between workflows and modules, and Unix file system objects. A workflow corresponds to a directory, and having expand permissions on a workflow is analogous to having permission to enter the directory. A module corresponds to a file, and read access on a module is analogous to read permissions on a file. A module with a defined τ expansion corresponds to a symbolic link to a Unix directory, and having expand permissions on such a module is analogous to having permission to enter the directory.

In an access control specification for a workflow, read and expand privileges are associated with a set of user groups for each module and sub-workflow N ; the special group world includes all users. A user u is authorized for an action on N iff u is in a group associated with N for that action.

We denote the set of users who are authorized to read (expand) N as readers(N ) (expanders(N )). Authorizations associated with a sub-workflow are inherited by the modules that expand to it, i.e., readers(M )= readers(τ (M )) and expanders(M )= expanders(τ (M )), whenever τ (M ) is defined. Definition 6. An access control specification for a workflow S is an assignment of readers(N ), expanders(N ) for each N ∈ S ∪ modules(S).

In a repository of workflows we require that the access control specification be consistent across workflows. Definition 7. A repository R of workflows, together with their access control specifications, is access control-consistent iff for any user u and for any two workflows

W 1 and W 2 in R, if M ∈ modules(W 1 ) ∩ modules(W 2 ), then (1) u ∈ readers(M ) in both W 1 and W 2 or in neither W 1 nor W 2 , and (2) u ∈ expanders(M ) in both W 1 and W 2 or in neither W 1 nor W 2 .

We also revise the notion of a match to take access control into account: Definition 8. A keyword k matches a workflow S under access control specification A for user u iff 1. k ∈ keywords(S) and u ∈ readers(S); or 2. k ∈ keywords(M ) for some M ∈ modules(S), u ∈ readers(M ), and u ∈ expanders(W ) for every sub-workflow W on the path from S to wf(M ) in the expansion hierarchy H(S).

A query Q= {k 1 , ...k n } matches S iff each k i matches S.

Consider again the example in Fig. 1(a), and suppose that user u has the following access privileges: u ∈ readers(W 1 )∩expanders(W 1 ), u ∈ expanders(W 3 ), u ∈ readers(W 5 )∩expanders(W 5 ), and u ∈ readers(M 5 ). The query Q 1 = {"SNP", "parse"} matches W 1 for u, because "SNP" ∈ keywords(W 1 ) and "parse" ∈ keywords(M 5 ). Both keywords are accessible to u: keywords(W 1 ) is accessible because u ∈ readers(W 1 ), and keywords(M 5 ) is accessible because u ∈ expanders(W 1 ) ∩ expanders(W 3 ) ∩ readers(M 5 ). However, Q 2 ={"OMIM"} does not match W 1 for u because u ∈ readers(W 3 ), preventing a match on keywords(W 3 ). Further, while u ∈ readers(W 5 ), he cannot reach W 5 from W 1 , because u ∈ expanders(W 2 ).

Result Presentation for a Single Workflow

Given a workflow that matches the query, the goal of this section is to define a query result that is concise, informative, and non-redundant. First, we formalize the notion of a match assignment, which maps query keywords to nodes in the expansion hierarchy. Next, we define a projection over the expansion hierarchy, which contains only the relevant portion of the hierarchy, and is a concise and informative representation of the match assignment. We then eliminate redundant projections from the set of projections for each match assignment. Finally, we show how non-redundant projections can be used to produce a query result. Definition 9. A match assignment, µ, is a function mapping each keyword k i to the workflow or module name N ∈ S ∪ modules(S) such that k i ∈ keywords(N ).

Returning to our example, W 1 matches the query Q = {"SNP", "OMIM"} since "SNP" ∈ keywords(W 1 ) and "OMIM" ∈ keywords(M 2 ), yielding a match assignment µ 1 ("SNP")= W 1 and µ 1 ("OMIM")= M 2 . Importantly, keyword matches occur only at the top-level workflow (W 1 in our case) or at a module. In our example, the keyword "OMIM" is in keywords(W 3 ) as well as in keywords(M 2 ), but only M 2 will be used in a match assignment; the match for W 3 will be "caught" at M 2 since keywords(M 2 )= keywords(τ (M 2 )) = keywords(W 3 ).

Each keyword query that matches workflow S (as per Definition 5) will generate a non-empty set of match assignments {µ 1 , ..., µ n }, since each keyword may occur in the workflow multiple times. In our running example, there are a total of nine match assignments, listed in Fig. 2(a): "SNP" is among the keywords for W 1 , M 1 and M 3 , and "OMIM" is among the keywords of M 2 , M 4 , and M 7 .

Note that a match assignment may map multiple keywords to the same workflow or module N . For example, a valid match assignment for Q = {"SNP", "disease"} is µ("SNP") = W 1 and µ("disease") = W 1 .

In an access-controlled workflow, the set of match assignments for a user is restricted to those for which he has appropriate privileges (recall Definition 8). Definition 10. A match assignment µ(k 1 = N 1 ), . . . µ(k n = N n ) is valid for user u in an access-controlled workflow S iff u ∈ readers(N i ) and u ∈ expanders(W ) for every workflow W on the path from S to wf(M ) in the expansion hierarchy H(S).

This notion of match is used in most approaches to querying under access control in XML, see, e.g., [3] and the strongest approach presented in [2]. 1Having defined a match assignment in workflow S, we next need to transform it into a query result. We start by placing the match assignment into the context of the expansion hierarchy H(S) (see Definition 4). Definition 11. Given a workflow S that matches query Q, and a corresponding match assignment µ, let R= { wf(M ) | ∃k i ∈ Q. µ(k i )=M } be the relevant sub-workflows of S. Let W i be the least common ancestor in H(S) of nodes that appear in R. Then the projection of H(S) over R is the tree T formed by the union of paths from W i to W j in H(S), for all W j ∈ R. Fig. 2 lists all match assignments for the query Q = {"SNP", "OMIM"}, and shows the corresponding projections of H(W 1 ). Nodes and edges in the expansion hierarchy that are projected out are greyed out; nodes at which a keyword match occurred are presented in red font and denoted by a "*". Note that, while there are nine distinct match assignments for the query, there are only five distinct projections. To see why, consider the projection in Fig. 2(b) that corresponds to two match assignments µ 1 and µ 2 : "OMIM" is matched by M 2 and maps to W 1 the expansion hierarchy, and "SNP" is matched by either W 1 or M 1 , both also mapping to W 1 in the expansion hierarchy. Some of the sub-workflows appearing as nodes in the projection T may not appear in R, but are necessary to show the connection between the match assignments, and so must be included to make the result informative. E.g., W 2 is not part of the match assignment, yet it is part of the projection in Fig. 2(e).

µ 1 ("SNP") = W 1 ,µ 1 ("OMIM") = M 2 µ 2 ("SNP") = M 1 ,µ 2 ("OMIM") = M 2 µ 3 ("SNP") = W 1 ,µ 3 ("OMIM") = M 4 µ 4 ("SNP") = M 1 ,µ 4 ("OMIM") = M 4 µ 5 ("SNP") = M 3 ,µ 5 ("OMIM") = M 2 µ 6 ("SNP") = M 3 ,µ 6 ("OMIM") = M 4 µ 7 ("SNP") = W 1 ,µ 7 ("OMIM") = M 7 µ 8 ("SNP") = M 1 ,µ 8 ("OMIM") = M 7 µ 9 ("SNP") = M 3 ,µ 9 ("OMIM") = M 7 (a) match assignments M 1 M 2 M 4 W 4 M 8

(f) T5 : µ9 Fig. 2. Match assignments and projections of H(W1) for Q = {"SNP", "OMIM"}.

Projecting over the expansion hierarchy for a match assignment ultimately makes the query result concise, because T "trims" H(S): T may be rooted at a sub-workflow of S rather than at S, trimming off top levels of H(S) (e.g., Fig. 2(d) and 2(f)); subtrees of H(S) rooted at leaves of T are also trimmed (e.g., all subtrees rooted at W 1 are trimmed in Fig. 2(b)).

Consider now all projections over H(W 1 ) in Fig. 2, and note that some are more concise than others: Projection T 2 in Fig. 2(c) is rooted at W 1 but also contains W 2 , while projection T 3 in Fig. 2(d) contains a complete match at W 2 . In fact, T 3 is a subtree of T 2 . Similarly, observe that T 5 is a subtree of T 4 . We use these relationships to define non-redundant projections. Definition 12. Given query Q and workflow S, let T be the set of all projections of H(S) for Q. A projection T ∈ T is non-redundant if there does not exist another projection T ′ ∈ T , such that T = T ′ and T ′ is a sub-tree of T .

Non-redundant projections of H(W 1 ) for Q are T 1 , T 3 , and T 5 , in Fig. 2. Definition 12 is similar to that used in search and ranking over XML documents (e.g. [6,7]), where it is referred to as "conciseness". We use "conciseness" w.r.t. a single projection, and "non-redundancy" w.r.t. a set of projections.

The node corresponding to the top-level workflow, W 1 , is not always part of the projection; e.g., this is the case in Fig. 2(d). Since a user interested in re-using workflows will want to see how the relevant workflows are invoked, we need to provide re-use context. To improve informativeness of a result, we return a projection that is rooted at the top-level workflow. Definition 13. Let T = (V, E) be a projection of H(S) over R and let W be the root of T in H(S). Then the informative projection T ′ is formed by adding to T the path in H(S) from S to W .

We are now ready to define the representation of a query result: Definition 14. Let T = (V, E) be an informative projection of H(S) over R for workflow S = (W, τ ). Then the query result is the specification S ′ = (V, τ ′ ), where τ ′ (M ) = τ (M ) for all M that appear as labels on edges in E and undefined otherwise.

To generate a result for a set of match assignments, we compute the set of projections T , remove redundant projections from this set, resulting in a set of projections T ′ , and then compute a query result for each T ∈ T ′ . One option is to present these results to the user individually. For example, trees T 1 , T 3 and T 5 in Fig. 2 are non-redundant, and the corresponding query results are shown in Fig. 3(a), 3(b), and 3(c). Another option is to first generate a union of the projections in T ′ , and then produce a single query result for this union. Note that, because we require that all projections be informative (Definition 13), they will all have the same root -the top-level workflow S. The union of T 1 , T 3 and T 5 is T 5 , and the corresponding query result is shown in Fig. 4(a). Fig. 4(b) is the combined query result for W 2 , which is a union of two results: one matching W 2 (top level), and another matching "SNP" at W 2 and "OMIM" at M 7 .

Before proceeding to the question of result generation from a set of matches, it is worth understanding the relationship between our notion of keyword search and that of keyword search in XML trees. Since the dataflow structure of a workflow is not used in matching, one can think of the workflow as an XML tree that is formed from the workflow expansion hierarchy and also includes atomic modules as children of the sub-workflow to which they belong. Each node in the XML tree is labeled with the concatenation of its keywords; a node matches a query if its label contains one or several query terms as a substring. Similarly to our scenario, in XML keyword search, the result returned is a subtree rooted at the least common ancestor of the keyword matches, whenever the result does not have a proper subtree which is also a keyword match (i.e., is non-redundant). In our scenario, the result is further trimmed by deleting subtrees that do not contain matches (i.e., is concise). Finally, our result is augmented with the path to the root from the least common ancestor (i.e., is informative).

Search and Ranking in Workflow Repositories

We now focus on search in workflow repositories, where queries will likely return many matches, motivating ranking. Suppose that a workflow repository is given, and consists of five access-controlled workflows (per Definition 10) W 1 , W 2 , W 3 , W 4 , and W 5 . Suppose that user u issuing query Q = {"SNP", "OMIM"} is among the readers and expanders of all workflows and modules in the repository. Workflows W 1 and W 2 match Q, and one option is to present a ranked list of query results in Fig. 3. Another presentation option is to return the union of all query results for each workflow, and to then rank these combined results. Fig. 4 presents combined query results for W 1 and W 2 . We discuss both options below.

Consider again query results in Fig. 3, and note that there is a correspondence between some of these results. In particular, Fig. 3(b) and 3(d) refer to the same actual match that occurs in W 2 , but in Fig. 3(b) this match is presented in the context of W 1 . A similar relationship holds between results in Fig. 3(c) and 3(e). We argued in the introduction that both types of results are of potential interest to the user. However, it is certainly reasonable to prioritize one type of a result over another in a ranked list, using a measure of quality.

Given our conjunctive semantics of matching, all query results are exact matches, and are therefore relevant. However, larger results are more difficult for the user to understand, and so smaller results may be preferred. A natural way to quantify result quality is to consider the number of modules in the result.

Definition 15. Given an access-controlled workflow S, a query Q, and a result S i for Q in S, the size of S i , denoted size(S i ), is the number of modules in S i .

Returning to our example in Fig. 3, we have: size(W 1 1 ) = 3, size(W 2 1 ) = 5, size(W 3 1 ) = 6, size(W 1 2 ) = 2, and size(W 2 2 ) = 3. Using size as a measure of quality gives rise to the following ranking (rank position is recorded next to each entry

): (W 1 2 , 1), (W 2 2 , 2), (W 1 1 , 2), (W 2 1 , 4),(W 3 1 ,5

). Another natural way to quantify result quality is by using hierarchical workflow structure, and we define one possible such measure below. Definition 16. Given an access-controlled workflow S, a query Q, and a query result S i for Q in S, the depth of S i , denoted depth(S i ), is the number of τ expansions in S i . In our example in Fig. 3, we have:

depth(W 1 1 ) = 0, depth(W 2 1 ) = 1, depth(W 3 1 ) = 2, depth(W1

2 ) = 0, and depth(W 2 2 ) = 1. This gives rise to the following ranking: (W 1 1 , 1), (W 1 2 , 1), (W 2 1 , 3), (W 2 2 , 3), (W 3 1 , 5). Size and depth may be combined into a single quality measure, using, e.g., a (weighted) average of the two scores.

Another presentation option is to return the union of all query results for each workflow, and to then rank these combined results. Fig. 4 presents combined query results for W 1 and W 2 . One natural way to discriminate between multiple combined results is to consider what portion of each result is query-relevant. We refer to this measure as specificity, and define it below. Our definition is in-line with the specificity measure proposed by INEX and used in XML retrieval [5].

Definition 17. Given an access-controlled workflow S, a query Q, and a combined query result S ′ for Q in S, the specificity of S ′ , denoted spec(S ′ ), is the proportion of query-relevant modules in S ′ to size(S ′ ).

For the combined results in Fig. 4, spec(W ′ 1 ) = 5 6 and spec(W ′ 2 ) = 1, and so W ′ 2 would rank higher than W ′ 1 . Specificity treats occurrences of different keywords equally. However, there are cases when it is natural to assign higher importance to certain keywords. For example, "SNP" may be a very common keyword in the repository, while "OMIM" may be uncommon. A combined results containing many uncommon keywords should receive a high score. This semantics may be represented using, e.g., the popular tf-idf weighting scheme from information retrieval [10].

Keyword search and ranking over XML has been considered, e.g., [1,4]. In [1], XML documents have a tree structure. Queries consist of keywords appearing at leaf nodes and of labels appearing at intermediate tree nodes. Subtrees of the document are returned, ranked on a combination of tf-idf-style weights and subtree structure, e.g., its size and the ancestor-descendant relationships between query-relevant nodes. While the semantics of a match are somewhat different in our case, where keywords may be assigned to tree nodes other than leaves, the ranking options of [1] may be applicable. In [4], a weight of each keyword is computed using authority propagation in hyperlinked XML documents. Keyword proximity is taken into account when computing the rank of a result.

We formalized keyword search in access-controlled repositories of hierarchical workflows. We defined what it means for a single hierarchical workflow to match a keyword query, and discussed ranking semantics for workflow repositories.

Much interesting follow-up work remains to be done. The focus of this paper was on defining the formalism, and on making connections to search and ranking in XML and in information retrieval. A natural next step is to develop efficient query processing and ranking algorithms. This is non-trivial, because of the interesting interactions between query results and access control. For example, to support tf-idf ranking over a repository, one may pre-compute tf-idf weights for all (keyword, workflow) pairs. In an access-controlled setting, each user may potentially have a different view over the repository, requiring that tf-idf weights be computed per (user, keyword, workflow). This data structure may quickly become too large to compute and maintain, motivating optimizations.

Fig. 1 .1Fig. 1. Running example: workflow for estimating genetic disease susceptibility.

Fig. 3 .3Fig. 3. Query results for workflows W1 and W2 for Q = {"SNP", "OMIM"}.

2 Fig. 4 .24Fig. 4. Combined query results for workflows W1 and W2, for Q = {"SNP", "OMIM"}. We do not consider the weaker (and nonstandard) notions discussed in[2], which allow matches to occur below "unexpandable" terms under certain conditions.

Acknowledgments

This work was supported in part by NSF grants IIS-0803524, IIS-0629846, IIS-0915438, CCF-0635084, IIS-0904314, and by grant 0937060 to the Computing Research Association for the CIFellows Project.

XSEarch: A semantic search engine for XML SCohen VLDB 2003 A fine-grained access control system for XML documents EDamiani TIS-SEC 5 2002 Secure XML querying with security views WFan CYChan MNGarofalakis 2004 SIGMOD XRANK: Ranked keyword search over XML documents LGuo 2003 SIGMOD Inex 2005 evaluation measures GKazai MLalmas INEX 2005 Effective keyword search for valuable LCAs over XML documents GLi CIKM 2007 Schema-free XQuery YLi CYu HVJagadish 2004 VLDB XSeek: A semantic XML search engine using keywords ZLiu JWalker YChen VLDB 2007 WISE: Searching workflow hierarchies ZLui QShao YChen 2010 VLDB Introduction to Information Retrieval CDManning PRaghavan HSchütze 2008 Cambridge University Press The design and realisation of the myExperiment virtual research environment for social sharing of workflows DDRoure CAGoble RStevens FGCS 25 5 2009 Exploring repositories of scientific workflows JStoyanovich BTaskar SDavidson WANDS 2010 Efficient keyword search for smallest LCAs in XML databases YXu YPapakonstantinou SIGMOD 2005