<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Keyword Search in Workflow Repositories with Access Control</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Susan Davidson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Soohyun M. Lee</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julia Stoyanovich</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>susan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>soohyunl</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>jstoy}@cis.upenn.edu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer and Information Science Department University of Pennsylvania</institution>
          ,
          <addr-line>Philadelphia, PA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A number of on-line repositories of scientific workflows are emerging. These repositories enable sharing and reuse of workflows, and aid in the design of new workflows. The growing size of these repositories, the complex hierarchical structure of the workflows, and the need to incorporate access control mechanisms make information discovery in these repositories an interesting challenge. This paper formalizes keyword search in repositories of hierarchical workflows. We start by defining what it means for a single hierarchical workflow to match a keyword query while accounting for access control, and discuss options for displaying the resulting matches within the workflow. We extend this to search over workflow repositories, by proposing various ranking semantics that build on techniques from XML search and from information retrieval, and adapting them to our setting.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Scientific workflows are increasingly important within the life sciences, where
users deal with large amounts of data and perform sophisticated analyses.
Workflows are an attractive paradigm for scientists because they make analyses
repeatable and re-usable. A workflow is typically represented as a directed graph
whose nodes represent analysis steps or modules and edges represent
potential dataflow between modules. A module implements a certain functionality by
transforming its data inputs into data outputs.</p>
      <p>Workflow development platforms typically come equipped with libraries of
pre-defined modules, which workflow designers may combine to implement a new
analysis. Since workflows may be complex, sometimes consisting of hundreds of
modules, it is natural to allow a sub-graph of the workflow graph – a sub-workflow
– to be encapsulated in a composite module. A composite module implements
the functionality of the sub-workflow, but reveals implementation details only
when expanded. This naturally leads to a hierarchical workflow structure.</p>
      <p>As an example, consider the hierarchical workflow in Fig. 1(a). This workflow,
labeled W1, estimates individual susceptibility to genetic disorders based on
genome-wide SNP array data. The input to W1, indicated by node I1, is a set
of single nucleotide polymorphisms (SNPs), ethnicity, lifestyle, family history,
and symptoms. The output of W1 is a prognosis, indicated by node O1. The first
SNP
disease I1
W1</p>
      <p>M1
M2
M8
O1
τ
τ
τ
W2</p>
      <p>M3</p>
      <p>SNP
PubMed
OMIM</p>
      <sec id="sec-1-1">
        <title>W3 parse</title>
      </sec>
      <sec id="sec-1-2">
        <title>W4 risk</title>
        <p>M5
M9</p>
        <p>M4</p>
        <p>τ
join
M6
prognosis</p>
        <p>OMIM
W5 OpaMrsIMe</p>
        <p>M7
module in W1, named M1, determines a set of disorders the patient is genetically
susceptible to based on the input SNPs and ethnicity information. The second
module, M2, refines the set of disorders for which the patient is at risk, based
on lifestyle, family history, and symptoms. Module M8 computes the prognosis.
Modules M1, M2, M4, and M8 are composite, and expand to sub-workflows that
implement the required functionality (indicated by τ edges in Fig. 1(a)). The
presence of such composite modules gives rise to an expansion hierarchy. The
expansion hierarchy for W1 is presented in Fig. 1(b).</p>
        <p>
          On-line repositories of scientific workflows, such as myExperiment.org [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], are
emerging in support of the sharing and re-use of workflows. To enable discovery of
workflows of interest, several of these repositories support tagging — a workflow
may be tagged by its author as well as by other users of the system. Workflow
W1 in our running example is tagged with “SNP” and “disease”. Fig. 1(a) shows
tags associated with workflows and modules as call-outs.
        </p>
        <p>
          In a recent work [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] we considered how tags may be used to enable browsing
of workflow repositories such as myExperiment.org. In the browsing scenario, the
user is not fully aware of his information need; he is exploring the repository to
better formulate his information need and identify items of potential interest. In
this paper we shift focus, and consider how tags may be used in search — a task
performed by a user who is well-aware of his information need, and is looking for
objects that satisfy that need precisely. For example, a user looking for workflows
that make disease susceptibility predictions based on SNP information may pose
a keyword query Q1 = {“SNP”, “disease”} against the repository, expecting that
workflows such as W1 be retrieved. If the user’s task is implemented fully by some
workflow in the result, he may re-use the workflow without further modification.
        </p>
        <p>However, users may also want to develop new functionality, possibly re-using
existing workflows as building blocks. Suppose that the user wants to build a
workflow that combines SNP-based disease susceptibility hypotheses with
information about potentially affected genes, and retrieves the pathways in which
these genes participate. Realizing that part of his task deals with SNP look-ups
in OMIM, he may issue the query Q2 = {“SNP”, “OMIM”}.</p>
        <p>Q2 matches workflow W2, and so W2 should be returned as an answer.
However, the user may also want to understand how to re-use W2 in the scope of a
larger workflow. Therefore, it is also valuable to return W1 as a result for Q2,
and to show how W2 is used in the context of W1. Understanding this re-use
context is important since it is often easier to learn by example, e.g., how to
format the input parameters and what to expect as output.</p>
        <p>Repositories of hierarchical workflows, while giving rise to exciting re-use
opportunities, also bring about interesting challenges with respect to security.
Security considerations arise naturally in domains where data objects are
attributable to individual users (their authors) and have intrinsic intellectual or
commercial value. We take the point of view that any information discovery
approach should be mindful of these considerations, and develop a model of search
in repositories of hierarchical workflows that incorporates access control.</p>
        <p>
          Note that our approach leverages user-assigned tags to identify matching
workflows and modules. Tags are thought to provide the emergent bottom-up
semantics in many datasets on the Social Web, and have been used for search and
data exploration in, e.g., Delicious, Flickr, and myExperiment.org, the largest
public repository of scientific workflows. Note, however, that our approach is
not specific to tags, and that it can also use the top-down semantics prescribed
by ontologies or controlled vocabularies instead of, or in addition to, tags.
Related Work. Our work builds on the WISE search engine over repositories of
hierarchical workflows presented in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Following the intuition of [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], we return
a portion of the workflow that is both informative, in that it shows the
expansion and dataflow relationships necessary to understand a keyword match, and
concise, in that only the modules and edges necessary to understand the match
are returned. However, we extend the model of [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] in several important ways.
First, we formalize the ideas, leveraging hierarchical workflows with τ expansion.
Second, our characterizations of informativeness and conciseness are different,
in that we always make the context of the match w.r.t. the top-level workflow
explicit. Third, we discuss what the result is when a query matches a workflow
in multiple ways. Fourth, we incorporate access control.
        </p>
        <p>
          Our work is also related to keyword search over tree-structured data (e.g.,
XML) [
          <xref ref-type="bibr" rid="ref1 ref13 ref4 ref6 ref7 ref8">1, 4, 6–8, 13</xref>
          ], as well as XML access control [
          <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
          ]. We compare and
contrast our work to these in the following sections.
        </p>
        <p>This paper makes the following contributions. We formalize keyword search
over hierarchical workflows, and define what it means for a workflow to match a
keyword query, while also accounting for access control specifications (Section 2).
We discuss various options for identifying and displaying one or more matches
per workflow (Section 3). We consider search over workflow repositories, and
propose various ranking semantics that build on techniques from XML search
and from information retrieval, and adapt them to our setting (Section 4).</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Model</title>
      <p>Workflows and Keyword Search
A workflow is hierarchical, in the sense that a module may be composite and
expand to a sub-workflow. We start by defining a simple workflow in which
composite modules are not expanded, and then extend to a system of workflow
specifications that expand composite modules.</p>
      <p>Definition 1. A simple workflow W is a labeled directed acyclic graph G =
(V ∪ {I, O}, E) whose set of nodes, V , models the modules of the workflow. The
set of edges, E, models potential dataflow between modules v ∈ V as well as
input to and output from the workflow (I → v, v → O). Every v ∈ V must be
on a path from I to O and has a module name, name(v).</p>
      <p>For example, the box in Fig. 1(a) named W1 is a simple workflow. The input
to W1 is a node labeled I1, and the output of W1 is a node labeled O1. The
workflow consists of three modules, M1, M2, and M8. Each of these modules is
composite, and may be expanded.</p>
      <p>Definition 2. A workflow S is a pair (W, τ ), where W is a finite set of simple
workflows and τ is a (partial) expansion function that maps some of the module
names in the workflows in W to simple workflows in W.</p>
      <p>Note that if a module is used multiple times in a workflow, i.e., v1 6= v2 but
name(v1) = name(v2), its τ expansion will be the same in each occurrence.
From here on, we will use N to denote either a workflow or a module, M to
denote a module name, modules(S) to refer to the set of module names in the
workflows in W for a workflow S, and wf(N ) to refer to the workflow in which N
appears (if N is a module) or the workflow itself (if N is a workflow). Returning
to Fig. 1(a), τ (M1) = W2, τ (M2) = W3, τ (M4) = W5, and τ (M8) = W4;
wf(W1)=W1, wf(M1)=W1, and wf(M4)=W2.</p>
      <p>Definition 2 allows a simple workflow that is part of two different workflows to
have different expansion functions defined over its module names. For example,
W2 in the workflow W1 has τ1(M4) = W5. However, in a different top level
workflow W0 it could be that τ0(M4) is not defined, or maps to a different
subworkflow. In a repository of workflows, we will disallow such inconsistencies.
Definition 3. A repository of workflows is a set of workflows such that, for any
two workflows W1 and W2, if M ∈ modules(W1) ∩ modules(W2) then either (1)
τ1(M ) and τ2(M ) are both undefined; or (2) τ1(M ) = τ2(M ).</p>
      <p>We now define the notion of an expansion hierarchy for a workflow.
Definition 4. Given a workflow S = (W, τ ), if M is a module name in a
workflow W ′ and τ ′(M ) = W ′′, we say that W ′′ is a sub-workflow of W ′. The
expansion hierarchy of S is the directed graph denoted H(S) whose nodes are
the workflows in W, and in which there is an edge from W ′ ∈ W to W ′′ ∈ W
if and only if W ′′ is a sub-workflow of W ′. Each edge is labeled with the module
name M whose expansion creates the sub-workflow relationship. A workflow is
well-formed if its expansion hierarchy is a tree.</p>
      <p>The expansion hierarchy for our running example is shown in Fig. 1(b). Since
the hierarchy is a tree, W1 is well-formed. To simplify the discussion, we assume
in the rest of this paper that all workflows are well-formed. Our results can be
easily extended to expansion hierarchies that are DAGs. Extending our work to
the (less common) recursive workflow definitions is an interesting open problem.</p>
      <p>A module or workflow N is associated with a set of keywords, denoted
keywords(N ), e.g., keywords(W1)= {“SNP”, “disease”}, keywords(M6)= {“join”}.
Since the expansion of a module is its sub-workflow definition, we require that
keywords(M )= keywords(τ (M )) whenever τ (M ) is defined, e.g., keywords(M1)=
keywords(W2)= {“SNP”}. To reduce clutter, we omit the keywords associated
with modules when they expand to a sub-workflow, and show only the keywords
associated with the sub-workflow.</p>
      <p>A keyword query Q is a set of keywords {k1, ...kn}. Informally, a query
matches a workflow S if each ki can be found somewhere in S.</p>
      <p>Definition 5. A keyword k matches a workflow S iff k ∈ keywords(S) or k ∈
keywords(M ) for some M ∈ modules(S). A query Q= {k1, ...kn} matches S iff
each ki matches S.
2.2</p>
      <p>Access Control and Search
Inspired by conventions in Unix file systems, a workflow S has two actions
associated with it: read, which allows an authorized user to perform a match against
keywords(S); and expand, which allows an authorized user to see the structure of
S, i.e., the module names and the dataflow links between the modules. Similarly,
a module named M has two actions associated with it: read, which allows an
authorized user to perform a match against keywords(M ); and expand, which
allows an authorized user to expand its definition to a sub-workflow.</p>
      <p>It is helpful to think of the following analogy between workflows and
modules, and Unix file system objects. A workflow corresponds to a directory, and
having expand permissions on a workflow is analogous to having permission to
enter the directory. A module corresponds to a file, and read access on a module
is analogous to read permissions on a file. A module with a defined τ expansion
corresponds to a symbolic link to a Unix directory, and having expand
permissions on such a module is analogous to having permission to enter the directory.</p>
      <p>In an access control specification for a workflow, read and expand privileges
are associated with a set of user groups for each module and sub-workflow N ;
the special group world includes all users. A user u is authorized for an action
on N iff u is in a group associated with N for that action.</p>
      <p>We denote the set of users who are authorized to read (expand) N as readers(N )
(expanders(N )). Authorizations associated with a sub-workflow are inherited by
the modules that expand to it, i.e., readers(M )= readers(τ (M )) and
expanders(M )= expanders(τ (M )), whenever τ (M ) is defined.</p>
      <p>Definition 6. An access control specification for a workflow S is an assignment
of readers(N ), expanders(N ) for each N ∈ S ∪ modules(S).</p>
      <p>In a repository of workflows we require that the access control specification
be consistent across workflows.</p>
      <p>Definition 7. A repository R of workflows, together with their access control
specifications, is access control-consistent iff for any user u and for any two
workflows W1 and W2 in R, if M ∈ modules(W1) ∩ modules(W2), then (1)
u ∈ readers(M ) in both W1 and W2 or in neither W1 nor W2, and (2) u ∈
expanders(M ) in both W1 and W2 or in neither W1 nor W2.</p>
      <p>We also revise the notion of a match to take access control into account:
Definition 8. A keyword k matches a workflow S under access control
specification A for user u iff
1. k ∈ keywords(S) and u ∈ readers(S); or
2. k ∈ keywords(M ) for some M ∈ modules(S), u ∈ readers(M ), and u ∈
expanders(W ) for every sub-workflow W on the path from S to wf(M ) in
the expansion hierarchy H(S).</p>
      <p>A query Q= {k1, ...kn} matches S iff each ki matches S.</p>
      <p>Consider again the example in Fig. 1(a), and suppose that user u has the
following access privileges: u ∈ readers(W1)∩expanders(W1), u ∈ expanders(W3),
u ∈ readers(W5)∩expanders(W5), and u ∈ readers(M5). The query Q1 = {“SNP”,
“parse”} matches W1 for u, because “SNP” ∈ keywords(W1) and “parse” ∈
keywords(M5). Both keywords are accessible to u: keywords(W1) is accessible
because u ∈ readers(W1), and keywords(M5) is accessible because u ∈ expanders(W1)
∩ expanders(W3) ∩ readers(M5). However, Q2 ={“OMIM”} does not match W1
for u because u 6∈ readers(W3), preventing a match on keywords(W3). Further,
while u ∈ readers(W5), he cannot reach W5 from W1, because u 6∈ expanders(W2).
3</p>
    </sec>
    <sec id="sec-3">
      <title>Result Presentation for a Single Workflow</title>
      <p>Given a workflow that matches the query, the goal of this section is to define a
query result that is concise, informative, and non-redundant. First, we formalize
the notion of a match assignment, which maps query keywords to nodes in the
expansion hierarchy. Next, we define a projection over the expansion hierarchy,
which contains only the relevant portion of the hierarchy, and is a concise and
informative representation of the match assignment. We then eliminate
redundant projections from the set of projections for each match assignment. Finally,
we show how non-redundant projections can be used to produce a query result.
Definition 9. A match assignment, μ, is a function mapping each keyword ki to
the workflow or module name N ∈ S ∪ modules(S) such that ki ∈ keywords(N ).</p>
      <p>Returning to our example, W1 matches the query Q = {“SNP”, “OMIM”}
since “SNP” ∈ keywords(W1) and “OMIM” ∈ keywords(M2), yielding a match
assignment μ1(“SNP”)= W1 and μ1(“OMIM”)= M2. Importantly, keyword matches
occur only at the top-level workflow (W1 in our case) or at a module. In our
example, the keyword “OMIM” is in keywords(W3) as well as in keywords(M2), but
only M2 will be used in a match assignment; the match for W3 will be “caught”
at M2 since keywords(M2)= keywords(τ (M2)) = keywords(W3).</p>
      <p>Each keyword query that matches workflow S (as per Definition 5) will
generate a non-empty set of match assignments {μ1, ..., μn}, since each keyword may
occur in the workflow multiple times. In our running example, there are a total
of nine match assignments, listed in Fig. 2(a): “SNP” is among the keywords for
W1, M1 and M3, and “OMIM” is among the keywords of M2, M4, and M7.</p>
      <p>Note that a match assignment may map multiple keywords to the same
workflow or module N . For example, a valid match assignment for Q = {“SNP”,
“disease”} is μ(“SNP”) = W1 and μ(“disease”) = W1.</p>
      <p>In an access-controlled workflow, the set of match assignments for a user is
restricted to those for which he has appropriate privileges (recall Definition 8).
Definition 10. A match assignment μ(k1 = N1), . . . μ(kn = Nn) is valid
for user u in an access-controlled workflow S iff u ∈ readers(Ni) and u ∈
expanders(W ) for every workflow W on the path from S to wf(M ) in the
expansion hierarchy H(S).</p>
      <p>
        This notion of match is used in most approaches to querying under access
control in XML, see, e.g., [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and the strongest approach presented in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].1
      </p>
      <p>Having defined a match assignment in workflow S, we next need to transform
it into a query result. We start by placing the match assignment into the context
of the expansion hierarchy H(S) (see Definition 4).</p>
      <p>Definition 11. Given a workflow S that matches query Q, and a corresponding
match assignment μ, let R= { wf(M ) | ∃ki ∈ Q. μ(ki)=M } be the relevant
sub-workflows of S. Let Wi be the least common ancestor in H(S) of nodes that
appear in R. Then the projection of H(S) over R is the tree T formed by the
union of paths from Wi to Wj in H(S), for all Wj ∈ R.</p>
      <p>
        Fig. 2 lists all match assignments for the query Q = {“SNP”, “OMIM”},
and shows the corresponding projections of H(W1). Nodes and edges in the
expansion hierarchy that are projected out are greyed out; nodes at which a
keyword match occurred are presented in red font and denoted by a “*”. Note
that, while there are nine distinct match assignments for the query, there are
only five distinct projections. To see why, consider the projection in Fig. 2(b)
that corresponds to two match assignments μ1 and μ2: “OMIM” is matched by
M2 and maps to W1 in the expansion hierarchy, and “SNP” is matched by either
W1 or M1, both also mapping to W1 in the expansion hierarchy.
1 We do not consider the weaker (and nonstandard) notions discussed in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which
allow matches to occur below “unexpandable” terms under certain conditions.
      </p>
      <p>W1
W3</p>
      <p>M2 M8</p>
      <p>W4</p>
      <p>M1</p>
      <p>W
M4 2
W *
5</p>
      <p>W *
1</p>
      <p>M2 M8
W3</p>
      <p>W4
(d) T3 : μ6
(e) T4 : μ7, μ8</p>
      <p>M1
W
M4 2</p>
      <p>W *</p>
      <p>1
W3</p>
      <p>M2 M8</p>
      <p>W4
W5
(b) T1 : μ1, μ2</p>
      <p>W1
M1</p>
      <p>M2 M8
MW42* W3
W *
5
(f) T5 : μ9</p>
      <p>W4
µ1("SNP") = W1,µ1("OMIM") = M2
µ2("SNP") = M1,µ2("OMIM") = M2
µ3("SNP") = W1,µ3("OMIM") = M4
µ4("SNP") = M1,µ4("OMIM") = M4
µ5("SNP") = M3,µ5("OMIM") = M2
µ6("SNP") = M3,µ6("OMIM") = M4
µ7("SNP") = W1,µ7("OMIM") = M7
µ8("SNP") = M1,µ8("OMIM") = M7
µ9("SNP") = M3,µ9("OMIM") = M7
(a) match assignments
W *</p>
      <p>1
M1</p>
      <p>M2 M8
MW42* W3</p>
      <p>W4</p>
      <p>W5
(c) T2 : μ3, μ4, μ5</p>
      <p>M1
W *
M4 2</p>
      <p>W5</p>
      <p>Some of the sub-workflows appearing as nodes in the projection T may not
appear in R, but are necessary to show the connection between the match
assignments, and so must be included to make the result informative. E.g., W2 is
not part of the match assignment, yet it is part of the projection in Fig. 2(e).</p>
      <p>Projecting over the expansion hierarchy for a match assignment ultimately
makes the query result concise, because T “trims” H(S): T may be rooted at
a sub-workflow of S rather than at S, trimming off top levels of H(S) (e.g.,
Fig. 2(d) and 2(f)); subtrees of H(S) rooted at leaves of T are also trimmed
(e.g., all subtrees rooted at W1 are trimmed in Fig. 2(b)).</p>
      <p>Consider now all projections over H(W1) in Fig. 2, and note that some are
more concise than others: Projection T2 in Fig. 2(c) is rooted at W1 but also
contains W2, while projection T3 in Fig. 2(d) contains a complete match at W2.
In fact, T3 is a subtree of T2. Similarly, observe that T5 is a subtree of T4. We
use these relationships to define non-redundant projections.</p>
      <p>Definition 12. Given query Q and workflow S, let T be the set of all projections
of H(S) for Q. A projection T ∈ T is non-redundant if there does not exist
another projection T ′ ∈ T , such that T 6= T ′ and T ′ is a sub-tree of T .</p>
      <p>
        Non-redundant projections of H(W1) for Q are T1, T3, and T5, in Fig. 2.
Definition 12 is similar to that used in search and ranking over XML documents
(e.g. [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]), where it is referred to as “conciseness”. We use “conciseness” w.r.t.
a single projection, and “non-redundancy” w.r.t. a set of projections.
      </p>
      <p>The node corresponding to the top-level workflow, W1, is not always part
of the projection; e.g., this is the case in Fig. 2(d). Since a user interested in
re-using workflows will want to see how the relevant workflows are invoked, we
need to provide re-use context. To improve informativeness of a result, we return
a projection that is rooted at the top-level workflow.</p>
      <p>Definition 13. Let T = (V, E) be a projection of H(S) over R and let W be
the root of T in H(S). Then the informative projection T ′ is formed by adding
to T the path in H(S) from S to W .</p>
      <p>We are now ready to define the representation of a query result:
Definition 14. Let T = (V, E) be an informative projection of H(S) over R
for workflow S = (W, τ ). Then the query result is the specification S′ = (V, τ ′),
where τ ′(M ) = τ (M ) for all M that appear as labels on edges in E and undefined
otherwise.</p>
      <p>To generate a result for a set of match assignments, we compute the set of
projections T , remove redundant projections from this set, resulting in a set of
projections T ′, and then compute a query result for each T ∈ T ′. One option is
to present these results to the user individually. For example, trees T1, T3 and
T5 in Fig. 2 are non-redundant, and the corresponding query results are shown
in Fig. 3(a), 3(b), and 3(c). Another option is to first generate a union of the
projections in T ′, and then produce a single query result for this union. Note
that, because we require that all projections be informative (Definition 13), they
will all have the same root — the top-level workflow S. The union of T1, T3 and
T5 is T5, and the corresponding query result is shown in Fig. 4(a). Fig. 4(b) is
the combined query result for W2, which is a union of two results: one matching
W2 (top level), and another matching “SNP” at W2 and “OMIM” at M7.</p>
      <p>Before proceeding to the question of result generation from a set of matches,
it is worth understanding the relationship between our notion of keyword search
and that of keyword search in XML trees. Since the dataflow structure of a
workflow is not used in matching, one can think of the workflow as an XML tree
that is formed from the workflow expansion hierarchy and also includes atomic
modules as children of the sub-workflow to which they belong. Each node in the
XML tree is labeled with the concatenation of its keywords; a node matches a
query if its label contains one or several query terms as a substring. Similarly
to our scenario, in XML keyword search, the result returned is a subtree rooted
at the least common ancestor of the keyword matches, whenever the result does
not have a proper subtree which is also a keyword match (i.e., is non-redundant).
In our scenario, the result is further trimmed by deleting subtrees that do not
contain matches (i.e., is concise). Finally, our result is augmented with the path
to the root from the least common ancestor (i.e., is informative).
4</p>
    </sec>
    <sec id="sec-4">
      <title>Search and Ranking in Workflow Repositories</title>
      <p>We now focus on search in workflow repositories, where queries will likely return
many matches, motivating ranking. Suppose that a workflow repository is given,
and consists of five access-controlled workflows (per Definition 10) W1, W2, W3,
W4, and W5. Suppose that user u issuing query Q = {“SNP”, “OMIM”} is
among the readers and expanders of all workflows and modules in the repository.
Workflows W1 and W2 match Q, and one option is to present a ranked list of
τ</p>
      <p>W2 SNP OMIM</p>
      <p>M3 M4</p>
      <p>W1
τ</p>
      <p>W2 SNP</p>
      <p>M3</p>
      <p>M4
M1
M2
M8</p>
      <p>I1</p>
      <p>O1
M4
(e) W22
τ</p>
      <p>W5 OMIM</p>
      <p>M7
(c) W13
τ
O2 W5 OMIM</p>
      <p>M7
SNP I1
W1</p>
      <p>M1
M2
M8</p>
      <p>O1
(a) W11</p>
      <p>SNP
OMIM</p>
      <p>W1</p>
      <p>M1
M2
M8</p>
      <p>I1
O1</p>
      <p>(b) W12</p>
      <p>SNP
I2 W2</p>
      <p>M3</p>
      <p>SNP</p>
      <p>OMIM O2</p>
      <p>M4
(d) W21</p>
      <p>SNP
I2 W2 SNP</p>
      <p>M3
query results in Fig. 3. Another presentation option is to return the union of all
query results for each workflow, and to then rank these combined results. Fig. 4
presents combined query results for W1 and W2. We discuss both options below.</p>
      <p>Consider again query results in Fig. 3, and note that there is a correspondence
between some of these results. In particular, Fig. 3(b) and 3(d) refer to the same
actual match that occurs in W2, but in Fig. 3(b) this match is presented in the
context of W1. A similar relationship holds between results in Fig. 3(c) and 3(e).
We argued in the introduction that both types of results are of potential interest
to the user. However, it is certainly reasonable to prioritize one type of a result
over another in a ranked list, using a measure of quality.</p>
      <p>Given our conjunctive semantics of matching, all query results are exact
matches, and are therefore relevant. However, larger results are more difficult
for the user to understand, and so smaller results may be preferred. A natural
way to quantify result quality is to consider the number of modules in the result.
Definition 15. Given an access-controlled workflow S, a query Q, and a result
Si for Q in S, the size of Si, denoted size(Si), is the number of modules in Si.</p>
      <p>Returning to our example in Fig. 3, we have: size(W11) = 3, size(W12) = 5,
size(W13) = 6, size(W21) = 2, and size(W22) = 3. Using size as a measure of
quality gives rise to the following ranking (rank position is recorded next to each
entry): (W21, 1), (W22, 2), (W11, 2), (W12, 4),(W13, 5).</p>
      <p>Another natural way to quantify result quality is by using hierarchical
workflow structure, and we define one possible such measure below.</p>
      <p>Definition 16. Given an access-controlled workflow S, a query Q, and a query
result Si for Q in S, the depth of Si, denoted depth(Si), is the number of τ
expansions in Si.
W1</p>
      <p>M1
M2
M8</p>
      <p>I1
SNP</p>
      <p>τ
OMIM
O1</p>
      <sec id="sec-4-1">
        <title>W2 SNP OMIM</title>
        <p>M3 M4
τ</p>
      </sec>
      <sec id="sec-4-2">
        <title>W5 OMIM</title>
        <p>M7</p>
        <p>In our example in Fig. 3, we have: depth(W11) = 0, depth(W12) = 1, depth(W13) =
2, depth(W21) = 0, and depth(W22) = 1. This gives rise to the following ranking:
(W11, 1), (W21, 1), (W12, 3), (W22, 3), (W13, 5). Size and depth may be combined
into a single quality measure, using, e.g., a (weighted) average of the two scores.</p>
        <p>
          Another presentation option is to return the union of all query results for
each workflow, and to then rank these combined results. Fig. 4 presents combined
query results for W1 and W2. One natural way to discriminate between multiple
combined results is to consider what portion of each result is query-relevant. We
refer to this measure as specificity, and define it below. Our definition is in-line
with the specificity measure proposed by INEX and used in XML retrieval [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
Definition 17. Given an access-controlled workflow S, a query Q, and a
combined query result S′ for Q in S, the specificity of S′, denoted spec(S′), is the
proportion of query-relevant modules in S′ to size(S′).
        </p>
        <p>For the combined results in Fig. 4, spec(W1′) = 56 and spec(W2′) = 1, and so
W2′ would rank higher than W1.</p>
        <p>′</p>
        <p>
          Specificity treats occurrences of different keywords equally. However, there
are cases when it is natural to assign higher importance to certain keywords.
For example, “SNP” may be a very common keyword in the repository, while
“OMIM” may be uncommon. A combined results containing many uncommon
keywords should receive a high score. This semantics may be represented using,
e.g., the popular tf-idf weighting scheme from information retrieval [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>
          Keyword search and ranking over XML has been considered, e.g., [
          <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
          ]. In [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ],
XML documents have a tree structure. Queries consist of keywords appearing
at leaf nodes and of labels appearing at intermediate tree nodes. Subtrees of
the document are returned, ranked on a combination of tf-idf-style weights and
subtree structure, e.g., its size and the ancestor-descendant relationships between
query-relevant nodes. While the semantics of a match are somewhat different
in our case, where keywords may be assigned to tree nodes other than leaves,
the ranking options of [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] may be applicable. In [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], a weight of each keyword is
computed using authority propagation in hyperlinked XML documents. Keyword
proximity is taken into account when computing the rank of a result.
We formalized keyword search in access-controlled repositories of hierarchical
workflows. We defined what it means for a single hierarchical workflow to match
a keyword query, and discussed ranking semantics for workflow repositories.
        </p>
        <p>Much interesting follow-up work remains to be done. The focus of this paper
was on defining the formalism, and on making connections to search and ranking
in XML and in information retrieval. A natural next step is to develop efficient
query processing and ranking algorithms. This is non-trivial, because of the
interesting interactions between query results and access control. For example,
to support tf-idf ranking over a repository, one may pre-compute tf-idf weights
for all (keyword, workflow) pairs. In an access-controlled setting, each user may
potentially have a different view over the repository, requiring that tf-idf weights
be computed per (user, keyword, workflow). This data structure may quickly
become too large to compute and maintain, motivating optimizations.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was supported in part by NSF grants IIS-0803524, IIS-0629846,
IIS0915438, CCF-0635084, IIS-0904314, and by grant 0937060 to the Computing
Research Association for the CIFellows Project.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>S.</given-names>
            <surname>Cohen</surname>
          </string-name>
          et al.
          <article-title>XSEarch: A semantic search engine for XML</article-title>
          .
          <string-name>
            <surname>In</surname>
            <given-names>VLDB</given-names>
          </string-name>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>E.</given-names>
            <surname>Damiani</surname>
          </string-name>
          et al.
          <article-title>A fine-grained access control system for XML documents</article-title>
          .
          <source>TISSEC</source>
          ,
          <volume>5</volume>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>W.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. Y.</given-names>
            <surname>Chan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Garofalakis</surname>
          </string-name>
          .
          <article-title>Secure XML querying with security views</article-title>
          .
          <source>In SIGMOD</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>L.</given-names>
            <surname>Guo</surname>
          </string-name>
          et al. XRANK:
          <article-title>Ranked keyword search over XML documents</article-title>
          .
          <source>In SIGMOD</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>G.</given-names>
            <surname>Kazai</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Lalmas</surname>
          </string-name>
          .
          <article-title>Inex 2005 evaluation measures</article-title>
          .
          <source>In INEX</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          et al.
          <article-title>Effective keyword search for valuable LCAs over XML documents</article-title>
          .
          <source>In CIKM</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          .
          <article-title>Schema-free XQuery</article-title>
          .
          <source>In VLDB</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Walker</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen. XSeek</surname>
          </string-name>
          :
          <article-title>A semantic XML search engine using keywords</article-title>
          .
          <source>In VLDB</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Shao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          . WISE:
          <article-title>Searching workflow hierarchies</article-title>
          .
          <source>VLDB</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>C. D. Manning</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Raghavan</surname>
          </string-name>
          , and H. Schu¨tze. Introduction to Information Retrieval. Cambridge University Press,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>D. D. Roure</surname>
            ,
            <given-names>C. A.</given-names>
          </string-name>
          <string-name>
            <surname>Goble</surname>
            , and
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Stevens</surname>
          </string-name>
          .
          <article-title>The design and realisation of the myExperiment virtual research environment for social sharing of workflows</article-title>
          .
          <source>FGCS</source>
          ,
          <volume>25</volume>
          (
          <issue>5</issue>
          ),
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>J. Stoyanovich</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Taskar</surname>
            , and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Davidson</surname>
          </string-name>
          .
          <article-title>Exploring repositories of scientific workflows</article-title>
          .
          <source>In WANDS</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Papakonstantinou</surname>
          </string-name>
          .
          <article-title>Efficient keyword search for smallest LCAs in XML databases</article-title>
          .
          <source>In SIGMOD</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>