Repairing Provenance Policy Violations by Inventing
                  Non-Functional Nodes

                 Saumen Dey1 , Daniel Zinn2 , and Bertram Ludäscher12
                 1
                     Dept. of Computer Science, University of California, Davis
                       2
                          Genome Center, University of California, Davis


       Abstract. In scientific collaborations, provenance is increasingly used to ex-
       plain, debug, reproduce, and determine the validity and quality of data products.
       In such environments, it can be infeasible or undesirable to publish the complete
       provenance of all the final output data products. We have developed P RO P UB,
       a system that allows users to publish a customized version of their data prove-
       nance, based on a set of publication and customization requests, while observing
       certain provenance publication policies, expressed as logic integrity constraints.
       The user’s customization requests may violate one or more integrity constraints.
       In previous work, we removed additional parts of the provenance graph (i.e., not
       directly requested by the user) to repair policy violations. In this paper, we present
       an alternative approach which ensures that all relevant nodes are retained in the
       provenance graph. The key idea is to introduce new (non-functional) nodes that
       are used to represent lineage dependencies, without revealing information that the
       user wants to protect. With this new approach, a user may now explore different
       provenance publication strategies, and choose the most appropriate one before
       publishing sensitive provenance data.


1    Introduction
In the emerging paradigm of collaborative, data-intensive science, sharing data products
even prior to publication is desirable [1,2]. Yet, without a proper scientific publication
associated with openly published data, its validity and accuracy might be questionable.
This is problematic in an open environment, where published data by one scientist is
used by another scientist as input for further data analyses. In such an environment, data
provenance (the lineage and processing history of data) can help to ensure data quality
[3,4,5,6,7]. It is thus desirable to publish data products together with their provenance.
    In many cases, however, provenance data can be sensitive and may contain private
information or intellectual property that should not be revealed [7,8,5]. Consequently,
a balancing act (Figure 1) is necessary between (i) the desire to publish provenance
data so that collaborators can understand and rely on the shared data products, and (ii)
the need to protect sensitive information, e.g., due to privacy concerns or intellectual
property issues.
    We view provenance as a bipartite, directed, acyclic graph, capturing which data
nodes were consumed and produced, respectively, by invocation (i.e., computation)
nodes. Our model thus corresponds to the Open Provenance Model (OPM) which cap-
tures the dependencies between data artifacts and invocations [9,10].
                                     The Balancing Act
                               Privacy &
                               Relevancy           Provenance
                               Concerns             Publishing


Fig. 1. In collaborative settings, scientists publish provenance for an improved understanding of
the result data. With increasing privacy concerns, collaborators have to choose the right balance
between providing sufficient provenance data and protecting sensitive information.


     To sanitize provenance graphs, a scientist can remove sensitive data nodes or in-
vocations nodes from the provenance graph. Alternatively, she can abstract a set of
sensitive nodes by grouping them into a single, abstract node. This update may violate
some of the integrity constraints of the provenance graph [11]. For example, grouping
multiple nodes into one abstraction node may introduce new dependencies which were
absent in the initial provenance graph. Hiding nodes may also make some nodes in the
final graph appear independent of each other even though they are dependent in the ini-
tial graph. Thus, one can no longer trust that the published provenance data is “correct”
(e.g., there are no false dependencies) or “complete” (e.g., there are no false indepen-
dencies). Therefore, we propose a system that allows a publisher to provide a high-level
specification what parts of the provenance graph are to be published and which parts are
to be sanitized, while guaranteeing that at the same time certain provenance publication
constraints are observed.


2    Motivating Example

Figure 2(a) shows the provenance graph (PG) taken from the First Provenance Chal-
lenge [12]. Data nodes are depicted as circles and invocation nodes (representing com-
putations) as boxes; dependencies among them are shown as directed edges. These
edges capture the lineage of data and thus are typically drawn from right (newer nodes)
to left (older nodes). For example, d16 was generated by an invocation s2 , and was in
                                                              gen by               used
turn used by invocation c2 , denoted by, respectively s2 ←− d16 and d16 ←− c2 .
    Let us assume, the user wants to publish data products d18 and d19 along with their
lineage data. Then, she will issue the publication requests as shown in Figure 2(a). A
recursive query is used to retrieve all data and invocation nodes upstream from d18 and
d19 and we get a modified provenance graph (PG! ) as shown in Figure 2(b). Note that
the lineage of d20 up to s3 is not relevant for d18 and d19 and hence not included in PG! .
Further assume that before publishing PG! , the user also requests a set of customiza-
tions as shown in Figure 2(b).
    Figure 3 shows the provenance graph we get after applying all the customization
requests. We see that this provenance graph violates three provenance policies: There
                                       "#$%&'(!%&)*+,*!                                                   "*+,-".,!                /'0)!      "*+,-".,10234!526!
                                                                                !"#                                                           "*+,-".,1+24!526!
                        *$!   !$+#   ,$!      !$/#                                                            *$!     !$+#   ,$!       !$.#   "*+,-".,1&24!526!
!"#                                                                             !$%#               !$(#
                                                                                          '$!
                !$(#                                  %&)*+,*-./01!             !$$#                                                          "#$#%&'()10226!
!$%#                                                                                               !$)#       *&!     !$-#   ,&!       !$"#
                                                      %&)*+,*-./21!                                                                           "#$#%&'()10276!
          '$!           *&!   !$-#   ,&!      !$"#
                                                                                !$&#
!$$#                                                                                                                         /'0)!
                !$)#                                                                                                                          /'0)10286!
                                                                         "#$#%&'()!                                                           /'0)1.26!!
!$&#
                        *(!   !$.#   ,(!      !&%#                                                                                            /'0)1.76!

       (a) Provenance graph (PG) and publish request                                  (b) User requests: anonymize, abstract, hide

           Fig. 2. (a) User requests to publish the provenance of {d18 , d19 }; and (b) customization requests
           to anonymize data nodes {d11 , d12 }, to abstract nodes {m1 , d14 , s1 }, and to hide {c1 , d18 , c2 }

                                                     $%&'(!
                                                                            !!24&! "!2&3!
                                           !)$        !"#$        !"%$
                                                                            !!253! "!26! "!27(!
                                           !"*$
                                                      "#!        &'!     !"($                   !")$
                                           !""$
                                                     )%*(!+,,-,!         ./'0(!123(*(23(2&(!
                                           !"'$


           Fig. 3. Provenance graph after applying all user requests. Provenance policies No-Type Error
           (NTE), No-Cyclic Dependency (NCD) and No-False Independence (NFI) are violated, while
           No-Write Conflict (NWC) and No-False Dependence (NFD) are satisfied.


           is a cycle between d13 and g1 , a type error for the edge from s2 to g1 (the graph should
           be bipartite), and there is no dependency between d19 and d16 , violating, respectively,
           the No-Cyclic Dependency (NCD), No-Type Error (NTE) and No-False Independence
           (NFI) policies. On the other hand, the provenance policies No-Write Conflict (NWC)
           and No-False Dependence (NFD) are not violated by these customization requests.
           Outline and Contributions. In Section 3, we first describe the provenance model, user
           requests, provenance policies, and logical architecture of P RO P UB. This overall frame-
           work was proposed recently in [11]. In Section 4 we present our main contribution, i.e.,
           a new way to repair policy violations, not by removing additional nodes (as in our prior
           work), but by introducing new (non-functional) nodes that represent the original lineage
           dependencies, without revealing information that the user wants to protect. We describe
           in detail how policy violations will be repaired such that all relevant nodes are retained
           in the final provenance graph. Related work is discussed in Section 5 and Section 6
           presents some concluding remarks and suggestions for future work.


           3    Provenance Publisher (P RO P UB)
           In our recent work, we developed the system P RO P UB [11], which uses a declarative
           approach to publish customized policy-aware provenance. P RO P UB accepts the initial
           provenance graph and two types of input specifications. (i) User Requests: the publica-
           tion and customization requests, and (ii) Provenance Policies: the integrity constraints
Relation Name Description
  used(I, D) An edge specifying that the invocation I used the data artifact D.
 gen by(D, I) An edge to indicate that the data artifact D was generated by invocation I.
 actor(I, A) An invocation node I, which was executed by actor A.
  data(D, R) A data artifact node, whose value can be retrieved using the reference R.
   dep(X, Y)  An auxiliary relation and defined as dep = used ∪ gen by and to specify that
              node X depends on node Y , irrespective of the node types.

                       Table 1. Provenance Model for P RO P UB


  User Request Description
 ur:lineage(D) Selects the complete lineage for the data artifact D
ur:anonymize(N) Erases the actor/process identify or the data reference from the node N
   ur:hide(N)     Removes the invocation or data node N
ur:abstract(N, G) Collapses all nodes N to the abstract group G
  ur:retain(N) Keeps the node N in the customized provenance

           Table 2. User requests for lineage publication and customization


to be observed. P RO P UB then applies all user requests on the initial provenance graph
and checks for policy violations. In case there is a violation, it applies repairs and gen-
erates the customized provenance graph.

Provenance Model. The provenance model used in P RO P UB is based on OPM, the
Open Provenance Model [13] and our earlier work [14]: A provenance (or lineage)
graph is an acyclic graph P G = (V, E), where the nodes V = D ∪ I represent either
data items D or actor invocations I. The graph G is bipartite, i.e., the edges E = Euse ∪
Egby are either used edges Euse ⊆ I × D or generated-by edges Egby ⊆ D × I. Here,
a used edge (i, d) ∈ E means that invocation i has read d as part of its input, while a
generated-by edge (d, i) ∈ E means that d was output data, written by invocation i.
We use the schema shown in Table 1.

User Requests. The user requests supported by the P RO P UB framework are summa-
rized in Table 2. The P RO P UB system expects user requests to be asserted as relational
facts that can then be used by a Datalog rule engine. An user request can be a publi-
cation request or a customization request. A customization user request can request to
remove a node or an edge or to keep that in the final graph.

Provenance Policies. The provenance graph supported by P RO P UB is a bipartite di-
rected acyclic graph. Also, an invocation can read many data artifacts, but a data ar-
tifact is written by exactly one invocation. We developed three provenance policies to
                                                                               ∆u
verify if these structural properties are satisfied in the provenance graph PG! , which
we get after applying all the customization requests on PG . P RO P UB has two more
                                                                 !

provenance policies to ensure the correctness and completeness of information. These
provenance policies are briefly defined in Table 3.
    Provenance Policy       Description
 No-Write Conflict (NWC) A data artifact can be written by only one invocation.
No-Cyclic Dependency (NCD) There is no cycle between any two nodes X and Y.
   No-Type Error (NTE)      Two nodes with a direct dependency are of different types.
                                                           ∆u
No-False Dependence (NFD) Two nodes are dependent in PG!       only if they are dependent
                                   !
                            in PG .
                                                             ∆u
No-False Independence (NFI) Two nodes are independent in PG!     only if they are indepen-
                                      !
                            dent in PG .

                                 Table 3. Provenance Policies


Constraint Description
ic:wc(X, Y) Write conflict: invocations X and Y are creating the same data node.
ic:cd(X, Y) Cyclic dependency between nodes X and Y.
ic:te(X, Y) Type error: nodes X and Y are connected via used or gen by edges, but don’t have
            the corresponding node types.
                                                             ∆u
ic:fd(X, Y) False dependency: node Y depends on X in PG! , but not in PG! .
                                                                              ∆u
ic:fi(X, Y) False independence: node Y depends on X in PG! , but not in PG! .

           Table 4. Integrity constraint relations used to detect policy violations


    We use a set of integrity constraints (ICs) to check whether the provenance policies
defined in Table 3 are satisfied. Table 4 lists the “witness relations” that are defined by
rules (not shown) and which are used to detect particular IC violations.3


3.1     Logical Architecture

The logical architecture of the P RO P UB system is shown in Figure 4. The user submits a
set of publication and customization requests U0 . The module Direct-Conflict-Detection
detects direct conflicts among the given user-requests. For example, a hide and a retain
request on the same node is an obvious conflict. The user needs to update her original
requests until all direct conflicts are resolved, resulting in a conflict-free user request
U. The Lineage-Selection module computes the sub graph PG! , which contains all to-
be-published data items (specified using the ‘lineage’ predicate) together with their
complete provenance.
    The Request-Policy-Evaluation module calculates the updates (∆u: inform of insert
and delete) needed to apply all the user requests from U on PG! . It applies ∆u on
                                                     ∆u
PG! and get a customized provenance graph PG! . Then it checks if all the selected
provenance policies (PP) are observed by evaluating respective integrity constraints. In
case some of the policies are violated, this module calculates updates (∆p: inform of
insert and delete) needed to repair the violations. In a final conflict resolution step using
the module Implied-Conflict-Detection-Resolution, the system detects all such implied
 3
     For example., we can detect whether a data node is created by different invocations X and Y
     and record this as ic:wc(X, Y ).
                                                    !"#$%
                                                 &#'(#")"%!*%


                      ()&*'+&,-&).'            :5$#0)%;,.<50)%         +$,-#./.0#%
                        ;"<=$7.'                 :#)#09,.%              1$/23%+1%


                      !"#$%&#'()&*'                                     !"#$%&#'
                                              45.#/6#%7#8#09,.%         /*012'3/4'
                       +&,-&).)'('


                      +$,-#./.0#%%            &#'(#")%=%+,850>%           8-9'3/:'8-9'
                       +,850>%E+F%              ?-/8(/9,.%                   (9'3'


                                               @A285#B%;,.<50)%
                      >?<"*&#'()&*'                                    D,.,$#B%!"#$%
                                                 :#)#09,.%=%
                        +&,-&).)''                                       &#'(#")"%
                                                 &#",8(9,.%


                                                                       ;("),A5C#B%
                        5$"60.&#'                1(/$/.)##B%
                                                                       +$,-#./.0#%
                        3"6$7$&)'                 +,8505#"%
                                                                          1$/23%


                                      Fig. 4. P RO P UB Architecture


conflicts by comparing ∆u and ∆p. In case an implicit conflict is detected, it selects
another subset from the given U and PP following the user preferences. These steps are
                                                                             ∆u
repeated until there is no more policy violations. It then applies ∆p on PG!    to get
the customized provenance graph (CG) ready to be published.


4   Repairing Policy Violations

If we apply the customization user requests on PG! , we get an intermediate provenance
            ∆u                                                        ∆u
graph PG!       as we shown in Figure 2(a), 2(b) and 3. But, PG!          may violate one
                                                 ∆u
or more provenance policies. In case the PG!          violates a structural policy (NWC,
NCD, and NTE), it will no more be a proper provenance graph. Also, in case it violates
                                               ∆u
a non-structural policy (NFI and NFD), PG!         may contain incorrect information or
                                                                       ∆u
may become incomplete. Thus, user will not be able to publish PG! . To resolve this
issue, we apply the customization requests U on PG! in a strategic way such that it
confirms to all the provenance policies.
     Our strategy is primarily based on two ideas (i) inventing non-functional nodes, and
(ii) converting user requests using other forms of user requests.
                                                                ∆u
Inventing Non-functional Nodes. In case PG!            has a structural violation, P RO P UB
resolves the violation by adding a new non-functional node. A non-functional node is
added to maintain the structure of provenance graph. Presence of a non-functional node
in the final customized graph may represent one data or invocation node or a set of data
and invocation nodes. No mapping is maintained between the non-functional node and
the nodes it replaced. Also, it will not carry any URL. Thus, no one will be able to
reach to the value of a data artifact or the source code of an actor from a non-functional
                 (a) Policy Violation                          (b) Fix

Fig. 5. (a) direct dependency between data nodes causing a type error (NTE violation); (b)
P RO P UB resolves this by inventing a non-functional invocation node.


node. P RO P UB invents minimum numbers of non-functional nodes to resolve a policy
violation.
    P RO P UB uses the same strategy to resolve NFD policy violations. The fix to the
violation of this policy is complex and may need more than one non-functional node to
be added. In spite of this complexity, P RO P UB resolves the violation using the minimum
numbers of non-functional nodes.
Converting User Requests. A publisher can use ur:hide requests to hide individual
nodes or the partial structure of the provenance graph. When we apply these user re-
quests all the selected nodes and the associated edges are removed from the provenance
graph PG! and a set of independence may be created which violates the NFI policy. We
can use the inventing new non-functional node strategy as discussed above and replace
the selected node by a non-functional node to resolve this policy violation. But, this
approach keeps the structure of the original provenance graph in the final provenance
graph. Instead, P RO P UB converts these ur:hide user requests into an equivalent set of
ur:abstract user requests so that all the selected nodes are removed and no unintended
dependencies are removed.


4.1   Repairing Structural Policy Violations

No-Type Error. This policy is violated in case there is a direct dependency between
two nodes of same type (i.e. a dependency between two data artifacts or a dependency
between two invocations). P RO P UB invents a non-functional invocation node in case
the policy violation is between two data artifacts as shown in Fig. 5. In the similar way,
P RO P UB invents a non-functional data node in case the policy violation is between two
invocation nodes. We used the rules as shown below to create the non-functional nodes
and fix the violations of this policy.
del_dep(X,Y) :- ic:te(X,Y).
add_data(f(X,Y),T) :- ic:te(X,Y), d_actor(X,_), T=‘ic:te’.
add_actor(f(X,Y),T) :- ic:te(X,Y), d_data(X,_), T=‘ic:te’.
add_dep(X,f(X,Y)) :- ic:te(N1,N2), X is N1.
add_dep(f(X,Y),Y) :- ic:te(N1,N2), Y is N2.


No-Write Conflict. This policy is violated in case there are N (N ≥ 2) gen by edges
for a data node. To resolve this violation P RO P UB removes incorrect gen by edges for
the violated data node and keeps only one gen by edge, which is there in PG. But, this
                    (a) Policy Violation                         (b) Fix

Fig. 6. In (a) there are two gen by edges with the data node causing the No-Write conflict policy
violation and P RO P UB resolves this by inventing a non-functional data node as shown in (b).


               (a) Policy Violation                           (b) Fix

Fig. 7. In (a) there is a cycle between the data node and the invocation node and P RO P UB resolves
this by inventing the non-functional invocation node as shown in (b).


may violate the NFI policy as it removes dependencies for N − 1 invocation nodes.
To get around this side effect P RO P UB invents N − 1 non-functional data nodes and
creates N − 1 gen by edges as shown in Fig. 6. Lastly, it copies all the used edges for
the violated data node over to all N − 1 non-functional data nodes. Following rules are
used to create the non-functional nodes and fix the violations of this policy:
del_data(D) :- ic:wc(D).
del_dep(D,I) :- ic:wc(D), d_gen_by(D,I).
add_data(f(D,I),T) :- ic:wc(D), d_gen_by(D,I), T=‘ic:wc’.
add_dep(f(D,I),I) :- ic:wc(D), d_gen_by(D,I).
add_dep(I,f(D,I1)) :- ic:wc(D), d_gen_by(D,I1), d_used(I,D).


No-Cyclic Dependency. This policy is violated in case a node is reachable from itself.
In Fig. 7, there is a cycle between a invocation node and a data node. To fix this violation
P RO P UB invents a non-functional invocation node and creates a used edge between the
data node and the non-functional invocation node. Then it removes all the gen by edges
from the invocation node (except the one with the data node with which it has the
cycle) and copies them over to the non-functional invocation node. In the similar way,
P RO P UB resolves this violation between two invocation nodes.


4.2   Repairing No-False Independence (NFI) Policy Violations
                                                                                 ∆u
This policy is violated in case two nodes are not dependent in PG!            even though
they are in PG! . This may occur in case the ur:hide user requests are applied on PG! as
shown in Fig. 8. One way to resolve this violation is to insert direct dependencies, which
        (a) User Requests: ur:hide nodes                       (b) Policy Violation


                                            (c) Fix

Fig. 8. PG! and user requests ur:hide are shown in (a). In (b) some dependencies are removed
                      ∆u
between nodes in PG! . P RO P UB then resolves this in two steps (i) transforms these ur:hide
requests into equivalent ur:abstract requests and (ii) applies these ur:abstract requests on PG!
and gets the customized graph is shown in (c).


                                      ∆u                                     ∆u
are there in PG! but missing in PG! , between any two nodes in PG! . But, this pro-
cess may add too many edges and the graph may become unreadable. One optimization
to this process is to develop transitive dependencies to reduce the total number of new
edges needed. This may be computation intensive. P RO P UB uses a different strategy
to fix this violation. Following rules are used to transform the ur:hide requests into an
equivalent set of ur:abstract requests:
hide_connected(X,Y) :- ur:hide(X), ur:hide(Y), dep(X,Y).
hide_connected(X,X) :- ur:hide(X).
hide_connected(X,Y) :- hide_connected(Y,X).
hide_connected(X,Y) :- hide_connected(X,Z), hide_connected(Z,Y).
smaller(X) :- hide_connected(X,Y), X < Y.
minimum(X) :- ur:hide(X), not(smaller(X)).
abstract_hide(X,G) :- hide_connected(X,G), minimum(G).

    The customization user requests ur:abstract removes nodes from PG! , but does
not violate the NFI policy. To avoid the NFI polic violations P RO P UB transforms the
ur:hide user requests into an equivalent set of ur:abstract user requests. These will be
applied to PG! in the same way the User issued ur:abstract requests are applied.


4.3   Repairing No-False Dependence (NFD) Policy Violations
                                                                     ∆u
This policy is violated in case two nodes are dependent in PG! even though they are
not in PG! . This may occur in case the ur:abstract user requests are applied on PG! as
shown in Fig. 9. In Fig. 9(a) we have a partial provenance graph showing the ur:abstract
requests and the nodes with direct dependencies with one or more nodes selected to be
abstracted. This figure shows that in PG! the data artifact ‘1’ depends on data artifact
‘a’ and ‘b’. In the similar way, the data node ‘2’ depends on invocation nodes ‘b’ and
‘c’ and so on. Now, if we apply these ur:abstract requests by collapsing all the selected
                                            ∆u
nodes into a abstracted node then in PG!        the data artifact ‘1’ become depended on
                                                        ∆u
nodes ‘a’, ‘b’, ‘c’, ‘d’, and ‘e’ and thus making PG! incorrect, as shown in Fig. 9.
                           #!$%&#'%"
                   !"                   #"        !"                      #"

                    !"                 %"
                                                  !"                      %"

                    &"                  '"
                                                                          '"
                                                  &"
                                       ("                                 ("
                   $"
                                                  $"
                                        +"                                +"
                    )"                  *"        )"                       *"
                  (a) The ur:abstract user      (b) The dependencies cre-
                  requests                      ated by applying these
                                                ur:abstract user requests

Fig. 9. In (a) we show the boundary of one ur:abstract user requests set and the nodes with a
direct dependency with one or more nodes selected to be abstracted. After these ur:abstract user
requests are applied on PG! we get a new set of dependencies as shown in (b).


    To avoid this policy violations P RO P UB takes a systematic three stages approach to
apply the ur:abstract user requests. Instead of collapsing into one abstracted node, it
invents a number of non-functional data and invocation nodes to maintain the depen-
                                        ∆u
dencies between any two nodes in PG! as they are in PG! . This systematic approach
ensures that the minimum number of non-functional nodes are invented. In the first
stage, P RO P UB develops two sets in and out. The in is a set of data nodes which is
used by some of the invocation nodes selected to be abstracted and invocation nodes
which generated some of the data nodes selected to be abstracted. The out is a set
of data nodes which is generated by one of the invocation node selected to abstracted
and invocation nodes which used some of the data nodes selected to abstracted. It also
calculates the dependencies for each of the node in set out on the nodes of the set in.
    Now, P RO P UB creates non-functional data nodes for each of the invocation nodes
from the sets in and out. One non-functional data node is created for exactly one in-
vocation node from the set in through a gen by edge. One non-functional data node
is created for more then one invocation node from the set out through used edges in
case these invocations have the same dependencies on the set in. At this stage, a non-
functional data node is connected either to a node from the set in or to one or many
nodes from the set out. For example, invocation node ‘4’ ‘5’ and depends on invoca-
tion nodes ‘b’ and ‘c’ and P RO P UB will create only one non-functional data node and
two used edges. This is shown in Fig. 10(a).
    In the second stage, it calculates the list of dependencies of all nodes from the set
out to the nodes from in. P RO P UB creates one non-functional invocation node for each
of these unique dependency lists and it creates gen by edges for nodes from the set
out which has the same dependency list. Then it creates used edges to connect to the
nodes in in set from any of these non-functional invocation nodes. It will connect with
respective non-functional data node created in the last stage in case an edge needs to be
created with an invocation either from in or out. This outcome is shown in Fig. 10(b).
      !"                   #"     !"                      #"     !"                        #"

      !"                   %"     !"                      %"                               %"
                                                                 !"

      &"                   '"                             '"
                                  &"                             &"                        '"
                           ("                             ("                               ("
      $"                          $"                             $"
                           +"                             +"                               +"

      )"                   *"      )"                     *"      )"                       *"


             (a) Stage 1                 (b) Stage 2                   (c) Fix Stage 3

                     Fig. 10. Repairing No-False Dependence Policy Violations


  Algorithm: C ALCULATE C USTOM PG
  I NPUT: provenance graph PG, user requests U and provenance policies PP
  O UTPUT: customized provenance graph CG
     1. Test for Direct Conflicts // as explained in Section 3.1
     2. IF there are Direct Conflicts THEN
     3.   RETURN false // User can resubmit after changing U
     4. ELSE
     5.   Compute PG! // as explained in Section 3.1
     6.   Transform ur:hide user requests into ur:abstract user requests // as explained in Section 4.2
                                                                 ∆u
     7.   Apply all ur:abstract user requests on PG! to get PG! // as explained in Section 4.3
                                             ∆u
     8.   Resolve NCC violations on PG! // as explained in Section 4.1
                                                       ∆u
     9.   Resolve NWC violations on modified PG! // as explained in Section 4.1
                                                      ∆u
     10. Resolve NFT violations on modified PG! // as explained in Section 4.1
                      ∆u
     11. CG = PG! // Final customized provenance graph
     12. RETURN CG

      Fig. 11. Computing CG using the Inventing Non-Functional Nodes approach


    In the final stage, P RO P UB combines nodes if possible. For example, in Fig. 10(b)
the path from node ‘6’ to node ‘e’ has three consecutive non-functional nodes with no
other dependencies. These three nodes can be replaced by only one non-functional data
node. The result is shown in Fig. 10(c). Now, P RO P UB removes all the nodes selected
to be abstracted and associated edges from PG! .

4.4        Algorithm
The algorithm mentioned in Fig. 11 finds the customized provenance graph, if available.
In this approach, we add non-functional nodes to repair policy violations. In Figure 3
                   ∆u
we show that PG!       has a structural policy violation between nodes g1 and s2 . Using
this approach, we introduce a non-functional data node d such that d is dependent on
g1 ; and s2 is dependent on d. Now, to fix the cycle between d13 and g1 we introduce
the non-functional invocation node g2 and create a dependency (gen by) edge from d15
to g2 . Then, we get the final CG as shown in Figure 12. Note that we are now able to
keep all the relevant nodes in CG.
                                                                      -./+0-'+1)"23#!"4#
                                               !#%&'# !#%')#          -./+0-'+1/"3#!"4#
                     !&$                       !#%()# !#%*# !#%+,#    -./+0-'+15"3#!"4#
                     !"'$                                             -%6%7589,1)""4#
                            !"#   !$     /:#    !"#$    !$#    !"&$
                                                                      -%6%7589,1)":4#
                     !""$
                                  !")$   !:#    !"%$
                     !"($                                             ;8),1)"<4#
                                                                      ;8),1'"4#
                                                                      ;8),1':4#


          Fig. 12. Customized Provenance Graph after repairing all policy violations


5   Related Work

In [3,4,5,6,7], it has been observed that provenance can be used, e.g., to interpret results,
diagnose errors, fix bugs, improve reproducibility, and generally to build trust on the
final data products and the underlying processes. In addition, provenance can be used to
enhance exploratory processes [15,16,17], and techniques have been developed to deal
with provenance efficiently [18,19].
    In many cases, provenance carries sensitive information, which can cause privacy
concerns related to a data, actor, or workflow specification. Studying provenance, one
can capture the functionality (being able to guess the output of the actor given a set of
inputs) of an actor (module), or the execution flow of a workflow [8].
    The security view approach [5] limits the available provenance to a user by pro-
viding a partial view of the workflow through a role-based access control mechanism,
and by defining a set of access permissions on actors, channels, and input/output ports
as specified by the workflow owner at design time. The ZOOM∗ UserViews approach
[20] allows to define a partial, zoomed-out view of a workflow, based on a user-defined
distinction between relevant and irrelevant actors. Provenance information is restricted
by the definition of that partial view of the workflow.
    In our recent work [11], we developed P RO P UB, which uses a declarative approach
to publish customized policy-aware provenance. In this paper, we developed a new way
to repair policy violations, not by removing additional nodes (as in [11]), but by in-
troducing new (non-functional) nodes that represent the original lineage dependencies,
without revealing information that the user wants to protect. We described in detail how
policy violations will be repaired such that all relevant nodes are retained in the final
provenance graph.


6   Conclusions

We discussed the need for provenance in scientific collaboration. Provenance data helps
to build trust in the published results and data. However, provenance can also contain
sensitive data and/or too much irrelevant detail. Thus, scientists should be able to “cus-
tomize” provenance data before sharing it.
    Our current P RO P UB system is based on the open provenance model (OPM). We
plan to extend P RO P UB to include model extensions, e.g., to support structured data
structures, in particular nested collections [19]. Furthermore, P RO P UB currently sug-
gests only one specific modified graph based on a given U and PP. In future work, we
plan to investigate how to extend this approach to rank alternative solutions, thus sup-
porting scientists even more in finding the desirable balance between revealing prove-
nance information and preserving privacy when sharing data with collaborators.


References

 1. Nature: Special Issue on Data Sharing. Volume 461. (September 2009)
 2. Missier, P., Ludäscher, B., Bowers, S., Dey, S., Sarkar, A., Shrestha, B., Altintas, I., Anand,
    M., Goble, C.: Linking multiple workflow provenance traces for interoperable collaborative
    science. In: Workflows in Support of Large-Scale Science (WORKS), 2010 5th Workshop
    on, IEEE 1–8
 3. Bose, R., Frew, J.: Lineage retrieval for scientific data processing: a survey. ACM Computing
    Surveys (CSUR) 37(1) (2005) 1–28
 4. Simmhan, Y., Plale, B., Gannon, D.: A survey of data provenance in e-science. ACM
    SIGMOD Record 34(3) (2005) 31–36
 5. Chebotko, A., Chang, S., Lu, S., Fotouhi, F., Yang, P.: Scientific workflow provenance query-
    ing with security views. In: Web-Age Information Management, 2008. WAIM’08. The Ninth
    International Conference on, IEEE (2008) 349–356
 6. Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for Computational Tasks: A Survey.
    Computing in Science and Engineering 10(3) (2008) 11–21
 7. Davidson, S., Khanna, S., Roy, S., Boulakia, S.: Privacy issues in scientific workflow prove-
    nance. In: Proceedings of the 1st International Workshop on Workflow Approaches to New
    Data-centric Science, ACM (2010) 1–6
 8. Davidson, S.B., Khanna, S., Panigrahi, D., Roy, S.: Preserving Module Privacy in Workflow
    Provenance. CoRR abs/1005.5543 (2010)
 9. Moreau, L., Clifford, B., Freire, J., Futrelle, J., Gil, Y., Groth, P., Kwasnikowska, N., Miles,
    S., Missier, P., Myers, J., et al.: The open provenance model core specification (v1. 1). Future
    Generation Computer Systems (2010)
10. Anand, M., Bowers, S., Ludascher, B.: Provenance browser: Displaying and querying sci-
    entific workflow provenance graphs. In: Data Engineering (ICDE), 2010 IEEE 26th Interna-
    tional Conference on, IEEE (2010) 1201–1204
11. Dey, S., Zinn, D., Ludäscher, B.: PROPUB: Towards a Declarative Approach for Publishing
    Customized, Policy-Aware Provenance. In: Scientific and Statistical Database Management
    Conference (to appear). (2011)
12. Moreau, L., Ludäscher, B., Altintas, I., Barga, R., Bowers, S., Callahan, S., Chin, J., Clifford,
    B., Cohen, S., Cohen-Boulakia, S., et al.: Special issue: The first provenance challenge.
    Concurrency and Computation: Practice and Experience 20(5) (2008) 409–418
13. Moreau, L., Clifford, B., Freire, J., Gil, Y., Groth, P., Futrelle, J., Kwasnikowska, N., Miles,
    S., Missier, P., Myers, J., Simmhan, Y., Stephan, E., den Bussche, J.V.: OPM: The Open
    Provenance Model Core Specification (v1.1). http://openprovenance.org/ (De-
    cember 2009)
14. Anand, M., Bowers, S., McPhillips, T., Ludäscher, B.: Exploring scientific workflow prove-
    nance using hybrid queries over nested data and lineage graphs. In: Scientific and Statistical
    Database Management, Springer (2009) 237–254
15. Davidson, S., Freire, J.: Provenance and scientific workflows: challenges and opportunities.
    In: SIGMOD Conference, Citeseer (2008) 1345–1350
16. Freire, J., Silva, C., Callahan, S., Santos, E., Scheidegger, C., Vo, H.: Managing rapidly-
    evolving scientific workflows. Provenance and Annotation of Data (2006) 10–18
17. Silva, C., Freire, J., Callahan, S.: Provenance for visualizations: Reproducibility and beyond.
    Computing in Science & Engineering (2007) 82–89
18. Heinis, T., Alonso, G.: Efficient Lineage Tracking For Scientific Workflows. In: Proceedings
    of the 2008 ACM SIGMOD conference. (2008) 1007–1018
19. Anand, M., Bowers, S., Ludäscher, B.: Techniques for efficiently querying scientific work-
    flow provenance graphs. In: Proceedings of the 13th International Conference on Extending
    Database Technology, ACM (2010) 287–298
20. Biton, O., Cohen-Boulakia, S., Davidson, S.: Zoom* userviews: Querying relevant prove-
    nance in workflow systems. In: Proceedings of the 33rd international conference on Very
    large data bases, VLDB Endowment (2007) 1366–1369