<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Comprehensive Sanitization Approach for Workflow Provenance Graphs</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Noha Nagy Mohy Hoda M. O. Mokhtar Mohamed E. El-Sharkawi Information Systems Department, Information Systems Department, Information Systems Department, Faculty of Computers and Information, Faculty of Computers and Information, Faculty of Computers and Information, Cairo University, Egypt Cairo University, Egypt Cairo University</institution>
          ,
          <country country="EG">Egypt</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>As the number of provenance aware organizations increases, particularly in workflow scientific domains, sharing provenance data becomes a necessity. Meanwhile, scientists wish to share their scientific results without sacrificing privacy, neither directly through illegal authorizations nor indirectly through illegal inferences. Nevertheless, current work in workflow provenance sanitizing approaches do not address the disclosure problem of sensitive information through inferences. In this paper, we propose a comprehensive workflow provenance sanitization approach called ProvS that maximize both graph utility and privacy with respect to the influence of various workflow constraints. Experimental results show the effectiveness of ProvS through testing it on a graph-based system implementation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Graph anonymity</kwd>
        <kwd>Graph privacy</kwd>
        <kwd>Secure provenance graph</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Securing provenance data has been studied in recent years. Hiding,
anonymization, and grouping are well known sanitization
approaches that are used to preserve privacy of sensitive
provenance graph components[nodes/edges] [3], [6], [8], [9], [12]–
[14], [18], [19], [25]. In fact, these approaches vary in terms of
graph utility, privacy and conformance to provenance policies that
guarantee the completeness and correctness of the provenance
graph [3], [14]. Hiding approach removes the required sensitive
graph components which cause dangling nodes and edges [11].
Anonymization approach only hides the identification attributes.
(c) 2016, Copyright is with the authors. Published in the
Workshop Proceedings of the EDBT/ICDT 2016 Joint
Conference (March 15, 2016, Bordeaux, France) on
CEURWS.org (ISSN 1613-0073). Distribution of this paper is permitted
under the terms of the Creative Commons license
This approach gave anonymization approach the privilege of
satisfying provenance graph policies and increasing the graph
utility at the same time [14], [29]. However, they do not guarantee
the privacy of sensitive information as the attacker can re-identify
the anonymized graph components. Grouping approach provides an
abstract graph view that preserves the privacy of sensitive graph
components, their major drawback is decreasing the graph utility
[5], [19]. In addition, they require checking the resulting graph
validity according to provenance policies and handling invalid
graphs either by grouping more nodes and edges or inventing new
dummy nodes to satisfy the provenance graph policies [
        <xref ref-type="bibr" rid="ref28">13</xref>
        ].
On the other hand, in the application domain, workflow systems are
impacted by industrial laws and regulations that control the
workflow execution to ensure the compliance of business rules and
regulations that prevent business failures [31]. There are multiple
types of workflow constraints that specify the control flow between
business processes, define restrictions on the resources or define
exception handling procedures. Actually, these workflow
constraints are good seeds for attacking privacy particularly in a
provenance area as provenance stores a complete history of the
workflow execution.
      </p>
      <p>This paper focuses on the problem of attacking graph privacy by
re-identifying sanitized graph components through using domain
knowledge. The domain knowledge that we address is the
workflow constraints. We study the main factors that affect the
provenance graph sanitization privacy, utility, and provenance
graph policy. The paper introduces a sanitization approach called
ProvS that utilizes anonymization, grouping and workflow
constraints to produce a set of sanitization actions. These
sanitization actions need to be applied to the workflow provenance
graph to preserve its privacy while considering the graph utility and
provenance graph policies. The key contributions of this paper are:</p>
      <p>Proposing a comprehensive sanitization approach
tailored to preserve privacy of workflow provenance
graphs.</p>
      <p>Handling the privacy problem in anonymization
approach
Increasing the graph utility of the grouping approach
Automatically decides which properties (nodes/edges) of
the graph need to be grouped and which properties need
to be anonymized without user intervention.</p>
      <p>The rest of the paper is organized as follows: Section 2 presents a
brief background information about provenance graphs in addition
to the various types of workflow constraints. Related work is
reviewed in Section 3. A motivating example is presented in
Section 4. Section 5 introduces the proposed approach ProvS.
Finally, Section 6 concludes the paper and presents some points for
future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. BACKGROUND</title>
      <p>This section provides a brief introduction to the most common
keywords that will be used along this paper.</p>
      <p>rm
Fo
d
ie
rv
e
D</p>
      <p>A1</p>
      <p>Used</p>
      <p>Agent
P1
A2
P2</p>
      <p>ControlledBy
GeneratedBy
Used</p>
      <p>TriggeredB
y</p>
    </sec>
    <sec id="sec-3">
      <title>2.1 Workflow Provenance Graph</title>
      <p>Workflow provenance records the history of workflow executions
[17]. There are several tools for capturing workflow provenance
[2], [23], [29]. These tools capture information about the sequence
of workflow process executions used to produce a data item, as well
as the intermediate data passed between these processes. Most of
these tools use the Open Provenance Model (OPM) that was
proposed in [20] as their standard model for representing
provenance data. OPM in Figure 1 models provenance as a Directed
Acyclic Graph (DAG) which consists of three types of nodes:
artifacts represent the data used, processes represent actions
performed on or caused by artifacts, and resulting in new artifacts,
and agents that represent actors executing the process. The edges
of the OPM graph represent a relationship between two nodes.
Provenance graphs are captured automatically by the workflow
system. Provenance graph can be formally defined as G=(V,E,L,F)
where V is the set of vertices, E V V is a set of edges, L is a set
of labels, and F is a labeling function F:EL that assigns each edge
a label.</p>
      <p>There are some provenance policies that ensure the correctness and
completeness of the provenance graph which discussed in [14].</p>
      <p>Table 1. Provenance graph policy [14]</p>
      <sec id="sec-3-1">
        <title>Provenance Policy</title>
        <sec id="sec-3-1-1">
          <title>No Write Conflict (NWC)</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>No Cyclic Dependency (NCD)</title>
        </sec>
        <sec id="sec-3-1-3">
          <title>No-Type Error</title>
        </sec>
        <sec id="sec-3-1-4">
          <title>No-False Dependency (NFD)</title>
        </sec>
        <sec id="sec-3-1-5">
          <title>No-False Independency (NFI)</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Description</title>
        <p>A data artifact can be written by
only one process
There is no cycle between any two
nodes
Two nodes with a direct
dependency are of different types.
Two nodes are dependent in the
resulted graph only if they are
dependent in the original graph
Two nodes are independent in the
resulted graph only if they are
independent in the original graph</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>2.2 Workflow Constraints</title>
      <p>The executions of workflow systems are always governed by a set
of constraints that guarantee the correctness of the execution [24].
These constraints can be defined by the user or can be captured
from the Business Process Model (BPM) that acts as a schema for
the workflow system. Processes and its related constraints are the
key elements of workflow constraints. There are different types of
workflow constraints: some constraints based on data values where
the end result of the process determines the following workflow.
Constraints that control the flow of the processes, constraints based
on time while other constraints based on roles that determine who
is responsible for what. A brief description of workflow constraints
that are considered in this paper are outlined in Table 2.</p>
    </sec>
    <sec id="sec-5">
      <title>2.3 Graph Privacy and Utility</title>
      <p>Workflow provenance graphs contain sensitive information. For
example, in healthcare, activities including patient diagnoses,
treatments, and processes performed by health care professionals are
considered sensitive information. A major goal of preserving
privacy is to assure that sensitive information is properly protected.
Hence, we need to define what is meant by sensitive information.
Sensitive information is the information that needs to be hidden
from unauthorized users. In this paper, sensitive information can be
a provenance graph node (process, data, or actor) or a provenance
graph edge (used, generated by, controlledBy).</p>
      <p>We formalize two privacy goals in workflow provenance graphs.
- Node Privacy: node privacy concerns with hiding the
identity or attribute values of a node. Let G be the original
graph and V be a node belonging to G. Let G' be the
sanitized graph view of G. The privacy of V means if an
adversary is given G' and extra domain knowledge
information then he should not reveal the identity of E in
the original graph G.</p>
      <p>Edge Privacy: edge privacy concerns with hiding the
relationships between two of the graph nodes. Let G be
the original graph and E be an edge belonging to G. Let
G' be the sanitized graph view of G. The privacy of E
means if an adversary is given G' and extra domain
knowledge information then he should not reveal the
existence of E in the original graph G. In this paper, Edge
privacy can be specified in terms of node privacy as it
requires hiding the identity of the two connected nodes.
Note that, we do not consider removing edges in our
proposed approach.</p>
      <p>Inspired by the fact that a process uses input data to produce output
data, the following is a subset of the inference rules presented in
[30] to prevent disclosure of sensitive information that will be
considered in our approach.</p>
      <p>R1: Revealing the input and output would identify the process
R2: Revealing the process and output would reveal the input and
revealing the input and process would reveal the output
Utility is concerned with the amount of information presented in
the sanitized graphs to be useful for sharing. Utility of the graph is
measured by the utility of its nodes and utility of its connections
(edges). There are different utility measures for graph. In this paper,
we use the utility measure that introduced in [3] where the utility of
nodes is measured by the average percentage of nodes in the
original graph that are retained the sanitized graph. The utility of
edges is measured by the average percentage of edges in the
original graph that are retained in the sanitized graph.</p>
    </sec>
    <sec id="sec-6">
      <title>3. RELATED WORK</title>
      <p>
        Provenance sanitization via provenance graph transformation is
discussed by several researchers [32], [33], [18], [
        <xref ref-type="bibr" rid="ref28">13</xref>
        ], [34].
The authors in [35] proposed ZOOM which controls access to
sensitive provenance data by driving provenance views from
workflow views. The main advantage of ZOOM it allows
provenance to be derived at different levels of granularity.
However, it cannot be used for complex scientific workflows. The
authors in [11] proposed an anonymization approach named
Surrogate Parenthood that derives a protected graph G' from the
original graph G that preserves the provenance policies. In
Surrogate Parenthood each path in G' exists in G. The main
advantage of this work is the fact that it preserves the utility of the
original graph. The authors in [12] presented a framework called
ProPub to publish provenance data based on the data log which
stores user privileges, portions of the graph that need to be
abstracted, deleted, or retained, as well as graph policies. The main
advantage of this framework is that it detects conflicts between
sanitization policy and provenance policies.
      </p>
      <p>The authors in [19] proposed a model known as Provenance
Abstraction Model (PAM) and implemented a tool called ProvAbs
which uses a grouping approach with a defined clearance level of
both graph components and users to get a secured graph view. The
main advantage of this model is that the resulting graph preserves
the confidentiality of the provenance graph. On the other hand, it
may require multiple grouping to preserve to provenance policies.
The authors in [6] proposed a graph grammar rewriting rules that
generate a safe provenance graph. Their proposed rewriting rules
involve some graph operations such as vertex contraction, path
contraction, edge contraction, and node relabeling. The main
limitation of this work is its negligence of provenance policies as
well as they do not study the disclosure threat.</p>
      <p>To conclude, previous attempts proposed different approaches to
preserve the privacy of the OPM provenance graph but sacrifice
either the graph privacy or the graph utility. Grouping approaches
preserve the privacy of the OPM graph, but it sacrifices utility in
addition, it requires handling conflicts between the resulted graph
and the provenance policies. On the other hand, anonymization
approaches increase the utility of the sanitized OPM graph (as it
just changes the nodes to less sensitive ones and do not change the
structure of the graph) but they do not guarantee its privacy.
Moreover, anonymization and grouping depend on the user
specification which is very hard specifically for large OPM graphs.
Nevertheless, previous efforts ignored the importance of preserving
graph privacy and provenance policies while preserving an
acceptable level the graph utility. In the remaining of this paper, we
present a novel approach for protecting provenance graphs from
disclosing sensitive information that caused by workflow
constraints without sacrificing neither graph privacy nor its utility.</p>
    </sec>
    <sec id="sec-7">
      <title>4. MOTIVATING EXAMPLE</title>
      <p>In this Section, we show how the knowledge of workflow
constraints may allow adversary users to infer sensitive information
from a sanitized workflow provenance graph. In the following
discussion, Pa denotes an anonymized view of a process node. Da
denotes an anonymized view of a data node. While Pg and Dg
denote a grouped process node and a grouped data node
respectively.</p>
      <p>Figure 2 represents graph G which is a fragment of a workflow
provenance graph. Consider the following set of constraints that
govern the workflow execution where P1, P2, P4, P5 are workflow
processes.</p>
      <sec id="sec-7-1">
        <title>C1: P1 and P2 have the same input data</title>
        <p>C2: P4 must be executed by a manager
C3: P2 and P5 must be executed by the same actor
C4: P2 and P4 are sequential processes
C5: If the output data of P4 is greater than 5 then it must be
processed by P5
D8</p>
        <p>U
D1
D2
In the following discussion, we discuss two scenarios with different
types of sensitive information and the effectiveness of different
sanitization approaches mainly anonymization and grouping to
prevent disclosure of sensitive information.</p>
        <p>Scenario 1: Sensitive information is a process node P2. Figure 3
represents different sanitized views of graph G that are displayed
in Figure 2. Figure 3(a) represents G1, which is produced using
anonymization. P2 is anonymized to Pa and its output D2 is
D8</p>
        <p>U</p>
        <p>P6
D8</p>
        <p>U
D1</p>
        <p>D2</p>
        <p>U
U
U</p>
        <p>P6
U
anonymized to Da to ensure that Pa cannot be detected from its
input and output data values. G1 prevents direct attacks to P2.
However, an attacker can infer that Pa is P2 via constraint C1. The
utility of graph G1 is [node utility=19/19 and edge utility=19/19].
Using the grouping approach, P1, P2 and their output data D3 and
D4 are grouped to process Pg1. This grouping will preserve the
privacy of P2, however, it violates the provenance policies as two
nodes from the same type Pg1 and P4 are connected. Hence, we
need to group more nodes to satisfy all the provenance policies.
Figure 3 (b) represents graph G2 that is produced via grouping P4
and its actors with Pg1 and D1 to Pg2.we had to group D1 to handle
NFD policy (Table 1). G2 preserves the privacy of sensitive process
P2 and preserves the provenance policies. However, it affects the
graph utility (node utility=12/19, edge utility=12/19) as it groups
many of the graph components.</p>
        <p>Scenario 2: Sensitive information is a data node D5. Figure 4(a)
represents G3 that anonymizes D5 to Da and anonymizes P3 to Pa.
the utility of G3 is [the node utility=19/19 and edge utility=19/19].
Figure 4(b) represents G4 that groups P3, D5, and D2 to Pg. Both
G3 and G4 preserve the privacy of D5, but the utility of G3 has
been better than G4 utility [node utility= 16/19 and edge
utility=16/19].</p>
        <p>This example proves that one approach will not fit in all cases. The
choice of the optimal sanitization approach depends on the defined
workflow constraints.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>5. ProvS Architecture</title>
      <p>In the following discussion, we state some general assumptions to
ensure the effectiveness of our proposed approach. Those
assumptions basically define the general workflow structure to
distinguish other structures that can be a result of firing workflow
constraints. First, workflow processes are in sequential order.
Second, the cardinalities of the relationships between the
provenance graph components are unknown. Third, different
processes are executed by different actors (SOD). Fourth, the input
provenance graph is valid according to provenance graph policy
(see Table 1). We used the graph utility measure that proposed in
[18]. Finally, ProvS produces the set of sanitization actions that
need to be applied to a provenance graph to be secured. We are not
concerned with how the nodes will be anonymized as there are
many approaches exits in the literature for anonymization [27] [16].
The architecture of ProvS is portrayed in Figure 5. The
architecture encapsulates two main phases which will be described
in the following discussion.</p>
    </sec>
    <sec id="sec-9">
      <title>5.1 The Design Phase</title>
      <p>The design phase is concerned with collecting and refinement
workflow constraints by the workflow experts to be used for
controlling the workflow execution. This phase is performed only
once in offline. It may be repeated only when workflow constraints
are updated. Actually, the main incentive behind studying
workflow constraints are twofold:</p>
      <p>Sanitization will work under supervision of a Workflow
Management System (WFMS) that enforces these
constraints during workflow runs. Therefore, the
workflow constraints will play an integral part in
securing workflow provenance graphs.</p>
      <p>Determining what nodes need to be anonymized and
what nodes need to be grouped in order to preserve the
graph privacy with respect to the knowledge of workflow
constraints.</p>
      <p>ConNet</p>
      <p>Repository</p>
      <p>Privacy Policy
Provenance Graph Repository
Execution Time
Sanitizer</p>
      <p>Preparing
Sensitive Nodes</p>
      <p>Compliance Rules</p>
      <p>Repository</p>
      <p>Business Rules</p>
      <p>Repository
WF Constraints
Graph Generator</p>
      <p>Refinement &amp;</p>
      <p>Resolve Conflicts
Inference Engine</p>
      <p>Filtering</p>
      <p>Get the
supplement
actions</p>
      <p>WF Expert</p>
      <p>Graph
Owner
Workflow constraints specify either the values of nodes' attributes
or the relationships between processes. For illustration, BOD
constraint defines a link structure as multiple process nodes are
connected to the same agent node through controlledBy
relationship. MMP constraint defines a link structure as multiple
data output nodes are connected to the same process node. SP
constraint defines a link structure as a process output node is
connected to the other process node. PP constraint defines a value
of two process nodes, it does not specify any link structure between
them. Similarly, PEP constraint defines a value of a process node.
From the previous discussion and our motivation example we can
conclude some remarks, First, constraints that affect the identity
disclosure, attribute disclosure or define the general workflow
structure (discussed earlier in this section) can be secured using
anonymization approach. On the other hand, constraints which
specify a link structure different from the general workflow
structure cannot be secured against illegal inferences using an
anonymization approach as the structure of the graph is the key
player and anonymization does not hide this structure. Therefore, it
should be sanitized using a grouping approach to hide the graph
structure. A complete assessment of workflow constraints using
anonymization and grouping against privacy is discussed in [21].
Subsequently, we classify the workflow constraints (Table 2),
according to their influence on graph privacy as follows:</p>
      <p>Configuration constraints are the types of constraints
that define link structure different from the general
workflow structure.</p>
      <p>Identity constraints are the types of constraints that
define a node identity or attribute values such that relate
actor or role to a process or define relationships between
processes and/or data.</p>
      <p>Orthogonal to this classification, Binding of duty, different
workflow and same input constraints are considered configuration
constraints while the other constraints are considered identity
constraints.
mentioned, SP is the general graph structure therefore, it will not
be considered as a configuration constraint.</p>
      <p>Driven by the following facts: workflow constraints may have
different influence on the privacy of OPM graphs. 1) Some
constraints might not affect the privacy of OPM graph, 2) One
constraint may be used to reveal sensitive information; 3) Multiple
constraints can be combined to reveal sensitive information. We
propose a constraint network graph (ConNet) which offers a means
of identifying potential relationships between workflow
components.</p>
      <p>ConNet is an undirected graph that represents the workflow
constraints exist among different workflow components. A ConNet
node can be a process, data, or actor that exists in a workflow
constraint. The edges represent constraints between the connected
nodes. These edges are labelled with the workflow constraint name
that governs the execution of the two connecting nodes. The main
incentive behind this graph is to discover all the paths that an
attacker can go through to infer sensitive information. ConNet is
constructed as follows:</p>
      <p>For each constraint in the workflow constraints
1. Draw a node for each OPM component in the
constraint
2. Label these nodes with the component's name in the
constraint
3. Connect the nodes in each constraint with an edge
and label this edge with the constraint name
(specified in Table 2).</p>
      <p>Continuing with our motivating example discussed in Section 4, the
ConNet is illustrated in Figure 6 ConNet graph is stored in ConNet
repository to be used later in the execution phase.</p>
      <p>P1
Manager</p>
      <p>Actor</p>
      <p>SameInput</p>
      <p>BindingOfDuty
P2</p>
      <p>P4</p>
      <p>Sequential
Role Const.</p>
      <p>Ce
rtain P
rocess</p>
      <p>P5</p>
    </sec>
    <sec id="sec-10">
      <title>5.2 The Execution Phase</title>
      <p>This phase is concerned with generating the sanitization actions
that need to be applied to a workflow provenance graph based on
privacy policy and present these sanitization actions to the graph
owner. The main components of this phase are:
Privacy Policy: the privacy policy stores the privacy preferences.
It contains the sensitive graph components [nodes/edges] that need
to be hidden.</p>
      <p>Provenance Graph Repository: stores the workflow provenance
graphs that need to be sanitized for sharing.</p>
      <p>The Inference Engine Module: The main purpose of this module
is to find the set of provenance graph nodes that are related to the
sensitive information using ConNet graph. The InferenceEngine
procedure in Figure 7 generates two sets, one for the anonymized
nodes AA set which is composed of individual nodes that are
required to be anonymized, while the other is for the grouped nodes
GA set which is a set of nodes set which are required to be grouped
together. The procedure first checks if the sensitive node N exists
in the ConNet (line 2) then, for all the edges that exist in the paths
connected to N, it gets the edge label to determine its classification
type (line 4-8). Then, based on the edge classification the connected
nodes of this edge it is added either to AA or GA (line 9-11).
The Sanitizer Module: the sanitizer has three main functions (i)
preparing the sensitive node set, (ii) Filtering; (iii) get the
supplement actions. The sanitizer algorithm is presented in Figure
9. Preparing sensitive nodes set by acquiring the sensitive nodes
from the privacy policy and for each sensitive edge it get its
connected nodes and add them to the sensitive nodes set. Next, the
sanitizer sends these sensitive set to the InferenceEngine module
(line 5). Further, it filters the AA set and GA set generated from the
InferenceEngine. For multiple sensitive nodes which have different
sanitization actions the sanitizer merge the anonymization sets and
grouping sets. In addition, it uses the following rules in the filtering
step.</p>
      <p>Rule 1: If a process exists in both AA set and GA set then remove
it from AA (as grouping is more restricted than anonymization).
Rule2. If a process exists in multiple groups in GA then merge these
groups to one group (as a node cannot exist in more than one
group).</p>
      <p>Finally, the sanitizer gets the supplement actions (Line 8 and 221)
to handles the limitations of anonymization and grouping. For
grouping set it uses Rule 3 (Line 8-12).
Rule 3: For each grouped process set, group the data outputs nodes
of the grouped processes into a data group node (to prevent NFD
and NWC problem in Table 1).</p>
      <p>For the anonymization set. It computes the related actions using
procedure RelatedActions in Figure 8. Which uses the following
rules to generate related actions that need to be added to AA set.
Rule 4: for each anonymized data node, anonymize its generated
process and used process (to prevent disclosure of this data using
R1 and R2 in Section 2.3).</p>
      <p>Rule 5: for each anonymized process in AA anonymize its output
data and the process that used this output data (to prevent
disclosure of the anonymized process using R1 and R2).
The sanitizer output the final anonymization set and grouping set
to the graph owner.</p>
      <p>Continuing with our motivating example discussed in Section 4, for
the first scenario, the anonymization set AA= {P4, D6, A3, A4}
and the grouping set GA= {{P1, P2, P5}, {D3, D4, D7}}. The
resulted graph utility will be [nodes utility=15/19 and the edge
utility= 15/19].</p>
      <sec id="sec-10-1">
        <title>The Sanitizer Algorithm</title>
        <p>Input: OPM graph G(V,E), Sensitive Set SS, ConNet C(V,E)
Output: Anonymization set AA and Grouping set GA
Method:
1. Define Anonymization Actions Sets AA
2. Define Grouping Actions Set GA
3. Define Output Data Set OD
4. For each S in SS
5. InferenceEngine (S, ConNet, AA,GA)
6. IF (AA !=  ˅ GA != ) Then
7. Begin
8. For each group in GA
9. Begin
10. OD=set of all generated data from this group
11. GA=GA OD
12. End
13. IF (an element exists in different pairs in GA) Then
14. union these pairs to be one group
15. For each element e in AA
16. IF(e h where hGA) Then
17. AA=AA-{e}
18. For each element e in AA
19. RelatedActions(e, G, AA output)
20. End IF
21. Else// S is not in the ConNet
22. Begin
23. AA=AA{S}
24. RelatedActions(S, G, AA output)
25. End
26. End
In the second scenario, the anonymization set AA= {P3, D5, P5}
and the grouping set GA=. The resulted graph utility will be [node
utility= 19/19 and edge utility=19/19].</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>5.3 Evaluation</title>
      <p>In Table 4, we compare the main features of different existing
sanitization approaches against ProvS. It is clear from the table that
both Surrogates and ProvS do not cause conflicts in the resulting
graph while other approaches detect and solve conflicts. The reason
is that anonymization preserves the graph structure and ProvS uses
homogenous grouping which handles this issue. For the utility issue
surrogates provides a utility measure for paths and nodes to
measure the informative of the surrogates. ProvAbs annotates each
node in the graph with a utility value which defines the nodes to be
retained in the resulted graph. ProvS uses utility for paths and
nodes to measure the informative of the resulted graph. For the
privacy feature, all the approaches that use grouping preserve the
graph privacy.</p>
      <p>We perform experiments to ensure the effectiveness of ProvS.
Experiments were conducted on a HP PC with 2.53GHz Intel Core
i5 CPU, 4GB RAM and 160 disk space running MS Windows 7.
Neo4j 2.2.5 graph database is used to store OPM provenance
graphs and constraints network (ConNet). Java is used as the main
programming language for writing the logic of the code and we
used Cypher queries to retrieve data from the provenance graph and
ConNet.A set of workflow constraints was generated along with a
series of synthetic provenance graphs that satisfy the generated
workflow constraints. The experiments were conducted on 3
different sets of workflow constraints. The first a set of type
configuration constraints which were sanitized via grouping, the
second a set of type identity constraints which were sanitized via
an anonymization approach. The third set is mixed half of them
from identity constraint type and the other half from the
configuration constraint type which was sanitized using a
combination of anonymization and grouping. The results of the
experiments are shown in Figure 10 , which shows clearly that
ProvS increases the graph utility even if the workflow constraints
are configuration constraints. This is due to homogenous grouping
that decreases the number of graph properties which are needed to
be grouped.</p>
      <p>Utility Measure for Different Types of Constraints</p>
      <p>Number of Configration Constraints
Connected Graph</p>
      <p>Unconnected Graph
Additionally, an additional series of tests to determine the effect of
ConNet topology on the provenance graph utility in case of
configuration constraints. Figure 11 shows a comparison of the
ConNet topology versus graph utility while keeping the sensitive
information, provenance graph size and ConNet size constant. This
comparison shows that unconnected ConNet graphs increase the
utility of the graph as the groups in the grouping set will not be
overlapping and hence, will not be joined. On the other hand, highly
connected ConNet graphs decrease the resulted graph utility as
many nodes will be in different groups and hence these groups will
be joint to be one group.</p>
    </sec>
    <sec id="sec-12">
      <title>6. CONCLUSION AND FUTURE WORK</title>
      <p>Traditional sanitization approaches which are tailored to secure
provenance graphs are no longer sufficient as they ignore an
important source of security, namely graph privacy attacks due to
the use of workflow constraints. In this paper, we first illustrated
the main problems of anonymization and grouping approaches.
Then, we highlighted different types of workflow constraints that
could be used in disclosing sensitive information. Further, we
classified these constraints according to their influence on graph
privacy. Consequently, we introduced ProvS that ensures secure
and valid provenance graphs. ProvS combines anonymization and
grouping for sanitizing sensitive provenance graph. Experimental
results show the effectiveness of ProvS through testing it on a
graph-based system implementation.</p>
      <p>For future work, we plan to enhance the performance of ProvS in
the presence of large provenance graphs and large number of
workflow constraints. In addition, applying our approach to
different case studies.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Alsiyami</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <article-title>“A Policy Language Definition for Provence in Pervaisve Computing,”</article-title>
          <string-name>
            <given-names>Ph.D.</given-names>
            <surname>Thesis</surname>
          </string-name>
          . Univ. Sussex.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Altintas</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Barney</surname>
          </string-name>
          , and E. Jaeger-frank, “
          <article-title>Provenance Collection Support in the Kepler Scientific Workflow System,”</article-title>
          <source>in Proceedings of the international conference on Provenance and Annotation of Data</source>
          ,
          <year>2006</year>
          , vol.
          <volume>4145</volume>
          , pp.
          <fpage>118</fpage>
          -
          <lpage>132</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Blaustein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chapman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Seligman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Allen</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Rosenthal</surname>
          </string-name>
          , “Surrogate Parenthood: Protected and
          <string-name>
            <given-names>Informative</given-names>
            <surname>Graphs</surname>
          </string-name>
          ,” in PVLDB,
          <year>2011</year>
          , vol.
          <volume>4</volume>
          , no.
          <issue>8</issue>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Cadenhead</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kantarcioglu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Thuraisingham</surname>
          </string-name>
          , “
          <article-title>A framework for policies over provenance</article-title>
          ,
          <source>” Proc. 3rd USENIX Work. Theory Pract. Proven. (TaPP '11)</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Cadenhead</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kantarcioglu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Thuraisingham</surname>
          </string-name>
          , “
          <article-title>An Evaluation of Privacy, Risks and Utility with Provenance,”</article-title>
          <string-name>
            <surname>Secur. Knowl. Manag. Work.</surname>
          </string-name>
          (SKM),
          <year>Novemb</year>
          .,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Thuraisingham</surname>
          </string-name>
          , “
          <article-title>Transforming provenance using redaction</article-title>
          ,
          <source>” Proc. 16th ACM Symp. Access Control Model.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Technol</surname>
          </string-name>
          . -
          <source>SACMAT '11</source>
          , p.
          <fpage>93</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Sci. Eng.</surname>
          </string-name>
          , vol.
          <volume>3</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>94</fpage>
          -
          <lpage>102</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Comput.</surname>
          </string-name>
          , vol.
          <volume>3</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>322</fpage>
          -
          <lpage>337</lpage>
          , Oct.
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Cheney</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Perera</surname>
          </string-name>
          , “
          <article-title>An Analytical Survey of Provenance Sanitization</article-title>
          ,” in IPAW,
          <year>2014</year>
          , pp.
          <fpage>113</fpage>
          -
          <lpage>126</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Cohen-Boulakia</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Biton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cohen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Davidson</surname>
          </string-name>
          , “
          <article-title>Addressing the provenance challenge using ZOOM,”</article-title>
          <string-name>
            <surname>Concurr. Comput. Pract. Exp.</surname>
          </string-name>
          , vol.
          <volume>20</volume>
          , no.
          <issue>5</issue>
          , pp.
          <fpage>497</fpage>
          -
          <lpage>506</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Davidson</surname>
            ,
            <given-names>S. B.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Bao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Roy</surname>
          </string-name>
          , “
          <article-title>Hiding Data and Structure in Workflow Provenance</article-title>
          ,” DNIS, pp.
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Dey</surname>
            ,
            <given-names>S. C.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zinn</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Ludäscher</surname>
          </string-name>
          , “ProPub:
          <article-title>Towards a declarative approach for publishing customized, policyaware provenance</article-title>
          ,
          <source>” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics)</source>
          , vol.
          <volume>6809</volume>
          LNCS, pp.
          <fpage>225</fpage>
          -
          <lpage>243</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Dey</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Ludascher</surname>
          </string-name>
          , “
          <article-title>A declarative approach to customize workflow provenance</article-title>
          ,
          <source>” Proc. Jt. EDBT/ICDT 2013 Work. - EDBT '13</source>
          , p.
          <fpage>9</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Dey</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zinn</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Ludáscher</surname>
          </string-name>
          , “
          <article-title>Repairing provenance policy violations by inventing non-functional nodes,”</article-title>
          <source>in CEUR Workshop Proceedings</source>
          ,
          <year>2011</year>
          , vol.
          <volume>737</volume>
          , pp.
          <fpage>109</fpage>
          -
          <lpage>122</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Jurnawan</surname>
          </string-name>
          , W. and U. Röhm, “
          <article-title>Data provenance support in relational databases for stored procedures</article-title>
          ,
          <source>” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)</source>
          ,
          <year>2009</year>
          , vol.
          <volume>5667</volume>
          LNCS, pp.
          <fpage>97</fpage>
          -
          <lpage>111</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Kifer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>“L-Diversity</surname>
            <given-names> </given-names>
          </string-name>
          :
          <string-name>
            <surname>Privacy Beyond</surname>
          </string-name>
          k -Anonymity,
          <source>” Proc. 22nd Int. Conf. Data Eng.</source>
          , vol.
          <volume>1</volume>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>36</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Anand</surname>
          </string-name>
          , “Managing Scientific Workflow Provenance,” Unversiy od California,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Danger</surname>
          </string-name>
          , “
          <article-title>ProvAbs: model, policy, and tooling for abstracting PROV graphs,” Procs</article-title>
          .
          <article-title>IPAW 2014 (Provenance Annot</article-title>
          ., pp.
          <fpage>3</fpage>
          -
          <lpage>15</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Danger</surname>
          </string-name>
          , “
          <article-title>Provenance Graph Abstraction by Node Grouping,”</article-title>
          <source>Sch. Comput. Sci. Tech. Rep. Ser., no. August</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Moreau</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Freire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Futrelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Mcgrath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Myers</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Paulson</surname>
          </string-name>
          , “
          <article-title>The Open Provenance Model : An Overview,” IPAW, no</article-title>
          .
          <source>August</source>
          <year>2007</year>
          , pp.
          <fpage>323</fpage>
          -
          <lpage>326</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Nagy</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , “
          <article-title>Assessment of Workflow Constraints and Sanitization Approaches on Provenance Graph Privacy</article-title>
          and Utility,”
          <source>Tech. report, Cairo Univ.</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Sandhu</surname>
          </string-name>
          , “
          <article-title>A Provenance-based Access Control Model,” 2012 Tenth Annu</article-title>
          .
          <source>Int. Conf. Privacy, Secur. Trust</source>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Wipat</surname>
            , and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          , “
          <article-title>Taverna: A tool for the composition and enactment of bioinformatics workflows</article-title>
          ,” Bioinformatics, vol.
          <volume>20</volume>
          , no.
          <issue>17</issue>
          , pp.
          <fpage>3045</fpage>
          -
          <lpage>3054</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Pesic</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , “
          <article-title>Constraint-Based Workflow Management Systems</article-title>
          : Shifting Controls to Users,”
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Thuraisingham</surname>
          </string-name>
          , “
          <article-title>REDACT: A Framework for Sanitizing RDF Data,”</article-title>
          <source>in Proceedings of the 22nd international conference on World Wide Web companion</source>
          ,
          <year>2013</year>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Runte</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          and
          <string-name>
            <surname>M. El Kharbili</surname>
          </string-name>
          , “
          <article-title>Constraint Checking for Business Process Management,”</article-title>
          <source>in GI Jahrestagung</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>4093</fpage>
          -
          <lpage>4103</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          13, no.
          <issue>6</issue>
          , pp.
          <fpage>1010</fpage>
          -
          <lpage>1027</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>Baumgrass</surname>
          </string-name>
          , “
          <article-title>Detecting and Resolving Conflicts of Mutual-Exclusion and Binding Constraints in a Business Process Context,” Move to Meaningful Internet Syst</article-title>
          .
          <source>OTM</source>
          <year>2011</year>
          , vol.
          <volume>7044</volume>
          , no.
          <source>October</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>Silva</surname>
            ,
            <given-names>C. T.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Anderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Santos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Freire</surname>
          </string-name>
          , “
          <article-title>Using VisTrails and provenance for teaching scientific visualization</article-title>
          ,
          <source>” Comput. Graph. Forum</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>Thuraisingham</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , T. Cadenhead,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kantarcioglu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Khadilkar</surname>
          </string-name>
          ,
          <article-title>Secure Data Provenance and Inference Control with Semantic Web</article-title>
          .
          <source>Auerbach Publications</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>Papazoglou</surname>
          </string-name>
          , “
          <article-title>Enforcing Compliance on Business Processes Through the Use of Patterns,”</article-title>
          <source>in 19th European Conference on Information Systems (ECIS</source>
          <year>2011</year>
          ), Finland,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <surname>White</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>a and</article-title>
          <string-name>
            <surname>I. B. M. Corp</surname>
          </string-name>
          , “Process Modeling Notations and Workflow Patterns,” Bus.
          <volume>21</volume>
          , vol.
          <volume>21</volume>
          , pp.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>