=Paper= {{Paper |id=Vol-2322/DARLIAP_4 |storemode=property |title=Privacy-Preserving Data Analysis Workflows for eScience |pdfUrl=https://ceur-ws.org/Vol-2322/DARLIAP_4.pdf |volume=Vol-2322 |authors=Khalid Belhajjame,Noura Faci,Zakaria Maamar,Vanilson Burégio,Edvan Soares,Mahmoud Barhamgi |dblpUrl=https://dblp.org/rec/conf/edbt/BelhajjameFMBSB19 }} ==Privacy-Preserving Data Analysis Workflows for eScience == https://ceur-ws.org/Vol-2322/DARLIAP_4.pdf

Privacy-Preserving Data Analysis Workflows for eScience
Khalid Belhajjame Noura Faci Zakaria Maamar
PSL, Université Paris-Dauphine, Claude Bernard University Zayed University
LAMSADE Lyon, France Dubai, United Arab Emirates
Paris, France noura.faci@univ-lyon1.fr zakaria.maamar@zu.ac.ae
khalid.belhajjame@dauphine.fr

Vanilson Burégio Edvan Soares Mahmoud Barhamgi
Federal Rural University of Federal Rural University of Claude Bernard University
Pernambuco Pernambuco Lyon, France
Recife, Brazil Recife, Brazil mahmoud.barhamgi@univ-lyon1.fr
vanilson.buregio@ufrpe.br edvan.soares@ufrpe.br

ABSTRACT that the data providers impose. Moreover, data analysis may yield
Computing-intensive experiences in modern sciences have be- into sensitive and private data about individuals (e.g., health con-
come increasingly data-driven illustrating perfectly the Big-Data ditions) that were not expected during the experiment design.
era’s challenges. These experiences are usually specified and Various research works (e.g., [4, 18, 26, 29–31]) have examined
enacted in the form of workflows that would need to manage data outsourcing and/or sharing from a privacy perspective. We
(i.e., read, write, store, and retrieve) sensitive data like persons’ note, however, that in the context of data analysis workflows the
past diseases and treatments. While there is an active research techniques/tools that assist the designer in the specification and
body on how to protect sensitive data by, for instance, anonymiz- enforcement of data protection policies are limited. In particular,
ing datasets, there is a limited number of approaches that would scientists need to identify the parameters in the workflows that
assist scientists identifying the datasets, generated by the work- carry sensitive datasets during their execution, and determine
flows, that need to be anonymized along with setting the anonymiza- which anonymization method should be applied to those datasets
tion degree that must be met. We present in this paper a pre- prior to their publication. This task can be tedious, especially for
liminary for setting and inferring anonymization requirements large workflows.
of datasets used and generated by a workflow execution. The In this preliminary work, we overcome the above issue by
approach was implemented and showcased using a concrete ex- providing scientists with the means to automatically (i) identify
ample, and its efficiency assessed through validation exercises. the workflow parameters that are bound to sensitive data during
the workflow execution, and (ii) infer the anonymity degree
that needs to be applied to such datasets before releasing them
1 INTRODUCTION publicly. We will define what we exactly mean by anonymity
Data-driven transformation and analysis (e.g., re-formatting data degree later on in Section 3.1 when introducing k-anonymity
and computing statistics) are omnipresent in science and have [23].
become attractive for verifying scientists’ hypotheses. This ver- Our contributions are as follows: (i) an architecture of a pri-
ification is dependent on dataset availability that third parties vacy preserving workflow system that preserves the privacy of
(e.g., government bodies and independent organizations) supply the dataset used and generated when enacting workflows, (ii) a
for re-formatting, combination, and scrutiny using what the com- method for automatically detecting sensitive dataset and setting
munity refers to as complex Data analysis Workflow (DWf) [9]. their anonymity degree, and (iii) a system that implements the
A DWf is a process that has an objective (e.g., discover prognos- proposed method and experiments that showcase its efficiency
tic molecular biomarkers) and a set of operations packaged (at using real-world scientific workflows.
design time) into stages (e.g., pre-process and analyze) and or- The paper is organized as follows. Section 2 presents a sci-
chestrated (at run-time) according to data and other dependencies entific workflow from the health-care domain that we use as a
that the workflow designer specifies. Despite the availability of running example. Section 3 presents an architecture for a privacy-
free datasets for the scientific community (e.g., Figshare1 , Data- preserving workflow environment, and then discusses certain
verse2 , OpenAire3 , and DataOne4 ), data providers, in certain necessary requirements that this environment should satisfy.
disciplines, are still reluctant to sharing their datasets with the Section 4 presents a new method for automatically detecting
community. Indeed, there is a serious concern about dataset in- sensitive workflow parameters, and for inferring the anonymity
appropriate manipulation/misuse during experiences that could degree that should be enforced when publishing the datasets
lead to sensitive-data leak and/or misuse. Although this could used or generated by such parameters as a result of the workflow
happen inadvertently, the consequences remain the same. As a re- execution. This method is implemented and validated in Section 5
sult, some scientists/DWfs are deprived of valuable and necessary and Section 6, respectively. Section 7 presents a literature review.
datasets due to some restrictions (e.g., access control policies) Conclusions are drawn in Section 8.
1 figshare.com
2 dataverse.org
3 openaire.eu
2 RUNNING SCENARIO
4 dataone.org Fig. 1 exemplifies a DWf that consists of five operations (opi=1,5 )
connected through dataflow dependencies. Input/Output param-
©2019 Copyright held by the author(s). Published in the Workshop Proceedings eters are omitted for the sake of readability. This workflow’s
of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal) on
CEUR-WS.org. operations are as follows:
,, K. Belhajjame et al.

Figure 1: Example of data-analysis workflow

• op1 query a dataset to get nutrition data. Table 1 is an ex- supplied can be sensitive or non-sensitive. Sensitive datasets
ample of this operation’s output listing for each patient her carry personal details on individuals and therefore, should be
average daily intake of fruits & vegetables, dairy products, anonymized before making them publicly available.
meat, and dessert. Initially, the datasets are transferred to a data repository that
• op2 retrieves oncology data about patients in terms of type is private to the workflow system in preparation for their “cleans-
of cancer and age (Table 2). ing" (Step 1). Once the DWf starts (Step 2), the execution en-
• op3 combines Table 1 and Table 2’s data. Specifically, it gine loads the “cleansed" datasets from the private data reposi-
performs a natural join on nutrition and oncology informa- tory (Step 3). The obtained intermediate and final datasets are
tion. The combination’s outcome is presented in Table 3. stored again in this repository (Step 4). If the DWf execution re-
Note that, in the general case, not all nutrition patients will veals new insights at the scientist’s discretion, she may choose to
be oncology patients, and vice-versa. We have the same publish (some of) the datasets used and/or generated by the work-
patients in Tables 1 and 2 for the sake of illustration, only. flow in a public data repository (Step 6) for the benefit of the com-
• op4 implements a machine learning model that helps pre- munity who could explore, reuse, or even review such datasets.
dict the likelihood of a patient to suffer from a particular Prior to the release, these datasets are anonymized (Step 5).
type of cancer given his/her nutrition habits. Examples
of models that can be produced are decision-based trees,
Data owner Trusted workflow environment
neural networks, and Bayesian networks, to mention just
a few. 2 launch Workﬂow
Sensi&ve
Data
execution workbench
• Finally, op5 generates a final report that the scientist will
examine. Such a report contains various information such 5
Workﬂow
3 execu&on
engine
launch data
as nutrition attributes that are prevalent in identifying Non
Sensi&ve
get anonymization
Data
inputs
the type of cancer the patients may suffer from, as well 4 store
outputs
as information about the performance of the prediction 1
Private
data
model, e.g., accuracy, ROC curve, etc. [3]. Data owner share
repository

We assume that dietetics&nutrition and oncology departments Data
anonymizer
Sensi&ve
Data
6 get data
willing to share their datasets, should receive the necessary guar-
antees that safeguard private data from being leaked, misused, or 7 publish data
tampered, for example. In particular, they should be able to state Public data repositories
that their datasets are sensitive and set the anonymity degree
that should be respected when anonymizing their datasets. Non
Sensi&ve
Non
Sensi&ve
Non
Sensi&ve
Data
Data
Data
3 PRIVACY-PRESERVING WORKFLOW
MANAGEMENT SYSTEM
This section presents the architecture of our privacy-preserving Figure 2: Chronology of operations in the WfMS
WfMS and defines the requirements that would preserve this pri-
vacy. Different techniques can be used for data anonymization,
e.g., generalization [27], perturbation [15], suppression [10], en-
3.1 Overview cryption, k-anonymization [23] and differential privacy [11]. Dif-
In Fig. 2, providers make their datasets available to a (trusted) ferential privacy is perhaps the most sophisticated method with
workflow management system, that will be able to manipulate better privacy guarantees. That said, it is not suitable for our
such datasets without them being anonymized. The datasets purpose. Indeed, differential privacy is used to protect individual
Privacy-Preserving Data Analysis Workflows for eScience ,,

Table 1: Nutrition information of patients

Patient ID Fruits & Veg Dairy Meat Dessert
John 1 80g 33cl 150g 200g
Ahmed 2 100g 20cl 200g 150g
Ian 3 100g 50cl 300g 250g
Suzanne 4 50g 50cl 400g 300g
Yassmine 5 300g 0cl 0g 100g
Xin 6 250g 0cl 0g 100g

Table 2: Oncology information of patients

Patient ID Type of Cancer Age
John 1 Melanoma 25
Ahmed 2 Lung cancer 28
Ian 3 lymphoma 35
Suzanne 4 Breast Cancer 40
Yassmine 5 Cervical cancer 65
Xin 6 Ovarian cancer 70

Table 3: Combined nutrition and oncology information of patients

Patient ID Age Cancer Fruits & Veg Dairy Meat Dessert
John 1 25 Melanoma 80g 33cl 150g 200g
Ahmed 2 28 Lung Cancer 100g 20cl 200g 150g
Ian 3 35 Lymphoma 100g 50cl 300g 250g
Suzanne 4 40 Breast Cancer 50g 50cl 400g 300g
Yassmine 5 65 Cervical Cancer 300g 0cl 0g 100g
Xin 6 70 Ovarian Cancer 250g 0cl 0g 100g

privacy in the context of statistical queries. In our case, we are present hereafter the requirements that should be met by a work-
interested in providing users with the means to explore data flow environment to preserve the privacy of the datasets it uses
produced the executions of a workflow, as opposed to generat- and generates during the execution of workflows.
ing some statistics, which is what differential privacy is mainly (1) The scientist should be able to specify the DWf’s inputs
targeted for. Because of this, we use in the context of this pa- that are bound to sensitive datasets during the execution
per k-anonymity. k-anonymity has been extensively studied in of DWf.
the database and data mining communities [12, 25]. However, (2) Datasets’ providers that submit sensitive inputs to a work-
its use in data analysis workflows is still limited. To illustrate flow should establish their privacy requirements in terms
k-anonymity, let us consider a dataset (d) of records referring of degree of anonymization. This degree will then be used
each to an individual, e.g., age, address, and gender that could be to anonymize such datasets prior to their publication by
used to reveal his identity. Such attributes are known as quasi- the WfMS.
identifiers. (d) is k-anonymized, where (k) is an integer, if each (3) The dependencies between the parameters of the oper-
quasi-identifier tuple occurs in at least (k) records in (d). For ations that compose the workflow should be extracted.
example, the dataset illustrated in Table 4 is 2−anonymized. Each Such dependencies allow identifying the sensitive datasets
tuple occurs at least twice in the dataset. Therefore, each patient that were used to derive a given dataset, with the view
contained in the anonymized version of (d) cannot be distin- to calculate the anonymity degree of the later based on
guished from at least 2 individuals. In the remainder of the paper, the anonymity degrees of the former. Indeed, protecting a
we use the term anonymity degree to refer to (k). workflow’s input datasets may not be sufficient to protect
private information. Intermediate and final datasets that
3.2 How to achieve a privacy-preserving result from a workflow execution can contain sensitive
WfMS? data, too.
Datasets that a workflow uses or generate are not independent (4) A WfMS should assist scientists in identifying workflow
of each other. In particular, the workflow operations will derive parameters that are bound to sensitive datasets, and cal-
new datasets from an initial set of datasets that are eventually culating the anonymity degree that needs to be enforced
sensitive during the workflow execution. Dependencies between when publishing such datasets.
the datasets should, therefore, be considered, when setting the The next section illustrates how the aforementioned require-
anonymity degree of the derived datasets based on the anonymity ments are taken into account in the design of a privacy-preserving
degree of the initial sensitive datasets. With this in mind, we data workflow.
,, K. Belhajjame et al.

4 PRIVACY-PRESERVING DATA ANALYSIS anonymized (d) must not be distinguished from at least (2) other
WORKFLOWS individuals [23].
We begin by presenting a formal model for a DWf and then specify
the inputs of the workflow that are sensitive and their anonymity
4.2 Detecting sensitive parameters and
degree. Finally, we present a solution that automatically identifies inferring their anonymity degrees
the sensitivity and anonymity degree of the remaining parame- Manual identification of a workflow’s parameters that are sen-
ters of the DWf. sitive and setting their anonymity degrees can be tedious. This
becomes a serious concern when the workflow includes a large
4.1 Workflow model definition number of operations. To address this issue, we propose in this
Workflow model. We formally define a DWf as a tuple section, an approach that takes as input the sensitivity of the input
⟨DWfid , OP, DL⟩ where DWfid is a unique identifier of the work- parameters of the workflow (DWf) together with their anonymity
flow, OP is a set of data manipulation operations (opi ) that consti- degrees. It then detects the list of (intermediate and final) pa-
tute the workflow, and DL is the set of data links between these rameters in (DWf) that may be sensitive, and infer the anonymity
operations. degree that should be applied to the datasets bound to those
An operation opi is defined by ⟨name, in, out⟩ where name parameters during the execution of the (DWf).
is self-descriptive, and in and out represent input and output Parameter dependencies. Dependencies between a work-
parameters, respectively. As some output parameters could be flow (DWf)’s parameters is a key element to our approach. A
other operations’ inputs, a parameter has a unique name (pname ). parameter ⟨op, p⟩ depends on a parameter ⟨op ′, p ′ ⟩ in a work-
Let IN = ∪op∈OP (op.in) and OUT = ∪op∈OP (op.out) be the sets flow (DWf), if during the execution of (DWf) the data bound to the
of all operations’ inputs and outputs in a DWf, respectively. The parameter ⟨op ′, p ′ ⟩ contribute to or influence the data bound to
set of data links connecting the workflow operations must then the parameter ⟨op ′, p ′ ⟩ 5 .
satisfy the following: DL ⊆ (OP × OUT) × (OP × IN). A data link Parameter dependencies can be specified by examining the
relating op1 ’s output ⟨o, op1 ⟩ to op2 ’s input ⟨i, op2 ⟩ is therefore workflow specification (DWf)6 . Given a workflow (DWf), the de-
denoted by the pair ⟨⟨o, op1 ⟩, ⟨i, op2 ⟩⟩. We use INDWf and OUTDWf pendencies between its parameters are inferred as follows:
to denote DWf’s inputs and outputs, respectively. In this work,
• Given an operation (op) that belongs to (DWf), we can infer
we consider acyclic workflows that are free of loops. It is worth
that the outputs of (op) depends on its inputs. Consider for
noting that most of existing scientific workflow languages do not
example that ⟨i, op⟩ and ⟨o, op⟩ are an input and output of
support loops [17].
(op). We can infer that ⟨o, op⟩ depends on ⟨i, op⟩, which
Sensitive parameters. To specify that a (DWf)’s given input or we write:
output parameter carries sensitive data, we use the following dependsOn(⟨o, op⟩, ⟨i, op⟩)
boolean function:
• If the workfow (DWf) contains a data link connecting an
isSensitive(⟨op, p⟩) output ⟨op, o⟩ to an input ⟨op, i⟩, then we infer that ⟨op, i⟩
that is true if the data bound to ⟨op, p⟩ during the DWf’s execution depends on ⟨op, o⟩, i.e., dependsOn(⟨o, op⟩, ⟨i, op ′ ⟩). This
are sensitive; otherwise, false. For example, in the running exam- is because the data bound to ⟨o, op⟩ during the workflow
ple (Section 2), the two initial parameters of the workflow are execution is a copy of the data bound to ⟨i, op ′ ⟩.
sensitive in that their instances are collections of records about We also transitively derive dependencies between the opera-
patients along with their nutritions and cancer histories. tion parameters of a workflow based on the following rules:
R 1 : dependsOn∗ ( ⟨p, op⟩, ⟨p′, op′ ⟩) : − dependsOn( ⟨p, op⟩, ⟨p′, op′ ⟩)
Parameter anonymity degree. The execution of a DWf corre-
R 2 d: ependsOn∗ ( ⟨p, op⟩, ⟨p′, op′ ⟩) : − dependsOn∗ ( ⟨p, op⟩, ⟨p”, op”⟩),
sponds to a DWf instance denoted by (insWf). The anonymity dependsOn∗ ( ⟨p”, op”⟩, ⟨p′, op′ ⟩)
degree of a DWf’s parameter (⟨p, op⟩) is defined with respect to a
Applying the above rules to our example workflow, we conclude
given DWf instance (insWf). Indeed, different instances of DWf may
for instance, that dependsOn∗ ( ⟨o, op3 ⟩, ⟨i, op2 ⟩), where i and o are
have as input datasets different anonymity degree requirements.
parameter names.
For example, the owner of an input dataset used for a given work-
flow instance (insWf1 ) may impose a more stringent anonymity Detecting sensitive parameters. We use parameter dependen-
degree than the owner of an input dataset used for a different cies to assist the workflow designer identify the intermedi-
workflow instance (insWf2 ). As a result the same workflow pa- ate and final parameters that may be sensitive. Specifically, a
rameter may have different anonymity degrees depending on the parameter ⟨p ′, op ′ ⟩ that is not an input to the workflow, i.e.,
workflow instance in question. Due to this difference in require- ⟨p ′, op ′ ⟩ < I N DW f , may be sensitive if it depends on a workflow
ment, we use the following function to specify the anonymity input that is known to be sensitive, i.e.,
degree of a given parameter ⟨p, op⟩ with respect to a workflow ∃⟨i, op⟩ ∈ INDWf s.t. sensitive(i, op)
instance insWf: ∧ dependsOn∗ (⟨p ′, op ′ ⟩, ⟨i, op⟩)
Note that we say that ⟨p ′, op ′ ⟩ may be sensitive. This is be-
anonymity(⟨p, op⟩, insWf) cause an operation that consumes sensitive datasets may produce
For example, anonymity( ⟨p, op1 ⟩, w1 ) = 3 specifies that the pa- 5 The notion of contribution and influence are in line with the derivation and
rameter ⟨p, op1 ⟩ has an anonymity degree of 3 within the influence relationship defined by the W3C PROV recommendation [19].
workflow instance w1 . Consider that the dataset (d) is bound 6 Parameter dependencies correspond to what is referred to in the scientific workflow

to the parameter ⟨p, op1 ⟩ within the workflow instance (w1 ). community by retrospective provenance. This is because such dependencies can be
inferred from the workflow specification as opposed to other kinds of information,
Given that anonymity( ⟨i, op1 ⟩, w1 ) = 3, (d) must be anonymized e.g., execution log, which can only be obtained retrospectively once the workflow
before its publication. Specifically, each record (individual) in the execution terminates.
Privacy-Preserving Data Analysis Workflows for eScience ,,

non-sensitive datasets. For example, op5 in Fig. 1 generates non- Workflow Dependency Extractor. This module is used to
sensitive information although its outputs are sensitive inputs of identify the dependencies between workflow parameters. It takes
the workflow. The output of such an operation is a report that is as input a workflow specification and produces as output a list of
free from information about individual patients. pairs of parameters ⟨p1 , p2 ⟩ where p2 depends on p1 . Let us con-
Inferring anonymity degree. In addition to assisting the sider our running example of Section 2. Applying the Workflow
designer identify sensitive intermediate and final output Dependency Extractor to this workflow reveals, for instance, that
parameters, we also infer details about the anonymity degree the input of op3 depends on the inputs of op1 and op2 , among
that should be applied to dataset instances of those sensitive other dependencies.
parameters. To illustrate this, consider that ⟨p ′, op ′ ⟩ is a sensitive Sensitive Parameter Detector. This module identifies work-
intermediate or final output parameter. The anonymity degree flow parameters that may be sensitive. It takes as input the
of such a parameter given a workflow execution insWf can be workflow input that is indicated (by the user or workflow’s au-
defined as the maximum degree of the sensitive datasets that are thor) as sensitive, and the parameter dependencies produced by
used as input to the workflow and that contribute to the datasets Workflow Dependency Extractor. It produces as output a list
instances of ⟨p ′, op ′ ⟩. Taking the maximum anonymity degree
of parameters that may be sensitive. Let us consider our run-
of the contributing inputs ensures that the anonymity degrees
imposed on such inputs is honored by the dependent parameter ning example along with the inputs of operations op1 and op2
in question. That is: that the scientist sets as sensitive because of handling personal
information. The Sensitive Parameter Detector concludes that
anonymity( ⟨p′, op′ ⟩, insWf) = the remaining parameters of the workflow may be sensitive. In-
max({anonymity( ⟨i, op⟩, insWf) s.t. sensitive( ⟨i, op⟩) deed, the workflow’s all intermediate and final parameters de-
∧ dependsOn∗ ( ⟨p′, op′ ⟩, ⟨i, op⟩)}) pend on op1 and op2 inputs. It is worth underlining that the
Once anonymity degree is computed, the WfMS uses an sensitive − detector − parameter identifies the parameters
anonymization algorithm proposed in the literature like Mondar- that may be sensitive. In other words, not all the parameters that
ian [16] before publishing the datasets used and generated as a are returned by this module will be flagged as sensitive. This
result of the workflow execution. is the case for the outputs of op4 : establish correlations
and op5 : generate report, which, respectively, deliver a ma-
5 IMPLEMENTATION chine learning model and report that are free of any per-
sonal detail, and as such do not need to be anonymized.
Fig. 3 depicts the system architecture implementing our privacy-
Note, however, that if a parameter is not returned by the
aware workflow approach. Not all the components reported in
sensitive − detector − parameter, then that means that such
Fig. 2 have been implemented. Indeed, instead of reinventing the
parameter is definitely not sensitive.
wheel, we make use of some existing popular scientific workflow
Anonymity Degree Calculator. This module computes the
systems [6, 14, 28]. We have, therefore, focused on implement-
anonymity degree of a workflow’s sensitive parameters. To this
ing the Anonymizer component which consists of the following
end, it establishes the anonymity degree that must be met by a
modules.
sensitive parameter that is not a workflow’s initial input. Indeed,
.cwl File the anonymity degree of the initial parameters of the workflow
Workflow as a whole is specified by the user. It takes as input the anonymity
Loader
DWf
degree of each input of the workflow that is known to be sensi-
designer tive, the list of parameter dependencies that are produced by the
.JSON File
Workflow Dependency Extractor, and the list of workflow param-
Workflow Dependency eters that are identified as sensitive by the Sensitive Parameter
Extractor Detector. It then produces the anonymity degree of each sensi-
tive parameter of the workflow (other than the initial workflow
inputs). Let us consider the nutrition and oncology departments
.JSON File that state that their data should be 2-anonymized before pub-
Sensitive Parameter Anonymity Degree
lication. By using the anonymity − degree − calculator, we
k-Anonymizer
Detector Calculator establish that the anonymity degree op1,2,3 ’s outputs should be
Annotated equal to 2.
sensitive I/O
k-Anonymizer. Once the anonymity degrees of the parame-
ters are produced, the k-Anonymizer is enabled to anonymize the
Sensitive dataset instances of these parameters during a workflow execu-
I/O
tion. The anonymization operation is out of the scope of this pa-
per. Instead, existing k-anonymization algorithms (e.g., ARX [20],
Figure 3: System’s technical architecture
an open source data anonymization tool) can be used. For in-
stance, Tables 4, 5, and 6 show the data obtained by anonymizing
Workflow Loader. To ensure our system interoperability the data of Tables 1, 2, and 3, respectively, with the anonymity
with existing workflow systems, we decided on handling the degree k = 2.
workflows specified in the Common Workflow Language (CWL7 ).
CWL has recently gained momentum and is currently supported
by major scientific workflow systems. The Workflow Loader mod-
ule converts a CWL workflow into an equivalent JSON format,
which is used internally by our system.
7 https://github.com/common-workflow-language/common-workflow-language
,, K. Belhajjame et al.

Table 4: Anonymized nutrition information of patients with k = 2.

Patient ID Fruits & Veg Dairy Meat Dessert
* * 80g ≤ Fruits ≤ 100g 20cl ≤ Dairy < 40cl 100g ≤ Meat ≤ 200g 100g < Dessert ≤ 200g
* * 80g ≤ Fruits ≤ 100g 20cl ≤ Dairy < 40cl 100g ≤ Meat ≤ 200g 100g < Dessert ≤ 200g
* * 0g ≤ Fruits ≤ 50g 40cl < Dairy ≤ 50cl 200g < Meat ≤ 400g 200g < Dessert ≤ 300g
* * 0g ≤ Fruits ≤ 50g 40cl < Dairy ≤ 50cl 200g < Meat ≤ 400g 200g < Dessert ≤ 300g
* * 200g ≤ Fruits ≤ 300g 0cl ≤ Dairy < 20cl 0g < Meat ≤ 50g 0g < Dessert ≤ 100g
* * 200g ≤ Fruits ≤ 300g 0cl ≤ Dairy < 20cl 0g < Meat ≤ 50g 0g < Dessert ≤ 100g

Table 5: Anonymized oncology data of patients with k = 2.

Patient ID Type of Cancer Age
* * Melanoma 20 ≤ Age ≤ 30
* * Lung cancer 20 ≤ Age ≤ 30
* * lymphoma 30 < Age ≤ 40
* * Breast Cancer 30 < Age ≤ 40
* * Cervical cancer 60 ≤ Age ≤ 70
* * Ovarian cancer 60 ≤ Age ≤ 70

Table 6: Combined nutrition and oncology information of patients anonymized with k = 2

Patient ID Age Type of Cancer Fruits & Veg Dairy Meat Dessert
* * 20 ≤ Age ≤ 30 Melanoma 80g ≤ Fruits ≤ 100g 20cl ≤ Dairy < 40cl 100g ≤ Meat ≤ 200g 100g < Dessert ≤ 200g
* * 20 ≤ Age ≤ 30 Lung Cancer 80g ≤ Fruits ≤ 100g 20cl ≤ Dairy < 40cl 100g ≤ Meat ≤ 200g 100g < Dessert ≤ 200g
* * 30 < Age ≤ 40 Lymphoma 0g ≤ Fruits ≤ 50g 40cl < Dairy ≤ 50cl 200g < Meat ≤ 400g 200g < Dessert ≤ 300g
* * 30 < Age ≤ 40 Breast Cancer 0g ≤ Fruits ≤ 50g 40cl < Dairy ≤ 50cl 200g < Meat ≤ 400g 200g < Dessert ≤ 300g
* * 60 ≤ Age ≤ 70 Cervical Cancer 200g ≤ Fruits ≤ 300g 0cl ≤ Dairy < 20cl 0g < Meat ≤ 50g 0g < Dessert ≤ 100g
* * 60 ≤ Age ≤ 70 Ovarian Cancer 200g ≤ Fruits ≤ 300g 0cl ≤ Dairy < 20cl 0g < Meat ≤ 50g 0g < Dessert ≤ 100g

6 VALIDATION workflow has. The examination of Workflows 2, 13, and 20 re-
For validation purposes, different experiments were carried out vealed that they have a larger number of outputs compared with
upon the system described in Section 5. 20 different CWL work- the rest of workflows.
flows8 (500 executions per workflow) have been used so that Regarding the overhead due to sensitive parameter detection
parameters like loading times, identifying parameter dependen- and anonymity degree calculation, it is almost instantaneous for
cies and sensitive parameters, and computing anonymity degree all workflows, and therefore there was no need to show the charts
have been assessed. Number of operations, sensitive inputs, and for them (also due to limited space). In summary, the result of the
anonymity degrees highlight the differences between these work- experiment we ran are encouraging and show that the overhead
flows. due to the solution can bearly be noticed.
For each workflow, we compute the minimum, maximum,
and average overhead due to workflow loading, parameter de-
pendency extraction, sensitive parameter identification, and
anonymity degree computation, across the 10K executions. On
the one hand, Fig. 4 is for workflow loading. The minimum time
is nearly 0ms in most cases, which can hardly be seen on the
chart. The average time is almost the same for all workflows;
i.e., approximately equal to 0.1ms. Regarding the maximum time,
it varies between 1ms and 3ms, which are small numbers. On
the other hand, Fig. 5 is for parameter dependency extraction.
Required minimum and average time can be hardly seen on the
chart; in fact, the extraction of dependencies is instantaneous in
most cases. For the required maximum time, it is less than 0.2ms
for most workflows. However, 3 outliers have been identified,
Workflows 2, 13, and 20, that take almost 15ms in the worst case.
This can be explained by the fact that dependency extraction is
influenced by the number of input and output parameters the Figure 4: Overhead due to workflow loading

8 view.commonwl.org/workflows
Privacy-Preserving Data Analysis Workflows for eScience ,,

i.e., assign the task referring to some manipulation of data, to the
employee who has the lowest restriction level according to the
subject’s privacy policy, and (iv) keep the subject informed about
any attempt for accessing his/her data.
In [5], Barth et al. present a privacy-policy violation detection
approach based on execution logs of business processes. The
aim is to identify a set of employees potentially responsible for
privacy breach. The authors introduce two types of compliance:
strong and weak. An action is strongly compliant with a privacy
policy given a trace if there exists an extension of the trace that
contains the action and satisfies the policy. An action is weakly
compliant with a policy given a trace if the trace augmented
with the action satisfies the present requirements of the privacy
policy.
Figure 5: Overhead due to parameter dependency extrac- In [8], Davidson et al. discuss privacy-preserving management
tion of provenance-aware workflow systems. The authors first formal-
ize the privacy concerns: (i) data privacy that requires outputs
of the workflow’s modules (aka operations) should not reveal to
7 RELATED WORK users without an access privilege, (ii) module privacy that requires
Privacy concerns in the context of workflows have been exam- the functionality of this module is not revealed, and (iii) structural
ined by a number of proposals. We present in this section these privacy that refers to hiding the data flow’s structure in the given
proposals and conclude the section by discussing how our work execution.
advances the state of the art. The aforementioned proposals can be classified into two cate-
In [13], Gil et al. address the issue of data privacy in the con- gories. Those that preserve the privacy of tasks (operations) of
text of DWfs. To this end, they propose an ontology that preserves workflows. This is exemplified in the works by Barth et al. [5]
this privacy along with enforcing access control over data with and Davidson et al. [8]. And those that preserve the privacy
respect to a given set of access permissions. The ontology speci- of data that workflows manipulate at run-time. This is exem-
fies eligible privacy-preserving policies (e.g., generalization and plified with the works of Gil et al. [13], Teepe et al. [24], and
anonymization) per DWf’s input/output parameter. To support Alhaqbani et al. [2]. Contrarily, the work of Sharif et al. [21]
privacy policy enforcement in DWfs, a framework was developed addresses the privacy of both task and data. In the context of
to represent policies as a set of elements that include applicable our work, we are concerned with the privacy of workflow data
context, data usage requirement, privacy protection requirement, and hence, is in line with the second category of proposals. How-
and corrective actions if the policy is violated. ever, achieving this privacy requires that the workflow designer
In [7], Chebbi and Tata propose a workflow reduction-based manually identifies sensitive workflow parameters and sets the
abstraction approach for workflow advertisement purposes. The degree to which the datasets bound to those parameters need to
approach reduces a workflow inter-visibility using 13 rules that be anonymized. We have taken care of both aspects in our work.
depend on dependencies between operations in the workflows
along with the operation types (i.e., internal versus external. 8 CONCLUSION
In [24], Teepe et al. analyze a business workflow specification We presented an approach for preserving privacy in the context
to determine the properties that would achieve privacy protection of scientific workflows that heavily rely on large datasets. We
of a company’s partners and customers. To this end, they repre- have shown how data plays a role in (i) identifying sensitive oper-
sent workflows as Color-X diagrams and then translate them into ation parameters in the workflow and (ii) deriving the anonymity
Prolog so that privacy relevant properties over data are analyzed, degree that needs to be enforced when publishing the datasets
e.g., need-to-know principle. This analysis inspects the messages instances of these parameters. To the best of our knowledge, this
sent by all employees involved in the business workflow to detect is the first work that looks into these aforementioned items (i)
“gossipy” employees, i.e., those who exchange more information and (ii). We have also implemented a system that showcases our
than they are asked for. solution and conducted some experiments for efficiency needs.
In [21], Sharif et al. introduce MPHC standing for Multiter- This work opens up opportunities for more research in the field
minal Cut for Privacy in Hybrid Clouds framework to minimize of anonymization of workflow data. In this respect, our ongo-
the cost of executing workflows while satisfying both task/data ing work includes investigating the applicability of our solution
privacy and deadline/budget constraints. In [22], Sharif et al. ex- to anonymization techniques, other than k-anonymity, e.g., l-
tend MPHC with Bell-LaPadula rules so that all data and tasks diversity and t-closeness [1].
are deployed over hybrid cloud instances with greater or equal
privacy levels.
REFERENCES
In [2], Alhaqbani et al. propose a privacy-enforcement ap- [1] [n. d.]. A critique of k-anonymity and some of its enhancements.
proach for business workflows based on 4 requirements: (i) cap- [2] B. Alhaqbani, M. Adams, C. J. Fidge, and A. H. M. ter Hofstede. 2013. Privacy-
ture the subject (i.e., data owner)’s privacy policy during the Aware Workflow Management. Springer, Dortmund, Germany, 111–128.
[3] E. Alpaydin. 2014. Introduction to Machine Learning (2nd ed.). The MIT Press,
workflow specification on top of the privacy policies defined by Cambridge, Massachusset, USA.
the workflow administrator, (ii) define data properties (i.e., hide [4] G. Antoniou, M. Baldoni, P. A. Bonatti, W. Nejdl, and D. Olmedilla. 2007. In
and generalize) linked to private data so that these properties Secure Data Management in Decentralized Systems. Springer, 169–216.
[5] A. Barth, J. C. Mitchell, A. Datta, and S. Sundaram. 2007. Privacy and Utility
influence the workflow engine to protect data as per the sub- in Business Processes. In Computer Security Foundations Symposium – CSF,
ject’s privacy policy, (iii) allocate work while preserving privacy, 6-8 July. IEEE, Venice, Italy, 279–294.
,, K. Belhajjame et al.

[6] S. P. Callahan, J. Freire, E. Santos, et al. 2006. Vistrails: Visualization meets
data management. In SIGMOD. ACM Press, Chicago, IL, USA, 745–747.
[7] I. Chebbi and S. Tata. 2007. Workflow Abstraction for Privacy Preservation.
In International Conference on Web Information Systems Engineering – WISE,
December 3. Springer Link, Nancy, France, 166–177.
[8] S. B. Davidson, S. Khanna, V. Tannen, S. Roy, Y. Chen, T. Milo, and J. Stoy-
anovich. 2011. Enabling Privacy in Provenance-Aware Workflow Systems. In
Biennial Conference on Innovative Data Systems Research, January 9-12. CIDR
Conference, Asilomar, CA, USA, 215–218.
[9] E. Deelman, D. Gannon, M. Shields, and I. Taylor. 2009. Workflows and e-
Science: An Overview of Workflow System Features and Capabilities. Future
Generation Computer Systems 25, 5 (2009), 528–540.
[10] R. B. Dolby, G. Harvey, N. P. Jenkins, and R. Raviraj. 2000. Data suppression
and regeneration. (2000). US Patent 6,038,231.
[11] Cynthia Dwork. 2006. Differential Privacy. In Automata, Languages and
Programming, 33rd International Colloquium, ICALP 2006, Venice, Italy, July 10-
14, 2006, Proceedings, Part II. Springer, 1–12. https://doi.org/10.1007/11787006_1
[12] A. Friedman, R. Wolff, and A. Schuster. 2008. Providing k-anonymity in data
mining. The VLDB Journal 17, 4 (2008), 789–804.
[13] Y. Gil, W.K. Cheung, V. Ratnakar, and K-K. Chan. 2007. Privacy Enforcement
in Data Analysis Workflows. In AAAI Workshop on Privacy Enforcement and
Accountability with Semantics (PEAS). AAAI, Busan, Korea, 41–48.
[14] Y. Gil, V. Ratnakar, J. Kim, et al. 2011. Wings: Intelligent Workflow-Based
Design of Computational Experiments. Intelligent Systems 26, 1 (2011), 62–72.
[15] H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar. 2003. On the privacy
preserving properties of random data perturbation techniques. In International
Conference on Data Mining – ICDM’03. IEEE, Melbourne, Florida, USA, 99–106.
[16] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. 2006. Mondrian Multidimen-
sional K-Anonymity. In International Conference on Data Engineering, ICDE
2006, 3-8 April. IEEE, Atlanta, GA, USA, 25.
[17] J. Liu, E. Pacitti, P. Valduriez, and M. Mattoso. 2015. A Survey of Data-Intensive
Scientific Workflow Management. J. Grid Comput. 13, 4 (2015), 457–493.
[18] A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam. 2007.
l-diversity: Privacy beyond k-anonymity. Transactions on Knowledge Discovery
from Data (TKDD) 1, 1 (2007), 3.
[19] P. Missier, K. Belhajjame, and J. Cheney. 2013. The W3C PROV family of
specifications for modelling provenance metadata. In Joint 2013 EDBT/ICDT
Conferences, EDBT ’13 Proceedings, Genoa, Italy, March 18-22, 2013. ACM press,
773–776.
[20] F. Prasser, F. Kohlmayer, R. Lautenschläger, and K. Kuhn. 2014. ARX - A
Comprehensive Tool for Anonymizing Biomedical Data. In American Medical
Informatics Association Annual Symposium. AMIA.
[21] S. Sharif, J. Taheri, A. Y. Zomaya, and S. Nepal. 2013. MPHC: Preserving
Privacy for Workflow Execution in Hybrid Clouds. In International Conference
on Parallel and Distributed Computing, Applications and Technologies, PDCAT,
December 16-18. Sponsored by IEEE, Taipei, Taiwan, 272–280.
[22] S. Sharif, P. Watson, J. Taheri, S. Nepal, and A. Y. Zomaya. 2017. Privacy-Aware
Scheduling SaaS in High Performance Computing Environments. IEEE Trans.
Parallel Distrib. Syst. 28, 4 (2017), 1176–1188.
[23] L. Sweeney. 2002. k-anonymity: A model for protecting privacy. International
Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10, 05 (2002),
557–570.
[24] W. Teepe, R.P. van de Riet, and M.S. Olivier. 2003. WorkFlow Analyzed for
Security and Privacy in using Databases. Journal of Computer Security 11, 3
(2003), 271–282.
[25] M. Terrovitis, N. Mamoulis, and P. Kalnis. 2008. Privacy-preserving anonymiza-
tion of set-valued data. VLDB Endowment 1, 1 (2008), 115–125.
[26] B. Wang, B. Li, and H. Li. 2012. Oruta: Privacy-Preserving Public Auditing for
Shared Data in the Cloud. In Proceedings of the 2012 IEEE Fifth International
Conference on Cloud Computing. IEEE Computer Society, 295–302.
[27] K. Wang, P. S. Yu, and S. Chakraborty. 2004. Bottom-up generalization: A data
mining solution to privacy protection. In International Conference on Data
Mining – ICDM’04. IEEE, Brighton, UK, 249–256.
[28] K. Wolstencroft, R. Haines, D. Fellows, et al. 2013. The Taverna workflow
suite: designing and executing workflows of Web Services on the desktop,
web or in the cloud. Nucleic acids research (2013), W557âĂŞW561.
[29] S. Guadie Worku, C. Xu, J. Zhao, and X. He. 2014. Secure and Efficient Privacy-
preserving Public Auditing Scheme for Cloud Storage. Comput. Electr. Eng. 40,
5 (2014), 1703–1713.
[30] X. Xiao and Y. Tao. 2006. Anatomy: Simple and effective privacy preservation.
In Proceedings of the 32nd international conference on Very large data bases.
VLDB Endowment, 139–150.
[31] M. Yiu, G. Ghinita, C. Jensen, and P. Kalnis. 2009. Outsourcing Search Services
on Private Spatial Data. In Proceedings of the 25th International Conference on
Data Engineering, ICDE 2009, March 29 2009 - April 2 2009, Shanghai, China.
IEEE, 1140–1143.