=Paper= {{Paper |id=Vol-3741/paper69 |storemode=property |title=“Dead or Alive, we can deny it”. A Differentially Private Approach to Survival Analysis |pdfUrl=https://ceur-ws.org/Vol-3741/paper69.pdf |volume=Vol-3741 |authors=Francesco L. De Faveri,Guglielmo Faggioli,Nicola Ferro,Riccardo Spizzo |dblpUrl=https://dblp.org/rec/conf/sebd/FaveriF0S24 }} ==“Dead or Alive, we can deny it”. A Differentially Private Approach to Survival Analysis== https://ceur-ws.org/Vol-3741/paper69.pdf
                                “Dead or Alive, we can deny it”. A Differentially
                                Private Approach to Survival Analysis.
                                Francesco Luigi, De Faveri1,* , Guglielmo, Faggioli1 , Nicola, Ferro1 and
                                Riccardo, Spizzo2
                                1
                                    Department of Information Engineering, University of Padua, Padua, Italy
                                2
                                    National Cancer Center CRO Aviano, Aviano, Italy


                                              Abstract
                                              Survival Analyses (SAs), a key statistical tool used to predict event occurrence over time, often involve
                                              sensitive information, necessitating robust privacy safeguards. This work demonstrates how the Revised
                                              Randomized Response (RRR) can be adapted to ensure Differential Privacy (DP) while performing SAs.
                                              This methodology seeks to safeguard the privacy of individuals’ data without significantly changing the
                                              utility, represented by the statistical properties of the survival rates computed. Our findings show that
                                              integrating DP through RRR into SAs is both practical and effective, providing a significant step forward
                                              in the privacy-preserving analysis of sensitive time-to-event data. This study contributes to the field by
                                              offering a new comparison method to the current state-of-the-art used for SAs in medical research.

                                              Keywords
                                              Differential Privacy, Privacy-Preserving Mechanisms, Survival Analysis, Information Security


                                1. Introduction
                                Large amounts of data have been instrumental in medical research, leading to significant
                                advancements and important scientific discoveries. Thanks to the availability of big data
                                in healthcare [1, 2], researchers have improved their understanding of certain diseases and
                                developed powerful prognostic models. A general method in medical data analysis involves
                                clustering patients based on similar characteristics, such as diseases or treatments, to obtain
                                insights for research purposes. Once the clusters are created, a critical aspect that researchers
                                consider is understanding the survival probability of new patients belonging to a population.
                                Researchers commonly resort to statistical procedures called Survival Analyses (SAs) to achieve
                                this objective. This technique helps study the probability of survival for patients belonging to a
                                specific population based on personal and sensitive data collected during clinical trials [3, 4, 5].
                                   However, with the increasing reliance on data, particularly those containing personally
                                identifiable information, there arises a significant concern regarding privacy [6, 7, 8]. Emerging
                                from leaking sensitive information from survival research, malicious employers may decide to
                                terminate employment before covering medical expenses incurred due to the condition if they
                                know about the employee’s medical condition. To avoid such a situation, the gold standard
                                definition of privacy [9] has been introduced in the medical research domain. Differential

                                SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23–26, 2024, Villasimius, Sardinia, Italy
                                *
                                 Corresponding author.
                                $ francescoluigi.defaveri@phd.unipd.it (F. L. De Faveri); faggioli@dei.unipd.it (G. Faggioli)
                                 0009-0005-8968-9485 (F. L. De Faveri); 0000-0002-5070-2049 (G. Faggioli); 0000-0001-9219-6239 (N. Ferro);
                                0000-0001-7772-0960 (R. Spizzo)
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Privacy (DP) provides the patient with “Plausible Deniability”, a condition under which an
individual can deny his participation in a specific research study in a manner that an adversary
cannot disprove with certainty. Applying Differential Privacy (DP) have been used over the last
decade [10, 11, 12], yet without considering specifically the task of SAs.
   In this study, we show how to use a revised version of the randomized response, i.e., the
coin-toss mechanism, introduced by Warner [13] and modified by Greenberg et al. [14], in order
to protect patients’ sensitive categories when performing SAs for medical research purposes.
We provide formal proof of the 𝜀-DP property and report a comparison with a similar priva-
tization mechanism used in literature with the same goal, proving that in DP scenarios, with
an appropriate privacy budget (𝜀 = 3) the results maintains important characteristics of the
original results. The main contribution of this paper is to apply a DP mechanism in SAs and
expose the trade-off between utility and privacy in performing SA in a differentially private
manner using the coin-toss mechanism to achieve privacy for the patients.
   In Section 2, we describe the related works used to provide privacy in healthcare research,
focusing on privacy in SAs. Section 3 explains the theoretical background of the work, and
Section 4 illustrates the mechanism used to address the problem of privacy in the SA function
computation. Finally, in Section 5, we report our findings and fully discuss the results obtained.


2. Related Works
Privacy-preserving technology literature has offered different techniques to adopt DP in medical
research studies, such as generative artificial intelligence and statistical models [15, 16].
   Solutions for providing privacy using artificial intelligence methods include the implementa-
tion of Generative Adversarial Networks (GANs) [17, 18, 19], trained using DP, to create new
synthetic data for research purposes rather than operating with the real data. However, GANs
have limitations in terms of scalability and efficiency, which include the need for a significant
amount of training data and higher demands on power consumption as reported in [20, 21].
   On the other hand, the application of statistical models for survival analysis was first explored
by Nguyên and Hui [22], Yu et al. [23], which investigated the impact of DP to understand
the impact of explanatory variables for discrete-time SAs. In [22] the author proposed an
extension of the DP Output Perturbation [24] method originally proposed for the empirical risk
minimization problem. However, such an approach is limited for discrete-time SAs and is not
applicable in this study. Moreover, Yu et al. [23] proposes to project patients’ data to a space of
lower dimension such that the projection preserves good characteristics of the original data.
Nevertheless, the method does not rely on any formal definition of privacy.
   To the best of our knowledge, it is not present in the literature a model to perform SA
privatizing the categories in which patients are grouped. A similar work has been proposed
by Gondara and Wang [25]. The authors proposed the Laplacian Noise Time Event (LNTE)
mechanism to obfuscate the time-to-event data used by researchers when investigating the
survival rates of a patients’ dataset. The LNTE mechanism modifies traditional survival analysis
by introducing Laplace noise, in line with a specified privacy budget 𝜀, to both the subjects
at risk 𝑟𝑗 and events 𝑑𝑗 within a dataset 𝐷. This process generates a perturbed matrix 𝑀 ′ ,
ensuring DP. Then, the algorithm iteratively adjusts these counts across time points to maintain
updated risk and event information, culminating in a differentially private estimation of the
survival probability. However, the mechanism has some limitations, especially when the dataset
size is small, leading to truncation and biased estimates in the survival rates.


3. Background and Preliminaries
We split the background into two parts: initially, we briefly describe SA and its main use in
healthcare. Then present the DP framework, describing the properties used in our methodology.

3.1. Survival Analysis
Survival Analyses (SAs) are statistical techniques commonly used to analyze time-to-event or
time-to-failure data when the event of interest has not yet occurred [26, 27]. Such functions
investigate the time between a dichotomous event to occur and are used in different fields
of research [28, 29]. In SAs, given a certain population of interest, researchers observe the
expression of some event at a defined time 𝑡 and want to compute the probability trend of
some event, e.g., the occurrence of a certain symptom, the need of a certain treatment, or the
death, for the rest of the population at time 𝑡′ > 𝑡. Focusing on the medical research we are
presenting, the event of interest is often defined as the death of a patient belonging to a specific
population. An intuitive definition of the survival function is that it provides the probability
that the event of interest, i.e., the death of a patient, has not yet occurred by time 𝑡. On the other
hand, a censored event refers to a situation where the exact time of the event (such as death,
relapse, or recovery) is unknown for an individual within the study. Censoring occurs when
the observation period ends before the event occurs or when the individual is lost to follow-up.
   Specifically for the medical research field, the Kaplan-Meier (KM) estimator [30] is the
most common method used to compute a patient’s probability rate of survival. The KM non-
parametric method is a statistical model that evaluates the survival trend of patients with
a common characteristic, finding the relation between the probability of survival over the
observation time in the population analyzed. Equation 1, outlines the process for calculating
the KM estimator. This method factors in the occurrence of event 𝑑𝑖 and the count of patients
𝑛𝑖 who have not yet experienced death or have been censored by time 𝑡𝑖 ≤ 𝑡.
                                                 ∏︁ (︂ 𝑛𝑖 − 𝑑𝑖 )︂
                                        𝑆(𝑡)
                                        ̂︀ =                                                        (1)
                                                           𝑛𝑖
                                              𝑖:𝑡𝑖 ≤𝑡


3.2. Differential Privacy
The gold standard definition of privacy widely accepted in the information security community
research is provided by the notion of Differential Privacy, introduced in Dwork et al. [9]. A DP
mechanism is designed to ensure sensitive data privacy while preserving its utility. In essence,
DP adds an appropriately prepared noise level during computation using a so-called privacy
budget 𝜀, determining the balance between data privacy and utility. The DP definition is built
upon the concept of neighboring datasets, i.e., datasets that can differ at most for only one record.
Formally, the definition of 𝜀-DP [9] states that a randomized mechanism ℳ, i.e., a mechanism
that takes an input and outputs a noisy output, is 𝜀-DP if, for any pair of neighboring datasets
𝐷 and 𝐷′ and a privacy budget 𝜀 ∈ R+ , it holds:

                    Pr [ℳ(𝐷) ∈ 𝑆] ≤ 𝑒𝜀 · Pr ℳ(𝐷′ ) ∈ 𝑆 ∀𝑆 ⊂ Im(ℳ)
                                           [︀         ]︀

  If a randomized mechanism satisfies 𝜀-DP, then it ensures that the probability of observing
any output is almost equal for any neighboring datasets. When facing two similar yet distinct
inputs, we expect the output to be the same within a specific probability range, regulated by
the privacy budget 𝜀 provided. Hence, the mechanism protects users’ privacy by ensuring that
there is uncertainty about the original data, even if the output is identical for two different
inputs. As the definition states, it is possible to infer that the lower 𝜀 values are the higher the
privacy levels results: if 𝜀 = 0, then Pr [ℳ(𝐷) ∈ 𝑆] = Pr [ℳ(𝐷′ ) ∈ 𝑆] ∀𝑆 ⊂ Im(ℳ), i.e., the
output of the mechanism does not depend on the input.
  An important notion when employing the definition of DP for a randomization mechanism
ℳ is the characterization of its Privacy Loss (PL) measure. Considering two potential inputs
neighbor datasets 𝐷, 𝐷′ , the PL of the mechanism is defined as the logarithmic ratio between
the probabilities of observing the same output 𝑂 for each input:

                                                         Pr [ℳ(𝐷) = 𝑂]
                                                      (︂                  )︂
                           ℒℳ(𝐷)||ℳ(𝐷′ ) (𝑂) = log                                                (2)
                                                         Pr [ℳ(𝐷′ ) = 𝑂]
  An important and helpful property for our use-case that links the 𝜀-DP property of a mecha-
nism ℳ with its measure of PL is provided by Dwork and Roth [31]. Formally, stating that a
mechanism ℳ adheres to 𝜀-DP is equivalent to asserting that the absolute value of PL of the
mechanism is upper bounded by 𝜀 with probability 1.


4. Methodology
This Section shows the Revised Randomized Response (RRR) method and its 𝜀-DP property.

4.1. Revised Randomized Response (RRR)
The Randomize Response mechanism was introduced by Warner [13] as a masking technique
for protecting the confidentiality of people in survey responses. The Randomize Response
mechanism, Figure 1(a), also known as the Direct Encoding mechanism [32], has been used
to gather data from survey participants without revealing their real answers. The individuals
are asked to respond truthfully or falsely about a property based on a certain condition. This
mechanism involves a series of steps, where the individual first flips a fair coin (𝑝 = 0.5). If the
result is “heads”, they respond truthfully. If not, they proceed to flip a second fair coin, and if the
result of the second coin flip is “heads”, the answer is “yes”. Otherwise, it returns a “no”answer.
   Figure 1(b) illustrates the revised version of the Randomized Response [14, 32]. Unlike the
original method, the RRR mechanism adapts inputs that support categories beyond the binary
options of “Yes”or “No”, extending to adapt multiple categories. The algorithm works in the
following manner: given a total of 𝑛 > 1 categories and a coin described by a probability
𝑝 ∈ (0, 1) of landing on “heads”, the algorithm takes the input category 𝐶𝑖 , and selects the
      (a) Original Randomize Response mechanism.                 (b) RRR mechanism.

Figure 1: Schematic comparison between the original Randomize Response with a fair coin (𝑝 = 0.5)
and its Revised version, defined by a coin with bias 𝑝, selecting a category 𝐶𝑖 among 𝑛 available ones.


output category through a coin toss process. Similar to the Randomized Response technique,
if the coin lands on “heads”, the mechanism outputs the true category 𝐶𝑖 . Conversely, if the
coin lands on “tails”, RRR uniformly chooses a category from all available options, varying the
Randomize Response mechanism and decreasing the probability of leaking the real category.

4.2. Privacy Properties
Equation 2 measures the privacy loss of ℳ, and the aim is to prove it to be less or equal to 𝜀. As a
first step, we evaluate the numerator of that equation. The probability of having the real category
as output after receiving category 𝐶𝑖 as initial input is provided by the following expression:
Pr[𝑅𝑒𝑠𝑝 = 𝐶𝑖 |𝑇 𝑟𝑢𝑒 = 𝐶𝑖 ] = 𝑝 + (1 − 𝑝) 𝑛1 = 𝑛𝑝+1−𝑝    𝑛    . On the other hand, to evaluate the
denominator, i.e., the probability of getting 𝐶𝑖 as the final response upon collecting 𝐶𝑖 as input,
is estimated by the expression: Pr[𝑅𝑒𝑠𝑝 = 𝐶𝑖 |𝑇 𝑟𝑢𝑒 = 𝐶𝑖 ] = (1 − 𝑝)(1 − 𝑛1 ) = 𝑛−1−𝑛𝑝+𝑝 𝑛      .
   Therefore, by computing the logarithm of the ratio between the quantities computed above,
we formulate the privacy budget 𝜀 as a function of 𝑝 and 𝑛, obtaining:
                                               (︂                 )︂
                                                    𝑛𝑝 + 1 − 𝑝
                                𝜀(𝑛, 𝑝) = log
                                                  𝑛 − 1 − 𝑛𝑝 + 𝑝

   The function 𝜀(𝑛, 𝑝) is subject to specific restrictions due to the existential conditions of
the logarithmic function. Nonetheless, these limitations can be addressed without significant
difficulty. It is crucial to emphasize that the total number of categories should be a positive
integer, and the coin’s probability parameter 𝑝 should fall within the real interval (0, 1). Con-
sequentially, by fixing a number of available categories 𝑛, it holds the condition in 3, which
provides the conclusion for the DP property of the mechanism.
                                                   (︂           )︂
                                                      𝑛−2
                                 𝜀 > 0 ⇐⇒ 𝑝 ∈                ,1                               (3)
                                                      2𝑛 − 2

   In conclusion, we would like to make some important remarks. The condition in 3 states that
the privacy budget is regulated by how much the coin is set to be fair, i.e., how much the coin
is parameterized to respond truthfully during the obfuscation process. We want to stress that
with the same amount of privacy guarantees set by the 𝜀 value, the RRR mechanism influences
less the computation of the KM estimator when compared to the LNTE [25] even for small size
groups. For instance, the RRR mixes the observations of the events, with the consequence that
for different populations, the number of people at risk is more or less the same; thus, the survival
curves will not differ too much from one another. On the other hand, the LNTE provides noisy
observation as stated by [25]: when the population reaches zero, the LNTE method needs to be
truncated. Otherwise, the researcher must rely on a higher privacy budget 𝜀.


5. Results
We discuss the DP approach by simulating a medical research context where we apply the KM
estimator, as detailed in Equation 1 to perform SAs. We vary the privacy budget 𝜀 to evaluate
the privacy-utility trade-off in SA and conduct statistical analysis to confirm our conclusions.

5.1. Experimental Setup
The datasets1 employed in our experiments represent the typical datasets used in SAs in the
medical domain. To ensure a comprehensive evaluation, we used two collections of survival
data, commonly used in literature [25, 33] for similar studies, to estimate the SA functions in
different privacy settings. We utilized the dataset from Bernard et al. [34], referred to as the
IPSS-R dataset, to delineate the group populations under consideration for conducting different
SAs. Similarly, the dataset from McGilchrist and Aisbett [35], known as the Kidney dataset, was
employed. Such dataset contains survival data about patients who participated in the original
studies [34, 35] categorized into different groups based on the diseases or risks.

5.2. Survival Curves
Figure 2(a) shows the original KM curves for the IPSS-R dataset2 . Figures 2(b), 2(c), 2(d) and
Figures 2(e), 2(f), 2(g) represents the KM curves obtained using the RRR and LNTE [25] mecha-
nisms, respectively. As done by Gondara and Wang [25], we analyzed the scenarios using three
privacy budgets, i.e., 𝜀 = 1, 2, 3, to compare how the mechanisms work in the KM estimation.
   Our objective was to show how to process sensitive data for computing SAs in a DP manner.
Upon applying the RRR and the LNTE mechanisms, we observe a gradual convergence of the
survival curves for different values of 𝜀. Specifically, for lower values of the privacy budget,
𝜀 = 1, the mechanisms alter the trend of the survival rates across time, making the distinction
between risk groups less pronounced and creating a convergence through the “Intermediate”
risk population. On the other hand, as 𝜀 increases (𝜀 = 2, 3), the survival curves for the RRR
method appear to approach those of the original IPSS-R data. However, even at higher 𝜀 values,
the survival probabilities do not completely align with the original scenario, indicating a residual
impact of the privacy-preserving mechanism. Conversely, applying the LNTE, for all 𝜀, indicates
a complete change in the survival trends, reducing the utility of the analysis.

1
  The datasets were obtained using the Open Source platform cBioPortal https://www.cbioportal.org/datasets and
  the dataset available on the R survival package https://cran.r-project.org/web/packages/survival/index.html.
2
  The Kidney Survival Curves, which show similar trends, are omitted due to space limitations but are released along
  with the code in the GitHub repository at https://github.com/Kekkodf/DP-SurvAnalysis.
                                                                              1.0




                                                           Survival Probability
                                                                              0.8                                                                     Risk Groups
                                                                                                                                                        Very High
                                                                              0.6                                                                       High
                                                                              0.4                                                                       Intermediate
                                                                                                                                                        Low
                                                                              0.2                                                                       Very Low
                                                                              0.0
                                                                                    0                          50       100 150 200 250
                                                                                                                    Time (months)
                                                                                                                              (a) IPSS-R

                         1.0                                                                               1.0                                                                   1.0
    Survival Probability




                                                                                        Survival Probability




                                                                                                                                                            Survival Probability
                         0.8                                                                               0.8                                                                   0.8
                         0.6                                                                               0.6                                                                   0.6
                         0.4                                                                               0.4                                                                   0.4
                         0.2                                                                               0.2                                                                   0.2
                         0.0                                                                               0.0                                                                   0.0
                               0     50     100 150 200 250                                                         0     50    100 150 200 250                                        0    50    100 150 200 250
                                           Time (months)                                                                       Time (months)                                                     Time (months)
                                     (b) RRR, 𝜀 = 1                                                                      (c) RRR, 𝜀 = 2                                                     (d) RRR, 𝜀 = 3

                     1.0                                                                                 1.0                                                                 1.0
  Survival Probability




                                                                                    Survival Probability




                                                                                                                                                          Survival Probability
                                                                                                         0.9                                                                 0.9
                     0.8                                                                                 0.8                                                                 0.8
                                                                                                         0.7                                                                 0.7
                     0.6                                                                                 0.6                                                                 0.6
                     0.4                                                                                 0.5                                                                 0.5
                                                                                                         0.4                                                                 0.4
                     0.2                                                                                 0.3                                                                 0.3
                                                                                                                                                                             0.2
                               0     50     100 150 200 250                                                         0    50     100 150 200 250                                        0    50    100 150 200 250
                                          Time (months)                                                                        Time (months)                                                     Time (months)
                                    (e) LNTE, 𝜀 = 1                                                                     (f) LNTE, 𝜀 = 2                                                    (g) LNTE, 𝜀 = 3

Figure 2: KM Plots in different privacy parametrization settings using data from the IPSS-R dataset.


5.3. Statistical Results

Table 1
Pairwise Log-Rank test statistics using Original and RRR (𝜀 = 3) results from the Kideny dataset,
concerning the population based on the kind of disease registered (AN, GN, PKD, Other).
                                                                                                               Test Statistic                    𝑝-value                                         − log2 (𝑝)
                                   Disease A        Disease B
                                                                                                   Original                    𝜀=3         Original         𝜀=3                            Original      𝜀=3

                                   AN               GN                                                         0.01             0.11           0.93               0.75                       0.11         0.42
                                                    Other                                                      1.69             0.85           0.19               0.36                       2.37         1.48
                                                    PKD                                                        1.09             0.79           0.30               0.37                       1.75         1.42
                                   GN               Other                                                      0.99             0.63           0.32               0.43                       1.64         1.23
                                                    PKD                                                        0.60             0.60           0.44               0.44                       1.19         1.19
                                   Other            PKD                                                        0.26             0.39           0.61               0.53                       0.71         0.91


   We conducted a Pairwise Log-Rank test on all populations in the Kidney dataset to observe
statistical differences in the results obtained. As Table 1 shows, applying the RRR mechanism
with 𝜀 = 3 shows comparable findings with the originals. In addition, an important insight is
Table 2
Median Survival times with 95% Confidence Interval estimated in the privatized simulations. Where a
“-” is placed, it was not possible to compute the Median Survival time.
Dataset   Mech.   Category           𝜀=1                    𝜀=2                     𝜀=3              No DP

                  V-High      14.73 (9.90, 74.04)     12.07 (9.11, 21.27)    13.02 (10.13,22.88)     10.52
                  High        22.26 (15.68, 42.41)   19.66 (15.25, 34.52)    20.52 (16.64, 35.80)    17.29
          LNTE    Inter.        40.04 (29.10, -)       48.07 (31.59, -)        53.75 (35.80, -)      34.39
                  Low                  -                       -                      -              56.68
                  V-Low                -                       -                      -              86.10
IPSS-R
                  V-High      15.68 (12.82, 18.87)   14.27 (12.29, 16.27)    11.38 (10.13, 13.05)    10.52
                  High        22.32 (19.23, 25.84)   18.38 (16.67, 21.24)    17.65 (15.71, 20.02)    17.29
           RRR    Inter.      37.74 (34.19, 43.23)   33.34 (28.57, 38.20)    34.19 (29.10, 38.53)    34.39
                  Low         52.93 (48.98, 57.50)   54.87 (51.78, 64.47)    57.47 (52.93, 65.98)    56.68
                  V-Low       65.62 (55.53, 82.95)   80.94 (69.76, 100.76)   85.35 (69.76, 108.39)   86.10
                  AN             1.30 (1.13, -)          1.30 (1.00, -)          1.33 (1.00, -)      1.77
                  GN             1.00 (0.50, -)          4.33 (0.73, -)          5.13 (0.87, -)      1.00
          LNTE
                  PKD            18.73 (5.07, -)         5.07 (1.00, -)          5.07 (2.10, -)      2.60
                  Other          3.97 (1.80, -)          5.90 (2.10, -)          8.17 (3.80, -)      4.70
Kidney
                  AN            2.20 (1.27, 6.53)      1.43 (0.90, 3.20)       1.77 (0.90, 3.20)     1.77
                  GN            0.93 (0.40, 4.33)      1.27 (0.50, 5.20)       1.30 (0.50, 5.20)     1.00
           RRR
                  PKD           2.60 (0.50, 9.73)     5.07 (0.87, 17.03)       4.40 (1.00, 5.07)     2.60
                  Other         5.07 (0.80, 14.9)      3.97 (0.80, 9.73)       4.70 (0.93, 8.17)     4.70


provided by the 𝑝-value fetched: we can see that no 𝑝-value is altered so that the null hypothesis
of the distribution can be rejected, replicating the same results of the original scenario.
   Moreover, we summed the median survival times and their related confidence intervals,
Table 2. The analysis reveals that the Kidney dataset’s low cardinality has an impact on both
methods employed. However, the RRR method returns median times more accurately, which
closely resembles the real ones, even in cases where there are high privacy guarantees (𝜀 = 1).
On the other hand, the LNTE mechanism disrupts the survival rates of the subject. As a result,
calculating median times becomes impractical as the probabilities do not reach the 0.5 threshold.



6. Conclusion
We compared the RRR method with the state-of-the-art for conducting SAs in a DP manner.
Our findings suggest that the RRR method is more effective in balancing privacy and utility,
maintaining the distribution properties of real results, and ensuring that researchers can still
derive important insights from such SAs. Our work contributes to the evolution of privacy-
preserving methods in medical research and provides a new comparison for future investigations.
As future works, we plan to explore different methods for conducting SAs to gain insights into
the privacy-utility trade-offs for such a task. Specifically, we intend to consider Linear and Cox
Regression to perform SAs and apply different DP mechanisms to protect patients’ privacy.
References
 [1] S. Bahri, N. Zoghlami, M. Abed, J. M. R. S. Tavares, Big data for healthcare: A survey, IEEE
     Access 7 (2019) 7397–7408. doi:10.1109/ACCESS.2018.2889180.
 [2] K. M. Batko, A. Slezak, The use of big data analytics in healthcare, J. Big Data 9 (2022) 3.
     URL: https://doi.org/10.1186/s40537-021-00553-4. doi:10.1186/S40537-021-00553-4.
 [3] C. P. Lim, A. Vaidya, Y.-W. Chen, V. Jain, L. C. Jain (Eds.), A Survival Analysis Guide in
     Oncology, Springer International Publishing, Cham, 2023. URL: https://doi.org/10.1007/
     978-3-031-11170-9_2. doi:10.1007/978-3-031-11170-9_2.
 [4] S. Kuo, M. Ventin, H. Sato, J. M. Harrison, Y. Okuda, M. Qadan, C. R. Ferrone, K. D. Lillemoe,
     C. F. del Castillo, Common hepatic artery lymph node metastasis in pancreatic ductal
     adenocarcinoma: An analysis of actual survival, Journal of Gastrointestinal Surgery (2024).
     doi:https://doi.org/10.1016/j.gassur.2024.02.018.
 [5] B. Gudjonsson, E. M. Livstone, H. M. Spiro, Cancer of the pancreas. diagnostic accuracy
     and survival statistics, Cancer 42 (1978) 2494–2506. doi:https://doi.org/10.1002/
     1097-0142(197811)42:5<2494::AID-CNCR2820420554>3.0.CO;2-R.
 [6] A. Anjum, S. ur Rehman Malik, K.-K. R. Choo, A. Khan, A. Haroon, S. Khan, S. U. Khan,
     N. Ahmad, B. Raza, An efficient privacy mechanism for electronic health records, Com-
     puters & Security 72 (2018) 196–211. URL: https://www.sciencedirect.com/science/article/
     pii/S0167404817302031. doi:https://doi.org/10.1016/j.cose.2017.09.014.
 [7] K. Abouelmehdi, A. Beni-Hssane, H. Khaloufi, M. Saadi, Big data security and pri-
     vacy in healthcare: A review, Procedia Computer Science 113 (2017) 73–80. URL:
     https://www.sciencedirect.com/science/article/pii/S1877050917317015. doi:https://doi.
     org/10.1016/j.procs.2017.08.292, the 8th International Conference on Emerging
     Ubiquitous Systems and Pervasive Networks (EUSPN 2017) / The 7th International Con-
     ference on Current and Future Trends of Information and Communication Technologies
     in Healthcare (ICTH-2017) / Affiliated Workshops.
 [8] A. Almalawi, A. I. Khan, F. Alsolami, Y. B. Abushark, A. S. Alfakeeh, Managing security of
     healthcare data for a modern healthcare system, Sensors (Basel) 23 (2023) 3612.
 [9] C. Dwork, F. McSherry, K. Nissim, A. D. Smith, Calibrating noise to sensitivity in private
     data analysis, in: S. Halevi, T. Rabin (Eds.), Theory of Cryptography, Third Theory of
     Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006, Proceedings,
     volume 3876 of Lecture Notes in Computer Science, Springer, 2006, pp. 265–284. URL:
     https://doi.org/10.1007/11681878_14. doi:10.1007/11681878\_14.
[10] W. Liu, Y. Zhang, H. Yang, Q. Meng, A survey on differential privacy for medical data
     analysis, Annals of Data Science (2023). URL: https://doi.org/10.1007/s40745-023-00475-3.
     doi:10.1007/s40745-023-00475-3.
[11] K. M. Chong, A. Malip, Bridging unlinkability and data utility: Privacy preserving data pub-
     lication schemes for healthcare informatics, Computer Communications 191 (2022) 194–207.
     URL: https://www.sciencedirect.com/science/article/pii/S014036642200144X. doi:https:
     //doi.org/10.1016/j.comcom.2022.04.032.
[12] J. Ficek, W. Wang, H. Chen, G. Dagne, E. Daley, Differential privacy in health research: A
     scoping review, J Am Med Inform Assoc 28 (2021) 2269–2276.
[13] S. L. Warner, Randomized response: A survey technique for eliminating evasive answer
     bias, Journal of the American Statistical Association 60 (1965) 63–69. URL: http://www.
     jstor.org/stable/2283137.
[14] B. G. Greenberg, A.-L. A. Abul-Ela, W. R. Simmons, D. G. Horvitz, The unrelated question
     randomized response model: Theoretical framework, Journal of the American Statistical
     Association 64 (1969) 520–539. URL: http://www.jstor.org/stable/2283636.
[15] T. Ha, T. K. Dang, T. T. Dang, T. A. Truong, M. T. Nguyen, Differential privacy in deep
     learning: An overview, in: 2019 International Conference on Advanced Computing and
     Applications (ACOMP), 2019, pp. 97–102. doi:10.1109/ACOMP.2019.00022.
[16] J. Wang, S. Liu, Y. Li, A review of differential privacy in individual data release, International
     Journal of Distributed Sensor Networks 11 (2015) 259682. URL: https://doi.org/10.1155/
     2015/259682. doi:10.1155/2015/259682.
[17] X. Zhang, S. Ji, T. Wang, Differentially private releasing via deep generative model, CoRR
     abs/1801.01594 (2018). URL: http://arxiv.org/abs/1801.01594. arXiv:1801.01594.
[18] H. Bae, D. Jung, H. Choi, S. Yoon, Anomigan: Generative adversarial networks for
     anonymizing private medical data, in: Pacific Symposium on Biocomputing 2020, Fairmont
     Orchid, Hawaii, USA, January 3-7, 2020, 2020, pp. 563–574. URL: https://psb.stanford.edu/
     psb-online/proceedings/psb20/Bae.pdf.
[19] B. K. Beaulieu-Jones, Z. S. Wu, C. Williams, R. Lee, S. P. Bhavnani, J. B. Byrd, C. S.
     Greene, Privacy-preserving generative deep neural networks support clinical data shar-
     ing, Circulation: Cardiovascular Quality and Outcomes 12 (2019) e005122. doi:10.1161/
     CIRCOUTCOMES.118.005122.
[20] A. Dash, J. Ye, G. Wang, A review of generative adversarial networks (gans) and its
     applications in a wide variety of disciplines: From medical to remote sensing, IEEE Access
     12 (2024) 18330–18357. doi:10.1109/ACCESS.2023.3346273.
[21] Z. Cai, Z. Xiong, H. Xu, P. Wang, W. Li, Y. Pan, Generative adversarial networks: A
     survey toward private and secure applications, ACM Comput. Surv. 54 (2021). URL:
     https://doi.org/10.1145/3459992. doi:10.1145/3459992.
[22] T. T. Nguyên, S. C. Hui, Differentially private regression for discrete-time survival anal-
     ysis, in: Proceedings of the 2017 ACM on Conference on Information and Knowledge
     Management, CIKM ’17, Association for Computing Machinery, New York, NY, USA,
     2017, p. 1199–1208. URL: https://doi.org/10.1145/3132847.3132928. doi:10.1145/3132847.
     3132928.
[23] S. Yu, G. Fung, R. Rosales, S. Krishnan, R. B. Rao, C. Dehing-Oberije, P. Lambin, Privacy-
     preserving cox regression for survival analysis, in: Proceedings of the 14th ACM SIGKDD
     International Conference on Knowledge Discovery and Data Mining, KDD ’08, Association
     for Computing Machinery, New York, NY, USA, 2008, p. 1034–1042. URL: https://doi.org/
     10.1145/1401890.1402013. doi:10.1145/1401890.1402013.
[24] K. Chaudhuri, C. Monteleoni, A. D. Sarwate, Differentially private empirical risk mini-
     mization, J. Mach. Learn. Res. 12 (2011) 1069–1109. URL: https://dl.acm.org/doi/10.5555/
     1953048.2021036. doi:10.5555/1953048.2021036.
[25] L. Gondara, K. Wang, Differentially private survival function estimation, in: F. Doshi-Velez,
     J. Fackler, K. Jung, D. C. Kale, R. Ranganath, B. C. Wallace, J. Wiens (Eds.), Proceedings of
     the Machine Learning for Healthcare Conference, MLHC 2020, 7-8 August 2020, Virtual
     Event, Durham, NC, USA, volume 126 of Proceedings of Machine Learning Research, PMLR,
     2020, pp. 271–291. URL: http://proceedings.mlr.press/v126/gondara20a.html.
[26] T. G. Clark, M. J. Bradburn, S. B. Love, D. G. Altman, Survival analysis part i: Basic
     concepts and first analyses, British Journal of Cancer 89 (2003) 232–238. URL: https:
     //doi.org/10.1038/sj.bjc.6601118. doi:10.1038/sj.bjc.6601118.
[27] L. L. Johnson, Chapter 26 - an introduction to survival analysis, in: J. I. Gallin, F. P.
     Ognibene, L. L. Johnson (Eds.), Principles and Practice of Clinical Research (Fourth
     Edition), fourth edition ed., Academic Press, Boston, 2018, pp. 373–381. URL: https:
     //www.sciencedirect.com/science/article/pii/B9780128499054000265. doi:https://doi.
     org/10.1016/B978-0-12-849905-4.00026-5.
[28] B. Bieszk-Stolorz, Application of the survival analysis methods in contemporary economics
     on the example of unemployment, in: K. Nermend, M. Łatuszyńska (Eds.), Experimental
     and Quantitative Methods in Contemporary Economics, Springer International Publishing,
     Cham, 2020, pp. 115–131.
[29] N. R. Latimer, Survival analysis for economic evaluations alongside clinical trials—
     extrapolation with patient-level data, Med. Decis. Making 33 (2013) 743–754.
[30] E. L. Kaplan, P. Meier, Nonparametric estimation from incomplete observations, Journal of
     the American Statistical Association 53 (1958) 457–481. URL: https://www.tandfonline.com/
     doi/abs/10.1080/01621459.1958.10501452. doi:10.1080/01621459.1958.10501452.
[31] C. Dwork, A. Roth, The algorithmic foundations of differential privacy, Foundations and
     Trends® in Theoretical Computer Science 9 (2014) 211–407. URL: http://dx.doi.org/10.1561/
     0400000042. doi:10.1561/0400000042.
[32] T. Wang, J. Blocki, N. Li, S. Jha, Locally differentially private protocols for frequency
     estimation, in: E. Kirda, T. Ristenpart (Eds.), 26th USENIX Security Symposium, USENIX
     Security 2017, Vancouver, BC, Canada, August 16-18, 2017, USENIX Association, 2017, pp.
     729–745. URL: https://www.usenix.org/conference/usenixsecurity17/technical-sessions/
     presentation/wang-tianhao.
[33] P.-C. Bürkner, brms: An R Package for Bayesian Multilevel Models Using Stan, Journal of
     Statistical Software 80 (2017) 1–28. URL: https://www.jstatsoft.org/index.php/jss/article/
     view/v080i01. doi:10.18637/jss.v080.i01.
[34] E. Bernard, H. Tuechler, P. L. Greenberg, R. P. Hasserjian, J. E. A. Ossa, Y. Nannya,
     S. M. Devlin, M. Creignou, P. Pinel, L. Monnier, G. Gundem, J. S. Medina-Martinez,
     D. Domenico, M. Jädersten, U. Germing, G. Sanz, A. A. van de Loosdrecht, O. Kos-
     mider, M. Y. Follo, F. Thol, L. Zamora, R. F. Pinheiro, A. Pellagatti, H. K. Elias, D. Haase,
     C. Ganster, L. Ades, M. Tobiasson, L. Palomo, M. G. D. Porta, A. Takaori-Kondo, T. Ishikawa,
     S. Chiba, S. Kasahara, Y. Miyazaki, A. Viale, K. Huberman, P. Fenaux, M. Belickova, M. R.
     Savona, V. M. Klimek, F. P. S. Santos, J. Boultwood, I. Kotsianidis, V. Santini, F. Solé,
     U. Platzbecker, M. Heuser, P. Valent, K. Ohyashiki, C. Finelli, M. T. Voso, L.-Y. Shih,
     M. Fontenay, J. H. Jansen, J. Cervera, N. Gattermann, B. L. Ebert, R. Bejar, L. Malco-
     vati, M. Cazzola, S. Ogawa, E. Hellström-Lindberg, E. Papaemmanuil, Molecular inter-
     national prognostic scoring system for myelodysplastic syndromes, NEJM Evidence 1
     (2022) EVIDoa2200008. URL: https://evidence.nejm.org/doi/abs/10.1056/EVIDoa2200008.
     doi:10.1056/EVIDoa2200008.
[35] C. A. McGilchrist, C. W. Aisbett, Regression with frailty in survival analysis, Biometrics
     47 (1991) 461–466. URL: http://www.jstor.org/stable/2532138.