-

1613-0073

Post-Processing Techniques

Alberto Archetti

alberto.archetti@polito.it 0 1

Francesco Stranieri

francesco.stranieri@polito.it 1 2

Matteo Matteucci

matteo.matteucci@polimi.it 0 0 Politecnico di Milano , Via Giuseppe Ponzio, 34, 20133 Milan , Italy 1 Politecnico di Torino, Corso Duca degli Abruzzi , 24, 10138 Turin , Italy 2 Università degli Studi di Milano-Bicocca , Viale Sarca, 336, 20126 Milan , Italy

Survival analysis is a crucial tool in healthcare, allowing us to understand and predict time-to-event occurrences using statistical and machine-learning techniques. As deep learning gains traction in this domain, a specific challenge emerges: neural network-based survival models often produce discretetime outputs, with the number of discretization points being much fewer than the unique time points in the dataset, leading to potentially inaccurate survival functions. To this end, our study explores post-processing techniques for survival functions. Specifically, interpolation and smoothing can act as efective regularization, enhancing performance metrics integrated over time, such as the Integrated Brier Score and the Cumulative Area-Under-the-Curve. We employed various regularization techniques on diverse real-world healthcare datasets to validate this claim. Empirical results suggest a significant performance improvement when using these post-processing techniques, underscoring their potential as a robust enhancement for neural network-based survival models. These findings suggest that integrating the strengths of neural networks with the non-discrete nature of survival tasks can yield more accurate and reliable survival predictions in clinical scenarios. survival analysis, neural networks, regularization techniques, healthcare Survival analysis [1] is a field of statistics concerned with modeling time-to-event data. Its primary objective is to construct a survival function depending on time tailored to a particular subject, representing the probability of not experiencing a particular event of interest up to , such as disease onset, death, or hospital discharge. Thus, a survival function is formally . The analysis of time-to-event data is of paramount importance in healthcare, facilitating the identification of patient risk factors over time. Distinctively, survival analysis difers from conventional machine learning tasks such as classification and regression due to its ability to handle censored data points - instances where the event of interest has not yet occurred for a particular subject. This characteristic is common in clinical data, given the 0000-0003-3826-4645 (A. Archetti); 0000-0002-5366-8499 (F. Stranieri); 0000-0002-8306-6739 (M. Matteucci) htp:/ceur-ws.org CEUR Workshop Proceedings (CEUR-WS.org) IS N1613-073

CEUR ceur-ws.org

1. Introduction

(M. Matteucci) CEUR Workshop Proceedings prolonged, complex, and privacy-constrained nature of data collection, which challenges the applicability of data-intensive machine learning models.

Recent advancements in survival applications exploit neural network-based deep learning techniques, emphasizing their ability to model the non-linear relationships between patient features and time-to-event records. Their utility has been demonstrated in various studies [ 2, 3, 4, 5, 6 ], emphasizing their generalization advantage over traditional statistical approaches and matching the expressive power of ensemble methods [ 7, 8, 9 ]. However, most common neural network architectures involve a set of discrete outputs, necessitating specific processing to adapt to the continuous nature of survival analysis. To this end, numerous coping strategies between discrete-output neural networks and survival analysis have been introduced. Most techniques focus on time discretization [ 5 ], enabling neural networks to encapsulate time-event associations for a limited set of time points. Instead, few methodologies directly tackle timecontinuous survival functions and are based on proportional hazard [ 2 ] or piece-wise constant hazard [ 5, 10 ].

In our research, we conduct a thorough examination of multiple interpolation methods to determine if post-processing interpolation can augment the eficacy of discrete-output neural networks. Specifically, we delve into three interpolation techniques: linear, piece-wise exponential, and spline-based, applying them to the state-of-the-art neural survival models. We investigate whether performance gaps are relevant between interpolated and non-interpolated versions of the same survival model. Our investigation employs time-dependent survival metrics to gauge the eficacy of neural-based models, namely the Integrated Brier Score (IBS) and the Cumulative Area-Under-the-Curve (Cumulative AUC). Our empirical analysis, validated across several real-world healthcare datasets, indicates that interpolation supports the generalization capability of neural-based survival models. This improvement is particularly relevant when the number of discretization bins and, consequently, neural network outputs is substantially smaller than the dataset’s sample count. This scenario commonly arises in practical applications where the dataset size considerably outweighs the neural network’s output neurons.

In summary, our research ofers a comprehensive empirical analysis of interpolation methods tailored for neural-based survival models. We explore the potential advantages of incorporating a post-processing interpolation phase based on simple operations with negligible computational overhead. These insights bear significant implications for the clinical applicability of survival models, suggesting that a simple interpolation step can markedly boost the generalizability of a neural-based survival model.

2. Background

This section provides the necessary background on survival analysis as a machine-learning problem, alongside the description of the survival metrics to assess model performance that will be investigated in subsequent experimental evaluation.

2.1. Survival Analysis

Survival analysis tackles time-to-event modeling, leveraging both statistical and machinelearning methodologies. It plays a pivotal role in interpreting clinical data, forecasting occurrences such as the onset of a disease, relapses, mortality, and hospital discharges. By harnessing patient information, the aim is to formulate a time-dependent parametric function, () , that denotes the probability of a subject not encountering a specified event up to a given time, expressed as

() = ( > ).

This non-increasing function starts with a value of 1 at = 0 , approaching 0 as tends to infinity.

Instead of () , several survival methods estimate the instantaneous hazard rate for each individual, called hazard function: From the hazard function, the survival function can be derived as ℎ() = lim →0 ( ≤ < + | ≥ )

() = exp(− ()) where () represents the integral of ℎ over the interval from 0 to .

What sets survival analysis apart from conventional machine learning tasks, like classification or regression, is its ability to analyze censored data points. Such data represent subjects who have not encountered the specified event during the data collection period. Hence, survival datasets comprise triplets: (x , , ), where (i) x indicates the feature vector for subject ; (ii) is a binary flag, which is set to 0 if the sample is censored; and (iii) designates either the event’s time or the censoring time, depending on the value of . This is the most common scenario in survival problems, referred to as right censoring. Throughout this paper, our discussions will refer to the right censoring context.

The most prevalent models used for deriving survival functions include the non-parametric Kaplan-Meier model [ 11 ] and the linear Cox model [ 12 ]. Machine learning-enhanced non-linear extensions typically employ ensemble strategies [ 7, 13 ] and neural networks, which will be analyzed in Section 3.

2.2. Metrics in Survival Analysis

The most common metrics used to evaluate the predictive power of survival models are the Concordance Index (C-Index), the IBS, and the Cumulative AUC. The C-Index [ 14 ] measures the agreement between the predicted survival outcomes from a model and the actual observed outcomes for pairs of samples. Specifically, for each time point, the predicted outcome is determined by the model’s survival probability or risk score, while the true outcome reflects the event status – 1 for non-censored and 0 for censored samples. Only pairs with times 1 < 2 and events 1, 2 where 1 is non-censored are considered comparable. The C-Index measures the proportion of comparable pairs that are concordant, meaning the sample with the higher predicted survival probability outlives the other. This measure can be interpreted as (1) (2) (3) the probability that, for two randomly chosen individuals, the one with the higher risk score will experience the event first. A C-Index value of 0.5 signifies random predictions, whereas 1 indicates perfect concordance. While easy to interpret, the C-Index does not provide information about model calibration.

Alongside the C-Index, another common metric for assessing survival models is the Brier Score (BS) [ 15 ], which quantifies both precision and calibration of predicted survival outcomes. The BS computes the squared diference between the actual event occurrence (1 for the event and 0 otherwise) and the predicted survival probability for a specific time instant. Ideally, a BS value should be close to 0, indicating perfect prediction. The IBS integrates the Brier scores over various times, giving an overall temporal performance evaluation of the model. The IBS summarizes the model’s ability to capture accurate event probabilities. However, its evaluation can be afected by the integration range and the time density of available samples.

The third most common metric for survival models is the Cumulative AUC. While the AUC is traditionally a classification metric, its application extends to survival studies with timedependent outcomes [ 16 ]. In this context, the AUC examines the predicted survival probabilities against observed event statuses over several time instants. Samples that are censored before or during this period are treated as negative events. The Cumulative AUC integrates these time-dependent AUC values, with 1 indicating perfection in prediction.

To adjust for censoring biases, the Inverse Probability of Censoring Weighting (IPCW) method [ 14, 17 ] is employed. Here, each sample is assigned a weight based on its inverse censoring probability at a given time. Observations with high censoring likelihoods get more weight, and vice versa for low-censoring observations. This weighting helps to counteract potential biases due to the event censoring distribution. Also, each of the metrics described focuses on a specific aspect of survival models. Therefore, for a comprehensive evaluation of the overall quality of a survival model, multiple metrics must be taken into account.

3. Related Work

In recent years, deep learning increased the expressive capability of traditional survival models. The first works were devoted to the extension of one of the most prominent survival models: the Cox model [ 12 ]. The Cox model defines a hazard function based on the assumption that the relative risk between subjects remains unchanged over time (proportional hazard assumption): ℎ(|x ) = ℎ0()exp(x ), (4) where ℎ0() is the baseline hazard common across all subjects, and exp(x ) is a subject-specific factor that modifies the baseline hazard based on an individual’s risk profile. The classic Cox model assumes the existence of a linear relationship between features and subject hazard with the risk multiplier being the exponential of the dot product of features and weights.

A substantial extension of the Cox model is DeepSurv [ 2 ]. Here, the linear relationship between features and risks is replaced with a deep neural network, capturing non-linear interactions between features and the hazard function. It leverages the same diferentiable loss function as the original Cox model for training, called partial log-likelihood. This loss function is tailored to train models based on the proportional hazard assumption.

However, the proportional hazards assumption, though rendering models straightforward and interpretable, can sometimes hamper their generalization. In fact, many real-world datasets do not respect this assumption, rendering such models less efective. A paradigm shift in neural survival models emerged with time discretization techniques [ 5 ]. These techniques allowed neural networks to directly approximate discretized hazard and survival functions. Among the models following this approach, DeepHit [ 3 ] employs sigmoid activations to estimate discrete probabilities for designated event times. DeepHit is specifically tailored to compute probabilities for multiple competing events, predicting which event occurs first. In fact, its loss function is designed not only to improve the model’s accuracy but also to predict event occurrence in the most probable order.

Drawing inspiration from the Multi-Task Logistic Regression (MTLR) approach [18], Neural Multi-Task Logistic Regression (N-MTLR) [19] employs multiple neural-based logistic regression heads to predict event occurrence probability for each time step. These outputs are subsequently normalized using a softmax function to yield event probabilities.

Finally, the Logistic Hazard model [ 20, 5 ] frames the survival problem discretely, transforming it into a sequence of binary classification tasks. Each task predicts the risk for an event occurrence at a given time interval. The model captures time-dependent efects through a multi-output neural network employing softmax activations, making it a robust choice for handling time-varying efects in survival analysis.

An alternative approach from [ 5, 10 ] instead of discretizing the survival function, assumes the hazard function to be piece-wise constant. This method, called PC-Hazard, produces continuous survival functions framed as piece-wise exponentials. Thus, PC-Hazard adapts any regression model to a survival model, trainable with the Poisson regression technique.

4. Interpolation Methods

In survival models based on neural networks, the discrete outputs, or anchor points, define the value of the survival function for a set of specific time instants. This section provides a description of several interpolation techniques designed to bridge the gap between discrete survival functions and continuous metric evaluation. Consider a set of time instants, each corresponding to the limit of a discretization bin, { 1, 2, … , }, such that 0 < 1 < 2 < ⋯ < . Then, a survival model based on neural networks produces a set of outputs { 1, 2, … , }, such that 1 ≥ 1 ≥ 2 ≥ ⋯ ≥ ≥ 0. The set of pairs ( , )corresponds to the anchor points leveraged by the interpolation methods to obtain a continuous survival function. In order to allow the interpolation to attain the properties of survival functions, we consider the pairs (0, 1)and ( ∞, 0) to always be part of the set of anchor points. Figure 1 illustrates the considered interpolation techniques evaluated on a set of fixed anchor points.

4.1. Step-wise Interpolation

Most works apply step-wise interpolation to produce continuous outputs from the set of anchor points produced by a survival model. In particular, given a time instant ∈ [ , +1 ), this simple 0.0 for = 2, 3, 6, 7, and 9. The first two plots from the left refer to step-wise interpolations considering the following or previous anchor points, respectively. The third plot illustrates a linear interpolation. The fourth is a piece-wise exponential, inspired by the PC-Hazard model [ 5, 10 ]. The final plot interpolates the anchor points with a monotonic cubic spline. type of interpolation defines the value of a survival function as

() = () = +1 . which corresponds to the value of the closest anchor points with a lower corresponding time. We call this interpolation method Step FWD, indicating that the anchor point is propagated forward in the survival function. In the following analyses, we also employ an alternative approach, called Step BWD, which propagates the next closest anchor point backward in the survival function as The idea of Step BWD is to focus on future event instances rather than the immediate past. It might be relevant in situations where interventions or treatments are planned, and the anticipation of the next event risk is more clinically significant than the immediate past.

4.2. Linear Interpolation

The most straightforward extension to the step-wise interpolation techniques is to define the interpolation point on the line connecting the considered anchor points. A linearly-interpolated survival function is defined as

−

4.3. Piece-wise Exponential Interpolation

This interpolation method is inspired by the piece-wise constant hazard model [ 5, 10 ]. This method, referred to as PWE, assumes the hazard function to be constant within each time interval. Then, according to Eq. (3), the survival function results in a piece-wise exponential function. The interpolation is computed as where () = exp ( ⋅

− ) A suficient condition to ensure monotonicity is to set, for > 3 or > 3, At this point, the values of tangents guarantee that a Hermit spline passing through the anchor points is non-increasing. The survival function is computed as () = (2 3 − 3 2 + 1) + ( 3 − 2 2 + ) + (−2 3 + 3 2) +1 + ( 3 − 2) +1 = +1 −

4.4. Monotonic Cubic Spline Interpolation

The Hermite spline with monotonicity constraints [21] is a spline-based interpolation method to fit a set of anchor points with a non-increasing smooth function maintaining a continuous derivative. The Fritsch–Carlson method enables the construction of survival functions with a smooth transition between anchor points. In the subsequent sections, we refer to this interpolation technique as Spline.

The idea is to constrain the tangents of the Hermit spline in such a way that the resulting piece-wise function is monotonic. To this end, the Fritsch–Carlson method starts from the secant lines between successive anchor points and initializes the average of the secants as where

5. Experiments

This section collects the experimental methodology and results to validate our claims about interpolation techniques for neural-based survival models. In particular, we describe the datasets involved in the experiments, the training and inference procedure, and the final results obtained. We highlight that each of the datasets involved is publicly accessible and used in several survival studies for benchmarking purposes [ 22, 8, 9 ]. To allow for reproducibility, we made the source code of the experiments publicly available1.

1https://github.com/archettialberto/interpolation_for_deep_survival_analysis 5.1. Datasets

This section describes the datasets processed throughout the experiments: • Worcester Heart Attack Study (WHAS500) [23]: This dataset focuses on cardiovascular health, specifically patients who have experienced myocardial infarction. Given that heart diseases are one of the leading causes of mortality worldwide, models built on this dataset can help in risk prediction, better understanding of prognostic factors, and overall improved patient management strategies. • German Breast Cancer Study Group (GBSG2) [24]: Cancer recurrence is a significant concern for patients who have undergone treatment. The GBSG2 dataset provides insights into factors that may afect recurrence, especially in the context of hormone treatments. The dataset’s focus on covariates like age, menopausal status, and tumor-specific details makes it a rich source for modeling and predictions, which can directly influence treatment decisions. • Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) [ 25, 2 ]: This dataset ofers clinical attributes related to patients experiencing breast cancer. It is part of a larger project ofering genomic data, paving the way for personalized treatment plans by taking into account the genetic variations that might influence survival rates. • The Cancer Genome Atlas Program - Breast Cancer Study (TCGA-BRCA) [26]: The TCGA provides a comprehensive view of the genomic changes across various cancer types. Among the data collection projects revolving around TCGA, BRCA focuses on breast-invasive carcinoma, ofering insights into the variations in survival outcomes based on geographic regions and their associated clinical practices. This dataset comes from a dataset suite for medical federated learning, called Flamby [26]. In this study, we do not consider the federated aspect, aggregating the regional clients into a single cluster of individuals.

5.2. Experimental Setup

This section delineates the methodological approach utilized to assess the eficacy of interpolation as a post-processing measure in survival models based on neural networks. The datasets employed for our evaluation, specifically WHAS500, GBSG2, METABRIC, and TCGA-BRCA, are detailed in Section 5.1. Data from these datasets were uniformly sampled to formulate both training and test splits, comprising 80% and 20% of the overall samples, respectively. Subsequently, the training subset underwent an additional 80-20% split to generate a validation subset.

The experiments involved four state-of-the-art neural network-based models from survival analysis: DeepSurv, DeepHit, Logistic Hazard, and N-MTLR, each thoroughly described in Section 3. Notably, DeepSurv is the only model based on the proportional hazard assumption, whereas the others rely on an explicit definition of discrete time bins. Concerning these discretization points, we adopted a uniform splitting approach, increasing the anchor count with every experiment. The tested numbers of anchors are 5, 10, 50, 100, 500, and 1000. These numbers hold for non-proportional models only, as DeepSurv has a fixed number of anchors, corresponding to the points of the baseline function, shared across all subjects.

Each model comprises a two-layer fully connected neural network with a number of inputs equal to the dataset features and a hidden layer size of 32. Each layer is followed by a ReLU activation function and a dropout regularization layer with 0.1 probability. The number of outputs is 1 for DeepSurv and equal to the number of anchor points for all the other models. In the experiments, models are trained using the Adam optimizer with a learning rate of 0.01. Training executed till convergence for a maximum of 300 epochs, adopting an early stopping strategy on the validation set with a 10-epoch patience threshold. The selected batch size was ifxed at 128.

In the subsequent inference phase, survival functions were derived from the anchor points of each model, after an interpolation step leveraging the methods outlined in Section 4 – Step BWD, Step FWD, Linear, PWE, and Spline. For each trained model paired with an interpolation strategy, the C-Index, the IBS, and the Cumulative AUC with IPCW weighting were evaluated, as described in Section 2.2. The IBS and the Cumulative AUC were integrated over the 25th and 75th percentiles of the test times, to limit the noise that could be introduced by the lower sample density at the endpoints of the time spectrum. Finally, to limit the efects of randomness, each single experiment was repeated 30 times, averaging the final results.

5.3. Results

In this section, we present and discuss the empirical results derived from our experiments with various interpolation techniques. For brevity, we enumerate the IBS (Table 2a), Cumulative AUC (Table 2b), and C-Index (Table 2c) values achieved on the METABRIC dataset, which is the largest dataset among the ones analyzed, for 10, 100, and 1000 anchor points. Detailed numerical values on the WHAS500, GBSG2, and TCGA-BRCA datasets are reported in Appendix A. On top of that, the time-dependent metrics for all datasets, namely IBS and Cumulative AUC, are plotted for 5, 50, and 500 anchor counts in Figure 2a and Figure 2b.

Does interpolation serve as an efective post-processing step when evaluated using the IBS metric? As illustrated in Table 2a and Figure 2a, implementing any form of interpolation generally proves beneficial over the Step BWD or Step FWD techniques. Specifically, for a limited number of anchor points, i.e., 5 and 10, neural models leveraging Linear and PWE interpolations demonstrate a better IBS compared to their counterparts. Although Spline interpolation surpasses step-wise methods, it falls behind Linear and PWE. As the number of anchor points increases, the distinction among interpolation methods diminishes. This is expected, as a larger anchor count ofers a finer discretization grid, enabling the neural network to precisely adjust the survival function and thereby mitigating the necessity for interpolation. Notably, while minor diferences can still be observed at 50 and 100 anchors, increasing to 500 or 1000 efectively equalizes the results of all methods. This convergence can be attributed to the anchor count approaching the dataset size, compelling the model to capture the behavior of individual time instances.

Does interpolation serve as an efective post-processing step when evaluated using the Cumulative AUC metric? The Cumulative AUC metric outcomes, reported in Table 2b and Figure 2b, largely follow the trend of the previous observations. Non-step-based interpolation methods tend to augment the Cumulative AUC for neural models, especially when the number of anchor points is low. An outlier to this trend is observed with DeepHit using 10 anchor points on the METABRIC dataset, where Step FWD emerges as the best technique. However, this remains the only exception with respect to the general trend. Remarkably, while Step FWD often serves as a default choice for state-of-the-art survival models, it is consistently outperformed by Step BWD. Similar to the IBS trend, the performance diference among interpolation techniques diminishes with an increased anchor count.

How do interpolation techniques afect the C-Index metric? As highlighted in Table 2c, step-based interpolation methods marginally outperform other techniques regarding the C-Index on the METABRIC dataset. Hence, for specific applications where concordance is the only metric of utmost importance, step-based interpolation stands as a reliable choice. On the other hand, for any other situation, smoother interpolation techniques present better time-dependent metrics with only a negligible degradation of concordance.

Is there a correlation between the proportional hazard assumption and interpolation’s eficacy? The proportional hazard assumption significantly impacts the model’s outputs, imposing constant subject ratios over time. Consequently, the chosen interpolation method should not afect the C-Index, as confirmed by DeepSurv’s performance in Table 2c. Interestingly, for the other metrics, IBS and Cumulative AUC, deviations are not noticeable to the fourth decimal place. Thus, for models based on the proportional hazard assumption, the influence of interpolation on performance is negligible. Instead, as thoroughly analyzed earlier, the opposite holds for non-proportional models based on time discretization.

How does censoring impact results? As previously discussed, interpolation techniques generally enhance survival metrics. This improvement is particularly evident in the METABRIC dataset, which has the most significant proportion of censored samples among the datasets we examined. When we compare this to other datasets with fewer censored samples, the positive efect of interpolation, although still present, is less marked. While it is not definitive that there is a direct correlation between interpolation and the percentage of censorship, we can afirm that a high rate of censoring does not hinder the benefits of interpolation techniques.

6. Conclusion

In this study, we investigated the influence of interpolation techniques on the performance metrics of survival models. Due to their expressive power, these models often achieve a high degree of generalization. However, their inherent discretization limitations can compromise their precision. To address this, we focused on the post-processing of survival functions through interpolation between anchor points, aiming to improve time-dependent metrics such as the IBS and Cumulative AUC. The empirical analyses conducted across various real-world healthcare datasets and model configurations underscored a consistent pattern: even simple interpolation methods, like linear interpolation, ofer tangible improvements in these metrics. This trend is especially noticeable when the number of anchor points is orders of magnitude smaller than the dataset cardinality, which corresponds to most real-world use cases. In summary, this study underscores the potential of combining the expressiveness of neural networks with interpolation techniques to improve the accuracy of survival predictions in clinical contexts.

7. Ethical Discussion

While our study focuses on a specific mathematical question concerning the post-processing of existing, well-studied survival models, the delicate nature of risk assessment in the healthcare domain raises discussions on several ethical dimensions. First, at its core, survival analysis studies the probability outcomes of events over time. In the medical field, the results of SA models may influence decision-making and treatment priorities. The potential prioritization of patients based solely on statistical outcomes may lead to short-sighted decisions. Therefore, the outcomes of survival models should be used as suggestions for domain experts who must take actions based on several real-world factors that may inevitably not be captured by statistical models.

Second, the use of patient data must undergo consent and transparency. Especially in the healthcare domain, where data are sensitive and privacy-protected, it is of utmost importance to ensure that the rights of individuals and data owners are respected. In this study, we utilized publicly available survival datasets that are commonly used to benchmark survival techniques.

In conclusion, while our focus specifically addresses a technical aspect of survival models, we recognize the broader impact of survival analysis. Our hope is that by enhancing the reliability of these models, we contribute to a more ethical and fair healthcare landscape where statistical predictions serve as one tool among many, to aid judgments of medical professionals.

Acknowledgments

This project has been supported by AI-SPRINT: AI in Secure Privacy-pReserving computINg conTinuum (European Union H2020 grant agreement No. 101016577) and FAIR: Future Artificial Intelligence Research (NextGenerationEU, PNRR-PE-AI scheme, M4C2, investment 1.3, line on Artificial Intelligence). 0 0 5AS IBS0.20 H W Baseline Hazard Anchors 5

50 Anchors 500 5 500

5 50 Anchors

50 Anchors

500 (a) IBS results.

DeepSurv

DeepHit

Logistic Hazard

N-MTLR Step BWD Step FWD Linear PWE Spline Step BWD Step FWD Linear PWE Spline Step BWD Step FWD Linear PWE Spline Step BWD Step FWD Linear PWE Spline the better; for the Cumulative AUC, the higher the better. Columns correspond to survival models, while rows correspond to survival datasets. Results are averaged over 30 runs. Journal of Machine Learning Research 21 (2020) 1–6. URL: http://jmlr.org/papers/v21/ 20-729.html. [17] J. M. Robins, A. Rotnitzky, Recovery of information and adjustment for dependent censoring using surrogate markers, in: AIDS epidemiology, Springer, 1992, pp. 297–331. [18] C.-N. Yu, R. Greiner, H.-C. Lin, V. Baracos, Learning patient-specific cancer survival distributions as a sequence of dependent regressors, Advances in neural information processing systems 24 (2011). [19] S. Fotso, Deep neural networks for survival analysis based on a multi-task framework, arXiv preprint arXiv:1801.05512 (2018). [20] M. F. Gensheimer, B. Narasimhan, A scalable discrete-time survival model for neural networks, PeerJ 7 (2019) e6257. [21] F. N. Fritsch, R. E. Carlson, Monotone piecewise cubic interpolation, SIAM Journal on

Numerical Analysis 17 (1980) 238–246. URL: http://www.jstor.org/stable/2156610. [22] A. Archetti, E. Lomurno, F. Lattari, A. Martin, M. Matteucci, Heterogeneous datasets for federated survival analysis simulation, in: Companion of the 2023 ACM/SPEC International Conference on Performance Engineering, ICPE ’23 Companion, Association for Computing Machinery, New York, NY, USA, 2023, p. 173–180. URL: https://doi.org/10.1145/3578245. 3584935. doi:10.1145/3578245.3584935. [23] D. W. Hosmer, S. Lemeshow, S. May, Applied Survival Analysis: Regression Modeling of Time-to-Event Data, Wiley Series in Probability and Statistics, John Wiley & Sons, Inc., Hoboken, NJ, USA, 2008. URL: http://doi.wiley.com/10.1002/9780470258019. doi:10.1002/ 9780470258019. [24] M. Schumacher, G. Bastert, H. Bojar, K. Hübner, M. Olschewski, W. Sauerbrei, C. Schmoor, C. Beyerle, R. Neumann, H. Rauschecker, Randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. german breast cancer study group., Journal of Clinical Oncology 12 (1994) 2086–2093. [25] B. Pereira, S.-F. Chin, O. M. Rueda, H.-K. M. Vollan, E. Provenzano, H. A. Bardwell, M. Pugh, L. Jones, R. Russell, S.-J. Sammut, D. W. Y. Tsui, B. Liu, S.-J. Dawson, J. Abraham, H. Northen, J. F. Peden, A. Mukherjee, G. Turashvili, A. R. Green, S. McKinney, A. Oloumi, S. Shah, N. Rosenfeld, L. Murphy, D. R. Bentley, I. O. Ellis, A. Purushotham, S. E. Pinder, A.L. Børresen-Dale, H. M. Earl, P. D. Pharoah, M. T. Ross, S. Aparicio, C. Caldas, The somatic mutation profiles of 2,433 breast cancers refine their genomic and transcriptomic landscapes, Nature Communications 7 (2016) 11479. URL: https://www.nature.com/articles/ ncomms11479. doi:10.1038/ncomms11479. [26] J. Ogier du Terrail, S.-S. Ayed, E. Cyfers, F. Grimberg, C. He, R. Loeb, P. Mangold, T. Marchand, O. Marfoq, E. Mushtaq, B. Muzellec, C. Philippenko, S. Silva, M. Teleńczuk, S. Albarqouni, S. Avestimehr, A. Bellet, A. Dieuleveut, M. Jaggi, S. P. Karimireddy, M. Lorenzi, G. Neglia, M. Tommasi, M. Andreux, Flamby: Datasets and benchmarks for cross-silo federated learning in realistic healthcare settings, in: S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.), Advances in Neural Information Processing Systems, volume 35, Curran Associates, Inc., 2022, pp. 5315–5334. URL: https://proceedings.neurips.cc/paper_files/paper/2022/file/ 232eee8ef411a0a316efa298d7be3c2b-Paper-Datasets_and_Benchmarks.pdf.

Detailed Numerical Results

This section presents all the numerical results obtained throughout our experiments. For each dataset, we list three tables, each corresponding to the IBS, Cumulative AUC, and

C-Index

metrics, respectively. In particular, the table-dataset correspondence is as follows: • WHAS500 dataset: Table 3 (IBS), Table 4 (Cumulative AUC), and Table 5 (C-Index). • •

GBSG2 dataset: Table 6 (IBS), Table 7 (Cumulative AUC), and Table 8 (C-Index).

METABRIC dataset: Table 9 (IBS), Table 10 (Cumulative AUC), and Table 11 (C-Index). • TCGA-BRCA dataset: Table 12 (IBS), Table 13 (Cumulative AUC), and Table 14 (C-Index).

A.1. WHAS500 dataset

IBS results on the WHAS500 dataset. Values are averaged over 30 runs and scaled up by a factor of 100 for better readability.

Anchors

Step BWD

Step FWD

A.2. GBSG2 dataset METABRIC dataset A.4. TCGA-BRCA dataset

PWE

[1]

Wang ,

Li ,

C. K.

Reddy , Machine learning for survival analysis: A survey, ACM Computing Surveys (CSUR) 51 ( 2019 ) 1 - 36 .

[2]

J. L.

Katzman ,

Shaham ,

Cloninger ,

Bates ,

Jiang ,

Kluger , Deepsurv: personalized treatment recommender system using a cox proportional hazards deep neural network , BMC medical research methodology 18 ( 2018 ) 1 - 12 .

[3]

Lee ,

Zame ,

Yoon , M. Van Der Schaar , Deephit: A deep learning approach to survival analysis with competing risks , in: Proceedings of the AAAI conference on artificial intelligence , volume 32 , 2018 .

[4]

Kvamme , Ø. Borgan, I. Scheel , Time-to-event prediction with neural networks and cox regression , arXiv preprint arXiv: 1907 . 00825 ( 2019 ).

[5]

Kvamme , Ø. Borgan, Continuous and discrete-time survival prediction with neural networks , Lifetime Data Analysis 27 ( 2021 ) 710 - 736 .

[6]

Wiegrebe ,

Kopper ,

Sonabend ,

Bender , Deep learning for survival analysis: A review , arXiv preprint arXiv:2305.14961 ( 2023 ).

[7]

Ishwaran , U. B. Kogalur , E. H.

Blackstone , M. S.

Lauer , Random survival forests, The annals of applied statistics 2 ( 2008 ) 841 - 860 .

[8]

Archetti ,

Matteucci , Federated Survival Forests, in: 2023 International Joint Conference on Neural Networks (IJCNN2023) , IEEE (in press) , 2023 .

[9]

Archetti ,

Ieva ,

Matteucci , Scaling survival analysis in healthcare with federated survival forests: A comparative study on heart failure and breast cancer genomics , Future Generation Computer Systems 149 ( 2023 ) 343 - 358 . URL: https://www.sciencedirect.com/ science/article/pii/S0167739X23002935. doi:https://doi.org/10.1016/j.future. 2023 . 07 .036.

[10]

Bender ,

Rügamer ,

Scheipl ,

Bischl , A general machine learning framework for survival analysis , in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases , Springer, 2021 , pp. 158 - 173 .

[11]

E. L.

Kaplan ,

Meier , Nonparametric estimation from incomplete observations , Journal of the American statistical association 53 ( 1958 ) 457 - 481 .

[12]

D. R.

Cox , Regression models and life-tables , Journal of the Royal Statistical Society. Series B (Methodological) 34 ( 1972 ) 187 - 220 . URL: http://www.jstor.org/stable/2985181.

[13]

Hothorn ,

Bühlmann ,

Dudoit ,

Molinaro , M. J. Van Der Laan , Survival ensembles, Biostatistics 7 ( 2005 ) 355 - 373 . URL: https://doi.org/10.1093/biostatistics/kxj011. doi: 10 .1093/biostatistics/ kxj011. arXiv:https://academic.oup.com/biostatistics/articlepdf/7/3/355/690076/kxj011.pdf.

[14]

Uno ,

Cai ,

M. J.

Pencina , R. B. D'Agostino , L.-J. Wei , On the c-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data , Statistics in medicine 30 ( 2011 ) 1105 - 1117 .

[15]

Graf ,

Schmoor ,

Sauerbrei , M. Schumacher, Assessment and comparison of prognostic classification schemes for survival data , Statistics in medicine 18 ( 1999 ) 2529 - 2545 .

[16]

Pölsterl , scikit -survival: A library for time-to-event analysis built on top of scikit-learn,