Novel Test for Survival Data Analysis of Cancer Patients Dmitriy Klyushin1 and Pavel Yakovlev2 1Taras Shevchenko National University of Kyiv, Ukraine, Akademika Glushkova Avenue 4D, Kyiv, 03680, Ukraine 2Feofaniya Clinical Hospital, Akademika Zabolotnogo 21, Kyiv, 03143, Ukraine Abstract Modern medical information systems necessarily include functions for assessing the effectiveness of treatment provided to patients. As a rule, this problem is solved by calculating the survival functions for estimation of the risk of death. Traditionally, three nonparametric tests are used to analyze survival: the Cochran−Mantel−Hansel log-rank test, the Wilcoxon test for censored data, and the Tarone−Ware test. In these tests, testing statistical hypotheses about the equivalence of survival functions, as a rule, is reduced to calculating the critical value of the standard normal distribution. These tests give reliable results only if the samples are large enough and additional conditions are met. Consequently, for the development of effective medical information systems that perform survival analysis, nonparametric tests are required that use a minimum of preliminary assumptions and allow the use of small samples. The paper proposes a test for testing the hypothesis of the equivalence of the survival functions, which does not depend on the sample size and does not use additional preconditions, except for the condition of the continuity of the distribution functions of the initial data. Keywords 1 Survival analysis, risk of death, Kaplan-Mayer curve, Log-rank test, Wilcoxon test, Tarone−Ware test 1. Introduction To assess the effectiveness of the treatment provided to patients and the risk of death during a given period, many cancer healthcare facilities design information systems that analyze data and assess patient survival using the Kaplan–Meier curve [1]. Three nonparametric tests are usually used in the survival analysis based on the Kaplan−Meier estimator: the Cochran‒Mantel‒Hansel log-rank test [2], the Wilcoxon test [3], and the Tarone–Ware test [4]. To test statistical hypotheses about the identity of the survival functions, these tests mainly calculate the values of the standard normal distribution. However, these tests give reliable results only if the samples are large enough and additional conditions are met. The most popular is the log rank test, which gives the maximum power under the alternatives with proportional hazards [5]. However, these tests give reliable results only if the samples are large enough and additional conditions are met. For example, the Wilcoxon test is preferable when deaths at early time points have more weights [6], and the Tarone‒Ware test also places more heavy weight on hazards at the early time [7]. CITRisk’2021: 2nd International Workshop on Computational & Information Technologies for Risk-Informed Systems, September 16–17, 2021, Kherson, Ukraine EMAIL: dokmed5@gmail.com (D.Klyushin); pavel_3@hotmail.com (P.Yakovlev) ORCID: 0000-0003-4554-1049 (D.Klyushin); 0000-0002-1767-3231 (P.Yakovlev ) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) The nonparametric Kaplan-Meier estimate measures the survival time of patients, i.e. the interval of time between a certain date (for example, the date of surgery) and the moment of death or censuring. It allows the construction of survival functions based on data on the life expectancy of patients and estimates the risk of death during a given time period. Similarly, it can be used to estimate the time to equipment failure or other significant event. Thus, it can be used for assessment of the risk of a specific event (death, failure, etc.) based on observations (censored and uncensored). The aim of this paper is to describe an alternative nonparametric test that does not use any assumption excepting the most general (continuity of the distribution) and allow using small samples (size less than 50). This test use the p-statistics investigated in [8–11] and base on the A(n) Hillʼs assumption [12]. The theoretical background of the p-statistics is developed by Matveichuk and Petunin [8, 9] and later by Johnson and Kotz [10], and Klyushin and Petunin [11]. The high sensitivity and specificity of the nonparametric test for homogeneity of two samples based on the p-statistics is demonstrated in [11]. Here we propose new application of this test for comparison of two survival curves. 2. Theoretical background Consider= samples x ( x1 , x2 ,..., xn )= ∈ G1 and y ( y1 , y2 ,..., yn ) ∈ G2 from absolutely continuous distributions F1 and F2 . The Hill's assumption A( n ) [12] states that for exchangeable random values x1 , x2 ,..., xn ∈ G following to an absolutely continuous distribution function ( ( P x ∈ x( i ) , x( j ) )) =nj +− 1i , j < i, (1) where x( i ) and x( j ) are the i-th and j-th order statistics. Find the relative frequency hij of the ( ) event ym ∈ x( i ) , x( j ) for the elements of y and estimate the deviation of hij from the expected j −i probability n +1 ( ) using the Wilson confidence interval I ij( n ) = pij(1) , pij( 2) where (1) hij n + g 2 2 − g hij (1 − hij )n + g 2 4 p ij = , n + g2 (2) hij n + g 2 2 + g hij (1 − hij )n + g 2 4 pij(2) = . n + g2 The significance level of this interval is the function of g. When g = 3 the significance level of I ij( n ) does not exceed 0.05 [11]. P-statistics, estimating the homogeneity of samples x and y, is  j −i   n ( n − 1)  =h # =pij ∈ I ij( n )   , (3)  n +1   2   j −i  It is the relative frequency of the event = pij ∈ I ij( n )  . Therefore, using (2) and (3) we  n +1  may construct the Wilson interval I for the p-statistics an formulate the following test: the null hypothesis on identity of the survival functions is accepted if the upper bound of I is greater than 0.95, else it is rejected.  j −i  For the true null hypothesis is true, the events = pij ∈ I ij( n )  form a generalized  n + 1  Bernoulli scheme [8, 9]. For the false null hypothesis they form a modified Bernoulli scheme. If the null hypothesis may be either true or false, they form the Matveichuk–Petunin scheme [10]. j −i i If the null hypothesis is true, lim ∈ ( 0,1) , and lim ∈ ( 0,1) , then the asymptotic n →∞ n + 1 n →∞ n + 1 significance level β of a sequence of confidence intervals I ij( n ) is less than 0.05 [11]. 3. Experiments and results To confirm the high sensitivity and specificity of the proposed test, we considered two groups of patients with a nondifferential diagnosis of bladder cancer of stages T2 and T3, who in 1998– 2016 received special surgical care (radical and salvage cystectomy) at the urology department of the Kiev City Clinical Oncological Dispensary. For the analysis, patients were taken who had a complete history and an accurate survival result (uncensored). Characterization of the prevalence of the malignant process was carried out according to the clinical classification TNM 7th ed. (2010). The first group (stage T2) consists of 38 patients, among them 22 patients were underwent to radical cystectomy (17 died and 5 are alive), and 16 were underwent to the salvage cystectomy (7 died and 9 are alive). The second group (stage T3) consists of 51 patients, among them 33 patients were underwent to radical cystectomy (24 died and 9 are alive), and 18 were underwent to the salvage cystectomy (10 died and 8 are alive). The survival curves for the first and second groups are demonstrated in Fig. 1 and Fig. 2. Here the mark 1 means the radical cystectomy and 0 means the salvage cystectomy, Tables 1–4 contain the mean survival times and results of testing identity of the survival curves using four tests: log-rank, Wilcoxon, Tarone–Ware, and p- statistics, Survival curves 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 0 500 1000 1500 2000 2500 3000 3500 4000 Survival time, days 0 1 Figure 1: Survival curves in the first group of patients (stage T2) As we see, in the first group (stage T2) the survival curve of the patients who were underwent to radical cystectomy goes above the survival curve of the patients who were underwent to salvage cystectomy. Therefore, intuitively, the risk of death for the former patients is less than for the latter ones and the salvage cystectomy prolongs life of patients better than the radical cystectomy. However, this hypothesis must be rigorously tested using statistical tests. Traditionally, to estimate the significance of the deviation between to survival curves the log- rank test, the Wilcoxon test, and the Tarone–Ware are used. Their p-values are the critical values of these tests. Survival curves 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 0 500 1000 1500 2000 2500 Survival time, days 0 1 Figure 2: Survival curves in the second group of patients (stage T3) In the second group (stage T3) the survival curve of the patients who were underwent to radical cystectomy also goes above the survival curve of the patients who were underwent to salvage cystectomy. We again may suppose that the risk of death for the former patients is less than for the latter ones. Note, that since the stage T3 is harder that T2, the survival interval became mush shorter. The maximum survival time in the first group is avout 4000 days (almost 11 years) but in second group it is about 2500 days (almost 7 years). Thus, the effectiveness of the cytectomy in this group is compensated by the stage of tumors. To estimate the significance of the deviation between to survival curves we again used the log-rank test, the Wilcoxon test, and the Tarone– Ware and their p-values. In both cases we completed the traditional analysis by computing the p-statistics as an alternative to the three above tests. Descriptive statistics of the data are provided in Tables 1–3 Table 1 Mean survival time in the first group (stage T2) Cystectomy Mean survival time Standard deviation Lower bound (95%) Upper bound (95%) Radical 1015,720 202,769 618,300 1413,141 Salvage 1647,688 309,949 1040,198 2255,177 Table 2 Results of survival analysis in the first group of patients (stage T2) at significance level 0.05 Test Observed value Critical value p-value Log-rank 3.239 3.841 0,072 Wilcoxon 2.533 3.841 0,111 Tarone-Ware 2.893 3.841 0,089 P-statistics 0.997 0.950 0.003 Table 3 Mean survival time in the first group (stage T3) Cystectomy Mean survival time Standard deviation Lower bound (95%) Upper bound (95%) Radical 1015.720 202.769 618.300 1413.141 Salvage 1647.688 309.949 1040.198 2255.177 Table 4 contains the observed values, critical values and p-values of the log-rank test, the Wilcoxon test, the Tarone–Ware test, and the p-statistics. Table 4 Results of survival analysis in the second group of patients (stage T3) at significance level 0.05 Test Observed value Critical value p-value Log-rank 1.718 3.841 0.190 Wilcoxon 2.083 3.841 0.149 Tarone-Ware 2.046 3.841 0.153 P-statistics 0.981 0.950 0.019 The hypothesis of the identity of the two survival functions (0 — the salvage cystectomy and 1 —the radical cystectomy) in the first and second groups (stages T2 and T3, respectively) was tested using four tests at a significance level of 0.05. In all the results, there were no statistically significant differences between the survival curves, since the observed values did not exceed the critical value and the upper confidence bound for the p-statistics exceeds 0.95. The log-rank test, the Wilcoxon test and the Tarone–Ware test acceps the null hypothesis is the corresponding p- values are less than 0.05, and the test based on the p-statistics, in opposite, accepts the null hypothesis if its p-value is greater than 0.05. Noteworthy is the fact that the observed p-value (the probability of rejecting the null hypothesis, provided that it is true) in the p-statistics test is an order of magnitude less than in the three traditional nonparametric tests used in the analysis of survival. This is the evidence of high sensitivity and specificity of the proposed test. 4. Conclusions Mathematical basis of modern medical information systems for assessing the effectiveness of treatment and the risk of death during a given time period must be more rigorously justified. Traditional nonparametric tests used in survival analysis (the log-rank test, the Wilcoxon test, and the Tarone−Ware test) assume conditions that not always are met in practice. These tests reduce the verification of statistical hypotheses about the equivalence of survival functions to calculating the critical value of the standard normal distribution. This is justified only when samples are large enough and additional conditions are met. Thus, to develop an effective medical information system for survival analysis, we need in nonparametric tests with minimal preliminary assumptions and minimal requirements to the size of samples. In paper, we described a test for verification of the hypothesis of the equivalence of the survival functions and risk of death during a given time period, which does not depend on the sample size and does not use additional preconditions, except for the condition that the samples have not ties. We have provided the strong mathematical background and demonstrated high sensitivity and specificity of testing homogeneity of two samples of random samples from continuous distributions in comparison with three traditional tests. We have shown the practical application of this test in survival analysis of the patient with bladder cancer and demonstrated its high performance. This test may be used for the development of effective medical information systems that perform survival analysis of cancer patients. Note, that the scheme described in the paper is easily expanded on much wider spectrum of problems connected with the assessment of the risk of device failure or risk of some significant event based on the censored and uncensored observations. Future work will be directed to the improvement of computational complexity of the proposed test and its expanding to the various problem of the risk assessment. References [1] M.Morris, S.Landon, I.Reguilon, J.Butler, M.McKee, E.Nolte, Understanding the link between health systems and cancer survival: A novel methodological approach using a system-level conceptual model, Journal of Cancer Policy, 25, 202, 100233. doi: 10.1111/codi.15622 [2] J.M.Bland, D.G.Altman, The logrank test. British Medical Journal, 328, 2004, 1073. doi: 10.1136/bmj.328.7447.1073 [3] M.A.Proschan, L.E.Dodd, Re-randomization tests in clinical trials, Statistics in medicine, 38, 2019, pp. 2292-2302. doi: 10.1002/sim.8093 [4] R.E.Tarone, J.Ware, On distribution-free tests for equality of survival distributions, Biometrika, 64, 1977, pp. 156–160. doi: 10.1093/biomet/64.1.156 [5] T.G.Karrison, Versatile tests for comparing survival curves based on weighted log-rank statistics, The Stata Journal, 16, 2016, pp. 678–690 [6] A.Hazra, N.Gogtay, Biostatistics Series Module 9: Survival Analysis, Indian Journal of Dermatology, 62, 2017, pp.: 251–257. doi: 10.4103/ijd.IJD_201_17 [7] P.G.Karadeniz, I.Ercan, Examining tests for comparing survival curves with right censored data, Statistics in Transition New Series, 18, 2017, pp. 311‒328. doi: 10.21307/stattrans- 2016-072 [8] S.A.Matveichuk, Yu.I.Petunin, Generalization of Bernoulli schemes that arise in order statistics, I. Ukrainian Mathematical Journal, 42, 1990, pp. 459–466. doi: 10.1007/BF01058940 [9] S.A.Matveichuk, Yu.I Petunin, Generalization of Bernoulli schemes that arise in order statistics, II. Ukrainian Mathematical Journal, 43, 1991, pp. 728–734. doi: 10.1007/BF01058940 [10] N.Johnson, S.Kotz, Some generalizations of Bernoulli and Polya-Eggenberger contagion models, Statist Paper, 32, 1991, pp. 1–17. doi: 10.1007/BF02925473 [11] D.A.Klyushin, Yu.I.Petunin, A Nonparametric Test for the Equivalence of Populations Based on a Measure of Proximity of Samples, Ukrainian Mathematical Journal, 55, 2003, pp. 181–198. doi: 10.1023/A:1025495727612 [12] B.M.Hill, Posterior distribution of percentiles: Bayes’ theorem for sampling from a population, Journal of American Statistical Association, 63, 1968, pp. 677–691