Nonparametric method for estimation of forecasting models
using small samples
Dmitriy Klyushin
Taras Shevchenko National University of Kyiv, prospect Glushkova 4D, Kyiv, 03680, Ukraine


                Abstract
                The coronavirus epidemic has stimulated a surge of research in the field of forecasting the
                epidemic curve based on various mathematical models. To predict the time series that
                predicts the number of patients, different models are used, both differential and machine
                learning models. Differential models for predicting the epidemic curve depend on a number
                of unpredictable factors. This often results in inaccurate predictions. In contrast to these
                approaches, machine learning models that predict time series based on training samples show
                higher reliability. In both cases, the problem arises of testing the hypothesis about the
                homogeneity of errors on training samples. The paper describes the application of the
                Klyushin-Petunin test to test the homogeneity of two samples and compares its effectiveness
                with the widely used the Wilcoxon test and the Diebold-Mariano test and the using the
                example of three methods for predicting the COVID-19 epidemic curve based on data on the
                number of cases in a certain period in Germany, Japan, South Korea and Ukraine. The
                efficiency and usefulness of the proposed nonparametric approach is demonstrated.

                Keywords 1
                COVID-19, Machine Learning, Time Series, Forecasting, Nonparametric Test, Wilcoxon
                signed-rank test, Diebold–Mariano Test, Samples Homogeneity, Hill Assumption.

1. Introduction
    The rapid and difficult to predict spreading of the coronavirus requires accurate and reliable
forecasting of the epidemic curve. Classical differential models (SIR, SEIR, etc.) depend on many
fuzzy unpredictable factors. This limits the use of differential models. As a rule, they provide short-
term and scenario forecasts, allowing one to consider alternative scenarios for the development of
events in the near future. This complicates the planning of measures to combat coronavirus and
requires the use of other approaches. In opposite, machine learning methods have demonstrated good
efficiency. They do not rely on preliminary assumptions on parameters of the epidemic curve, using
only training samples. Using different forecasting models of the epidemic curve we must compare the
distributions of forecasting errors to select the best model.
    This paper describes a non-parametric approach to estimating errors in time series forecasting
models using epidemic curve data from Germany, Japan, South Korea, and Ukraine as examples.
Section 1 substantiates the relevance of the problem under consideration. Section 2 describes the
COVID-19 epidemic curve prediction models. Section 3 describes common accuracy measures of
time series forecasting and tests for their evaluation. Section 4 describes the Klyushin–Petunin test for
samples without ties. Section 5 describes the Klyushin–Petunin test for samples with ties. Section 6
demonstrates the application of nonparametric methods using data from Germany, Japan, South
Korea, and Ukraine. Section 7 contains conclusions and formulation of open questions. Section 8
contains the references.

MoMLeT+DS 2022: 4th International Workshop on Modern Machine Learning Technologies and Data Science, November, 25-26, 2022,
Leiden-Lviv, The Netherlands-Ukraine.
EMAIL: dmytroklyushin@knu.ua
ORCID: 0000-0003-4554-1049
           © 2022 Copyright for this paper by its authors.
           Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
           CEUR Workshop Proceedings (CEUR-WS.org)
2. Prognostic models of the epidemic curves
    The predictive models used in forecasting the epidemic curve must be accurate and reliable [1], as
decisions based on them have far-reaching implications. .
    In epidemiology, differential compartmental models SIR, SEIR, and their modifications, are
widely used. Model names reflect the compartments included in it: susceptible (S), infected (I),
recovered (R), and exposed (E). They strongly depend on the parameters of transition from one
compartment to another. SIR, SEIR models, as well as their modifications, are widely used to predict
the COVID-19 epidemic curve [2–9]. They have a rigorous mathematical justification, but they
depend on parameters that are difficult to accurately estimate. As a result, they are theoretically
accurate, but practically very approximate models. Against this backdrop, methods of machine
learning based on training samples often produce more precise predictions [10]
    Prediction of the epidemic curve of COVID-19 in different countries is traditionally based on the
use of SIR and SEIR models. In [2], the SIR model was applied to data from India. Similar work has
been done by [3] in India based on a logistic curve and SIR models. The authors found that the
logistic curve model is more precise than the SIR. In the paper of Guirao [4], the SIR and SEIR
models have been simplified to reduce their dependence on weakly defined parameters. Ifghuis et al.
[5] refined the SIR model for Morocco epidemic data. They compared the accuracy of the SIR and
SEIR models, preferring the first model. All the models used in these works rely on the accuracy of
the preliminary assumptions. If these assumptions are unreliable, the forecast becomes inaccurate.
Some researchers have attempted to refine compartmental models using statistical methods. In the
paper of Nesteruk [6], the optimal parameters for the SIR model are determined, the exact solution of
the problem is obtained, and a short-term forecast of the epidemic curve in China is given. However,
the precision of this model is also due to the reliability of the original data. In [7] and [8], the
researchers proposed a modification of the SEIR model by adding environmental indicators to it. In
[8] the authors tested the accuracy of this model on the example of China, Italy, South Korea and Iran.
As in all other work using compartmental models, the authors point out that often the models depend
on parameters that are difficult to measure accurately. In [9], the authors applied these models to
predict infections in China, taking into account population mobility, latent infection rates, and virus
infectivity.
    Due to the unreliability of the assumptions underlying the compartment models, machine learning
methods have been proposed as an alternative. In the article [11], Swapnarekha et al. compared
statistical models, as well as machine and deep learning methods, using the COVID-19 epidemic
curve prediction as an example. The authors have shown that the alternative methods they propose
provide an accurate prediction of the epidemic. Sujat et al [12] used linear regression, multilayer
perceptron and vector autoregression to predict the epidemic in India. It turned out that the multilayer
perceptron allows us to get more accurate predictions, like linear regression and vector
autoregression.
    Appadu et al. [13] applied the Euler’s iterative method and cubic spline interpolation to forecast
the total number of infected people and the number of COVID-19 incidents in Germany, India, South
Africa and South Korea. They compare the accuracy measures of the two forecasts using relative
error. In the paper [14], Tuli et al. developed a cloud computing platform for their own mathematical
model and tried to accurately predict the epidemic in real time. Distante et al. [15] simulated the
Covid-19 outbreak in different regions of Italy using a deep convolution autoencoder and used the
SEIR model to predict peaks in the epidemic curve using data collected in China for training. It turned
out that these data allow for accurate predictions.
    One of the most popular machine learning methods used to forecast the COVID-19 epidemic curve
is an artificial neural network. Using recurrent neural networks, Kolozsvári et al. [16] predicted the
COVID-19 epidemic curve and demonstrated its precision. Ibrahim et al. [17] to predict the epidemic
curve proposed the LSTM-Autoencoder variational model, which includes epidemic and demographic
data. The resulting model proved to be accurate both in the short and long term. Hu et al. [18] used
artificial intelligence techniques to model the spread of the coronavirus. According to the authors, the
models they propose give more precise predictions than standard algorithms. Guo and He [19]
proposed an artificial neural network that accurately predicts infection and mortality rates from
COVID-19.
    Balli [20] explored the application of the linear regression, the random forest method, the
multilayer perceptron, and the support vector machine. It was shown that the support vector machine
model is the most accurate. Kafieh et al. [23] forecast the epidemic curves in China, Germany, Iran,
Italy, Japan, South Korea, Spain, Switzerland, and the United States. They used the multilayer
perceptron, the random forest method, the recurrent neural networks, and the long short-term network.
The last of these models turned out to be the most accurate. Rustam et al. [22] used the linear
regression, the LASSO method, the support vector machine, and exponential smoothing to forecast
the epidemic curve. The accuracy of all models used is high and comparable.
    As we can see, various models are used to forecast the COVID-19 epidemic curve. Given this
diversity, it is necessary to develop effective tools to compare their precision. It should be emphasized
that a simple comparison of average precision rates is unreliable because it does not take into account
its statistical nature. Traditionally, to assess the statistical validity of estimates of the precision of
predictive models the nonparametric Wilcoxon test [23, 24] and parametric Diebold–Mariano test [25,
26] are used. In this work, we propose to use an original nonparametric test [27] to evaluate predictive
models and demonstrate its effectiveness using three forecasting methods (the random forest method,
the k-nearest neighbor method, and gradient boosting) for four countries: Germany, Japan, South
Korea, and Ukraine.

3. Error measures
    One of the most difficult forecasting tasks is selecting the most precise method. Model accuracy is
usually assessed using standard error, mean absolute deviation, and mean absolute percentage error.
However, before comparing mean accuracy scores, it is necessary to test the hypothesis of a
statistically significant difference between their distributions. If the accuracy scores have the same
distribution, the comparison of their means becomes unreliable, since the difference between the
scores of the mean could be explained by chance.
    Generally, model errors are assumed to be uniform, stationary, and unbiased. For example, errors
are often considered to have a Gaussian distribution. In any case, it is necessary to test whether the
accuracy estimates of the models have the same distribution. Traditionally, the nonparametric
Wilcoxon test and the parametric Dibold-Mariano test are used to test this hypothesis. The null
hypothesis states that the mean value of the difference between two samples of errors equals to zero.
    As shown by numerical experiments, the Wilcoxon and Diebold-Mariano tests recognize location
bias with a high level of confidence, but cannot recognize different variance at the same location with
the same accuracy. The solution of this problem is the test proposed in [27]. This test recognizes both
the shift in the mean and the change in the variance with a high level of confidence.
    Since the samples of data containing information on errors of forecasting often are small (for
example, in practice there are samples of the size 8 and even 6), we use resampling to increase the
precision of statistical estimations. In particular, we use the bootstrap technique [28]. As well-known,
this technique generates series samples that can contain duplicates (ties). That is why, to test
homogeneity of two-samples of errors we need use two versions of the Klyushin–Petunin tests:
without ties and with ties.

4. Nonparametric test for homogeneity of samples without ties

   Let samples u   u1 , u2 ,..., un  and v   v1 , v2 ,..., vm  obey absolutely continuous distribution
functions F1 and F2 , respectively. The null hypothesis on homogeneity is F1  F2 and the opposite
hypothesis is F1  F2 . According to Hill's assumption A( n) [29], if random values u1 , u2 ,..., un obey
exchangeable and absolutely continuous distribution function F then

                                            
                                         P u  u i  , u j      nj  1i ,                        (1)
where j  i, u is a sample value obeying F , and ui  and u j  are the i-th and j-th order statistics.
    Suppose that the null hypothesis is true and construct the variational series u1 , u 2  ,..., u n  . Let

          
Aij k   vk  ui  , u j          . By Hill’s assumption, if j  i then
                                                                     
                                                                  P vk  ui  , u j      p  nj  1i .
                                                                                                       ij


   Construct the Wilson confidence interval for the probability of Aij  :
                                                                                                                                 k


                                                     hij( n , k ) n  g 2 2  g hij( m , k ) (1  hij
                                                                                                                 m,k 
                                                                                                                           )n  g 2 4
                                          pij(1)                                                                                         ,
                                                                                     n  g2
                                                                                                                                                           (2)
                                                                                                                 m ,k 
                                                     h  ( n,k )
                                                       ij         n g 2 g h
                                                                         2               (m, k )
                                                                                        ij         (1  hij                )n  g 4  2

                                          pij(2)                                                                                         ,
                                                                                      n g     2


                         # Aij k 
where hij
             n,k 

               n
                                     . Then, form the confidence interval I ij n   pij1 , pij 2
                                                                             whose significance level                                        
depends on g. If g equals 3 than the significance level of I ij  is not greater than 0.05 [27]. Let
                                                                n


           j i                                                                      2L
B   pij        I ij n   . Put N   n  1 n 2 and find L  # B . Then, h             is a measure of
           n 1                                                                  n  n  1
homogeneity of samples x and y (called p-statistics). Repeating the procedure of computing the
Wilson confidence interval for the probability of B we obtain the interval I  n    p1 , p2  . Therefore,
we may formulate the decision rule with the significance level, which is not greater than 0.05: if I  n 
covers 0.95 then F1  F2 else F1  F2 .

5. Nonparametric test for homogeneity of samples with ties

    Let us extend the test described above to samples with ties, Let the sample u   u1 , u2 , ..., un  obeys
distribution F . Denote the number of tie uk in u by t  uk  . A sample u without ties we shall call
hypothetical, and a sample u with ties we shall call empirical. If the distribution F is differentiable
and Lipschitz-continuous, i.e.
                                        F  x  F  y  K x  y
the sample value u * is independent on u , and the order statistics u k  , k  i of the empirical sample
u is a tie with multiplicity t  uk  , then

                                                                                                  
                                                                  p u   u( k ) , u( k 1)     u( k )  
                                                                                                                                   1
                                                                                                                                 n 1
                                                                                                                                      .                    (4)


                                                           
    Therefore, if p u   u(i ) , u( j )  , i  j , and 1  i, j  m , then

                                                                                                                                                  j i
                                                                                                           
                                                 pij  p  Aij   p u   u( i ) , u( j )    i   i 1  ...   j 1 
                                                                                                                                                  n 1
                                                                                                                                                       ,   (5)

                                                                            
where  l    u( l )  , Aij  u   u(i ) , u( j )  . If u(l ) , i  l  j  1 is not a tie, then  l  0 and (5) is
reduced to (1).
   Denote by H the null hypothesis on homogeneity of absolutely continuous distribution functions
                                                                                                               
F1 and F2 . Let u is a sample with ties. If F1  F2 then the probability of Aij( k )  uk   u(i ) , u( j ) 

equals (5). Compute the Wilson confidence interval I                  ( n,m)
                                                                     ij        p ,p
                                                                                 (1)
                                                                                 ij
                                                                                       (2)
                                                                                       ij     for p  A  . Then
                                                                                                        (k )
                                                                                                       ij

                       j i         n,m 
          2 #  pij           I ij  
                      n   1             is a homogeneity measure between empirical samples (called
h( n,m) 
                    n  n  1
empirical p-statistics).

6. Forecasting of the COVID-19 epidemic curves
    To demonstrate the usefulness of the proposed method, we consider models for predicting the
COVID-19 epidemic curve based on data on the number of cases in a certain period in Germany,
Japan, South Korea and Ukraine [10]. As the authors indicated [10], the selection of these countries
was motivated by different character of pandemic dynamics in these countries and different anti-pandemic
decisions made by their governments. This choice should mitigate the bias due to these factors. Note, that this is
not unique variant of the countries selection. For example, Papastefanopoulos et al. [30] made a
comparison of six time series methods to forecast percentage of active cases per population in ten
other countries using the other criterion: the greatest number of total confirmed cases. Similar case
studies were, for example, made in [31]. However, the choice of countries and forecasting methods is
rather not important. The main is that we analyze error distributions of forecasting models applied the
same datasets.


Figure 1: Absolute errors of forecasting models for Germany

   Figure 1 demonstrates that absolute errors of the Gradient Boosting Model is varying about zero.
The precision of the K-Nearest Neighbor Model is quite small also. The Random Forest Model is the
worst model. Therefore, we could just compare these graphs and make decision. However, to
complete analysis of the precision we must compare not the errors but the distributions of the errors.
To do this we may use the two-sample tests, for example, the Wilcoxon sign-rank test, the Diebold–
Mariano test and similar tests. But different tests have different features, for example, the Wilcoxon
sign-rank test and the Diebold–Mariano text are good for comparison of samples with different means
but they fail when samples have different variances and the same location. To solve this problem we
can use the Klyushin–Petunin test [27]. The Wilcoxon sign-rank test, the Diebold–Mariano test and
many other tests suppose that samples are homogeneous if they have the same average value (the
hypothesis about the location shift).
Figure 2: Absolute errors of forecasting models for Japan

    As we see in Figure 2, the absolute errors of the K-Nearest Neighbor Model are smallest. The
errors of the Gradient Boosting Model is varying about zero as in the previous case. The precision of
the Random Forest Model is the largest.
    It is quite interesting, that in the case of Japan we see that the Random Forest Model is the most
imprecise, but the Gradient Boosting Model and the K-Nearest Neighbor Model changed their places.
Nevertheless, analyzing the graphs demonstrated in Figure 2 we could suppose that the distributions
of the errors of these models are equivalent.


Figure 3: Absolute errors of forecasting models for South Korea

   The graphs of the absolute errors of the Random Forest Model, K-Nearest Neighbor Model, and
the Gradient Boosting Model for South Korea is similar to the results for Germany but the scales of
the absolute errors are different. The K-Nearest Neighbor Model is the most accurate for the data
from South Korea and vary different from the precision of KNN Model for Germany. As wee see in
Figure 1 the absolute error of KNN Model for Germany is varying about 50000 cases, but for South
Korea the range of the absolute error is bounded by the dozens of cases.
Figure 4: Absolute errors of forecasting models for Ukraine

   The results of forecasting for Ukraine completely correspond to the results obtained for Germany,
Japan, and South Korea. Is this case, we see that the absolute errors of the K-Nearest Neighbor
Models and the Gradient Boosting Model are close. Therefore, to compare their precision correctly
we must estimate their statistical equivalence. To compare statistical homogeneity of these data we
apply the Klyushin–Petunin test, the Wilcoxon signed-rank test and the Diebold-Mariano test to real
data from Germany, Japa, South Korea and Ukraine.
   The main problem that can create an obstacle is the small size of the samples. To overcome this
obstacle we must use some resampling technique applying the Klyushin-Petunin test. In this work we
use the bootstrapping [28] using 10000 trials. We compare the absolute errors of the Random Forest
Model (RFM), the K-Nearest Neighbor Model (KNN), and the Gradient Boosting Model (GBM).

Table 1
P-value of the tests for RFM, KNN and GBM methods for Germany
  Tests/         Klyushin–Petunin             Diebold–Mariano                         Wilcoxon
Methods
              RFM       KNN     GBM     RFM         KNN       GBM             RFM      KNN       GBM
   RFM       0.000 0.060 0.060          0.000      0.023      0.016           0.000    0.002     0.002
   KNN       0.060 0.000 0.060          0.023      0.000      0.194           0.002    0.000     0.002
   GBM       0.060 0.060 0.000          0.016      0.194      0.000           0.002    0.002     0.000

   The p-values of the Klyushin-Petunin test and the Wilcoxon signed-rank test cited in Table 1 do
not vary for all comparisons of the distributions of the absolute errors because the samples of the
absolute errors corresponding to all the models are not pair-wise overlapping. This is explained by the
fact that both these methods use ordering and permutations. That is why, for not overlapping samples
they produce the constant p-value. Therefore, according to the Klyushin–Petunin test (the p-value is
greater than 0.05) and the Wilcoxon signed-rank test (the p-value is less than 0.05) all the models for
Germany provide statistically different results with the significance level =0.05. The Diebold-
Mariano test recognizes the difference between all the samples with the exception of the K-Nearest
Neighbor Model and the Gradient Boosting Model, as its p-value in all other cases is less that 0.05.
   As we see, the forecasting errors of the Random Forest Model, K-Nearest Neighbor Model and the
Gradient Boosting Model are very suitable for application of all the classical tests for samples
homogeneity because these samples have clearly different locations. However, this is not the case
when samples have similar locations but different scales.
Table 2
P-value of the tests for RFM, KNN and GBM methods for Japan
  Tests/         Klyushin–Petunin             Diebold–Mariano                           Wilcoxon
Methods
              RFM       KNN     GBM     RFM         KNN       GBM               RFM      KNN       GBM
   RFM       0.000 0.060 0.060          0.000     1.94e-05 1.95e-05             0.000    0.002     0.002
   KNN       0.060 0.000 0.060 1.94e-05             0.000   1.24e-05            0.002    0.000     0.002
   GBM       0.060 0.060 0.000 1.95e-05 1.24e-05              0.000             0.002    0.002     0.000

    As in the previous case, the p-values of the Klyushin-Petunin test and the Wilcoxon signed-rank
test cited in Table 2 do not vary for all comparisons of the distributions of the absolute errors because
the samples of the absolute errors corresponding to all the models are not pair-wise overlapping.
Therefore, according to the Klyushin–Petunin test and the Wilcoxon signed-rank test all the models
for Japan provide statistically different results with the significance level =0.05. The Diebold-
Mariano test recognizes the statistical difference between all the samples.

Table 3
P-value of the tests for RFM, KNN and GBM methods for South Korea
  Tests/         Klyushin–Petunin             Diebold–Mariano                           Wilcoxon
Methods
              RFM       KNN     GBM     RFM         KNN        GBM              RFM      KNN       GBM
   RFM       0.000 0.060 0.060          0.000     2.37e-04 2.92e-05             0.000    0.002     0.002
   KNN       0.060 0.000 0.060 2.37e-04             0.000    1.44e-02           0.002    0.000     0.002
   GBM       0.060 0.060 0.000 2.92e-05 1.44e-02               0.000            0.002    0.002     0.000

    Here, we see again, that the p-values of the Klyushin-Petunin test and the Wilcoxon signed-rank
test cited in Table 3 do not vary for all comparisons of the distributions of the absolute errors because
the samples of the absolute errors corresponding to all the models are not pair-wise overlapping.
Therefore, according to the Klyushin–Petunin test and the Wilcoxon signed-rank test all the models
for South Korea provide statistically different results with the significance level =0.05. The Diebold-
Mariano test recognizes the statistical difference between all the samples, also.

Table 4
P-value of the tests for RFM, KNN and GBM methods for Ukraine
        Tests/               Klyushin–Petunin     Diebold–Mariano                       Wilcoxon
      Methods
                          RFM      KNN     GBM  RFM      KNN   GBM              RFM      KNN       GBM
         RFM              0.000 0.060 0.060 0.000 0.039 0.036                   0.000    0.002     0.002
         KNN              0.060 0.000 0.060 0.039 0.000 0.002                   0.002    0.000     0.002
        GBM               0.060 0.060 0.000 0.036 0.002 0.000                   0.002    0.002     0.000

   Now, the p-values of the Klyushin-Petunin test and the Wilcoxon signed-rank test cited in Table 4
do not vary for all comparisons of the distributions of the absolute errors because the samples of the
absolute errors corresponding to all the models are not pair-wise overlapping. Therefore, according to
the Klyushin–Petunin test, the Diebold-Mariano test and the Wilcoxon signed-rank test all the models
for Japan provide statistically different results with the significance level =0.05.
   As a general conclusion, we may state that all the tests recognize the statistical difference between
the models. Therefore, we may rank the models by their precision. After statistical analysis, we may
conclude that the Gradient Boosting Model is the best, the K-Nearest Neighbor Model is less accurate
but quite good, and the Random Forest Models is the worst for forecasting the numbers of cases in
Germany, Japan, South Korea, and Ukraine.
7. Conclusions and scope for the future work
     Naïve comparison of the predicted values using error measures ignores the stochastic nature of
these values. To solve this problem, we must make comparisons of the accuracy measures and arrange
the models by their quality. Usually, for this purpose the nonparametric methods are widely used, in
particular, the Wilcoxon signed-rank test and the Diebold-Mariano test. The experiments made using
the data from Germany, Japan, South Korea, and Ukraine have shown that the Klyushin–Petunin test
with bootstrapping as effective for comparing forecast models as the Wilcoxon signed-rank test and
the Diebold–Mariano test.
     The future work will be focused on the comparisons of the proposed test with other test used in
this domain and on the theoretical properties of the p-statistics for the samples with unequal sizes.

8. References
[1] S. Eker, Validity and usefulness of COVID-19 models, Humanit Soc Sci Commun 54 (2020).
     doi:10.1057/s41599-020-00553-4.
[2] N. Anand, A. Sabarinath, S. Geetha, S. Somanath, Predicting the Spread of COVID-19 Using
     SIR Model Augmented to Incorporate Quarantine and Testing, Trans Indian Natl Acad Eng 5
     (2020) 141–148, doi:10.1007/s41403-020-00151-5.
[3] M. Babu, M. Marimuthu, M. Joy, M. Nadaraj, E. Asirvatham, L. Jeyaseelan, Forecasting
     COVID-19 epidemic in India and high incidence states using SIR and logistic growth models,
     Clin Epidemiol Glob Health 9 (2020) 26–33. doi:10.1016/j.cegh.2020.06.006.
[4] A. Guirao, The Covid-19 outbreak in Spain. A simple dynamics model, some lessons, and a
     theoretical framework for control response, Infect Dis Model. 5 (2020) 652–669,
     doi:10.1016/j.idm.2020.08.010.
[5] O. Ifguis, M Ghozlani, F. Ammou, A. Moutcine, A. Abdellah, Simulation of the Final Size of the
     Evolution Curve of Coronavirus Epidemic in Morocco using the SIR Model, J Environ Pub
     Health, 2020 (2020), article ID 9769267. doi:10.1155/2020/9769267
[6] I. Nesteruk, Statistics-based predictions of coronavirus epidemic spreading in mainland China,
     Innov Biosyst Bioeng 4 (2020) 13–18, doi:10.20535/ibb.2020.4.1.195074
[7] N. Wang, Y. Fu, H. Zhang, H. Shi, An evaluation of mathematical models for the outbreak of
     COVID-19, Precis Clin Med 3 (2020) 85–93. doi:10.1093/pcmedi/pbaa016.
[8] J. He, G. Chena, Y. Jiang, R. Jin, R. Shortridge, S. Agusti, M. He, J. Wu, D. Duarte,
     G. Christakos, Comparative infection modeling and control of COVID-19 transmission patterns
     in China, South Korea, Italy and Iran, Sci. Total Environ 74 (2020) 141447.
     doi:10.1016/j.scitotenv.2020.141447.
[9] R. Li, S. Pei, B. Chen, Y. Song, T. Zhang, W. Yang, B. Wei, L. Xin, X. Wei, Substantial
     undocumented infection facilitates the rapid dissemination of novel coronavirus (sars-cov-2),
     Science, 368 (2020) 489–493. doi:10.1111/jebm.12376.
[10] D. Chumachenko I. Meniailov, L. Bazilevych, T. Chumachenko, S. Yakovlev, Investigation of
     Statistical Machine Learning Models for COVID-19 Epidemic Process Simulation: Random
     Forest,     K-Nearest      Neighbors,       Gradient   Boosting, Computation 10   (2022)   86.
     https://doi.org/10.3390/computation10060086.
[11] H. Swapnarekha, H. Behera, J. Nayak, B. Naik, Role of Intelligent Computing in COVID-19
     Prognosis: A State-of-the-Art Review, Chaos Solitons Fract, (2020) 109947.
     doi:10.1016/j.chaos.2020.109947.
[12] R. Sujath, J. Chatterjee, A. Hassanien, A machine learning forecasting model for COVID-19
     pandemic in India, 34:959–972, Stoch Env Res Risk A. doi:10.1007/s00477-020-01827-8.
[13] A. Appadu, A. Kelil, Y. Tijani, Comparison of some forecasting methods for COVID-19,
     Alexandria Engineering Journal 60 (2020) 1565–1589. doi: 10.1016/j.aej.2020.11.011.
[14] S. Tuli, S. Tuli, R. Tuli, S. Gill, Predicting the Growth and Trend of COVID-19 Pandemic using
     Machine Learning and Cloud Computing. Internet Things, 11 (2020) 100222.
     doi:10.1016/j.iot.2020.100222.
[15] C. Distante I. Pereira, L. Gonçalves, P. Piscitelli, A. Miani, Forecasting Covid-19 Outbreak
     Progression in Italian Regions: A model based on neural network training from Chinese data,
     MedRxiv,          2020.         https://www.medrxiv.org/content/10.1101/2020.04.09.20059055v1
     doi:10.1101/2020.04.09.20059055.
[16] L. Kolozsvári, T. Bérczes, A. Hajdu, R. Gesztelyi, A. Tiba, I. Varga, G. Szőllősi, S. Harsányi,
     S. Garbóczy, J. Zsuga, Predicting the epidemic curve of the coronavirus (SARS-CoV-2) disease
     (COVID-19)            using          artificial       intelligence,      MedRxiv,         2020.
     https://www.medrxiv.org/content/10.1101/2020.04.17.20069666v2.
     doi:10.1101/2020.04.17.20069666
[17] M. Ibrahim, J. Haworth, L. Lipani, A. Aslam, T. Cheng, N. Christie, (2020) Variational-LSTM
     Autoencoder to forecast the spread of coronavirus across the globe. MedRxiv, 2020.
     https://www.medrxiv.org/content/10.1101/2020.04.20.20070938v1.
     doi:10.1101/2020.04.20.20070938.
[18] Z. Hu, Q. Ge, S. Li, E. Boerwincle, L. Jin, M. Xiong, Forecasting and evaluating multiple
     interventions of Covid-19 worldwide. Front Artif Intell (2020) 2020.00041.
     doi:10.3389/frai.2020.00041
[19] Q. Guo, Z. He. Prediction of the confirmed cases and deaths of global COVID-19 using artificial
     intelligence. Environ Sci Pollut Res 28 (2021) 11672–11682. doi:10.1007/s11356-020-11930-6
[20] S. Balli, Data analysis of Covid-19 pandemic and short-term cumulative case forecasting using
     machine learning time series methods. Chaos Solitons Fractals 142 (2021) 110512. doi:
     10.1016/j.chaos.2020.110512.
[21] R. Kafieh, R. Arian, N. Saeedizadeh, Z. Amini, N. Serej, S. Minaee, S. Yadav, A. Vaezi,
     N. Rezaei, S. Javanmar, COVID-19 in Iran: Forecasting Pandemic Using Deep Learning.
     Computational and Mathematical Methods in Medicine (2021) article ID 6927985.
     doi:10.1155/2021/692798.
[22] F. Rustam et al. COVID-19 Future Forecasting Using Supervised Machine Learning Models. In:
     IEEE Access 8 (2020) 101489–101499. doi:10.1109/ACCESS.2020.2997311.
[23] B. Flores, The utilization of the Wilcoxon test to compare forecasting methods: A note, Int J
     Forecast 5 (1989) 529–535. doi:10.1016/0169-2070(89)90008-3.
[24] T. DelSole, M. Tippett, Comparing Forecast Skill. Mon Wea Rev, 142 (2014) 4658–4678.
     doi:10.1175/MWR-D-14-00045.1.
[25] F. Diebold, R. Mariano, Comparing predictive accuracy. J Bus Econ Stat 13 (1995) 253–263,
     doi: 10.1080/07350015.1995.10524599.
[26] F. Diebold, Comparing Predictive Accuracy, Twenty Years Later: A Personal Perspective on the
     Use and Abuse of Diebold-Mariano Tests, NBER Working Papers 18391, 2012, National Bureau
     of Economic Research, Inc.
[27] D. Klyushin, Y. Petunin, A Nonparametric Test for the Equivalence of Populations Based on a
     Measure of Proximity of Samples. Ukrainian Math J 55 (2003) 181–198.
     doi:10.1023/A:1025495727612.
[28] B. Efron, Bootstrap methods: another look on the jacknife. Ann. Statist. 7 (1979) 1–26.
     doi:10.1214/aos/1176344552
[29] B. Hill, Posterior distribution of percentiles: Bayes’ theorem for sampling from a population. J
     Am Stat Assoc, 63 (1968) 677–691. doi:10.1080/01621459.1968.11009286
[30] V. Papastefanopoulos, P. Linardatos, S. Kotsiantis, COVID-19: a comparison of time series
     methods to forecast percentage of active cases per population. Applied sciences 10 (2020) 3880.
     doi:10.3390/app10113880.
[31] D. Klyushin, Comparing Predictive Accuracy of COVID-19 Prediction Models: A Case Study.
     In: Hassan, S.A., Mohamed, A.W., Alnowibet, K.A. (eds) Decision Sciences for COVID-19.
     International Series in Operations Research & Management Science, 2022, vol 320. Springer,
     Cham. doi:10.1007/978-3-030-87019-5_10.