=Paper=
{{Paper
|id=Vol-3038/paper28
|storemode=property
|title=Visualization of the Epidemics Forecasting Results
|pdfUrl=https://ceur-ws.org/Vol-3038/paper18.pdf
|volume=Vol-3038
|authors=Natalya Shakhovska,Ihor Darmoriz,Yaroslav Vyklyuk,Yurii Kryvenchuk,Pavlo Pukach
|dblpUrl=https://dblp.org/rec/conf/iddm/ShakhovskaDVKP21
}}
==Visualization of the Epidemics Forecasting Results==
<pdf width="1500px">https://ceur-ws.org/Vol-3038/paper18.pdf</pdf>
<pre>
Visualization of the Epidemics Forecasting Results
Nataliya Shakhovska, Ihor Darmoriz, Yarosvav Vyklyuk, Yurii Kryvenchuk and Pavlo Pukach
Lviv Polytechnic National University, Lviv, Ukraine, 79013


               Abstract
               Modeling and forecasting of time series is one of the most importance for various practical
               applications. Many things are more or less time-dependent. Its analysis can forecast the future
               behavior to take some action for better results in the future. Research purpose is to develop a
               software product that has the ability to forecast the spread of the epidemic in relation to its
               specific features. The comparison of linear model, Convolution neural network and Recurrent
               neural network for epidemic forecasting is given. The spread of epidemics occurs over a period
               of time where anybody can see trends of some features during some time. Because the result
               is influenced by a large number of factors, and the training took place only on a short history,
               the results are of high quality because the MAPE error does not exceed 30% with a prediction
               for all characteristics.

               Keywords 1
               machine learning, forecasting, time series, epidemic

1. Introduction
   Epidemics produced by infections and viruses usually come to a first-place amount of the large-scale
disasters and catastrophes that have attended the entire history of humankind, on a par with starvation,
wars, man-made and natural disasters. According to the World Health Organization (WHO), severe
respiratory infections account for 60-70% of the total morbidity of the population, with a tendency to
develop complexities and chronicity of the process. Due to the extreme variability of the pathogen,
acute respiratory infections remain an uncontrolled infection. Another example is coronavirus disease
affected by the new virus SARS-CoV-2 (COVID-19). Nearly 241 million people worldwide have
contracted COVID 19 (https://index.minfin.com.ua/ua/reference/coronavirus/geography/). Of these,
more than 17 million are ill at this moment, and more than 21 million have been cured. More than 4
million people died from the disease. In total, the disease was detected in 203 countries.
   The nature of diseases caused by infections and viruses (even with known treatment prevention
schemes) depends of numerous factors, namely:
         variability of strains,
         method of distribution,
         parameters of the distribution area: climatic conditions, infrastructure and connections
            between towns and inside towns, quality of medical care, the most common life style,
            chronic diseases essential in this area, political situation, etc.
   That is why developing simulation models of the spread and character of morbidity and new cases
of various infections and viruses is a problematic scientific task. The main characteristics of this task
are the following:
         multicriteria: type of spread (epidemic spread, controlled spread in a mild form of the
            disease), initial parameters, the distribution territory,


Informatics & Data-Driven Medicine, 11, 2021, Lviv, Ukraine
EMAIL: nataliya.b.shakhovska@lpnu.ua (NS); ihor.darmoriz.kn.2017@lpnu.ua (ID); yaroslav.vyklyuk@gmail.com (YaV),
yurii.p.kryvenchuk@lpnu.ua (YuK), pavlo.p.pukach@lpnu.ua (PP)
ORCID: 0000-0002-6875-8534 (NS 0000-0003-2549-1873 (ID); 0000-0003-4766-4659 (YaV), 0000-0002-2504-5833 (YuK), 0000-0002-
0488-6828 (PP)
            ©️ 2021 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)
         time dependence,
         simulation interval,
         variety of input data.
   Therefore, it is necessary to develop a system that is more sensitive to changes in the spread of the
disease to predict its reach in the future and monitor other changes that may occur during an epidemic
(new cases, recovery, mortality, etc).
   The aim of the work is to develop a model and system based on it that would make it possible to
monitor and predict the spread of the epidemic on the basis of various characteristics.
   The main contributions of this paper are the following:
        The new schema of recurrent neural network for COVID-19 infection forecasting is developed.
        To increase the predictive accuracy, the clustering is used on the preprocessing stage. It allows
   to reduce the influence of data heterogeneity due to the presence of several locations.
   This paper is organized into several sections. In State of the art section, the methods of times series
analysis are given. In the section #3 “Methods and means”, new schema of recurrent neural network is
proposed. The fourth section presents result of proposed methods and gives data interpretation. The last
section concludes the paper.

2. State of the Art
    In paper [1] describes that the spread of the H5N1 influenza virus in birds has heightened concern
about a new human influenza epidemic. Using epidemiological data collected in the early stages of the
outbreak, the authors show how to predict the maximum pervasiveness of a pandemic wave and its
amplitude and duration by adapting the epidemic model of mass action to observational data by standard
regression analysis.
    In [2], authors are used tools of mathematics (particularly wavelet theory) and computer science
(machine learning). They have developed a new method of modeling the evolution of epidemics, which
is not limited to the human population. The most important new feature of the proposed approach is the
following:
          an epidemic can occur in several waves;
          these waves can be global and local;
          in addition, they can occur in different periods and places.
    Based on the latest data from the Johns Hopkins database, authors apply the model to several
countries (the Czech Republic, France, Italy, Germany, and the US states of New York and Florida).
After that, they compare the actual rate of diseases and their predictions to established and other recently
developed methods and techniques of prognostication.
    In [3], it is described that COVID-19 trend forecasting is a significant problem. This work integrates
the latest COVID-19 epidemiological data into a logistics model by June 16, 2020 to meet the epidemic
trend constraint and then introduce the constraint value into the FbProphet model, a machine-based time
series prediction model to obtain an epidemic curve and predict an epidemic trend. Three significant
points are summarized from our modeling results for the world countries, Brazil, Russia, India, Peru,
and Indonesia.
    Paper [4] proposed a new epidemic model (SuEIR) for predicting the spread of COVID-19, based
on the number of confirmed deaths in the United States. In particular, the SuEIR model is a variation
of the SEIR model, taking into account untested/unregistered cases of COVID-19, and is trained in
machine learning algorithms based on historical data messages. In addition to providing baseline
predictions for confirmed cases and fatalities, the proposed SuEIR model can also predict the peak date
of active cases and estimate the baseline reproduction number.Time series consist of the following
components:
         Seasonal changes;
         Trend;
         Cyclical variations;
         Random variations.
    After researching time series predictions, we can conclude that the usage of neural networks is a
new application, as over the past century, a large number of linear algorithms have been developed to
analyze and predict time series, including ARMA [5, 6], ARIMA [7], VAR [8], HWES [9] and others.
Although these types have been quite widespread, they may not always be effective enough. Their
disadvantages include the following points:
        They require complete data. Some missing values can actually affect the model. But there are
    also ways to deal with missing data.
        They rely on linear relationships. In many traditional models, their assumptions are based on a
    linear basis.
        They usually only deal with one-dimensional data. For example, they can analyze a time series
    for a single characteristic (such as virus mortality), although when dealing with epidemics, we
    analyze several types of data.
        They usually do not work well in the long run.
    Convolutional neural network (CNN) is a class of deep artificial neural networks that has been
successfully used in the analysis of visual images [10]. They are mainly used for work or image analysis,
but in some cases they can be used effectively for time series. An important characteristic for the use
of this network is the correlation of several types of data in the analysis of the series.
    A recurrent neural network (RNN) is a class of artificial neural networks where connections between
nodes form a directional graph along a time sequence [11, 12]. This allows him to demonstrate temporal
dynamic behavior. A well-known RNN is long-term memory or LSTM, and it has the ability to solve
time series problems. LSTM networks eliminate the need for a predefined time window due to the
ability to study long-term correlations in different sequences and are able to accurately model complex
multidimensional sequences. The advantages and disadvantages of each of the models are given in
Table 1.

Table 1
Models comparison
      Model                      Advantage                                 Disadvantage
      Linear             Simple, easy to understand             Requires complete data; Uses only
                                                              linear connections; Accuracy is not so
                                                                      high for long-term data
       CNN         Less sensitivity to noise compared to      Cannot work with time dependencies
                    linear models; Can work with both
                           one-dimensional and
                   multidimensional data; The ability to
                      work with multi-step predictions
       RNN         Advantages are the same as for CNN;        In some cases it cannot process large
                      working with time dependencies            time dependencies; Tendency to
                                                                          overfitting

   Therefore, considering the advantages and disadvantages of different methods, we can conclude that
the best choice is RNN, namely LSTM for future data processing.


3. Materials and Methods
   The main characteristics for the analysis of epidemics are geographical identification, as well as
characteristics that are new values of the following criteria:
       observations,
       confirmation,
       death,
       recovery.
    The feature set is also linked to a specific location, such as the city or region that will be predicted.
    A data set was used as input data [13]. This is a time series with different metrics that can be used
for analysis or prediction. The dataset contains information about the COVID-19 virus in relation to the
cities of Ukraine. In this example, a data set for the Lviv region was used.


Figure 1. Dataset example.

    The initial data of the developing system should provide the user with an understanding of the
situation regarding the epidemiological main indicators: new diseases, recovery, death, etc. The data is
calculated as a prediction based on the original data and create a forecast for this set of characteristics.
    As output, the user will receive an apology visualization in the form of a graph for a specific data
set, as well as a map with a prediction for a specific region for easier visual perception
    The proposed in the paper model is built taking into account three main criteria:
        The number of past days for prediction;
        The number of days to anticipate;
        The number of criteria to consider.
    For our case, one day was taken into account to predict the next day using four characteristics by
analyzing the previous seven days.
         In the beginning, neural network decides what information to remember and what to throw out
of the cell state. This action is performed in the "Forget gate" part. X presents input data, H - the result
of the current stage, t is the step number. In this part, the sigmoid function considers the input data from
ht-1 and xt, and then outputs a number between 0 and 1 for each number from cell Ct-1, where 1 will
mean completely save the state, and 0 - completely forget it.
                                             ft=sig(W(f)[ht-1,xt]+b(f)).
        The next step is the “Input gate” section to update the cell status. First, the current state X t and
the previously hidden state ht-1 are passed to a sigmoid function to transform values between 0
(important) and 1 (unimportant). Next, the same information about the hidden and current state will be
transmitted via the tanh function. For network regularization, operator tanh calculates vector Čt in range
from -1 to 1 for the multiplying.
                                      it=sig(W(i)[ht-1,xt]+b(i)).
                                    Čt=gt =tanh(W(g)[ht-1,xt]+b(g)).
         When the network has prepared information about the data it receives from the two previous
layers, the next step is to decide to save information from the new state in the cellular state in the "Cell
state". The previous state of cell Ct-1 is multiplied by the forgetting vector ft.
                                              Ct=ftCt-1+Čtht-1.
        The last step is to determine the values to pass to the next layer. Initially, the values of the
current state and the previous hidden state are passed to the last sigma function. This result is further
multiplied by the new cell state generated from the cell state after transmission through the tanh
function. Based on the final value, the network decides what information the hidden state should carry.
This latent state is used for forecasting. As a result, the new cell state and the new latent state are carried
over to the next time step.
                                            ot=sig(W(o)xt+U(o)ht-1+b(o)),
                                                 ht=ot+tanh(Ct).
   The structure of the model is given in Fig. 2.


Figure 2. RNN schema.

4. Results
    Before starting work, all data should be standardized [14], as data is measured at different scales and
a large difference can slow down or even hinder the effective learning process:


   At the preprocessing stage the data gaps are found. To remove them, grouping is used. The
distribution before and after grouping is given in Fig.3. Data gaps have narrowed, so grouping data for
specific periods is appropriate. It allows to reduce the influence of data heterogeneity due to the
presence of several locations
   Next, a model will be created to predict the regions, so the next step is to group the data with the
selection of a specific area.
                                                    а)


                                                    b)
Figure 3. Number of new cases for different periods of time before (a) and after grouping (b)

   The next step is to train the model. At this stage, the training of the previously created model was
carried out with the condition of release, if the accuracy is greater than 92% or all epochs will not be
passed. Each model with the lowest loss result is also stored for validation data. The model training
process can be seen in Figure 4, and the model training history in Figure 5.


Figure 4. Model training
Figure 5. Loss function

   The model accuracy on the training data for each of the parameters is given in Fig. 5 and resuls of
forecasting is demonstrated in Fig. 6. New cases and new obseravations are modelled separetly.


                                                   а)
                                                     b)
Figure 6. Forecasting on training data: а) – new observations; b) – new cases.

    The graph shows the training data, represented by a blue line, as well as the prediction, represented
by an orange line. From these graphs we can say that the accuracy of prediction relative to the test data
is very high.
    Training process is given in Fig. 7.


Figure 7. Training process.

   Mean absolute percentage error (MAPE) was used to verify the accuracy of the losses.
   The results are shown in Table 2.
Table 2
Error for testing dataset

                                      Measure name                Error
                                                                   (%)
                                 Number of observations           26.3
                                Number of confirmed cases         26.5
                                    Number of death               29.2
                                  Number of recoveries            29.4
   Because the result is influenced by a large number of factors, and the training took place only on a
short history, the results are of high quality because the MAPE error does not exceed 30% with a
prediction for all characteristics.
   To work with new data, a ready-made data model and a ready-made MinMaxScaler for further
correct alignment of variables relative to previous data is required. After completing the data
normalization phase, the model is trained and then the function plt_result () is called to graphically
display the accuracy of learning for each parameter, which are in separate columns.
   To display graphic data on the map, you need to perform some pre-processing of data. To do this,
use the function update_df_to_plt (), where an important parameter is "registration_area". This
parameter is responsible for the area that will be displayed later.
   Then the show_map () function is executed to display the data on the map. The result is shown in
Fig. 8. The virus power is marked in different colors.


Figure 8. New confirmed cases for Lviv region

Conclusions
    Different approaches to time data analysis were analyzed and the use of an RNN neural network,
namely LSTM, was used. This choice was also justified by comparison with other approaches.
Currently, this option is quite promising for predicting new cases of COVID-19, because it is not limited
to the most rigorous type of problem.
    During the development, a software product was built with a description of the system itself, taking
into account the optimal software. A user guide for working with this product in different situations has
also been described.
    So, as a conclusion, this system performs the task well enough given the number of factors that affect
in one way or another the result and this system is quite relevant for use.


5. Acknowledgements
   This work is supported by National Foundation of Fundamental research, Ukraine, Project
#103.01.0025.
6. References
[1]  I. M. Hall, R. Gani, H. E. Hughes, and S. Leach, “Real-time epidemic forecasting for pandemic
     influenza,” Epidemiol. Infect., vol. 135, no. 3, pp. 372–385, Apr. 2007, doi:
     10.1017/S0950268806007084.
[2] T. Tat Dat et al., “Epidemic Dynamics via Wavelet Theory and Machine Learning with
     Applications to Covid-19,” Biology, vol. 9, no. 12, p. 477, Dec. 2020, doi:
     10.3390/biology9120477.
[3] “Prediction of epidemic trends in COVID-19 with logistic model and machine learning technics,”
     Chaos Solitons Fractals, vol. 139, p. 110058, Oct. 2020, doi: 10.1016/j.chaos.2020.110058.
[4] D. Zou, L. Wang, P. Xu, J. Chen, W. Zhang, and Q. Gu, “Epidemic Model Guided Machine
     Learning for COVID-19 Forecasts in the United States,” Epidemiology, preprint, May 2020. doi:
     10.1101/2020.05.24.20111989.
[5] P. Gomes, R. Castro, “Wind speed and wind power forecasting using statistical models:
     autoregressive moving average (ARMA) and artificial neural networks (ANN)”, International
     Journal of Sustainable Energy Development, vol. 1, no. 1/2, pp 13-28, 2012.
[6] Haider, Abbas, “The COVID-19 Impact on Oil Market and Equity Market Link: An Evidence from
     ARMA-GJR GARCH-M Model”, Diss. CAPITAL UNIVERSITY, 2021.
[7] D. Benvenuto, M. Giovanetti, L., Vassallo, S. Angeletti, M. Ciccozzi, “Application of the ARIMA
     model on the COVID-2019 epidemic dataset”, Data in brief, vol. 29, p. 105340, 2020.
[8] F. Milani, “COVID-19 outbreak, social response, and early economic effects: a global VAR
     analysis of cross-country interdependencies”, Journal of population economics, vol. 34, no. 1, pp.
     223-252, 2021.
[9] A. Howell, “Battling Burnout at the Frontlines of Health Care Amid COVID-19”, AACN Advanced
     Critical Care, vol. 32, no. 2, pp. 195-203, 2021.
[10] Acharya, U. R., Oh, S. L., Hagiwara, Y., Tan, J. H., Adam, M., Gertych, A., & San Tan, R. “A deep
     convolutional neural network model to classify heartbeats”, Computers in biology and medicine,
     89, 389-396, 2017.
[11] Zaremba, W., Sutskever, I., & Vinyals, O. “Recurrent neural network regularization”, arXiv
     preprint arXiv:1409.2329, 2014.
[12] Donkers, Tim, Benedikt Loepp, and Jürgen Ziegler. "Sequential user-based recurrent neural
     network recommendations." Proceedings of the eleventh ACM conference on recommender
     systems. 2017.
[13] V.     Piven, VasiaPiven/covid19_ua. 2021. Accessed: Apr. 26, 2021.                        URL::
     https://github.com/VasiaPiven/covid19_ua
[14] Al Shorman, Amaal R., et al. "The Influence of Input Data Standardization Methods on the
     Prediction Accuracy of Genetic Programming Generated Classifiers." IJCCI. 2018.

</pre>