Machine Learning Approach to COVID-19 Epidemic Process Simulation using Polynomial Regression Model Darina Kapusta, Alireza Mohammadi, Dmytro Chumachenko National Aerospace University “Kharkiv Aviation Institute”, Chkalow str., 17, Kharkiv, Ukraine Abstract The article presents an approach to modeling epidemic processes based on machine learning. A model is built based on the polynomial regression method. The simulation results allow us to calculate the predicted incidence of coronavirus infection in a certain area. The model has been shown to be accurate enough for use in public health policy-making settings. The disadvantage of using machine learning methods is the impossibility of identifying factors affecting the dynamics of the epidemic process. But, due to their high accuracy, such models can be used in an ensemble with agent-based and compartment models. Keywords 1 Epidemic model, polynomial regression, COVID-19 simulation, machine learning, artificial intelligence. 1. Introduction An epidemic of a previously unknown coronavirus that causes pneumonia broke out in January 2020 in the Chinese province of Hubei [1]. Within two weeks, the virus spread to other countries. The first cases of infection with the new coronavirus were recorded at the end of December 2019. Its appearance is associated with the seafood market in the city of Wuhan in China (Hubei province). Until the market closed on January 1, 2020, marine mammals, bats, chickens, rabbits and snakes were sold here [2]. Chinese virologists suggested that one of these animal species could become a source of infection. During the first month, almost 6,000 people were infected with the coronavirus, more than 130 died from pneumonia caused by the virus. China has restricted communications with a number of metropolitan areas, quarantining 56 million people in 17 cities in Hubei. Later it turned out that the new coronavirus almost 90% coincides with the SARS-CoV virus, which appeared in China in the early 2000s and claimed about 800 lives. The new coronavirus was first assigned the code 2019- nCoV, and from February 11 it was renamed SARS-CoV-2 (Severe Acute Respiratory Syndrome Coronavirus 2) [3]. As the epidemic was contained, the authorities of individual countries began to gradually ease restrictive measures in order to minimize damage to the economy and prevent social problems [4]. In the fall of 2020, the second and third waves of the epidemic began in many countries. The death toll from coronavirus in the United States as of August 2021 has reached almost 700 thousand people [5]. This made the pandemic the deadliest in the 20th century, ahead of the Spanish flu pandemic. Recently, an average of 10,000 people per day have been dying from coronavirus [6]. This is the highest figure since the beginning of March this year. Scientists associate this with the emergence of strains “delta” and “iota”, characterized by increased infectiousness and lethality [7]. International Workshop of IT-professionals on Artificial Intelligence (ProfIT AI 2021), September 20-21, 2021, Kharkiv, Ukraine EMAIL: kapusta.darina.yurievna@gmail.com (D. Kapusta); alireza.mohammadi9207@gmail.com (A. Mohammadi); dichumachenko@gmail.com (D. Chumachenko). ORCID: 0000-0002-8168-8411 (D. Kapusta); 0000-0002-4964-4494 (A. Mohammadi); 0000- 0003-2623-3294 (D. Chumachenko). ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) The authorities of many countries have introduced compulsory vaccination for certain groups of the population (doctors, teachers, government officials, etc.) [8]. However, the number of vaccinated people is not enough to reduce the dynamics of the spread of the pandemic [9]. Mathematical and simulation modeling is an effective tool for identifying the rules for the spread of a pandemic. With the help of the simulation results, it is possible to form a scientifically grounded policy of countering the epidemic and the introduction of anti-epidemic measures to reduce the incidence. The aim of the paper is to develop machine learning model of COVID-19 epidemic process dynamics and investigate it’s results. 2. Analysis of epidemic process simulation approaches To date, research teams from around the world have built many models for the spread of COVID- 19. And morbidity modeling goes back centuries, when the breakthrough model of SIR, proposed by Kermak and McKendrick [10]. Models useful for studying infectious diseases on a population scale can generally be classified into two types: deterministic and stochastic. In deterministic models, a large population is divided into smaller groups called classes, where each group represents a specific stage of the epidemic. Such models are often formulated as a set of differential equations (in continuous time) or difference equations (in discrete time) that help explain what, on average, happens on a population scale [11]. The decision of a deterministic model is a function of time or space and, as a rule, uniquely depends on the input data. The stochastic model is formulated in terms of a stochastic process, which, in turn, is a set of random variables, X (t,ω) ≡ X(t), defined as: , (1) where T and Ω represent the time and total space for the sample. The solution to the stochastic model is the probability distribution for each of the random variables [12]. Such models capture the inherent variability of demographic and environmental variability and are useful in small populations. More specifically, they allow observation of every person in a population at random. Deterministic models are used to address questions such as “What proportion of people would have been infected during an epidemic outbreak?”, “What conditions must be satisfied to prevent and control an epidemic?”, etc. Deterministic models are the best when studying a large population, and models of stochastic epidemics are useful for a small population and provide answers to questions such as: “How long can a disease last?”, “What is the probability of a large fire”, etc. Unlike deterministic models, stochastic models can be time consuming to create and require many simulations run to generate useful predictions. They can become very complex mathematically and lead to misperceptions of dynamics. Different modeling approaches are suitable for investigating different problems. For example, simple deterministic models can be useful for understanding the underlying dynamics of an infection, but they are of limited use as a forecasting tool because any epidemic is unique and unlikely to follow the “average” pattern. Stochastic models are difficult to construct, but are especially useful for assessing risks and can be used to investigate the likelihood of different outcomes. As for August 2021 wore than 250 thousand sources have been published on issues relating to COVID-19 in various fields, starting from medicine and biology, and finishing with computer science and mathematics. A lot of researches dedicated to COVID-19 modeling from different perspectives, such as COVID-19 characteristics [13], epidemiology [14], general Artificial Intelligence [15], machine learning [16], etc. They are solving different tasks, such as virus detection [17], contact tracing [18], forecasting [19], vaccine development [20], etc. Main limitations of developed models and approaches are complex data, low quality of data, limited data, heterogeneity of population, interactions between multi-source data, disclosing unknown attributes, etc. All these limitations lead to a decrease in the accuracy of the forecast obtained with the help of modeling. The machine learning approach to epidemic process simulation can eliminate that drawback and shows high accuracy in simulation dynamics. 3. Polynomial regression model Polynomial Regression is a supervised regression learning algorithm. The regression algorithm establishes a regression model between variables and obtains the correlation between variables and dependent variables in the learning process [21]. Regression analysis can be used for predictive or classification models. Common regression algorithms include: linear regression, nonlinear regression, logistic regression, polynomial regression, comb regression, lasso regression) and ElasticNet regression. Among them, the most commonly used are linear regression, nonlinear regression, and logistic regression. In many cases, the linear model may not fit well with the target data curve, which requires the introduction of a non-linear regression model [22]. There are several strategies for nonlinear regression: the first strategy is to convert nonlinear regression to linear regression, and the second strategy is to convert nonlinear regression to polynomial regression. Polynomial regression adds a higher cardinality of an element (such as a square or cubic term), which is equivalent to increasing the model's degree of freedom to capture non-linear changes in the data. The goal of regression analysis is to model the expected value of the dependent variable y in terms of the value of the independent variable (or vector of independent variables) x. In simple linear regression, the model y = 0 + 1x + ε, (2) where ε is the unobservable random error with mean zero due to the scalar variable x. In this model, for each unit of increase in the value of x, the conditional expectation of y is increased by β1 units. In many cases, such a linear relationship may not be observed. For example, if we model the number of deaths from COVID-19 depending on the percentage of population vaccinated in a certain area, then the yield increases also due to the decrease in the availability of places in hospitals. In this case, you can use a quadratic model of the form y = 0 + 1x + 2x2 + ε. (3) In general terms, we can model the expected value of y as a polynomial of the nth degree, obtaining a general polynomial regression model y = 0 + 1x + 2x2 + 3x3 + … + nxn ε. (4) It is convenient that all these models are linear in terms of estimation, since the regression function is linear in terms of unknown parameters β0, β1, .... Therefore, for least squares analysis, computational and logical problems polynomial regression can be completely solved using the methods multiple regression. For this, x, x2, ... are treated as separate explanatory variables in a multiple regression model. From the formula (4), it can be concluded that to obtain a polynomial regression model that fits the target dataset perfectly, the key is to solve the value of the weight of each property-independent variable. Linear regression first constructs a convex function optimization function (such as: the minimum sum of squares of the difference between a given function value and the model's prediction value) and uses least squares [23] and gradient descent [24] to compute the final fit parameters. Although polynomial regression is technically a special case of multiple linear regression, the interpretation of a fitted polynomial regression model requires a slightly different perspective. It is often difficult to interpret the individual coefficients when fitting a polynomial regression because the underlying monomials can be highly correlated. For example, x and x2 have a correlation of more than 0,9 when x is evenly distributed over the interval (0,1). Although the correlation can be reduced by using orthogonal polynomials, it is usually more informative to look at the fitted regression function as a whole. Point or simultaneous confidence intervals can then be used to determine the uncertainty in the estimate of the regression function. 4. Results To simulate COVID-19 epidemic process in Ukraine we have used data of new cases and deaths, provided by the Center for Public Health of the Ministry of Health of Ukraine. Polynomial regression model program implementation has been made with Python programming language. An important part of developing a software product is designing it. The process approach is the main element of management in organizations that manage public health in Ukraine. At the same time, one of the key aspects of this approach is to ensure the visibility (“transparency”) of the management object (organization or system) through its accurate, sufficient, concise, easy-to- understand and analyze description. Obviously, for complex systems, which include all institutions of the public health system in Ukraine at all levels (from regional laboratory centers to the Public Health Center under the Ministry of Health of Ukraine), it is almost impossible to obtain a single description suitable for any case, with faced by managers. Being multifaceted in the form and content of presentation, an organization (complex system) as a set of interrelated components can be represented by independent, complete “projections”, the number of which is determined by the needs and tasks of management. To model business processes, the IDEF0 and DFD methodologies were used. Functional model consists of four main elements: • Process (Eng. Process), i.e. a function or sequence of actions that must be taken in order for the data to be processed. This can be creating an order, registering a customer, etc. It is customary to use verbs in the names of processes, i.e. Create customer (not create customer) or process order (not place an order). There is no strict system of requirements, as, for example, in IDEF0 or BPMN, where notations have a hard-coded syntax, since they can be executable. But still, certain rules should be adhered to so as not to confuse other people when reading the DFD. • External entities. These are any objects that are not included in the system itself, but are for it a source of information or recipients of any information from the system after data processing. It can be a person, an external system, any storage media and data storage. • Data store. Internal data storage for processes in the system. The received data before processing and the result after processing, as well as intermediate values must be stored somewhere. These are databases, tables, or any other option for organizing and storing data. It will store customer data, customer requests, invoices and any other data that entered the system or is the result of processing processes. • Data flow. In the notation, it is displayed in the form of arrows that show what information is included and what comes from a particular block in the diagram. The functional model of the program complex is shown in Figure 1. Figure 1: Functional model of system. Decomposition of functional diagram is presented in Figure 2. Figure 2: Decomposition of functional model of system. Implementing a particular use case requires the participation and interaction of specific instances of actors and classes. The most suitable tool for describing this interaction is sequence and communication diagrams, which essentially represent the same information. For the information system for forecasting the epidemic process COVID-19, use case diagrams are built, presented in Figures 3-5. Figure 3: Case diagram “Initial data tab”. Figure 4: Case diagram “System running tab”. Figure 5: Case diagram “Graph tab”. The developed information system includes both general data on COVID-19 morbidity in the world and detailed data on COVID-19 morbidity in Ukraine. Worldwide data is automatically loaded from the John Hopkins University database, and detailed data on Ukraine provided by the Center for Public Health of the Ministry of Health of Ukraine (fig. 6). Results of COVID-19 epidemic process in Ukraine using Polynomial regression model are shown in Figure 7. The results are calculated for 10, 20 and 30 days. Analysis of experimental study shows that the most accurate result is provided with 10-days forecast. Still, other forecasts shows enough accuracy to use that results in Public Health institutions to provide anti-epidemic measures to combat to COVID-19 pandemic. Figure 6: Interface of information system of COVID-19 morbidity forecasting. Figure 7: Results of COVID-19 simulation with Polynomial regression. The errors of forecasting are presented in table 1. Table 1 Forecasting errors Simulation period Root Mean Squared Error 10 65684.09170587435 20 196632.83601238678 30 480749.771966806 5. Conclusions Within the framework of the study, a model for predicting the dynamics of the epidemic process COVID-19 was built on the basis of the polynomial regression method. Based on the model, an information system has been developed that can be implemented in public health institutions to make decisions on the implementation of anti-epidemic measures to reduce the dynamics of the incidence of COVID-19 in Ukraine. The highest accuracy is shown by a forecast built for 10 days. However, the accuracy of the model allows the use of forecasts built for a longer period. Taking into account the incubation period of COVID-19, which averages 14 days, we recommend using a forecast for 20 days, which will also include new contacts with the source of infection. The advantage of the developed model is the high, in comparison with other approaches, the accuracy of constructing the predicted morbidity. The disadvantage of the model is the impossibility of identifying factors influencing the dynamics of morbidity. Therefore, it is recommended to use the proposed model in an ensemble with other models that allow analyzing the informativeness of factors, for example, with multi-agent or compartment ones. A machine learning model can be used to verify predictions from other approaches, increasing their accuracy. Acknowledgements The study was funded by the National Research Foundation of Ukraine in the framework of the research project 2020.02/0404 on the topic “Development of intelligent technologies for assessing the epidemic situation to support decision-making within the population biosafety management” [25]. References [1] T. Asselah, D. Durantel, E. Pasmant, G. Lau, R.F. Schinazi: COVID-19: Discovery, diagnostics and drug development. Journal of Hepatology 74 (1) (2021) 168-184. doi: 10.1016/j.jhep.2020.09.031. [2] X. Lu, Y. Xing, G.W. Wong: COVID-19: lessons to date from China. Archives of Disease in Children 105 (12) (2020) 1146-1150. doi: 10.1136/archdischild-2020-319261. [3] Z. Xu, et. al.: China shares experience during the COVID-19 outbreak. Burns: journal of the International Society for Burn Injuries 47 (1) (2021) 249-250. doi: 10.1016/j.burns.2020.05.014. [4] L. Webb: COVID-19 lockdown: A perfect storm for older people's mental health. Journal of Psychiatric and Mental Health Nursing 28 (2) (2021) 300. doi: 10.1111/jpm.12644. [5] O.O. Woolcott, R.N. Bergman: Mortality Attributed to COVID-19 in High-Altitude Populations. High Altitude Medicine and Biology 21 (4) (2020) 409-416. doi: 10.1089/ham.2020.0098. [6] M.V. Blagosklonny: From causes of aging to death from COVID-19. Aging (Albany NY) 12 (11) (2020) 10004-10021. doi: 10.18632/aging.103493. [7] I. Torjesen: Covid-19: Delta variant is now UK's most dominant strain and spreading through schools. BMJ (Clinical research) 373 (2021) n1445. doi: 10.1136/bmj.n1445. [8] I. Ali: Impact of COVID-19 on vaccination programs: adverse or positive? Human Vaccines and Immunotherapeutics 16 (11) (2020) 2594-2600. doi: 10.1080/21645515.2020.1787065. [9] F.M. Russell, B. Greenwood: Who should be prioritised for COVID-19 vaccination? Human Vaccines and Immunotherapeitocs 17 (5) (2021) 1317-1321. doi: 10.1080/21645515.2020.1827882. [10] F. Brauer: The Kermack–McKendrick epidemic model revisited. Mathematical Biosciences 198 (2) (2005) 119-131. doi: 10.1016/j.mbs.2005.07.006. [11] M.B. Trawocki: Deterministic Seirs Epidemic Model for Modeling Vital Dynamics, Vaccinations, and Temporary Immunity. Mathematics 5 (7) (2017) doi: 10.3390/math5010007 [12] L.J.S. Allen: A primer on stochastic epidemic models: Formulation, numerical simulation, and analysis. Infectious Disease Modelling 2 (2) (2017) 128-142. doi: 10.1016/j.idm.2017.03.001 [13] H. Esakandari, M. Nabi-Afjadi, J. Fakkari-Afjadi, N. Farahmandian, S.M. Miresmaeili, E. Bahreini: A comprehensive review of COVID-19 characteristics. Biological Procedures Online 22 (2020) 1–10. [14] M. Park, A.R. Cook, J.T. Lim, Y. Sun, B.L. Dickens: A systematic review of COVID-19 epidemiology based on current evidence. Journal of Clinical Medicine 9 (4) (2020) 967. [15] M.N. Islam, T.T. Inan, S. Rafi, S.S. Akter, I.H. Sarker, A.K.M.N. Islam: A Survey on the Use of AI and ML for Fighting the COVID-19 Pandemic. arXiv e-prints (2020), arXiv–2008. [16] T.T. Nguyen: Artificial intelligence in the battle against coronavirus (COVID-19): a survey and future research directions. (2020) arXiv:2008.07343 [17] I. Izonin, et. al.: Predictive modeling based on small data in clinical medicine: RBF-based additive input-doubling method. Mathematical Biosciences and Engineering 18 (3) (2021) 2599- 2613. doi: 10.3934/mbe.2021132 [18] Y. Mao, S. Jiang, D. Nametz: Data-driven Analytical Models of COVID-2019 for Epidemic Prediction, Clinical Diagnosis, Policy Effectiveness and Contact Tracing: A Survey. (2020). [19] O. Byambasuren, et. al.: Estimating the Extent of True Asymptomatic COVID-19 and Its Potential for Community Transmission: Systematic Review and Meta-Analysis. Journal of the Association of Medical Microbiology and Infectious Disease Canada 5 (4) (2020) 223–234. [20] A.K. Arshadi, et. al.: Artificial Intelligence for COVID-19 Drug Discovery and Vaccine Development. Frontiers in Artificial Intelligence 3 (2020) 65. [21] H. Li, S. Yamamoto: Polynomial regression based model-free predictive control for nonlinear systems. 2016 55th Annual Conference of the Society of Instrument and Control Engineers of Japan (SICE) (2016) 578-582, doi: 10.1109/SICE.2016.7749264. [22] S. Kavitha, S. Varuna, R. Ramya: A comparative analysis on linear regression and support vector regression. 2016 Online International Conference on Green Engineering and Technologies (IC- GET) (2016) 1-5, doi: 10.1109/GET.2016.7916627. [23] V.P. Mashtalir, et. al.: Group structures on quotient sets in classification problems. Cybernetics and Systems Analysis 50 (4) (2014) 507-518. doi: 10.1007/s10559-014-9639-z [24] S.N. Gerasin, et. al.: Set coverings and tolerance relations. Cybernetics and Systems Analysis 44 (3) (2008) 333-340. doi: 10.1007/s10559-008-9007-y [25] S. Yakovlev, et. al., The concept of developing a decision support system for the epidemic morbidity control, CEUR Workshop Proceedings 2753 (2020) 265–274.