Air Pollution Prediction as a Source for Decision Making Framework in Medical Diagnosis Valerii Lovkina, Andrii Oliinyka and Yurii Lukashenkoa a National University “Zaporizhzhia Polytechnic”, Zhukovsky str., 64, Zaporizhzhia, 69063, Ukraine Abstract The problem of air pollution prediction is presented in the paper. It is considered regarding complex problem of creation of decision making framework in medical diagnosis. Therefore prediction is performed for a day, not for on an hour. The method of air pollution prediction is developed using Long Short Term Memory (LSTM) recurrent neural network. The LSTM- based model is used for prediction of concentration of separate air pollutant during the next day based on its concentration during the previous hours and average traffic data. The experimental investigation of the proposed method is performed by comparing it with ARIMA model, multilayer perceptron, vanilla recurrent neural networks and LSTM. The proposed method should be used in practice inside medical diagnosis tools and separate systems for air pollution analysis, enabling to obtain predicted air pollutant concentration level during the next day. Keywords 1 Air pollution, road traffic, medical diagnosis, decision making framework, prediction, machine learning, long short term memory. 1. Introduction Despite its huge spread for now, urbanization does not stop to increase. This process results in significant rise of concentration of population and human activity per square meter of territory in relatively small space. High human activity leads to large-scale economic changes, as well as to large emissions of heat, gases and waste, which as a result pollute air. Consequence of this process is detected by harmful impact on human health [1]. The described processes are already typical not only for industrial cities, but also for urban centers, where industrial production is not so highly influential. Analyzing the air quality index (AQI) in different cities of the world, it is seen that there are not only industrial centers of the world among the cities with low air quality. The list of top polluted cities also includes cities in Norway (Oslo), Poland (Krakow), Croatia (Zagreb) [2]. In the context of Ukraine, it should be noted that the level of pollution in Kyiv, which is the largest city of the country, currently prevails over industrial centers of the country during some periods of time. All these factors prove that the huge number of people in the world is affected by air pollution, and the problem of determining air quality level is widespread and important. Air pollution [3] is determined by concentration of particles and gases in the air [4]. Regarding the problem of air pollution, it is important to monitor the current situation, analyze the accumulated data, and to predict the future level of air pollution. Such a prediction is significant in short and long terms. Prediction for medical diagnosis [5] is appropriate for the whole day or longer because of the specific character of medical examination and decisions made on treatment. The paper is aimed at prediction of air pollution during the next day. IntelITSIS’2021: 2nd International Workshop on Intelligent Information Technologies and Systems of Information Security, March 24–26, 2021, Khmelnytskyi, Ukraine EMAIL: vliovkin@gmail.com (V. Lovkin); olejnikaa@gmail.com (A. Oliinyk); lukashenkoyuriii@gmail.com (Y. Lukashenko) ORCID: 0000-0002-6890-2807 (V. Lovkin); 0000-0002-6740-6078 (A. Oliinyk); 0000-0002-2478-4597 (Y. Lukashenko) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) At the same time the whole obtained, calculated and predicted dataset enables complex solution of the problems of city management and medical diagnosis, because a man and a city, as well as biosphere in general, are the main objects of the impact of air pollution in the result. This paper is devoted to the consideration of air pollution in terms of creating a decision-making framework for medical diagnosis, where prediction of the level of air pollution is actual for determination of the individual impact of air pollution on the patient. 2. Air pollution prediction problem statement Medical diagnosis mainly consists in determining the patient's diagnosis. The obtained diagnosis becomes the basis for decisions to be made by doctor concerning further treatment of the patient. To determine the diagnosis it is necessary to form a heterogeneous set of data which characterizes the observed situation. On the one part this data is interlinked with subjective information about the patient which is determined during survey and various types of examination and on the other part is interlinked with the environment where patient lives. The level of air pollution is one of the main indicators describing such an environment. Depending on the environmental conditions, it is possible to plan the specification of medical examination, which results are used in decision-making in diagnosis, and determine the specification of the implementation of decisions made based on diagnosis results. The whole set of decisions [6] made during a medical diagnosis forms a decision-making framework consisting of the following stages: • making decisions concerning specification of the planned medical examination; • making decisions concerning choice of diagnosis methods for the patient; • making decisions concerning determination of the patient's condition; • making decisions concerning the further treatment of the patient. This group of decisions requires on the one hand the accumulation of historical data on air pollution, i.e. indicators of air pollution by certain substances collected at the relevant stations, and on the other hand prediction of the level of air pollution at these stations for the future. The problem of air pollution prediction should be stated as determination of functional dependence between air pollutant concentration level during the next period of time and its concentration level during the previous periods of time together with additional parameter 𝜀𝜀: 𝑝𝑝𝑎𝑎𝑐𝑐 (𝑡𝑡, 𝑡𝑡 + 𝑛𝑛 − 1) = 𝑓𝑓(𝑝𝑝𝑎𝑎 (𝑡𝑡 − 1), 𝑝𝑝𝑎𝑎 (𝑡𝑡 − 2), … , 𝑝𝑝𝑎𝑎 (𝑡𝑡 − 𝑛𝑛), 𝑒𝑒(𝑡𝑡 − 𝑛𝑛, 𝑡𝑡 − 1)), (1) 𝑐𝑐 where 𝑝𝑝𝑎𝑎 (𝑡𝑡, 𝑡𝑡 + 𝑛𝑛 − 1) is an average level of concentration of air pollutant a which is calculated for the time period which lasts from 𝑡𝑡 till 𝑡𝑡 + 𝑛𝑛 − 1, 𝑝𝑝𝑎𝑎 (𝑡𝑡 − 1), 𝑝𝑝𝑎𝑎 (𝑡𝑡 − 2), … , 𝑝𝑝𝑎𝑎 (𝑡𝑡 − 𝑛𝑛) present concentration of air pollutant a at the discrete corresponding moments 𝑡𝑡 − 1, 𝑡𝑡 − 2, … , 𝑡𝑡 − 𝑛𝑛, f is a functional dependence which has to be found in the study, 𝑒𝑒(𝑡𝑡 − 𝑛𝑛, 𝑡𝑡 − 1) is an additional parameter, which presents additional factors, which influence 𝑝𝑝𝑎𝑎𝑐𝑐 , should be calculated for the time period which lasts from 𝑡𝑡 − 𝑛𝑛 till 𝑡𝑡 − 1, and doesn’t depend on the corresponding air pollutant. Each air pollutant a is an element of the set of air pollutants A, including gases and particulates. This set should be created, including all air pollutants which are factors of decision making in medical diagnosis. The shape of functional dependence f should be investigated based on machine learning methods. The main attention should be paid to recurrent neural networks and LSTM recurrent neural networks, because of the nature of sequence 𝑝𝑝𝑎𝑎 (𝑡𝑡 − 1), 𝑝𝑝𝑎𝑎 (𝑡𝑡 − 2), … , 𝑝𝑝𝑎𝑎 (𝑡𝑡 − 𝑛𝑛) in the problem (1). At the same time these discrete air pollutants form Air Quality Index (AQI), which is represented by categorical value used for monitoring and decision making. AQI is obtained on the basis of levels of the following pollutants: • ground-level ozone (O3); • particle pollution (PM2.5 and PM10); • carbon monoxide (CO); • sulfur dioxide (SO2); • nitrogen dioxide (NO2) [7]. It means that accurate prediction of concentration of these pollutants is critical for accurate prediction of AQI. 3. Related works A number of studies concerning air pollution prediction using machine learning methods has been conducted. Concentration of PM2.5 is estimated in the study [8] using regression models. The proposed models should be applied for countries where there is no possibility to use costly sensors to monitor air pollution and to create dataset which is necessary for prediction by sequence-based methods. Prediction is performed using real-time traffic monitoring based on Google Maps. Separate models were built for different periods of day. Trace gas concentrations are observed only as additional data for the environment where it could be accumulated. Regression models don’t enable to take into account complex relations between data sequences and different types of factors which influence air pollution. Prediction of concentration of particulate matters using regression models was broadened out by prediction of concentration of PM10 in the study [9]. Regression models were used to predict concentration level during the next day. Method for pattern analysis using dynamic time warping was proposed in the study [10]. This method needs data on PM2.5 concentration from multiple stations and prediction is performed based on similarity between stations. k-nearest neighbour method, which calculates dynamic time warping as distance between stations by using its geographical coordinates, is used. In the paper [11] support vector regression model was used to predict concentration of separate air pollutants and to predict general pollution level based on the AQI. The following air pollutants were studied: carbon monoxide, sulfur dioxide, nitrogen dioxide, ground-level ozone, particulate matter 2.5. Prediction was realized on an hourly basis. Appropriate results were obtained for O3, CO and SO2, that’s why this approach couldn’t be recommended for universal usage. The study [12] is dedicated to the relationship between air pollution and urban transport networks. Artificial neural network model based on multilayer perceptron and the ARIMAX model are compared using experimental investigation. Prediction is performed for an hour. It is proposed to use ensemble model based on both models to process specific situations. Such an ensemble actually models influence of transport network on nitrogen dioxide concentration in the city air, so it doesn’t take into account other factors which influence on the air quality as well as other air pollutants. Besides such a model does not consider sequences which exist in the history of air pollutant concentration. Deep learning model based on LSTM neural networks was investigated in the paper [13] where it is presented in the context of Internet of Things concept [14, 15]. The proposed model is aimed at AQI prediction, so the obtained results are categorical. During experimental investigation separate LSTM models were created for ozone and nitrogen dioxide gases. The obtained results indicated that sequence-based approach for air quality prediction is perspective and could be used in practice. LSTM-based model is used in the paper [16] to predict PM2.5 concentration in the air of South Korea. The prediction was performed for long-term periods. Different time horizons, including 8, 12, 16, 20, 24 hours, were investigated. It confirmed possibility to predict air pollution for intervals longer than 1 hour. In the paper [17] LSTM neural networks and deep autoencoders were used for PM concentration prediction. PM2.5 and PM10 were investigated using datasets of Seoul. Prediction was performed for 10 days after period which was studied. During the experimental investigation LSTM models demonstrated better results, therefore there is no practical need in the usage of deep autoencoders for air pollution prediction problem. The study [18] is aimed at road traffic prediction based on air pollution. CO, NO, NO2, NOx and O3 are the observed air pollutants. Prediction was realized based on LSTM neural network architecture. But at the same time air pollution is not a reason of road traffic but its consequence. So it should be possible to improve air pollution prediction using road traffic data because road traffic is one of the main reasons of polluted air in big cities. The following study of the features which could be used for LSTM model to perform a prediction in a day is needed. The problem of feature selection was considered in the studies of authors [19] and should be applied to air pollution. 4. Proposed method of LSTM-based air pollution prediction using traffic data Air pollution prediction is performed separately for each air pollutant, therefore prediction models should be created for each air pollutant from the set A. As a result of learning of air pollution nature and literature review the hypothesis on the dependence between air pollution concentration and traffic data was moved. For further analysis of this hypothesis separate investigation of correlation between city traffic data and concentration of air pollutants was performed. The investigated correlation is positive. Its visualization is presented in the Figure 1. Figure 1: Graph of correlation between road traffic and concentration of air pollutants (CO, NO2, PM10, SO2) The problem (1) is solved using LSTM model [20, 21, 22, 23]. This decision was made because input data are characterized by sequential nature, so this problem is a time series prediction problem. The proposed structure of the model is presented in the Figure 2. Figure 2: Structure of the model which is used for air pollutant prediction in the proposed method The proposed model consists of two LSTM layers: the first layer of neurons interacts with input data, the second layer is a hidden layer. The first layer is proposed to build from 8 neurons. Each input neuron gets arithmetic mean value of concentration of air pollutant within 3 hours. Then obtained values are processed by the hidden layer which consists of 2 LSTM-neurons. Amount of traffic impacts on one of the neurons of the hidden layer, and hence impacts on the final prediction. Dataset which is needed for the model training in the proposed method should be prepared in the following way. Each sample represents values of parameters during a day and air pollutant concentration for the following day registered in a separate station. Parameters include one value of average day traffic in the region of station and 8 values of an air pollutant concentration during a day. 8 values were used instead of 24, because fluctuations of air pollutant within 3 hours are insignificant. Air pollutant concentration during the following day is presented by median value calculated using 24 values of air pollutant concentration. The obtained dataset is normalized before model training. Concentration data during the current day should be registered in one station and should be used for prediction by the trained models which present each air pollutant in practice. Predictions are performed by the models for the following day. The obtained values of concentration of each air pollutant should be used for decisions on specificity of patient treatment. 5. Results The main dataset [24], which was used for experimental investigation, represents air pollution concentration levels registered during 18 years (from 2001 to 2018) in different stations in Madrid on hourly basis. Approximately 150 thousands of measurements were performed for each air pollutant in a station. Air pollutants presented in the dataset include SO2, CO, NO, NO2, PM2.5, PM10, NOx, O3, TOL, BEN, EBE, MXY, PXY, OXY, TCH, CH4, NMHC [24]. Not all stations presented in this dataset had measurements for full list of air pollutants which are included in AQI during full period from 2001 to 2018: some measurements are missed because of the absence of some equipment, its repair or unavailability. The example of missed (white color) and obtained (black) values is presented in the Figure 3. Figure 3: Missed values of concentration of air pollutants in the dataset under investigation For experimental investigation only stations with equipment for registration of all air pollutants from AQI were chosen. Missed values were replaced by averaging. Traffic data for the chosen stations were obtained from Madrid's City Council Open Data website [25]. The whole dataset was divided into learning sample (80 %) and test sample (20 %). During experimental investigation the following models and methods were used for the problem solving: ARIMA model, artificial neural network based on multilayer perceptron [22], vanilla recurrent neural network [21], LSTM model [20], the proposed method which uses the same LSTM model and traffic data as additional parameter. Software was developed for the investigation using Python programming language. Keras library was used for neural network models realization. For estimation of the obtained results metrics of root mean square error (RMSE) and mean absolute error (MAE) were used: 𝑛𝑛 (2) 1 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 = � �(𝐸𝐸𝑖𝑖 − 𝐴𝐴𝑖𝑖 )2 , 𝑛𝑛 𝑖𝑖=1 𝑛𝑛 1 (3) 𝑀𝑀𝑀𝑀𝑀𝑀 = �⌊𝐸𝐸𝑖𝑖 − 𝐴𝐴𝑖𝑖 ⌋, 𝑛𝑛 𝑖𝑖=1 where 𝑛𝑛 is a number of samples in test dataset, 𝐴𝐴𝑖𝑖 is actual value (i-th sample from test dataset), 𝐸𝐸𝑖𝑖 is a predicted value with number i. The results of the conducted experimental investigation using these metrics were accumulated in the Table 1. Table 1 Comparison of RMSE and MAE values calculated for the existing models and the proposed method for air pollutant O3 Prediction model/method RMSE MAE ARIMA model 15.33 10.95 Multilayer perceptron 19.10 12.87 Recurrent neural network 14.95 10.63 LSTM 13.87 9.13 The proposed method 12.71 7.22 The obtained results demonstrate that LSTM model is characterized by better values of RMSE and MAE than ARIMA model which is a classic solution for time series prediction problem, traditional multilayer perceptron and vanilla neural network. ARIMA model allowed to obtain better results than multilayer perceptron. At the same time additional usage of traffic data for model input allowed to perform prediction with RMSE which is 9.13 % smaller and MAE which is 20.92 % smaller than LSTM model without additional parameter. Another metric was proposed to estimate accuracy of prediction of all air pollutant concentration from AQI. This estimation was performed calculating percent of samples from dataset for which prediction error was not larger than the limit (for example, 0.5 mcg/m3 for O3). The obtained results are presented in the Table 2. Table 2 Comparison of prediction accuracy for the test sample Prediction model/method Accuracy, % ARIMA model 86.35 Multilayer perceptron 83.96 Recurrent neural network 86.75 LSTM 88.62 The proposed method 90.98 The proposed method is characterized by the best results between the considered models and methods. The accuracy of the proposed method is 4.63 % better than accuracy of ARIMA model and 2.36 % better than accuracy of LSTM model. Prediction made for Arturo Soria station by LSTM model, created without traffic data and with traffic data based on the procedure of the proposed method, is presented in the Figure 4. To accent the differences between character of values of the previous day and the next day for which prediction is performed, predictions are visualized with a day interval. Figure 4: Predicted values of O3 concentration in Arturo Soria station 6. Conclusion Air pollution prediction problem is considered from the point of view of decision making in medical diagnosis. Main features of such decisions within the decision making framework are presented. Mathematical formalization of the air pollution prediction problem is made. Method of the problem solution is presented. Prediction model in the method is organized using LSTM neural network and consists of 2 LSTM layers. Road traffic data is used for additional presentation of environment as a factor which impacts on air pollution. Data preparation procedure is described in the paper. Experimental investigation of the proposed method is performed using dataset collected in Madrid during 18 years. ARIMA model, multilayer perceptron, vanilla recurrent neural network and LSTM are used as alternatives. The model, which was trained according to the proposed method, allowed to obtain better results, including smaller values of RMSE, MAE and better accuracy level. The proposed method should be used in practice inside medical diagnosis tools and separate systems for air pollution analysis, enabling to predict air pollutant concentration level during the next day. 7. Acknowledgments The work was performed as part of the research work "Development of methods and tools for analysis and prediction of dynamic behavior of nonlinear objects" (state registration number 0121U107499) of Software Tools Department of National University “Zaporizhzhia Polytechnic”. We are particularly grateful for the assistance with data sample which was given by Diego Vicente, Junior Data Scientist at Decide Soluciones in Madrid, Spain. 8. References [1] R. F. Phalen, R. N. Phalen, Introduction to Air Pollution Science: A Public Health Perspective, Jones & Bartlett Learning, Burlington, MA, 2011. [2] Air quality and pollution city ranking, 2021, URL: https://www.iqair.com/world-air-quality- ranking. [3] D. A. Vallero, Fundamentals of Air Pollution, 5th ed., Academic Press, Waltham, MA, 2014. [4] Air Pollution: MedlinePlus, 2021, URL: https://medlineplus.gov/airpollution.html. [5] A. Oliinyk, S. Subbotin, V. Lovkin, S. Leoshchenko, T. Zaiko, Development of the indicator set of the features informativeness estimation for recognition and diagnostic model synthesis, in: Advanced Trends in Radioelectronics, Telecommunications and Computer Engineering : 14th International Conference TCSET'2018, Lviv-Slavske, Ukraine, 2018, pp. 903-908. doi: 10.1109/TCSET.2018.8336342. [6] T. Kolpakova, A. Oliinyk, V. Lovkin, Improved method of group decision making in expert systems, in: 2017 IEEE First Ukraine Conference on Electrical and Computer Engineering (UKRCON), Kyiv, Ukraine, 2017, pp. 939–943. doi: 10.1109/UKRCON.2017.8100388. [7] Air Quality Index (AQI) Basics, 2021, URL: https://www.airnow.gov/aqi/aqi-basics/. [8] Y. Rybarczyk, R. Zalakeviciute, Regression Models to Predict Air Pollution from Affordable Data Collections, in: H. Farhadi (Ed.), Machine Learning - Advanced Techniques and Emerging Applications, InTech, London, 2018, pp. 15-48. doi: 10.5772/intechopen.71848. [9] M. T. Lei, J. Monjardino, L. Mendes, D. Gonçalves, F. Ferreira, Macao air quality forecast using statistical methods, Air Quality, Atmosphere & Health 3 (2019) 249-258. doi: 10.2495/EI-V2- N3-249-258. [10] P.-W. Soh, K.-H. Chen, J.-W. Huang, H.-J. Chu, Spatial-temporal pattern analysis and prediction of air quality in Taiwan, in: 2017 10th International Conference on Ubi-media Computing and Workshops (Ubi-Media), Pattaya, Thailand, 2017, pp. 1-6. doi: 10.1109/UMEDIA.2017.8074094. [11] M. Castelli, F. Martins Clemente, A. Popovič, S. Silva, L. Vanneschi, A Machine Learning Approach to Predict Air Quality in California, Complexity 2020 (2020) 1-23. doi: 10.1155/2020/8049504. [12] M. Catalano, F. Galatioto, M. Bell, A. Namdeo, A. Bergantino, Improving the prediction of air pollution peak episodes generated by urban transport networks, Environmental Science & Policy 60 (2016) 69-83. doi: 10.1016/j.envsci.2016.03.008. [13] I. Kok, M. Simsek, S. Ozdemir, A deep learning model for air quality prediction in smart cities, in: 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, 2017, pp. 1983-1990. doi: 10.1109/BigData.2017.8258144. [14] J. A. Alsayaydeh, V. Shkarupylo, M. S. Hamid, S. Skrupsky, A. Oliinyk, Stratified Model of the Internet of Things Infrastructure, Journal of Engineering and Applied Sciences, 13 (2018) 8634- 8638. doi: 10.3923/jeasci.2018.8634.8638. [15] J. A. Alsayaydeh, M. Nj, S. N. Syed, A. W. Yoon, W. A. Indra, V. Shkarupylo, C. Pellipus, Homes appliances control using bluetooth, ARPN Journal of Engineering and Applied Sciences, 14 (2019) 3344-3357. [16] T.-C. Bui, V.-D. Le, S. K. Cha, A Deep Learning Approach for Forecasting Air Pollution in South Korea Using LSTM, 2018, URL: https://arxiv.org/abs/1804.07891. [17] T. Xayasouk, H. Lee, G. Lee, Air Pollution Prediction Using Long Short-Term Memory (LSTM) and Deep Autoencoder (DAE) Models, Sustainability 12 (2020) 2570-2577. doi: 10.3390/su12062570. [18] F. Awan, R. Minerva, N. Crespi, Improving Road Traffic Forecasting Using Air Pollution and Atmospheric Data: Experiments Based on LSTM Recurrent Neural Networks, Sensors 20 (2020) 3749-3769. doi: 10.3390/s20133749. [19] A. Oliinyk, S. Subbotin, V. Lovkin, S. Leoshchenko, T. Zaiko, Feature selection based on parallel stochastic computing, in: 2018 IEEE 13th International Scientific and Technical Conference on Computer Sciences and Information Technologies, CSIT 2018 - Proceedings, Lviv, Ukraine, 2018, pp. 347-351. doi: 10.1109/STC-CSIT.2018.8526729. [20] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, The MIT Press, Cambridge, Massachusetts, 2017. [21] J. D. Kelleher, Deep Learning, The MIT Press, Cambridge, Massachusetts, 2019. [22] C. C. Aggarwal, Neural Networks and Deep Learning: A Textbook, Springer, Yorktown, NY, 2018. [23] S. Leoshchenko, A. Oliinyk, S. Subbotin, T. Zaiko, Using Modern Architectures of Recurrent Neural Networks for Technical Diagnosis of Complex Systems, in: Proceedings of the 2018 International Scientific-Practical Conference Problems of Infocommunications. Science and Technology (PIC S&T), Kharkiv, Ukraine, 2018, pp. 411-416. doi: 10.1109/INFOCOMMST.2018.8632015. [24] Air Quality in Madrid (2001-2018), 2018, URL: https://www.kaggle.com/decide-soluciones/air- quality-madrid. [25] En portada – Portal de datos abiertos del Ayuntamiento de Madrid, 2021, URL: https://datos.madrid.es/portal/site/egob.