Application of machine learning methods to the prediction of NO2 concentration in the air environment Iryna Didych1,∗,† , Andrii Mykytyshyn1,∗,† , Andrii Stanko1,†, Mykola Mytnyk 1,† 1 Ternopil Ivan Puluj National Technical University, Ruska 56, 46001 Ternopil, Ukraine Abstract Air quality significantly impacts public health, with nitrogen dioxide (NO 2) being a key pollutant linked to respiratory and cardiovascular diseases. In this study, we developed a machine learning model to accurately predict hourly NO2 concentrations in Ternopil, Ukraine, using readily available meteorological and temporal data. The model was trained on a large dataset and tested using data from the Ecocity monitoring station, known for recording NO 2 levels exceeding legal limits. By employing neural networks, the model demonstrated high accuracy in predicting NO2 concentrations, with the error of 3.9% and 1.4%, respectively, in the test samples. Our findings underscore the potential of machine learning techniques to enhance air quality monitoring and forecasting, particularly in urban areas with limited resources. This approach offers a valuable tool for real-time pollution management and public health protection. Keywords ⋆1 Air quality, nitrogen dioxide, prediction, machine learning 1. Introduction Air quality is a complex, multifactorial set of chemical, physical, and biological characteristics of air, and at the same time a very relevant topic because of its connection to human health. Numerous studies have demonstrated the link between cardiovascular and lung diseases and long-term exposure to pollutants, in particular nitrogen dioxide (NO 2) and particulate matter (PM2.5 and PM10). According to the European Environment Agency [1], in 2018, about 55,000 premature deaths in the EU could be attributed to exposure to NO 2. The results of several clinical and epidemiological studies show that there is at least moderate evidence that adverse health effects occur even with short-term exposure to pollutants, such as exposure below established limits [2]. Increasing concentrations of pollutants in the atmosphere have changed its properties, making it a harmful environment for humans and other living organisms [3]. Pollutants include ⋆ ITTAP’2024: 4th International Workshop on Information Technologies: Theoretical and Applied Problems, October 23- 25, 2024, Ternopil, Ukraine, Opole, Poland 1∗ Corresponding author. † These authors contributed equally. iryna.didych@tntu.edu.ua (I. Didych); mikitishin@gmail.com (A. Mykytyshyn); stanko.andrjj@gmail.com (A. Stanko) ; mytnyk@networkacad.net (M. Mytnyk). 0000-0003-2846-6040 (I. Didych); 0000-0002-2999-3232 (A. Mykytyshyn); 0000-0002-5526-2599 (A. Stanko); 0000- 0003-3743-6310 (M. Mytnyk) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings a variety of gases, droplets, and particles that degrade air quality; therefore, their exposure to humans is believed to lead to serious health problems, especially in urban areas where pollution levels are high [4]. Air pollutants are chemical, physical (e.g., particulate matter), or biological agents that alter the natural characteristics of the atmosphere. Particulate matter, cited as an example of air pollutants, is the main factor that negatively affects human health due to its high toxicity. Air pollution results in the presence of certain gases in the atmosphere in concentrations that exceed the standard and can be seriously harmful to human health. Examples of such gases are nitrogen oxides, sulfur oxides, carbon monoxide, photochemical oxidants (e.g. ozone), lead, as well as various heavy metals and volatile organic compounds that are released into the atmosphere as a result of industrialization and transport emissions, thereby degrading air quality. Air quality is defined as the state of the atmosphere in the environment that can be affected by pollution from these sources. Air quality is considered good when it meets a certain level of purity and transparency, and there are no gaseous pollutants such as smoke, dust, smog, and other impurities in the atmosphere. The Air Quality Index (AQI) is a numerical indicator used by government agencies to inform the public about the current state of the air or projected levels of air pollution. When the AQI rises due to an increase in air pollutants (e.g., during peak traffic hours or when there are downwind wildfires), a growing proportion of the population can quickly experience serious negative health effects [3]. The US Environmental Protection Agency has classified these air pollutants into six main categories [5]. The issue of constant monitoring of air quality in real time is also relevant, as the authors says in paper [6] and [7]. Several studies have been conducted to monitor air quality in the environment. Air quality assessments were carried out in the city of Ternopil, Ukraine, at the Ivan Puluj National Technical University of Ternopil. The purpose of this study is to assess and analyse the impact of atmospheric pollution on air quality, as well as to establish its dependence on atmospheric factors. Many studies have established a link between cardiovascular and pulmonary diseases and long-term exposure to pollutants, including nitrogen dioxide (NO₂) and particulate matter (PM2.5 and PM10). According to the European Environment Agency [1], in 2018, approximately 55,000 premature deaths in the EU were attributed to NO₂ exposure. Data from several clinical and epidemiological studies show at least moderate evidence that adverse health effects can occur even with short-term exposure to pollutants, including exposures below established limit values [2]. Natural NO₂ emissions have very low background concentrations. Emissions associated with anthropogenic activities are the most significant factor affecting human health. The main source of NO₂ emissions is human-caused combustion processes, such as heat, electricity, or internal combustion engines. In particular, motor vehicles are the main sources of nitrogen oxides [8]. However, it is important to investigate the dependence of NO₂ concentration on changes in temperature and humidity in the environment. Such an analysis requires models that can accurately reflect air quality in order to identify periods of increased pollution and respond quickly to short-term fluctuations, especially in urban areas. Prediction of pollutants by deterministic methods is accompanied by significant uncertainties due to the complexity of the physical and chemical processes that determine the formation and transport of pollutants in the urban atmosphere [9-11]. In this regard, sophisticated machine learning methods are becoming increasingly common in air quality modelling, outperforming traditional statistical approaches. In the review by Cabañeros et al. [12], which analysed the number of studies using artificial neural networks (ANNs) to model pollutants since 2001, found 139 papers. Of these, 51 studies applied this method to predict nitrogen oxides, while others focused on modelling various pollutants such as particulate matter (PM), carbon dioxide (CO₂), and ozone (O₃). ANNs are capable of detecting complex, nonlinear relationships between meteorological variables and pollutant concentrations, and generalizing information from training datasets to form functional relationships between variables, even if the nature of these relationships is unknown. Unlike regression analysis, ANNs work effectively in the presence of significant noise in the data [13]. The first successful applications of ANNs for modelling NO₂ concentrations in urban areas were presented in the works of Gardner and Dorling [14], Kolemainen et al. [15]. The authors in [16], which demonstrated the advantages of the proposed approaches over regression models. Since then, many modern studies have also obtained significant results in the use of neural networks for modelling nitrogen oxides, simulating both national emissions over long periods [17] and local emissions on an hourly basis [11]. Some studies have included several air pollutants in the models, such as Jiang et al. [18], where a combination of a neural network and a heuristic algorithm was used to develop an early warning system for five different pollutants. A key aspect in developing machine learning-based air quality models is the selection of appropriate input parameters. The concentration of NO₂ in urban air is influenced by many variables that reflect meteorological conditions and pollution sources. Several studies have identified meteorological variables such as temperature, humidity, and wind speed as important predictors, as well as concentrations of other pollutants [19]. An alternative approach is to use previously measured concentrations of the target pollutant as predictors, relying on temporal autocorrelation between successive values of the same variable. This is especially effective for forecasting several hours in advance [20]. In some studies, this method was combined with long short-term memory (LSTM) recurrent networks to successfully predict NO₂ up to eight hours in advance [21]. Dai et al. [22] combined LSTMs with convolutional neural networks (CNNs) to create a model suitable for predicting six different pollutants. While these models often perform well, they are more computationally intensive than simple feedforward networks. Other studies have used traffic data obtained by vehicle counts or other models [23]. Since traffic is one of the main contributors to elevated NO₂ concentrations, traffic statistics have a high predictive value. 2. Materials and Methods 2.1. Study Area In our study, we set out to develop a model that would accurately estimate hourly NO₂ concentrations in Ternopil, Ukraine, using only available standard meteorological and temporal data as input variables. As in many cities, air quality in Ternopil is monitored by several separate measurement stations. In previous years, several of them have recorded high levels of NO₂ that exceeded the regulatory limits. For this study, the Ecocity station in the central part of Ternopil, known as a pollution hotspot, was chosen, where NO₂ concentrations often exceed the legal thresholds. This location is particularly important due to the high density of residential development in the immediate vicinity of the station. Ternopil is located in western Ukraine, near the Seret River, on the Ternopil Plateau of the Podillia Upland of the Eastern European Plain. The city is located in the temperate climate zone of the broadleaf forest zone. Ternopil has a moderately continental climate with warm and humid summers and mild winters [24]. The street where the air quality monitoring station used in this study is located is in the central part of the city. It is a heavily trafficked main transportation artery connecting the highway with the centre of Ternopil. At the same time, there is a very high density of residential buildings along this street. For several months, both the concentrations of pollutants measured at this station and the number of days with peak pollution levels exceeded the established permissible limits. 2.2. AirFresh air quality monitoring station The correct choice of input parameters, is crucial for predicting pollutant concentrations. Ivan Puluj National Technical University in Ternopil (Ukraine), in cooperation with the program “Clean Air for Ukraine” of the NGO Arnika (Prague, Czech Republic), NGO Free Arduino (Ivano-Frankivsk, Ukraine) and the public monitoring network EcoCity, installed an AirFresh air quality measurement station at the university to expand the network and conduct research. The AirFresh air quality monitoring station is a device that enables real-time monitoring and recording of ambient air conditions, namely temperature, humidity, and dust concentrations of PM2.5 and PM10. AirFresh measures the concentrations of dust microparticles (PM2.5 and PM10), carbon monoxide (CO), ammonia (NH 3), ground-level ozone (O3), and nitrogen dioxide (NO₂) as shown in Figure 1. Each station may be outfitted with radiation background sensors or additional sensors for 16 pollutants tracked by EcoCity.[25-27]. Figure 1: Daily measurements of nitrogen dioxide (NO₂) levels by the AirFresh station in Ternopil Meteorological parameters, such as the height of the measurement site, have a significant impact on the concentration of NO₂. At the monitoring station, data on pollutant concentrations, as well as temperature and relative humidity, are measured at a height of 3 meters. Experimental data on NO₂ concentrations, temperature and humidity were collected at the station during the period from August 6 to 13, 2024. To ensure the ability to recognize temporal patterns in NO₂ fluctuations, the model was given the ability to detect typical hourly and daily variations of this pollutant. To this end, the analysis included time variables that reflect different frequency components in the observed data. 3. Results and discussion The development of a machine learning model consists of several stages, each of which plays an important role in creating an efficient and accurate model. The main stages of developing a machine learning model are: data collection; data preprocessing and analysis; selection of a model and machine learning algorithm; splitting the data into training and test samples; model training; and evaluation and validation. Tracking NO2 emissions, which is the most active pollutant gas, and predicting its concentration are important steps towards pollution control. Therefore, nitrogen dioxide (NO₂) was predicted using experimental data obtained from the Meteorological Station AirFresh. During learning, the dataset was divided into two unequal parts training and test samples. The study was divided into two stages. At the first stage, the training set contained experimental dependencies of NO₂ concentration on temperature, humidity, and measurement time for six days, and a sample of a one-day dataset unknown to the system was chosen to test the quality of forecasting. And at the second stage, the test sample was randomly selected by the computer from all the experimental data for different days (Figure 2). Figure 2: Comparison of the two predicting architectures for training and testing The dataset contained 541 elements. In particular, the training set contained 504 elements characterising temperature, humidity and measurement time over N days (six days in our study). The NO₂ concentration was predicted by neural networks and selected as the output parameter. Based on the experimental results of the NO₂ concentration for one day, a test set of 37 elements was formed to evaluate the quality of prediction. It was found that the built models can make predictions based on data that were not used in the training sample. Therefore, such results are informative for studying their quality. 0,06 0,06 NO2_pred, ppm/mg/m3 NO2_pred, ppm/mg/m3 0,05 0,05 0,04 0,04 0,03 0,03 0,04 0,05 0,06 0,03 0,03 0,035 0,04 0,045 0,05 0,055 0,06 NO2_true, ppm/mg/m3 NO2_true, ppm/mg/m3 a) b) Figure 3: The predicted (NO2pred) and experimental (NO2true) concentrations during August 2024, in particular, a) August 13 and b) August 6-13 in test sample by method of neural networks 0,07 0,06 0,06 0,05 0,05 NO2, ppm/mg/m3 NO2, ppm/mg/m3 0,04 0,04 0,03 0,02 0,03 Exp Exp 0,01 Pred Pred 0 0,02 0 5 10 15 20 25 30 35 40 0.000010.000120.000250.000380.000530.00063 August 13, 2024 August 6-13, 2024 a) b) Figure 4: The predicted and experimental dependences of NO 2 concentrations during August 2024, in particular, a) August 13 and b) August 6-13 in test sample. The neural networks were used to build (Figure 3 and 4) the dependence of experimental concentrations (NO2true) on the predicted ones (NO2pred), as well as NO₂-August 13, 2024 for one day. The NN method gives an error of 3.9%. It is important that in Figure 3, the points are located quite close to the bisector of the first coordinate angle, which indicates the consistency of the predicted and experimental data. The NO₂ concentration was predicted by temperature, humidity and time of measurement over seven days. The sample contained 541 elements, of which 80% were randomly selected for the training set and 20% were left to evaluate the quality of the prediction. The parameters of the neural network are shown in Table 1. Table 1 The parameters of neural network Dependencies Algorith Function Function of Name of Error m of of hidden output network function learning activation activation NO₂ -August 13, MLP 3-23-1 BFGS SOS Tangential Exponential 2024 NO₂ -August 6-13, MLP 3-29-1 BFGS SOS Tangential Exponential 2024 The prediction error was calculated using the Mean Absolute Percent Error (MAPE) formula: 1 | y true − y prediction| n MAPE=100 % ⋅ ∑ , (1) n i=1 | y true| It was found that the predicting results are in good agreement with the experimental ones. The error of the NN method is 1.4%. Conclusions In this study, a machine learning model was developed to predict NO₂ concentrations based on meteorological and temporal data in Ternopil, Ukraine. The concentration of nitrogen dioxide (NO₂) was predicted using experimental data obtained from the Weather Station during 6-13 August 2024. It was found that regardless of the type of study (self-selected test sample or randomly selected by a computer), the forecasting results are in good agreement with the experimental data. The error of the NM method is 3.9% and 1.4%, respectively, in the test samples. The proposed model allows for real-time forecasting of pollutant emissions, which is an important tool for monitoring air quality in urban areas with limited resources. The use of such models can be an important step toward creating early warning systems for elevated levels of pollution and rapid response to short-term fluctuations in the concentration of harmful substances. This could have a positive impact on public health, especially in areas with heavy traffic and high building density. Future research could focus on integrating additional factors, such as traffic and other pollutants, to enable even more accurate predictions of air quality changes in modern urban ecosystems. References [1] González Ortiz, A., Guerreiro, C., & Soares, J. (2020). Air Quality in Europe: 2020 Report. In European Environment Agency. EU Publications: Luxembourg. [2] Latza, U., Gerdes, S., & Baur, X. (2009). Effects of nitrogen dioxide on human health: Systematic review of experimental and epidemiological studies conducted between 2002 and 2006. In International Journal of HEH, 212, 271–287. [3] Okudo, C.C., Ekere, N.R., & Okoye, C.O.B. (2022). Evaluation of Particulate Matter (PM2.5 and PM10) Concentrations in the Dry and Wet Seasons As Indices of Air Quality in Enugu Urban, Enugu State, Nigeria. In Journal of CSN, 47(5), 998-1015. [4] Impacts of air pollution and acid rain on wildlife. In Air Pollution. http://www.air- quality.org.uk [5] U.S. Environmental Protection Agency (USEPA).. Particulate Matter (PM) Basics. In EPA. http://www.epa.gov.gov/pm-pollution/particulate-matter-pm-basics. [6] Stanko, A., Wieczorek, W., Mykytyshyn, A., Holotenko, O., & Lechachenko, T. (2024). Real- time air quality management: Integrating IoT and Fog computing for effective urban monitoring. CITI’2024: 2nd International Workshop on Computer Information Technologies in Industry 4.0, June 12–14, 2024, Ternopil, Ukraine. [7] Duda, O., Mykytyshyn, A., Mytnyk, M., & Stanko, A. (2020). The network platform cyber- physical systems application for smart buildings air pollution indicators monitoring," Časopis Manažérska Informatika, Univerzita Komenského v Bratislave, Slovakia, vol. 1, no. 1, 2023, ISSN 2729-8310. [8] Environmental Protection Agency. (2023). Nitrogen dioxide (NO2) pollution: Basic information about NO2. In EPA. www.epa.gov [9] Baklanov, A., Molina, L.T., & Gauss, M. (2016). Megacities, air quality and climate. In Atmospheric Environment, 126, 235–249. [10] Canepa, E., & Builtjes, P.J.H. (2017). Thoughts on Earth System Modeling: From global to regional scale. In Earth-Science Reviews, 171, 456–462. [11] Arhami, M., Kamali, N., & Rajabi, M.M. (2013). Predicting hourly air pollutant levels using artificial neural networks coupled with uncertainty analysis by Monte Carlo simulations. In Environmental Science and Pollution Research, 20, 4777–4789. [12] Cabaneros, S.M., Calautit, J.K., & Hughes, B.R. (2019). A review of artificial neural network models for ambient air pollution prediction. In Env. Modelling & Software, 119, 285–304. [13] Wu, Y., & Zhang, Y. (2020). Artificial neural network approaches for modeling air pollutants concentrations: A case study in Jinan, China. In Atm. Env., 224, 117333. [14] Gardner, M., & Dorling, S. (1999). Neural network modelling and prediction of hourly NOx and NO2 concentrations in urban air in London. In Atm. Env., 33, 709–719. [15] Kolehmainen, M., Martikainen, H., & Ruuskanen, J. (2001). Neural networks and periodic components used in air quality forecasting. In Atm. Env., 35, 815–825. [16] Perez, P., & Trier, A. (2001). Prediction of NO and NO2 concentrations near a street with heavy traffic in Santiago, Chile. In Atmospheric Environment, 35, 1783–1789. [17] Stamenković, L.J., Antanasijević, D.Z., Ristić, M., Perić-Grujić, A.A., & Pocajt, V.V. (2017). Prediction of nitrogen oxides emissions at the national level based on optimized artificial neural network model. In Air Quality, Atmosphere & Health, 10, 15–23. [18] Jiang, P., Li, C., Li, R., & Yang, H. (2019). An innovative hybrid air pollution early-warning system based on pollutants forecasting and Extenics evaluation. In Knowledge-Based Systems, 164, 174–192. [19] Ding, W., Zhang, J., & Leung, Y. (2016). Prediction of air pollutant concentration based on sparse response back-propagation training feedforward neural networks. In Environmental Science and Pollution Research, 23, 19481–19494. [20] Liu, H., Wu, H., Lv, X., Ren, Z., Liu, M., Li, Y., & Shi, H. (2019). An intelligent hybrid model for air pollutant concentrations forecasting: Case of Beijing in China. In Sustainable Cities and Society, 47, 101471. [21] González-Enrique, J., Ruiz-Aguilar, J.J., Moscoso-López, J.A., Urda, D., Deka, L., & Turias, I.J. (2021). Artificial neural networks, sequence-to-sequence LSTMs, and exogenous variables as analytical tools for NO2 (air pollution) forecasting: A case study in the bay of algeciras (Spain). In Sensors, 21, 1770. [22] Dai, H., Huang, G., Wang, J., Zeng, H., & Zhou, F. (2021). Prediction of Air Pollutant Concentration Based on One-Dimensional Multi-Scale CNN-LSTM Considering Spatial- Temporal Characteristics: A Case Study of Xi’an, China. In Atmosphere, 12, 1626. [23] Yeganeh, B., Hewson, M.G., Clifford, S., Tavassoli, A., Knibbs, L.D., & Morawska, L. (2018). Estimating the spatiotemporal variation of NO2 concentration using an adaptive neuro- fuzzy inference system. In Env. Modelling & Software, 100, 222–235. [24] Makhortykh, M., & Shevchuk, V. (2013). Ternopil Region: Geographical Features and Climate Overview. In Ukrainian Geographical Journal, 4, 55-67. [25] EcoCity. EcoCity https://eco-city.org.ua [26] NGO “Arnika”. Arnika https://arnika.org [27] Clean air for Ukraine: Clean Air for Ukraine https://cleanair.org.ua