Predicting of air pollutant concentrations based on spatio-temporal attention convolutional LSTM networks Peng Jiang1 , Igor Bychkov4 , Jun Liu2,3 , Alexei Hmelnov4 1 Department of Science and Technology Cooperation, Westlake University, No.18, Shilong Mountain Street, Xihu District, Hangzhou, China 2 “the Belt and Road” Institute for Information Technology, Hangzhou Dianzi University, No.115, Wenyi Road, Xihu District, Hangzhou, China 3 School of Automation (Artificial Intelligence), Hangzhou Dianzi University, No.1158, Number Two Street, Jianggan District, Hangzhou, China 4 Matrosov Institute for System Dynamics and Control Theory of Siberian Branch of Russian Academy of Sciences, 134 Lermontov st. Irkutsk, Russia E-mail: jiangpenghz@163.com Abstract. Forecasting of air pollutant concentration, which is influenced by air pollution accumulation, traffic flow and industrial emissions, has attracted extensive attention for decades. In this paper, we propose a spatio-temporal attention convolutional long short term memory neural networks (Attention-CNN-LSTM) for air pollutant concentration forecasting. Firstly, we analyze the Granger causalities between different stations and establish a hyperparametric Gaussian vector weight function to determine spatial autocorrelation variables, which is used as part of the input feature. Secondly, convolutional neural networks (CNN) is employed to extract the temporal dependence and spatial correlation of the input, while feature maps and channels are weighted by attention mechanism, so as to improve the effectiveness of the features. Finally, a depth long short term memory (LSTM) based time series predictor is established for learning the long-term and short-term dependence of pollutant concentration. In order to reduce the effect of diverse complex factors on LSTM, inherent features are extracted from historical air pollutant concentration data meteorological data and timestamp information are incorporated into the proposed model. Extensive experiments were performed using the Attention-CNN- LSTM, autoregressive integrated moving average (ARIMA), support vector regression (SVR), traditional LSTM and CNN, respectively. The results demonstrated that the feasibility and practicability of Attention-CNN-LSTM on estimating CO and NO concentration. 1. Introduction In recent years, air pollution, especially the large-scale haze caused by ultrafine particles and volatile organic compounds (VOCs) of mobile pollution sources, has attracted worldwide attention [1],[2]. Ultrafine particles and VOCs not only harm to human health directly, but also is the important precursor of fine particulate matter (PM2.5) and major component of photochemical smog. Therefore, monitoring the emission of ultrafine particles and VOCs from Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). mobile pollution sources is one of the effective means to reduce smog weather and can improve the quality of regional urban atmospheric environment. Knowing the source and concentration of these pollutants is essential to reduce the adverse effects of air pollution on health [3]. Thus, in the spatial dimension, the characteristics of pollutant changes in different regions are considered, and the differences and causes of pollutant concentration in time and space are analyzed, so as to improve the efficiency and reliability of pollutant concentration prediction, and provide decision- making basis for the government to control air pollution, traffic control and life travel. The approaches for forecasting air pollutant concentrations mainly include deterministic and statistical models. Deterministic models simulate the atmospheric physic and chemistry in the processes of emission, diffusion and transformation of air pollutions, cannot explain the non-linearity and heterogeneity of some factors on the formation of pollutants [4],[5],[6],[7]. Statistical models are based on a data-driven manner ranther than sophisticated theoretical models to estimate air quality, has shown a virtue of obvious advantages [8],[9]. Recently, deep neural networks (DNNs) can automatically learn salient feature mappings from high-dimensional input data, and avoid the complicated process of artificial design and extraction of features to solve a wide variety prediction of complex problems, such as further perfect the prediction performance of air pollutant concentrations. Inherently considering spatio- temporal correlations of historical air pollutant data, meteorological data and timestamp data Li et al. proposed a novel LSTM model to forecast air pollutant concentration [10]. Qi embedded feature selection and spatial-temporal semi-supervised learning (ST-SSL) in the deep network to infer the PM2.5 concentration for the next few hours at all locations [11]. However, since the input of the model is a fixed-length sequence, the model’s representation of the context is also a sequence of the same length, which limits the performance of the model. So it is difficult to get a suitable vector representation as the output. In general, the air polluting process usually involves a variety of interacting pollutants, which are affected by local reactions, spatio-temporal evolution properties of air pollutant concentration and confounding factors, such as the direction of wind and humidity. Therefore, the research on the prediction of air pollutant concentration still faces the following two challenges: (i) LSTM is hard to deal with the time series with long-term dependency and complex task, (ii) air pollution causal pathways are complex among different locations in nature, since they may be influenced by geography, atmospheric phenomena and other complex factors. To handle both challenges outlined above, we propose a deep spatio-temporal hybrid model to estimate air pollutant concentrations. The main contributions of the paper are given as follows: (i) Granger causality is used to model the spatial correlation between different stations in adjacent regions, and consider the spatial dependence of air pollutant concentrations between the propagation of air pollution under different wind directions in each sub-region by constructing a hyperparametric Gauss vector weight function. (ii) By constructing the Attention-CNN hybrid model, we can effectively extract the intrinsic features from historical air pollutant concentrations, meteorological and timestamp data by learning over a long time span, and then use LSTM layer to extract temporal information from these feature mappings. (iii) We use air pollutant concentration data and meteorological monitoring data from northern Taiwan in 2015 for research and analysis to evaluate our methods. Abundant experiments prove that the model is superior to traditional machine learning methods. The rest of the paper is organized as follows: Section 2 mainly introduce the data description, Attention-CNN-LSTM model, spatio-temporal correlation using Granger causality analysis, extraction of spatial and temporal features, attention mechanism in feature map and channel and prediction for air pollution concentration of multiple monitoring stations. Section 3 shows Figure 1. Topographic map and the location of monitoring sites in Northern Taiwan. the experimental results. Finally, some concluding remarks and suggestions for future work are in Section 4. 2. Materials and methods 2.1. Data Preprocessing The experimental data in this paper is from the Environmental Protection Administration, Executive Yuan, R.O.C. In the current experiments, air pollutant concentration data and meteorological monitoring data is collected every hour for 25 monitoring sites (shown in Figure 1) from Jan/01/2015 to Dec/31/2016. The meteorological data and air quality data used for CO and N O concentration prediction are shown in Table 1.To mitigate the negative impact of missing values on data analysis performance, we delete the timestamp to eliminate missing values, because filling in missing values requires accurate prediction of spatial and temporal correlation between different time series. If filling in missing time series data is not representative, temporal autocorrelation and spatial correlation may not be strong. In all the experiments, the data is divided into the training set (80%) and testing set (20%). 2.2. Attention-CNN-LSTM The framework of the proposed spatial-temporal prediction model for multi-scale pollutant concentrations is shown in Figure 2. The main inputs (historical CO/N O concentration data) are included in brown box, and auxiliary inputs (meteorological data, related pollutant concentration and the time of day) are included in light blue box; r represents the number of time steps used, and the numbers in the parentheses represent the dimensions of each type of feature. Considering the spatio-temporal correlation between 25 stations and their historical information using GC (Granger causality) analysis, as an index to measure the interaction Table 1. Atmospheric pollutants and meteorological data applied in model studies. Input parameters Unit Input parameters Unit NO ppb O3 ppb CO ppm Average temperature ◦C P M10 µg/m3 Relative humidity % P M2.5 µg/m3 Wind direction degree T HC ppm Wind speed m/sec N M HC ppm Rainfall mm SO2 ppb Figure 2. Attention-CNN-LSTM Architecture. between time series, has been favored in recent decades. For complex spatial factors, we use GC to analyze the correlation between the air concentration time series. We define the time series of air pollutants at two monitoring sites as Yi and Xi respectively. The formula of GC and the null hypothesis are given as follows: n X n X Yi (t) = Φi (j)Yi (t − j) + µi (j)Xi (t − j) + t , if i ∈ Nd (1) j=1 j=1 n X Yi (t) = Φi (j)Yi (t − j) + t , if i ∈ / Nd (2) j=1 where Nd is the neighborhood set of the spatial clustering (The K-Means algorithm is adopted to gather them); t is a white noise Gaussian random vector; n is the number of time stamps; vector Φi is the correspondent weights for Yi ; vector µi represents the spatial weight between spatial locations Yi and Xi . In the section of Spatial-temporal feature extraction, two or more CNN layers are selected to extract the intrinsic features from historical air pollutant data for long-term span learning, and then the one-hot encoding is used. The method encodes the hourly data and combines the extracted features with current meteorological data and related pollutant data to improve the predictive performance. In addition, we add batch normalization (BN) after the second and third convolutional layers of the model. Considering that the scaled exponential linear units (SELU) function has better convergence performance and can effectively avoid the gradient disappearance problem, it is taken as activation function in this paper [12]. In the section of Feature map attention, We adopted an attention mechanism to weigh the hidden features to enhance their validity. In the attention mechanism, F = {f (1), f (2), ..., f (j)} is the output hidden feature maps of the convolutional layer, where j ∈ R is the number of the convolution kernel. The weighted feature maps F 0 are computed by softmax : F0 = W · F (3) f 0 (i) = ωi ∗ f (i) exp(f (i)) ωi = softmax(F ) = Pi (4) t=1 exp(f (t)) here W = {ω1 , ω2 , ..., ωj } is a weight matrix and its size is the same as that of the feature maps. To generate W , the attention mechanism consists of 3 convolution layers with the stride 1. The first convolution layer has k × s filters with convolution kernel size 5 × 5, the second and third layers have k filters with convolution kernel size 3 × 3, and the number of filters is 100 for each convolutional layer. By stacking several layers of LSTM, the features in spatially correlated contaminant data with long-term dependence can be automatically extracted layer by layer, and the fused features can be used to compute multiscale time series prediction of air pollutants. For the LSTM layer, one input is the temporal information of X = (x1 , x2 , ..., xt ), and another input is the hidden unit ht−1 from the last time step. The forward training process of Attention-CNN-LSTM can be expressed by the following equations: ft = σ(Wf · [ht−1 , xt ] + bf ) (5) it = σ(Wi · [ht−1 , xt ] + bi ) (6) Ct = ft ∗ Ct−1 + it ∗ tanh(WC · [ht−1 , xt ] + bC ) (7) ot = σ(Wo · [ht−1 , xt ] + bo ) (8) ht = ot ∗ tanh(Ct ) (9) where it , ot , and ft denote the activation of the input gate, output gate and forget gate, respectively; Ct and ht denote the activation vector for each cell and memory block, respectively; and W and b denote the weight matrix and bias vector. The output from the last step of the LSTM is then fed to the fully connected layer for spatio-temporal air pollution prediction. 3. Results and discussion 3.1. Performance metric and settings Each monitoring station collects air quality data once an hour, and the dataset contains more than 400000 instances, each with concentrations of CO and N O. To prove the effectiveness of the proposed Attention-CNN-LSTM, several models are used for comparison, including ARIMA [13], SVR [14], CNN [15], LSTM [16], and CNN-LSTM [17]. In order to prevent inconsistency of data magnitude differences and gradient explosion, we need to convert all the input data by scaling the attributes between [0, 1] using the min-max normalization. The performance evaluation indicators, including the root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE) and correlation coefficient (R), were used to evaluate the effectiveness of our model in our experiments. 3.2. Performance comparisons Obviously, the current state has different effects on different time intervals in the future [18]. Therefore, we analyze the input data with the air pollution concentration at multiple time intervals on the next 1st hour to 24th hour to develop different training sets. Over the next 3h, we train a model for each hour, respectively. With respect to the next 4 − 24h, which are divided into the three groups (4 − 6h, 7 − 12h, 13 − 24h) and the models are trained for each time interval. As it is shown in the Tables 2,3,4,5, the accuracy of all the models decreases as the prediction time extends. However, for CO concentration, the RMSE standard deviation of Attention-CNN-LSTM, which varies from 0.4602 to 0.5803, is much lower than that for the other models, indicating that Attention-CNN-LSTM achieves higher accuracy and stability in long-term prediction. This is due to the combination of 3D-CNN, attention mechanism and LSTM, which extract advanced spatio-temporal features while maintaining the transmission of state information, rather than using LSTM alone or CNN. Prediction results show that Attention-CNN-LSTM is valid for air pollutant concentration forecasting in total data set. Table 2. Comparisons of MAE using different models for the next 1st to the 24th hour predictions 1h 2h 3h 4-6h 7-12h 13-24h NO ARIMA 13.42 13.28 13.64 14.49 14.89 15.28 SVR 11.95 11.94 12.26 12.70 13.02 13.13 CNN 10.67 11.86 11.64 11.76 12.41 12.68 LSTM 9.906 10.08 10.90 11.26 11.87 12.14 CNN-LSTM 9.193 9.727 10.09 10.52 11.19 11.55 Attention-CNN-LSTM 8.568 8.693 9.270 9.830 10.68 10.77 CO ARIMA 0.5131 0.5478 0.5910 0.6253 0.6479 0.6876 SVR 0.4516 0.4632 0.5124 0.5706 0.5880 0.6176 CNN 0.4365 0.4570 0.4565 0.4745 0.5031 0.5466 LSTM 0.4149 0.4323 0.4240 0.4400 0.4586 0.4919 CNN-LSTM 0.4079 0.4078 0.4115 0.4395 0.4439 0.4458 Attention-CNN-LSTM 0.3481 0.3627 0.3770 0.3954 0.4027 0.4360 4. Conclusions In this paper, an attention-based CNN-LSTM model has been proposed to forecast air pollutant concentration. Granger causality analysis is utilized to explore the spatial correlation among different monitoring sites and spatial features are added into the prediction model. The model combines ordinary convolution units and attention mechanism to extract the spatial and temporal feature maps. Finally, the air pollutant concentration of multiple monitoring stations is predicted by LSTM, which can learn temporal dependencies on time series of pollutant concentrations. Experimental results have demonstrated that the proposed Attention-CNN- LSTM outperforms the other state-of-the-art algorithms in terms of RMSE, MAE, MAPE and R values. In order to further improving the performance of the proposed method, several aspects remain to be investigated in the future work: (1) Exploring the spatio-temporal clustering method based on weather patterns because the air polluting process may be affected by multiple weather patterns; (2) Exploring multi-faceted causality analysis and environmental factors. Table 3. Comparisons of RMSE using different models for the next 1 to 24 hour prediction 1h 2h 3h 4-6h 7-12h 13-24h NO ARIMA 17.80 18.27 18.75 19.42 19.62 20.04 SVR 16.07 16.07 16.61 17.04 17.66 17.71 CNN 14.05 15.20 15.83 16.05 16.52 16.95 LSTM 13.19 14.49 14.58 15.18 15.59 15.72 CNN-LSTM 13.08 13.82 14.09 14.75 15.04 15.59 Attention-CNN-LSTM 12.27 12.87 13.29 13.69 14.18 14.79 CO ARIMA 0.6125 0.6759 0.6874 0.7372 0.7775 0.7854 SVR 0.5536 0.6191 0.6365 0.6490 0.6703 0.6803 CNN 0.5269 0.5710 0.5865 0.6125 0.6365 0.6423 LSTM 0.5043 0.5597 0.5699 0.5961 0.6027 0.6106 CNN-LSTM 0.4988 0.5784 0.5820 0.6037 0.6121 0.6145 Attention-CNN-LSTM 0.4602 0.4634 0.4969 0.5654 0.5710 0.5803 Table 4. Comparisons of R using different models for the next 1 to 24 hour prediction 1h 2h 3h 4-6h 7-12h 13-24h NO ARIMA 0.7910 0.7770 0.7672 0.7443 0.7346 0.7290 SVR 0.8307 0.8286 0.8158 0.8056 0.7934 0.7883 CNN 0.8519 0.8464 0.8348 0.8289 0.8183 0.8078 LSTM 0.8867 0.8755 0.8631 0.8534 0.8385 0.8248 CNN-LSTM 0.8980 0.8834 0.8729 0.8616 0.8594 0.8363 Attention-CNN-LSTM 0.9057 0.8933 0.8832 0.8787 0.8648 0.8386 CO ARIMA 0.8146 0.7892 0.7796 0.7379 0.7163 0.7093 SVR 0.8374 0.8274 0.8138 0.8044 0.7908 0.7785 CNN 0.8674 0.8518 0.8434 0.8271 0.8138 0.8090 LSTM 0.8858 0.8742 0.8639 0.8490 0.8319 0.8249 CNN-LSTM 0.8919 0.8805 0.8766 0.8633 0.8304 0.8360 Attention-CNN-LSTM 0.9219 0.9091 0.8970 0.8736 0.8518 0.8508 Acknowledgments This work was supported in part by the Leading Talents of Science and Technology Innovation in Zhejiang Province 10 Thousands Plan under Grant 2018R52040, in part by the National Key Research and Development Program of China under Grant 2016YFC0201400, in part by the Provincial Key Research and Development Program of Zhejiang Province under Grant 2017C03019, and in part by the International Science and Technology Cooperation Program of Zhejiang Province for Joint Research in High-tech Industry under Grant 2016C54007. References [1] Kurt A and Oktay A B 2010 Forecasting air pollutant indicator levels with geographic models 3 days in advance using neural networks Expert Systems with Applications 37(12) 7986-7992 [2] Arhami M, Kamali N and Rajabi M 2013 Predicting hourly air pollutant levels using artificial neural networks coupled with uncertainty analysis by Monte Carlo simulations Environmental Science and Pollution Research 20(7) 4777-4789 [3] Yang W, Deng M, Xu F and Wang H 2018 Prediction of hourly PM2.5 using a space-time support vector regression model Atmospheric Environment 181 12-19 Table 5. Comparisons of MAPE using different models for the next 1 to 24 hour prediction 1h 2h 3h 4-6h 7-12h 13-24h NO ARIMA 53.01 52.37 53.18 59.45 61.45 65.23 SVR 45.05 44.18 46.08 51.42 54.10 56.45 CNN 38.76 39.15 41.47 44.79 45.79 47.69 LSTM 31.61 36.96 39.45 41.03 43.73 44.80 CNN-LSTM 30.78 34.00 35.14 39.43 40.16 42.09 Attention-CNN-LSTM 28.06 28.43 29.04 29.75 31.27 34.02 CO ARIMA 43.72 45.29 48.50 53.18 57.94 59.44 SVR 35.75 37.57 38.54 41.24 45.05 51.79 CNN 25.96 26.40 28.22 29.55 30.54 32.96 LSTM 22.27 25.23 27.12 28.62 29.35 30.29 CNN-LSTM 22.81 23.98 24.29 26.96 27.84 29.13 Attention-CNN-LSTM 21.26 22.48 23.99 25.23 27.98 28.17 [4] Kim Y, Fu J S and Miller T L 2010 Improving ozone modeling in complex terrain at a fine grid resolution: Part I–examination of analysis nudging and all PBL schemes associated with LSMs in meteorological model Atmospheric Environment 44(4) 523-532 [5] Jeong J I, Park R J, Woo J H, Han Y J and Yi S M 2011 Source contributions to carbonaceous aerosol concentrations in Korea Atmospheric environment 45(5) 1116-1125 [6] Crippa M, Canonaco F, Slowik J G, El Haddad I, De-Carlo P F, Mohr C, ... and Abidi E 2013 Primary and secondary organic aerosol origin by combined gas-particle phase source apportionment. Atmos. Chem. Phys 13(16) 8411-8426 [7] Zhou G, Xu J, Xie Y, Chang L, Gao W, Gu Y and Zhou J 2017 Numerical air quality forecasting over eastern China: An operational application of WRF-Chem Atmospheric environment 153 94-108 [8] Wang D, Wei S, Luo H, Yue C and Grunder O 2017 A novel hybrid model for air quality index forecasting based on two-phase decomposition technique and modified extreme learning machine Science of The Total Environment 580 719-733 [9] Prasad K, Gorai A K and Goyal P 2016 Development of ANFIS models for air quality forecasting and input optimization for reducing the computational cost and time. Atmospheric Environment 128 246-262 [10] Li X, Peng L, Yao X, Cui S, Hu Y, You C and Chi T 2017 Long short-term memory neural network for air pollutant concentration predictions: Method development and evaluation Environmental Pollution 231 997-1004 [11] Qi Z, Wang T, Song G, Hu W, Li X and Zhang Z M 2018 Deep air learning: Interpolation, prediction, and feature analysis of fine-grained air quality. IEEE Transactions on Knowledge and Data Engineering [12] Klambauer G, Unterthiner T, Mayr A and Hochreiter S 2017 Self-normalizing neural networks Advances in Neural Information Processing Systems 971-980 [13] Dickey D 1989 Time series theory and methods. Technometrics 31(1) 121-121 [14] Nieto P J, Combarro E F, Diaz J J and Montanes E 2013 A SVM-based regression model to study the air quality at local scale in Oviedo urban area (Northern Spain): A case study Applied Mathematics and Computation 219(17) 8923-8937 [15] Chen Y N, Han C C, Wang C T, Jeng B S and Fan K C 2009 A CNN-based face detector with a simple feature map and a coarse-to-fine classifier-Withdrawn. IEEE Transactions on Pattern Analysis and Machine Intelligence [16] Ghaderi A, Sanandaji B M and Ghaderi F 2017 Deep forecast: deep learning-based spatio-temporal forecasting arXiv preprint arXiv:1707.08110 [17] Wu Y and Tan H 2016 Short-term traffic flow forecasting with spatial-temporal correlation in a hybrid deep learning framework. arXiv preprint arXiv:1612.01022 [18] Zheng Y, Yi X, Li M, Li R, Shan Z, Chang E, Li T 2015 Forecasting finegrained air quality based on big data. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2267-2276