Air Quality Estimation Using LSTM and An Approach for Data Processing Techniques Minh-Anh Ton-Thien1,2,3,* , Chuong Thi Nguyen1,4,* , Quang M. Le1,2,3 , Dat Q.Duong1,2,3 1 AISIA Research Lab, Ho Chi Minh City, Vietnam 2 University of Science, Ho Chi Minh City, Vietnam 3 Vietnam National University, Ho Chi Minh City, Vietnam 4 iLotusLand, Vietnam * Two first author have equal contribution minhanhtt2000@gmail.com chuong.nguyen@vietan-enviro.com ABSTRACT This paper describes our approach for the MediaEval2021 “Cross- Data Analytics for (transboundary) Haze Prediction” subtask1. The objective of this subtask is to predict PM10 values at different lo- cations in multiple countries using data only from each country itself. In addition, we have applied XGBoost to deal with missing PM10 values on the training dataset and Long Short-term Memory (LSTM) [2] models to predict air pollution. 1 INTRODUCTION Nowadays, air pollution leads to increasing cases of cardiovascular Figure 1: Architecture of a LSTM cell and respiratory diseases. It also affects social and economic activi- ties. By using data from the last several days to predict air pollution for upcoming days, we can plan appropriate activities to protect monitoring stations, i.e., PM10, we employed XGBoost [1] to impute our health. the missing values from the weather features. As given in the task description [3] of the MediaEval2021, subtask First, the missing values of weather features on the training 1 provides time-series datasets collected from different air quality dataset were filled using the first method. Next, we created a new and weather stations in Brunei, Singapore, and Thailand. Therefore, dataset from the original training dataset by dropping the rows we decided to use LSTM models to predict air pollution of the where PM10 values are missing. Then, the new dataset was used next day from weather features and air quality of 10 previous days. to build the XGBoost model to predict missing PM10 values on the For Brunei, we built and compared different variants of the LSTM original training data from weather features. model, i.e., the LSTM, Bidirectional LSTM, and Stacked LSTM. On It is worth noticing that all-weather features of Thailand col- the other hand, for Singapore and Thailand datasets, because of lected in 2015 are missing. Therefore, to avoid interference when the lack of time and many PM10 values that need to be predicted filling missing values, we dropped all data points in that year. hourly, we only used Bidirectional LSTM. 2.2 Models 2 OUR APPROACH Research studies have shown that LSTM is suitable for time series 2.1 Missing Values Imputation data [8, 9]; it is good at solving long-term memory problems, espe- cially predicting n-th samples using many time steps before. Thus, Each data point in the training dataset represents the information we applied LSTM models in our study to predict air pollution. of a location of one day in the country. We observed many missing values in the datasets; thus, we decided to employ two different 2.2.1 LSTM. One of the disadvantages of Recurrent Neural Net- methods to impute the missing values according to the value type. work (RNN) is it can not process long sequences; LSTM architecture For data extracted from weather stations (i.e., temperature, rainfall, is proposed to solve that problem. An LSTM cell, as shown in Fig- humidity, and wind speed), we filled the missing values with the ure 1, has a cell state C that allows the information to flow through mean values of its stations. For data related to air quality from for long-term memory. It also includes three gates: forget gate- decide what information should be kept or discarded by looking Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). at the previous state and current input. Input gate decides what MediaEval’21, December 13-15 2021, Online information is essential at the current step and how to add to the cell state; output gate decides what the output should be. MediaEval’21, December 13-15 2021, Online Minh-Anh et al. 3 RESULTS AND ANALYSIS We evaluate the proposed models using RMSE metric calculated as follow: √︄ Í𝑁 2 𝑖=1 ||𝑦 (𝑖) − 𝑦ˆ (𝑖)|| 𝑅𝑀𝑆𝐸 = 𝑁 where 𝑁 is the number of data points, 𝑦 (𝑖) is the i-th measurement and 𝑦ˆ (𝑖) is its ground truth. The experimental results on validation sets of Brunei, Singapore, and Thailand datasets are shown in Table 1. In Brunei, Bi-LSTM achieves a slightly better result than LSTM, with a score of 3.625. Figure 2: Architecture of an unfolded Bi-LSTM For Singapore and Thailand, the average scores are 5.821 and 10.624. Table 1: Test results from best run submission on validation sets for Subtask 1 RMSE Model Brunei Singapore Thailand Bi-LSTM 3.625 5.821 10.624 LSTM 3.629 - - Stacked LSTM 3.921 - - Table 2 described the evaluation results of our submitted run on the test datasets. Compare with the results on validation sets, it shows that our models did not work well for Brunei and Singapore Figure 3: Architecture of a Stacked LSTM on the test sets and overfitting has occurred. Table 2: Results that task organizers provided on the held-out 2.2.2 Bi-LSTM. Bi-LSTM model, which was developed from test datasets Bidirectional Recurrent Network [6], consists of two LSTM layers: one taking the input in a forward direction, and the other in a RMSE backward direction. The architecture of an unfolded Bi-LSTM, as Model Brunei Singapore Thailand depicted in Figure 2, helps the network go through the input at the Bi-LSTM 10.967 10.248 9.762 same time so that it can recognize the pattern of our data better. In this work, Bi-LSTM initially has shown some promising results 2.2.3 Stacked LSTM. Stacked LSTM is a model that includes for Brunei and Singapore on the validation sets. However, it did multiple LSTM layers. By making the model deeper, it has proved not perform as expected in the test sets. It might be because our its effection in sequence data [5, 7]. In our study, we used 2-layers missing values imputation technique is not good enough and the Stacked LSTM architecture, as described in Figure 3. models can’t fully recognize the pattern of the datasets. 2.3 Models description ACKNOWLEDGMENTS For each country, we did not develop multiple models for each Finally, we would like to send our thanks to AISIA Research Lab for station but combined data from all the stations into one complete supporting our team; the Organization Board of MediaEval 2021 dataset and build models to recognize the pattern form the dataset. and the Task Organizer for providing us with an opportunity to The first 80% of the data is used for training sets, and the last 20% participate in the competition. of the data is used for validation sets. Information of ten previous days, which includes weather features and air quality, is used to REFERENCES predict the PM10 values of the upcoming day. The Adam optimizer [1] Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree [4] is employed for model training, and the number of epochs used Boosting System. In Proceedings of the 22nd ACM SIGKDD International during training is 100 with a batch size of 512. Conference on Knowledge Discovery and Data Mining (KDD ’16). ACM, For Brunei, we experimented with LSTM, Bidirectional LSTM New York, NY, USA, 785–794. https://doi.org/10.1145/2939672.2939785 (Bi-LSTM), and Stacked LSTM. The organizers provided the PM10 [2] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780. values hourly for Singapore and Thailand, and we only tested all [3] Asem Kasem, Minh-Son Dao, Effa Nabilla Aziz, Duc-Tien Dang- three LSTM models for three first hours PM10 values (i.e., 𝑃𝑀10_1, Nguyen, Cathal Gurrin, Minh-Triet Tran, Thanh-Binh Nguyen, and 𝑃𝑀10_2, and 𝑃𝑀10_3). The results show that Bi-LSTM is slightly Wida Suhaili. Overview of Insight for Wellbeing Task at MediaEval better than LSTM and Stacked LSTM, so we chose to employ the 2021: Cross-Data Analytics for Transboundary Haze Prediction. Proc. Bi-LSTM model for each hourly PM10 value for the final predictions. of the MediaEval 2021 Workshop, Online, 13-15 December 2021. Insight for Wellbeing: Cross-Data Analytics for (transboundary) Haze Prediction MediaEval’21, December 13-15 2021, Online [4] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Sto- chastic Optimization. (2014). http://arxiv.org/abs/1412.6980 cite arxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015. [5] H. Sak, Andrew Senior, and F. Beaufays. 2014. Long short-term mem- ory recurrent neural network architectures for large scale acoustic modeling. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (01 2014), 338–342. [6] M. Schuster and K.K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673– 2681. https://doi.org/10.1109/78.650093 [7] Ilya Sutskever, Oriol Vinyals, and Quoc Le. 2014. Sequence to Sequence Learning with Neural Networks. Advances in Neural Information Processing Systems 4 (09 2014). [8] Yi-Ting Tsai, Yu-Ren Zeng, and Yue-Shan Chang. 2018. Air Pol- lution Forecasting Using RNN with LSTM. In 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Con- gress(DASC/PiCom/DataCom/CyberSciTech). 1074–1079. https://doi. org/10.1109/DASC/PiCom/DataCom/CyberSciTec.2018.00178 [9] Thanongsak Xayasouk, HwaMin Lee, and Giyeol Lee. 2020. Air Pollution Prediction Using Long Short-Term Memory (LSTM) and Deep Autoencoder (DAE) Models. Sustainability 12 (03 2020), 2570. https://doi.org/10.3390/su12062570