Towards Time Series Forecasting of Cross-Data Analytics for Haze Prediction Ali Akbar, Muhammad Atif Tahir, Muhammad Rafi National University of Computer and Emerging Sciences, Karachi Campus, Pakistan {k201306,atif.tahir,muhammad.rafi}@nu.edu.pk ABSTRACT The authors in paper [3] identified two major research gaps that Atmospheric pollution and a thin/thick layer of dust/smoke (Haze) persisted in predicting PM10 concentrations in Brunei, Darussalam: has become one of the major issues all over the world. It obscures 1) The recent research did not take the use of CNN and Recurrent the visibility of the sky[1]. The paper aimed to first explore the Neural Networks (RNN) into account for haze prediction use-case, dataset with the help of visualizations and ACF and PACF plots and 2) The majority of other researchers used the data from 1997 and analyzing the trends and seasonality components. Thereafter, to 1998 period, which was considered a disastrous period in that Time series methodologies were applied to predict PM10 values, region, which made the outcomes of research biased and not widely given the same countries data and the data from other neighbouring applicable. The authors attempted multiple Time Series and Deep countries as well. Models, including ARIMA and SARIMA, were ap- Learning techniques including Moving Averages (average of PM10 plied and tuned along with training methodologies including Grid values with shifting average-window), Linear Regression, and RNN Search and Walk-Forward validation. The paper also employed (with 1-D Convolutional layer), of which CRNN proved to be the Vector Auto-Regression (VAR) methodology to capture the cross best performing model throughout. However, this paper also does data relationship between one country and the other. The imple- not cater transboundary effects for Haze Prediction and additionally mentation and model produced the best (across the two sub-tasks) explicitly mentions and proposes this approach as a future work SMAPE scores of 44.96, 29.07 and 27.74 on the Brunei, Singapore possible for this paper. and Thailand datasets, respectively. The authors in paper [4] studies the effect of transboundary haze events in Malaysia by using Multiple Linear Regression (MLR) for estimating PM10. The paper used stepwise MLR with a 95% 1 INTRODUCTION confidence interval and the dataset was divided into 70% for training Environmental pollution is a major concerned. The basic idea of the and 30% for testing. Along with using normalization, the authors research in Haze Prediction is presented in [1]. Our main motivation used different tests including Durbin–Watson (DW) test and R- to participate in the challenge, is to be the part of the research aimed squared test for correlation. The authors concluded by showing at the existing unsolved problem of Haze. It is perceived as a major different test results and standard deviations and dispersions graphs, problem for countries around the world and is becoming a serious with the PM10(t+1) model giving accuracy of 0.668. The paper did health concern across the globe. incorporate the use of transboundary haze prediction but did not The paper aimed to focus on the first two subtasks i.e 1) predict- incorporate, analyse and compare the localized version of haze in a ing the PM10 values using the data from the same country, and, 2) region. predicting the PM10 values from the same country and other coun- The authors in paper [5] used the Convolutional RNN to de- try’s data as well. The dataset was first visualised, structured and termine the transboundary based haze levels in island cities and arranged in different files and also tackled one of the challenges of introduced a Dynamic C-RNN that combined a CNN and RNN and the missing values as the parameters, along with PM10 data, were can model the interactions spatially and temporally. The paper used not available for all the weather stations and at all times. Thereafter Spatial Transformation and techniques such as Inverse Distance several Time Series and Machine Learning models were applied Weighting and transformed the data, keeping the constraint and and evaluated based on the results of the train and test datasets. assumption of the island in view. The paper finally concluded by stating results that proved that D-CRNN proved to be successful when compared with existing state-of-the-art algorithms. While 2 RELATED WORK this paper makes an attempt to cater the transboundary and local The paper in [2] discussed methodology to forecast the haze occur- effects, many assumptions and methodology involved were keep- rences in Southeast Asia. The paper utilized a Convolutional Neural ing in view that the training and testing is to be done for island Network (CNN) based framework, known as HazeNet which has cities, especially while transforming data. Therefore, the results, a 16-layered architecture and had been trained on 18 hydrological methodology and data transformation techniques may or may not and meteorological features and time-sequence maps of about 35 be consistently supportive with non-island cities/countries. years and achieved about 95.2% accuracy in the validation set, how- ever, neglected the impact and importance of transboundary haze effects. 3 APPROACH The first step in the study was to understand and visualize the Copyright 2021 for this paper by its authors. Use permitted under Creative Commons data to have the basic understanding of the values. In addition, License Attribution 4.0 International (CC BY 4.0). MediaEval’21, December 13-15 2021, Online there were some null values in the dataset which had to be taken care of since the models implementation required the complete MediaEval’21, December 13-15 2021, Online Ali Akbar et al. Figure 1: Architecture diagram of the proposed approach dataset without any null values. This was done by trying different produced was SARIMA with optimal pdq parameters as (1, 0, 1) and methods on different datasets, included using Last observation optimal PDQS parameters as (0, 1, 1, 12). carried forward and linear imputation. These methods were tried For the task 2, the Vector Auto Regression (VAR) models were and tested several times and its impact on accuracies were observed. implemented to find the relationship of one country’s data with It was finally concluded that the Last observation carried forward other countries data along with other station’s provided data. The method produced the best SMAPE scores and hence was applied to data from other country, including other weather variables were impute the missing values. merged into a data-frame and then fed to the VAR model. Addition- Further, as shown in Figure 1, the data and its structure was ally, statistical tests including Cointegration Test were used to find arranged as required in different models, for example combining the relationship between multiple features and the lag variables the date, month and year columns to make it a date-time column. were identified with the help of the least AIC and BIC scores. Thereafter the Auto-Correlation Function (ACF) and Partial Auto- Correlation Function (PACF) was studied to get the real insights of the dataset( including observing the trend and seasonality compo- 4 RESULTS AND ANALYSIS nents and thereby using 1 seasonal differencing order in the model The models provided considerably fair results for all the three fitting and selection step) and to find the appropriate lag variables datasets including Brunei, Singapore, and Thailand datasets across to be used at the time of fitting a Time Series model. It was identified the two sub-tasks. However, it is noted that there is no much differ- that lag 1 and 2 were of most value in the datasets and hence this ence between the two sub-tasks and this may be due to under-fitting observation was applied in the model fitting stage and was further of the model and non-effective parameter selection since the mod- verified by the grid search method. This step, along with other steps els in Time Series vary a lot depending upon the lag variables and that were dataset specific, were carried out three different times the choice of other parameters. The individual scores are reported since we had data from three different countries. For the training below: and model evaluation purposes, the datasets were divided into two Task Brunei Singapore Thailand parts: 70% for training and 30% for testing, i.e choosing the first 70% MAE SMAPE MAE SMAPE MAE SMAPE of the data as training and the last 30% data as test data and hence 1 9.418 44.96 7.437 29.07 7.473 27.85 preserving the chronological sequence of the time series nature. 2 9.419 44.97 7.436 29.07 7.443 27.74 The task 1 was to predict the PM10 value at different locations in multiple countries using data only from each country itself, and 5 DISCUSSION AND OUTLOOK therefore several Time Series techniques were applied and evaluated The proposed model for task 1 achieved encouraging results for for example using Moving Averages, and ARIMA [6]/SARIMA [7] all the regions, however, for the task 2, the model’s performance models. Methods such as Grid Search and Walk-Forward validation was not very satisfactory and did not achieve better results which method were used while evaluating and validating the Time Series may be due to under-fitting and non-efficient choice of parameters. models so that the best and accurate models were selected while Time series models are very much dependent on identifying the weighing the values of PM10 with the amount of time being passed. key dependency of lag values. Therefore, it is observed that there While these methods proved to give good accuracies on Brunei is a need to reevaluate these values and a modified model may be dataset (lower error rate compared to others), the methodology suggested for the second sub-task. Moreover, the model tuning and slightly failed on other datasets mainly due to Google Colab(Online hyper-parameters learning can also be used to improve the model Python notebook and execution engine), used to train and test and since Task 2 and Task 3 were more challenging and requires the models, being crashed with bigger amount of data and heavy more complex model, spending more time and resources on model RAM consumption. This heavily impacted the results since the data selection and fine tuning could be very useful. We would finally like then had to be divided into batches and then given for training to thank MediaEval and the task organizers to provide us with an which impacted the results. However, the best model that could be opportunity to work in this domain and contribute to the society. Towards Time Series Forecasting of Cross-Data Analytics for Haze Prediction MediaEval’21, December 13-15 2021, Online REFERENCES [1] A. Kasem, M.-S. Dao, E. N. Aziz, D.-T. Dang-Nguyen, C. Gurrin, M.-T. Tran, T.-B. Nguyen, and W. Suhaili, “Overview of insight for wellbeing task at mediaeval 2021: Cross-data analytics for transboundary haze prediction,” Proc. of the MediaEval 2021 Workshop, Online, December 2021. [2] Wang and Chien, “Exploiting deep learning in forecasting the occur- rence of severe haze in southeast asia,” arXiv preprint arXiv:2003.05763, 2020. [3] E. N. Aziz, A. Kasem, W. S. H. Suhaili, and P. Zhao, “Convolution recurrent neural network for daily forecast of pm10 concentrations in brunei darussalam,” Chemical Engineering Transactions, vol. 83, pp. 355– 360, 2021. [4] S. Abdullah, N. N. L. M. Napi, A. N. Ahmed, W. N. W. Mansor, A. A. Mansor, M. Ismail, A. M. Abdullah, and Z. T. A. Ramly, “Development of multiple linear regression for particulate matter (pm10) forecasting during episodic transboundary haze event in malaysia,” Atmosphere, vol. 11, no. 3, p. 289, 2020. [5] P. Zhao and K. Zettsu, “Convolution recurrent neural networks based dynamic transboundary air pollution predictiona,” pp. 410–413, 2019. [6] G. P. Zhang, “Time series forecasting using a hybrid arima and neural network model,” Neurocomputing, vol. 50, pp. 159–175, 2003. [7] A. E. Permanasari, I. Hidayah, and I. A. Bustoni, “Sarima (seasonal arima) implementation on time series to forecast the number of malaria incidence,” in 2013 International Conference on Information Technology and Electrical Engineering (ICITEE), pp. 203–207, IEEE, 2013.