Towards Time Series Forecasting of Cross-Data Analytics for Haze
                          Prediction
                                           Ali Akbar, Muhammad Atif Tahir, Muhammad Rafi
                           National University of Computer and Emerging Sciences, Karachi Campus, Pakistan
                                             {k201306,atif.tahir,muhammad.rafi}@nu.edu.pk

ABSTRACT                                                                                The authors in paper [3] identified two major research gaps that
Atmospheric pollution and a thin/thick layer of dust/smoke (Haze)                    persisted in predicting PM10 concentrations in Brunei, Darussalam:
has become one of the major issues all over the world. It obscures                   1) The recent research did not take the use of CNN and Recurrent
the visibility of the sky[1]. The paper aimed to first explore the                   Neural Networks (RNN) into account for haze prediction use-case,
dataset with the help of visualizations and ACF and PACF plots                       and 2) The majority of other researchers used the data from 1997
and analyzing the trends and seasonality components. Thereafter,                     to 1998 period, which was considered a disastrous period in that
Time series methodologies were applied to predict PM10 values,                       region, which made the outcomes of research biased and not widely
given the same countries data and the data from other neighbouring                   applicable. The authors attempted multiple Time Series and Deep
countries as well. Models, including ARIMA and SARIMA, were ap-                      Learning techniques including Moving Averages (average of PM10
plied and tuned along with training methodologies including Grid                     values with shifting average-window), Linear Regression, and RNN
Search and Walk-Forward validation. The paper also employed                          (with 1-D Convolutional layer), of which CRNN proved to be the
Vector Auto-Regression (VAR) methodology to capture the cross                        best performing model throughout. However, this paper also does
data relationship between one country and the other. The imple-                      not cater transboundary effects for Haze Prediction and additionally
mentation and model produced the best (across the two sub-tasks)                     explicitly mentions and proposes this approach as a future work
SMAPE scores of 44.96, 29.07 and 27.74 on the Brunei, Singapore                      possible for this paper.
and Thailand datasets, respectively.                                                    The authors in paper [4] studies the effect of transboundary
                                                                                     haze events in Malaysia by using Multiple Linear Regression (MLR)
                                                                                     for estimating PM10. The paper used stepwise MLR with a 95%
1    INTRODUCTION                                                                    confidence interval and the dataset was divided into 70% for training
Environmental pollution is a major concerned. The basic idea of the                  and 30% for testing. Along with using normalization, the authors
research in Haze Prediction is presented in [1]. Our main motivation                 used different tests including Durbin–Watson (DW) test and R-
to participate in the challenge, is to be the part of the research aimed             squared test for correlation. The authors concluded by showing
at the existing unsolved problem of Haze. It is perceived as a major                 different test results and standard deviations and dispersions graphs,
problem for countries around the world and is becoming a serious                     with the PM10(t+1) model giving accuracy of 0.668. The paper did
health concern across the globe.                                                     incorporate the use of transboundary haze prediction but did not
   The paper aimed to focus on the first two subtasks i.e 1) predict-                incorporate, analyse and compare the localized version of haze in a
ing the PM10 values using the data from the same country, and, 2)                    region.
predicting the PM10 values from the same country and other coun-                        The authors in paper [5] used the Convolutional RNN to de-
try’s data as well. The dataset was first visualised, structured and                 termine the transboundary based haze levels in island cities and
arranged in different files and also tackled one of the challenges of                introduced a Dynamic C-RNN that combined a CNN and RNN and
the missing values as the parameters, along with PM10 data, were                     can model the interactions spatially and temporally. The paper used
not available for all the weather stations and at all times. Thereafter              Spatial Transformation and techniques such as Inverse Distance
several Time Series and Machine Learning models were applied                         Weighting and transformed the data, keeping the constraint and
and evaluated based on the results of the train and test datasets.                   assumption of the island in view. The paper finally concluded by
                                                                                     stating results that proved that D-CRNN proved to be successful
                                                                                     when compared with existing state-of-the-art algorithms. While
2    RELATED WORK
                                                                                     this paper makes an attempt to cater the transboundary and local
The paper in [2] discussed methodology to forecast the haze occur-                   effects, many assumptions and methodology involved were keep-
rences in Southeast Asia. The paper utilized a Convolutional Neural                  ing in view that the training and testing is to be done for island
Network (CNN) based framework, known as HazeNet which has                            cities, especially while transforming data. Therefore, the results,
a 16-layered architecture and had been trained on 18 hydrological                    methodology and data transformation techniques may or may not
and meteorological features and time-sequence maps of about 35                       be consistently supportive with non-island cities/countries.
years and achieved about 95.2% accuracy in the validation set, how-
ever, neglected the impact and importance of transboundary haze
effects.                                                                             3   APPROACH
                                                                                     The first step in the study was to understand and visualize the
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons   data to have the basic understanding of the values. In addition,
License Attribution 4.0 International (CC BY 4.0).
MediaEval’21, December 13-15 2021, Online
                                                                                     there were some null values in the dataset which had to be taken
                                                                                     care of since the models implementation required the complete
MediaEval’21, December 13-15 2021, Online                                                                                       Ali Akbar et al.


                                        Figure 1: Architecture diagram of the proposed approach


dataset without any null values. This was done by trying different         produced was SARIMA with optimal pdq parameters as (1, 0, 1) and
methods on different datasets, included using Last observation             optimal PDQS parameters as (0, 1, 1, 12).
carried forward and linear imputation. These methods were tried               For the task 2, the Vector Auto Regression (VAR) models were
and tested several times and its impact on accuracies were observed.       implemented to find the relationship of one country’s data with
It was finally concluded that the Last observation carried forward         other countries data along with other station’s provided data. The
method produced the best SMAPE scores and hence was applied to             data from other country, including other weather variables were
impute the missing values.                                                 merged into a data-frame and then fed to the VAR model. Addition-
    Further, as shown in Figure 1, the data and its structure was          ally, statistical tests including Cointegration Test were used to find
arranged as required in different models, for example combining            the relationship between multiple features and the lag variables
the date, month and year columns to make it a date-time column.            were identified with the help of the least AIC and BIC scores.
Thereafter the Auto-Correlation Function (ACF) and Partial Auto-
Correlation Function (PACF) was studied to get the real insights of
the dataset( including observing the trend and seasonality compo-          4   RESULTS AND ANALYSIS
nents and thereby using 1 seasonal differencing order in the model         The models provided considerably fair results for all the three
fitting and selection step) and to find the appropriate lag variables      datasets including Brunei, Singapore, and Thailand datasets across
to be used at the time of fitting a Time Series model. It was identified   the two sub-tasks. However, it is noted that there is no much differ-
that lag 1 and 2 were of most value in the datasets and hence this         ence between the two sub-tasks and this may be due to under-fitting
observation was applied in the model fitting stage and was further         of the model and non-effective parameter selection since the mod-
verified by the grid search method. This step, along with other steps      els in Time Series vary a lot depending upon the lag variables and
that were dataset specific, were carried out three different times         the choice of other parameters. The individual scores are reported
since we had data from three different countries. For the training         below:
and model evaluation purposes, the datasets were divided into two               Task       Brunei            Singapore           Thailand
parts: 70% for training and 30% for testing, i.e choosing the first 70%                MAE SMAPE MAE SMAPE MAE SMAPE
of the data as training and the last 30% data as test data and hence            1      9.418    44.96     7.437    29.07     7.473     27.85
preserving the chronological sequence of the time series nature.                2      9.419    44.97     7.436    29.07     7.443     27.74
    The task 1 was to predict the PM10 value at different locations
in multiple countries using data only from each country itself, and        5   DISCUSSION AND OUTLOOK
therefore several Time Series techniques were applied and evaluated
                                                                           The proposed model for task 1 achieved encouraging results for
for example using Moving Averages, and ARIMA [6]/SARIMA [7]
                                                                           all the regions, however, for the task 2, the model’s performance
models. Methods such as Grid Search and Walk-Forward validation
                                                                           was not very satisfactory and did not achieve better results which
method were used while evaluating and validating the Time Series
                                                                           may be due to under-fitting and non-efficient choice of parameters.
models so that the best and accurate models were selected while
                                                                           Time series models are very much dependent on identifying the
weighing the values of PM10 with the amount of time being passed.
                                                                           key dependency of lag values. Therefore, it is observed that there
While these methods proved to give good accuracies on Brunei
                                                                           is a need to reevaluate these values and a modified model may be
dataset (lower error rate compared to others), the methodology
                                                                           suggested for the second sub-task. Moreover, the model tuning and
slightly failed on other datasets mainly due to Google Colab(Online
                                                                           hyper-parameters learning can also be used to improve the model
Python notebook and execution engine), used to train and test
                                                                           and since Task 2 and Task 3 were more challenging and requires
the models, being crashed with bigger amount of data and heavy
                                                                           more complex model, spending more time and resources on model
RAM consumption. This heavily impacted the results since the data
                                                                           selection and fine tuning could be very useful. We would finally like
then had to be divided into batches and then given for training
                                                                           to thank MediaEval and the task organizers to provide us with an
which impacted the results. However, the best model that could be
                                                                           opportunity to work in this domain and contribute to the society.
Towards Time Series Forecasting of Cross-Data Analytics for Haze Prediction     MediaEval’21, December 13-15 2021, Online


REFERENCES
[1] A. Kasem, M.-S. Dao, E. N. Aziz, D.-T. Dang-Nguyen, C. Gurrin, M.-T.
    Tran, T.-B. Nguyen, and W. Suhaili, “Overview of insight for wellbeing
    task at mediaeval 2021: Cross-data analytics for transboundary haze
    prediction,” Proc. of the MediaEval 2021 Workshop, Online, December
    2021.
[2] Wang and Chien, “Exploiting deep learning in forecasting the occur-
    rence of severe haze in southeast asia,” arXiv preprint arXiv:2003.05763,
    2020.
[3] E. N. Aziz, A. Kasem, W. S. H. Suhaili, and P. Zhao, “Convolution
    recurrent neural network for daily forecast of pm10 concentrations in
    brunei darussalam,” Chemical Engineering Transactions, vol. 83, pp. 355–
    360, 2021.
[4] S. Abdullah, N. N. L. M. Napi, A. N. Ahmed, W. N. W. Mansor, A. A.
    Mansor, M. Ismail, A. M. Abdullah, and Z. T. A. Ramly, “Development
    of multiple linear regression for particulate matter (pm10) forecasting
    during episodic transboundary haze event in malaysia,” Atmosphere,
    vol. 11, no. 3, p. 289, 2020.
[5] P. Zhao and K. Zettsu, “Convolution recurrent neural networks based
    dynamic transboundary air pollution predictiona,” pp. 410–413, 2019.
[6] G. P. Zhang, “Time series forecasting using a hybrid arima and neural
    network model,” Neurocomputing, vol. 50, pp. 159–175, 2003.
[7] A. E. Permanasari, I. Hidayah, and I. A. Bustoni, “Sarima (seasonal
    arima) implementation on time series to forecast the number of malaria
    incidence,” in 2013 International Conference on Information Technology
    and Electrical Engineering (ICITEE), pp. 203–207, IEEE, 2013.