Insights for Wellbeing: Predicting PM10 Values
                               Using Stacking Ensemble Model
                                                  Huu-Vinh Nguyen1, 2 , Thi Thuy Nga Duong2
                                        1 University of Information Technology, Ho Chi Minh City, Vietnam
                               2 University of Natural Resources and Environment, Ho Chi Minh City, Vietnam

                                                nhvinh@hcmunre.edu.vn,dttnga_cntt@hcmunre.edu.vn

ABSTRACT                                                                                For Singapore dataset and Thailand, all the weather attributes
In this paper, we present our ISRS-HCMUNRE team’s contribu-                          collected by day. The PM10 values collected by hour. We calculated
tion to the task Insight for Wellbeing: Cross-Data Analytics for                     PM10 values average by day.
(transboundary) Haze Prediction at MediaEval 2021. We extracted                        2.2.2 Location features. For station location data set, there is a
different types of useful attributes for the problem: the weather data,              weather station and an air quality station in a district. This is very
the location features, the air pollution data on the data sets pro-                  helpful for prediction PM10 values by weather data in a district.
vided. We applied stacking method, deep learning models, machine
learning model for prediction PM10 values at different locations                         2.2.3 Weather features. The weather factors significantly affect
for sub task 1.                                                                      the prediction of PM10 values. A show that the rain flow, wind
                                                                                     speed affect the PM10 concentration in the air. We decided to use
                                                                                     all weather attributes (e.g., temperature, humidity, wind speed, rain
1     INTRODUCTION                                                                   flow) for training the model.
In many countries over the world, the prediction of air pollution is
getting more attention. In this study, we aim to utilize deep learning               3   MODELING
and machine learning approach using insights from data provided
                                                                                     For training an appropriate model for the problem, we split raw
by the organizer to predict the PM10 value, as given in the task
                                                                                     dataset of each country into training and testing datasets following
1 description of the competition MedialEval 2021 [3]. This task’s
                                                                                     the time value. The training data contains 80 percent of data points
primary motivation in to predict PM10 values for 3 days ahead. In
                                                                                     and the testing dataset contains 20 percent.
this sub task, we explore the correlation between the PM10 value
                                                                                        For predicting PM10 values, we built two models called main-
and the features we extracted from data set.
                                                                                     model and sub-model. We use weather data from three days before
                                                                                     to predict three days ahead. The main-model is a stacked model [2]
2 METHODOLOGY                                                                        including three deep learning models: Long Short-Term Memory
2.1 Data Pre-processing                                                              (LSTM)[4], Bi-directional (Bi-LSTM), Gate Recurrent Unit (GRU)[5]
The dataset for task 1 includes weather data (temperature, humid-                    and a machine learning model: Linear Regression. The sub-model
ity, wind speed. . . ) and PM10 values for three countries: Thailand,                is a LSTM model that we used to predict PM10 values if there are
Brunei and Singapore with many data points have zero value, un-                      missing weather data points in result data file. The data for training
reasonably large or missing. They are called anomalies or outliers,                  sub-model is only PM10 values from days before.
which have to be preprocessed before extracting features. We cal-
culated average values by day for each weather attributes if they                    4   STACKING METHOD
collected by hour. We dealt with missing value by filling them with
mean, zero [1].
   We used Box plot method to determine the outliers. By using
this method, we also found the max value and the min data range
and moved outlier values into this range by setting them equals the
max value or the min value.

2.2     Features Extraction
The number of weather attributes data provided is different for
each country, they are important for predicting the PM10 values.
   2.2.1 Timestamp features. For Brunei dataset, the weather at-
tributes collected by day except wind speed values collected by
hour. We calculated the average wind speed by day and put all
weather attributes to train models.
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
MediaEval’21, 13-15 December 2021, Online                                                     Figure 1: The illustration of stacking model
MediaEval’21, December 13-15 2021, Online                                                                                                  Vinh et al.

                           Table 1: Evaluation of the prediction PM10 values on Brunei training data set

                          Single variable LSTM          Bi-LSTM               LSTM                  GRU                  Stacking
             Station ID    MAE         RMSE          MAE      RMSE       MAE      RMSE         MAE     RMSE           MAE RMSE
                101B      10.8667        3.3        10.6673 10.812      10.9133 11.0315       10.8654 10.976          0.0425    0.21
                201B      18.0297       4.25        20.6543 20.6661     21.1532 21.1606       20.6163 20.6284         0.0341    0.18
                302B       31.77        5.64        23.4149 23.5046     23.7142 23.7927       23.3478 23.445          0.0362    0.19
                401B      9.6541        4.11        13.9908 14.0457     14.4873 14.5318       14.4799 14.5192         0.0244    0.16

                          Table 2: Evaluation of the prediction PM10 values on Thailand training data set

                          Single variable LSTM           Bi-LSTM              LSTM                   GRU                 Stacking
             Station ID   MAE         RMSE           MAE       RMSE      MAE      RMSE         MAE      RMSE          MAE RMSE
                42T       7.8234       10.31        25.7271 25.745      25.3611 25.3749       26.0117 26.038          0.108     0.14
                43T        7.43        10.19          8.784    8.8256    9.3066  9.3318        9.2756   9.3076        0.0988    0.13
                44T       6.9928        8.98        21.4329 21.4416     21.2444 21.2527        21.436 21.4452         0.1161    0.15
                62T       6.343         8.68        20.3824 20.4103     19.9103 19.9332       19.6263 19.6539         0.119     0.15
                63T       6.4491        8.32         9.4611    9.5014    9.975   9.9999        9.9127    9.952        0.1095    0.15
                80T         5.7         7.4         16.9081 17.0003     17.0171 17.0769        17.042 17.1219         0.154      0.2

                          Table 3: Evaluation of the prediction PM10 values on Singapore training data set

                          Single variable LSTM          Bi-LSTM               LSTM                   GRU                 Stacking
             Station ID   MAE         RMSE           MAE      RMSE       MAE      RMSE         MAE      RMSE          MAE RMSE
                1WS       5.1162       6.21         26.7096 26.778      26.8381 26.8885       26.6675 26.7233         0.1537    0.19
                2ES       5.9605       7.63         24.1252 24.1544     24.2381 24.2608       24.1295 24.1519         0.2189    0.26
                3CS       4.7474       5.81         21.3828 21.4401      21.612 21.6594       21.4659 21.5168          0.17     0.21
                 4SS      5.1535       6.34         29.3604 29.4337     29.4885 29.5365        29.637 29.7021          0.13     0.16
                5NS       5.2575       6.68         30.2583 30.3607     30.6443 30.7297       30.5918 30.6796         0.179     0.22


   To enhance the prediction result, we employed the stacked gen-         6    RESULTS AND DISCUSSION
eralization technique. The stacked model has two levels: level 0 and     After extracting the necessary information, we evaluated four deep
level 1. The level 0 data is the training dataset inputs and level 0     learning models: LSTM, Bi-LSTM, GRU, single variable LSTM and
models learn to make predictions from this data. The level 1 input       stacking model on testing data set. The table 1,2,3 show the RMSE,
data is the output of level 0 models and the single level 1 model, or    MAE score for each air quality station.
meta-learner to make predictions form this data.
We utilized three models: LSTM, Bi-LSTM, GRU, as level 0 mod-               There are a lot of data points missing in Thailand and Singapore
els. We used Linear Regression model as level 1 model for final           data sets leads to the testing result in poor performance.
prediction.
                                                                          REFERENCES
5   PERFORMANCE METRICS
                                                                           [1] Denis Cousineau and Sylvain Chartier. 2010. Outlier detection and
For evaluating the performance on the proposed methods, we use                 treatment: a review. International Journal of Psychological Research
root mean square error (RMSE), mean absolute error (MAE), as                   (2010), 58–67.
follows:                                                                   [2] Dat Q. Duong, Quang M. Le, Tan-Loc Nguyen-Tai, Hien D. Nguyen,
                                                                               Minh-Son Dao, and Binh T. Nguyen. 2021. An Effective AQI Estima-
                              v
                              u
                              t                                                tion Using Sensor Data and Stacking Mechanism. In New Trends in
                                     𝑁
                                  1 Õ                                          Intelligent Software Methodologies, Tools and Techniques - Proceedings of
                    𝑅𝑀𝑆𝐸 =                    ˆ 2
                                        (𝑦𝑖 − 𝑦)                               the 20th International Conference on New Trends in Intelligent Software
                                  𝑁 𝑖=1
                                                                               Methodologies, Tools and Techniques, SoMeT 202, Cancun, Mexico, 21-23
                                                                               September, 2021 (Frontiers in Artificial Intelligence and Applications),
                                     𝑁                                         Hamido Fujita and Héctor Pérez-Meana (Eds.), Vol. 337. IOS Press,
                                  1 Õ                                          405–418. https://doi.org/10.3233/FAIA210040
                       𝑀𝐴𝐸 = (     )    |𝑦𝑖 − 𝑦|
                                              ˆ
                                  𝑁 𝑖=1                                    [3] Asem Kasem, Minh-Son Dao, Effa Nabilla Aziz, Duc-Tien Dang-
                                                                               Nguyen, Cathal Gurrin , Minh-Triet Tran, Thanh-Binh Nguyen, and
Where 𝑦ˆ is the 𝑖 𝑡ℎ predicted value from model, 𝑦¯ is the average of          Wida Suhaili. Overview of MediaEval 2021: Insights for Wellbeing
observed values, and 𝑦𝑖 is 𝑖 𝑡ℎ observed value, (i = 1, . . . , N).            Task Cross-Data Analytics for Transboundary Haze Prediction.
Lifelogging for wellbeing                                                    MediaEval’21, December 13-15 2021, Online


 [4] Jung-Hwan Park, Seong-Joon Yoo, Kyung-Joong Kim, Yeong-Hyeon
     Gu, Keon-Hoon Lee, and U-Hyon Son. 2017. PM10 density forecast
     model using long short term memory. In 2017 Ninth International
     Conference on Ubiquitous and Future Networks (ICUFN). 576–581. https:
     //doi.org/10.1109/ICUFN.2017.7993855
 [5] Guang Yang, HwaMin Lee, and Giyeol Lee. 2020. A Hybrid Deep
     Learning Model to Forecast Particulate Matter Concentration Levels
     in Seoul, South Korea. Atmosphere 11, 4 (2020). https://doi.org/10.
     3390/atmos11040348