Insights for Wellbeing:
    Predicting Personal Air Quality Index Using Regression Approach
                            Amel Ksibi1,Amina Salhi1, Ala Alluhaidan1, Sahar A. El_Rahman1,2
     1
         College of Computer and Information Sciences, Princess Nourah Bint Abdulrahman University, Riyadh, Saudi Arabia
              2
                 Electrical Engineering Department, Faculty of Engineering-Shoubra, Benha University, Cairo, Egypt
                  amelksibi@pnu.edu.sa , Aisalhi@pnu.edu.sa, Asalluhaidan@pnu.edu.sa , sahr_ar@yahoo.com


ABSTRACT                                                                     only data from open sources (e.g., weather, air pollution data) to
                                                                             predict the personal air pollution data.
Providing air pollution information to individuals enables them to           However, it is not always possible to gather plentiful amounts of
understand the air quality of their living environments. Thus, the           such data. As a result, a key research question remains open: Can
association between people’s wellbeing and the properties of the             sparse or incomplete data be used to gain insight into wellbeing?
surrounding environment is an essential area of investigation. This          Meanwhile, machine learning techniques brought more
paper proposes Air Quality Prediction through harvesting                     opportunities for accurate prediction of air pollution [4]. Thus, it is
public/open data and leveraging them to get Personal Air Quality             compulsory to find new approaches based on data analytics for
index. These are usually incomplete. To cope with the problem of             personal air quality prediction challenge.
missing data, we applied KNN imputation method. To predict                   The objective of this study was to evaluate the ability of regression
Personal Air Quality Index, we apply a voting regression approach            approaches to predict individual air pollutants values and the air
based on three base regressors which are Gradient Boosting                   quality index (AQI).
regressor, Random Forest regressor and linear regressor.                     Our paper is organized as follows. In Section 2, we present state of
Evaluating the experimental results using the RMSE metric, we got            the art on air quality prediction methods. In Section 3, we discuss
an average score of 35.39 for Walker and 51.16 for Car.                      proposed process for air pollutant prediction. Section 4 analyses the
                                                                             results while Section 5 covers discussion and conclusion.
     1     INTRODUCTION
Air pollution has an intensive impact on public health and the                    2    RELATED WORK
environment[1]. Providing air pollution information to individuals           City-wide air quality prediction has been of interest over the past
enables them to understand the air quality of their living                   40 years[3]. However, all these studies focused only on
environments. Thus, the association between people’s wellbeing               determining the air pollutants values at city scale for general
and the properties of the surrounding environment is an essential            population. At personal scale, recent investigations are focusing on
area of investigation[2]. In fact, public atmospheric monitoring             crowdsourcing computing through harvesting data from wearable
stations in urban areas provide large quantities of global air quality       sensors[5]. These sensors provide lifelog data which can be
data (GAQD) by deploying, across the globe, expensive high-end               classified into two categories: numerical data ( weather data,
air pollution sensors. These data including weather data                     environmental variables, GPS, time, health measurements, etc).
(temperature, wind) and air pollution data (PM2.5, NO2, O3)                  This study focuses on personal air quality prediction using
collected over the city, have been investigated widely for general           numerical lifelog data. Personal air quality is a significant indicator
population[3]. However, on the scale of individual people and its            when evaluating the air pollution impact on personal health [6].
personal wellbeing, these research investigations are too limited,           Predicting the personal air quality has a main challenge that is
leading to a broad low accuracy and low spatio-temporal resolution,          developing an effective model based on a small amount of sparse
when assessing the impact of air pollution on personal health.               or incomplete data training dataset. To deal with this issue, Zhao et
With the plenitude of sensing devices, developing hypotheses about           al. [7] proposed a prediction model based on CRNN (convolution
the associations within the heterogenous sensors data captured from          recurrent neural network) for short-term PM2.5 pollution prediction
these devices, contributes towards building effective models that            utilizing the spatial-temporal features of atmospheric sensing data.
make it possible to understand the impact of the environment on              The experiments conducted using the atmospheric sensing dataset
wellbeing at the individual scale. Such models are necessary since           from thirty-three coastal cities in China and Fukuokas
not all cities are fully covered by standard air pollution and weather       environmental monitoring dataset during 2015 to 2017.
stations. The critical research question here is whether we can use          Zhao et al. [6] designed a transfer learning model using an encoder-
                                                                             decoder structure using decoder transfer learning (DTL) that based
                                                                             on the Wasserstein distance to match the atmospheric monitoring
Copyright 2020 for this paper by its authors. Use permitted under Creative
                                                                             stations data that is the source domain heterogeneous distribution
Commons License Attribution 4.0 International (CC BY 4.0).
MediaEval’20, December 14-15 2020, Online
                                                                             and the personal air quality that is the target domain.
 MediaEval’20, December 14-15 2020, Online                                                                                   A. Ksibi et al.

The aforementioned methods focus on determining personal air             voting regressor, we opt for Gradient Boosting regressor , Random
quality index from various factors such as whether, GPS, and             Forest regressor and linear regressor as base regressors. Gradient
environmental data. In this paper, we aim to select the most             boosting regressor relies on a loss function to be optimized, a weak
important factors that influence the prediction of the personal air      learner to make predictions, and an additive model to add weak
quality data.                                                            learners for minimizing the loss function. This machine
                                                                         learning technique yields a prediction model usually by decision
     3    METHODOLOGY                                                    trees. A Random Forest Regressor is a technique that uses multiple
                                                                         decision trees and Bootstrap Aggregation to produce a more
Our proposed process contains two steps: data preprocessing and
                                                                         reliable prediction model. Linear regression, the most known
then training a voting regressor to predict Personal Air Quality
                                                                         regression analysis is based on a linear predictor function with
Prediction with public/open data.
                                                                         unknown model parameters.
3.1 Data preprocessing
                                                                              4    EXPERIMENTAL                   RESULTS              AND
The dataset used in this paper is Personal air quality dataset
(PAQD) which is described in [5]. It contains weather data (e.g.,
                                                                                   ANALYSIS
temperature, humidity), atmospheric data (e.g., O3, PM2.5, and           In this section, we report and discuss the experimental results
NO2), GPS data, and multimedia data (e.g., images, annotation).          achieved after submitting one run for the task1 “Personal Air
Since the data quality and its representativeness play a crucial role    Quality Prediction with public/open data”. Table1 represents the
                                                                         official results for our run based on regression approach. The
in the effectiveness of prediction algorithm, we perform a process
                                                                         performance of the predictions was evaluated using root mean
of data preprocessing to guarantee the quality of data. This process
                                                                         square error (RMSE). As can be seen in Table 1, SO2 achieved the
consists of missing data imputation, feature extraction and features     best results with score 12.08 using sensor data collected by walkers,
selection.                                                               while NO2 showed the best results with score 25.02 using sensor
      a) Missing data imputation                                         data collected by car. Moreover, we can see that the obtained results
Based on the hypothesis that there is a strong correlation of            for AQI from walker data outperforms those obtained from car
heterogeneous data recordings at the near-by location and time, we       data. This can be a clue that the quality of sensor data collected by
estimate that two recordings are close if the features that neither is   walkers outperforms the quality of data collected by Car.
missing are close. So, we can determine the values of missing
features according to the mean value from the k nearest recordings.               Table 1: Official results of the submitted run
Indeed, we used sklearn.impute.KNNImputer to predict the
missing values and we defined k=5.
                                                                                                 PM2.5         NO2        O3       AQI
      b) Features extraction
                                                                                                 RMSE         RMSE      RMSE      RMSE
Based on the assumption that the level of pollution may vary from            AVG walkers         35.34        25.98     12.08     35.39
one period to another on the same day and from one day to another            AVG car             40.93        25.02     35.98     51.16
in the same month and from one month to another in the same year,
we extracted the following features from datetime component to                5    CONCLUSIONS
enrich the learning model with temporal information: month
number [1–12], day[1-31], hour of the day [0–23], minute[0-59].          This paper represents our first attempt to address the task “Personal
      c) Features selection                                              Air Quality Prediction with public/open data”. The proposed
To select the most important features, we performed different            solution was based on data preprocessing and training voting
combinations of features and we applied a simple regressor over          regressor based on three base regressors. The obtained results
the training dataset. According to the obtained results, whether data    demonstrate the quality of sensor data collected by walker. As
increases the RMSE. So, we decide to focus only on Time Data and         future work, we would investigate on transfer learning over
GPS data to predict the values of pollutant variables O3, PM2.5,         multimedia lifelog data such as egocentric photos and videos to get
and NO2.                                                                 insights about individual wellbeing.
3.2 Personal Air               Quality       Prediction        with      ACKNOWLEDGMENTS
public/open data
                                                                         The authors extend their appreciation to the Deputyship for
The Personal Air Quality Prediction can be represented as a
                                                                         Research & Innovation, Ministry of Education in Saudi
regression problem where we are required to determine a continue
                                                                         Arabia   for funding this research work through the project
value that is the AQI. Given the selected features, we apply a
                                                                         number PNU-DRI-RI-20-033.
regression approach to estimate the value of each pollutant variable.
For this issue, we test different regressor models over the training
dataset and we obtain the best results with the voting regressor.
the voting regressor is an ensemble meta-estimator that fits several
base regressors, each on the whole dataset. The algorithm then
averages the individual predictions to form a final prediction. In our
Insight for Wellbeing: Multimodal personal health lifelog data analysis   MediaEval’20, December 14-15 2020, Online

REFERENCES

[1] Song, H., Lane, K. J., Kim, H., Kim, H., Byun, G., Le, M., Choi,
Y., Park, C. R., & Lee, J. T. (2019). Association between Urban
Greenness and Depressive Symptoms: Evaluation of Greenness
Using Various Indicators, International journal of environmental
research and public health, 16(2), 173.
[2] P. Vo, T. Phan, M. Dao and K. Zettsu, Association Model
between Visual Feature and AQI Rank Using Lifelog Data, 2019
IEEE International Conference on Big Data (Big Data), Los
Angeles, CA, USA, 2019, pp. 4197-4200
[3] Y.Xu, W.Yang, and J.Wang, “Air quality early-warning
system        for        cities      in        china,” Atmospheric
Environment,vol.148,pp.239–257,2017.
[4] S. Ameer, M. A. Shah, A. Khan, H. Song, C. Maple, S.
U. Islam, andM. N. Asghar, “Comparative analysis of machine
learning techniques for predicting air quality in smart cities,” IEEE
Access, vol. 7, pp. 128 325–128 338, 2019.
[5] Dao, M. S., Zhao, P. J, Nguyen, N.T., Nguyen, T.B., Dang-
Nguyen D. T., Gurrin, C., “Overview of mediaeval2020: Insights
for wellbeing task - multimodal personal health lifelog data
analysis,” in MediaEval Benchmarking Initiative for Multimedia
Evaluation, CEUR Workshop Proceedings, Dec 2020.
[6] Zhao, P. and Zettsu, K., Decoder Transfer Learning for
Predicting Personal Exposure to Air Pollution, 2019 IEEE
International Conference on Big Data (Big Data), Los Angeles, CA,
USA, 2019, pp. 5620-5629.
[7] Zhao, P. and Zettsu, K., Convolution Recurrent Neural
Networks for Short-Term Prediction of Atmospheric Sensing Data,
The 4th IEEE International Conference on Smart Data (SmartData
2018), pp.815-821


                                                                                                                 3