<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Insights for Wellbeing: Predicting Personal Air Quality Index Using Regression Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Amel Ksibi</string-name>
          <email>amelksibi@pnu.edu.sa</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amina Salhi</string-name>
          <email>Aisalhi@pnu.edu.sa</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ala Alluhaidan</string-name>
          <email>Asalluhaidan@pnu.edu.sa</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sahar A. El_Rahman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>College of Computer and Information Sciences, Princess Nourah Bint Abdulrahman University</institution>
          ,
          <addr-line>Riyadh</addr-line>
          ,
          <country country="SA">Saudi Arabia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Electrical Engineering Department, Faculty of Engineering-Shoubra, Benha University</institution>
          ,
          <addr-line>Cairo</addr-line>
          ,
          <country country="EG">Egypt</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>Providing air pollution information to individuals enables them to understand the air quality of their living environments. Thus, the association between people's wellbeing and the properties of the surrounding environment is an essential area of investigation. This paper proposes Air Quality Prediction through harvesting public/open data and leveraging them to get Personal Air Quality index. These are usually incomplete. To cope with the problem of missing data, we applied KNN imputation method. To predict Personal Air Quality Index, we apply a voting regression approach based on three base regressors which are Gradient Boosting regressor, Random Forest regressor and linear regressor. Evaluating the experimental results using the RMSE metric, we got an average score of 35.39 for Walker and 51.16 for Car.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Air pollution has an intensive impact on public health and the
environment[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Providing air pollution information to individuals
enables them to understand the air quality of their living
environments. Thus, the association between people’s wellbeing
and the properties of the surrounding environment is an essential
area of investigation[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In fact, public atmospheric monitoring
stations in urban areas provide large quantities of global air quality
data (GAQD) by deploying, across the globe, expensive high-end
air pollution sensors. These data including weather data
(temperature, wind) and air pollution data (PM2.5, NO2, O3)
collected over the city, have been investigated widely for general
population[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However, on the scale of individual people and its
personal wellbeing, these research investigations are too limited,
leading to a broad low accuracy and low spatio-temporal resolution,
when assessing the impact of air pollution on personal health.
With the plenitude of sensing devices, developing hypotheses about
the associations within the heterogenous sensors data captured from
these devices, contributes towards building effective models that
make it possible to understand the impact of the environment on
wellbeing at the individual scale. Such models are necessary since
not all cities are fully covered by standard air pollution and weather
stations. The critical research question here is whether we can use
only data from open sources (e.g., weather, air pollution data) to
predict the personal air pollution data.
      </p>
      <p>
        However, it is not always possible to gather plentiful amounts of
such data. As a result, a key research question remains open: Can
sparse or incomplete data be used to gain insight into wellbeing?
Meanwhile, machine learning techniques brought more
opportunities for accurate prediction of air pollution [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Thus, it is
compulsory to find new approaches based on data analytics for
personal air quality prediction challenge.
      </p>
      <p>The objective of this study was to evaluate the ability of regression
approaches to predict individual air pollutants values and the air
quality index (AQI).</p>
      <p>Our paper is organized as follows. In Section 2, we present state of
the art on air quality prediction methods. In Section 3, we discuss
proposed process for air pollutant prediction. Section 4 analyses the
results while Section 5 covers discussion and conclusion.
2</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        City-wide air quality prediction has been of interest over the past
40 years[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However, all these studies focused only on
determining the air pollutants values at city scale for general
population. At personal scale, recent investigations are focusing on
crowdsourcing computing through harvesting data from wearable
sensors[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. These sensors provide lifelog data which can be
classified into two categories: numerical data ( weather data,
environmental variables, GPS, time, health measurements, etc).
This study focuses on personal air quality prediction using
numerical lifelog data. Personal air quality is a significant indicator
when evaluating the air pollution impact on personal health [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Predicting the personal air quality has a main challenge that is
developing an effective model based on a small amount of sparse
or incomplete data training dataset. To deal with this issue, Zhao et
al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] proposed a prediction model based on CRNN (convolution
recurrent neural network) for short-term PM2.5 pollution prediction
utilizing the spatial-temporal features of atmospheric sensing data.
The experiments conducted using the atmospheric sensing dataset
from thirty-three coastal cities in China and Fukuokas
environmental monitoring dataset during 2015 to 2017.
Zhao et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] designed a transfer learning model using an
encoderdecoder structure using decoder transfer learning (DTL) that based
on the Wasserstein distance to match the atmospheric monitoring
stations data that is the source domain heterogeneous distribution
and the personal air quality that is the target domain.
The aforementioned methods focus on determining personal air
quality index from various factors such as whether, GPS, and
environmental data. In this paper, we aim to select the most
important factors that influence the prediction of the personal air
quality data.
      </p>
      <p>3</p>
    </sec>
    <sec id="sec-3">
      <title>METHODOLOGY</title>
      <p>Our proposed process contains two steps: data preprocessing and
then training a voting regressor to predict Personal Air Quality
Prediction with public/open data.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Data preprocessing</title>
      <p>
        The dataset used in this paper is Personal air quality dataset
(PAQD) which is described in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. It contains weather data (e.g.,
temperature, humidity), atmospheric data (e.g., O3, PM2.5, and
NO2), GPS data, and multimedia data (e.g., images, annotation).
Since the data quality and its representativeness play a crucial role
in the effectiveness of prediction algorithm, we perform a process
of data preprocessing to guarantee the quality of data. This process
consists of missing data imputation, feature extraction and features
selection.
      </p>
      <p>a) Missing data imputation
Based on the hypothesis that there is a strong correlation of
heterogeneous data recordings at the near-by location and time, we
estimate that two recordings are close if the features that neither is
missing are close. So, we can determine the values of missing
features according to the mean value from the k nearest recordings.
Indeed, we used sklearn.impute.KNNImputer to predict the
missing values and we defined k=5.</p>
      <p>
        b) Features extraction
Based on the assumption that the level of pollution may vary from
one period to another on the same day and from one day to another
in the same month and from one month to another in the same year,
we extracted the following features from datetime component to
enrich the learning model with temporal information: month
number [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7">1–12</xref>
        ], day[
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7">1-31</xref>
        ], hour of the day [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7">0–23</xref>
        ], minute[
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7">0-59</xref>
        ].
      </p>
      <p>c) Features selection
To select the most important features, we performed different
combinations of features and we applied a simple regressor over
the training dataset. According to the obtained results, whether data
increases the RMSE. So, we decide to focus only on Time Data and
GPS data to predict the values of pollutant variables O3, PM2.5,
and NO2.</p>
    </sec>
    <sec id="sec-5">
      <title>3.2 Personal Air Quality Prediction with public/open data</title>
      <p>The Personal Air Quality Prediction can be represented as a
regression problem where we are required to determine a continue
value that is the AQI. Given the selected features, we apply a
regression approach to estimate the value of each pollutant variable.
For this issue, we test different regressor models over the training
dataset and we obtain the best results with the voting regressor.
the voting regressor is an ensemble meta-estimator that fits several
base regressors, each on the whole dataset. The algorithm then
averages the individual predictions to form a final prediction. In our
A. Ksibi et al.
voting regressor, we opt for Gradient Boosting regressor , Random
Forest regressor and linear regressor as base regressors. Gradient
boosting regressor relies on a loss function to be optimized, a weak
learner to make predictions, and an additive model to add weak
learners for minimizing the loss function. This machine
learning technique yields a prediction model usually by decision
trees. A Random Forest Regressor is a technique that uses multiple
decision trees and Bootstrap Aggregation to produce a more
reliable prediction model. Linear regression, the most known
regression analysis is based on a linear predictor function with
unknown model parameters.</p>
      <p>4</p>
    </sec>
    <sec id="sec-6">
      <title>EXPERIMENTAL</title>
    </sec>
    <sec id="sec-7">
      <title>ANALYSIS</title>
    </sec>
    <sec id="sec-8">
      <title>RESULTS AND</title>
      <p>In this section, we report and discuss the experimental results
achieved after submitting one run for the task1 “Personal Air
Quality Prediction with public/open data”. Table1 represents the
official results for our run based on regression approach. The
performance of the predictions was evaluated using root mean
square error (RMSE). As can be seen in Table 1, SO2 achieved the
best results with score 12.08 using sensor data collected by walkers,
while NO2 showed the best results with score 25.02 using sensor
data collected by car. Moreover, we can see that the obtained results
for AQI from walker data outperforms those obtained from car
data. This can be a clue that the quality of sensor data collected by
walkers outperforms the quality of data collected by Car.
This paper represents our first attempt to address the task “Personal
Air Quality Prediction with public/open data”. The proposed
solution was based on data preprocessing and training voting
regressor based on three base regressors. The obtained results
demonstrate the quality of sensor data collected by walker. As
future work, we would investigate on transfer learning over
multimedia lifelog data such as egocentric photos and videos to get
insights about individual wellbeing.</p>
    </sec>
    <sec id="sec-9">
      <title>ACKNOWLEDGMENTS</title>
      <p>The authors extend their appreciation to the Deputyship for
Research &amp; Innovation, Ministry of Education in Saudi
Arabia for funding this research work through the project
number PNU-DRI-RI-20-033.</p>
      <p>Insight for Wellbeing: Multimodal personal health lifelog data analysis</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lane</surname>
            ,
            <given-names>K. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Byun</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Choi</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>C. R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J. T.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Association between Urban Greenness and Depressive Symptoms: Evaluation of Greenness Using Various Indicators</article-title>
          ,
          <source>International journal of environmental research and public health</source>
          ,
          <volume>16</volume>
          (
          <issue>2</issue>
          ),
          <fpage>173</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Vo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Phan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dao</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Zettsu</surname>
          </string-name>
          ,
          <article-title>Association Model between Visual Feature and AQI Rank Using Lifelog Data, 2019</article-title>
          <source>IEEE International Conference on Big Data (Big Data)</source>
          , Los Angeles, CA, USA,
          <year>2019</year>
          , pp.
          <fpage>4197</fpage>
          -
          <lpage>4200</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          , “
          <article-title>Air quality early-warning system for cities in china,” Atmospheric Environment</article-title>
          ,vol.
          <volume>148</volume>
          ,pp.
          <fpage>239</fpage>
          -
          <lpage>257</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ameer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Maple</surname>
          </string-name>
          , S. U. Islam, andM. N. Asghar, “
          <article-title>Comparative analysis of machine learning techniques for predicting air quality in smart cities</article-title>
          ,
          <source>” IEEE Access</source>
          , vol.
          <volume>7</volume>
          , pp.
          <volume>128</volume>
          <fpage>325</fpage>
          -
          <lpage>128</lpage>
          338,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Dao</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>P. J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>N.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nguyen</surname>
          </string-name>
          , T.B.,
          <string-name>
            <surname>DangNguyen D. T.</surname>
          </string-name>
          ,
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , “
          <article-title>Overview of mediaeval2020: Insights for wellbeing task - multimodal personal health lifelog data analysis,” in MediaEval Benchmarking Initiative for Multimedia Evaluation</article-title>
          , CEUR Workshop Proceedings, Dec
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Zettsu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Decoder</surname>
          </string-name>
          Transfer Learning for Predicting Personal Exposure to Air Pollution,
          <source>2019 IEEE International Conference on Big Data (Big Data)</source>
          , Los Angeles, CA, USA,
          <year>2019</year>
          , pp.
          <fpage>5620</fpage>
          -
          <lpage>5629</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Zettsu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <source>Convolution Recurrent Neural Networks for Short-Term Prediction of Atmospheric Sensing Data, The 4th IEEE International Conference on Smart Data (SmartData</source>
          <year>2018</year>
          ), pp.
          <fpage>815</fpage>
          -
          <lpage>821</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>