<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A2QI: An Approach for Air Pollution Estimation in MediaEval 2020</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>AISIA Research Lab</institution>
          ,
          <addr-line>Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dat Q. Duong</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Two first author have equal contribution</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Science</institution>
          ,
          <addr-line>Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Vietnam National University</institution>
          ,
          <addr-line>Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this paper, we present our AISIA team's contribution to the task Insight for Wellbeing: Multimodal personal health lifelog data analysis at MediaEval 2020. From the data sets provided, we extracted diferent types of useful attributes for the problem: the timestamp information, the geographical data, sensor data, and the semantic features from images captured by users. We proposed an approach, namely A2QI, by applying machine learning models for estimating the local AQI score and level, including Support Vector Machine and Random Forest. We evaluated the experimental data sets using Randomized Search and K-Fold cross-validation. The test sets' evaluation shows that employing a machine learning approach with appropriate features can significantly improve accuracy.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        In many countries worldwide, the prediction of air pollution is an
increasingly undeniably significant problem. It can impact
individuals and their wellbeing. In this study, we aim to use a machine
learning approach using insights from the lifelog data provided by
the organizer to predict the personal air pollution data as well as
the individual air quality data, as given in the task description [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
of the competition MediaEval 2020. This task’s primary motivation
is to investigate the association between people’s wellbeing and
the surrounding environment’s properties. The problem consists
of two subtasks. In the first subtask, we explore the correlation
between the air pollution data with the features we extracted from
the sensor (e.g., timestamp information, the user’s geographical
location). In the second subtask, we utilized the features mentioned
earlier, together with the semantic features extracted from cameras
by users, to predict six pollutants used to calculate the AQI values.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>OUR APPROACH</title>
    </sec>
    <sec id="sec-3">
      <title>Anomaly detection</title>
      <p>Observing the three columns of  2.5,  2, and 3 in the training
dataset for both tasks, one can see that many data points have
zero value, are negative numbers, or are unreasonably large (e.g.,
−3000, −4900, etc.). Also, one can find a similar observation even in
positive-valued data points. They are called anomalies or outliers,
which have to be preprocessed before extracting features.</p>
      <p>
        Now, let us consider an arbitrary column whose data needs to
have a preprocessing step. One can determine these outliers in two
cases: the first one includes zero and negative signed values, the
other includes positive outliers (which will be defined later). For
the positive outliers, we apply z-score method [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Specifically,
if we consider the ℎ qualitative data point (denoted by  ) in the
column, the formula for computing its z-score (denoted by  ) can
be given as  =  −  , where  and  are the sum and mean value

of the column, respectively.
      </p>
      <p>In this work, a data point whose z-score is larger than 3.0 is called
an outlier. It is worth noticing that the mean value is computing
based on the positive values only, intending to avoid the influence
of negative valued data points whose absolute values are large.</p>
      <p>After detecting all the anomalies, we replace them with the
average of positive values via the reason mentioned above.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Features Extraction</title>
      <p>The problem consists of two subtasks. Each task asks for using a
diferent data set. Nevertheless, they both include information about
time, location, weather, and concentration values of contaminants
related to AQI (e.g.,  2 or 3). Therefore, our proposed feature
extraction techniques in these data types can be applied to both
data sets. Also, we calculated the necessary features from the image
data given in the second task.</p>
      <p>2.2.1 Timestamp features. From the given information about
time, we extract timestamp features. Specifically, we survey the
correlation between the time point that the data are collected and
the corresponding AQI values and ranks that need to be predicted.
These features include part of day () and is rush hour ( ).</p>
      <p>To begin with, we deduce the  feature. That is, we split a
day’s 24-hour time into five groups, The “Early Morning" group
is for the time from 5 AM to before 7 AM, the time from 7 AM to
before noon is considered the “Morning" group, between noon and
before 4 PM is “Afternoon" group, between 4 PM and before 8 PM
is “Evening", and the remaining period between 8 PM and 5 AM
is the “Night" group. From our observation, there is a noticeable
increase in trafic density during the time of Morning and Evening
groups, which leads to a high level of pollution caused by smoke
from these means of transportation. Consequently, we expect there
is a fluctuation in the data collected during these periods.</p>
      <p>Also, we check whether a particular local measured time is a
rush hour or not, which leads to extracting the second feature
in the group of timestamps features, i.e., is rush hour ( ). In</p>
      <p>SMAPE
0.32
0.52
0.32
0.52
detail, if that given point of time falls into one of these periods
(7:00 AM to 9:00 AM) and (4:00 PM to 7:00 PM), it is called a rush
hour. This feature is a development of the former (i.e., ). We
will survey the periods when the trafic density reaches the highest
peak, resulting in sharp growth of AQI values and ranks.</p>
      <p>2.2.2 Location features. When surveying the factors afecting
the level of pollution of a location, we consider the distance
between that location and the nearest railway station, which is usually
crowded with people and transports. Using the information about
coordinates of a place, we extract the feature about the distance
from that place to the chosen station. In this study, we use the
Shibuya station (35◦ 39′N, 139◦ 42′E).</p>
      <p>
        To compute the mentioned distance, we use the Haversine formula[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
That is, given coordinates of two points  and , the distance
between them can be calculated as follows:
 (, ) = 2 · · arcsin 2  −2  +
 (). ( ).2  − 
2
      </p>
      <p>
        2.2.3 Semantic features. In the second task, we are provided the
data of images captured in diferent locations, which is the most
challenging data type in our opinion. Our approach is to investigate
if the number of cars, motorbikes, and the contrast of the images
can impact the level of pollution in that captured location. We used
SSD ResNet 50 (Retina Net 50)[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], a pre-trained object-detection
model, to extract the mentioned features from the images of the
data set.
      </p>
      <p>
        Also, we extract features related to the contrast of the images,
which can be highly correlated to the intensity of a given place’s
pollution. In detail, given a two-dimensional image  of size  ×
 , we use RMS contrast formula[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to compute its contrast. The
mentioned formula can be seen in the equation (2)
 =
v
u
tu 1
 −1 −1
Õ Õ
 . =0 =0
  − 
2
(2)
where  is the contrast value that needs computing,   is the
intensity pixel of the image  at point (,  ), and  is the average
intensity of all the pixels in that image.
      </p>
      <p>Finally, it is worth noticing that in this study, we did not use the
number of people as a feature related to image data, as the people
appearing in the given images have been blurred for the sake of
privacy.
3</p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS AND DISCUSSION</title>
      <p>
        After extracting the necessary information, we evaluated two
machine learning models using a Randomized Search with a 5-fold
cross-validation technique to optimize the model hyper-parameters
and avoid overfitting our training data. The two models we used
were Support Vector Machine (SVM) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Random Forest (RF) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. It
is crucial to note that we also tested other machine learning
methods, e.g., Linear Regression, XGBoost, and CatBoost, and chose the
two best performing models on the training data for submission.
Each model is optimized and evaluated separately using diferent
data set of each subtask. Only timestamp and geographical features
were used for subtask 1, and the semantic features were combined
with other feature types for subtask 2. The machine learning models
were optimized based on the mean absolute error (MAE) metric.
      </p>
      <p>The results on test sets are presented in Table 1. In the first
subtask, we can see that using Random Forest can achieve the best
result in general with data collected by a walker. For predicting
the AQI value, the results of MAE, RMSE, and SMAPE, in this case,
are 12.74, 15.93, and 0.32, respectively. In the second task, the best
result can be achieved by using SVM. For predicting PM2.5, the best
performance in MAE, RMSE, and SMAPE are 3.49,3.76 and 0.15,
respectively.</p>
      <p>Also, if one can enhance the quality of the images captured in
the data set and combine it with public weather data, the training
results can be improved significantly.
4</p>
    </sec>
    <sec id="sec-6">
      <title>ACKNOWLEDGEMENT</title>
      <p>As the authors, we would like to thank AISIA Research Lab to
support our team and allow us to use their computational resources
for this study. Also, we would like to give our thanks to the
Organization Board of MediaEval 2020 competition and Task Organizer
for providing us with data sets to conduct necessary experiments.
Insight for Wellbeing: Multimodal personal health lifelog data analysis</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <fpage>2019</fpage>
          .
          <article-title>Detecting Outliers in High Dimensional Data Sets using ZScore Methodology</article-title>
          .
          <source>International Journal of Innovative Technology and Exploring Engineering</source>
          <volume>9</volume>
          ,
          <issue>1</issue>
          (Nov.
          <year>2019</year>
          ),
          <fpage>48</fpage>
          -
          <lpage>53</lpage>
          . https://doi.org/10. 35940/ijitee.a3910.
          <fpage>119119</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Basyir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nasir</surname>
          </string-name>
          , Suryati Suryati, and
          <string-name>
            <given-names>Widdha</given-names>
            <surname>Mellyssa</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Determination of Nearest Emergency Service Ofice using Haversine Formula Based on Android Platform</article-title>
          . EMITTER
          <source>International Journal of Engineering Technology</source>
          <volume>5</volume>
          ,
          <issue>2</issue>
          (Jan.
          <year>2018</year>
          ),
          <fpage>270</fpage>
          -
          <lpage>278</lpage>
          . https: //doi.org/10.24003/emitter.v5i2.
          <fpage>220</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Corinna</given-names>
            <surname>Cortes</surname>
          </string-name>
          and
          <string-name>
            <given-names>Vladimir</given-names>
            <surname>Vapnik</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>Support-Vector Networks</article-title>
          .
          <source>In Machine Learning</source>
          .
          <fpage>273</fpage>
          -
          <lpage>297</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Denis</given-names>
            <surname>Cousineau</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sylvain</given-names>
            <surname>Chartier</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Outlier detection and treatment: a review</article-title>
          .
          <source>International Journal of Psychological Research, ISSN 2011-7922</source>
          , Vol.
          <volume>3</volume>
          ,
          <issue>Nº</issue>
          . 1,
          <year>2010</year>
          , pags.
          <fpage>58</fpage>
          -
          <volume>67 3</volume>
          (
          <issue>01</issue>
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Dao</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>P. J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>N.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Binh</surname>
          </string-name>
          ,
          <string-name>
            <surname>Dang-Nguyen D. T.</surname>
          </string-name>
          ,
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Overview of MediaEval 2020: Insights for Wellbeing Task - Multimodal Personal Health Lifelog Data Analysis</article-title>
          .
          <source>In MediaEval Benchmarking Initiative for Multimedia Evaluation, CEUR Workshop Proceedings.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Heljä</given-names>
            <surname>Kukkonen</surname>
          </string-name>
          , Jyrki Rovamo, Kaisa Tiippana, and
          <string-name>
            <given-names>Risto</given-names>
            <surname>Näsänen</surname>
          </string-name>
          .
          <year>1993</year>
          .
          <article-title>Michelson contrast, RMS contrast and energy of various spatial stimuli at threshold</article-title>
          .
          <source>Vision research</source>
          <volume>33</volume>
          (
          <year>08 1993</year>
          ),
          <fpage>1431</fpage>
          -
          <lpage>6</lpage>
          . https: //doi.org/10.1016/
          <fpage>0042</fpage>
          -
          <lpage>6989</lpage>
          (
          <issue>93</issue>
          )
          <fpage>90049</fpage>
          -
          <lpage>3</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Tsung-Yi</surname>
            <given-names>Lin</given-names>
          </string-name>
          , Priya Goyal, Ross Girshick, Kaiming He, and
          <string-name>
            <given-names>Piotr</given-names>
            <surname>Dollar</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Focal Loss for Dense Object Detection</article-title>
          .
          <fpage>2999</fpage>
          -
          <lpage>3007</lpage>
          . https://doi. org/10.1109/ICCV.
          <year>2017</year>
          .324
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Tin</given-names>
            <surname>Kam Ho</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>Random decision forests</article-title>
          .
          <source>In Proceedings of 3rd International Conference on Document Analysis and Recognition</source>
          , Vol.
          <volume>1</volume>
          .
          <fpage>278</fpage>
          -
          <lpage>282</lpage>
          vol.
          <volume>1</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>