<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multimodal Deep Learning for Transboundary Haze Prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Phuc-Thinh Nguyen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nazmudeen Mohamed Saleem</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University Teknologi</institution>
          <country country="BN">Brunei</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Information Technology</institution>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>Environmental pollution, particularly air pollution, has long been an issue in every major city on the planet. For many years, accurate estimation of PM2.5 and PM10[8] [12] fine dust concentration values has been a fascinating study area. This study focused on 3-Day Transboundary Air Pollution Prediction, which proposed to merge many deep learning models and pick appropriate properties for the PM10 index prediction problem by utilizing various features such as timestamps, geographical features, and public weather data. Using the dataset provided by MediaEval, we examined the performance of several learning models and features in order to investigate the problem. Experimental results show that combining multiple deep learning models together gives a higher overall performance than other techniques and features in RMSE, MAE, SMAPE.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Haze air pollution is defined as the presence in the air of particulate
matter such as smoke, dust, and other vapours that arise from the
large-scale forest and land fires, factories, and automobiles. When
the concentration of airborne pollutants reaches dangerous levels,
it causes respiratory issues and has significant consequences for
visibility, economic productivity, transportation, and tourism.[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
      </p>
      <p>
        Transboundary haze is a recurring problem in many parts of the
world, particularly in Southeast Asia, where haze pollution sources
difer from nation to country, with varying percentages coming
from localized or transboundary sources.[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
      </p>
      <p>
        The goal of this article is to address the sub-task 2 of the
competition which is to examine the transnational PM10 estimate problem
using timestamp information, location data, and weather data using
the technique of mixing multiple deep learning models.[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
      </p>
    </sec>
    <sec id="sec-2">
      <title>METHODOLOGY</title>
      <p>We’ll start with a high-level overview of the topic and then go
over to our specific approach that we have developed for a PM10
prediction algorithm.</p>
    </sec>
    <sec id="sec-3">
      <title>Pre-processing data</title>
      <p>At this stage the data for each station is separated from the original
data (train air quality and train weather) based on the station’s
ID. Then we will combine the data by province as the data on air
quality and weather can be found in the same province.</p>
      <p>
        We’ll utilize the interpolation method[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] for missing values in
the data. In addition, we fill in the mean for the variables that cannot
be interpolated.
      </p>
      <p>With testing data (from 2018-2019), we’ll utilize the zero-fill
method to generate a mask to help the model work with missing
values of data that is used for prediction.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Features Extraction</title>
      <p>We model the PM10 index prediction problem as a regression
problem with the following features to estimate the PM10 index in the
near future from a list of specified attributes.</p>
      <p>2.2.1 Timestamp features. Because outdoor air quality varies
greatly depending on the time of day, timestamp information can
be valuable for PM10 estimation dificulties. In particular, each
country’s PM10 index is unique, as are the provinces within the
same country. As a result, we created a PM10 index for a country
by averaging the PM10 scores of the provinces. We take the daily
average and then average it across provinces in provinces where
PM10 is reported hourly.</p>
      <p>2.2.2 Location features. We choose one province as a landmark,
and then we evaluate the closest distances between provinces in
the vicinity of the landmark as it can provide useful information
for plot analysis and reduce noise due to air pollution in the data.</p>
      <p>
        We calculated the distance using the Haversine formula[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which
is an integral equation for navigation that yields precise results
for calculating the great circle distance between two points on the
Earth’s surface based on their latitudes. The Haversine formula can
be calculated using two positions, A and B:
 (, ) = 2. . 2  −2  +
 (). ( ).2  − 
2
1
2 , (1)
where r is the Earth’s radius, and , , ,  are the latitudes
and longitudes of two points A and B, respectively.
      </p>
      <p>2.2.3 Weather features. Public weather features include
information on weather, such as "temperature," "precipitation,"
"humidity," "wind direction," and "wind speed," obtained from local stations.
These characteristics can be thought of as supplementary data that
can help machine learning models become more robust and reliable.</p>
      <p>In order to provide the best forecast results, we analyze the
correlation between these weather features and the PM10 index in
each country and select the features with the strongest correlation.
The wind direction had a strong correlation with PM10 in the
Indonesia dataset; the temperature, rain, and humidity in the Brunei
dataset; the rain, humidity, and wind speed in the Thailand dataset;
and the temperature in the Singapore dataset.</p>
      <p>
        If some of the provinces are lacking "Temperature" data We
utilize a basic LSTM[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] approach to predict the "Temperature" using
weather data from 2018-2019 (testing data). .
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Training and Testing Setup</title>
      <p>To train a model that is appropriate for the problem, we split the raw
training dataset (2010-2017) into training and test datasets, using
80% of them for training and the remaining 20% for model testing.
After that, we will predict PM10 on testing dataset (2018-2019). This
study uses a regression model to forecast PM10 for the next three
days.
2.4</p>
    </sec>
    <sec id="sec-6">
      <title>Multimodal Method</title>
      <p>Process the data for each station in section 2.1, then build a
dataset for each country by averaging the weather and air quality
features of the stations in that country; for example, data of Brunei
would be obtained by averaging the weather and air quality features
of the stations in Brunei.</p>
      <p>For each country, we use data from the previous three days to
forecast the future three days.</p>
      <p>
        We employ the concept of merging diferent models[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ][
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to
increase the performance of the PM10 estimator. The fundamental
idea is to connect the outputs of each country’s PM10 prediction
deep learning model, then feed that information into a final deep
learning model to get the final PM10 result.
      </p>
      <p>
        We’ll need three branches to construct our multi-input network:
The first two forks will be a simple BiLSTM[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] that will handle
Brunei and Thailand data repectively. A simple LSTM will handle
Indonesian data inputs in the last fork. Finally, concatenate these
branches to produce the final multi-input deep learning model. It’s
random; you’re free to use Indonesian data in the BiLSTM model.
We replaced it throughout the experiment, and the MAE and RMSE
results are nearly similar.
      </p>
      <p>We must fill in the mean from 2010 to 2015 because Singapore
weather data is only available from 2016 to 2017, hence we do not
recommend using Singapore data to train the multimodal model
to avoid overfitting the model. We’ll run the BiLSTM model for
Singapore data separately.
2.5</p>
    </sec>
    <sec id="sec-7">
      <title>Performance metrics</title>
      <p>
        We employ root mean square error (RMSE), mean absolute
error (MAE)[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and Symmetric mean absolute percentage error
(SMAPE)[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to evaluate the performance of the proposed approaches.
3
3.1
      </p>
    </sec>
    <sec id="sec-8">
      <title>EXPERIMENTS</title>
    </sec>
    <sec id="sec-9">
      <title>Data sets</title>
      <p>The organizer has provided us with a data set. Weather and air
quality data are included in the dataset, with the training set spanning
2010-2017 and the test set spanning 2018-2019.</p>
      <p>T.Nguyen
3.2</p>
    </sec>
    <sec id="sec-10">
      <title>Model settings</title>
      <p>To improve the performance of the presented approaches, we use a
random search method to select ideal hyperparameters based on
performance indicators. For each model, this method entails
scanning a predefined parameter space and picking the best performing
hyperparameters which is shown in Table 1.
3.3</p>
    </sec>
    <sec id="sec-11">
      <title>Results</title>
      <p>In this study, we compare the proposed methodologies’
performance to the performance measures listed in Section 2.5.</p>
      <p>The MAE and RMSE results of multimodal are better than the rest
of the models, as shown in Table 2. However, because multimodal
employs the average PM10 value of countries, the forecast results
will be skewed for countries with excessively low or excessively
high PM10.</p>
      <p>The MAE and RMSE values in Table 2 are better than those in
Table 3, but the SMAPE results are worse.
4</p>
    </sec>
    <sec id="sec-12">
      <title>CONCLUSIONS AND FUTURE WORKS</title>
      <p>After three months for analyzing, we illustrated the benefits of
combining numerous models with generalization and deep learning
methods to address the PM10 index estimation problem utilizing
various types of features such as timestamps, location, and public
weather data. By incorporating a variety of characteristics, the test
results reveal that PM10 level prediction is fairly accurate when
compared to ground-truth. Transnational air pollution can be
predicted using the strategy of merging multiple models.</p>
      <p>We’re excited to continue our research by examining additional
forms of data, such as image data, video, and new deep learning
models. One of the new approachs is multivariate transformer
learning because this learning can learn across long timespans.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>AR</given-names>
            <surname>Varkonyi-Koczy. A Mosavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S</given-names>
            <surname>Ardabili</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>List of Deep Learning Models</article-title>
          .
          <source>International Conference on Global Research and Education</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Jason</given-names>
            <surname>Brownlee</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Deep Learning for Time Series Forecasting</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Quang</surname>
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Nguyen-Tai Tan-Loc; Bo Dong; Nguyen Dat; Dao MinhSon; Nguyen Binh T. Duong</surname>
            ,
            <given-names>Dat Q.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Le</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Multi-source Machine Learning for AQI Estimation</article-title>
          .
          <volume>5</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Wida</given-names>
            <surname>Susanty Haji Suhailia-Peijiang Zhaob</surname>
          </string-name>
          . Efa Nabilla Aziza,
          <string-name>
            <given-names>Asem</given-names>
            <surname>Kasema</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Convolution Recurrent Neural Network for Daily Forecast of PM10 Concentrations in Brunei Darussalam</article-title>
          .
          <source>AIDIC</source>
          <volume>13</volume>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D</given-names>
            <surname>Roggen</surname>
          </string-name>
          .
          <source>FJ Ordóñez</source>
          .
          <year>2016</year>
          .
          <article-title>Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition. mdpi 5 (</article-title>
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y. T.</given-names>
            <surname>GU. G. R. LIU</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>A POINT INTERPOLATION METHOD FOR TWO-DIMESIONAL SOLIDS</article-title>
          .
          <source>INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN ENGINEERING</source>
          (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Asem</given-names>
            <surname>Kasem</surname>
          </string-name>
          ,
          <string-name>
            <surname>Minh-Son</surname>
            <given-names>Dao</given-names>
          </string-name>
          , Efa Nabilla Aziz,
          <string-name>
            <surname>Duc-Tien</surname>
            <given-names>DangNguyen</given-names>
          </string-name>
          , Cathal Gurrin ,
          <string-name>
            <surname>Minh-Triet</surname>
            <given-names>Tran</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thanh-Binh Nguyen</surname>
            , and
            <given-names>Wida</given-names>
          </string-name>
          <string-name>
            <surname>Suhaili</surname>
          </string-name>
          .
          <source>Overview of MediaEval</source>
          <year>2021</year>
          :
          <article-title>Insights for Wellbeing Task Cross-Data Analytics for Transboundary Haze Prediction</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Pingqing</given-names>
            <surname>Fu Xiangdong Li Ling Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Xiaosan</given-names>
            <surname>Luo</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Airborne particulate matter pollution in urban China: a chemical mixture perspective from sources to impacts</article-title>
          .
          <source>National Science Review</source>
          <volume>4</volume>
          (
          <year>2016</year>
          ),
          <fpage>593</fpage>
          -
          <lpage>610</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S</given-names>
            <surname>Suryati Widdha Mellyssa. M Basyir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M</given-names>
            <surname>Nasir</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Determination of Nearest Emergency Service Ofice using Haversine Formula Based on Android Platform</article-title>
          .
          <source>EMITTER 5</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H</given-names>
            <surname>Smith NR Draper</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>Applied regression analysis</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G</given-names>
            <surname>Hinton. Y LeCun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Deep learning</article-title>
          .
          <source>Nature</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Min-Hua Shi-Yi-Xin Lian</surname>
          </string-name>
          .
          <string-name>
            <surname>Yu-Fei</surname>
            <given-names>Xing</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yue-Hua Xu</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>The impact of PM2.5 on the human respiratory system</article-title>
          .
          <source>Journal of Thoracic Disease (Jan</source>
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>