=Paper= {{Paper |id=Vol-2670/MediaEval_19_paper_42 |storemode=property |title=Predicting Missing Data by Using Multimodal Data Analytics |pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_42.pdf |volume=Vol-2670 |authors=Loc Tai Nguyen Tan,Minh-Tam Nguyen,Dang-Hieu Nguyen |dblpUrl=https://dblp.org/rec/conf/mediaeval/TanNN19 }} ==Predicting Missing Data by Using Multimodal Data Analytics== https://ceur-ws.org/Vol-2670/MediaEval_19_paper_42.pdf
                             Predicting Missing Data by Using
                                Multimodal Data Analytics
                    Loc Tai Tan Nguyen1 , Minh-Tam Nguyen2 , Dang-Hieu Nguyen3 ,
                                          1,2,3 University of Information Technology, VietNam

                      locntt.12@grad.uit.edu.vn,tamnm.12@grad.uit.edu.vn,hieund.12@grad.uit.edu.vn
ABSTRACT                                                              in detail how to filter data and collect useful position infor-
In this paper, we introduce a method using multimodal data            mation to predict missing data.
analytics to predict missing data collected by sensors. Our
approach is to find data at the near-by location and time by
using the time-filtering algorithm and incrementally scan-            2.1    Data Processing
ning radius to replace missing data. The method is evaluated                • Circling Time: This function is to collect all near-
by using MediaEval 2019 Insight for wellbeing – subtask                       by-time data. We first cluster all given datasets into
1 dataset and evaluation metric. The results show that the                    different groups so that each group has the same
proposed method works well and predict missing data with                      date and time (i.e., same day). Then only data hap-
high accuracy.                                                                pens within start_time and end_time are selected.
                                                                              It should be noted that start_time and end_time de-
                                                                              note the time period when data missing.
1    INTRODUCTION                                                           • Circling Position: In order to collect all near-by-
                                                                              location data, we define the formula that calculates
Air pollution is proved to be a significant factor affect on                  the distance of two coordinates. All data recorded
human beings [2]. Thus, having the ability to predict air                     within this distance are selected. The formula is de-
pollution is the target of many research activities [3]. Nev-                 fined as follows:
ertheless, before being to predict air pollution, collecting
air pollution data from sensors and data from objects that
may impact or be impacted by air pollution may have more
priority order [2]. Noise, outliers, and missing data usually                 where: d: is the distance between the two points; r :
happen when gathering data towards harming severely on                        is the radius of the sphere; α 1 , α 2 : latitude of point 1
the accuracy of a predicting stage. Thus, MediaEval 2019                      and latitude of point 2 (in radians); β 1 ,β 2 : longitude
Insight for wellbeing task challenges participants to recover                 of point 1 and longitude of point 2 (in radians). The
missing data recorded by air pollution sensors (e.g., PM2.5)                  radius is set from 1m to 100m.
[1]. This paper reports our solution to tackle this challenge.

2    METHODOLOGY                                                      2.2    Missing Data Prediction
The primary purpose of the proposed method is to define a             After running the circling time and circling position, we
hypothesis that can represent the associations among hetero-          obtain the PM 2.5 value of some nearest positions; we then
geneous data and towards building a system that able to pre-          calculate the Maximum, Minimum and Average of these
dict missing values in the provided dataset. The hypothesis           values from a position that needs to predict. To optimizing
points out that there is a strong association of heterogeneous        the results, we incrementally increase the radius step by step
data recording at the near-by location and time. Thus, we             from 1m to 20m at this time to scan all positions. According
build the time filtering algorithm and radius-based increment         to our experience, we choose the ideal radius is 20m since
scan policy to gather near-by data whose values can be used           within the 20m radius the predicted PM 2.5 values reach the
to predict missing data. The following (sub)sections describe         highest accuracy.
                                                                         If within 20m radius, we cannot get any point, we will take
Copyright 2019 for this paper by its authors. Use                     a single nearest point in [21m, 100m]. If there is no point in
permitted under Creative Commons License Attribution                  [0m, 100m], set value for PM 2.5 is zero and from thence we
4.0 International (CC BY 4.0).                                        have build Algorithm 1.
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France                                                      L.T.T. Nguyen et al.

3   EXPERIMENTAL RESULTS                                              Algorithm 1: Recovery PM 2.5 ’s values from near-by
The experimental results running on the training dataset is           location and time data
denoted in Table 1                                                   1 DataA: Merge all data in a group;

              Table 1. The Result of runs                            2 DataB: In DataA, retrieve all data in the period from
                                                                         starttime to endtime of data lost PM 2.5 ;
                                                                     3 DataC: A list coordinates of data lost PM 2.5 ;
                                                                     4 for each coordinate in DataC do
                                                                     5      - initialization array(PM 2.5 ) containing values of
                                                                              PM 2.5 ;
                                                                     6      - initialization array(coordinate) to store
                                                                              coordinate;
                                                                     7      while radius less than or equal hundred do
                                                                     8           for each coordinate in DataB do
                                                                     9                set d is distance coordinate in DataC and
                                                                                       DataB;
                                                                    10                if d less than radius and coordinate not in
                                                                                       array(coordinate) then
                                                                    11                    - add value PM 2.5 of coordinate B into
                                                                                            array(PM 2.5 );
   Table 2 shows the results when running on the testing            12                    - add coordinate into
dataset.
                                                                                            array(coordinate);
              Table 2. The Evaluation of run                        13                else
        Group_id     Method      Run_id     Score                   14                    do nothing
        SHT_UIT Maximum             1    0.00483679
                                                                    15             if radius greater than twenty and number
        SHT_UIT Average             2    0.00054178
                                                                                    of element in array(PM 2.5 ) greater than
        SHT_UIT Minimum             3    0.00046321
                                                                                    zero then
   Experimental results are evaluated based on optimized            16                 calculator output for PM 2.5 ;
the Maximum, Minimum and Average precision. This result             17                 - get maximun value in array(PM 2.5 );
shows that although our proposed method is simple but it
                                                                    18                 - get average all values in
is effective. Our best run is run with Minimum. Because
                                                                                         array(PM 2.5 );
the Minimum value has noise very low, the value is more
accurate than the other two methods (Maximum, Average).             19                 - get minimun value in array(PM 2.5 );
Nevertheless, there is not a big gap among submitted runs.          20                 break loop on DataB and then break
                                                                                         for radius loop, go to next coordinate
4   CONCLUSIONS                                                                          in DataC;
                                                                    21             else
We report our work at the MediaEval 2019 Insight for Well-
                                                                    22                 do nothing
being task - subtask 1. We use time-filtering algorithm and
radius-based increment policy to gather near-by location            23         if radius equal hundred and number of element
and time data towards predicting missing data. The results                      in array(PM 2.5 ) equal zero then
show that our solution has high accuracy.                           24             set output value of PM 2.5 is zero;
                                                                    25         else
REFERENCES                                                          26             do nothing
[1] Minh-Son Dao, Peijiang Zhao, Tomohiro Sato, Koji Zettsu, Duc-
    Tien Dang-Nguyen, Cathal Gurrin, and Ngoc-Thanh Nguyen.
    2019. Overview of MediaEval 2019: Insights for Wellbeing
    Task: Multimodal Personal Health Lifelog Data Analysis. In      [3] Peijiang Zhao and Koji Zettsu. 2018. Convolution Recurrent
    MediaEval2019 Working Notes (CEUR Workshop Proceedings).            Neural Networks for Short-Term Prediction of Atmospheric
    CEUR-WS.org , Sophia Antipolis, France.         Sensing Data. In 2018 IEEE GreenCom-CPSCom-SmartData).
[2] Tomohiro Sato, Minh-Son Dao, Kota Kuribayashi, and Koji             IEEE, 815–821.
    Zettsu. 2019. SEPHLA: Challenges and Opportunities Within
    Environment-Personal Health Archives. In MMM. 325–337.