HCMUS at Insight for Wellbeing Task 2019:
             Multimodal Personal Health Lifelog Data Analysis with
                Inference from Multiple Sources and Attributes
                                      Hoang-Anh Le, Thang-Long Nguyen Ho , Minh-Triet Tran∗
                              Faculty of Information Technology, University of Science, VNU-HCM, Vietnam
                          1612013@student.hcmus.edu.vn,nhtlong@selab.hcmus.edu.vn,tmtriet@fit.hcmus.edu.vn

ABSTRACT                                                                  relationships. Because we don’t want unexplainable-relationship
When collecting and processing data recorded by sensors for any           between coordinates and PM 2.5 , we need a solution clear and stable
applications, noisy and missing data is an important problem that         as much as we can. We do not have adequate data about location
need to be address. This paper presents two approaches we use             and PM 2.5 , also any pretrain model to mapping from location in-
to predict missing air quality data in MediaEval Insight for Well-        formation to what we need. We assume coordinates value have a
being Task. The first approach based on other data attributes like        relationship with other attributions so coordinates’ meaning can be
temperature and humidity, and the second based on data recorded           implicitly represented through temperature and humidity feature
from other sources. Evaluating the experimental results using the         [Figure 1], so that if we found the right function to mapping from
average L2 distance, we got the score of 0.9013 for the first approach    temperature and humidity to PM 2.5 , we also have the coordinates
and 0.0155 for the second approach.                                       information in the result, also simplify the data. As a result, we push
                                                                          normalized temperature and humidity data through a multi-layer
                                                                          perceptron model to approach the problem.
1    INTRODUCTION
Environmental data can be used to analyse different aspects for
the deveopment of the society, including the quality of personal
health [2] or depressive symptoms [3]. The data can be of various
sources and formats, such as spatialtemporal raster images [1] or a
combination of weather, air polution, lifelog images, etc [2].
   In the MediaEval Life Well Being 2019 task[2], we are given 14
categories of pollution data recorded by people who wear sensors,
use smartphones and walk along pre-defined routes inside a city,
and asked to develop methods that process the data to obtain in-
sights about personal wellbeing. In subtask 1, our goal is developing
a hypothesis about the associations within the heterogeneous data         Figure 1: Top 3 in correlation heatmap on temperature and
and build a system that is able to correctly replace segments of data     humidity
of the PM 2.5 index that have been removed.
   Based on the organization of the data, we found there are 2 main          2.1.2 Using test-set as validation. We do the same as first run at
approaches to predict the PM 2.5 in the queries. In the first approach,   this run, however, the motivation of this run is that we do not try to
we want to explore if it would be possible to find relationship           overfit the testing dataset of organizing, in this approach we try to
between PM 2.5 values and other obtained attributes (Section 2.1). In     generalize the method. We use the dataset development (unrated)
the second approach, because in each question, there are a number         in the contest to train set, the official dataset to the validation set.
of people walking in the same region in roughly the same time             This task is preprocessed data most clean-able and optimize the
interval, we propose to combine values from multiple people, to           loss on validation (official dataset).
infer the missing data segment of PM 2.5 (Section 2.2).

2 METHOD                                                                  2.2    Inference from other people
                                                                          First, we examine and compare the coordinates and trajectories of
2.1 Inference from other attributes                                       people within the same group (same question) through time, and
   2.1.1 Using test-set only. After observing over the chart of a         find that in most cases, people in the same group walked in roughly
dataset, we omitted some features have a mean value close to zero         the same route, and they were at the same location together at every
like NO 2 ,33 , some category features. We found the feature about        moment along the way (the start and end times of each person may
the location of all users at any time are not so different, so we         vary) [Figure 2].
concluded that data about location and PM 2.5 are not had close              Therefore, we can conclude that given a specific time, the PM 2.5
∗ The first two authors contributed equally to the paper                  values recorded by people within the same group are highly related
                                                                          because they recorded the PM 2.5 value of the same location at the
Copyright 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution
                                                                          same time, and we could guess the missing PM 2.5 values by the
4.0 International (CC BY 4.0).                                            corresponding PM 2.5 values of other people in the same group.
MediaEval’19, 29-31 October 2019, Sophia Antipolis, France
MediaEval’19, 29-31 October 2019, Sophia Antipolis, France                          Hoang-Anh Le, Thang-Long Nguyen-Ho, Minh-Triet Tran


                             Figure 2: Trajectories and coordinates through time of people in Query Q1


   However, comparing PM 2.5 values of all people in the same            values become more inaccurate to estimate the true values. To
group, we find that these values vary considerably. Thus, we imple-      remove these noise, we check at a certain time, if the difference
ment some statistical method to predict missing PM 2.5 values from       between the value of a sensor with the average value larger than
corresponding PM 2.5 values of other people.                             the variance by a threshold factor, than we ignore this value and
                                                                         recalculate the average. We also recalculate the bias value and apply
   2.2.1 Average. We predict the missing PM 2.5 values by taking
                                                                         it for this run.
the average of PM 2.5 values of other people in the group in the
corresponding time. [Figure 3]. However, the PM 2.5 data of these
people are scatter over the time interval and not available for every    3     EXPERIMENTS AND RESULTS
second. There for we use 1D linear interpolation to predict PM 2.5       Table 1: Official evaluation result (provided by organizers)
data for each person at every second before taking the average.
    2.2.2 Average with bias. The average of PM 2.5 values of all peo-        Approach    RunID                 Method                  Score
ple is only a reasonable prediction for the true PM 2.5 value of the                       1            MLP - Testing data             0.8141
                                                                                1
environment at that moment. However, most sensors can not pro-                             2          MLP - Development data           0.9013
duce these true values, each sensor has its own inaccuracy. And                            3                   Average                 0.3384
since we want to predict the PM 2.5 values recorded by a specific sen-          2          4             Average with bias             0.0155
sor, we want to take into account this inaccuracy. Since the random                        5     Average (outlier removed) with bias   0.0157
noise are difficult to evaluate, we only consider the bias problem
- the sensor consistently records values that lower or higher than       The table above shows the results of each method mentioned earlier.
the true values by a certain amount (the bias value).                    In this table, the scores of each run is the means of L2 distance
    To estimate the bias, we calculate the difference between the        between the predicted results and the ground truth. Our experiment
average PM 2.5 values and the PM 2.5 values of that sensor at each       results show that the second approach (predict based on other
moment these values available, and take the average of these dif-        people within the group), achieve fairly good results. The result of
ferences. After that, we add this bias to the predict values of the      the first approach (predict based on temperature and humidity) are
previous run.                                                            not so good as the average L2 distances are still quite large. We think
                                                                         the reason is probably because only temperature and humidity could
   2.2.3 Average (outlier removed) with bias. We observe that there      not give us enough information to predict the PM 2.5 values, and to
are some noise in certain sensor that make some recorded PM 2.5          have really good predictions, we should combine the information
values become very high, having very large differences with the          about variations of PM 2.5 values through time, the temperature
values of other sensor at corresponding time, making the average         and humidity values and the PM 2.5 values of other people in the
                                                                         same group.


                                                                         4     CONCLUSION AND FUTURE WORKS
                                                                         We propose two simple approaches for the Life Well Being Problem.
                                                                         The first approach uses a neural network to predict PM 2.5 values
                                                                         from other factors like temperature and humidity. The second ap-
                                                                         proach using the PM 2.5 values recorded by other people at the same
                                                                         location and at the same time. These methods are simple but can
                                                                         predict the missing values quite effectively.
                                                                            We think these methods could be improved further by combining
                                                                         them together (meaning take into account both the other attributes
                                                                         values and other PM 2.5 values), having a more effective noise re-
    Figure 3: The average of PM 2.5 values of other people               moval method, or building a more complex regression model.
Insight for Wellbeing Task                                                       MediaEval’19, 29-31 October 2019, Sophia Antipolis, France


ACKNOWLEDGMENTS                                                             [2] Tomohiro Sato Koji Zettsu Duc-Tien Dang-Nguyen Cathal Gurrin
Research is supported by Vingroup Innovation Foundation (VINIF)                 Ngoc-Thanh Nguyen Minh-Son Dao, Peijiang Zhao. 2019. Overview
                                                                                of MediaEval 2019: Insights for Wellbeing Task: Multimodal Personal
in project code VINIF.2019.DA19. We would like to thank AIOZ Pte
                                                                                Health Lifelog Data Analysis. In MediaEval2019 Working Notes (CEUR
Ltd for supporting our team with computing infrastructure.                      Workshop Proceedings). CEUR-WS.org <http://ceur-ws.org>, Sophia
                                                                                Antipolis, France.
REFERENCES                                                                  [3] Hyeonjin Song, Kevin James Lane, Honghyok Kim, Hyomi Kim, Garam
[1] Minh-Son Dao and Koji Zettsu. 2018. Complex Event Analysis of               Byun, Minh Le, Yongsoo Choi, Chan Ryul Park, and Jong-Tae Lee.
    Urban Environmental Data based on Deep CNN of Spatiotemporal                2019. Association between urban greenness and depressive symptoms:
    Raster Images. In IEEE International Conference on Big Data, Big Data       Evaluation of greenness using various indicators. International Journal
    2018, Seattle, WA, USA, December 10-13, 2018. 2160–2169. https://doi.       of Environmental Research and Public Health 16, 2 (2 1 2019). https:
    org/10.1109/BigData.2018.8621916                                            //doi.org/10.3390/ijerph16020173