DCU team at The 2019 Insight for Wellbeing Task: Multimodal
               personal health lifelog data analysis

              Tu-Khiem Le1* , Van-Tu Ninh1* , Liting Zhou1 , Duc-Tien Dang-Nguyen2 , Cathal Gurrin1
                                                             1 Dublin City University, Ireland
                                                             2 University of Bergen, Norway

                                 tukhiem.le4@mail.dcu.ie,tu.ninhvan@adaptcentre.ie,zhou.liting2@mail.dcu.ie,
                                             ductien.dangnguyen@uib.no,cathal.gurrin@dcu.ie

ABSTRACT                                                                        2     RELATED WORK
In this paper, the authors described their proposed method in ana-              In recent years, lifelogging has gained more and more attentions
lyzing lifelog data in association with the environment. Tackling               and many research works have been proposed to provide better
the problem of incomplete data, we proposed a replacement method                understanding of personal digital collections. To support, many
using linear regression method which results in a normalized L2                 international benchmarking efforts have been made and various
distance score of 0.0153. Meanwhile, the authors solved the per-                challenges on lifelogging data were hosted, the most recent of which
sonal air quality subtask by inferring from lifeloggers’ PM2.5 data,            is NTCIR-14 Lifelog-3 Task [3], LSC 2018 [1], and ImageCLEF2019-
which achieves 1.0 in the arithmetic mean of absolute distance score            lifelog [2]. While the purpose of these challenges is to mainly focus
between the predictions and the actual classes.                                 on developing a solution to retrieve relevant moments based on a
                                                                                set of given queries, each challenge has different subtasks to fur-
                                                                                ther explore this multimodal data. In the Lifelog Search Challenge
1    INTRODUCTION                                                               (LSC), not only are the participants required to build an interactive
Along with the development of engineering and technology, more                  retrieval system, but they also need to compete with each other in
and more personal devices such as smartphones, video cameras                    the competition with real-time on-screen query.
and wearable sensors have come to life which provide people the                     The datasets, which were utilised in these challenges, are col-
ability to easily capture every aspect of their life. On top of that,           lected by many lifeloggers who wear a passive-captured wearable
the term lifelogging is defined to be the process of recording a                camera and other tracking sensors. Each lifelogger normally gener-
detailed trace of life passively[4], which generates a large collection         ates around 1250 - 4500 images per day in association with other
of multimedia data. The huge amount of lifelog data leads to the                biometrics (e.g. heart rate, calorie), locations (GPS), physical move-
need to quickly retrieve and extract particular insight based on                ments and music. They share nearly the same structure with the
the associations between data. In the MediaEval 2019 Insight for                lifelog data in the MediaEval 2019 Insight for Wellbeing Challenge.
Wellbeing Challenge, they defined a new approach to lifelog data                However, this challenge also considers additional information from
in relation with the environment. This is potential in analyzing the            the environment, which makes the insight more general and enables
effect of general pollution on the living quality on individual scale.          us to obtain an overview of the wellbeing among individuals.
Beside the information recorded from the weather and air pollution
stations, lifelog data could add in the true nature of particular               3     APPROACH
regions where the stations are not set up.
                                                                                From the dataset, we are provided air quality data gathered by
   The organisers generated a novel dataset called SEPHLA [6]
                                                                                the stations and lifeloggers’ sensors. These are extremely useful
which is collected by multiple lifeloggers who walk on several se-
                                                                                information to reconstruct missing segments of data and predict
lected routes in the city and record data through wearable sensors
                                                                                air quality index for specific areas. Besides, we also got a collection
and smartphones. The lifelog images, biometrics, weather, urban
                                                                                of image data recorded by the lifeloggers with corresponding vi-
perception tags, emotional tags and air pollution data are provided
                                                                                sual concepts extracted from the neural network, along with the
within the dataset. To better understand the data and gain insights
                                                                                information on the checkpoints where they are asked to take pic-
for personal wellbeing, the organizers defines two subtasks: Seg-
                                                                                tures. However, the images which are actively taken might vary
ment Replacement and Personal Air Quality prediction. In the first
                                                                                from the lifeloggers’ preferences. Therefore, it’s hard to capture and
subtask, the participants are asked to investigate the associations
                                                                                generalize the context across individuals. Based on the observation
among data and develop a solution to reconstruct the segments of
                                                                                we gained, we proposed the solutions to both sub-tasks which are
data which are removed by the organisers. Meanwhile, The second
                                                                                described in the following subsections.
subtask aims to estimate people wellbeing by predicting the AQI
(Air Quality Index) on particular positions in a specified time. More
details about the this challenge can be found in [5].                           3.1    Segment Replacement
                                                                                In this subtask, the sequence of missing PM2.5 data is specified
* These two authors contributed equally.                                        in each query with a starting and ending time. As the lifeloggers
Copyright 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution                            walked in groups, the data from others could help regenerate the
4.0 International (CC BY 4.0).                                                  missing segments. The data from the stations, however, is not quite
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France                                                   Tu-Khiem Le, Van-Tu Ninh et al.


reliable since their distance is too far from the routes and they        Table 2: Ranked list of best score of each team in Personal
might contribute noises to the result. Therefore, considering the        Air Quality Subtask
NO2, O3, temperature, humidity and heartbeat data from people
who share the same route with the targeted lifelogger, we build                             Group ID      Run ID      Score
a simple linear regression model to predict the removed PM2.5                               healthism       19         0.3
data. Specifically, let x be the 5-dimensional L2-normed feature                            SHT-UIT          1         0.8
vectors composed of five components mentioned above, and y be                                 DCU           1          1.0
the targeted PM2.5 value that needs to be predicted. We construct
a linear regression model y = wT x + b and apply gradient descent
to find the best parameters w and b to minimize root-mean-square         means that our approach to handle the data for this task is not good
error, which aims to minimize the gap between model predictions          and the operation that we apply to process PM2.5 data to infer AQI
and ground-truth of train data. Then the trained model is used to        level is not correct. Since the data recorded from the lifeloggers
generate the missing PM2.5 of the targeted person in that group.         walking through the route is not totally correct (as the values are
As NO2, and O3 values are not almost zero for most of the times,         almost zeros for all) and the collected data is not enough (less than
temperature, humidity, and heart-beat are the main factors that          24 hours during seven non-consecutive days), we can hardly infer
contribute most to our predictions.                                      the right AQI level for the route.
                                                                            As we do not exploit all the provided materials such as the data
3.2    Personal Air Quality                                              recorded from the stations, images and related metadata, we might
To obtain AQI for each day, we would need to first gather the air        miss some important features that could be used to improve our
quality data. From the checkpoints of each route, we could obtain a      predictions. Moreover, as we rely on the users’ recorded data along
list of GPS along the route. Then, we extracted all air quality data     the route that they pass through, the recorded values such as PM2.5,
where lifeloggers’ GPS is closed to the checkpoints. The distance        NO2, O3 are not reliable as the most of their values are zeros. These
between two GPSs is calculated using the Haversine formula.              are the main factors that affect our results in both sub-tasks. In order
    As we observed from the air quality data of each route, NO2          to improve it in future work, we might need to consider additional
and O3 values are mostly zeros while PM2.5 values have some              data on the internet, which is recorded from nearby stations, to
fluctuations. Therefore, we choose PM2.5 to predict the ultimate         provide the missing PM2.5 values during the days to generate the
Air Quality Index (AQI). At first, we refine the data to get the right   correct estimation of AQI score.
PM2.5 data for each route by calculate the distance between the
route’s GPS and collectors’ current GPS. After this step, we obtain      ACKNOWLEDGMENTS
data for 27 routes on 7 days from different groups of collectors.        This publication has emanated from research supported in party
For each data on a day collected by a user, we compute its average       by research grants from Irish Research Council (IRC) under Grant
PM2.5. Therefore, we receive many average PM2.5 values from              Number GOIPG/2016/741 and Science Foundation Ireland under
many collectors in one day. We consider the maximum value of             grant numbers SFI/12/RC/2289 and 13/RC/2106.
these average PM2.5 values as the criteria to evaluate AQI for that
route on that day. Then, we average the AQI value of 7 days and          REFERENCES
re-evaluate again to infer the AQI level of the route.                    [1] 2018. LSC ’18: Proceedings of the 2018 ACM Workshop on The Lifelog
                                                                              Search Challenge. ACM, New York, NY, USA.
4     RESULTS AND ANALYSIS                                                [2] Duc-Tien Dang-Nguyen, Luca Piras, Michael Riegler, Minh-Triet Tran,
                                                                              Liting Zhou, Mathias Lux, Tu-Khiem Le, Van-Tu Ninh, and Cathal
Table 1: Ranked list of best score of each team in Segments                   Gurrin. 2019. Overview of ImageCLEFlifelog 2019: Solve my life
Replacement Subtask                                                           puzzle and Lifelog Moment Retrieval. In CLEF2019 Working Notes
                                                                              (CEUR Workshop Proceedings). CEUR-WS.org <http://ceur-ws.org>,
                                                                              Lugano, Switzerland.
               Group ID      Run ID         Score                         [3] Cathal Gurrin, H. Joho, Frank Hopfgartner, L. Zhou, Van-Tu Ninh,
               healthism       3        0.000427182                           Tu-Khiem Le, Rami Albatal, D.-T Dang-Nguyen, and Graham Healy.
               SHT-UIT         3         0.000463205                          2019. Overview of the NTCIR-14 Lifelog-3 task.
                 DCU           1        0.015310414                       [4] Cathal Gurrin, Alan F. Smeaton, and Aiden R. Doherty. 2014. LifeL-
                                                                              ogging: Personal Big Data. Foundations and Trends® in Information
                HCMUS          4         0.015514208                          Retrieval 8, 1 (2014), 1–125. https://doi.org/10.1561/1500000033
                                                                          [5] Tomohiro Sato Koji Zettsu Duc-Tien Dang-Nguyen Cathal Gurrin
   It can be seen from the table 1 that our team (DCU) manages to             Ngoc-Thanh Nguyen Minh-Son Dao, Peijiang Zhao. 2019. Overview
achieve the 3r d highest score of approximately 0.0153 among the              of MediaEval 2019: Insights for Wellbeing Task: Multimodal Personal
best submission list in the Segment Replacement sub-task. It means            Health Lifelog Data Analysis. In MediaEval2019 Working Notes (CEUR
                                                                              Workshop Proceedings). CEUR-WS.org <http://ceur-ws.org>, Sophia
that our approach manages to generate relatively good prediction
                                                                              Antipolis, France.
with low error. However, there are other solutions could provide          [6] Tomohiro Sato, Minh Dao, Kota Kuribayashi, and Koji Zettsu. 2018.
more precise result with significantly low error.                             SEPHLA: Challenges and Opportunities within Environment-Personal
   Meanwhile, in the Personal Air Quality sub-task, our approach              Health Archives.
got the arithmetic mean absolute L1 distance score of 1.0. This