DCU team at The 2019 Insight for Wellbeing Task: Multimodal personal health lifelog data analysis Tu-Khiem Le1* , Van-Tu Ninh1* , Liting Zhou1 , Duc-Tien Dang-Nguyen2 , Cathal Gurrin1 1 Dublin City University, Ireland 2 University of Bergen, Norway tukhiem.le4@mail.dcu.ie,tu.ninhvan@adaptcentre.ie,zhou.liting2@mail.dcu.ie, ductien.dangnguyen@uib.no,cathal.gurrin@dcu.ie ABSTRACT 2 RELATED WORK In this paper, the authors described their proposed method in ana- In recent years, lifelogging has gained more and more attentions lyzing lifelog data in association with the environment. Tackling and many research works have been proposed to provide better the problem of incomplete data, we proposed a replacement method understanding of personal digital collections. To support, many using linear regression method which results in a normalized L2 international benchmarking efforts have been made and various distance score of 0.0153. Meanwhile, the authors solved the per- challenges on lifelogging data were hosted, the most recent of which sonal air quality subtask by inferring from lifeloggers’ PM2.5 data, is NTCIR-14 Lifelog-3 Task [3], LSC 2018 [1], and ImageCLEF2019- which achieves 1.0 in the arithmetic mean of absolute distance score lifelog [2]. While the purpose of these challenges is to mainly focus between the predictions and the actual classes. on developing a solution to retrieve relevant moments based on a set of given queries, each challenge has different subtasks to fur- ther explore this multimodal data. In the Lifelog Search Challenge 1 INTRODUCTION (LSC), not only are the participants required to build an interactive Along with the development of engineering and technology, more retrieval system, but they also need to compete with each other in and more personal devices such as smartphones, video cameras the competition with real-time on-screen query. and wearable sensors have come to life which provide people the The datasets, which were utilised in these challenges, are col- ability to easily capture every aspect of their life. On top of that, lected by many lifeloggers who wear a passive-captured wearable the term lifelogging is defined to be the process of recording a camera and other tracking sensors. Each lifelogger normally gener- detailed trace of life passively[4], which generates a large collection ates around 1250 - 4500 images per day in association with other of multimedia data. The huge amount of lifelog data leads to the biometrics (e.g. heart rate, calorie), locations (GPS), physical move- need to quickly retrieve and extract particular insight based on ments and music. They share nearly the same structure with the the associations between data. In the MediaEval 2019 Insight for lifelog data in the MediaEval 2019 Insight for Wellbeing Challenge. Wellbeing Challenge, they defined a new approach to lifelog data However, this challenge also considers additional information from in relation with the environment. This is potential in analyzing the the environment, which makes the insight more general and enables effect of general pollution on the living quality on individual scale. us to obtain an overview of the wellbeing among individuals. Beside the information recorded from the weather and air pollution stations, lifelog data could add in the true nature of particular 3 APPROACH regions where the stations are not set up. From the dataset, we are provided air quality data gathered by The organisers generated a novel dataset called SEPHLA [6] the stations and lifeloggers’ sensors. These are extremely useful which is collected by multiple lifeloggers who walk on several se- information to reconstruct missing segments of data and predict lected routes in the city and record data through wearable sensors air quality index for specific areas. Besides, we also got a collection and smartphones. The lifelog images, biometrics, weather, urban of image data recorded by the lifeloggers with corresponding vi- perception tags, emotional tags and air pollution data are provided sual concepts extracted from the neural network, along with the within the dataset. To better understand the data and gain insights information on the checkpoints where they are asked to take pic- for personal wellbeing, the organizers defines two subtasks: Seg- tures. However, the images which are actively taken might vary ment Replacement and Personal Air Quality prediction. In the first from the lifeloggers’ preferences. Therefore, it’s hard to capture and subtask, the participants are asked to investigate the associations generalize the context across individuals. Based on the observation among data and develop a solution to reconstruct the segments of we gained, we proposed the solutions to both sub-tasks which are data which are removed by the organisers. Meanwhile, The second described in the following subsections. subtask aims to estimate people wellbeing by predicting the AQI (Air Quality Index) on particular positions in a specified time. More details about the this challenge can be found in [5]. 3.1 Segment Replacement In this subtask, the sequence of missing PM2.5 data is specified * These two authors contributed equally. in each query with a starting and ending time. As the lifeloggers Copyright 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution walked in groups, the data from others could help regenerate the 4.0 International (CC BY 4.0). missing segments. The data from the stations, however, is not quite MediaEval’19, 27-29 October 2019, Sophia Antipolis, France MediaEval’19, 27-29 October 2019, Sophia Antipolis, France Tu-Khiem Le, Van-Tu Ninh et al. reliable since their distance is too far from the routes and they Table 2: Ranked list of best score of each team in Personal might contribute noises to the result. Therefore, considering the Air Quality Subtask NO2, O3, temperature, humidity and heartbeat data from people who share the same route with the targeted lifelogger, we build Group ID Run ID Score a simple linear regression model to predict the removed PM2.5 healthism 19 0.3 data. Specifically, let x be the 5-dimensional L2-normed feature SHT-UIT 1 0.8 vectors composed of five components mentioned above, and y be DCU 1 1.0 the targeted PM2.5 value that needs to be predicted. We construct a linear regression model y = wT x + b and apply gradient descent to find the best parameters w and b to minimize root-mean-square means that our approach to handle the data for this task is not good error, which aims to minimize the gap between model predictions and the operation that we apply to process PM2.5 data to infer AQI and ground-truth of train data. Then the trained model is used to level is not correct. Since the data recorded from the lifeloggers generate the missing PM2.5 of the targeted person in that group. walking through the route is not totally correct (as the values are As NO2, and O3 values are not almost zero for most of the times, almost zeros for all) and the collected data is not enough (less than temperature, humidity, and heart-beat are the main factors that 24 hours during seven non-consecutive days), we can hardly infer contribute most to our predictions. the right AQI level for the route. As we do not exploit all the provided materials such as the data 3.2 Personal Air Quality recorded from the stations, images and related metadata, we might To obtain AQI for each day, we would need to first gather the air miss some important features that could be used to improve our quality data. From the checkpoints of each route, we could obtain a predictions. Moreover, as we rely on the users’ recorded data along list of GPS along the route. Then, we extracted all air quality data the route that they pass through, the recorded values such as PM2.5, where lifeloggers’ GPS is closed to the checkpoints. The distance NO2, O3 are not reliable as the most of their values are zeros. These between two GPSs is calculated using the Haversine formula. are the main factors that affect our results in both sub-tasks. In order As we observed from the air quality data of each route, NO2 to improve it in future work, we might need to consider additional and O3 values are mostly zeros while PM2.5 values have some data on the internet, which is recorded from nearby stations, to fluctuations. Therefore, we choose PM2.5 to predict the ultimate provide the missing PM2.5 values during the days to generate the Air Quality Index (AQI). At first, we refine the data to get the right correct estimation of AQI score. PM2.5 data for each route by calculate the distance between the route’s GPS and collectors’ current GPS. After this step, we obtain ACKNOWLEDGMENTS data for 27 routes on 7 days from different groups of collectors. This publication has emanated from research supported in party For each data on a day collected by a user, we compute its average by research grants from Irish Research Council (IRC) under Grant PM2.5. Therefore, we receive many average PM2.5 values from Number GOIPG/2016/741 and Science Foundation Ireland under many collectors in one day. We consider the maximum value of grant numbers SFI/12/RC/2289 and 13/RC/2106. these average PM2.5 values as the criteria to evaluate AQI for that route on that day. Then, we average the AQI value of 7 days and REFERENCES re-evaluate again to infer the AQI level of the route. [1] 2018. LSC ’18: Proceedings of the 2018 ACM Workshop on The Lifelog Search Challenge. ACM, New York, NY, USA. 4 RESULTS AND ANALYSIS [2] Duc-Tien Dang-Nguyen, Luca Piras, Michael Riegler, Minh-Triet Tran, Liting Zhou, Mathias Lux, Tu-Khiem Le, Van-Tu Ninh, and Cathal Table 1: Ranked list of best score of each team in Segments Gurrin. 2019. Overview of ImageCLEFlifelog 2019: Solve my life Replacement Subtask puzzle and Lifelog Moment Retrieval. In CLEF2019 Working Notes (CEUR Workshop Proceedings). CEUR-WS.org , Lugano, Switzerland. Group ID Run ID Score [3] Cathal Gurrin, H. Joho, Frank Hopfgartner, L. Zhou, Van-Tu Ninh, healthism 3 0.000427182 Tu-Khiem Le, Rami Albatal, D.-T Dang-Nguyen, and Graham Healy. SHT-UIT 3 0.000463205 2019. Overview of the NTCIR-14 Lifelog-3 task. DCU 1 0.015310414 [4] Cathal Gurrin, Alan F. Smeaton, and Aiden R. Doherty. 2014. LifeL- ogging: Personal Big Data. Foundations and Trends® in Information HCMUS 4 0.015514208 Retrieval 8, 1 (2014), 1–125. https://doi.org/10.1561/1500000033 [5] Tomohiro Sato Koji Zettsu Duc-Tien Dang-Nguyen Cathal Gurrin It can be seen from the table 1 that our team (DCU) manages to Ngoc-Thanh Nguyen Minh-Son Dao, Peijiang Zhao. 2019. Overview achieve the 3r d highest score of approximately 0.0153 among the of MediaEval 2019: Insights for Wellbeing Task: Multimodal Personal best submission list in the Segment Replacement sub-task. It means Health Lifelog Data Analysis. In MediaEval2019 Working Notes (CEUR Workshop Proceedings). CEUR-WS.org , Sophia that our approach manages to generate relatively good prediction Antipolis, France. with low error. However, there are other solutions could provide [6] Tomohiro Sato, Minh Dao, Kota Kuribayashi, and Koji Zettsu. 2018. more precise result with significantly low error. SEPHLA: Challenges and Opportunities within Environment-Personal Meanwhile, in the Personal Air Quality sub-task, our approach Health Archives. got the arithmetic mean absolute L1 distance score of 1.0. This