-

Overview of ImageCLEF Lifelog 2020: Lifelog Moment Retrieval and Sport Performance Lifelog

Van-Tu Ninh

Tu-Khiem Le

Liting Zhou

Luca Piras

Michael Riegler

Pal Halvorsen

Mathias Lux

Minh-Triet Tran

Cathal Gurrin

Duc-Tien Dang-Nguyen

4 0 Dublin City University , Dublin , Ireland 1 ITEC, Klagenfurt University , Klagenfurt , Austria 2 Pluribus One & University of Cagliari , Cagliari , Italy 3 Simula Metropolitan Center for Digital Engineering , Oslo , Norway 4 University of Bergen , Bergen , Norway 5 University of Science , VNU-HCM, Ho Chi Minh City , Vietnam

This paper describes the fourth edition of Lifelog challenges in ImageCLEF 2020. In this edition, the Lifelog challenges consist of two tasks which are Lifelog Moments Retrieval (LMRT) and Sport Performance Lifelog (SPLL). While the Lifelog Moments Retrieval challenge follows the same format of the previous edition, its data is a larger multimodal dataset based on the merger of three previous NTCIR Lifelog datasets, which contain approximately 191,439 images with corresponding visual concepts and other related metadata. The Sport Performance Lifelog, which is a brand new challenge, is composed of three subtasks that focus on predicting the expected performance of athletes who trained for a sport event. In summary, the ImageCLEF Lifelog 2020 receives 50 runs from six teams in total with competitive results.

Due to the widely use of wearable devices, digital sensors, and smart phones which capture photos, biometric signals, and location information passively, a huge amount of daily-life data is recorded by many people everyday. As a result, there is an ever increasing reserach e ort into developing methodologies for exploiting the potential of this data. Such lifelog data has been used for many retrieval and analytics challenges since the inaugoral NTCIR-12 - Lifelog task [ 6 ] in 2016.

There have been many reserch tasks addressed by these challenges, such as lifelog retrieval, data segmentation, data enhancement/annotation and interactive retrieval. Speci cally in the ImageCLEF lifelog challenge, we note a number of di erent tasks, such as the Solve My Life Puzzle task in 2019 [ 5 ], Activity of Daily Living Understanding task in 2018 [ 4 ], or Lifelog Summarization task in 2017 [ 3 ]. Therefore, in the fourth edition of ImageCLEFlifelog tasks hosted in ImageCLEF 2020 [ 12 ], the organizers both propose a brand-new task which monitors the wellbeing and predicts the expected performance of the athletes training for a sporting event, as well as continuing to maintain the core Lifelog Retrieval Moments task with enriched dataset in terms of visual concepts, annotations, and scale.

Details of the two challenges and the data employed are provided in section 2. In section 3, submissions and results are presented and discussed. In nal section 4, the paper is concluded with the discussion of nal remarks and future work. 2 2.1

Overview of the Task Motivation and Objectives

Personal lifelog data is continually increasing in volume due to the popularity of personal wearable/portable devices for health monitoring and life recording such as smartphones, smart watches, tness bands, video cameras, biometric data devices and GPS or location devices. As a huge amount of data is created daily, there is a need for a systems that can analyse the data, index, categorize, summarize to gain deep insights from the data and support a user in some positive way.

Although many related workshops of lifelogging were held successfully for years such as three editions of NTCIR, annual Lifelog Search Challenge (LSC), and ImageCLEFlifelog 2019, we still aim to bring the attention of lifelogging to not only research groups but also to diverse audiences. Nevertheless, we continue to maintain the core task to encourage research groups to propose creative retrieval approaches to lifelog data, as well as nominating a new task to introduce a new challenge to the research community. 2.2

Challenge Description Lifelog Moment Retrieval Task (LMRT)

In this task, the participants were required to retrieve a number of speci c moments in a lifelogger's life. Moments are de ned as "semantic events, or activities that happened throughout the day" [ 5 ]. For example, a participant would have been required to nd and return relevant moments for the query \Find the moment(s) when the lifelogger was having an ice cream on the beach". In this edition, particular attention was to be paid to the diversi cation of the selected moments with respect to the target scenario. The ground truth for this subtask was created using a manual annotation process and aimed towards compete relevance judgements. Figure 1 illustrates some examples of the moments when the lifelogger was shopping in a toy shop. In addition, listings 1 and 2 show all the Description: Find the moment(s) in 2015 or 2016 when u1 enjoyed beers in the bar. Narrative: To be considered relevant, u1 must be clearly in a bar and drink beers.

T.002 Building Personal Computer

Description: Find the moment(s) when u1 built my personal computer from scratch. Narrative: To be considered relevant, u1 must be clearly in the o ce with the PC parts or uncompleted PCs on the table.

T.003 In A Toyshop

Description: Find the moment(s) when u1 was looking at items in a toyshop. Narrative: To be considered relevant, u1 must be clearly in a toyshop where various toys are being examined.

T.004 Television Recording

Description: Find the moment(s) when u1 was being recorded for a television show. Narrative: To be considered relevant, there must clearly be a television camera in front of u1. The moments the interviewer/cameramen is interviewing/recording u1 are also considered relevant.

T.005 Public Transport In Home Country

Description: Find the moment(s) in 2015 and 2018 when u1 was using public transports in my home country (Ireland).

Narrative: Taking any form of public transport in Ireland is considered relevant, such as bus, taxi, train and boat.

T.006 Seaside Moments

Description: Find moment(s) in which u1 was walking by the sea taking photos or eating ice-cream.

Narrative: To be considered relevant, u1 must be taking a walk by the sea or eating ice-cream with the sea is clearly visible.

T.007 Grocery Stores

Description: Find moment(s) in 2016 and 2018 when u1 was doing grocery shopping on the weekends.

Narrative: To be considered relevant, u1 must be clearly buys or visibly interacts with products in a grocery store on the weekends.

T.008 Photograph of The Bridge

Description: Find the moment(s) when u1 was taking a photo of a bridge. Narrative: Moments when u1 was walking on a street without stopping to take a photo of a bridge are not relevant. Any other moment showing a bridge when a photo was not being taken are also not considered to be relevant.

T.009 Car Repair

Description: Find the moment(s) when u1 was repairing his car in the garden. Narrative: Moments when u1 was repairing his car in the garden with the gloves on his hand. Sometimes, he also held the hammer and his phone and these moments are also considered relevant.

T.010 Monsters

Description: Find the moment(s) when u1 was looking at an old clock, with owers visible, with a small monster watching u1.

Narrative: Moments when u1 was at home, looking at an old clock, with owers visible, with a lamp and two small monsters watching u1 are considered relevant.

Listing 1: Description of topics for the development set in LMRT.

T.001 Praying Rite

Description: Find the moment when u1 was attending a praying rite with other people in the church.

Narrative: To be relevant, the moment must show u1 is currently inside the church, attending a praying rite with other people.

T.002 Lifelog data on touchscreen on the wall

Description: Find the moment when u1 was looking at lifelog data on a large touchscreen on the wall.

Narrative: To be relevant, the moment must show u1 was looking at his lifelog data on the touchscreen wall (not desktop monitor).

T.003 Bus to work - Bus to home

Description: Find the moment when u1 was getting a bus to his o ce at Dublin City University or was going home by bus.

Narrative: To be relevant, u1 was on the bus (not waiting for the bus) and the destination was his home or his workplace.

T.004 Bus at the airport

Description: Find the moment when u1 was getting on a bus in the aircraft landing deck in the airport.

Narrative: To be relevant, u1 was walking out from the airplane to the bus parking in the aircraft landing deck with many airplanes visible.

T.005 Medicine cabinet

Description: Find the moment when u1 was looking inside the medicine cabinet in the bathroom at home.

Narrative: To be considered relevant, u1 must be at home, looking inside the opening medicine cabinet beside a mirror in the bathroom.

T.006 Order Food in the Airport

Description: Find the moment when u1 was ordering fast food in the airport. Narrative: To be relevant, u1 must be at the airport and ordering food. The moments that u1 was queuing to order food are also considered relevant.

T.007 Seafood at Restaurant

Description: Find moments when u1 was eating seafood in a restaurant in the evening time.

Narrative: The moments show u1 was eating seafood or seafood parts in any restaurant in the evening time are considered relevant.

T.008 Meeting with people

Description: Find the moments when u1 was a round-table meeting with many people and there was pink (not red) name-cards for each person.

Narrative: The moments show u1 was at a roundtable and having a meeting with many people, and with pink name-cards visible are considered relevant.

T.009 Eating Pizza

Description: Find the moments when u1 was eating a pizza while talking to a man. Narrative: To be considered relevant, u1 must eat or hold a pizza with a man in the background.

T.010 Socialising

Description: Find the moments when u1 was talking to a lady in a red top, standing directly in front of a poster hanging on a wall.

Narrative: To be relevant, u1 must be talking with a woman in red, who was standing right in front of a scienti c research poster.

Listing 2: Description of topics for the test set in LMRT. queries used in the challenge.

Sport Performance Lifelog (SPLL)

Given a dataset of 16 people who train for a 5km run for the sport event (e.g., daily sleeping patterns, daily heart rate, sport activities, and image logs of all food consumed during the training period), participants are required to predict the expected performance (e.g., estimated nishing time, average heart rate and calories consumption) of the trained athlete. 2.3

Datasets

LMRT Task: The data is a combination of three previously released datasets of NTCIR-Lifelog Tasks: NTCIR-12 [ 6 ], NTCIR-13 [ 7 ], and NTCIR-14 [ 8 ]. It is a large multimodal lifelog data of 114 days from one lifelogger whose dates range from 2015 to 2018. It contains ve main data types which are multimedia content, biometrics data, location and GPS, human activity data, and visual concepts and annotations of non-text multimedia content. Details of each type of data are as follows: { Multimedia Content: Most of this data are non-annotated egocentric photos captured passively by two wearable digital cameras: OMG Autographer and Narrative Clip1. The lifelogger wore the device for 16-18 hours per day to capture a complete visual trace of daily life with about 2-3 photos captured per minute during waking hours. The photo data was manually redacted to remove identi able content and faces [ 9 ]. 1 Narrative Clip and Narrative Clip 2 - http://getnarrative.com { Biometrics Data: This data contains heart rate, calories, and movement speed using a Fitbit tness tracker2. The lifelogger wore the Fitbit wearable device everyday for 24 hours so as to record continuous biometrics data. { Location and GPS: 166 semantic locations as well as GPS data (with and without location name) are recorded using both Moves app and smartphones. The GPS plays an important role to infer the time zone of lifelogger's current location to convert the time of di erent wearable devices into one standard timestamp. { Human Activity Data: This data is recorded by the Moves app which also provide some annotated semantic locations. It consists of four types of activities which are walking, running, transport, and airplane. { Visual Concepts and Annotations: The passively auto-captured images are passed through two deep neural networks to extract visual concepts about scenes and visual objects. For scene identi cation, we still employ the PlacesCNN [25] as in the latest edition. For visual object detection, we employed Mask R-CNN [ 10 ] pre-trained on 80 items of MSCOCO dataset [ 15 ] which is used to provide the category of visual objects in the image as well as its bounding boxes.

Format of the metadata. The metadata was stored in a csv les, which was called the metadata table. The structure and meaning of each eld in the table are described in Table 3. Additionally, visual categories and concepts descriptors are also provided in the visual concepts table. The format of it could be found in Table 4.

SPLL Task: The data was gathered using three di erent approaches: wearable devices (Fitbtit Fitness Tracker (Fitbit Versa), Google Forms, and PMSYS. Biometric data of an individual (training athlete) is recorded using Fitbit Fitness Tracker, including 13 di erent elds of information such as daily heart rate, calories, daily sleeping patterns, sport activities, etc. Google Forms were used to collect information of meals, drinks, medications, etc. At the same time, information of subjective wellness, injuries, and training load was recorded by PMSYS system. In addition, image-logs of food consumed during the training period from at least 2 participants and self reported data like mode, stress, fatigue, readiness 2 Fitbit Fitness Tracker (FitBit Versa) - https://www. tbit.com

Catgories

Calories Steps Distance Sleep Lightly active minutes Moderately active minutes Very active minutes Sedentary minutes Heart rate Time in heart rate zones Resting heart rate Exercise Sleep score Google Forms reporting Wellness Injury SRPE

Fitbit Fitbit Fitbit Fitbit Fitbit Fitbit Fitbit Fitbit Fitbit Fitbit Fitbit Fitbit Fibtit Google Form PMSYS PMSYS PMSYS

Per minute Per minute Per minute When it happens (usually daily) Per day Per day Per day Perday Per 5 seconds Per day Per day When it happens 100 entries per

le When it happens (usually daily) Per day Per day Per day Per day 3377529 1534705 1534705 2064 2244 2396 2396 2396 20991392 2178 1803 2440 1836 1569 1747 225 783 to train and other measurements also used for professional soccer teams [ 23 ]. The data was approved by the Norwegian Center for Research Data with proper copyright and ethical approval to release. Statistics of the ImageCLEFlifelog 2020 SPLL data is shown in table 2. LMRT: Classic metrics are employed to assess the performance of this LMRT task. These metrics include: { Cluster Recall at X (CR@X) - a metric that assesses how many di erent clusters from the ground truth are represented among the top X results; { Precision at X (P@X) - measures the number of relevant photos among the top X results; { F1-measure at X (F1@X) - the harmonic mean of the previous two.

Various cut o points are considered, e.g., X=5, 10, 20, 30, 40, 50. O cial ranking metrics are the F1-measure@10, which gives equal importance to diversity (via CR@10) and relevance (via P@10). In particular, the nal score to rank submissions of participants is the average F1-measure@10 of ten queries, which provides information of general performance of each interactive system for all 10 queries.

Participants were allowed to undertake the sub-tasks in an interactive or automatic manner. For interactive submissions, a maximum of ve minutes of search time was allowed per topic. In particular, methods that allowed interaction with real users (via Relevance Feedback (RF), for example), i.e., beside of the best performance, the way of interaction (like number of iterations using RF), or innovation level of the method (for example, new way to interact with real users) were encouraged.

SPLL: For this task, we employ two evaluation metrics to rank the submissions of participants. The primary score is to check how accurately the participants can predict whether it was an improvement or a deterioration after the training process by comparing the sign of the actual change value to the predicted one. The secondary score is the absolute di erence between the actual change and the predicted one. The primary score is ranked in descending order, and if there is a draw in the primary score, the secondary score is used to re-rank the teams. 2.5

Ground Truth Format

LMRT Task. The development ground truth for the LMRT task was provided in two individual txt les: one le for the cluster ground truth and one le for the relevant image ground truth.

In the cluster ground-truth le, each line corresponded to a cluster where the rst value was the topic id, followed by the cluster id number. Lines were separated by an end-of-line character (carriage return). An example is presented below: { 1, 1 { 1, 2 { ... { 2, 1 { 2, 2 { ...

In the relevant ground-truth le, the rst value on each line was the topic id, followed by a unique photo id which is image name without the extension, and then followed by the cluster id number (that corresponded to the values in the cluster ground-truth le) separated by comma. Each line corresponded to the ground truth of one image and lines were separated by an end-of-line character (carriage return). An example is presented below: { 1, b00001216 21i6bq 20150306 174552e, 1 { 1, b00001217 21i6bq 20150306 174713e, 1 { 1, b00001218 21i6bq 20150306 174751e, 1 { 1, b00002953 21i6bq 20150316 203635e, 2 { 1, b00002954 21i6bq 20150316 203642e, 2 { ... { 2, b00000183 21i6bq 20150313 072410e, 1 { 2, b00000184 21i6bq 20150313 072443e, 1 { 2, b00000906 21i6bq 20150312 171852e, 2 { 2, b00000908 21i6bq 20150312 172005e, 2 { 2, b00000909 21i6bq 20150312 172040e, 2 { ...

SPLL Task. The ground truth was provided in one txt le. For each line in this le, the rst value was the id of the sub-task which is 1, 2 or 3 (since the SPLL task is split into three sub-tasks), followed by the id of individual (p01, p02, ..., p16), followed by the actual change in the status of the individual after the training period. Although the three sub-tasks has di erent requirements, their output format is the same, which is a number indicating the change before and after training with preceding '+' sign if the change is an increase, or '-' sign if the change is a decrease. If there is no change after the training process, a 0 value without a preceding sign is also allowed. Values in each line were separated by comma. Lines were separated by and end-of-line character (carriage return). An example is shown below: { 1, p01, +8 { 1, p10, +86 { ...

Run RUN1* RUN2* RUN3* RUN1 RUN2 RUN3 RUN4 RUN5 RUN6 RUN7 RUN1 RUN2

RUN3 Team

Run

3 Evaluation Results

This year, we obtained 50 valid submissions in both two tasks of ImageCLEFlifelog tasks from 6 teams, which is not as high as in previous year. However, the results of these submissions show a signi cant improvement in the nal scores compared to ImageCLEFlifelog 2019. In particular, there were 38 submissions in LMRT with 6 teams participating in the task, while only one non-organizer team submitted 10 runs in SPLL task. The submitted runs and their results are summarised in Tables 5 and 6. 3.2

Results

In this section we provide a short description of all submitted approaches followed by the o cial result of the task.

The Organizer team continue to provided a baseline approach for the LMRT task with a web-based interactive search engine, which is an improved version

Team Organiser BIDAL Run

RUN1* RUN2* RUN1 RUN2 RUN3 RUN4 RUN5 RUN6 RUN7 RUN8 RUN9 RUN10 Notes: * submissions from the organizer teams are just for reference. The results in this paper are o cial version of ImageCLEFlifelog 2020 tasks. of LIFER 2.0 system [ 18 ] which was used at ImageCLEFlifelog 2019. The interactive elements of this system comprise three features: free-text querying and lterinf, visual similarity image search, and elastic sequencing to view nearby moments. The system, which focuses on experimenting the e ciency of free-text query features, is an early version of LifeSeeker 2.0 interactive search engine [ 14 ]. For the query processing procedure, the authors use natural language processing to parse the query into meaningful terms and employ Bag-of-Words to retrieve and rank relevant documents. The dictionary in Bag-of-Words is split into three dictionaries for ltering by using terms matching: time, location, and visual concepts. The authors extract more detailed visual concepts inferred from deep neural networks pre-trained on Visual Genome dataset [ 13 ] which were shown to be extremely useful for the retrieval process. For the SPLL task, the organizer team provided baseline approaches for all three sub-tasks, which used the exercise data from Fitbit Tracker, self-reporting, and food-images only. The authors propose a naive solution which computes the di erence between consecutive rows of data from exercise activities and self-reporting including distance, exercise duration, calories, and weight; then categorises them into positive and negative groups based on sign of the value ('+' or '-') and calculate the average of the two groups. Finally, they sum the two average to obtain the results. In addition, they also try to build a Linear Regression Model to predict the pace change and a Convolutional Neural Network to detect the type of food for manual calories inference.

The REGIM-Lab approaches the LMRT Task with the same strategies as their work in ImageCLEFlifelog 2019 [ 1 ], which used the ground truth of the development dataset from both LMRT 2019 and LMRT 2020 to automatically categorise images into categories for deep neural network ne-tuning with MobileNet v2 and DenseNet and visual concepts clustering. However, the di erence is that they use Elastic Search and Kibana Query Language (KBL) to perform retrieval on image concepts and metadata instead Apache Cassandra and Cassandra Query Language (CQL). Moreover, they attempt to enrich more visual concepts using YOLO v3 [ 19 ] trained on OpenImage dataset. They also treat the textual queries with three-word embedding models built from scratch which are Word2vec, FastText, and Glove

HCMUS focused on the LMRT task only this year. Their retrieval system has three components which are query by caption, query by similar image, and query by place and time from the metadata. For query by caption, they encoded images using Faster R-CNN [ 20 ] to extract object-level features, then applied self-attention to learn interaction between them. For query sentence, they used RoBerta model [ 16 ] to encode sentences. Finally, two feed forward networks were deployed to map image and text features to the common space correspondingly. Therefore, when a sentence is given, their model ranked all images based on cosine distance between the encoded images and the encoded query sentence to nd the most relevant images to the description of the sentence. For query by similar image, the same strategy was applied with ResNet152 image encoder [ 11 ] instead of Faster R-CNN. For query by place and time from metadata, they simply nd all moments based on the given semantic locations and view the images which are before and after a speci c moments.

DCU-DDTeam interactive search engine is the improved version of their Mysceal system in LSC'20 [24] and follows the same pipeline. The visual concepts of each image are the combination of the given metadata and outputs from DeepLabv3+ [ 2 ] and the enriched metadata extracted from Microsoft Computer Vision API. These annotations, along with other information such as locations and time, were then indexed in the Elastic Search. The input query is analyzed to extract the main information and enlarged by their expansion mechanism. They are then combined with the indexed database to nd matching images which are then ordered by their ranking function. In this version, they introduced three changes in the previous system including visual similarity, the user interface, and the summary panel. The visual similarities between images were measured by using cosine distance between visual features composed of SIFT [ 17 ] and VGG16 features. For the user interface, the authors remove the triad of searching bars as in the original version and reorganised the interface to explore cluster events more e ciently. The summary panel consists of the \Word List" panel which is the area on the screen showing the results of their query expansion with adjustable scores allowing the user to emphasize the concepts that they need to retrieve.

The BIDAL team is the only non-organizer team participating in both the LMRT and SPLL tasks. For the LMRT task, the authors generated clusters by employing a scene recognition model trained on the Google Fixmatch method [ 22 ]. They then used an attention mechanism to match the input query with the correct samples, which were then utilized to nd other relevant moments. For the SPLL task, they summarized information from various interval attributes, removed several unnecessary attributes, and generated some new attributes. Then, they trained several typical time-series neural network structure including Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) using the generated attributes or a set of attributes, or some pre-de ned seeding attributes.

The UA.PT Bioinformatics team continued to employ the approaches from last year challenge [ 21 ] to test the performance of their automatic lifelog search engine in the attempt to enrich visual concepts and labels by utilising many di erent object detection networks including Faster R-CNN and Yolov3 [ 19 ] pretrained on COCO dataset [ 15 ]. The image retrieval procedure was then done on a text-based vector space by computing similarity score between the text labels extracted from the images and and the visual concepts. Finally, a threshold was set to choose the results for each topic. As the results prove that this automatic approach did not work, the authors developed a web-based interactive search engine with a timestamp-clustering visualization to select the moments instead of de ning a threshold to choose the results automatically. The algorithms for searching relevant moments are mostly the same as automatic approach except for three new features which are: narrowing searching items by text matching between manually analysed query and the indexed database containing concepts of each image.

The o cial results are summarised in Tables 5 and 6. There are six teams participating in the LMRT task with the highest F1@10 score of 0.81 was achieved by HCMUS (Table 5). Most of the teams tried to enrich the visual concepts by deploying di erent CNNs for objects and places detection, then performing text analysis on the query and text matching. Some additional features were also added in most interactive systems such as searching for visually similar images, terms weighting for results re-ranking, context understanding before performing search, etc. The highest scoring approach by the HCMUS team, considered visual vector features extracted from CNNs when making the comparison between feature vectors to nd relative moments.

In the SPLL task, only one non-organizer team participated and they managed to achieve good scores. For the prediction of performance change, their approach gained 0.82 and 128.0 in terms of prediction accuracy and L1 distance between the prediction value and actual change respectively. 4

Discussions and Conclusions

In the ImageCLEFlifelog 2020, most of the submitted results managed to gain high scores in both tasks. Although the set of topics for LMRT task is di erent from previous edition, participants managed to search for the relative moments on a large-scale dataset while still achieving higher scores than the results in previous edition. This proves that the proposed features and query mechanisms actually enhance the performance of their retrieval systems. Most of the teams enrich semantic visual concepts using many di erent CNNs pretrained on different datasets such as COCO, OpenImage, and Visual Genome before indexing and querying; retrieve relative images based on text matching and text retrieval algorithms; perform visual similar image search. We also note many interesting approaches from teams to enhance the a ordance and interaction of the retrieval systems, including integrating lter mechanism into free-text search, considering adding visual vector features into the nal encoded vector, clustering images into events, etc.

Regarding the number of teams and submitted runs, only 6 teams participated in the LMRT task, including an organizer team, which produced 50 submissions in total. Each team was allowed to submit up to 10 runs. For the LMRT task, among ve teams which participated in ImageCLEFlifelog 2019 (including the organizer team), four teams managed to obtain better results with the highest F1-score up to 0.81. The mean (SD) increase of nal F1-score from these ve teams is 0.25 (0.18). The new team from Dublin City University also managed to achieve the 4th rank with a 0.48 F1-score. For the SPLL task, as the task is new, only one team from The Big Data Analytics Laboratory submitted 10 runs. Their best submission achieves an accuracy of performance change and the absolute di erence between the prediction and actual change are 0.82 and 128 respectively, which is a good result. 5

Acknowledgement

This publication has emanated from research supported in party by research grants from Irish Research Council (IRC) under Grant Number GOIPG/2016/741 and Science Foundation Ireland under grant numbers SFI/12/RC/2289 and SFI/13/RC/2106. 24. Tran, L.D., Nguyen, M.D., Binh, N.T., Lee, H., Gurrin, C.: Mysceal: An experimental interactive lifelog retrieval system for lsc'20. Proceedings of the Third Annual Workshop on Lifelog Search Challenge (2020) 25. Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017)

1. Abdallah , F.B. , Feki , G. , Ammar , A.B. , Amar , C.B. : Big data for lifelog moments retrieval improvement . In: CLEF ( 2019 )

2. Chen , L.C. , Zhu , Y. , Papandreou , G. , Schro , F. , Adam , H.: Encoder-decoder with atrous separable convolution for semantic image segmentation . In: ECCV ( 2018 )

3. Dang-Nguyen , D.T. , Piras , L. , Riegler , M. , Boato , G. , Zhou , L. , Gurrin , C. : Overview of imagecle ifelog 2017: Lifelog retrieval and summarization . In: CLEF ( 2017 )

4. Dang-Nguyen , D.T. , Piras , L. , Riegler , M. , Zhou , L. , Lux , M. , Gurrin , C. : Overview of imagecle ifelog 2018: Daily living understanding and lifelog moment retrieval . In: CLEF ( 2018 )

5. Dang-Nguyen , D.T. , Piras , L. , Riegler , M. , Zhou , L. , Lux , M. , Tran , M.T. , Le , T.K. , Ninh , V.T. , Gurrin , C. : Overview of imagecle ifelog 2019: Solve my life puzzle and lifelog moment retrieval . In: CLEF ( 2019 )

6. Gurrin , C. , Joho , H. , Hopfgartner , F. , Zhou , L. , Albatal , R.: Overview of ntcir-12 lifelog task . In: NTCIR ( 2016 )

7. Gurrin , C. , Joho , H. , Hopfgartner , F. , Zhou , L. , Gupta , R. , Albatal , R. , DangNguyen, D.T.: Overview of ntcir-13 lifelog-2 task ( 2017 )

8. Gurrin , C. , Joho , H. , Hopfgartner , F. , Zhou , L. , Ninh , V.T. , Le , T.K. , Albatal , R. , Dang-Nguyen , D.T. , Healy , G. : Overview of the ntcir-14 lifelog-3 task ( 2019 )

9. Gurrin , C. , Smeaton , A.F. , Doherty , A.R. : Lifelogging: Personal big data . Found. Trends Inf. Retr. 8 , 1 { 125 ( 2014 )

10. He , K. , Gkioxari , G. , Dollar , P. , Girshick , R.B.: Mask r-cnn. 2017 IEEE International Conference on Computer Vision (ICCV) pp. 2980 { 2988 ( 2017 )

11. He , K. , Zhang , X. , Ren , S. , Sun , J.: Deep residual learning for image recognition . 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 770 { 778 ( 2016 )

12. Ionescu , B. , Muller, H., Peteri , R. , Abacha , A.B. , Datla , V. , Hasan , S.A. , DemnerFushman , D. , Kozlovski , S. , Liauchuk , V. , Cid , Y.D. , Kovalev , V. , Pelka , O. , Friedrich , C.M. , de Herrera , A.G.S. , Ninh , V.T. , Le , T.K. , Zhou , L. , Piras , L. , Riegler , M. , l Halvorsen, P. , Tran , M.T. , Lux , M. , Gurrin , C. , Dang-Nguyen , D.T. , Chamberlain , J. , Clark , A. , Campello , A. , Fichou , D. , Berari , R. , Brie , P. , Dogariu , M. , Stefan , L.D. , Constantin , M.G. : Overview of the ImageCLEF 2020: Multimedia retrieval in lifelogging, medical, nature, and internet applications . In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the 11th International Conference of the CLEF Association (CLEF 2020 ), vol. 12260 . LNCS Lecture Notes in Computer Science , Springer, Thessaloniki, Greece (September 22 - 25 2020 )

13. Krishna , R. , Zhu , Y. , Groth , O. , Johnson , J., Hata , K. , Kravitz , J. , Chen , S. , Kalantidis , Y. , Li , L.J. , Shamma , D.A. , Bernstein , M.S. , Fei-Fei , L. : Visual genome: Connecting language and vision using crowdsourced dense image annotations . International Journal of Computer Vision 123 , 32{ 73 ( 2016 )

14. Le , T.K. , Ninh , V.T. , Tran , M.T. , Nguyen , T.A. , Nguyen , H.D. , Zhou , L. , Healy , G. , Gurrin , C. : Lifeseeker 2.0: Interactive lifelog search engine at lsc 2020 . Proceedings of the Third Annual Workshop on Lifelog Search Challenge ( 2020 )

15. Lin , T.Y. , Maire , M. , Belongie , S.J. , Hays , J. , Perona , P. , Ramanan , D. , Dollar , P. , Zitnick , C.L. : Microsoft coco: Common objects in context . ArXiv abs/1405 .0312 ( 2014 )

16. Liu , Y. , Ott , M. , Goyal , N. , Du , J. , Joshi , M. , Chen , D. , Levy , O. , Lewis , M. , Zettlemoyer , L. , Stoyanov , V. : Roberta: A robustly optimized bert pretraining approach . ArXiv abs/ 1907 .11692 ( 2019 )

17. Lowe , D. : Distinctive image features from scale-invariant keypoints . International Journal of Computer Vision 60 , 91{ (11 2004 )

18. Ninh , V.T., Le , T.K. , Zhou , L. , Piras , L. , Riegler , M. , Lux , M. , Tran , M.T. , Gurrin , C. , Dang-Nguyen , D.T.: Lifer 2.0: Discovering personal lifelog insights using an interactive lifelog retrieval system . In: CLEF ( 2019 )

19. Redmon , J. , Farhadi , A. : Yolov3: An incremental improvement . ArXiv abs/ 1804 .02767 ( 2018 )

20. Ren , S. , He , K. , Girshick , R.B. , Sun , J.: Faster r-cnn: Towards real-time object detection with region proposal networks . IEEE Transactions on Pattern Analysis and Machine Intelligence 39 , 1137 { 1149 ( 2015 )

21. Ribeiro , R. , Neves , A.J.R. , Oliveira , J.L. : Ua.pt bioinformatics at imageclef 2019: Lifelog moment retrieval based on image annotation and natural language processing . In: CLEF ( 2019 )

22. Sohn , K. , Berthelot , D. , Li , C.L. , Zhang , Z. , Carlini , N. , Cubuk , E.D. , Kurakin , A. , Zhang, H., Ra el, C.: Fixmatch: Simplifying semi-supervised learning with consistency and con dence . ArXiv abs/ 2001 .07685 ( 2020 )

23. Thambawita , V. , Hicks , S. , Borgli , H. , Pettersen , S. , Johansen , D. , Johansen , H. , Kupka , T. , Stensland , H. , Jha , D. , Gr nli, T.M., Fredriksen , P.M. , Eg , R. , Hansen , K. , Fagernes , S. , Biorn-Hansen , A. , Dang Nguyen , D.T. , Hammer , H. , Jain , R. , Riegler , M. , Halvorsen , P. : Pmdata: A sports logging dataset (02 2020 )