Introduction

Retrieving Events in Life Logging

Ergina Kavallieratou

kavallieratou@aegean.gr 0

Carlos R. del-Blanco

Carlos Cuevas

Narciso García

narciso@gti.ssr.upm.es 1 0 Department of Information & Communication Systems Engineering, University of the Aegean , Samos 83200 , Greece 1 Grupo de Tratamiento de Imágenes, ETSI Telecomunicación, Universidad Politécnica de Madrid , Madrid , Spain

5 15

This paper describes our contribution for the Lifelog Moment Retrieval (LMRT) challenge of ImageCLEF Lifelog2018. Lifelogging has a tremendous potential in many applications. However, the wide range of possible moment events along with the lack of fully annotated databases make this task very challenging. This work proposes an interactive and weakly supervised learning approach that can dramatically reduce the time to retrieve any kind of events in huge databases. Impressive results have been obtained in the referred challenge, reaching the first rank.

Life Logging Deep Learning Supervised Learning

Introduction

Lifelogging is the procedure of retrieving and tracking personal data during the daily life. The potential applications are endless, from memory retrieval [ 1 ] to surveillance [ 2 ]. Due to this fact, an increasing number of research works and events have been appearing in the last years, such as: • the LifeLog of DARPA of the U.S. Department of Defense [ 3 ] and • MyLifeBits by Gordon Bell of Microsoft [ 4-5 ].

Since then, many lifeloggers have tracked huge quantities of big data for various purposes [ 6 ][ 7 ]. The technology makes it easy to collect data automatically using sensors. However, there is not still a consolidate framework to analyze all this data to properly extract useful information for the target applications. On the other hand, a lot of discussion is taking place over the privacy and the ethics dimension of lifelogging [ 8 ]. Besides all, the huge potential of lifelogging has encouraged new efforts to advance in this line, such as the two competitions that have started the last two years: • NTCIR Lifelog Task [ 9 ] • ImageCLEFlifelog [ 10 ].

This year, the ImageCLEF Lifelog 2018 [ 11-12 ] has been divided into two subtasks (challenges): • The LMRT challenge about Lifelog Moment Retrieval • The ADLT challenge about Activities of Daily Living.

In this paper, three strategies are presented addressing the LMRT challenge. The participants had to retrieve a number of specific moments in a lifelogger’s life. Moments were defined as semantic events or activities that happened throughout the day. The ground truth for this subtask was created using manual annotation. The dataset consisted of 50 days of data from a lifelogger, namely: images (1,500-2,500 per day from wearable cameras), visual concepts (automatically extracted visual concepts with varying rates of accuracy), semantic content (semantic locations, semantic activities) based on sensor readings (via the Moves App) on mobile devices, biometrics information (heart rate, galvanic skin response, calorie burn, steps, etc.), music listening history. The dataset is built based on the data available for the NTCIR-13 - Lifelog 2

Task, which contained a total of 80,440 images.

The rest of the paper is structured as follows. In section 2, the proposed three strategies are described, in section 3 the experimental results for some trials are presented, while the conclusions are included in the section 4. 2

Proposed Strategies

Three different strategies have been conceived for addressing the ImageCLEFlifelog 2018 challenge, with the purpose to accurately retrieve images that correspond to the ten proposed topics (Table 1, Fig.1). The first strategy, called Two-class strategy, a deep learning framework has been developed that considers every topic independently. This is, two classes are considered per topic, one representing the event o action described by the topic, and the other the absence of it. The second strategy, called Tenclass strategy, considers all the topics simultaneously. Thus, the developed deep neural network uses ten output classes, one per topic. And finally, the last strategy, called Eleven-class strategy, is an evolution of the second one that adds an additional output to consider events that do not belong to the 10 referred challenge topics.

In this section the proposed strategies, as well as the preprocessing and postprocessing stages, are described in detail.

Topic Title Preparing Salad VR Experiments My Presentations Interviewed by a TV presenter Dinner at Home Assembling Furniture LST007 LST008 LST009

LST010

Taking a coach/bus in foreign countries Costa Coffee with friends Using mobile phone or tablets in a vehicle Graveyard Topic ID

LST001 LST002 LST003 LST004 LST005 LST006 LST007 LST008 LST009 LST010 In order to limit the big volume of images, considering the given metadata and the topics, we decided to split the images in the subdirectories, automatically, by using the Location and Activity tag of the metadata. Thus, two sets of directories were created, named after the names of the specific tag: 1. The Activity set was including just 3 directories: transport, airplane and walking, plus a fourth one called No-activity, including all the images with no information over activity. 2. The Location set was including 96 directories, plus a directory called No-place, including all the images where no named place was mentioned.

This automatic classification helped us to consider less images for a first retrieval to train our systems. Thus, for the presented topics (Table 1) corresponding directories were chosen, according to the description and the restrictions (Fig.1), as they are presented in Table 2. 2.2

Two-class strategy

This strategy (Fig.2) had to be repeated for each topic separately. For each image the question is: Does it satisfy the topic? Thus, for each topic we have two classes, namely: True, where the correct images are included; and False, all the others. After a first retrieval, applied to the corresponding directory, the system is retrained and tested over all data.

Considering the directory sets from preprocessing, the required steps include: 1. Manual choice of true images: In most cases about 10 images were selected as True, most of the times by the same event. Important exceptions were the topics 006 and 010 that there were few examples and, especially in 006, difficult to be found. 2. Training by using pretrained CNN: The pretrained Convolution Neural Networks

AlexNet [ 13 ] or GoogleNet [ 14 ] were used. 3. Testing on the corresponding data (Table 1): The appropriate directories were chosen in accordance to the description and details given of the topic (Fig.1). The four co-authors discussed a lot over the various topics. However, many times we had to ask the organizers for explanations due to cultural differences and definitions. 4. Manually splitting the results to the two classes: here is where the maximum of five minutes of search time allowed per topic, was used. In fact, a simple application was created that was showing the True images and asking for a YES or NO entered by the user. The procedure was very fast. In most topics, 1-2 minutes were enough. The topic 008 required just few seconds. Exception was the topic 006. The negative results were so many that we just kept the True and False that were reached in 5 mins, so not all the images of the corresponding directory were used for the final training. 5. Training using the same pretrained CNN: the AlexNet or GoogleNet that were used in step 2 was also used here.

6. Testing on all data: the retrained CNN was applied to all 80,439 images. 7. Postprocessing, in order to adapt the results to the required format. Three trials have been submitted by this strategy: one using AlexNet (subm#1), one using GoogleNet (subm#2) and one using the average of the two CNNs (subm#3).

1st retrieval

Manual choice of true images Training by using pretrained CNN Testing on the corresponding data (Table 1) Manually splitting the results to the two classes Training using the same pretrained CNN Testing on all data Postprocessing

This strategy (Fig.3) is applied just once for the ten topics. However, it is required to have the result of the first retrieval of the Two-class strategy (§2.2) that includes the steps 1-4. Then the True classes of each topic is created by merging the results of the previous strategy for both AlexNet and GoogleNet. These will be the ten classes of this strategy. Thus, the strategy includes the steps: 1. Merging of the True classes of AlexNet and GoogleNet after the 1st retrieval (Fig.2) for each topic i.e. 10 classes. 2. Training a pretrained CNN: the AlexNet or GoogleNet, using the ten classes.

3. Testing on all data: the retrained CNN was applied to all 80,439 images. 4. Postprocessing to adapt the required format.

Two trials have been submitted by this strategy: one using AlexNet (subm#4) and one using GoogleNet (subm#5). The AlexNet trial proved to be our best submission. 2.4

Eleven-class strategy

This strategy (Fig.4) is very similar to the previous one, including one more class: the class that an image is included if doesn’t belong to any other. For the training, this class was the merging all the False classes of the Two-class strategy, excluding the images that have already included to the classes of the Ten-class strategy. Thus, this strategy includes the steps: 1. Merging of the True classes of AlexNet and GoogleNet after the 1st retrieval (Fig.2) for each topic i.e. 10 classes 2. Merging the False classes of the Two-class strategy, excluding the images included at the 10 classes. 3. Training a pretrained CNN: the AlexNet or GoogleNet, using the eleven classes.

4. Testing on all data: the retrained CNN was applied to all 80,439 images. 5. Postprocessing to adapt the required format. One trial has been submitted by this strategy using AlexNet (subm#6). It was not possible to submit in-time using GoogleNet (subm#0), since it required to much time for train due to the large number of images in the eleventh class, that is 37,063 images.

The Subtask 2 of ImageCLEFlifelog 2018 requires for the submissions a CSV file in the following format: [topic id, image id, confidence score] (1) Where: - topic id: Number of the queried topic, e.g., 1 to 10 - image id: ID of a relevant image - confidence score: from 0 to 1. The CSV file should contain a diversified summarization in 50 images for each query.

The postprocessing procedure is creating the CSV file automatically and it is the same for the three strategies, using the probabilities of the classify level of the CNN. Thus, the images are ranked by the probabilities from high to low, for each result class (True of the Two-class strategy and the ten classes of Ten-class and Eleven-class strategies.

As correct are chosen the first 50 images that:

• Have corresponding ID in metadata: the organizers were accepting as possible correct images only the images that were labeled in metadata with an ID number. • Satisfy all the rules e.g. in topic 005, since dinner is required the time is required to be greater than 15.00.

Experimental results

For assessing performance, The organizers proposed the classic metrics for retrieval, specifically: • Cluster Recall at X (CR@X) - a metric that assesses how many different clusters from the ground truth are represented among the top X results; • Precision at X (P@X) - measures the number of relevant photos among the top X results; • F1-measure at X (F1@X) - the harmonic mean of the previous two. All the presented results have been performed using Matlab in a computer with processor Intel(R) Core™ i7-7700HQ CPU@2.80 GHz x8 and GPU NVIDIA GeForce GTX 1060. Exception was the trial that was not submitted, due to extreme requirements in training. This was finally performed in a computer Intel(R) Core™ i9-7900X CPU@

3.30 GHz x10 and GPU NVIDIA corporation device 1b02 x2.

Submission ID subm#1 subm#2 subm#3 subm#4 subm#5 subm#6 subm#0 Strategy

Two-class Two-class Two-class Ten-class Ten-class Eleven-class Eleven-class

CNN

AlexNet GoogleNet Average AlexNet GoogleNet AlexNet GoogleNet

Official ranking metrics this year are the F1-measure@10, which gives equal importance to diversity (via CR@10) and relevance (via P@10). In table 3, indicative Topic LST001 LST002 LST003 LST004 LST005 LST006 LST007 LST008 LST009 LST010 Mean Topic LST001 LST002 LST003 LST004 LST005 LST006 LST007 LST008 LST009 LST010 Mean Topic LST001 LST002 LST003 LST004 LST005 LST006 LST007 LST008 LST009 LST010 Mean Topic LST001 LST002 LST003 LST004 LST005 LST006 LST007 LST008 LST009 LST010 Mean results of F1@10 are given for all the mentioned submissions (subm#1-6), plus the not submitted trial of the third strategy (subm#0).

In Table 4, F1@Χ for various cut off points are considered, with X=5, 10, 20, 30, 40, 50, for all the proposed techniques. Finally, in Tables 5-11, are given all the detailed results for the submission 1-6, plus the no-submitted trial. 0.66 0.22

1 0.98 0.72 0 0.7 1 0.36 0.54 0.618 0.8 0.12 0.98 0.96 0.78

0 0.96 1 0.1 0.34 0.604

CR@30 F1@30 1 0.824 1 0.286 0.667 0.713

1 0.966 0.167 0.274

0 0 0.833 0.72 0.75 0.857 0.4 0.416 1 0.462 0.682 0.552

P@40 0.625 0.175 0.825 0.95 0.775

0 0.65

1 0.575 0.25 0.583

CR@50 F1@50 1 0.75 1 0.276 0.667 0.751

1 0.958 0.292 0.422

0 0 0.833 0.724 0.75 0.857 0.4 0.46 1 0.361 0.694 0.556 to handle a huge number of images for retrieving moments for ten specific topics. 3 different strategies were proposed in order to respond to the 10 topics. All of them used deep learning and specifically AlexNet and GoogleNet.

Except of the amount of images, other facts that we had to deal with was the cultural differences e.g. what time is dinner for the specific country, as well as the differences in definitions e.g. for some people, vehicle is what is moving on the road while for others can be any transport mean. Last but no least, the explanation of the topics by the participants could also be a problem e.g. what Assembling Furniture includes?

The detailed results, given by the organizers and presented in section 3, require much more experimentation and further study. For example, the topic LST004 Interviewed by a TV presenter, almost always gave a result very close to 1, while the LST006 Assembling Furniture gave always 0. The last one means that no correct image was among the ones we chose as True. Thus, the organizers could consider the possibility of giving

1-2 correct images per topic, at the beginning of the competition. In any case, it is a challenge that can create many new research fields and worth to be considered. Acknowledgements

This work has been partially supported by the Ministerio de Economía, Industria y Competitividad (AEI/FEDER) of the Spanish Government under project TEC201675981 (IVME).

1. Allen , Anita L.: Dredging up the past: Lifelogging, memory, and surveillance . The University of Chicago Law Review, vol. 75 , no 1, p. 47 - 74 ( 2008 ).

'Hara , K. , Tuffield , M. M. , & Shadbolt , N. : Lifelogging: Privacy and empowerment with memories for life . Identity in the Information Society , 1 ( 1 ), 155 - 172 ( 2008 ).

3. Magazine , G.: LifeLog: DARPA looking to record lives of interested parties . https://www.geek.com/news/lifelog-darpa -looking-to-record-lives-of-interested-parties552879/ ( 2013 ), retrieved on 28-5-2018.

4. Gemmell , J. , Bell , G. , Lueder , R. , Drucker , S. , & Wong , C. : MyLifeBits: fulfilling the Memex vision . In Proceedings of the tenth ACM international conference on Multimedia (pp. 235 - 238 ). ACM.( 2002 ).

5. Gemmell , J. , Bell , G. , & Lueder , R.: MyLifeBits: a personal database for everything . Communications of the ACM , 49 ( 1 ), 88 - 95 , ( 2006 ).

6. Sueda , K. , Miyaki , T. , & Rekimoto , J. : Social geoscape: visualizing an image of the city for mobile UI using user generated geo-tagged objects . In International Conference on Mobile and Ubiquitous Systems: Computing , Networking, and Services (pp. 1 - 12 ). Springer, Berlin, Heidelberg, ( 2011 ).

7. Heo , S. , Kang , K. , & Bae , C. : Lifelog collection using a smartphone for medical history form . In IT Convergence and Services (pp. 575 - 581 ). Springer, Dordrecht, ( 2011 ).

8. Jacquemard , T. , Novitzky , P. , O'Brolcháin , F. , Smeaton , A. F. , & Gordijn , B. : Challenges and opportunities of lifelog technologies: A literature review and critical analysis . Science and engineering ethics , 20 ( 2 ), 379 - 409 , ( 2014 ).

Cathal

Gurrin , Hideo Joho, Frank Hopfgartner, Liting Zhou, Rami Albatal: Overview of NTCIR-12 Lifelog Task . Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies , Tokyo, Japan, ( 2016 ).

10. Duc-Tien Dang-Nguyen, Luca Piras , Michael Riegler, Giulia Boato, Liting Zhou, Cathal Gurrin: Overview of ImageCLEFlifelog 2017: Lifelog Retrieval and Summarization . CLEF2017 Working Notes , Dublin, Ireland, vol 1866 , ( 2017 ).

11. Duc-Tien Dang-Nguyen and Luca Piras and Michael Riegler and Liting Zhou and Mathias Lux and Cathal Gurrin: Overview of ImageCLEFlifelog 2018: Daily Living Understanding and Lifelog Moment Retrieval . CLEF2018 Working Notes. CEUR Workshop Proceedings . ( 2018 ).

12. Bogdan Ionescu and Henning Muller and Mauricio Villegas and Alba Garcia Seco de Herrera and Carsten Eickhoff and Vincent Andrearczyk and Yashin Dicente Cid and Vitali Liauchuk and Vassili Kovalev and Sadid A. Hasan and Yuan Ling and Oladimeji Farri and Joey Liu and Matthew Lungren and Duc-Tien Dang-Nguyen and Luca Piras and Michael Riegler and Liting Zhou and Mathias Lux and Cathal Gurrin: Overview of ImageCLEF 2018}: Challenges, Datasets and Evaluation. Experimental IR Meets Multilinguality, Multimodality, and Interaction . Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018 ). LNCS Lecture Notes in Computer Science , Springer, ( 2018 ).

13. Krizhevsky , A. , Sutskever , I. , & Hinton , G. E. : Imagenet classification with deep convolutional neural networks . In Advances in neural information processing systems . pp. 1097 - 1105 , ( 2012 ).

14. Szegedy , Christian, Wei Liu, Yangqing Jia, Pierre Sermanet , Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich: Going deeper with convolutions . In Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 1 - 9 . ( 2015 ).