UA.PT Bioinformatics at ImageCLEF 2019: Lifelog Moment Retrieval based on Image Annotation and Natural Language Processing Ricardo Ribeiro, António J. R. Neves, and José Luis Oliveira IEETA/DETI, University of Aveiro, 3810-193 Aveiro, Portugal {rfribeiro,an,jlo}@ua.pt Abstract. The increasing number of mobile and wearable devices is dramatically changing the way we can collect data about a person’s life. These devices allow recording our daily activities and behavior in the form of images, video, biometric data, location and other data. This paper describes the participation of the Bioinformatics group of the In- stitute of Electronics and Engineering Informatics of University of Aveiro in the ImageCLEF lifelog task, more specifically in the Lifelog Moment Retrieval sub-task. The approach to solve this sub-task is divided into three stages. The first one is the pre-processing of the lifelog dataset for a selection of the images that contain relevant information in order to reduce the amount of images to be processed and obtain additional visual information and concepts from the ones to be considered. In the second step, the query topics are analyzed using Natural Languages Pro- cessing tools to extract relevant words to retrieve the desired moment. This words are compared with the visual concepts words, obtained in the pre-processing step using a pre-trained word2vec model, to compute a confidence score for each processed image. An additional step is used in the last two runs, in order to include the images not processed in the first step and improve the results of our approach. A total of 6 runs were submitted and the results obtained show an evolution with each submis- sion. Although the results are not yet competitive with other teams, this challenge is a good starting point for our research work. We pretend to continue the development of a lifelogging application in the context of a research project, so we expect to participate in the next year in the ImageCLEFlifelog task. Keywords: lifelog · moment retrieval · image processing 1 Introduction In the past few years, with the increase of wearable and smart technologies, the term lifelogging has received significant attention from both research and Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 Septem- ber 2019, Lugano, Switzerland. commercial communities. There is no general definition for lifelogging, but an appropriate definition is given by Dodge and Kitchin [3] as “a form of pervasive computing, consisting of a unified digital record of the totality of an individuals experiences, captured multi-modally through digital sensors and stored perma- nently as a personal multimedia archive” [4]. In a simple way, lifelogging is the process of tracking and record personal data created by our activities and be- haviour. The number of workshops and tasks for research has increased over the last few years and among them are the main tasks of ImageCLEF 2019 lab [6]: lifelogging, medicine, nature, and security. The lifelogging task aims to bring the attention of lifelogging to an as wide as possible audience and to promote research into some of the key challenges of the coming years [2]. Our motivation for this work is the great potential that personal lifelogs have in numerous applications, including memory and moments retrieval, daily living understanding, diet monitoring, or disease diagnosis, among others. For example: in Alzheimer’s disease, people have memory problems and using a lifelog application the person with the disease can be followed by a specialist or can help the person to remember certain moments or activities of her last days or months. This paper is organized as follows: the paper starts with an introductory section. Section 2 provides a brief introduction to the ImageCLEF lifelog and the sub-task Lifelog Moment Retrieval. The proposed approach used in our best run is described in Section 3. In Section 4, the results of all submitted runs obtained in the LMRT sub-task are presented and described the differences of each run compared to the implementation of our best result. Finally, a summary of the work presented in this paper, concluding remarks, and the future work are presented in Section 5. 2 Task Description The ImageCLEFlifelog 2019 task [1] is divided into two different sub-tasks: the Lifelog moment retrieval (LMRT) and the Puzzle sub-task. In this work, we only addressed the LMRT sub-task, as a starting point for a research work that we intend to develop with the aim of helping in some problems that exist around the world. In the LMRT sub-task, the participants have to retrieve a number of specific predefined activities in a lifelogger’s life. For example, they should return the relevant moments for the query “Find the moment(s) when I was shopping”. Particular attention should be paid to the diversification of the selected moments with respect to the target scenario. The ground truth for this sub-task was created using manual annotation [1]. ImageCLEFlifelog dataset is a completely new rich multimodal dataset which consists of 29 days of data from one lifelogger, namely: images (1,500-2,500 per day from wearable cameras), visual concepts (automatically extracted visual concepts with varying rates of accuracy), semantic content (semantic locations, semantic activities) based on sensor readings (via the Moves App) on mobile devices, biometrics information (heart rate, galvanic skin response, calorie burn, steps, continual blood glucose, etc.), music listening history, computer usage (frequency of typed words via the keyboard and information consumed on the computer via ASR of on-screen activity on a per-minute basis) [1]. However, In this work we use the images, the visual concepts and the semantic content of the dataset. 3 Proposed Method In this sub-task, we submitted 6 runs. Although our results have a lower F1- measure@10 in the test topics, this sub-task has become a good starting point for our research work in lifelog. In this section, we present the proposed approcah used in the last submission (run 6 of Table 1) that was our best result. However, some more details about the other runs are mentioned in Section 3. The final approach used to the LMRT task is divided into three stages, respectively: – Pre-Processing: The large amount of data (images) are analyzed and some images images are excluded based on a low-level image analysis algorithm proposed by the authors in order to reduce the search time that will be needed to analyze each topic. The images that are considered to be valuable are then processed using several state-of-art algorithms to extract informa- tion from the images. – Retrieval: The relevant words of the query topic are extracted using tools of Natural Language Processing (NLP). These words are compared with the information obtained from the images in the pre-processing stage, through a state-of-art model used to produce word embedding. Finally, we assign a score to each analyzed image for the query topic. – Post-Processing: Images that were skipped in the pre-processing step are reused depending on a defined distance from the images selected in the re- trieval step in order to fulfill with the goal of the sub-task, where all the images was used and annotated. 3.1 Pre-Processing We consider the pre-processing of the image dataset in a lifelogging applica- tion a very important stage, in order to select the relevant images and reduce the processing time and the errors in the annotation, extracting only relevant information from the lifelog images. In this step, we proposed a method for automatic selection of the lifelog images that contain relevant information using a blur/focus measure operator, called modified Laplacian. We use this method to extract low-level features and machine learning algorithms, namely k-nearest neighbors, to classify these fea- tures and decide if an image is valuable in this context. Figure 1 shows a block Fig. 1. Block diagram of the main steps implemented in the proposed method. diagram presenting the steps used in the proposed method. This proposal is described in a manuscript submitted by the authors but not yet published. Images that are not selected in this step are not processed in the retrieval step, and can be reused in the post-processing step to fulfill with the sub-task expected results. In a lifelogging application, the most important characteristics that we can extract from images are the objects and the elements that a certain environment contains. Some content of the selected images were extracted using the label detection of Google Cloud Vision API, YoloV3 [8] and the information provided by the organizers (location, activity and visual concepts). The data associated with each image is stored into JSON files of each day of the lifelogger. 3.2 Retrieval In the retrieval step the images are selected according to the query topic en- tered for the desired moment search. We use the SpaCy library [5] to analyze the topic narrative and extract relevant words. These words are divided into five categories, among them “activities”, “locations”, “relevant things”, “other things”, and “other words”. In order to assign words to each category we define some linguistic rules, such as semantic and syntactic rules. Semantic rules build the meaning of the sentence from its words and how words combine syntactically. Syntax refers to the rules that govern the ways in which words combine to form phrases, and sentences. For example: if the sentence has an auxiliary verb, the main verb usually corresponds to an activity and the words that follow the main verb may be things or locations involved in this activity. Figure 2 presents linguistic annotations generated by SpaCy library for topic number 10 narrative of the test topics. The words extracted from this topic are ”attending”, ”meeting” and ”China”, and then divided into the categories “activities”, “relevant things” and “locations”, respectively. A comparison is made between the words extracted from the narrative of the query topic and the concept words of each image selected in pre-processing, using a word2vec pre-trained model. We used a model trained on part of Google Fig. 2. Linguistic annotations generated by SpaCy library [5]. News dataset (about 100 billion words). This model contains 300-dimensional vectors for 3 million words and phrases. The gensim libary [9] allows us to load the word2vec model and compute the cosine similarity between words. For each category defined previously we have a similarity (values from 0 to 1). As the concepts that we have for each image are not very large and accurate to decide if the image correspond to the query topic or not, we use the sequence of images in a certain distance of the image that is being analyzed and their similarities to assign a score to the category of those words. These categories have different weights associated (the sum of all categories weight is equal to 1), therefore, the confidence score of each image is computed using these weights and the scores of categories. Finally, we determine a general threshold to select the images for the query topics. Images with confidence score above the threshold are selected for the query topic. 3.3 Post-Processing In the last step, some images that were not analyzed in the pre-processing step are reused to increase the performance of the result images for each topic. Thus, the images between two selected images in the retrieval step, are re-selected if the sequential distance between these two images doesn’t exceed a certain threshold obtained experimentally. Figure 3 shows three images selected for query topic number 10 as final output of the proposed approach. The image (b) was rejected in the pre-processing stage and the images (a) and (b) were selected in the retrieval stage. Then, in the post- processing stage, since image (b) is sequentially between the images (a) and (b), image (b) was reselected. 4 Results We submitted 6 runs on the LMRT sub-task. In this sub-task, the final score is computed as an arithmetic mean of all queries. The ranking metrics was the F1-measure@10, which gives equal importance to diversity (via CR@10) and relevance (via P@10), Cluster Recall and Precision at top 10 results, respectively. (a) u1 20180523 0022 i05 (b) u1 20180523 0022 i06 (c) u1 20180523 0022 i07 Fig. 3. Some selected images for the query topic number 10. We describe the last submission (run 6) in Section 3 and the other submis- sions follow the same pre-processing approach. However, we made changes in the retrieval step and added the post-processing step in the implementation of the other submissions. The post-processing step was only implemented in the two last submissions (run 5 and run 6). 4.1 Run 1 In the first submission (run 1), the query topic is analyzed using the title and narrative. From the title, stop words have been removed using the scikit-learn tools [7]. From the narrative, the nouns and the main verbs associated to an auxiliary verb were extracted using the SpaCy library [5] without rules. Then, we used the same word2vec model, described for run 6, to obtain the similarity of the topic words and concept words extracted in pre-processing step. In this run, the topic words were not divided into categories. We check if the topic words had a similarity above a threshold and if two or more topic words score above that threshold. In this case, the sequence of images in a certain distance is selected. The confidence score is the same for all selected images. 4.2 Run 2 In the second submission (run 2), we used the same approach of the run 1, however in the query topic analysis we only used the title. 4.3 Run 3 In this submission we only analyze the narrative of the query topic and we define linguistic rules to extract relevant words. The words were divided into four categories, that is, the same categories of run 6 but without the “other words” category. We check if the words in the categories had a similarity above a threshold and if two or more scores of the categories are above that threshold. In this case, the sequence of images is selected. The confidence score is assigned by the number of categories that had the score above that threshold. 4.4 Run 4 This submission follows the same approach of run 3, however some thresholds were adjusted and it is used the fifth category described in the run 6. 4.5 Run 5 In this submission, we reorganize the implementation of the run 4 and we defined different weights for each category of words in order to calculate the confidence score of each image. Then, we added the post-processing step to our implemen- tation. 4.6 UA.PT Bioinformatics Results The results obtained are shown in Table 1, along with the best result in this task, for comparison. The results of all participating team can be found in [1]. We can observe that our results are still far from the best ones on this task but we consider that our first participation in the ImageCLEF lifelog 2019 was an excellent starting point for our research work. Moreover, the best team already participated in the past. One of the main problems in our approaches is the low information and visual concepts extracted from the images of the lifelog data, for example the word “toyshop” never appears associated to an image. For this reason, half of the analyzed query topics did not obtain any result in the evaluation metrics. Using other state-of-art algorithms and API’s to obtain a more rich description of the images may increase the performance. Some visual concepts of the images are in form of bigrams or trigrams, for example “electronic device”, “ice cream”, “cell phone”, “car interior”, among others. As in our approach we only compute cosine similarity between two words, some of the visual concepts are lost and the result of our approaches decrease. Table 1. F1-measure@10 of each run submitted by us and the best team run in the LMRT task. Team Run Name F1-measure@10 (%) Run 1 1.6 Run 2 2.6 Run 3 2.7 Our Run 4 2.7 Run 5 3.6 Run 6 5.7 HCMUS Run 2 61.04 In order to solve this problem, the identification of bigrams and trigrams is one of the future implementations for our application. Another way to increase the performance of our work is the development of new linguistic rules, in order to analyze the description of the query topic and obtain more information for the retrieval step. For example: identify negative sentences and exclude some objects or environments that are not relevant. 5 Conclusion and Future Work The Lifelog Moment Retrieval (LMRT) sub-task of ImageCLEF lifelog 2019 was an excellent starting point for our research work in lifelogging. Although our results have a low score in this sub-task, we observe an evolution in each submitted run. After these results, our goal is to continue the development and improvement of our implementation. For future work, we intend to do more state-of-art research for the recognition of visual concepts and text mining methods. In order to develop an efficient application, we are going to create ontologies for daily activities and create hierarchical relationships for the words that can appear in the visual concepts. Developing a user interface is also one of our priorities for user interaction and visualization of search results. 6 Acknowledgments Supported by the Integrated Programme of SR&TD SOCA (Ref. CENTRO- 01-0145-FEDER-000010), co-funded by Centro 2020 program, Portugal 2020, European Union, through the European Regional Development Fund. References 1. Dang-Nguyen, D.T., Piras, L., Riegler, M., Tran, M.T., Zhou, L., Lux, M., Le, T.K., Ninh, V.T., Gurrin, C.: Overview of ImageCLEFlifelog 2019: Solve my life puzzle and Lifelog Moment Retrieval. In: CLEF2019 Working Notes. CEUR Workshop Proceedings, CEUR-WS.org , Lugano, Switzerland (Septem- ber 09-12 2019) 2. Dang-Nguyen, D.T., Piras, L., Riegler, M., Zhou, L., Lux, M., Gurrin, C.: Overview of imagecleflifelog 2018: daily living understanding and lifelog moment retrieval. In: CLEF2018 Working Notes. CEUR Workshop Proceedings, CEURWS. Avignon, France (2018) 3. Dodge, M., Kitchin, R.: ’outlines of a world coming into existence’: pervasive com- puting and the ethics of forgetting. Environment and planning B: planning and design 34(3), 431–445 (2007) 4. Gurrin, C., Smeaton, A.F., Doherty, A.R., et al.: Lifelogging: Personal big data. Foundations and Trends R in Information Retrieval 8(1), 1–125 (2014) 5. Honnibal, M., Montani, I.: spacy 2: Natural language understanding with bloom em- beddings, convolutional neural networks and incremental parsing. To appear (2017) 6. Ionescu, B., Müller, H., Péteri, R., Cid, Y.D., Liauchuk, V., Kovalev, V., Klimuk, D., Tarasau, A., Abacha, A.B., Hasan, S.A., Datla, V., Liu, J., Demner-Fushman, D., Dang-Nguyen, D.T., Piras, L., Riegler, M., Tran, M.T., Lux, M., Gurrin, C., Pelka, O., Friedrich, C.M., de Herrera, A.G.S., Garcia, N., Kavallieratou, E., del Blanco, C.R., Rodrı́guez, C.C., Vasillopoulos, N., Karampidis, K., Chamberlain, J., Clark, A., Campello, A.: ImageCLEF 2019: Multimedia retrieval in medicine, lifelogging, security and nature. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the 10th International Conference of the CLEF Associ- ation (CLEF 2019), LNCS Lecture Notes in Computer Science, Springer, Lugano, Switzerland (September 9-12 2019) 7. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011) 8. Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018) 9. Řehůřek, R., Sojka, P.: Software Framework for Topic Modelling with Large Corpora. In: Proceedings of the LREC 2010 Workshop on New Chal- lenges for NLP Frameworks. pp. 45–50. ELRA, Valletta, Malta (May 2010), http://is.muni.cz/publication/884893/en