Multimedia Lab @ ImageCLEF 2018 Lifelog Moment Retrieval Task Mihai Dogariu and Bogdan Ionescu Multimedia Lab, CAMPUS, University Politehnica of Bucharest, Romania mdogariu@imag.pub.ro, bionescu@alpha.imag.pub.ro Abstract. This paper describes the participation of the Multimedia Lab team at the ImageCLEF 2018 Lifelog Moment Retrieval Task. Our method makes use of visual information, text information and metadata. Our approach consists of the following steps: we reduce the number of images to analyze by eliminating the ones that are blurry or do not meet certain metadata criteria, extract relevant concepts with several Convolutional Neural Networks, perform K-means clustering on the Ori- ented Gradients and Color Histograms features and rerank the remaining images according to a relevance score computed between each image con- cept and the queried topic. Keywords: Lifelog · CNN · Imagenet · Places365 · MSCOCO · Food101 1 Introduction Recent technological advancements have resulted in the development of numer- ous wearable devices that can successfully help one track his own daily activity. Examples of such devices include wearable cameras, smart watches or fitness bracelets. Each of these provides information regarding its user’s activity and combining the outputs of all such devices can result in a highly detailed descrip- tion of the person’s habits, schedule or actions. However, continuous acquisition of data can lead to cumbersome archives of information which, in term, can be- come too difficult to handle, up to the point where it becomes inefficient to try to use them. As part of ImageCLEF 2018 evaluation campaign [7], the Lifelog Tasks [4] aim to solve these problems. This paper presents our participation in the Lifelog Moment Retrieval (LMR) task, in which participants have to retrieve a number of specific moments in a lifeloggers life, given a text query. Moments are defined as semantic events, or activities that happened throughout the day. For each query, a total of 50 images are expected to be extracted, both relevant and diverse, with the official metric being F 1@10 measure. The rest of the paper is organized as follows. In Section 2 we discuss re- lated work from the literature, in Section 3 we present our proposed system, in Section 4 we discuss the results and in Section 5 we conclude the paper. 2 Related Work In this section we briefly discuss the recent results obtained in similar competi- tions. The organizing team of the ImageCELF 2017 Lifelog Tasks [3] proposed a pipeline in which they perform a segmentation of the dataset based on time and concepts metadata. In parallel, they analyzed each query and extracted the relevant information that can be applied on the given metadata. After extract- ing only the images that fit the previous criteria they performed an additional filtering of images and remove those that contain large objects or are blurry. The last step involves a diversification of images through hierarchical clustering. A similar technique was used by [12] in their submission at the same com- petition. In addition, they also used the image descriptors obtained by running each image through different Convultional Neural Networks (CNN), i.e. they extracted object and place feature vectors to which they added a human detec- tion CNN. Each image was assigned a relevance score obtained by comparing the feature vector to a reference vector on a per topic basis. Their chosen clus- tering approach was K-means [11]. The same authors use a very similar system in [9], where they further add a temporal smoothing element. A somewhat dif- ferent system was adopted in [17] where the authors combined a visual indexing method similar to the ones in [12, 9] with a location indexing method. In our previous participation [5] we also applied a filtering procedure first based on the metadata and later on the similarity between the topic queries and the feature vector which consisted in detected concepts. This filtering was followed by a hierarchical clustering step. We learned that in order for this technique to work there has to be a strong correlation between the queries and the detected concepts. Also, enumeration of items that needed to be present in the image significantly improved the results. This paper combines the benefits from [12] and [3] to which it adds two more feature vectors. Moreover, we explore the impact that supervised fine-tuning has on the final results and present the outcome of 5 different techniques. 3 Proposed Approach Our approach involves the pipeline presented in Figure 1. Each of the processing steps is detailed in the following. The output of the system is a list of 50 images for each of the proposed 10 topics, which are both relevant and diverse with respect to the query. 3.1 Blur filtering We first apply a blur filtering over the entire dataset. We compute a focus mea- sure for each image by using the variance of the Laplacian kernel. If an image has a focus measure below an imposed threshold then it is discarded from fur- ther processing. Choosing the threshold requires several trials to see what works best for the dataset at hand. Imposing a low value on the threshold results in a Fig. 1. Processing pipeline. permissive filter, leading to a low number of discarded images, whereas a high threshold could wrongly discard images of acceptable quality. We found that a value of 60 for the threshold leads to satisfying results. We decided to allow the filter to be slightly permissive so that we do not reject true positives. In the end, from the total 80.5k images we discard 16.5k blurry images, leaving us with only 64k images to process. Another advantage of this technique is that it also filters out uninformative images that contain large homogeneous areas such as images where the camera was facing the ceiling or a wall. 3.2 Concepts extraction In the second step of our algorithm we run each of the remaining 64k images through several classifiers and a detector. We use 3 image-level classifiers and one object detector, to which we also add the concept detector information provided by the organizers. All of these systems are implemented using CNNs as described below. Imagenet classifier A common practice for detecting several concepts for an image is to run it through an image classifier trained on the popular Imagenet dataset [8]. This yields a 1000-D vector with values corresponding to the con- fidence level of associating the entire image with a certain concept. We use a ResNet50 [6] implementation trained on Imagenet. However, there are 2 important aspects that need to be considered when implementing this technique. The first one is that the classifier is trained to predict a single concept for the entire image, whereas lifelog images contain numerous objects that might be of interest for the retrieval task. The second aspect is that out of the 1000 classes only a small part is relevant, with the vast majority of these concepts unlikely to be met in a person’s daily routine. This leads to noisy classification, diminishing the usefulness of this classifier. Places classifier The second classifier that we implement is meant to predict the place presented in the image. We use the VGG16 [14] network, trained on the Places365 dataset [18]. The dataset consists of approximately 1.8 million images from 365 scene categories. The network outputs a 365-D vector with one confidence value for each scene category. The places classifier performs well with respect to the lifelogging tasks, being trained to distinguish between most of the backgrounds present in the competition’s dataset. This comes especially useful as most topics require the lifelogger to be present in a certain place at the time when the image has been captured. Food classifier As some topics revolved around the lifelogger’s eating and drinking habits we decided to also include a food classifier network. For this we use the InceptionV3 architecture [15] pre-trained on the Imagenet dataset and we fine-tune it on the Food101 dataset [1]. The result is a 101-D feature vector for each image. As the training dataset is composed of images where the labeled food takes up most of the image, when running our images through this classifier we extract 6 crops (upper left, upper right, center, lower left, lower middle, lower right) and their mirrored versions as well, which we pass through the network. Afterwards, we select the maximum activation for each food class from the 12 predictions and build the 101-D vector. Object detector Additionally to the classifiers we also use a concept detector. This has the advantage that it locates more than one instance of the same object and each instance has its own attached confidence. Therefore, there will be no competition between detections when computing the final results. For this pur- pose we use a Faster R-CNN [13] implementation trained on the MSCOCO [10] dataset. Another advantage of this setup is that with object detection it also per- forms object counting. Therefore, we build two feature vectors for each image: one that retains the frequency of each detected object inside the image and one which sums up the confidences of all detected instances for each class inside the image. As the dataset also contains the class “person”, we use its frequency to perform person counting. Also, many of the classes from the MSCOCO dataset can be found in daily scenarios, thus making it well-suited for the purpose of lifelog image retrieval. Official concepts Apart from the previously mentioned systems there is one more feature extractor that we use, namely the one provided by the organizers. They released a set of results in which each image is described by a various number of concepts. The total number of possible classes is not known and their objective is also uncertain as they cover a broad range of concepts such as places, foods, actions, objects, adverbs etc. To cope with this we add each unique concept from the official feature results to a list that sums up 633 unique entries. In the end, we create a 633-D feature vector for each image, with non- zero entries only where the official concept detector triggered a detection. On this positions we retained the detector’s confidence for the respective concept. 3.3 Metadata processing Apart from the concept detector, the organizers also released a file containing a large variety of metadata about each minute from the logged data. These meta- data encompass a bundle of information such as biometric data, timestamps, locations, activities, geographical coordinates, food logs and even the music that the lifelogger was listening to at certain times. We use only a part of this set of metadata. The rest of it can be used as well, but it did not fit our proposed system, therefore we only extract these data but did not process it any further. A summary of all the information that we process for each image can be seen in Table 1. Table 1. Information used for individual images. Type Content Dimension Activity 1-D Date 1-D Metadata Time (HH:MM:SS) 3-D Location 1-D Imagenet 1000-D Places 365-D Food 101-D Concepts MSCOCO objects 80-D MSCOCO person count 1-D Official concepts 633-D HOG descriptor 1536-D Feature vectors Color histogram 512-D 3.4 Refinement filtering From previous experience we found that a key aspect of obtaining good results is to narrow down the set of images that are to be processed. This can be done by eliminating images that do not meet a certain set of minimum requirements. In this sense we implement two types of filtering: one based on the metadata and one based on the soft values of the concepts mentioned in Table 1 and explained below. We select a random topic out of the 10 test ones, to serve as an example and we will discuss it throughout the rest of the paper. The topic consists of the following: Title: My Presentations Description: Find the moments when I was giving a presentation to a large group of people. Narrative: To be considered relevant, the moments must show more than 15 people in the audience. Such moments may be giving a public lecture or a lecture in the university. Metadata filtering Our general approach is to manually interpret the entire topic text and extract meaningful constraints on the metadata associated with each image. Those entries that do not satisfy the given constraints are eliminated from the processing pipeline. We prefer looser restrictions such that we lower the chance of removing images relevant to the query in question. For the above given topic we impose the following: – Activity: if the activity is any of the {’airplane’, ’transport’, ’walking’} then remove image; – Location: if the location is anything different from {’Work’, ’Dublin City University (DCU)’} then remove image; – Time: if the hour is not in the interval 9-19 then remove image; – Person count: if there are less than 10 persons detected then remove image. Two remarks are in order here. First, even if the person count is not part of the metadata we treat it as such because of its 1-D nature and discrete values. Second, the minimum threshold on the person count is lower than the query asks for because the MSCOCO object detector can have difficulties in detecting overlapped persons in an image. Soft concepts filtering In a similar manner we tackle the filtering based on the soft outputs of the concept detector/classifiers. If a certain object/concept is detected with a higher probability than a preset threshold in an image then that image is removed from the processing queue. Again, this process involves manual selection of concepts that should not be present in the images. As it would be a tedious work to select an exhaustive set of concepts for each classifier, we only select the ones which are most likely to appear in the lifelog dataset and would be in contradiction with the queried text, therefore the selection can greatly differ from one query to another. For the query in the above example we select the following: – Places: if the probability to detect any of the places from the set of words {’car interior’, ’living room’, ’kitchen’} is greater than the threshold then remove image; – MSCOCO objects: if the probability to detect any of the objects from the set of words {’traffic light’, ’cup’} is greater than the threshold then remove image; – Official concepts: if the probability to detect any of the concepts from the set of words {’blurry’, ’blur’, ’null’, ’Null’,’wall’, ’ceiling’, ’outdoor’, ’out- door object’} is greater than the threshold then remove image; We do not use the same technique for the Imagenet descriptor as it usually outputs low confidences and could thus have a great impact on the amount of images that would be removed. Also, the Food descriptor was not used for this topic as it is not relevant. Instead, its purpose is solely to classify food types for topics which implicitly ask for this. We tried several values for the threshold and by visual inspection of the output we noticed that 0.3 offers a good trade-off between the probability of rejecting true positives and rejecting true negatives. Finding the best value for each concept detector and each topic requires many iterations, making this a costly process. Relevance score After the blurred and irrelevant images have been filtered out we proceed into computing a relevance score for each image relative to the queried topic. In the same fashion as [12] we create a reference vector for each of the 5 concept detectors in Tabel 1 with higher values on the positions corre- sponding to concepts which are more likely to be found in relevant images and lower values on the other positions. The score associated to a certain concept detector is obtained by computing the dot product between the concept feature vector and its respective reference vector. The result is then weighted and added to the relevance score for each type of concept, as expressed in the equation below. 1000 X score =wimagenet × [conceptimagenet (i) ∗ refimagenet (i)]+ i=1 365 X wplaces × [conceptplaces (i) ∗ refplaces (i)]+ i=1 101 X wf ood × [conceptf ood (i) ∗ reff ood (i)]+ (1) i=1 80 X wmscoco × [conceptmscoco (i) ∗ refmscoco (i)]+ i=1 633 X wof f icial × [conceptof f icial (i) ∗ refof f icial (i)], i=1 with concept (i) being the confidence associated with the i -th detected concept from a dataset for the respective image, ref (i) being the refer- ence vector at position i for the given dataset and w being the weight given to the respective dataset. The weights for each dot product have been manually adjusted for each topic by trial and error. The values for the refer- ence vectors have been either set manually or automatically, depending on the submitted run. We discuss this at length in Section 4. 3.5 Diversification The submitted results are supposed to be both relevant and diverse. The rele- vance score should emphasize images that match the query description. For the diversity part we apply the K-means algorithm for all the images that are left after the filtering process. Each image is represented by the concatenation of two normalized vectors: a 1536-D vector representing the Histogram of Oriented Gradients (HOG) [2] feature vector and a 512-D vector representing the color histogram feature vector. This 2048-D vector should account for both shapes and colors inside images. We run the K-means algorithm with either 5, 10, 25 or 50 clusters. For the final list of proposed images we select from each cluster the image with the highest relevance score in a round-robin manner. 4 Experimental Results We have submitted one run during the competition and 4 other runs after the competition ended. The official metric of the competition was F 1@X, which is computed as the harmonic mean between precision (P @X) and cluster recall (CR@X), with X representing the number of the top elements to be taken into consideration. In Table 2 we present the final F 1@X results that we have obtained for each run with best values in bold. Our last run is omitted when choosing the best results because it implied a highly supervised approach and would lead to an unfair comparison. In Figure 2 we present the F 1@X results for individual topics. Next, we provide a detailed description of each run. Table 2. Official results for the submitted runs. Run F1@5 F1@10 F1@20 F1@30 F1@40 F1@50 Run 1 0.235 0.216 0.224 0.218 0.203 0.199 Run 2 0.154 0.169 0.215 0.21 0.207 0.199 Run 3 0.158 0.168 0.217 0.214 0.199 0.206 Run 4 0.129 0.166 0.184 0.184 0.178 0.188 Run 5 0.412 0.443 0.446 0.438 0.419 0.405 Run 1 This was the only run that we submitted during the competition and it follows the pipeline described in Section 3. We manually selected concepts from each training dataset that would be probable to appear in the images described by the queries. We set the reference vectors values to 1 on the positions corre- sponding to the selected concepts and to 0 elsewhere. This makes the dot product equivalent to an accumulation of confidences from a limited set of concepts for each image. The weights, wimagenet , wplaces , wf ood , wmscoco and wof f icial have been adjusted independently for each topic. The official F 1@10 value was 0.216 and this is the value that represents our position in the official standings. Run 2 In addition to what was proposed for Run 1 we also applied another filtering of the results, this time after the clusterization part. While going through the clusters in the round robbin manner we also checked that the newly added images are not too visually similar to the ones already added to the list. For this purpose each new proposal would be compared one-on-one with the already added proposals. The comparison was done with two metrics: mean squared error (M SE) and structural similarity index (SSIM ). If for a pair of images M SE < 2000 and SSIM > 0.5 then they are considered to be too similar, the latter one is discarded and the round robin continues. We expected this technique to allow for more diversity in the proposed list of images and enhance the cluster recall. Instead, it turned out to eliminate a part of the correct predictions and lower the precision. The official F 1@10 value was 0.169. Run 3 For the 3rd run we proposed a different way of computing the reference vectors, the same technique that we used in [5]. Namely, instead of manually selecting the concepts that dictate whether an image is relevant or not from each dataset, we only selected the nouns that best describe the topic’s descrip- tion, obtaining a short set of key words, called “words to search”. For the topic mentioned in Section 3.4 we have: words to search={’presentation’, ’group’, ’peo- ple’, ’audience’, ’public’, ’lecture’, ’conference’, ’university’, ’classroom’}. Start- ing from this set of words we computed the Wu-Palmer similarity measure [16] between each concept and all of the words from the “words to search” vector as described in the equation below. X refdataset (i) = dW U P (conceptdataset (i), w), (2) w∈words to search where dataset is any of the 5 datasets used in the concept detectors (Imagenet, Places-365, Food-101, MSCOCO, Official), dW U P (conceptdataset , w) is the Wu- Palmer distance between one concept of the dataset and one word from the set of words to search for, “words to search” . This avoided the binary setting of the reference vector that was used in the previous runs but it lead to a decrease of the performance of the entire system. The official F 1@10 value was 0.168. Run 4 The 4th run was similar to Run 3, with the only difference being that all the weights wimagenet , wplaces , wf ood , wmscoco and wof f icial were set to 1, rendering them neutral to the reference score computation. This allows the reference score to stabilize solely according to the similarity measure between the words from the topic description and the labels of the concept detectors. From Table 2 we can see that this only lowers the results, suggesting that tweaking the weights for each dot-product is a better approach. This run was our closest submission to a fully automatic system. The official F 1@10 value was 0.166. Run 5 Our last run was done with the same approach as Run 1, this time performing a fine-tuning of all system parameters for the topics that had bad results in the first run by trial and error. This approach leads to visibly better results. However, this is obtained after careful manual tuning, which makes the technique highly supervised and costly, as well, making it unfair to compare it with the previous runs, this being the reason why it is separated from the rest of the entries in Table 2. The official F 1@10 value was 0.443. Fig. 2. Results for each topic from the test set. 4.1 Discussion From the results that we presented in Figure 2 it can be seen that the F 1@X metric has high inter-topic variance. This does not come as a surprise since the topics approach different scenes, some of which are better represented in terms of number of images in the dataset or are better described in terms of the associated metadata. While some topics are easy to address (e.g. Topic 8:“Find the moments when I was with friends in Costa coffee.” can be retrieved almost solely based on the location metadata) there are still topics for which retrieval is difficult (e.g. Topic 6:“Find the moments when I was assembling a piece of furniture.”) mainly because of the difficulty of assigning distinctive concepts to their description. Except for the last run, it can be seen that all our approaches behave similarly for each individual topic, suggesting that there is no clear advantage in using one approach over the others. This is somewhat expected since they use the same data and almost the same degree of supervision. The only clear improvement can be seen when strong human input is involved. The part of the entire system which had the greatest impact on the final outcome was the metadata filtering. We argue that this is because this type of information has been specifically implemented for lifelogging purposes and therefore have the strongest contribution in the end. This was also proven by our 5th run where we paid more attention to fine-tuning the processing parameters, such as metadata, weights and set of query words, rather than on introducing a new system. The way the F 1@X metric changes with X is also worth mentioning. We noticed that is more beneficial to focus on the cluster recall than on the precision. This comes straightforward from the definition of the F 1@X metric in which CR@X and P @X have equal contributions. As the topics cover an average of 5- 6 different clusters (as per the development dataset) it is usually more productive to retrieve images even from at least two different clusters rather than retrieve all the images from a single cluster. This happens because the cluster recall can only increase with X, whereas the precision usually drops for the same number of images. However, the cluster recall usually compensates for the precision. We also notice that almost all of our approaches have the highest F 1@X value for X = 20 and they slightly decrease with the increase of X which was rather inconvenient since the official metric accounts for X = 10. However, we have reported quite similar results for X = 10 and X = 20. 5 Conclusions In this paper we presented our approach for the LMR competition at the Im- ageCLEF Lifelog task. We have adopted a general framework that processes visual, text and meta information about images. We have extracted 5 concept vectors, 2 feature vectors and more than 10 metadata fields for each image. All of the proposed variants rely on metadata filtering and try to link each key-word from the search topics to the concepts detector labels. A relevance score which takes the aforementioned link into consideration is then computed and K-means algorithm is used for clustering the results for the final proposals. The LMR task still poses numerous difficulties such as processing a great deal of multimodal data, adapting several multimedia retrieval systems to this type of task and integrating all the results. The diversity in the search queries is also to be taken into account, sometimes being quite easy to process (see results of ‘Topic 8’) but sometimes proving that there still is work to be done to find a solution that satisfies this type of generality (see results of ‘Topic 7’). We found that manual fine-tuning of system parameters offers the best result, but this makes the system personalized for the given topics, lowering its scalability to other similar tasks. As opposed to last year, we have implemented a significantly more complex system and the future challenge for us is to work towards a scalable system, not so much dependent on human input, to solve the LMR task. We believe that with the increasing interest in this type of competitions it is possible to achieve this perspective. Acknowledgement This work was supported by the Ministry of Innovation and Research, UEFIS- CDI, project SPIA-VA, agreement 2SOL/2017, grant PN-III-P2-2.1-SOL-2016- 02-0002. References 1. Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative com- ponents with random forests. In: European Conference on Computer Vision. pp. 446–461 (2014) 2. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) - Volume 1 - Volume 01. pp. 886–893. CVPR ’05 (2005) 3. Dang-Nguyen, D.T., Piras, L., Riegler, M., Boato, G., Zhou, L., Gurrin, C.: Overview of ImageCLEFlifelog 2017: Lifelog Retrieval and Summarization. In: CLEF2017 Working Notes. CEUR Workshop Proceedings, CEUR-WS.org , Dublin, Ireland (September 11-14 2017) 4. Dang-Nguyen, D.T., Piras, L., Riegler, M., Zhou, L., Lux, M., Gurrin, C.: Overview of ImageCLEFlifelog 2018: Daily Living Understanding and Lifelog Moment Re- trieval. In: CLEF2018 Working Notes. CEUR Workshop Proceedings, vol. 11018. CEUR-WS.org , Avignon, France (September 10-14 2018) 5. Dogariu, M., Ionescu, B.: A Textual Filtering of HOG-based Hierarchical Cluster- ing of Lifelog Data. In: CLEF2017 Working Notes. CEUR Workshop Proceedings, CEUR-WS.org , Dublin, Ireland (September 11-14 2017) 6. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778 (June 2016) 7. Ionescu, B., Müller, H., Villegas, M., de Herrera, A.G.S., Eickhoff, C., Andrea- rczyk, V., Cid, Y.D., Liauchuk, V., Kovalev, V., Hasan, S.A., Ling, Y., Farri, O., Liu, J., Lungren, M., Dang-Nguyen, D.T., Piras, L., Riegler, M., Zhou, L., Lux, M., Gurrin, C.: Overview of ImageCLEF 2018: Challenges, datasets and evaluation. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceed- ings of the Ninth International Conference of the CLEF Association (CLEF 2018), vol. 11018. LNCS Lecture Notes in Computer Science, Springer, Avignon, France (September 10-14 2018) 8. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con- volutional neural networks. In: Advances in Neural Information Processing Systems 25, pp. 1097–1105 (2012) 9. Lin, J., Molino, A., Xu, Q., Fang, F., Subbaraju, V., Lim, J.H.: VCI2R at the NTCIR-13 Lifelog-2 Lifelog Semantic Access Task. In: Proceedings of the 13th NT- CIR Conference on Evaluation of Information Access Technologies. Tokyo, Japan (December 5-8 2017) 10. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollr, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European Conference on Computer Vision (ECCV). vol. 8693, pp. 740–755. Zürich (2014) 11. Lloyd, S.: Least squares quantization in pcm. IEEE Trans. Inf. Theor. 28(2), 129– 137 (Sep 2006) 12. Molino, A., Mandal, B., Lin, J., Lim, J.H., Subbaraju, V., Chandrasekhar, V.: VC- I2R@ImageCLEF2017: Ensemble of Deep Learned Features for Lifelog Video Sum- marization. In: CLEF2017 Working Notes. CEUR Workshop Proceedings, CEUR- WS.org , Dublin, Ireland (September 11-14 2017) 13. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object de- tection with region proposal networks. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Sys- tems 28, pp. 91–99. Curran Associates, Inc. (2015) 14. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale im- age recognition. In: International Conference on Learning Representations (ICLR) (2015) 15. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the in- ception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. pp. 2818–2826 (2016) 16. Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the 32Nd Annual Meeting on Association for Computational Linguistics. pp. 133–138. ACL ’94 (1994) 17. Yamamoto, S., Nishimura, T., Akagi, Y., Takimoto, Y., Inoue, T., Toda, H.: PBG at the NTCIR-13 Lifelog-2 LAT, LSAT, and LEST Tasks. In: Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies. Tokyo, Japan (December 5-8 2017) 18. Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(6), 1452–1464 (June 2018)