<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multimedia Lab @ ImageCLEF 2018 Lifelog Moment Retrieval Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mihai Dogariu</string-name>
          <email>mdogariu@imag.pub.ro</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bogdan Ionescu</string-name>
          <email>bionescu@alpha.imag.pub.ro</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Multimedia Lab, CAMPUS, University Politehnica of Bucharest</institution>
          ,
          <country country="RO">Romania</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the participation of the Multimedia Lab team at the ImageCLEF 2018 Lifelog Moment Retrieval Task. Our method makes use of visual information, text information and metadata. Our approach consists of the following steps: we reduce the number of images to analyze by eliminating the ones that are blurry or do not meet certain metadata criteria, extract relevant concepts with several Convolutional Neural Networks, perform K-means clustering on the Oriented Gradients and Color Histograms features and rerank the remaining images according to a relevance score computed between each image concept and the queried topic.</p>
      </abstract>
      <kwd-group>
        <kwd>Lifelog</kwd>
        <kwd>CNN</kwd>
        <kwd>Imagenet</kwd>
        <kwd>Places365</kwd>
        <kwd>MSCOCO</kwd>
        <kwd>Food101</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Recent technological advancements have resulted in the development of
numerous wearable devices that can successfully help one track his own daily activity.
Examples of such devices include wearable cameras, smart watches or tness
bracelets. Each of these provides information regarding its user's activity and
combining the outputs of all such devices can result in a highly detailed
description of the person's habits, schedule or actions. However, continuous acquisition
of data can lead to cumbersome archives of information which, in term, can
become too di cult to handle, up to the point where it becomes ine cient to try
to use them. As part of ImageCLEF 2018 evaluation campaign [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], the Lifelog
Tasks [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] aim to solve these problems.
      </p>
      <p>This paper presents our participation in the Lifelog Moment Retrieval (LMR)
task, in which participants have to retrieve a number of speci c moments in a
lifeloggers life, given a text query. Moments are de ned as semantic events, or
activities that happened throughout the day. For each query, a total of 50 images
are expected to be extracted, both relevant and diverse, with the o cial metric
being F 1@10 measure.</p>
      <p>The rest of the paper is organized as follows. In Section 2 we discuss
related work from the literature, in Section 3 we present our proposed system, in
Section 4 we discuss the results and in Section 5 we conclude the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        In this section we brie y discuss the recent results obtained in similar
competitions. The organizing team of the ImageCELF 2017 Lifelog Tasks [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] proposed
a pipeline in which they perform a segmentation of the dataset based on time
and concepts metadata. In parallel, they analyzed each query and extracted the
relevant information that can be applied on the given metadata. After
extracting only the images that t the previous criteria they performed an additional
ltering of images and remove those that contain large objects or are blurry. The
last step involves a diversi cation of images through hierarchical clustering.
      </p>
      <p>
        A similar technique was used by [12] in their submission at the same
competition. In addition, they also used the image descriptors obtained by running
each image through di erent Convultional Neural Networks (CNN), i.e. they
extracted object and place feature vectors to which they added a human
detection CNN. Each image was assigned a relevance score obtained by comparing
the feature vector to a reference vector on a per topic basis. Their chosen
clustering approach was K-means [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The same authors use a very similar system
in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], where they further add a temporal smoothing element. A somewhat
different system was adopted in [17] where the authors combined a visual indexing
method similar to the ones in [
        <xref ref-type="bibr" rid="ref9">12, 9</xref>
        ] with a location indexing method.
      </p>
      <p>
        In our previous participation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] we also applied a ltering procedure rst
based on the metadata and later on the similarity between the topic queries
and the feature vector which consisted in detected concepts. This ltering was
followed by a hierarchical clustering step. We learned that in order for this
technique to work there has to be a strong correlation between the queries and
the detected concepts. Also, enumeration of items that needed to be present in
the image signi cantly improved the results.
      </p>
      <p>
        This paper combines the bene ts from [12] and [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to which it adds two more
feature vectors. Moreover, we explore the impact that supervised ne-tuning has
on the nal results and present the outcome of 5 di erent techniques.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Proposed Approach</title>
      <p>Our approach involves the pipeline presented in Figure 1. Each of the processing
steps is detailed in the following. The output of the system is a list of 50 images
for each of the proposed 10 topics, which are both relevant and diverse with
respect to the query.
3.1</p>
      <sec id="sec-3-1">
        <title>Blur ltering</title>
        <p>We rst apply a blur ltering over the entire dataset. We compute a focus
measure for each image by using the variance of the Laplacian kernel. If an image
has a focus measure below an imposed threshold then it is discarded from
further processing. Choosing the threshold requires several trials to see what works
best for the dataset at hand. Imposing a low value on the threshold results in a
permissive lter, leading to a low number of discarded images, whereas a high
threshold could wrongly discard images of acceptable quality. We found that a
value of 60 for the threshold leads to satisfying results. We decided to allow the
lter to be slightly permissive so that we do not reject true positives. In the end,
from the total 80.5k images we discard 16.5k blurry images, leaving us with only
64k images to process. Another advantage of this technique is that it also lters
out uninformative images that contain large homogeneous areas such as images
where the camera was facing the ceiling or a wall.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Concepts extraction</title>
        <p>In the second step of our algorithm we run each of the remaining 64k images
through several classi ers and a detector. We use 3 image-level classi ers and one
object detector, to which we also add the concept detector information provided
by the organizers. All of these systems are implemented using CNNs as described
below.</p>
        <p>
          Imagenet classi er A common practice for detecting several concepts for an
image is to run it through an image classi er trained on the popular Imagenet
dataset [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. This yields a 1000-D vector with values corresponding to the
condence level of associating the entire image with a certain concept. We use a
ResNet50 [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] implementation trained on Imagenet.
        </p>
        <p>However, there are 2 important aspects that need to be considered when
implementing this technique. The rst one is that the classi er is trained to
predict a single concept for the entire image, whereas lifelog images contain
numerous objects that might be of interest for the retrieval task. The second
aspect is that out of the 1000 classes only a small part is relevant, with the vast
majority of these concepts unlikely to be met in a person's daily routine. This
leads to noisy classi cation, diminishing the usefulness of this classi er.
Places classi er The second classi er that we implement is meant to predict
the place presented in the image. We use the VGG16 [14] network, trained on
the Places365 dataset [18]. The dataset consists of approximately 1.8 million
images from 365 scene categories. The network outputs a 365-D vector with one
con dence value for each scene category. The places classi er performs well with
respect to the lifelogging tasks, being trained to distinguish between most of the
backgrounds present in the competition's dataset. This comes especially useful
as most topics require the lifelogger to be present in a certain place at the time
when the image has been captured.</p>
        <p>
          Food classi er As some topics revolved around the lifelogger's eating and
drinking habits we decided to also include a food classi er network. For this we
use the InceptionV3 architecture [15] pre-trained on the Imagenet dataset and
we ne-tune it on the Food101 dataset [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The result is a 101-D feature vector
for each image. As the training dataset is composed of images where the labeled
food takes up most of the image, when running our images through this classi er
we extract 6 crops (upper left, upper right, center, lower left, lower middle, lower
right) and their mirrored versions as well, which we pass through the network.
Afterwards, we select the maximum activation for each food class from the 12
predictions and build the 101-D vector.
        </p>
        <p>
          Object detector Additionally to the classi ers we also use a concept detector.
This has the advantage that it locates more than one instance of the same object
and each instance has its own attached con dence. Therefore, there will be no
competition between detections when computing the nal results. For this
purpose we use a Faster R-CNN [13] implementation trained on the MSCOCO [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]
dataset. Another advantage of this setup is that with object detection it also
performs object counting. Therefore, we build two feature vectors for each image:
one that retains the frequency of each detected object inside the image and one
which sums up the con dences of all detected instances for each class inside the
image. As the dataset also contains the class \person", we use its frequency to
perform person counting. Also, many of the classes from the MSCOCO dataset
can be found in daily scenarios, thus making it well-suited for the purpose of
lifelog image retrieval.
        </p>
        <p>O cial concepts Apart from the previously mentioned systems there is one
more feature extractor that we use, namely the one provided by the organizers.
They released a set of results in which each image is described by a various
number of concepts. The total number of possible classes is not known and
their objective is also uncertain as they cover a broad range of concepts such
as places, foods, actions, objects, adverbs etc. To cope with this we add each
unique concept from the o cial feature results to a list that sums up 633 unique
entries. In the end, we create a 633-D feature vector for each image, with
nonzero entries only where the o cial concept detector triggered a detection. On
this positions we retained the detector's con dence for the respective concept.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Metadata processing</title>
        <p>Apart from the concept detector, the organizers also released a le containing a
large variety of metadata about each minute from the logged data. These
metadata encompass a bundle of information such as biometric data, timestamps,
locations, activities, geographical coordinates, food logs and even the music that
the lifelogger was listening to at certain times. We use only a part of this set
of metadata. The rest of it can be used as well, but it did not t our proposed
system, therefore we only extract these data but did not process it any further.
A summary of all the information that we process for each image can be seen in
Table 1.
From previous experience we found that a key aspect of obtaining good results
is to narrow down the set of images that are to be processed. This can be done
by eliminating images that do not meet a certain set of minimum requirements.
In this sense we implement two types of ltering: one based on the metadata and
one based on the soft values of the concepts mentioned in Table 1 and explained
below. We select a random topic out of the 10 test ones, to serve as an example
and we will discuss it throughout the rest of the paper. The topic consists of the
following:</p>
        <p>Title: My Presentations
Description: Find the moments when I was giving a presentation to a
large group of people.</p>
        <p>Narrative: To be considered relevant, the moments must show more than
15 people in the audience. Such moments may be giving a public lecture
or a lecture in the university.</p>
        <p>Metadata ltering Our general approach is to manually interpret the entire
topic text and extract meaningful constraints on the metadata associated with
each image. Those entries that do not satisfy the given constraints are eliminated
from the processing pipeline. We prefer looser restrictions such that we lower the
chance of removing images relevant to the query in question. For the above given
topic we impose the following:
{ Activity: if the activity is any of the f'airplane', 'transport', 'walking'g then
remove image;
{ Location: if the location is anything di erent from f'Work', 'Dublin City</p>
        <p>University (DCU)'g then remove image;
{ Time: if the hour is not in the interval 9-19 then remove image;
{ Person count: if there are less than 10 persons detected then remove image.</p>
        <p>Two remarks are in order here. First, even if the person count is not part
of the metadata we treat it as such because of its 1-D nature and discrete
values. Second, the minimum threshold on the person count is lower than the
query asks for because the MSCOCO object detector can have di culties in
detecting overlapped persons in an image.</p>
        <p>Soft concepts ltering In a similar manner we tackle the ltering based on
the soft outputs of the concept detector/classi ers. If a certain object/concept is
detected with a higher probability than a preset threshold in an image then that
image is removed from the processing queue. Again, this process involves manual
selection of concepts that should not be present in the images. As it would be a
tedious work to select an exhaustive set of concepts for each classi er, we only
select the ones which are most likely to appear in the lifelog dataset and would
be in contradiction with the queried text, therefore the selection can greatly
di er from one query to another. For the query in the above example we select
the following:
{ Places: if the probability to detect any of the places from the set of words
f'car interior', 'living room', 'kitchen'g is greater than the threshold then
remove image;
{ MSCOCO objects: if the probability to detect any of the objects from the
set of words f'tra c light', 'cup'g is greater than the threshold then remove
image;
{ O cial concepts: if the probability to detect any of the concepts from the
set of words f'blurry', 'blur', 'null', 'Null','wall', 'ceiling', 'outdoor',
'outdoor object'g is greater than the threshold then remove image;
We do not use the same technique for the Imagenet descriptor as it usually
outputs low con dences and could thus have a great impact on the amount of
images that would be removed. Also, the Food descriptor was not used for this
topic as it is not relevant. Instead, its purpose is solely to classify food types for
topics which implicitly ask for this.</p>
        <p>We tried several values for the threshold and by visual inspection of the
output we noticed that 0.3 o ers a good trade-o between the probability of
rejecting true positives and rejecting true negatives. Finding the best value for
each concept detector and each topic requires many iterations, making this a
costly process.</p>
        <p>Relevance score After the blurred and irrelevant images have been ltered
out we proceed into computing a relevance score for each image relative to the
queried topic. In the same fashion as [12] we create a reference vector for each
of the 5 concept detectors in Tabel 1 with higher values on the positions
corresponding to concepts which are more likely to be found in relevant images and
lower values on the other positions. The score associated to a certain concept
detector is obtained by computing the dot product between the concept feature
vector and its respective reference vector. The result is then weighted and added
to the relevance score for each type of concept, as expressed in the equation
below.</p>
        <p>
          score =wimagenet
1000
X[conceptimagenet(i) refimagenet(i)]+
i=1
wplaces
wfood
wmscoco
wofficial
365
X[conceptplaces(i) refplaces(i)]+
i=1
101
X[conceptfood(i) reffood(i)]+
i=1
80
X[conceptmscoco(i) refmscoco(i)]+
i=1
633
X[conceptofficial(i) refofficial(i)];
i=1
(1)
with concept&lt;dataset&gt;(i) being the con dence associated with the i -th detected
concept from a dataset for the respective image, ref&lt;dataset&gt;(i) being the
reference vector at position i for the given dataset and w&lt;dataset&gt; being the weight
given to the respective dataset. The weights for each dot product have been
manually adjusted for each topic by trial and error. The values for the
reference vectors have been either set manually or automatically, depending on the
submitted run. We discuss this at length in Section 4.
The submitted results are supposed to be both relevant and diverse. The
relevance score should emphasize images that match the query description. For the
diversity part we apply the K-means algorithm for all the images that are left
after the ltering process. Each image is represented by the concatenation of
two normalized vectors: a 1536-D vector representing the Histogram of Oriented
Gradients (HOG) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] feature vector and a 512-D vector representing the color
histogram feature vector. This 2048-D vector should account for both shapes
and colors inside images.
        </p>
        <p>We run the K-means algorithm with either 5, 10, 25 or 50 clusters. For the
nal list of proposed images we select from each cluster the image with the
highest relevance score in a round-robin manner.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Results</title>
      <p>We have submitted one run during the competition and 4 other runs after the
competition ended. The o cial metric of the competition was F 1@X, which is
computed as the harmonic mean between precision (P @X) and cluster recall
(CR@X), with X representing the number of the top elements to be taken
into consideration. In Table 2 we present the nal F 1@X results that we have
obtained for each run with best values in bold. Our last run is omitted when
choosing the best results because it implied a highly supervised approach and
would lead to an unfair comparison. In Figure 2 we present the F 1@X results
for individual topics. Next, we provide a detailed description of each run.
Run 1 This was the only run that we submitted during the competition and it
follows the pipeline described in Section 3. We manually selected concepts from
each training dataset that would be probable to appear in the images described
by the queries. We set the reference vectors values to 1 on the positions
corresponding to the selected concepts and to 0 elsewhere. This makes the dot product
equivalent to an accumulation of con dences from a limited set of concepts for
each image. The weights, wimagenet, wplaces, wfood, wmscoco and wofficial have
been adjusted independently for each topic. The o cial F 1@10 value was 0.216
and this is the value that represents our position in the o cial standings.
Run 2 In addition to what was proposed for Run 1 we also applied another
ltering of the results, this time after the clusterization part. While going through
the clusters in the round robbin manner we also checked that the newly added
images are not too visually similar to the ones already added to the list. For
this purpose each new proposal would be compared one-on-one with the already
added proposals. The comparison was done with two metrics: mean squared
error (M SE) and structural similarity index (SSIM ). If for a pair of images
M SE &lt; 2000 and SSIM &gt; 0:5 then they are considered to be too similar, the
latter one is discarded and the round robin continues. We expected this technique
to allow for more diversity in the proposed list of images and enhance the cluster
recall. Instead, it turned out to eliminate a part of the correct predictions and
lower the precision. The o cial F 1@10 value was 0.169.</p>
      <p>
        Run 3 For the 3rd run we proposed a di erent way of computing the reference
vectors, the same technique that we used in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Namely, instead of manually
selecting the concepts that dictate whether an image is relevant or not from
each dataset, we only selected the nouns that best describe the topic's
description, obtaining a short set of key words, called \words to search". For the topic
mentioned in Section 3.4 we have: words to search=f'presentation', 'group',
'people', 'audience', 'public', 'lecture', 'conference', 'university', 'classroom'g.
Starting from this set of words we computed the Wu-Palmer similarity measure [16]
between each concept and all of the words from the \words to search" vector as
described in the equation below.
      </p>
      <p>refdataset(i) =</p>
      <p>X
w2words to search
dW UP (conceptdataset(i); w);
(2)
where dataset is any of the 5 datasets used in the concept detectors (Imagenet,
Places-365, Food-101, MSCOCO, O cial), dW UP (conceptdataset; w) is the
WuPalmer distance between one concept of the dataset and one word from the set
of words to search for, \words to search" . This avoided the binary setting of
the reference vector that was used in the previous runs but it lead to a decrease
of the performance of the entire system. The o cial F 1@10 value was 0.168.
Run 4 The 4th run was similar to Run 3, with the only di erence being
that all the weights wimagenet, wplaces, wfood, wmscoco and wofficial were set to
1, rendering them neutral to the reference score computation. This allows the
reference score to stabilize solely according to the similarity measure between the
words from the topic description and the labels of the concept detectors. From
Table 2 we can see that this only lowers the results, suggesting that tweaking
the weights for each dot-product is a better approach. This run was our closest
submission to a fully automatic system. The o cial F 1@10 value was 0.166.
Run 5 Our last run was done with the same approach as Run 1, this time
performing a ne-tuning of all system parameters for the topics that had bad
results in the rst run by trial and error. This approach leads to visibly better
results. However, this is obtained after careful manual tuning, which makes the
technique highly supervised and costly, as well, making it unfair to compare it
with the previous runs, this being the reason why it is separated from the rest
of the entries in Table 2. The o cial F 1@10 value was 0.443.
From the results that we presented in Figure 2 it can be seen that the F 1@X
metric has high inter-topic variance. This does not come as a surprise since the
topics approach di erent scenes, some of which are better represented in terms of
number of images in the dataset or are better described in terms of the associated
metadata. While some topics are easy to address (e.g. Topic 8:\Find the moments
when I was with friends in Costa co ee." can be retrieved almost solely based on
the location metadata) there are still topics for which retrieval is di cult (e.g.
Topic 6:\Find the moments when I was assembling a piece of furniture.") mainly
because of the di culty of assigning distinctive concepts to their description.
Except for the last run, it can be seen that all our approaches behave similarly
for each individual topic, suggesting that there is no clear advantage in using one
approach over the others. This is somewhat expected since they use the same
data and almost the same degree of supervision. The only clear improvement
can be seen when strong human input is involved.</p>
      <p>The part of the entire system which had the greatest impact on the nal
outcome was the metadata ltering. We argue that this is because this type
of information has been speci cally implemented for lifelogging purposes and
therefore have the strongest contribution in the end. This was also proven by our
5th run where we paid more attention to ne-tuning the processing parameters,
such as metadata, weights and set of query words, rather than on introducing a
new system.</p>
      <p>The way the F 1@X metric changes with X is also worth mentioning. We
noticed that is more bene cial to focus on the cluster recall than on the precision.
This comes straightforward from the de nition of the F 1@X metric in which
CR@X and P @X have equal contributions. As the topics cover an average of
56 di erent clusters (as per the development dataset) it is usually more productive
to retrieve images even from at least two di erent clusters rather than retrieve
all the images from a single cluster. This happens because the cluster recall can
only increase with X, whereas the precision usually drops for the same number
of images. However, the cluster recall usually compensates for the precision.</p>
      <p>We also notice that almost all of our approaches have the highest F 1@X
value for X = 20 and they slightly decrease with the increase of X which was
rather inconvenient since the o cial metric accounts for X = 10. However, we
have reported quite similar results for X = 10 and X = 20.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>In this paper we presented our approach for the LMR competition at the
ImageCLEF Lifelog task. We have adopted a general framework that processes
visual, text and meta information about images. We have extracted 5 concept
vectors, 2 feature vectors and more than 10 metadata elds for each image. All
of the proposed variants rely on metadata ltering and try to link each key-word
from the search topics to the concepts detector labels. A relevance score which
takes the aforementioned link into consideration is then computed and K-means
algorithm is used for clustering the results for the nal proposals.</p>
      <p>The LMR task still poses numerous di culties such as processing a great
deal of multimodal data, adapting several multimedia retrieval systems to this
type of task and integrating all the results. The diversity in the search queries is
also to be taken into account, sometimes being quite easy to process (see results
of `Topic 8') but sometimes proving that there still is work to be done to nd a
solution that satis es this type of generality (see results of `Topic 7'). We found
that manual ne-tuning of system parameters o ers the best result, but this
makes the system personalized for the given topics, lowering its scalability to
other similar tasks.</p>
      <p>As opposed to last year, we have implemented a signi cantly more complex
system and the future challenge for us is to work towards a scalable system, not
so much dependent on human input, to solve the LMR task. We believe that
with the increasing interest in this type of competitions it is possible to achieve
this perspective.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgement</title>
      <p>This work was supported by the Ministry of Innovation and Research,
UEFISCDI, project SPIA-VA, agreement 2SOL/2017, grant
PN-III-P2-2.1-SOL-201602-0002.
12. Molino, A., Mandal, B., Lin, J., Lim, J.H., Subbaraju, V., Chandrasekhar, V.:
VCI2R@ImageCLEF2017: Ensemble of Deep Learned Features for Lifelog Video
Summarization. In: CLEF2017 Working Notes. CEUR Workshop Proceedings,
CEURWS.org &lt;http://ceur-ws.org&gt;, Dublin, Ireland (September 11-14 2017)
13. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object
detection with region proposal networks. In: Cortes, C., Lawrence, N.D., Lee, D.D.,
Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing
Systems 28, pp. 91{99. Curran Associates, Inc. (2015)
14. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. In: International Conference on Learning Representations (ICLR)
(2015)
15. Szegedy, C., Vanhoucke, V., Io e, S., Shlens, J., Wojna, Z.: Rethinking the
inception architecture for computer vision. In: 2016 IEEE Conference on Computer
Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30,
2016. pp. 2818{2826 (2016)
16. Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the
32Nd Annual Meeting on Association for Computational Linguistics. pp. 133{138.</p>
      <p>ACL '94 (1994)
17. Yamamoto, S., Nishimura, T., Akagi, Y., Takimoto, Y., Inoue, T., Toda, H.: PBG
at the NTCIR-13 Lifelog-2 LAT, LSAT, and LEST Tasks. In: Proceedings of the
13th NTCIR Conference on Evaluation of Information Access Technologies. Tokyo,
Japan (December 5-8 2017)
18. Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: A 10 million
image database for scene recognition. IEEE Transactions on Pattern Analysis and
Machine Intelligence 40(6), 1452{1464 (June 2018)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bossard</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guillaumin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Gool</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Food-101 { mining discriminative components with random forests</article-title>
          .
          <source>In: European Conference on Computer Vision</source>
          . pp.
          <volume>446</volume>
          {
          <issue>461</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Dalal</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Triggs</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Histograms of oriented gradients for human detection</article-title>
          .
          <source>In: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 -</source>
          Volume
          <volume>01</volume>
          . pp.
          <volume>886</volume>
          {
          <fpage>893</fpage>
          . CVPR '
          <volume>05</volume>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Dang-Nguyen</surname>
            ,
            <given-names>D.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boato</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Overview of ImageCLEFlifelog 2017:
          <article-title>Lifelog Retrieval and Summarization</article-title>
          .
          <source>In: CLEF2017 Working Notes. CEUR Workshop Proceedings</source>
          , CEUR-WS.org &lt;http://ceur-ws.
          <source>org&gt;</source>
          , Dublin,
          <source>Ireland (September</source>
          <volume>11</volume>
          -14
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Dang-Nguyen</surname>
            ,
            <given-names>D.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lux</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Overview of ImageCLEFlifelog 2018:
          <article-title>Daily Living Understanding and Lifelog Moment Retrieval</article-title>
          .
          <source>In: CLEF2018 Working Notes. CEUR Workshop Proceedings</source>
          , vol.
          <volume>11018</volume>
          . CEUR-WS.org &lt;http://ceur-ws.
          <source>org&gt;</source>
          , Avignon,
          <source>France (September 10-14</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Dogariu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>A Textual Filtering of HOG-based Hierarchical Clustering of Lifelog Data</article-title>
          .
          <source>In: CLEF2017 Working Notes. CEUR Workshop Proceedings</source>
          , CEUR-WS.org &lt;http://ceur-ws.
          <source>org&gt;</source>
          , Dublin,
          <source>Ireland (September</source>
          <volume>11</volume>
          -14
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          . pp.
          <volume>770</volume>
          {
          <issue>778</issue>
          (
          <year>June 2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , Muller, H.,
          <string-name>
            <surname>Villegas</surname>
          </string-name>
          , M.,
          <string-name>
            <surname>de Herrera</surname>
            ,
            <given-names>A.G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eickho</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andrearczyk</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cid</surname>
            ,
            <given-names>Y.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liauchuk</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kovalev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ling</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farri</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lungren</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dang-Nguyen</surname>
            ,
            <given-names>D.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lux</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Overview of ImageCLEF 2018:
          <article-title>Challenges, datasets and evaluation. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Ninth International Conference of the CLEF Association (CLEF</source>
          <year>2018</year>
          ), vol.
          <volume>11018</volume>
          .
          <source>LNCS Lecture Notes in Computer Science</source>
          , Springer, Avignon,
          <source>France (September 10-14</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.E.:
          <article-title>Imagenet classi cation with deep convolutional neural networks</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          <volume>25</volume>
          , pp.
          <volume>1097</volume>
          {
          <issue>1105</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Molino</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Subbaraju</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lim</surname>
            ,
            <given-names>J.H.:</given-names>
          </string-name>
          <article-title>VCI2R at the NTCIR-13 Lifelog-2 Lifelog Semantic Access Task</article-title>
          .
          <source>In: Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies</source>
          . Tokyo,
          <source>Japan (December 5-8</source>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>T.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maire</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hays</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perona</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramanan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dollr</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zitnick</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          :
          <article-title>Microsoft coco: Common objects in context</article-title>
          .
          <source>In: European Conference on Computer Vision (ECCV)</source>
          . vol.
          <volume>8693</volume>
          , pp.
          <volume>740</volume>
          {
          <fpage>755</fpage>
          . Zurich (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lloyd</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Least squares quantization in pcm</article-title>
          .
          <source>IEEE Trans. Inf. Theor</source>
          .
          <volume>28</volume>
          (
          <issue>2</issue>
          ),
          <volume>129</volume>
          { 137 (Sep
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>