<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Visual Concept Selection with Textual Knowledge for Understanding Activities of Daily Living and Life Moment Retrieval*</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tsun-Hsien Tang</string-name>
          <email>thtang@nlg.csie.ntu.edu.tw</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Min-Huan Fu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hen-Hsen Huang</string-name>
          <email>hhhuang@nlg.csie.ntu.edu.tw</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kuan-Ta Chen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hsin-Hsi Chen</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Information Engineering National Taiwan University</institution>
          ,
          <addr-line>Taipei</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Information Science</institution>
          ,
          <addr-line>Academia Sinica, Taipei</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>MOST Joint Research Center for AI Technology and All Vista Healthcare</institution>
          ,
          <addr-line>Taipei</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents our approach to the task of ImageCLEFlifelog 2018. Two subtasks, activities of daily living understanding (ADLT) and life moment retrieval (LMRT) are addressed. We attempt to reduce the user involvement during the retrieval stage by using natural language processing technologies. The two subtasks are conducted with dedicated pipelines, while similar methodology is shared. We first obtain visual concepts from the images with a wide range of computer vision tools and propose a concept selection method to prune the noisy concepts with word embeddings in which textual knowledge is inherent. For ADLT, the retrieved images of a given topic are sorted by time, and the frequency and duration are further calculated. For LMRT, the retrieval is based on the ranking of similarity between image concepts and user queries. In terms of the performance, our systems achieve 47.87% of percentage dissimilarity in ADLT and 39.5% of F1@10 in LMRT.</p>
      </abstract>
      <kwd-group>
        <kwd>Visual Concept Selection</kwd>
        <kwd>Distributed Word Representation</kwd>
        <kwd>Lifelog</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Wearable devices for personalized multimodal recording, together with dedicated
lifelogging applications for smartphones, become more popular nowadays. For example,
gadgets like GoPro and Google Lens have already attracted consumers’ attention, and
new kinds of media like Video Weblog (VLog) emerged on Youtube heavily rely on
these devices. On the other hand, numerous personalized data that are acquired,
recorded, and stored still remain challenging to access by their owners. As a result, a
system that supports human to make summarization and recap precious life moments is
highly demanded.</p>
      <p>
        In ImageCLEFlifelog 2018 [
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ], two subtasks are conducted to address the issue of
image-based lifelog retrieval. The first subtask, activities of daily living understanding
(ADLT), is aimed at providing a summarization of certain life events for a lifelogger.
The second subtask, lifelog moment retrieval (LMRT), is aimed at retrieving specific
moments in a lifelog such as shopping in a wine store. A key challenge in both subtasks
is the semantic gap between the textual user queries and the visual lifelog data. Users
tend to express their information needs in higher-level, abstract descriptions such as
shopping, dating, and having a coffee, while the visual concepts that computer vision
(CV) tools extract from images are usually a set of concrete objects such as cup, table,
and television. The approaches proposed by previous work [
        <xref ref-type="bibr" rid="ref3 ref4">3,4</xref>
        ] focus on dealing with
the visual information. In this work, we attempt to reduce the semantic gap by using
both visual information and textual knowledge. We propose a framework that integrates
visual and textual information extracted from advanced CV and natural language
processing (NLP) technologies.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        In this section, we briefly discuss recent works on lifelog retrieval. For a retrieval
model, relevance and diversity are two major criteria to achieve. The retrieval of
relevant lifelog data is usually based on modeling the similarity between the textual
concepts, from the user query, and the visual features, from the lifelog data. Diversity, on
the other hand, can be improved by image clustering. For example, Liting et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
propose a method based on textual concept matching and hierarchical agglomerative
clustering. Ana et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] propose a method based on both visual and metadata
information along with clustering on results. So far, various techniques for image processing
and image retrieval have been applied to lifelog retrieval, but relatively few NLP
techniques are explored in this area.
      </p>
      <p>
        As deep neural networks have achieved remarkable success in computer vision, it is
also tempting to use deeply learned features in lifelog retrieval. For example, features
generated from CNN are adopted to the lifelog retrieval task [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Models for textual
concept extraction include image classification model [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] or object detection model
[
        <xref ref-type="bibr" rid="ref6 ref7">6,7</xref>
        ]. For textual knowledge modeling, deep neural networks also benefit distributed
word representations, also known as word embeddings, where every word is
represented in a vector in a dense space. There are different implementations for learning
word embeddings [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref8 ref9">8,9,10,11,12,13</xref>
        ], which can be further used in multi-modal
applications.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Retrieval Framework</title>
      <p>
        Both subtasks rely on the information from the provided image set. We extract the
visual concepts from each image by using a wide range of image recognition tools. Before
that, preprocessing is performed to improve the image recognition. The images in the
lifelog data are automatically taken with a wearable camera so that many of them suffer
from the poor quality such as overexposed, underexposed, out of focus, or
ill-composed. We apply blurriness detection and pixel-wise color histogram [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to filter out
those uninformative images. For the qualified images, several image recognition tools
are integrated to extract visual concepts from a variety of aspects.
      </p>
      <p>Image Filtering. We prune low quality images with blurriness and color diversity
detection. The blurriness metric is defined based on the variation of the Laplacian. We
perform convolution on each image with the Laplacian filter (3x3 kernel), and calculate
the blurriness score as the variance of the convolved result. The images with a variance
below a threshold are considered blurry and undesirable. Moreover, images with a high
color homogeneity are also considered uninformative, and can be detected with
quantized color histograms.</p>
      <p>Concept Labeling. In order to retrieve lifelog data according to the query style defined
in the two subtasks, effective textual representation for images is crucial. For a given
photo, we would like to know where it was being taken, what objects are in it, and even
what action the lifelogger took at that moment. To extract visual concepts from different
aspects, deep learning models have shown breakthrough results in recent years.</p>
      <p>
        Basically, general concepts and scene of images can be captured by two DenseNet
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] classifiers pre-trained on ImageNet1K [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and Place365 [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], respectively. We
consider classes with output probability beyond the threshold as labels. The threshold
is moderate to ensure the recall rate. For details present in images, object detection
techniques Yolo-v2 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and Faster RCNN [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] are used. Both tools are pre-trained on
MS COCO [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and Open Images [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] datasets.
      </p>
      <p>Open Images dataset, which consists of 15,440,132 boxes on 600 categories in a
domain closer to human daily life, covers most of the topics in the subtasks. MS COCO
Dataset consists of 91 objects types with 2.5 million labels in 328k images. We also
utilize the image analysis function provided by Google Cloud Vision API1. The online
service provides not only fruitful labels but also supports optical character recognition
(OCR), which helps detect and extract text from images. Note that the image concepts
provided by organizer are also added in. After going through the above tools, an image
would be tagged with concepts present in various aspects, as shown in Fig. 2.
(a) Image ID: 20160815_150731_000
(b) Image ID: 20160820_101541_000
(c) Image ID: 20160827_093525_000</p>
      <p>Image recognizers are hardly to perfectly label the concepts. For instance, most of
the deep learning tools cannot capture the place of Fig. 1(a) except OCR, which
identifies the keyword Cafe. In our framework, the ensemble of the outputs of all image
recognizers is considered as a set of candidate visual concepts.
1 https://cloud.google.com/vision/
Concept Filtering. We leverage a number of state-of-the-art tools to depict an image
in a set of candidate visual concepts. However, false positive concepts generated by
those tools result in redundancy and noise. For example, an image relevant to interior
like “bedroom” or “living room” would be supported by the visual concepts such as
“couch”, “bed”, and “table”. By contrast, the concept “church” would have lower
similarities to those terms. Based on this idea, we prune the set of candidate visual concepts
by removing the concepts that are less supported by other concepts, and produce a set
of visual concepts for each image. We compute the semantic similarity between
candidate visual concepts by using pre-trained word embeddings and construct a similarity
matrix, which represents the similarity of each concept pair. We discard those concepts
that would accumulate low correlations with other concepts. The procedure is
illustrated in Fig. 3 with a real example.
Our framework fetches images based on a given query. In ADLT, we require users to
specify concepts that highly related to the given topics and keep off confusing terms as
a query term set for each topic. Moreover, time span according to the topics would be
considered. To ensure the quality and usability of the retrieval system in practice,
preprocessing on textual data and the retrieval algorithms are other crucial issues.
Metadata Preprocessing. For generalization of the retrieval task, tags available in
metadata like “Home” and “Work” in the location field and “transport” and “walking”
in the activity field are extracted as attributes of images, instead of using all the location
information. To proceed further with locational information, we calculate the average
moving speed according to GPS coordinates to infer the type of transportation (e.g. car,
airplane). For any pair of points {(lat1, lon1), (lat2, lon2)} on the geographical
coordinate system, the distance d is given by the great-circle distance formula as follows.
 = cos−1(sin( 1) sin( 2) + cos( 1) cos( 2) cos(|
2 − 
1|)) × 6371
(1)
and the average speed is calculated according to the distance d and the difference
between timestamps.</p>
      <p>Retrieval Model. The similarity between the user query and an individual image,
represented as a set of visual concepts, is measured with three schemes as follows.
Exact Matching. Given a list of concepts in the user query that are combined with
logical operators AND, OR, and NOT, the visual concepts of the image should meet the
condition. This approach returns accurate results if the topic is explicit, e.g., watching
TV or use cell phone in the car.</p>
      <p>
        BM25 Model. Exact matching suffers from low recall rate. Here, we perform the partial
matching by using a classic retrieval model, BM25 [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. The BM25 scheme measures
the relatedness between two sets of concepts based on term frequency (TF) and inverse
document frequency (IDF). In this way, the specific concepts are more likely to be
extracted. For example, the concept “grocery”, which provides specific information
than the general concept “indoor” does, has a higher BM25 score due to its higher
inverse document frequency.
      </p>
      <p>Word Embedding. Word embeddings (distributed word representations) have been
widely used in text similarity measurement. For fuzzy matching, we adopt the word
embeddings to measure the semantic relatedness between the concepts in the query and
the concepts extracted from an image. The information of semantic relatedness would
be helpful when similar but not identical concepts are present in both sides. We first
obtain distributed representations of concepts with word embeddings that are
pretrained on a large-scale corpus, and aggregate concept-level semantics by taking the
element-wise mean for each query/image. In this way, the relatedness between the
query and an image can be computed by using the cosine similarity. Note that by using
the pre-trained word embeddings, external knowledge is inherent in the retrieval model.
3.3</p>
      <sec id="sec-3-1">
        <title>Image Stream Segmentation</title>
        <p>
          Deeply Learned Features. Pre-trained convolutional neural networks have been
shown to be beneficial for various computer vision tasks as generic image feature
extractors. In this sense, we apply pre-trained CNN tools [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] to extract dense feature
vectors, and estimate the level of change between consecutive images by measuring the
dissimilarity of their representations using Euclidean distance and cosine similarity.
Another neural network-based approach for this purpose is to compress the image by
autoencoders. These methods are tempting that feature extraction can be done
automatically, and the obtained feature can be integrated with other features simply.
        </p>
        <p>In this work, we obtain the dense vector for each image with a pre-trained DenseNet
and with a deep autoencoder trained on the provided images. For each pair of two
consecutive images, a threshold of dissimilarity is heuristically tuned to determine whether
the two images belong to the same event cluster or not. Besides, smoothing methods
such as moving average should also be adopted to prevent the consecutive boundary
occur in a short time period.</p>
        <p>Event Interval. In addition to the features learned by models, human knowledge is also
involved. Due to the property of lifelog data and given topics, human beings can easily
figure out how long it takes for a daily activity. Therefore, our other approach is
grouping the images by reasonable interval of shooting times. That is, two consecutively
retrieved images would treat as a single event if the difference of their timestamps is
smaller than a threshold according to the event topic. The thresholds for each topic are
intuitive defined by human. For example, the reasonable event interval would be about
60 minutes for having a lunch.
3.4</p>
      </sec>
      <sec id="sec-3-2">
        <title>Daily Activities Summarization</title>
        <p>For ADLT, the system output should be two real values indicating the frequency and
total duration of the given topic, respectively. The frequency could be calculated by
summing up the number of segmented retrieved events. The duration is obtained by
summing up the time differences between the first frame and last frame of each event.
3.5</p>
      </sec>
      <sec id="sec-3-3">
        <title>Life Moment Retrieval</title>
        <p>For LMRT, the model is designed to retrieve a number of images indicating specific
life moments with respect to the target query. There are total 10 topics, each consists of
a title, a short description, and a longer narrative in detail. For query processing, we
extract key terms in the title and in the description only, since the intention in the
narrative is often too complicated to extract without human assistance. This can be done
by simply removing function words in each query. Note that further processing such as
stemming is unnecessary due to the use of word embeddings. The resulting concepts
can be further improved by human, according to the narrative or human’s general
knowledge. Finally, the query is transformed into vector representation and compared
with pre-computed vector representation of each image.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>For the two subtasks, automatic runs are submitted. For each of the subtasks, we first
describe the parameter settings, and show the experimental results. We only report the
execution time of our system for testing since we exploit pre-trained tools.
4.1</p>
      <sec id="sec-4-1">
        <title>Activities of Daily Living Understanding Task</title>
        <p>In the first trial of ADLT, we use the query consisting of the concepts automatically
parsed from the given topics. We seize nouns and gerunds as concepts and list them for
query. For the time condition, information is directly provided with the &lt;span&gt; tag.
However, some of the required concepts like socializing or having party seldom appear
in the visual concept sets since most of CV tools are trained with shallow semantic
image descriptions. To deal with this issue, the topics consisting of abstract ideas are
refined by human.</p>
        <p>
          As mentioned in previous sections, visual concepts of each image labeled by CV
tools are far from perfect. For this reason, visual concept filtering is applied to discard
the concepts with a low relevance to other concepts in the same image. After we obtain
a list of visual concepts sorted by the row sum of similarity matrix for each image, the
top half of the concepts are retained. To our observation, this is a flexible threshold.
Besides, two kinds of pre-trained word embeddings are brought in here, including
GloVe [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] trained on Common Crawl with 840B tokens and ConceptNet Numberbatch
[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. The comparison in percentage dissimilarity [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] is shown in Table 1, where (G) and
(N) denote GloVe and ConceptNet Numberbatch word vectors, respectively.
        </p>
        <p>For summarization, two types of image stream segmentation are tried on the
development set. We find out that the segmentation based on deeply learned features results
unstable boundaries and comes out a limited performance. Basically, it is hard to
determine the threshold of dissimilarity between two images. On the other hand, using
human defined reasonable interval for each event type, deemed an intuitive parameter for
daily activities, provides more sensible boundaries. As a result, we apply the latter
method on the test set.</p>
        <p>In general, our framework achieves a percentage dissimilarity of 0.3850 with human
refined query and visual concept filtering reported by the online evaluation platform.
The inference time for each topic is 5.59s on average. Error analysis shows that our
system fetches a lot of unnecessary images for topic 1 because the CV tools cannot
identify “coffee cup” in the lifelog data. For this reason, we refine the query by using
only “cup”, instead of “coffee cup”, to ensure the recall rate. It turns out that images
with bottles or mug are retrieved (under the Office scenario).</p>
        <p>To enhance the precision of topic 1, an ad-hoc method is further introduced to filter
out surplus retrieved items. First, we observe that the coffees might be bought by the
lifelogger from the same shop within the same red cup. So we specify the upper and the
lower bounds of RGB value to capture red objects in a given shoot. For the
consideration of spatial verification, we preserve images containing red area larger than a
threshold. The procedures of the above operations are shown in Fig. 4. By distinguishing the
red coffee cup from other types of cup, the results of topic 1 are greatly enhanced, and
the overall performance is also increased as shown in Table 1.</p>
        <p>
          The best result among our all submissions, i.e., percentage dissimilarity of 0.4787,
is achieved by the combination of fine-tuned query, concept selection mechanism with
ConceptNet Numberbatch word embedding and coffee capturing trick. The ad-hoc
image filtering for topic 1 produces a significant improvement. For the construction of the
similarity matrix used in visual concept filtering, the ConceptNet Numberbatch
embeddings, built using an ensemble that combines data from ConceptNet [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], word2vec [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ],
GloVe [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], and OpenSubtitles2016 [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], provide better results comparing to GloVe.
The effectiveness of concept filtering is shown in Fig. 5.
For LMRT, we submit a total of four successful runs, which are either fully automatic
or semi-automatic with human fine-tuned queries. In this subtask, we use a subset of
CV tools based on a preliminary evaluation on the development set. The relatedness
between images and queries are measured with the word embedding method as
described in Section 3.2. We employ the 300-dimenstional word embeddings pre-trained
on 16B tokens with fastText [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Out-of-vocabulary words are ignored. The first run
is using automatically extracted query terms to match image concepts. We consider this
as the baseline method.
        </p>
        <p>In the rest of runs, we add time and location constraints to each query and re-rank
the retrieved images using provided metadata. This information can be either inferred
automatically by NLP APIs2 in Run 3, or given by human in Run 4 and Run 5. We use
a heuristic method for re-ranking that increases the score by a weight wl if the location
constraints are satisfied, and decreases the score by a weight wt if the time constraints
are not satisfied. The weight wl is set as 0.8, and wt is 0.1 in the experiments.
2 https://cloud.google.com/natural-language/</p>
        <p>In Run 4, we fine-tune the queries to get better results. Each query is rewritten into
the form consisting of the following fields: positively related concepts, negatively
related concepts, location constraint, and time constraint. For negatively related concepts,
we assign a weight of -1 to their word representations before vector aggregation. Table
2 shows an example that the automatic one is extracted from the title and the
description. The fine-tuned one is obtained by removing relatively useless query terms and
manually adding additional concepts, according to narrative and retrieved images. Once
the query is modified, the system retrieves new results, and the query can be further
improved based on new results. In the last run, we perform clustering on the retrieved
images based on the event intervals described in Section 3.3.</p>
        <p>Title: Interviewed by a TV presenter
Description: Find all the moments when I was interviewed by TV presenter.
Narrative: The moment must show the cameras or cameramen in front of the
lifelogger. The interviews can occur at the lifelogger's home or in the office
environment.</p>
        <p>Automatic
positive = {interviewed, TV, presenter}; negative = {};
location = {}; time = {}
positive = {camera, person}; negative = {};
location = {home, work}; time = {}</p>
        <p>
          Table 3 and Fig. 6 show the performance of our method for LMRT. The inference
time (without clustering) for each topic is 2.46s on average. The best result on the test
set is achieved by the combination of the fine-tuned query with time/location
constraints, showing that it is crucial to have human to give more precise query expressions.
We also notice that there is no improvement with clustering. A possible reason is that
some queries in the test set ask for rare events such as assembling furniture. Under the
scoring metric, cluster recall at 10 (CR@10) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], used in this subtask, there is no benefit
to cluster events that occur less than ten times. Instead, inaccurate results may be
introduced and decrease the score.
        </p>
        <p>
          For LMRT, we employ pre-trained fastText for the official runs. As word embedding
playing a crucial role in this task, we try a number of off-the-shelf word embeddings
for similarity computation. Details of these embeddings can be found in [
          <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref8 ref9">8,9,10,11,
12,13</xref>
          ]. These models are designed in quite different ways, but all produce semantic
representations for individual words in a latent space. Fig. 7 compares different word
embeddings. We report F1@10 scores per query in the development set with a fully
automated approach. The results suggest that it would be beneficial to use word
embeddings that associate with additional contextual information such as syntactic
dependency or lexical ontology.
        </p>
        <p>Another advantage of adopting word embeddings is that sometimes query words are
absent while relative or similar concepts are present in the desired images. With word
embeddings, we have an opportunity to capture this kind of relations. For example, it
is possible to match an image with the concepts {bowl, food, knife, meal, person, salad,
This paper presents our approaches to daily activities summarization and moment
retrieval for lifelogging. In both subtasks, we introduce the external textual knowledge to
reduce the semantic gap between the user query and the visual concepts extracted by
the latest CV tools. Experimental results show the ensemble distributed word model,
ConceptNet Numberbatch, provides effective word embeddings in both two subtasks.</p>
        <p>Experimental results also suggest that better performances can be achieved by using
fine-tuned queries. That means there still exists a room for improvement on bridging
the gap between the abstract human intentions and the concrete visual concepts.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Dang-Nguyen</surname>
            ,
            <given-names>D.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lux</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Overview of ImageCLEFlifelog 2018:
          <article-title>Daily Living Understanding and Lifelog Moment Retrieval</article-title>
          .
          <source>In: CLEF2018 Working Notes (CEUR Workshop Proceedings).</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Müller</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Herrera</surname>
            ,
            <given-names>A.G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eickhoff</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andrearczyk</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cid</surname>
            ,
            <given-names>Y. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liauchuk</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kovalev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ling</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farri</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lungren</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dang-Nguyen</surname>
          </string-name>
          , D.-T.,
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lux</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Overview of ImageCLEF 2018:
          <article-title>Challenges, datasets and evaluation</article-title>
          .
          <source>In proceedings of the Ninth International Conference of the CLEF Association (CLEF</source>
          <year>2018</year>
          ), Avignon, France. LNCS, Springer.
          <source>(September</source>
          <volume>10</volume>
          -14
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boato</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dang-Nguyen</surname>
            ,
            <given-names>D.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Organizer Team at ImageCLEFlifelog 2017:
          <article-title>Baseline Approaches for Lifelog Retrieval and Summarization</article-title>
          .
          <source>In proceedings of CLEF</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Garcia del Molino</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mandal</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hwee</surname>
            <given-names>Lim</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Subbaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Chandrasekhar</surname>
          </string-name>
          ,
          <string-name>
            <surname>V.</surname>
          </string-name>
          :
          <article-title>VC-I2R@ImageCLEF2017: Ensemble of Deep Learned Features for Lifelog Video Summarization</article-title>
          .
          <source>In proceedings of CLEF</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maaten</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weinberger</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Densely Connected Convolutional Networks</article-title>
          .
          <source>In proceedings of CVPR</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Redmon</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farhadi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Yolo9000: Better, faster, stronger</article-title>
          .
          <source>In: arXiv:1612.08242</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rathod</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Korattikara</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fathi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fischer</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wojna</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guadarrama</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murphy</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Speed/accuracy trade-offs for modern convolutional object detectors</article-title>
          .
          <source>In proceedings of CVPR</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Speer</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chin</surname>
            ,
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Havasi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>ConceptNet 5.5: An Open Multilingual Graph of General Knowledge</article-title>
          .
          <source>In proceedings of AAAI</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>GloVe: Global Vectors for Word Representation</article-title>
          .
          <source>In proceedings of EMNLP</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In proceedings of NIPS</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Bag of Tricks for Efficient Text Classification</article-title>
          ,
          <source>In proceedings arXiv:1607.04606</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <article-title>Enriching Word Vectors with Subword Information</article-title>
          , In arXiv:
          <volume>1607</volume>
          .04606 (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Dependency-Based Word Embeddings</article-title>
          .
          <source>In proceedings of ACL</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. J.,
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fei-Fei: ImageNet: A LargeScale Hierarchical Image</surname>
          </string-name>
          <article-title>Database</article-title>
          .
          <source>In proceedings of CVPR</source>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Bolei</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agata</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aditya</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aude</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          , Antonio, T.:
          <article-title>Places: A 10 Million Image Database for Scene Recognition</article-title>
          .
          <source>In proceedings of TPAMI</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>T.-Y. Lin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Maire</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hays</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Perona</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Ramanan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dollar</surname>
            , and
            <given-names>C. L.</given-names>
          </string-name>
          <string-name>
            <surname>Zitnick. Microsoft</surname>
            <given-names>COCO</given-names>
          </string-name>
          :
          <article-title>Common objects in context</article-title>
          .
          <source>In proceedings of ECCV</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Krasin</surname>
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duerig</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alldrin</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferrari</surname>
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abu-El-Haija</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsova</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rom</surname>
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uijlings</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Popov</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kamali</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malloci</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pont-Tuset</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Veit</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Belongie</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomes</surname>
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chechik</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cai</surname>
            <given-names>D.</given-names>
          </string-name>
          , Feng
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Narayanan</surname>
          </string-name>
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Murphy</surname>
          </string-name>
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>OpenImages: A public dataset for large-scale multi-label and multi-class image classification</article-title>
          . (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaragoza</surname>
          </string-name>
          , H.:
          <article-title>The Probabilistic Relevance Framework: BM25 and Beyond</article-title>
          . In:
          <article-title>Foundations and Trends in Information Retrieval archive, Vol 3 Issue 4</article-title>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Lison</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tiedemann</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles</article-title>
          .
          <source>In proceedings of LREC</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>