<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Image annotation and two paths to text illustration</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Herve Le Borgne</string-name>
          <email>herve.le-borgne@cea.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Etienne Gadeski</string-name>
          <email>etienne.gadeski@cea.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ines Chami</string-name>
          <email>ines.chami@cea.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thi Quynh Nhi Tran</string-name>
          <email>thiquynhnhi.tran@cea.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Youssef Tamaazousti</string-name>
          <email>youssef.tamaazousti@cea.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexandru Lucian Ginsca</string-name>
          <email>alexandru.ginsca@cea.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adrian Popescu</string-name>
          <email>adrian.popescu@cea.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CEA, LIST, Laboratory of Vision and Content Engineering</institution>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes our participation to the ImageCLEF 2016 scalable concept image annotation main task and Text Illustration teaser. Regarding image annotation, we focused on better localizing the detected features. For this, we identi ed the saliency of the image to collect a list of potential interesting places into the image. We also added a speci c human attribute detector that boosted the results of the best performing team in 2015. For the text illustration, we proposed two complementary approaches. The rst one relies on semantic signatures that give a textual description of an image. This description is further matched to the textual query. The second approach learns a common latent space, in which visual and textual features are directly comparable. We propose a robust description, as well as the use of an auxiliary dataset to improve retrieval. While the rst approach only uses external data, the second one was mainly learned from the provided training dataset.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        This paper describes our participation to the ImageCLEF 2016 [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] scalable
concept image annotation main task (IAL: image annotation and localization)
and Text Illustration teaser that are described in detail in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Regarding image annotation, we improved our 2015 system [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and focused
on better localizing the detected features. In 2015, we proposed a concept
localization pipeline which uses the spatial information that CNNs o er. To improve
this, we identi ed the saliency of the image to collect a list of potential
interesting places from the image, then detected the concepts found in these boxes. We
also added a speci c human attribute detector that boosted the results of the
best performing team in 2015.
      </p>
      <p>For text illustration, we proposed two complementary approaches. The rst
one relies on semantic signatures that give a textual description of an image. This
description is further matched to the textual query. The second approach relies
on the learning of a common latent space, in which visual and textual features
are directly comparable, using a robust description and an auxiliary dataset to
improve retrieval. While the rst approach only uses external data, the second
one was mainly learned from the provided training dataset.</p>
      <p>This manuscript is organized as follows. Section 2 deals with our participation
to the image annotation and localization subtask, while Section 3 is dedicated
to the text illustration teaser. Each time, we discuss some limits of the task
itself that are important to better understand the results. Then we present the
method(s) and nally comment the results of the campaign.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Image annotation task</title>
      <sec id="sec-2-1">
        <title>Dataset limitation</title>
        <p>
          As highlighted last year by the team that obtained the best results [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], the
development set of the image annotation subtask (and it is probably the case
for the test set as well) su ers from severe limitations due to the crowd-sourcing
ground-truth annotation. They explain the annotation are inconsistent,
incomplete, sometimes incorrect and there are even some cases that are \impossible"
according to the assumptions of the task (e.g. the fact there are at most 100
concepts per image).
        </p>
        <p>It seems these issues have not been addressed in the 2016 development set,
thus the results are still subject to some limitations. However, on the other hand,
the task is thus consistent with last year's results and we can directly compare
the improvement from one year to another.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Method</title>
        <p>
          In this section, we detail the training and testing frameworks that we used. Our
method is based upon deep CNNs which have lately shown outstanding
performances in diverse computer vision tasks such as object classi cation, localization
and action recognition [
          <xref ref-type="bibr" rid="ref20 ref7">20, 7</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Training</title>
        <p>
          Data. We collected a set of roughly 251; 000 images (1; 000 images per concept)
from the Bing Images search engine. For each concept we used its name and its
synonyms (if present) to query the search engine. This dataset is of course noisy
but some works showed it is not a big issue to train a deep CNN [
          <xref ref-type="bibr" rid="ref29 ref6">6, 29</xref>
          ]. We used
this additional data to train a 16-layer CNN [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] and the 50-layer ResNet [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
We used 90% of the dataset for training and 10% for validation.
Network Settings. The networks were initialized with ImageNet weights. The
initial learning rate is set to 0:001 and the batch size is set to 256. The last layer
(the classi er) is trained from scratch, i.e. it is initialized with random weights
sampled from a Gaussian distribution ( = 0:01 and = 0) and its learning
rate is 10 times larger than for other layers. During training, the dataset is
enhanced with random transformations: RGB jittering, scale jittering, contrast
adjustment, JPEG compression and ips. It is known that data augmentation
leads to better models [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] and reduces over tting. Finally, the networks take
a 224 224 RGB image as input and produce 251 outputs, i.e. the number
of concepts. The models were trained on a single Nvidia Titan Black with our
modi ed version of the Caffe framework [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>Localizing concepts. We provide two approaches to detect the concepts and
localize them.</p>
        <p>
          The rst method, named FCN, is the same as described in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. It is a simple
and e cient framework where the concept detection and localization are done
simultaneously with a unique forward pass of the image to process. More
indepth information about this framework can be found in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>
          The second method is based upon the generic object detector EdgeBoxes [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ],
which takes an image as input and produces R regions where visual objects are
likely to appear (objectness detection). In our experiments, we extracted a
maximum of 100 regions per image then fed each one to the CNN models. We
nally kept the concept that had the highest probability among the 251 concepts.
Therefore this framework outputs R predictions per image.
        </p>
        <p>
          In addition to these methods, we also used a face detector [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] to categorize
more precisely the faces, eyes, noses and mouths. We extracted those features
on all images and aggregated the results to the boxes detected by the CNN
frameworks. It was our belief that it would slightly boost our accuracy since
these kind of "objects\ are quite hard to capture even with a good generic
object detector.
        </p>
        <p>Combination of runs We also combined some runs by simply concatenating
detected the boxes. When the number of boxes was above the allowed limit (100)
we randomly removed some of them (this case was very rare).
2.3</p>
      </sec>
      <sec id="sec-2-4">
        <title>Results</title>
        <p>
          We submitted ten runs to the campaign, with settings that allow to measure the
bene t of di erent choices. We studied the in uence of three parameters:
{ our last year's method used to localize the concepts [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] compared to the use
of EdgeBoxes [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ]
{ the CNN architecture, by comparing VGG [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] and ResNet [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] that obtained
good results at the ILSVRC campaign in 2014 and 2015.
        </p>
        <p>
          { the use of a face part detector [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]
Results of individual runs are reported in Table 1.
        </p>
        <p>On the ILSVRC challenge, VGG had 7:3% classi cation error and ResNet
obtained 5:7%. On the 2016 ImageCLEF dataset, we obtain similar results with
VGG and ResNet when we use FCN to localize the concepts and VGG is better
with EdgeBoxes. Regarding the VGG-based scores, we noticed that our results
are about 8 points better than last year, showing the bene t of the new learning
process.</p>
        <p>
          The use of a face part detector does not signi cantly improve our results and
they are even lower when we use the EdgeBoxes-based localization. This is quite
surprising since a similar process boosted the results of last year's best
performing team from 30:39 to 65:95 [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. However, regarding these results, we should
notice this \boost" was observed on the test dataset only (on the development
set, the performances were more or less the same with and without the body
part detection). It is also hard to explain how the results can increase by 35
points while the body-part detector deals with less than 10 classes among 250.
Following a discussion with the organizers of the campaign, it seems that there
was a bug in the evaluation script ( xed since then, and probably reported on
this year's overview [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]) and that detecting body parts is nally not very
interesting, making our results in line with other participants' ascertainment. However,
there are still some unexpected results with regards to the concepts that are
directly concerned with face part detection. In Table 2, we report the results for
four of these concepts and two di erent settings (results of ResNet+EdgeBoxes
are similar to VGG+EdgeBoxes). With FCN, the behavior is in line with
expectation. On the contrary with EdgeBoxes, the concepts mouth and eye are
perfectly detected, that is quite unlikely. Although there are obvious issues with
EdgeBoxes as explained below, this strange result may be due to a remaining
bug in the evaluation script.
        </p>
        <p>The most disappointing result is that the use of the EdgeBoxes-based
localization gives globally lower results than the FCN one. A possible reason is that
EdgeBoxes generates much more boxes than FCN and that a signi cant part
leads to wrong concept estimation, hence penalizing the global score.</p>
        <p>Last, we evaluated the combination of runs, as reported in Table 3. Once
again, the results are quite disappointing since the more we combine, the lower
the results are. Since the combination of runs is a simple concatenation of the
boxes found by each run, it is not clear to us how the results can decrease (the
mAP should be at least better than the weaker run). It probably results from
the way the results are evaluated but unfortunately the exact method used is
not available.
The task consists of matching textual queries to images without using the textual
features derived from the web page the images belong to, although these last
ones are available since they are part of the 500k noisy dataset. In practice, the
queries were mainly obtained by removing the HTML tags from these web pages,
and retaining all the remaining text. It thus raises an issue with regards to the
realism of the task.</p>
        <p>Indeed, when one wants to illustrate a text in practice, she would submit the
interesting part of the text as query to the system. It does not make sense to
add some noisy data in the query such as that coming from the generic task bar
as in the query --/--diUdSrlGyv7zF4 that starts with:</p>
        <p>
          Taakbalk Navigation Subnavigation Content home Who is who
organisational chart contact intranet nederlands zoekterm: Navigatie About
K.U.Leuven Education Research Admissions Living in Leuven Alumni
Libraries Faculties, Departments &amp; Schools International cooperation
Virology - home Current Labmembers Former Labmembers Research
Projects Publications Contact Us Where To Find Us Courses (...)
Of course, it is hard for the organizers to extract this \relevant text" at
a large scale, since there are 180; 000 queries. However, this could be part of
the task: if the query was the actual HTML page, the system could include an
automatic search of the relevant text by using the DOM structure of the query.
Speci c object detectors have been developed for a long time, to be able to
recognize e.g. faces [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], pedestrians [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] or buildings [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. More recently, it has been
proposed to use a set of object or concept detectors as image descriptors [
          <xref ref-type="bibr" rid="ref17 ref24">24, 17</xref>
          ],
introducing the \semantic features". With this approach, images are described
into a xed size vector space as it is the case with Bag-of-visual words, Ficher
Kernels [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] or even when one uses the last fully connected layer of a CNN as
a feature [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. However, contrary to these approaches, each dimension of a
semantic feature is associated to a precise concepts that makes sense for a human
(Fig. 1). The \semantic signature approach" to text illustration thus consists in:
{ (i) extracting relevant concept from images of reference
{ (ii) expressing the corresponding concepts with words and index them
{ (iii) matching the query to the index textually
During the campaign, we tested several alternatives for each of these steps.
        </p>
        <p>
          Regarding step (i), our system is based on recently published work [
          <xref ref-type="bibr" rid="ref22 ref23">23, 22</xref>
          ],
that is itself an extension of the Semfeat descriptor [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Relying on powerful
midlevel-features such as [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], this semantic feature is a large set of linear classi ers
that are built automatically. The authors showed that keeping a small part of
the K most active concepts and setting the others to zero (sparsi cation) led
to a more e cient descriptor for an image retrieval task. However, there are
two limitation to this approach: rst, the value of K has to be xed in advance;
secondly, sparsi cation is not e cient in a classi cation context, in the sense that
the performance obtained is below that of the mid-level feature it is built on. For
these reasons, [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] proposed to compute K for each image independently, with
regard to the actual content of the image. The principle is to keep the \dominant
concepts" only, when we are con dent on their detection.
        </p>
        <p>
          For step (ii), we considered two sets of concepts. The rst one, based on
WordNet, contains 17; 467 concepts, each being described by its main synset
(one word). The second set is that collected automatically in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. It contains
around 30; 000 concepts derived from Flickr groups, each described by three
words.
        </p>
        <p>For the textual matching step (iii), we considered classical inverse indexing,
computing the query-document similarity from the \weight/score" associated to
each indexed document.
3.3</p>
      </sec>
      <sec id="sec-2-5">
        <title>Text-image common space approach</title>
        <p>
          The design of common latent spaces has been proposed for a while [
          <xref ref-type="bibr" rid="ref16 ref19">16, 19</xref>
          ], in
particular in the case of textual and visual modality [
          <xref ref-type="bibr" rid="ref11 ref12 ref2 ref32 ref8">12, 32, 11, 2, 8</xref>
          ]. Given two
modalities, let say a visual and a textual modality described by their respective
features, the general idea is to learn a latent common sub-space of both feature
spaces, such that visual points are directly comparable to textual ones. One of
the recent popular approach is to use Canonical Correlated Analysis (CCA), in
particular in its kernelised version (KCCA) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>Let us consider N data samples f(xiT ; xiI )giN=1 RdT RdI , simultaneously
represented in two di erent vector spaces. The purpose of CCA is to nd
maximally correlated linear subspaces of these two vector spaces. More precisely, if
one notes XT 2 RdT and XI 2 RdI two random variables, CCA simultaneously
seeks directions wT 2 RdT and wI 2 RdI that maximize the correlation between
the projections of xT onto wT and of xI onto wI ,
wT 0 CT I wI
wT ; wI = arwgT m;waIxpwT 0 CT T wT wI 0 CII wI
(1)
where CT T , CII denote the autocovariance matrices of XT and XI respectively,
while CT I is the cross-covariance matrix. The solutions wT and wI are found
solution of an eigenvalue problem. The d eigenvectors associated to the d largest
eigenvalues de ne maximally correlated d-dimensional subspaces in RdT and
respectively RdI . Even though these are linear subspaces of two di erent spaces,
they are often referred to as \common" representation space.</p>
        <p>KCCA aims to remove the linearity constraint by using the \kernel trick"
to rst map the data from each initial space to the reproducing kernel Hilbert
space associated to a selected kernel and then looking for correlated subspaces
in these RKHS.</p>
        <p>
          In this space, textual and visual documents are directly comparable, thus it
is possible to perform cross-modal retrieval [
          <xref ref-type="bibr" rid="ref11 ref12 ref2 ref8">12, 11, 2, 8</xref>
          ]. However, it has been
recently found that the learned common space may not represent adequately all
data [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. It has thus be proposed a more robust representation data within the
common space consisting in coding the original visual and textual point with
respect to a codebook (Figure 2a). This method, named Multimedia Aggregated
Correlated components (MACC) is detailed in [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. Another contribution that
\`compensates" the defaults of the representation space is to project a bi-modal
auxiliary dataset into the common space and use the known text-image
connections of this dataset as a \pivot" to link e.g. a textual query to an appropriate
image of the reference database (Figure 2b).
        </p>
        <p>(a) robust description
(b) pivot principle</p>
        <p>MACC considers the use of two datasets. A rst training dataset T is used
to learn the common space with a KCCA. Due to computational issues, the
number of documents that can be used to learn this space is limited to few ten
thousands. Hence, at a quite large scale such as that of the text illustration
subtask, it is important to use a second auxiliary dataset A to \compensate"
the possible limitation of the initial learning.</p>
        <p>
          Once the settings of the common latent space are chosen, the principle of the
text illustration is quite straightforward, as illustrated in Figure 3. Regarding
the images of the reference database, we extract the same feature as that used
during learning, namely the FC7 fully connected layer of VGG [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. This vector
is projected on the latent space and the MACC signature is computed then
stored into the reference database.
        </p>
        <p>
          Regarding the textual query, we process the raw text in order to x the
defaults identi ed in Section 3.1. We rst remove the stopwords using the
Stanford NLTK package [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] that contains lists of stopwords in several languages. As
months, days and numbers are not included in the stopwords list from NLTK,
we also ltered them out as they are hard to illustrate and might add noise
to our model. Additionally, we also removed words containing special
characters such as ""' that are often found in noisy words. Furthermore, we combined
the stopwords list with a part of speech tagger developed in the NLTK library.
NLTK Pos Tag categorizes our set of words and labels each word according to its
grammatical properties. In order to take the more descriptive words, we choose
to keep only nouns (proper and common) and adjectives.
        </p>
        <p>For each word i of the resulting vocabulary, we extract a word2vec vectorial
representation ti. Then, we compute a weight wi equal to its t df value. For
a document d, we select the k terms in the textual description that have the
largest weights. We then compute a unique vector vd, representing d, from the k
selected words describing a document, weighting each ti with its corresponding
Pn
weight wi, resulting into the Weighted Arithmetic Mean (WAM) vd = Pi=in=11wwiiti .</p>
        <p>As said above, the classical KCCA algorithm does only support some dozen
thousand document to learn the latent common space. To get around this
limitation, we rst proceed to a selection of the training data, in order to build a
corpus with a diversi ed vocabulary. To do so, we divide our train set of 300k
documents into 10 groups of 30k documents each. We then clustered every group
of textual features with K-Means and compute 100 clusters per group. To build
a 20k documents corpus for instance, we select for each group 20 random
documents per cluster (20K = ngroupsnclustersndoc = 10x100x20). Similarly, we build
diversi ed sets for the pivotal basis by selecting a certain number of random
documents per cluster.
3.4</p>
      </sec>
      <sec id="sec-2-6">
        <title>Submission and Results</title>
        <p>We submitted four individual runs and three runs that merge them di erently.
Some synthetic results are presented in Table 4. Globally, the recall at 100 is
low, that is explained by the di culty of the task as well as the noise in the
queries.</p>
        <p>
          We run the method described in Section 3.2 with a semantic signature
computed with CBS and both the WordNet and FlickRgroups vocabularies. We
obtained better results with the smaller vocabulary of WordNet. The basic
classi ers of the Wornet-based semantic features are learned with \cleaner"
annotated images then those based on the FlickRGroups. However, since the original
Semfeat paper [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] showed better or similar results with both types of classi ers,
we suggest that in the current task the (small) di erence of performance may
be due to a better coverage of the vocabulary with respect to the queries.
        </p>
        <p>The approach based on the common space and the MACC representation
lead to signi cantly better results. A rst run Wam5 used 22k images for T , 64k
for A and 5 best words were retained to build the training and testing textual
features. In the second run Wam7 we used the same training dataset to learn the
common space but A was extended to 164k documents while we retained up to
30k words to build the textual training features. For the textual testing features,
we kept only 10 words to build the WAM, because of the noisy aspect of the
query data.</p>
        <p>The experiments we run on a development dataset (not reported) extracted
from the 300k development images suggest that a large part of the improvement
between Wam5 and Wam7 is due to the growth of the auxiliary dataset A.</p>
        <p>By merging several runs, the results are marginally improved. The run mergeA
concatenates 10 best results of CBS+WordNet and Wam5 with 80 best of Wam7. We
then also consider run wam8 similar to Wam7 but its testing textual queries that
were built with ve best words (instead of 10 for Wam7). The run mergeB merges
the 10 best results of CBS+WordNet, Wam5 and W am8 with the 70 best of Wam7.
Finally, mergeC concatenates the 5 best results of CBS+FlickRGroups, the 10
best results of CBS+WordNet and Wam5, the 15 best of Wam8 and 60 best of Wam7.
While the settings are quite di erent between the three merging methods, the
results are similar, showing that the results is mainly due to the rst answers of
Wam7.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>We presented the results to the Image Annotation and Localization subtask and
the Text Illustration teaser. The results to IAL are good in comparison to other
participants, but the contribution proposed in 2016 did not lead to signi cantly
better results than our 2015 system. It is partially due to the fairer evaluation of
the task, but since the exact method of evaluation is not released, it is hard to
fully interpret the results, in particular why the combination of runs decreases
the mAP. Regarding the Text Illustration teaser, we proposed two methods based
on recently published work. The results are globally low, due to the di culty of
the task in general1 and the very noisy queries in particular.
1 the other participant to the task obtained outstanding results around 80%. If they
actually used the same data as us, we'll of course revised the interest of our methods!</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bird</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loper</surname>
          </string-name>
          , E.:
          <article-title>Natural language processing with Python. "</article-title>
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          ,
          <source>Inc."</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Costa</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Coviello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Doyle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Rasiwasia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Lanckriet</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          , Levy,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Vasconcelos</surname>
          </string-name>
          , N.:
          <article-title>On the role of correlation and abstraction in cross-modal multimedia retrieval</article-title>
          .
          <source>TPAMI</source>
          <volume>36</volume>
          (
          <issue>3</issue>
          ),
          <volume>521</volume>
          {
          <fpage>535</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Dalal</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Triggs</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Histograms of oriented gradients for human detection</article-title>
          . In: In CVPR. pp.
          <volume>886</volume>
          {
          <issue>893</issue>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Gadeski</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le Borgne</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Popescu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Cea list's participation to the scalable concept image annotation task of imageclef 2015</article-title>
          .
          <source>In: CLEF2015 Working Notes. CEUR Workshop Proceedings</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Gilbert</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramisa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dellandrea</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaizauskas</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolajczyk</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Overview of the ImageCLEF 2016 Scalable Concept Image Annotation Task</article-title>
          .
          <source>In: CLEF2016 Working Notes. CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Evora,
          <source>Portugal (September 5-8</source>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ginsca</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Popescu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le Borgne</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ballas</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vo</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanellos</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Largescale image mining with ickr groups</article-title>
          .
          <source>In: 21th International Conference on Multimedia Modelling (MMM 15)</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Donahue</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Darrell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malik</surname>
          </string-name>
          , J.:
          <article-title>Rich feature hierarchies for accurate object detection and semantic segmentation</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Gong</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ke</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isard</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lazebnik</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A multi-view embedding space for modeling internet images, tags, and their semantics</article-title>
          .
          <source>IJCV</source>
          <volume>106</volume>
          (
          <issue>2</issue>
          ),
          <volume>210</volume>
          {233 (Jan
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hardoon</surname>
            ,
            <given-names>D.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szedmak</surname>
            ,
            <given-names>S.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shawe-Taylor</surname>
          </string-name>
          , J.R.:
          <article-title>Canonical correlation analysis: An overview with application to learning methods</article-title>
          .
          <source>Neural Comput</source>
          .
          <volume>16</volume>
          (
          <issue>12</issue>
          ),
          <volume>2639</volume>
          {
          <fpage>2664</fpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <article-title>Delving deep into recti ers: Surpassing humanlevel performance on imagenet classi cation</article-title>
          . In: ICCV (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Hodosh</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Young</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hockenmaier</surname>
          </string-name>
          , J.:
          <article-title>Framing image description as a ranking task: Data, models and evaluation metrics</article-title>
          .
          <source>Journal of Arti cial Intelligence</source>
          Research pp.
          <volume>853</volume>
          {
          <issue>899</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Hwang</surname>
            ,
            <given-names>S.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grauman</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Learning the relative importance of objects from tagged images for retrieval and cross-modal search</article-title>
          .
          <source>IJCV</source>
          <volume>100</volume>
          (
          <issue>2</issue>
          ),
          <volume>134</volume>
          {153 (Nov
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Jegou</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perronnin</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Douze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanchez</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perez</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Aggregating local image descriptors into compact codes</article-title>
          .
          <source>Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>34</volume>
          (
          <issue>9</issue>
          ),
          <volume>1704</volume>
          {
          <fpage>1716</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shelhamer</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Donahue</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karayev</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Long</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guadarrama</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Darrell</surname>
          </string-name>
          , T.:
          <article-title>Ca e: Convolutional architecture for fast feature embedding</article-title>
          .
          <source>arXiv preprint arXiv:1408.5093</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Kakar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chia</surname>
            ,
            <given-names>A.Y.S.:</given-names>
          </string-name>
          <article-title>Automatic image annotation using weakly labelled web data</article-title>
          .
          <source>In: CLEF2015 Working Notes</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Multimedia content processing through cross-modal association</article-title>
          .
          <source>In: Proc. ACM international conference on Multimedia</source>
          . pp.
          <volume>604</volume>
          {
          <fpage>611</fpage>
          . ACM Press (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>j</year>
          .,
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , Fei-fei, L.,
          <string-name>
            <surname>Xing</surname>
            ,
            <given-names>E.P.</given-names>
          </string-name>
          :
          <article-title>Object bank: A high-level image representation for scene classi cation &amp; semantic feature sparsi cation</article-title>
          .
          <source>In: NIPS</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Malobabic</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le Borgne</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murphy</surname>
          </string-name>
          , N.,
          <string-name>
            <surname>O'Connor</surname>
            ,
            <given-names>N.E.</given-names>
          </string-name>
          :
          <article-title>Detecting the presence of large buildings in natural images</article-title>
          .
          <source>In: Content-Based Multimedia Indexing (CBMI)</source>
          , 2005 International Workshop on (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Ngiam</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khosla</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nam</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          :
          <article-title>Multimodal deep learning</article-title>
          .
          <source>In: ICML</source>
          . pp.
          <volume>689</volume>
          {
          <issue>696</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <given-names>Sharif</given-names>
            <surname>Razavian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Azizpour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Sullivan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Carlsson</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          :
          <article-title>Cnn features o - the-shelf: An astounding baseline for recognition</article-title>
          .
          <source>In: The IEEE Conference on Computer Vision</source>
          and Pattern
          <string-name>
            <surname>Recognition (CVPR) Workshops</surname>
          </string-name>
          (
          <year>June 2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>CoRR abs/1409</source>
          .1556 (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Tamaazousti</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le Borgne</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hudelot</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Diverse concept-level features for multi-object classi cation</article-title>
          .
          <source>In: International Conference on Multimedia retrieval (ICMR'16)</source>
          . New York, USA (
          <year>June 2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Tamaazousti</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le Borgne</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Popescu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Constrained local enhancement of semantic features by content-based sparsity</article-title>
          .
          <source>In: International Conference on Multimedia retrieval (ICMR'16)</source>
          . New York, USA (
          <year>June 2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Torresani</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szummer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fitzgibbon</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>E cient object category recognition using classemes</article-title>
          .
          <source>In: European Conference on Computer Vision</source>
          . ECCV (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>T.Q.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le Borgne</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crucianu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Aggregating image and text quantized correlated components</article-title>
          .
          <source>In: IEEE Conference on Computer Vision</source>
          and
          <article-title>Pattern Recognition (CVPR)</article-title>
          . IEEE,
          <string-name>
            <surname>Las</surname>
            <given-names>Vegas</given-names>
          </string-name>
          , USA (
          <year>June 2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Uricar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franc</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thomas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sugimoto</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hlavac</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Real-time Multi-view Facial Landmark Detector Learned by the Structured Output SVM</article-title>
          .
          <source>In: BWILD '15: Biometrics in the Wild 2015 (IEEE FG 2015 Workshop)</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Muller, H., Garc a Seco de Herrera,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Schaer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Bromuri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Gilbert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Piras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Ramisa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Dellandrea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Gaizauskas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Mikolajczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Puigcerver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Toselli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.H.</given-names>
            ,
            <surname>Snchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.A.</given-names>
            ,
            <surname>Vidal</surname>
          </string-name>
          , E.:
          <article-title>General Overview of ImageCLEF at the CLEF 2016 Labs</article-title>
          . Lecture Notes in Computer Science, Springer International Publishing (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Viola</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Robust real-time object detection</article-title>
          .
          <source>In: IJCV</source>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Vo</surname>
            ,
            <given-names>P.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ginsca</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le Borgne</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Popescu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>E ective training of convolutional networks using noisy web images</article-title>
          .
          <source>In: CBMI</source>
          (
          <year>2015</year>
          ), prague, Czech Republic
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ya</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , G.:
          <article-title>Deep image: Scaling up image recognition</article-title>
          .
          <source>CoRR abs/1501</source>
          .02876 (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Zitnick</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dollar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Edge boxes: Locating object proposals from edges</article-title>
          .
          <source>In: ECCV. European Conference on Computer Vision</source>
          (
          <year>September 2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Znaidia</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shabou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le Borgne</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hudelot</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paragios</surname>
          </string-name>
          , N.:
          <article-title>Bag-ofmultimedia-words for image classi cation</article-title>
          .
          <source>In: ICPR</source>
          . pp.
          <volume>1509</volume>
          {
          <fpage>1512</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>