<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic Image Annotation using Weakly Labelled Web Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pravin Kakar</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiangyu Wang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alex Yong-Sang Chia</string-name>
          <email>yschiag@i2r.a-star.edu.sg</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>21-01</institution>
          ,
          <addr-line>1 Fusionopolis Way</addr-line>
          ,
          <country country="SG">Singapore</country>
          <addr-line>138632</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Social Media and Internet Vision Analytics Lab, Institute for Infocomm Research</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this work, we propose and describe a method for localizing and annotating objects in images for the Scalable Concept Image Annotation challenge at ImageCLEF 2015. The unique feature of our proposed method is in its almost exclusive reliance on a single modality { visual data { for annotating images. Additionally, we do not utilize any of the provided training data, but instead create our own similarly-sized training set. By exploiting the latest research in deep learning and computer vision, we are able to test the applicability of these techniques to a problem of extremely noisy learning. We are able to obtain state-of-theart results on an inherently multi-modal problem thereby demonstrating that computer vision can also be a primary classi cation modality instead of relying primarily on text to determine context prior to image annotation.</p>
      </abstract>
      <kwd-group>
        <kwd>visual recognition</kwd>
        <kwd>scalable annotation</kwd>
        <kwd>learning from noisy data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The Scalable Concept Image Annotation challenge (SCIA) at
ImageCLEF 2015 [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] is designed to evaluate methods to automatically
annotate, localize and/or describe concepts and objects in images. In
contrast to previous years, there have been several notable changes
to the challenge. Some of them are highlighted below:
{ Localization of objects within images has been introduced. As a
results, the focus on more \object"-like concepts has increased
this year.
{ Use of hand-labelled data has been allowed. Although this is done
to technically allow the use of deep learning models trained on the
ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], it opens up the possibility of potential clean-up of the
training data. Note that we have not done this in this work, but it
appears to be legal within the regulatory framework of the
challenge.
{ The training and test sets are identical. Therefore, a method that
is able to exploit the noisy training data (e.g. via data cleaning)
could, in theory, bene t from potentially over tting the training
data.
      </p>
      <p>
        From a computer vision perspective, SCIA is more challenging
than the current benchmark challenge [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] in at least two senses
1) the training data provided is fairly noisy, which makes learning a
di cult problem and 2) the test set is 5 the size of [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. While this
does not indicate a clear increase in level of di culty (for example,
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] has 4 the number of concepts of SCIA), certain aspects are
de nitely more demanding.
      </p>
      <p>In the rest of these notes, we discuss our proposed method,
including data collection, classi er training and post-processing tweaks.
We also discuss the challenges posed due to the fact that test data
is annotated via crowd-sourcing which adds another source of label
noise to \ground-truth" data. Finally, we present our results on SCIA
along with proposals for future research to improve the automatic
annotation capabilities of techniques in this eld.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Algorithm Design</title>
      <p>As mentioned earlier, our algorithm is designed to mostly rely on
visual data. We do not employ extensive ontologies to augment
training data, nor do we use them during the training process, thus
helping understand the importance of having a strong visual recognition
pipeline. The various stages of our annotation pipeline are discussed
below.
2.1</p>
      <sec id="sec-2-1">
        <title>Data Collection</title>
        <p>We do not use the provided training set for two main reasons: 1) as
the training and test sets are identical, there is no penalty for
overtting on the training data, which could provide an arti cial boost
Concept
Augmented</p>
        <p>Search
Terms</p>
        <p>Image
Crawling</p>
        <p>Filtering</p>
        <p>Training</p>
        <p>Set
to performance results, and 2) there is little direct relationship
between the target concepts and the image keywords in the training
data, making it di cult to decouple the signi cance of a good
ontology from that of a good visual learning mechanism. Therefore, we
create our own training data of approximately the same size as the
SCIA dataset.</p>
        <p>The data collection pipeline is shown in Figure 1. We rst
consider the target concept names as keywords for which appropriate
images need to be found. There is an issue of non-speci city of some
of the concept names. For example, the concept \dish" can refer
to both the vessel as well as the food content, although only the
former is the target concept. Additionally, it is di cult to achieve
both speci city and diversity using a single keyword when doing a
web search for images. As an example, searching for \candy" yields
generic images of candy, which while containing diverse instances of
candy do not closely match single, speci c instances of candy.</p>
        <p>
          Both the above issues are conventionally dealt with by using
ontologies to determine the coverage for each concept. We do not build
our own challenge-speci c ontology here, but instead simply rely on
WordNet [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] to augment the individual keywords. In particular, this
is done by also considering hyponyms (sub-categories) and lemmas
(similar words) of the target concept. The hyponyms help target
speci c instances of the target concept, while the lemmas help increase
the coverage of the target concept.
        </p>
        <p>
          This augmented set of keywords per concept is then passed into
an image search engine. We use Bing Image Search [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] in this pipeline.
Note that we search for the hyponyms and lemmas of the target
concept by appending the target concept, in order to ensure that the
correct sense of images is being search for. For example, searching
for \tru e" rather than \tru e candy" results in a very di erent
set of images that include fungi, which fall outside the scope of the
target concept.
        </p>
        <p>We gather up to 4000 images per target concept from our
crawling engine. These images are passed through a ltering step where
images that are corrupted, too small or almost completely uniform
in appearance are discarded. The remaining images then form our
training dataset - an automatically created, noisily labelled dataset.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Feature Extraction</title>
        <p>
          For the images collected by the above process, we extract features
that will be useful for image classi cation. We choose to use the
features from the winner of the latest ILSVRC classi cation challenge
- GoogLeNet [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], a deep learning model trained on the ILSVRC
dataset and consisting of a highly-interconnected network-in-network
architecture. Nevertheless, their model size is small enough to t
within our available computing resources of a single GeForce GTX
480 GPU with 1.5 GB of memory.
        </p>
        <p>
          For each training image, we scale it down to 256 256 pixels and
use the center crop of 224 224 pixels. The intuition behind this is
that as the images are retrieved using speci c keywords, it is likely
that the object of interest is the focus of the image and should be
dominant. This also reduces the computational complexity of the
feature extraction process considerably. We extract features from
the pooled 5B layer of the GoogLeNet model (see [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] for details),
yielding a 1024-dimensional vector per training image. Each feature
vector is then normalized to unit length.
        </p>
        <p>
          We then train linear SVM classi ers [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] in a one-versus-all
fashion. This is not strictly correct, as some concepts will almost certainly
overlap (e.g. \face" will contain \eye", \nose", \mouth", etc.).
However, making such an independence assumption greatly simpli es the
learning process. Moreover, it allows us to avoid using a relationship
ontology to determine the appopriate weight of every concept for
each classi er. This is also in line with the goal of the challenge to
Face and Gender
        </p>
        <p>Detection</p>
        <p>Object
Proposals</p>
        <p>Proposal
Classification</p>
        <p>Trees?</p>
        <p>No</p>
        <p>Non-maximal
suppression</p>
        <p>Yes
Morphological</p>
        <p>Merging</p>
        <p>Fusion
WOMAN CONF1 BOX1
.TW.AR.EGEON CCOONNFF32 BBOOXX32
design a scalable system, as the addition of a new target concept does
not necessitate a recomputation of the weights against every existing
concept. In order to manage space constraints, we uniformly sample
the negative training samples for each concept, only selecting 60,000
of them.</p>
        <p>Thus, we train a single 1024-dimensional linear classi er per
target concept to use for annotating test images.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Annotation Pipeline</title>
        <p>Figure 2 shows our processing pipeline for a single test image.</p>
        <p>
          We rst create object proposals using the technique of [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] that
uses selective-search to nd likely \object" regions. Experimentally,
it is observed that many of these proposals are near-duplicates. In
order to increase diversity, we limit the returned proposals to those
that overlap others by at most 90%. This is found to be an
acceptable value for controlling the tradeo between increasing diversity,
and losing signi cant objects. We restrict the number of proposals
returned to the rst 150, in decreasing order of size.
        </p>
        <p>Each object is then passed through the same feature extraction
pipeline as in Section 2.2, and the classi ers trained therein are run
to yield the most likely concepts per region. In general, non-maximal
suppression is done per concept across all regions to limit the e ect
of overlapping proposals reporting the same concept for the same
object.</p>
        <p>There are two branches from this primary pipeline that we
employ based on our observations on the SCIA development set. Firstly,
we observe that many object proposals are labeled as \tree" if they
contain heavy foliage. While not incorrect for the individual region,
it may be incorrect for an overall image, where it is often di cult
to localize a single tree. In order to mitigate this e ect, we perform
morphological merging for all tree boxes, taking the convex hull for
each merged region as the bounding box of the tree and assigning it
the highest con dence of all the merged boxes. We observe this to
help improve the localization performance for the \tree" concept on
the development set. We also believe that this idea can be extended
to other non-uniformly shaped, di cult-to-localize concepts such as
\rock", \leaf", \brick", etc. but we do not have su cient annotations
in the development data to verify the same.</p>
        <p>
          Secondly, we observe that for any generic image dataset, humans
are an important object. This is true for SCIA as well as for [
          <xref ref-type="bibr" rid="ref12 ref2 ref8">12,8,2</xref>
          ].
Note that this is in contrast to domain-speci c datasets such as
[
          <xref ref-type="bibr" rid="ref11 ref9">9,11</xref>
          ]. To this end, we use face and gender detection from [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] to
detect persons with frontal faces in images. We supplement this with a
simple regression to an upper-body annotation using the data from
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Finally, we use the information from [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] to determine the
locations of various other person attributes.
        </p>
        <p>A fusion step merges the results from the primary and two
secondary pipelines. Speci cally, person results from multiple pipelines
are suppressed or modi ed based on overlaps between returned
localizations for the same concepts. Additionally, localizations that have
too high or low aspect ratios are suppressed, along with localizations
that fall below a preset score threshold. Finally, if all localizations
have been suppressed, then we report a single localization comprising
of the entire image, corresponding to the global scene classi cation.
This is based on the premise that all the development set images
contain at least one concept, and we extend that assumption to all
the test images.</p>
        <p>Optionally, the fusion section can also contain multiple textual
re nement steps. One option is to search URL lenames for concept
names, and if found, assign them to the entire image. Another
approach uses correlation between concepts from an ontology. This is
done to test the impact of simple context addition to the annotation
pipeline. Details of this latter approach are provided in the following
subsection.
2.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Ontology and correlation</title>
        <p>With the feature extraction and annotation pipeline, a set of
bounding boxes fBig are obtained for each test image. We denote the
prediction scores for the target concepts in Bi as Si = [si1; :::; sim], where
m is the total number of target concepts. By combining the
prediction scores for all the bounding boxes fBig, the prediction score for
the image is calculated as S = [s1; :::; sm] where si = maxj sji.</p>
        <p>
          Due to the fact that concepts do not occur in isolation (e.g.
bathroom and bathtub, mountain and cli ), semantic context can be used
to improve annotation accuracy. Following a way similar to [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], we
adopt semantic di usion to re ne the concept annotation score. We
denote C = fc1; :::; cmg as the set of target concepts. Let W be the
concept a nity matrix where Wij indicates the a nity between
concepts ci and cj, and D denote the diagonal node degree matrix where
Dii = di = Pj Wij. Then the graph Laplacian is = D W and the
normalized graph Laplacian is L = I D 1=2W D1=2. In this
problem, we measure the concept a nity based on Wikipedia dataset.
Let M denote the total number of pages in Wikipedia. For concept
ci, we denote yik = 1 if concept keyword ci appears in page k, and
yik = 0 otherwise. The a nity Wij between concept ci and cj can
then be computed using Pearson product moment correlation as:
Wij =
        </p>
        <p>PkM=1(yik
(M
i)(yjk
1) i j
j)
(1)
where i and i are the sample mean and standard deviation for
ci, respectively. Based on our study, the original prediction should
be quite precise. We employ only positive correlation to boost the
concepts to improve the recall.</p>
        <p>Let g 2 Rm 1 denote the re ned score vector, the values gi and
gj should be consistent with Wij (the a nity between concepts ci
and cj). Motivated by the semantic consistency, we formulate the
score re nement problem by minimizing a loss function
The loss function can be rewritten as</p>
        <p>1
" = tr(gT Lg)</p>
        <p>2
g = g</p>
        <p>rg"</p>
        <p>The loss function can be optimized using gradient descent
algorithm as
where rg" = Lg, and is the learning rate.</p>
        <p>Intially, g = S. By iteratively optimizing the loss function, we
obtain the re ned smooth score vector g for the image. A threshold
is chosen, so that we consider concept ci appears if gi &gt; , otherwise
we think the concept does not appear in the image (consequently not
in any of the bounding boxes Bi). That is, for each bounding box
in an image, we report the concept with the maximum con dence
given the concept appears in the image.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Dataset Limitations</title>
      <p>We tune the various parameters of our algorithm by validating its
performance on the development set. Unfortunately, the
development set (and by extension, the SCIA test set) has multiple
problems that make it very di cult to correctly gauge the e ect of tuning.
Most of these problems arise from the limitations of crowd-sourcing
ground-truth annotation, and need to be addressed to make SCIA a
more consistent evaluation. We summarize the major issues involved
below with the 4 I's.
(2)
(3)
(4)</p>
      <p>Inconsistency Many images in the development set exhibit
inconsistent annotations. An example of this is shown in Figure 3. Despite
there clearly being 4 legs in the picture, only 1 is annotated. This
is inconsistent as none of the unannotated instances are any less of
\leg"s than the one annotated. Other examples of inconsistencies
seen in the development set include unclear demarcations between
when multiple instances of the same concept are to be grouped into
a single instance or vice versa and annotating partially-visible
instances in some images while not annotating completely visible
instances in other images.</p>
      <p>Incompleteness Several images in the development set are
incompletely annotated. This is most prevalent in the case of humans
where various body parts are skipped altogether in the annotations.
Apart from this, there appears to be a certain level of arbitrariness
in choosing which concepts to annotate. For instance in Figure 3,
\shirt" is annotated, but \short pants" is not when clearly both have
about the same level of \interestingness". Additionally, concepts like
\chair", \sock", \shoe", \stadium", etc. which are also present in the
image are not annotated. This makes it extremely di cult to judge
the performance of a proposed technique on the development set.
Moreover, it seems to run counter to the challenge assertion that the
proportion of missing or incomplete annotations is insigni cant.
Incorrectness Althought not as prevalent as the previous two
problems, many annotations are incorrect. In Figure 3, the two balls are
labelled as balloons, which is clearly wrong. There are other cases of
wrong annotations in items of clothing (shirt/jacket/suit) as well as
gender and age of persons (\man" labelled as \male child", \woman"
labelled as \man", etc.).</p>
      <p>Impossibility This issue is the least prevalent of the four discussed
in this section. The image shown in Figure 4 was agged by our
image annotation pipeline as having a very large number of object
instances. It can be seen that the image contains more than 200
faces. This implies that there are more than 100 instances of at least
one of \man" or \woman"1. Within the rules of the challenge, each
concept is limited to 100 instaces, making it impossible to annotate
all instances correctly. Grouping multiple instances into a single
instance, if one were inclined to do so, is not straightforward as there
is no clear group of men and women as in some other images.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>In this section, we evaluate the performance of di erent settings of
the algorithm on the development and test datasets. It is to be noted
that there appear to be signi cant di erences in the quality of the
annotations between the two sets, so results on one are not
indicative of results on the other. Moreover, as there were no particulars
1 A quick inspection of the image shows no children, eliminating the possibility of
instances of \male child" and \female child"
provided about the measures being used for evaluation beyond
\performance" on both concepts and images, we used the F-score as a
measure on the development set, which formed the basis of 2 out of
3 measures in the previous iteration of the challenge. As it turns out,
the evaluation measure used on the test set was the mAP and so,
results between the development and test sets are again not directly
comparable. These statistics are shown in Table 1.
We run two base versions of our pipeline, one aimed at garnering
better precision (BP) and one aimed at getting better recall (BR).
These are shown in the rst two rows of the table. Following this, we
notice that in some images, no concepts were predicted as their
condences scores fell below their respective thresholds. In these cases,
we forced at least 1 prediction to be made ( 1 pred.) giving rise to
two more variants, BP1 and BR1.</p>
      <p>URL-search corresponds to the URL lename-concept name match
discussed earlier. Agg. NMS refers to aggressive non-maximal
suppression that employs a NMS threshold of 0, resulting in all
overlapping bounding boxes for the same concept to be reduced to a
single one. From human attributes, we either report hair + mouth,
which showed no deleterious e ects on the development set in the
face of incomplete annotations, or face parts which also adds eyes
and noses, or body parts which further adds in faces, heads, necks
and arms. In the case of ontologies, while we obtain slightly better
results on the development set, output errors in the submission cause
the performance to be quite low, which is an outlier.</p>
      <p>From the results, it can be seen that all human attributes signi
cantly help boost performance. Moreover, URL-search causes a drop
in performance, while aggressive NMS again boosts performance.
Hence, a possible solution that yields even better performance could
be BR1 + Agg. NMS + body parts.</p>
      <p>It is also to be noted that the runner-up in the challenge attains
a performance about 15% lower than ours. As the details of their
technique are not available, it is di cult to pinpoint the cause of the
large di erence, but we believe that the use of an external training
set, combined with human part extraction played an important role.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>In this work, we have presented our proposed method for the
ImageCLEF Scalable Concept Image Annotation challenge. Our method
places heavy emphasis on the visual aspect of annotating images and
demonstrates the performance that can be achieved by building an
appropriate pipeline of state-of-the-art visual recognition techniques.
The interconnections between the techniques are modi ed and
enhanced to improve overall annotation performance by branching o
secondary recognition pipelines for certain highly common concepts.</p>
      <p>We also highlight the limitations of the current challenge dataset
with respect to the ground-truth annotations, categorizing the
major shortcomings. Despite these and our technique's general lack of
reliance on textual data, we are able to outperform competing
methods by a margin of at least 15%. In the future, we plan to re ne our
annotation pipeline based on the analysis of the results. As most of
the target concepts in this iteration of the challenge were localizable
in a well-de ned manner, it will be interesting to examine
localization for other, more abstract concepts. We also hope to combine
advances in natural language processing and semantic ontologies to
appropriately weigh training instances in learning classi ers as well
as look at the problem from a multi-modal point of view.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>1. Open biometrics, http://openbiometrics.org/</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Everingham</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Gool</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>C.K.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Winn</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <source>The PASCAL Visual Object Classes Challenge</source>
          <year>2011</year>
          (
          <article-title>VOC2011) Results</article-title>
          . http://www. pascal-network.org/challenges/VOC/voc2011/workshop/index.html
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>R.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>K.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsieh</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>Liblinear: A library for large linear classi cation</article-title>
          .
          <source>The Journal of Machine Learning Research 9</source>
          ,
          <year>1871</year>
          {
          <year>1874</year>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ferrari</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marin-Jimenez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Progressive search space reduction for human pose estimation</article-title>
          .
          <source>In: Computer Vision and Pattern Recognition</source>
          ,
          <year>2008</year>
          .
          <article-title>CVPR 2008</article-title>
          . IEEE Conference on. pp.
          <volume>1</volume>
          {
          <issue>8</issue>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>Y.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>S.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ngo</surname>
            ,
            <given-names>C.W.</given-names>
          </string-name>
          :
          <article-title>Domain adaptive semantic diffusion for large scale context-based video annotation</article-title>
          .
          <source>In: Computer Vision</source>
          ,
          <year>2009</year>
          IEEE 12th International Conference on. pp.
          <volume>1420</volume>
          {
          <fpage>1427</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Jusko</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          :
          <article-title>Human gure drawing proportions</article-title>
          , http://www.realcolorwheel. com/human.htm
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Microsoft:
          <article-title>Bing image search (</article-title>
          <year>2015</year>
          ), http://www.bing.com/images
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Opelt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pinz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fussenegger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Generic object recognition with boosting</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>28</volume>
          (
          <issue>3</issue>
          ),
          <volume>416</volume>
          {
          <fpage>431</fpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Parkhi</surname>
            ,
            <given-names>O.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vedaldi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jawahar</surname>
            ,
            <given-names>C.V.</given-names>
          </string-name>
          :
          <article-title>Cats and dogs</article-title>
          .
          <source>In: IEEE Conference on Computer Vision and Pattern Recognition</source>
          . pp.
          <volume>3498</volume>
          {
          <issue>3505</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Princeton University: About wordnet (
          <year>2010</year>
          ), https://wordnet.princeton.edu/ wordnet/
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Quattoni</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torralba</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Recognizing indoor scenes</article-title>
          .
          <source>Computer Vision</source>
          and Pattern
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Russakovsky</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krause</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Satheesh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Ma,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Karpathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Berg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.C.</given-names>
            ,
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          :
          <article-title>ImageNet Large Scale Visual Recognition Challenge</article-title>
          .
          <source>International Journal of Computer Vision</source>
          (IJCV) (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sermanet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reed</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anguelov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erhan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanhoucke</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rabinovich</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Going deeper with convolutions</article-title>
          .
          <source>CoRR abs/1409</source>
          .4842 (
          <year>2014</year>
          ), http://arxiv.org/abs/1409.4842
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Uijlings</surname>
            ,
            <given-names>J.R</given-names>
          </string-name>
          ., van de Sande,
          <string-name>
            <given-names>K.E.</given-names>
            ,
            <surname>Gevers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Smeulders</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.W.</surname>
          </string-name>
          :
          <article-title>Selective search for object recognition</article-title>
          .
          <source>International journal of computer vision 104(2)</source>
          ,
          <volume>154</volume>
          {
          <fpage>171</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Muller, H.,
          <string-name>
            <surname>Gilbert</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolajczyk</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>de Herrera</surname>
            ,
            <given-names>A.G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bromuri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amin</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohammed</surname>
            ,
            <given-names>M.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Acar</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uskudarli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marvasti</surname>
            ,
            <given-names>N.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aldana</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>del Mar</surname>
          </string-name>
          Roldan Garc a, M.:
          <article-title>General Overview of ImageCLEF at the CLEF 2015 Labs</article-title>
          . Lecture Notes in Computer Science, Springer International Publishing (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>