<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Frame the Crowd: Global Visual Features Labeling boosted with Crowdsourcing Information</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Riegler</string-name>
          <email>er@tudelft.nl</email>
          <email>miriegle@edu.uniklu.ac.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mathias Lux</string-name>
          <email>mlux@itec.uni-klu.ac.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christoph Kofler</string-name>
          <email>c.kofler@tudelft.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Delft University of Technology</institution>
          ,
          <addr-line>Delft</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Klagenfurt University</institution>
          ,
          <addr-line>Klagenfurt</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>In this paper we present our approach to the Crowd Sourcing Task of the MediaEval 2013 Benchmark [2] using transfer learning and visual features. For the visual features we adopt an existing approach for search based classification using content based image retrieval on global features with feature selection and feature combination to boost the performance. Our approach gives a baseline evaluation indicating the usefulness of global visual features, hashing and search-based classification.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The benchmarking task at hand [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] has been investigated
by two different means as well as a combination of them.
First, we only use crowdsourcing data for the labeling. We
compute a reliability measure for workers and use this value
along with the workers’ self-reported familiarity as features
for a classifier. Our second approach is based on the
assumption, that images taken with similar intentions, i.e.
displaying a fashion style, are framed in a similar way.
      </p>
      <p>
        We define the framing of an image as the sum of the
visible reflexes of the specific decisions that the photographer
makes when the image is captured. The photographer has
many different choices when taking a photo of a certain
object, event, person or scene. During the capture process
the photographer does not click the shutter randomly, but
rather makes use, either consciously or unconsciously, of a
set of conventions that can be thought of as a recipe for a
certain kind of image. The recipe leads to a distinguishable
framing that is used by the viewer in interpreting the
image. For example, a picture of a person framed in one way
is most easily interpreted as a fashion image and framed
in another way most easily interpreted as a holiday
memory. Choices photographers make to achieve certain types
of framing include color distribution, lighting, positions of
objects and people etc. They also include the choice of the
exact moment during ongoing action at which the image is
shot. In this way, the photographer also influences exactly
what is depicted in the image, e.g., facial expressions of the
people appearing in the image. Especially for fashion use
cases the framing theory is applicable. Due to the nature of
framing we employ global visual features using and
modifying the LIRE framework [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and boost classification results
with feature combination and feature selection.
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>
        Using LIRE we extracted the global features CEDD, FCTH,
JCD, PHOG, EH, CL, Gabor, Tamura, LL, OH, JPEGCoeff
and SC (which are described and referenced in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]). These
features are able to detect and distinguish characteristics of
a framing like the color distribution with Color Layout.
      </p>
      <p>
        The task includes one required condition, which only
allows the use of the workers’ annotations. However, it is
noted that those annotations are error prone. Therefore,
we integrated a reliability measure for workers based on the
work of Ipeirotis et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>We compare a rating rij ∈ R from worker wi ∈ W for
image xj ∈ I with R being the set of ratings, W being
the set of workers and I being the set of images, to the
majority votes V (xj) of all workers for an image. This gives
a measure of reliability Q(wi) of a specific worker wi. The
computed weight Q(wi) is then multiplied with the vote rij
of the worker wi for the image xj.</p>
      <p>V (xj) = arg max | {rij : wi ∈ W ∧ rij = v} |</p>
      <p>v∈{0,1}
Q(wi) = | {rij : xj ∈ I ∧ rij = V (xj)} |</p>
      <p>| {rij : xj ∈ I} |
Additionally, the familiarity of the worker with the fashion
topic is also added as a feature. So the feature vector for
an image ik with ratings of three workers w1, w2, w3 is
(r1k · Q(w1), r2k · Q(w2), r3k · Q(w3), fw1, fw2, fw3).
3.</p>
    </sec>
    <sec id="sec-3">
      <title>EXPERIMENTS</title>
      <p>
        We submitted five different evaluation runs. For the
crowdsourcing task, a two part data set was available. The first
part (MMSys dataset) is described in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and will be
referenced, for convenience, in this paper as DM . The second
part of the data is called the Fashion 10000 data set (DF ).
To transfer the experts’ knowledge from DM to DF , we use a
process called transfer learning for all our runs. This is done
by using a model, built from an expert knowledge containing
data set (in this case DM ) to generate a new accurate model
for the dataset without expert knowledge. In this case, by
labeling the images from DF with the DM model.
      </p>
      <p>The first evaluation run – the required one, run #1 –
made use of the feature vector of worker annotations,
mentioned above, and the Weka Random Forest Classifier, which
yielded good results in cross validation on DM . Using a
model built from the DM data set we labeled the images
from DF and retrained our model using the newly labeled
images.</p>
      <p>
        For the visual content classification (run #2) we used DM
to build a model for classification. The classifier is search
based, which means that the image being classified is
considered as query and the label is derived from the result list. A
similar approach was used in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Each result in the list votes
for a label, weighted by its inverse rank. The selection of
global features for classification is based on the information
gain of each global feature with respect to the class labels
in the training set DM . For the combination, only features
that have an above-average information gain are used. The
combination of the global features is done with late fusion.
This means, each global feature has its own classifier and
returns a ranked list for the given query image. Label weights
(inverse rank) are than added up, resulting in a combination
by rank.
      </p>
      <p>Classification performance in terms of time and scaling
is promising. In the worst case with 12 features combined,
classification per image takes about 240 ms. In the best case
– if only one feature is used – classification time is down to
16 ms per image.</p>
      <p>Run #3 uses the same techniques as described for run
#2, but uses the worker annotations of DF for training the
model. Run #4 uses the images labeled in run #1 for
training, and run #5 combines run #1 and run #4 in a way,
that classification based on visual features is used when the
random forest classifier returns an uncertain result.</p>
    </sec>
    <sec id="sec-4">
      <title>DISCUSSION AND CONCLUSIONS</title>
      <p>To estimate the performance of each run we used the test
data set and DM experts votes for ground truth. We split
the dataset 80% for training and 20% for test. The results of
these tests can be seen in Table 1 for both labels (L1, L2).
L1 stands for whether an image is fashion related or not
and L2 stands for whether the content of the image matches
with the category for the fashion item depicted in the
image. For the evaluation we used weighted F1 score (WF1),
because the positive and negative classes are not
comparable on size. The results of the Benchmark can be seen in
Table 2. The tests results show that the crowd sourcing
classifier has the best performance. Also the official results
support this fact. The outcome for the workers information
based runs in the final results compared to our test results
indicates that transfer learning worked well.</p>
      <p>Visual features based on classification performs much
better in our tests than in the final results (cp. runs
#2#4). It’s common that metadata, even when generated by
crowdsourcing, leads to better results, but still the
performance drop between preliminary and official results is
obvious. However, WF1 scores are more suitable for a steady
judgment as shown in Table 1 (e.g. run #3, F1 vs. WF1
scores in the preliminary runs).</p>
      <p>Nevertheless, taking all constraints into account, the
visual features perform quite well. Their benefit is that,
unlike crowdsourcing, which costs money, the effort to extract
them and get a small amount of training data is minimal.
Moreover, metadata quality depends on the actual
workers and the quality control mechanism of the crowdsourcing
platform. This is also indicated by the lower WF1 measure
of run #3 compared to run #2, as in run #2 expert
votings were used to train the model, while run #3 also takes
crowdsourcing workers into account for training.</p>
      <p>Another interesting effect is that a combination of
crowdsourcing metadata and visual content can improve the
performance, if it is used to build the model of the visual
classifier. In the other direction, it seems to lower performance.
Visual information based models have already worked well
with small number of training data and a small amount of
crowdsourcing can help to boost visual information retrieval
systems performance.</p>
      <p>We further assume that our theory of framing is supported
by the results. Especially our test results, because Label 1
is very good detectable by our global features classifier. On
the other side, the Label 1 detection was not good. This is
logical, because for the task of object detection local features
are better suitable.</p>
      <p>For future work it will be interesting to take a closer look
on the relationship between crowdsourcing and how it could
be used to improve the performance of visual features and
vice versa. Another interesting direction would be to use
crowdsourcing to create a specific dataset for framing. This
would help to draw a clearer definition and show the
usefulness of the framing theory.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Ipeirotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Provost</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>Quality management on amazon mechanical turk</article-title>
          .
          <source>In Proceedings of the ACM SIGKDD workshop on human computation</source>
          , pages
          <fpage>64</fpage>
          -
          <lpage>67</lpage>
          . ACM,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Loni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bozzon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Larson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Gottlieb</surname>
          </string-name>
          .
          <article-title>Crowdsourcing for Social Multimedia at MediaEval 2013: Challenges, data set, and evaluation</article-title>
          . In MediaEval 2013 Workshop, Barcelona, Spain, October
          <volume>18</volume>
          -19
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Loni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Menendez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Georgescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Galli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Massari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. S.</given-names>
            <surname>Altingovde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Martinenghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Melenhorst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vliegendhart</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Larson</surname>
          </string-name>
          .
          <article-title>Fashion-focused creative commons social dataset</article-title>
          .
          <source>In Proceedings of the 4th ACM Multimedia Systems Conference, MMSys '13</source>
          , pages
          <fpage>72</fpage>
          -
          <lpage>77</lpage>
          , New York, NY, USA,
          <year>2013</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lux</surname>
          </string-name>
          . LIRE:
          <article-title>Open source image retrieval in java</article-title>
          .
          <source>In Proceedings of the 21st ACM International Conference on Multimedia, MM '13</source>
          , page to appear, New York, NY, USA,
          <year>2013</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanjalic</surname>
          </string-name>
          .
          <article-title>Supervised reranking for web image search</article-title>
          .
          <source>In Proceedings of the international conference on Multimedia</source>
          , pages
          <fpage>183</fpage>
          -
          <lpage>192</lpage>
          . ACM,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>