Frame the Crowd: Global Visual Features Labeling
             boosted with Crowdsourcing Information

                Michael Riegler                          Mathias Lux                       Christoph Kofler
               Klagenfurt University                 Klagenfurt University         Delft University of Technology
                Klagenfurt, Austria                   Klagenfurt, Austria             Delft, The Netherlands
              miriegle@edu.uni-                  mlux@itec.uni-klu.ac.at                  c.kofler@tudelft.nl
                   klu.ac.at

ABSTRACT                                                          2.   APPROACH
In this paper we present our approach to the Crowd Sourc-            Using LIRE we extracted the global features CEDD, FCTH,
ing Task of the MediaEval 2013 Benchmark [2] using trans-         JCD, PHOG, EH, CL, Gabor, Tamura, LL, OH, JPEGCoeff
fer learning and visual features. For the visual features we      and SC (which are described and referenced in [4]). These
adopt an existing approach for search based classification        features are able to detect and distinguish characteristics of
using content based image retrieval on global features with       a framing like the color distribution with Color Layout.
feature selection and feature combination to boost the per-          The task includes one required condition, which only al-
formance. Our approach gives a baseline evaluation indi-          lows the use of the workers’ annotations. However, it is
cating the usefulness of global visual features, hashing and      noted that those annotations are error prone. Therefore,
search-based classification.                                      we integrated a reliability measure for workers based on the
                                                                  work of Ipeirotis et al. [1].
                                                                     We compare a rating rij ∈ R from worker wi ∈ W for
1.   INTRODUCTION                                                 image xj ∈ I with R being the set of ratings, W being
   The benchmarking task at hand [2] has been investigated        the set of workers and I being the set of images, to the
by two different means as well as a combination of them.          majority votes V (xj ) of all workers for an image. This gives
First, we only use crowdsourcing data for the labeling. We        a measure of reliability Q(wi ) of a specific worker wi . The
compute a reliability measure for workers and use this value      computed weight Q(wi ) is then multiplied with the vote rij
along with the workers’ self-reported familiarity as features     of the worker wi for the image xj .
for a classifier. Our second approach is based on the assump-
tion, that images taken with similar intentions, i.e. display-            V (xj ) = arg max | {rij : wi ∈ W ∧ rij = v} |
                                                                                    v∈{0,1}
ing a fashion style, are framed in a similar way.
   We define the framing of an image as the sum of the vis-
ible reflexes of the specific decisions that the photographer                           | {rij : xj ∈ I ∧ rij = V (xj )} |
                                                                             Q(wi ) =
makes when the image is captured. The photographer has                                           | {rij : xj ∈ I} |
many different choices when taking a photo of a certain ob-       Additionally, the familiarity of the worker with the fashion
ject, event, person or scene. During the capture process          topic is also added as a feature. So the feature vector for
the photographer does not click the shutter randomly, but         an image ik with ratings of three workers w1 , w2 , w3 is
rather makes use, either consciously or unconsciously, of a       (r1k · Q(w1 ), r2k · Q(w2 ), r3k · Q(w3 ), fw1 , fw2 , fw3 ).
set of conventions that can be thought of as a recipe for a
certain kind of image. The recipe leads to a distinguishable
framing that is used by the viewer in interpreting the im-
                                                                  3.   EXPERIMENTS
age. For example, a picture of a person framed in one way            We submitted five different evaluation runs. For the crowd-
is most easily interpreted as a fashion image and framed          sourcing task, a two part data set was available. The first
in another way most easily interpreted as a holiday mem-          part (MMSys dataset) is described in [3] and will be refer-
ory. Choices photographers make to achieve certain types          enced, for convenience, in this paper as DM . The second
of framing include color distribution, lighting, positions of     part of the data is called the Fashion 10000 data set (DF ).
objects and people etc. They also include the choice of the       To transfer the experts’ knowledge from DM to DF , we use a
exact moment during ongoing action at which the image is          process called transfer learning for all our runs. This is done
shot. In this way, the photographer also influences exactly       by using a model, built from an expert knowledge containing
what is depicted in the image, e.g., facial expressions of the    data set (in this case DM ) to generate a new accurate model
people appearing in the image. Especially for fashion use         for the dataset without expert knowledge. In this case, by
cases the framing theory is applicable. Due to the nature of      labeling the images from DF with the DM model.
framing we employ global visual features using and modify-           The first evaluation run – the required one, run #1 –
ing the LIRE framework [4] and boost classification results       made use of the feature vector of worker annotations, men-
with feature combination and feature selection.                   tioned above, and the Weka Random Forest Classifier, which
                                                                  yielded good results in cross validation on DM . Using a
                                                                  model built from the DM data set we labeled the images
Copyright is held by the author/owner(s).                         from DF and retrained our model using the newly labeled
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain    images.
   For the visual content classification (run #2) we used DM
to build a model for classification. The classifier is search               Table 1: Preliminary Test Results
based, which means that the image being classified is consid-             Run F1 L1 F1 L2 WF1 L1 WF1 L2
ered as query and the label is derived from the result list. A             1    0.882   0.882   0.872    0.915
similar approach was used in [5]. Each result in the list votes            2   0.7669 0.2599   0.7368   0.6047
for a label, weighted by its inverse rank. The selection of                3   0.7483 0.0493    0.623   0.5215
global features for classification is based on the information             4   0.7608 0.1204   0.6894   0.5489
gain of each global feature with respect to the class labels               5    0.885   0.892   0.883    0.932
in the training set DM . For the combination, only features
that have an above-average information gain are used. The
combination of the global features is done with late fusion.               Table 2: Official MediaEval Results
                                                                                   Run F1 L1 F1 L2
This means, each global feature has its own classifier and re-
turns a ranked list for the given query image. Label weights                        1     0.7124 0.7071
(inverse rank) are than added up, resulting in a combination                        2     0.5201 0.2908
by rank.                                                                            3     0.4986 0.4269
   Classification performance in terms of time and scaling                          4     0.5403 0.3938
is promising. In the worst case with 12 features combined,                          5     0.7123 0.6999
classification per image takes about 240 ms. In the best case
– if only one feature is used – classification time is down to
16 ms per image.                                                  formance, if it is used to build the model of the visual clas-
   Run #3 uses the same techniques as described for run           sifier. In the other direction, it seems to lower performance.
#2, but uses the worker annotations of DF for training the        Visual information based models have already worked well
model. Run #4 uses the images labeled in run #1 for train-        with small number of training data and a small amount of
ing, and run #5 combines run #1 and run #4 in a way,              crowdsourcing can help to boost visual information retrieval
that classification based on visual features is used when the     systems performance.
random forest classifier returns an uncertain result.                We further assume that our theory of framing is supported
                                                                  by the results. Especially our test results, because Label 1
                                                                  is very good detectable by our global features classifier. On
4.   DISCUSSION AND CONCLUSIONS                                   the other side, the Label 1 detection was not good. This is
   To estimate the performance of each run we used the test       logical, because for the task of object detection local features
data set and DM experts votes for ground truth. We split          are better suitable.
the dataset 80% for training and 20% for test. The results of        For future work it will be interesting to take a closer look
these tests can be seen in Table 1 for both labels (L1 , L2 ).    on the relationship between crowdsourcing and how it could
L1 stands for whether an image is fashion related or not          be used to improve the performance of visual features and
and L2 stands for whether the content of the image matches        vice versa. Another interesting direction would be to use
with the category for the fashion item depicted in the im-        crowdsourcing to create a specific dataset for framing. This
age. For the evaluation we used weighted F1 score (WF1),          would help to draw a clearer definition and show the useful-
because the positive and negative classes are not compara-        ness of the framing theory.
ble on size. The results of the Benchmark can be seen in
Table 2. The tests results show that the crowd sourcing           5.   REFERENCES
classifier has the best performance. Also the official results
                                                                  [1] P. G. Ipeirotis, F. Provost, and J. Wang. Quality
support this fact. The outcome for the workers information
                                                                      management on amazon mechanical turk. In
based runs in the final results compared to our test results
                                                                      Proceedings of the ACM SIGKDD workshop on human
indicates that transfer learning worked well.
                                                                      computation, pages 64–67. ACM, 2010.
   Visual features based on classification performs much bet-
                                                                  [2] B. Loni, A. Bozzon, M. Larson, and L. Gottlieb.
ter in our tests than in the final results (cp. runs #2-
                                                                      Crowdsourcing for Social Multimedia at MediaEval
#4). It’s common that metadata, even when generated by
                                                                      2013: Challenges, data set, and evaluation. In
crowdsourcing, leads to better results, but still the perfor-
                                                                      MediaEval 2013 Workshop, Barcelona, Spain, October
mance drop between preliminary and official results is ob-
                                                                      18-19 2013.
vious. However, WF1 scores are more suitable for a steady
judgment as shown in Table 1 (e.g. run #3, F1 vs. WF1             [3] B. Loni, M. Menendez, M. Georgescu, L. Galli,
scores in the preliminary runs).                                      C. Massari, I. S. Altingovde, D. Martinenghi,
   Nevertheless, taking all constraints into account, the vi-         M. Melenhorst, R. Vliegendhart, and M. Larson.
sual features perform quite well. Their benefit is that, un-          Fashion-focused creative commons social dataset. In
like crowdsourcing, which costs money, the effort to extract          Proceedings of the 4th ACM Multimedia Systems
them and get a small amount of training data is minimal.              Conference, MMSys ’13, pages 72–77, New York, NY,
Moreover, metadata quality depends on the actual work-                USA, 2013. ACM.
ers and the quality control mechanism of the crowdsourcing        [4] M. Lux. LIRE: Open source image retrieval in java. In
platform. This is also indicated by the lower WF1 measure             Proceedings of the 21st ACM International Conference
of run #3 compared to run #2, as in run #2 expert vot-                on Multimedia, MM ’13, page to appear, New York,
ings were used to train the model, while run #3 also takes            NY, USA, 2013. ACM.
crowdsourcing workers into account for training.                  [5] L. Yang and A. Hanjalic. Supervised reranking for web
   Another interesting effect is that a combination of crowd-         image search. In Proceedings of the international
sourcing metadata and visual content can improve the per-             conference on Multimedia, pages 183–192. ACM, 2010.