Imcube @ MediaEval 2015 Retrieving Diverse Social
          Images Task: Multimodal Filtering and Re-ranking

                               Sebastian Schmiedeke, Pascal Kelm, and Lutz Goldmann
                                                                imcube labs GmbH
                                                                 Berlin, Germany
                                          {schmiedeke, kelm, goldmann}@imcube.de


ABSTRACT
This paper summarizes the participation of Imcube at the Retriev-
ing Diverse Social Images Task of MediaEval 2015. This task ad-
dresses the problem of result diversification in the context of social
                                                                                          Relevancy
photo retrieval where the results of a query should contain relevant
                                                                                              Text          Visual      Credibility
but diverse items. Therefore, we propose a multi-modal approach
for filtering and re-ranking in order to improve the relevancy and
                                                                                                            Fusion
diversity of the returned list of ranked images.

1.    INTRODUCTION                                                                        Diversification

   The Retrieving Diverse Social Images Task of MediaEval 2015 [5]                                   Text             Visual
requires participants to develop a system that automatically refines
a list of images returned by a Flickr query in such a way that the                                          Fusion

most relevant and diverse images are returned in a ranked list of up
to 50 images.
   A photo is considered relevant if it is a common representation
of the overall query concept in good visual quality (sharpness, con-
trast, colours) and without people as main subjects except for que-
ries dealing with people as part of the topic. The results are consid-                       Figure 1: Proposed approach.
ered diverse if they depict different visual aspects (time, location,
view, style, etc) of the target concept with a certain degree of com-
plementarity.                                                              To improve the ranking of the images the query is expanded with
   The refinement and diversification process can be based on the          the most frequent words from the Wikipedia article and the images
social metadata associated with the collected photos in the data set       are re-ranked using a bag-of-words representation. The relevancy is
and/ or on the visual characteristics of the images. Furthermore,          further improved by removing images that do not match the original
the task provides information about user annotation credibility as         query and the location information from the Wikipedia article. The
an automation estimation of the quality of a particular user’s tag.        location information is extracted by analysing the original query or
                                                                           Wikipedia title (e.g. , “Great Sphinx of Giza”) taking into account
2.    SYSTEM DESCRIPTION                                                   typical prepositions for locations (e.g. “in”, “at”, “on”, “of”, “de”).
   In this section, we present our approach that combines textual,         In the case that no location information can be extracted (e.g. “Ni-
visual and credibility information to filter and re-rank the initial re-   agara Falls”) toponyms are not considered for relevancy filtering.
sults. Our approach consists of two steps – relevancy improvement
and diversification – as depicted in Figure 1.
                                                                           2.2    Visual relevancy improvement
   The goal of the first step is to improve the relevancy of the ranked       Visual information is also used to improve the relevancy by re-
image list by re-ranking the images based on more reliable textual         ranking them according to different criteria. For each visual feature
and visual criteria and filtering them in order to remove images           a ranked image list is derived based on the computed relevancy
which are irrelevant for the given application scenario. The goal of       scores.
the second step is to improve the diversity of the ranked image list          Since images with persons as main subjects are considered irrel-
through textual filtering and visual clustering and re-ranking. The        evant, we employ a face detector [7] trained for frontal and profile
individual modules will be described in the following sections.            faces to determine the size of facial regions. The inverse relative
                                                                           size of the detected faces determines the relevancy. Hence, the
2.1     Textual relevancy improvement                                      smaller the area covered by faces the more relevant is the image.
   This step exploits additional information extracted from the cor-          Additionally, photos taken from the target location but not dis-
responding Wikipedia article that is provided together with the query.     playing it are considered irrelevant. We model that relevancy by
                                                                           computing the visual similarity between the retrieved images and
                                                                           the images which are available from the associated Wikipedia arti-
Copyright is held by the author/owner(s).                                  cle. We use histogram of oriented gradients (HOG) features and a
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany                clusterless BoW approach [6] based on speeded up robust features
                                                    Table 1: Evaluation of different runs.
                        Average P@20                              Average CR@20                              Average F1@20
             one-concept multi-concept         overall one-concept multi-concept         all       one-concept multi-concept        overall
      run1     0.7014         0.6743           0.6878     0.3963         0.4209        0.4087        0.4885        0.5027           0.4957
      run2     0.7819         0.7143           0.7478      0.4380        0.3986        0.4182        0.5478        0.4917           0.5195
      run3     0.7674         0.6993           0.7331     0.4285         0.4064        0.4174        0.5340        0.4926           0.5132
      run5     0.5928         0.6671           0.6302     0.3410         0.3508        0.3460        0.4251        0.4448           0.4350


(SURF) features to generate histograms for each of the images. The         3.    EXPERIMENTS & RESULTS
similarity between the retrieved images and the wikipedia images              The following experiments were performed based on the system
is computed through histogram intersection. The retrieved images           and the individual modules described above following the guide-
are re-ranked by considering the maximum score across the set of           lines of the task.
wikipedia images.                                                             Run1 is an approach using visual information only (as described
   We further incorporate aesthetic aspects to emphasize more vi-          in Sec. 2.2 and 2.5). Run2 is an approach based on purely textual
sually appealing images, since less blurry and salient images are          information (as described in Sec. 2.1 and 2.4). Run3 is an approach
usually considered more relevant. The sharpness is calculated as           based on textual and visual information (as described in Sec. 2.1,
the ratio of magnitude of image gradients between different blurred        2.2 and Sec. 2.5). Run5 is an approach using credibility based rel-
versions of the original image. The larger that ratio, the more rel-       evancy and visual diversity (as described in Sec. 2.3 and 2.5).
evant the image. Saliency is measured using a spectral residual               These experiments are performed on the test set provided which
approach [4]. Considering the different criteria described above we        contains 69 one-concept location queries and 70 multi-concept que-
obtain 5 ranked image lists (Face, HOG, BoW, Sharpness, Saliency)          ries related to events.
which are fused using weighted rank fusion.                                   Table 1 shows the results on the test set for all the runs defined
                                                                           above. Since we want to evaluate our filters for different conditions,
2.3     Credibility based relevance improvement                            scores for the one-concept and multi-concept queries are also pro-
   This step is intended to be the baseline approach for improving         vided.
the relevance. It re-ranks the image list according to the credibil-          In general, the textual run (run2) achieves the best results. It
ity of the owner of an image. The re-ranking is based on 3 scores          achieves the higher precision (P @20 = 0.748) and also a slightly
which describe the user credibility [5]: the use of correct tags (visu-    better recall (CR@20 = 0.418) compared to the visual run (run1).
alScore), specific tags (tagSpecificity) and their preference for pho-     The advantage is more significant for one-concept queries than for
tographing faces (faceProportion). Following the application sce-          multi-concept queries. The textual run fails for queries which main
nario the combined credibility score for an image is high if the user      topic is not correlated to a location (e.g. “chinese new year in Bei-
has a high visualScore, a high tagSpecificity and a low faceProba-         jing” (its main topic is firework), “paragliding in the mountains”
bility.                                                                    or “tropical rain”). For these cases, the visual run reaches consid-
                                                                           erably higher F1 scores. Generally, the purely visual run achieves
2.4     Textual diversification                                            a better recall (CR@20 = 0.4209) and thus a slightly better F1
                                                                           metric (F 1@20 = 0.5027) for multi-concept queries.
   The final image list should not only contain relevant images but
                                                                              Since, the combination of visual and textual features (run3) con-
also diverse ones, i.e. , depicting different aspects of the topic. With
                                                                           stantly achieves lower scores than the individual modalities, we
the assumption that images which have an identical textual de-
                                                                           analyse the cases where improvements were made. For example
scription often depict very similar content, the images are clustered
                                                                           the previously mentioned query (“chinese new year in Beijing”)
based on their textual similarity. The ranked image list is then ob-
                                                                           benefits from visual information with a considerable increase of
tained by ranking the clusters in descending order according to their
                                                                           the F1 measure (∆F 1@20 = 0.18). In comparison to run2, run3
relevancy and iteratively selecting the most relevant image from
                                                                           achieves a lower precision (∆P @20 = −0.015) and a similar
each cluster.
                                                                           recall (∆CR@20 = −0.001) leading to slightly lower F1 score
                                                                           (∆F 1@20 = −0.006). However, it is interesting to note that the
2.5     Visual diversification                                             results differ for one-concept and multi-concept queries. The recall
   The visual diversification considers multiple visual characteris-       of run3 is actually higher than that of run2 for multi-concept que-
tics including colour (ColorMoment), structure (HOG, clusterless           ries (∆CR@20 = 0.008) while it is lower for one-concept queries
BoW approach [6]) and texture (local binary patterns (LBP)). For           (∆CR@20 = 0.009).
each feature the normalized distances between the retrieved images
are combined using weighted summation and then projected in a
lower dimensional space by applying the FastMap [2] algorithm.
On the resulting 5-dimensional feature space, kMeans++ cluster-            4.    CONCLUSION
ing [1] is applied.                                                          The results of the different runs show that overall the best results
   The number of clusters is estimated by Hartigan’s Leader clus-          can be achieved with textual information only and that the fusion
tering algorithm [3], but the number is restricted to be between 5         of visual and textual information leads to slightly worse results.
and 21. Clusters with a low mean relevancy or clusters containing          Analysing the results in more detail shows that visual information
only a few images are discarded, Since these small clusters are very       provides better results for multi-concept queries and queries where
likely to contain outliers. The remaining clusters are ordered in de-      the main topic is not correlated to a location while textual infor-
scending order according to their maximum relevancy and ranked             mation achieves better performance for one-concept queries. This
image list is obtained by iteratively selecting the best image from        shows that a more advanced fusion approach for combining textual
them.                                                                      and visual information may improve the results further.
5.   REFERENCES
[1] D. Arthur and S. Vassilvitskii. K-means++: The advantages of
    careful seeding. In Proceedings of the Eighteenth Annual
    ACM-SIAM Symposium on Discrete Algorithms, SODA ’07,
    pages 1027–1035, Philadelphia, PA, USA, 2007. Society for
    Industrial and Applied Mathematics.
[2] C. Faloutsos and K. Lin. FastMap: A fast algorithm for
    indexing, data-mining and visualization of traditional and
    multimedia datasets. In Proceedings of the 1995 ACM
    SIGMOD international conference on Management of data,
    pages 163–174. ACM New York, NY, USA, 1995.
[3] J. A. Hartigan. Clustering Algorithms. John Wiley & Sons,
    Inc., New York, NY, USA, 99th edition, 1975.
[4] X. Hou and L. Zhang. Saliency detection: A spectral residual
    approach. In Computer Vision and Pattern Recognition, 2007.
    CVPR ’07. IEEE Conference on, pages 1–8, June 2007.
[5] B. Ionescu, A. L. Gînscă, B. Boteanu, A. Popescu, M. Lupu,
    and H. Müller. Retrieving Diverse Social Images at MediaEval
    2015: Challenge, Dataset and Evaluation. MediaEval 2015
    Workshop, Wurzen, Germany, 2015.
[6] S. Schmiedeke, P. Kelm, and T. Sikora. DCT-based features
    for categorisation of social media in compressed domain. In
    Multimedia Signal Processing (MMSP), 2013 IEEE 15th
    International Workshop on, pages 295–300, 2013.
[7] P. Viola and M. Jones. Rapid object detection using a boosted
    cascade of simple features. In Computer Vision and Pattern
    Recognition, 2001. CVPR 2001. Proceedings of the 2001
    IEEE Computer Society Conference on, volume 1, pages
    I–511–I–518 vol.1, 2001.