MIS @ Retrieving Diverse Social Images Task 2015

                                         Maia Zaharieva1,2 and Lukas Diem2
                     1
                         Interactive Media Systems Group, Vienna University of Technology, Austria
                           2
                             Multimedia Information Systems Group, University of Vienna, Austria
                                  maia.zaharieva@[tuwien|univie].ac.at, l.diem@univie.ac.at


ABSTRACT                                                         2.     APPROACH
In this paper, we describe our approach for the MediaEval           We employ a multi-stage workflow for the retrieval of di-
2015 Retrieving Diverse Social Images Task. The proposed         verse social images, which passes the following steps: 1) data
approach exploits available user-generated textual descrip-      preprocessing, 2) relevance reranking, and 3) image cluster-
tions and the visual content of the images in a combination      ing and final image selection.
with common, unsupervised clustering techniques in order to         In the first step, data preprocessing, we filter potentially
increase the diversification of retrieval results. Preliminary   irrelevant images, i.e., images with humans as the main sub-
experiments indicate that the approach generalizes well for      jects and images that are captured far away from the topic of
different datasets and achieves comparable results for single-   interest. We employ the OpenCV1 face detector and remove
and multi-topic queries.                                         images with faces of area exceeding 5% of the total image
                                                                 area. Additionally, if GPS data is available, we measure the
                                                                 distance between the topic of interest and the correspond-
1.   INTRODUCTION                                                ing images and remove those with a Harvesine distance [7]
   Manual assessment of the relevance of publicly available      greater than 100km. The reason for this strict threshold is
images to a particular query is not feasible due to the im-      the underlying tourist application scenario where the preci-
mense amount of data captured and shared daily on social         sion of location’s specification ranges strongly from a partic-
media platforms. As a result, the automated optimization         ular spot (e.g., the Tower Bridge in London) to large-scale
of image retrieval results gains constantly in importance.       locations such as national parks or entire cities.
Next to relevance, the aspect of diversification of retrieval       The aim of the second stage, relevance reranking, is to
results plays a crucial role in order to reduce the redundancy   improve the original Flickr rating. Since the provided Wiki-
in the retrieved images and, thus, to increase the efficiency    pedia images are per definition representative [4], we mea-
in overviewing the underlying data. The MediaEval 2015           sure the visual similarity between the images of a set and
Retrieving Diverse Social Images Task [4] addresses these        the associated Wikipedia images by means of the Euclidean
challenges in form of a tourist-oriented retrieval task, where   distance between the corresponding adapted convolutional
the topics of interest represent sightseeing spots around the    neural network (CNN) based descriptors. In case that there
world. The aim of the task is to refine the set of images        are no Wikipedia images provided for a given query, we con-
retrieved from Flickr while taking into account both the rel-    sider the top 10 images from the original Flickr ranking as
evance and the diversity of the selected images.                 reference images. Following, all images are reranked accord-
   Previous work in this context shows a broad range of pos-     ing to the achieved similarity score.
sible approaches. The original Flickr ranking is commonly           In the third step, image clustering, we aim at finding
improved by a direct comparison with the corresponding           groups of similar images which can be used to diversify the
Wikipedia images [5][8]. Other methods employ training by        final image results. For the visual-based runs, preliminary
support vector machines (SVM) [6] or regression models [3].      experiments with the provided visual descriptors [4] and dif-
The diversification of retrieval results is usually approached   ferent clustering algorithms (k-means, k-medoids, XMeans,
by means of conventional clustering algorithms, such as k-       and agglomerative hierarchival clusteirng (AHC)) showed
means [3][6], hierarchical clustering [1][2], and random for-    that the best performing method for the development data
est [8] or by an ensemble of clustering approaches [5].          considers CNN as a visual feature and the AHC cluster-
   In this paper, we address relevance re-ranking by means       ing method. The final selection of images from the clusters
of a similarity score to a reference set of images. This ref-    follows a Round-Robin approach. We start by selecting the
erence set is given by Wikipedia images (if available) or by     image with the best relevance score from each cluster. These
the top ranked images provided by Flickr. To increase di-        images, sorted in ascending order, constitute the m highest
versification, we employ a hierarchical clustering algorithm     ranked results, where m is the number of detected clusters.
and compare the performance of recently-introduced pow-          The selected images are removed from their corresponding
erful visual features with text-based approaches, which are      clusters and the selection process is repeated until the re-
well-established in the context of web mining and retrieval.     quired number of retrieved results is achieved. We employ
                                                                 the Ward’s aggregation method and limit the number of final
                                                                 clusters to 50 based on preliminary experiments.
Copyright is held by the author/owner(s).                        1
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany          http://opencv.org
Table 1: Experimental results on the development                                        Table 2: Official runs configurations.
dataset in terms of precision (P@20), cluster recall                          Run        Data preprocessing          Relevance reranking     Image clustering
(CR@20), and F1-score (F1@20). Employed runs                                 1 (V)           Face filter                    CNN                   CNN
consider visual (V) and/or textual (T) information.                          2 (T)           GPS filter                       –                 TF-IDF
                                                                            3 (V,T)       Face+GPS filter                   CNN                   CNN
        Data              Relevance     Image      P@20    CR@20   F1@20    5 (V,T)       Face+GPS filter                     –                 TF-IDF
        preprocessing     reranking   clustering
  –     Flickr baseline                            0.812   0.343   0.471
  T     GPS filter           –          –          0.820   0.350   0.478   Table 3: MediaEval 2015 Benchmark results. Bold
  V     Face filter          –          –          0.816   0.349   0.478   values indicate best values in terms of F1-score for
 V,T    Face+GPS filter      –          –          0.825   0.355   0.485   the different types of test data.
  T     –                    –        TF-IDF       0.784   0.455   0.569
                                                                                        single-topic                 multi-topic                  overall
  T     GPS filter           –        TF-IDF       0.799   0.462   0.577        P@20      CR@20 F1@20        P@20     CR@20 F1@20        P@20    CR@20      F1@20
  T     –                    –         LDA         0.798   0.420   0.542   1    0.779      0.450     0.552   0.763     0.434     0.534   0.771    0.442      0.543
  T     GPS filter           –         LDA         0.815   0.429   0.553   2    0.690      0.407     0.501   0.697     0.434     0.519   0.694    0.421      0.511
  V     –                   CNN         –          0.936   0.282   0.420   3    0.794      0.455     0.560   0.764     0.435     0.535   0.778    0.445     0.547
  V     –                    –         CNN         0.783   0.437   0.553   5    0.688      0.409     0.501   0.699     0.429     0.517   0.694    0.419      0.509
  V     –                   CNN        CNN         0.831   0.454   0.578
  V     Face filter         CNN        CNN         0.835   0.461   0.584
 V,T    Face+GPS filter      –        TF-IDF       0.819   0.464   0.584
                                                                           based runs, the consideration of the relevance reranking step
 V,T    Face+GPS filter     CNN       TF-IDF       0.925   0.318   0.460   using the CNN features demonstrates a significant increase
 V,T    Face+GPS filter      –         LDA         0.830   0.437   0.564   in the relevance (P@20-score of 0.94). However, the drop
 V,T    Face+GPS filter     CNN        LDA         0.933   0.318   0.459
 V,T    Face+GPS filter     CNN        CNN         0.849   0.468   0.593
                                                                           in the clustering recall indicates an increase of redundancy
                                                                           in the retrieved images as a side-effect. Overall, the best-
   For the text-based runs we consider two approaches. First,              performing text-based and visual-based runs are comparable
we perform topic modeling on the textual descriptions of                   in terms of F1@20 with the computational costs for the text-
each image (title and tags) using Latent Dirichlet Allocation              based runs being significantly lower. The multimodal runs
(LDA) and the MALLET Toolbox2 and extract T topics                         additionally slightly improve both the clustering recall and
for the employed dataset. For each image, we estimate the                  the F1-scores by approximately 1%. Surprisingly, the con-
likelihoods l1 and l2 of the first- and second-best matching               sideration of the reranking step in a combination with the
topics. If the difference of the likelihoods is larger than a              text-based image clustering and selection cannot compen-
threshold τ (l2 /l1 < τ ) the most likely topic (l1 ) is assigned          sate for the drop in the clustering recall.
to the photo otherwise no topic is assigned. We set T = 50                    Following our preliminary experiments we submitted four
and τ = 0.8 for all experiments.                                           runs corresponding to the best configuration for the respec-
   The second text-based approach considers the well-estab-                tive modality (see Table 2). Table 3 summarizes the re-
lished term frequency-inverse document frequency (TF-IDF).                 sults of the official runs on the test dataset. In opposite to
We compute the TF-IDF vector for each image using the                      the development data, which contains the retrieval results
complete textual description (title, tags, and descriptions).              of single-topic queries only, the test data differentiates be-
The textual descriptions are first preprocessed to increase                tween single- (e.g., Niagara Falls) and multi-topics queries
their expressiveness, i.e., we remove potential occurrences                (e.g., Academy awards in Hollywood ). Overall, there is no
of the corresponding user name, web links, and stopwords                   significant difference in the performance for the two subsets.
and we additionally stem all remaining terms. Furthermore,                 While the (predominantly) visual-driven runs (runs 1 and 3)
we account for images with missing textual descriptions. In                show a slight decrease in the clustering recall for the multi-
such a case, we search for timely closest image with a de-                 topic queries, the text-driven runs (runs 2 and 4) indicate
scription which is either captured within a predefined radius              the opposing trend. Furthermore, in contrast to the results
(10 meter in our experiments) or by the same user within a                 on the development data, the test runs show notable differ-
predefined short time span (e.g., 5 minutes). In the follow-               ence between the performance of the text- and the visual-
ing, we cluster the resulting TF-IDF vectors using again the               based runs. This reveals the better generalization ability of
AHC method, whereas the similarity between the TF-IDF                      the visual-based runs to different datasets. Overall, the best
vectors is measured using the cosine similarity. The selection             performance in terms of F1-score of 0.55 is achieved by the
of the final image set follows the Round-Robin approach as                 visual-based run which additionally considers the face and
described for the visual-based approach.                                   GPS filters to reject irrelevant images (run 3).

3.     EXPERIMENTAL RESULTS                                                4.     CONCLUSION
   Table 1 presents a selection of our preliminary experi-                    In this paper we investigated both text- and visual-driven
ments on the development dataset. The results show that                    approaches for the diversification of Flickr image retrieval
the prerpocessing step (face and GPS filter) only marginally               results. The achieved performances indicate that the visual-
improves the performance for the top 20 retrieved images                   based approach copes well with different data and varying
in comparison to the Flickr baseline results. Nevertheless,                query types. Overall, the relevance ranking shows promising
95% of the rejected images are irrelevant with respect to the              results in terms of precision. However, the diversification in-
underlying search query. Experiments with the text-based                   creases only slowly by means of clustering recall. Our future
runs show only minor differences in the performance of the                 work will exploit the potential of combining features of dif-
TF-IDF and the LDA-based methods. While the achieved                       ferent modalities in the clustering process, e.g. by means of
precision (P@20) is comparable to those of the Flickr base-                a late fusion approach.
line, the cluster recall (CR@20) improves notably, e.g. from
0.34 to 0.46 using the TF-IDF approach. For the visual-
                                                                           Acknowledgment
2
                                                                           This work has been partly funded by the Vienna Science and
    http://mallet.cs.umass.edu.                                            Technology Fund (WWTF) through project ICT12-010.
5.   REFERENCES
[1] A. Castellanos, A. Garcia-Serrano, and J. Cigarran.
    UNED @ retrieving diverse social images task. In
    MediaEval Benchmark Workshop, 2014.
[2] D.-T. Dang-Nguyen, L. Piras, G. Giacinto, G. Boato,
    and F. D. Natale. Retrieval of diverse images by
    pre-filtering and hierarchical clustering. In MediaEval
    Benchmark Workshop, 2014.
[3] A. L. Ginsca, A. Popescu, and N. Rekabsaz. CEA
    LIST’s participation at the MediaEval 2014 retrieving
    diverse social images task. In MediaEval Benchmark
    Workshop, 2014.
[4] B. Ionescu, A. L. Gı̂nscâ, B. Boteanu, A. Popescu,
    M. Lupu, and H. Müller. Retrieving diverse social
    images at MediaEval 2015: Challenge, dataset and
    evaluation. In MediaEval Benchmark Workshop, 2015.
[5] J. R. M. Palotti, N. Rekabsaz, M. Lupu, and
    A. Hanbury. TUW @ retrieving diverse social images
    task 2014. In MediaEval Benchmark Workshop, 2014.
[6] M. I. Sarac and P. Duygulu. Bilkent-RETINA at
    retrieving diverse social images task of MediaEval 2014.
    In MediaEval Benchmark Workshop, 2014.
[7] R. W. Sinnott. Virtues of the haversine. Sky and
    Telescope, 68(2):159, 1984.
[8] C. Spampinato and S. Palazzo. PeRCeiVe@UNICT at
    MediaEval 2014 diverse images: Random forests for
    diversity-based clustering. In MediaEval Benchmark
    Workshop, 2014.