=Paper=
{{Paper
|id=Vol-1436/Paper16
|storemode=property
|title=UPC-UB-STP @ MediaEval 2015 Diversity Task: Iterative Reranking of Relevant Images
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper16.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/LidonBSNRZ15
}}
==UPC-UB-STP @ MediaEval 2015 Diversity Task: Iterative Reranking of Relevant Images==
<pdf width="1500px">https://ceur-ws.org/Vol-1436/Paper16.pdf</pdf>
<pre>
             UPC-UB-STP @ MediaEval 2015 Diversity Task:
                Iterative Reranking of Relevant Images

                 Aniol Lidon                             Marc Bolaños                     Markus Seidl
              Xavier Giró-i-Nieto                        Petia Radeva                 Matthias Zeppelzauer
            Universitat Politècnica de              Universitat de Barcelona            St. Pölten University of
                   Catalunya                       Barcelona, Catalonia/Spain              Applied Sciences
           Barcelona, Catalonia/Spain              marc.bolanos@ub.edu                    St. Pölten, Austria
             xavier.giro@upc.edu                                                   m.zeppelzauer@fhstp.ac.at

ABSTRACT                                                              2) Filtering of irrelevant images: Only a percentage
This paper presents the results of the UPC-UB-STP team in          of the top ranked images by relevance are considered in later
the 2015 MediaEval Retrieving Diverse Images Task. The             steps. In the multimodal runs, the relevance scores for the
goal of the challenge is to provide a ranked list of Flickr        visual and textual modalities are linearly normalized and
photos for a predefined set of queries. Our approach firstly       fused by averaging.
generates a ranking of images based on a query-independent            3) Feature and distance computation: Visual and/or
estimation of its relevance. Only top results are kept and         textual features are extracted for each image, and the simi-
iteratively re-ranked based on their intra-similarity to intro-    larity between each pair computed.
duce diversity.                                                       4) Reranking by diversity: An iterative algorithm se-
                                                                   lects the most different image with respect to all previously
                                                                   selected ones. The similarity is always assessed by averaging
1.   INTRODUCTION                                                  the considered visual and textual features. Iterations start
   The diversification of search results is an important factor    by adding the most relevant image as the first element of
to improve the usability of visual retrieval engines. This         the reranked list.
motivates the 2015 MediaEval Retrieving Diverse Images
Task [8], which defines the scientific benchmark targeted in       2.1   Visual data
this paper. The proposed methodology solves the trade-off             The visual information was analyzed with Convolutional
between relevance and diversity by firstly filtering results       Neural Networks (CNN) [13, 12] with two different approaches:
based on a learned relevance classifier, and secondly building        1) Ranking by relevance: A Relevance CNN was cre-
a diverse reranked list following an iterative scheme.             ated based on HybridNet [22], a CNN trained with objects
   The first challenge in our system is filtering irrelevant im-   from the ImageNet dataset [3] and locations from the Places
ages, as suggested in [2]. Relevance is a very abstract con-       dataset [22]. HybridNet was fine-tuned in two classes: rele-
cept with a high subjectivity involved. Similar problems           vant and irrelevant, as labeled by human annotators.
have been addressed in the visual domain, as for memorabil-           3) Feature and distance computation: The fully con-
ity [10] or interestingness [16]. In both cases, a crowdsourced    nected layers fc7 from a CNN trained on ImageNet [11], and
task was organised to collect a large amount of human an-          the fully connected layer fc8 from HybridNet [22] were used
notations used to train a classifier based on visual features.     as feature vectors [14].
   The second challenge to address is the diversity in the
ranked list. A seminar work from 1998 [1] introduced diver-
sity in addition to relevance for text retrieval, a concept that
                                                                   2.2   Textual data
was later ported to image [17, 4, 19] and video retrieval [7,         1) Ranking by relevance: For each query, we generate
6]. Different features have been used for this purpose, both       a textual term model in an unsupervised manner from all
textual (e.g. tags [20]), visual (e.g. convolutional neural        images returned for this query. We first remove stopwords,
networks [18]), or multimodal fusion [5].                          words with numeric and special characters and words of
                                                                   length ≤ 4. Next, we select the most representative terms by
                                                                   retaining only those terms where the term frequency (T Fq )
2.   METHODOLOGY                                                   is higher than the document frequency (DFq ) for the query
   A generic and easily extensible methodology of four steps       q. For each term in the model we store the T Fq as a weight.
has been applied in all our submitted runs. While steps 2          Once this model has been established, we map the textual
and 4 apply to all runs, steps 1 and 3 contain particularities     descriptions of the images to the model of the query. For
for visual and textual processing.                                 each image only terms that appear also in the query model
   1) Ranking by relevance: A relevance score for each             are retained. For each remaining term we retrieve the T Fi
image is estimated by either using visual or textual informa-      for the corresponding ith image and build a feature vector.
tion (see details in Section 2.1 and 2.2 respectively).            To compute a relevance score si for an image, we compute
                                                                   the cosine similarity simi between the query model and a
                                                                   given image feature vector. Additionally, we add the inverse
Copyright is held by the author/owner(s).                          original Flickr rank ri of the image to the score, yielding a
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany        final textual relevance score of si = simi + (1/ri ) for im-
age i. This computation is inspired by that of [21] with the            Modality         Visual     Text     Multi     Multi
difference that we use TF instead of TFIDF in the scoring                devset          Run 1     Run 2     Run 3     Run 5
function which showed to be more expressive in our experi-               P@20            0.756     0.802     0.836     0.847
ments.                                                                  CR@20            0.416     0.419     0.452     0.447
  3) Feature and distance computation: Diversity re-                     F1@20           0.530     0.543     0.578     0.577
ranking requires the similarity comparison of all relevant
                                                                    testset (single)     Run 1     Run 2     Run 3     Run 5
images for a query. For a given image, we first align its
terms to the query model. Next, we compute their TFIDF                   P@20            0.705     0.6819    0.749     0.733
weights (T Fi /DFi ) [15, 23]. Terms from the query model               CR@20            0.423     0.383     0.431     0.412
that do not occur in the image’s descriptions get a weight of            F1@20           0.519     0.478     0.533     0.513
zero. The resulting feature vectors are compared with the           testset (multi)      Run 1     Run 2     Run 3     Run 5
cosine metric in diversity re-ranking.                                   P@20            0.593     0.724     0.627     0.621
                                                                        CR@20            0.403     0.372     0.414     0.397
3.   EXPERIMENTAL SETUP                                                  F1@20           0.463      0.47     0.482     0.464
   The experimental setup is mostly defined by the 2015 Me-        testset (overall)     Run 1     Run 2     Run 3     Run 5
diaEval Retrieving Diverse Images Task, which provides a                 P@20            0.649     0.703     0.688     0.677
dataset partitioned into development (devset) and test (test-           CR@20            0.413     0.378     0.422     0.405
set), two types of queries (single and multi-topic), and stan-           F1@20           0.491     0.474     0.508     0.489
dardized and complementary evaluation metrics: Precision
at 20 (P@20 ), Cluster Recall at 20 (CR@20 ) and F1-score         Table 1: Precision, Recall and F1-Scores obtained
at 20 (F1@20 ). The reader is referred to the task overview       on each run with N = 20 on the devset, and the
paper [8] to learn the details of the problem.                    testset (single-topic, multi-topic and overall).
   The Relevance CNN described in Section 2.1 was trained
with a 2-fold cross validation, each split containing one half
of the devset queries. For both splits we stopped after 2,000
iterations, when the validation accuracy was the highest one
(76% and 75% respectively). When applying the best meth-
ods’ parameters on the testset, we used all the dev data and
fine-tuned the network stopping after 4,500 iterations, when
the training loss was minimum.
   The portion of images to be filtered in Step 2 was learned
by measuring the evolution of the final F1-score for differ-
ent percentages. From Runs 1 to 3 the best results where
obtained by keeping the top 20% of images, while for Run 5
the best value was 15%.

4.   RESULTS
   Table 1 presents the results obtained in four different con-
figurations: using visual information only (Run 1 ), using
textual data only (Run 2 ), and using the best combina-
tion of textual and visual data (Run 3 ). An additional Run
5 considers multimodal information only for relevance fil-
tering (Step 2) and purely visual information for diversity       Figure 1: Overall Precision, Recall and F1-score
reranking (Step 4). Rows 2 to 5 presents results on the de-       curves for different cutoffs N of top ranked images
vset for single-topic queries, while rows 6 to row 13 include     on all testset queries.
the results on the testset for the single-topic and multi-topic
queries. The overall results can be found in Rows 14 to 17.
   Figure 1 plots the Precision, Cluster Recall and F1-Score
                                                                  content. Considering the fact that our method was trained
curves depending on the amount of N top ranked images
                                                                  on single-topic queries only, the results for the multi-topic
considered in the evaluation, averaged over all queries on
                                                                  queries are, however, still promising.
our best run (Run 3).
                                                                     It is remarkable that increasing the number of N of re-
                                                                  trieved images increases both, recall and precision (and not
5.   CONCLUSIONS                                                  only recall as one would expect in a typical retrieval sce-
  The trade-off between relevance and diversity has been          nario), as shown in Figure 1. This indicates that the rele-
targeted in this work with relevance-based filtering and a        vance ranking obtained by our method is accurate (at least
posterior iterative process to introduce diversity. The final     for N ≤ 50).
results, presented in Table 1, are comparable to the state of        There is no clear winner between textual and visual in-
the art on the devset [9], and achieve up to a F1@20 of 0.508     formation (Runs 1 and 2 ). The multimodal combination,
on the testset.                                                   however, clearly improves performance (Runs 3 and 5 ). Ad-
  Multi-topic queries seem to be more difficult to diversify      ditionally, results indicate that using multimodal processing
than single-topic queries. A reason may be that multi-topic       at all stages (Run 3 ) is better than using multimodal pro-
queries are more general and contain more heterogeneous           cessing only during the relevance ranking (Run 5 ).
6.   REFERENCES                                                       recognition. Proceedings of the IEEE,
 [1] J. Carbonell and J. Goldstein. The use of mmr,                   86(11):2278–2324, 1998.
     diversity-based reranking for reordering documents          [14] A. S. Razavian, H. Azizpour, J. Sullivan, and
     and producing summaries. In Proceedings of the 21st              S. Carlsson. Cnn features off-the-shelf: an astounding
     annual international ACM SIGIR conference on                     baseline for recognition. In Computer Vision and
     Research and development in information retrieval,               Pattern Recognition Workshops (CVPRW), 2014
     pages 335–336. ACM, 1998.                                        IEEE Conference on, pages 512–519. IEEE, 2014.
 [2] D.-T. Dang-Nguyen, L. Piras, G. Giacinto, G. Boato,         [15] G. Salton and C. Buckley. Term-weighting approaches
     and F. G. De Natale. A hybrid approach for retrieving            in automatic text retrieval. Information processing &
     diverse social images of landmarks. In Multimedia and            management, 24(5):513–523, 1988.
     Expo (ICME), 2015 IEEE International Conference             [16] M. Soleymani. The quest for visual interest. In
     on, pages 1–6. IEEE, 2015.                                       Proceedings of the ACM International Conference on
 [3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and                Multimedia. ACM, 2015.
     L. Fei-Fei. Imagenet: A large-scale hierarchical image      [17] K. Song, Y. Tian, W. Gao, and T. Huang.
     database. In Computer Vision and Pattern                         Diversifying the image retrieval results. In Proceedings
     Recognition, 2009. CVPR 2009. IEEE Conference on,                of the 14th annual ACM international conference on
     pages 248–255. IEEE, 2009.                                       Multimedia, pages 707–710. ACM, 2006.
 [4] T. Deselaers, T. Gass, P. Dreuw, and H. Ney. Jointly        [18] E. Spyromitros-Xioufis, S. Papadopoulos, A. L.
     optimising relevance and diversity in image retrieval.           Ginsca, A. Popescu, Y. Kompatsiaris, and I. Vlahavas.
     In Proceedings of the ACM international conference on            Improving diversity in image search via supervised
     image and video retrieval, page 39. ACM, 2009.                   relevance scoring. In Proceedings of the 5th ACM on
 [5] Y. Gao, M. Wang, Z.-J. Zha, J. Shen, X. Li, and                  International Conference on Multimedia Retrieval,
     X. Wu. Visual-textual joint relevance learning for               pages 323–330. ACM, 2015.
     tag-based social image search. Image Processing,            [19] R. H. van Leuken, L. Garcia, X. Olivares, and R. van
     IEEE Transactions on, 22(1):363–376, 2013.                       Zwol. Visual diversification of image search results. In
 [6] X. Giro-i Nieto, M. Alfaro, and F. Marques. Diversity            Proceedings of the 18th international conference on
     ranking for video retrieval from a broadcaster archive.          World wide web, pages 341–350. ACM, 2009.
     In Proceedings of the 1st ACM International                 [20] R. Van Zwol, V. Murdock, L. Garcia Pueyo, and
     Conference on Multimedia Retrieval, page 56. ACM,                G. Ramirez. Diversifying image search with user
     2011.                                                            generated content. In Proceedings of the 1st ACM
 [7] M. Halvey, P. Punitha, D. Hannah, R. Villa,                      international conference on Multimedia information
     F. Hopfgartner, A. Goyal, and J. M. Jose. Diversity,             retrieval, pages 67–74. ACM, 2008.
     assortment, dissimilarity, variety: A study of diversity    [21] B. Vandersmissen, A. Tomar, F. Godin, W. De Neve,
     measures using low level features for video retrieval. In        and R. Van de Walle. Ghent university-iminds at
     Advances in Information Retrieval, pages 126–137.                mediaeval 2014 diverse images: Adaptive clustering
     Springer, 2009.                                                  with deep features. In MediaEval 2014, Workshop,
 [8] B. Ionescu, A. L. Gınsca, B. Boteanu, A. Popescu,                2014.
     M. Lupu, and H. Müller. Retrieving diverse social          [22] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and
     images at mediaeval 2015: Challenge, dataset and                 A. Oliva. Learning deep features for scene recognition
     evaluation. In MediaEval 2015 Workshop, Wurzen,                  using places database. In Advances in Neural
     Germany, 2015.                                                   Information Processing Systems, pages 487–495, 2014.
 [9] B. Ionescu, A. Popescu, M. Lupu, A. L. Gı̂nsca,             [23] J. Zobel and A. Moffat. Exploring the Similarity
     B. Boteanu, and H. Müller. Div150cred: A social                 Space. ACM SIGIR Forum, 32(1):18–34, 1998.
     image retrieval result diversification with user tagging
     credibility dataset. ACM Multimedia Systems-MMSys,
     Portland, Oregon, USA, 2015.
[10] P. Isola, J. Xiao, D. Parikh, A. Torralba, and
     A. Oliva. What makes a photograph memorable?
     Pattern Analysis and Machine Intelligence, IEEE
     Transactions on, 36(7):1469–1482, 2014.
[11] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev,
     J. Long, R. Girshick, S. Guadarrama, and T. Darrell.
     Caffe: Convolutional architecture for fast feature
     embedding. In Proceedings of the ACM International
     Conference on Multimedia, pages 675–678. ACM,
     2014.
[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton.
     Imagenet classification with deep convolutional neural
     networks. In Advances in neural information
     processing systems, pages 1097–1105, 2012.
[13] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.
     Gradient-based learning applied to document

</pre>