DCLab at MediaEval 2015 Retrieving Diverse
                           Social Images Task

                Zsombor Paróczi                         Máté Kis-Király                        Bálint Fodor
             Budapest University of                 Budapest University of                Budapest University of
           Technology and Economics               Technology and Economics              Technology and Economics
            paroczi@tmit.bme.hu                  kis.kiraly.mate@gmail.com balint.fodor@gmail.com


ABSTRACT                                                           2.1    Run1: Visual based re-ranking
In this paper we present our contribution to the MediaEval            In the first run participants could use only visual based
2015 Retrieving Diverse Social Images Task which requested         descriptors or own descriptors calculated using only the
participants to provide methods for refining Flickr image          images.
retrieval results thus to increase their relevance and                For the first run we use the following approach: step 1 -
diversification. Our approach is based on re-ranking the           calculating F ACE descriptor for each image, step 2 - filter
original result, using a precomputed distance matrix and           the images using F ACE and CN [0] descriptors, step 3 -
a spectral clustering scheme. We use color related visual          creating a distance matrix from color similarity, step 4 -
features, text and credibility descriptors to define similarity    doing spectral clustering using the distance matrix, step 5 -
between images.                                                    using the cluster information create the new result list.
                                                                      Our main approach was using color based distances [1,
                                                                   5] and filtering photos with faces [7, 6]. We used two of
1.   INTRODUCTION                                                  the descriptors provided by the organizers [2]: Global Color
  When a potential tourist makes an image search for a             Moments on HSV Color Space (CM): represent the first
place, she expects to get a diverse and relevant visual result     three central moments of an image color distribution: mean,
as a summary of the different views of the location.               standard deviation and skewness; Global Color Naming
  In the official challenge (Retrieving Diverse Social Images      Histogram (CN): maps colors to 11 universal color names:
at MediaEval 2015) [2] a ranked list of location photos            ”black”, ”blue”, ”brown”, ”grey”, ”green”, ”orange”, ”pink”,
retrieved from Flickr is given, and the task is to refine the      ”purple”, ”red”, ”white”, and ”yellow”.
result by providing a set of images that are both relevant and        First we calculated a new descriptor for each image: the
provide a diversified summary. An extended explanation             F ACE descriptor is the ratio of the calculated area occupied
for the task objectives, provided dataset and evaluation           by the possible face regions on an image and whole image
descriptors can be found in the task description paper [2].        area [7]. Then we used the CN descriptor to filter out black
The diversity means that images can illustrate different           color based images, since mostly dark images tend to have
views of the location at different times of the day/year and       less colors and those are mainly shifted into the gray range
under different weather conditions, creative views, etc. The       rather than having bright colors.
utility score of the refinement process can be measured using         In the reordering step we started from the original result.
the precision and diversity metric [8].                            We did our initial filtering by putting images to the end of
  Our team participated in previous challenges [7, 6], each        the result list where F ACE > 0 or CN [0] > 0.8, the first
year we experimented with a different approach. In 2013            value in CN corresponds to the color black.
we used diversification of initial results using clustering, but      After the preprocessing step we built the distance matrix
our solution was focused on diversification only [7]. In 2014      F , between each A and B images the distance was calculated
we tried to focus on relevance and diversity with the same         using the following equation:
importance as a new idea [6].
  In the previous approaches to the task we treated our
                                                                            10                            10
feature vectors (calculated values from metrics) as an N                    X                             X
dimensional continuous space with Euclidean coordinates.           FA,B =         |CNA [i] − CNB [i]| +         |si ∗ (CMA [i] − CMB [i])|
                                                                            i=0                           i=0
In this year apporach we will define a set of hand crafted                                
distance matrices with non-Euclidean coordinates, which                                    5, where 0 6 i < 3
can be used during the clustering.                                                    si = 1.5, where 3 6 i < 5
                                                                                          0.5, where 5 6 i < 9

2.   RUNS                                                             After the distance matrix was created we used
  In this section we introduce the approaches used to              unsupervised spectral clustering [3, 4] to create clusters from
generate the runs for each task.                                   the first 150 images, the target cluster count was 10.
                                                                      The final result was generated by picking the lowest
                                                                   ranking item from each cluster, appending those to the result
Copyright is held by the author/owner(s).                          list, then repeating this until all the items are used. The
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany        same clustering and sorting method was used during run2
and run3.                                                                      run name                  P@20          CR@20                F1@20
                                                                               Run1 (all)                .7094          .3780                .4782
2.2   Run2: Text based re-ranking                                              Run2 (all)                .6730          .3655                .4565
   The second run was the text based re-ranking which is                       Run3 (all)                .6863          .3624                .4603
accomplished using the title, tags and description fields of                   Run4 (all)                .7083          .3543                .4564
each image.                                                                    Run1 single               .7022          .3702                .4751
   For the second run we use the following approach: step                      Run2 single               .6435          .3494                .4379
1 - filtering stop words and characters, step 2 - creating a                   Run3 single               .6732          .3563                .4554
distance matrix from text similarity, step 4 - doing spectral                  Run4 single               .7014          .3589                .4651
clustering using the distance matrix, step 5 - using the                       Run1 multi                .7164          .3857                .4813
cluster information create the new result list.                                Run2 multi                .7021          .3813                .4748
   As a preprocessing step we executed a stop word                             Run3 multi                .6993          .3683                .4651
filtering. We also removed some special characters (namely:                    Run4 multi                .7150          .3498                .4479
.,-:;0123456789() @) and HTML specific character sets
(&amp;, &quot; and everything between < and >), then we                   Table 1: Official results on the test data.
used the remaining text as the input for a simple TF-IDF
calculation [9].                                                0.7	
  
   We calculated the distance between images (e.g.
                                                                0.6	
  
description fields) A and B in the following manner. We
initialize distance GA,B to zero and compared A and B           0.5	
  
at the term level. All occurring t terms in document A
compared with all terms in the document B and so on.            0.4	
  
If term t is contained by both documents, then GA,B will
not be increased. If t contained by only one document, we       0.3	
  
take into consideration the document frequency (DFt ): if       0.2	
  
DFt < 5, then it is a rare term and GA,B should be increased              F1@5	
          F1@10	
         F1@20	
        F1@30	
            F1@40	
         F1@50	
  
by 2; if DFt > DN/4, then it is a common term and GA,B
should be increased by 0.1 (where DN is the total number                   Visual	
  single	
     Visual	
  mul7	
     Text	
  single	
       Text	
  mul7	
  
of documents). If the term is not common nor rare, then we                 Vistext	
  single	
   Vistext	
  mul7	
     Cred	
  single	
       Cred	
  mul7	
  
added the DFt /DN to the distance.
   Using the three text descriptors we created a weighted       Figure 1: Official runs F1-score metric for various
sum for the field distances, where the empirically determined   cutoff points (results of test data).
weights are as follows: title=1, tags=2, description=0.5.
From these GA,B values we created the G distance matrix.
                                                                3.        RESULTS
2.3   Run3: Multimodal re-ranking                                 The 2015 dataset contained 153 location queries (45,375
   In the third run both visual and textual descriptors could   Flickr photos) as the development set, we used this to
be used to create the results.                                  develop our approach, all methods and thresholds were
   For the third run we use the following approach: step 1      calculated using the whole development set.
- creating the distance matrix F (see Section 2.1), step 2        The test set containing 139 queries: 69 one-concept
- creating the distance matrix G (see Section 2.2), step 2 -    location queries (20,700 Flickr photos) and 70 multi-concept
creating a new distance matrix from combining F and G,          queries related to events and states associated with locations
step 4 - doing spectral clustering using the distance matrix,   (20,694 Flickr photos).      Single-topic queries are basic
step 5 - using the cluster information create the new result    formulations such as the name of a location, multi-concept
list.                                                           queries are more complex, they are related to events and
   We used our visual distance matrix F and text distance       states associated with locations (like ’sunset in the city’).
matrix G and created a new aggregate matrix H. This               Our results can be seen in Table 3. and the F1 metrics can
matrix is simply the sum of the corresponding values from       be seen in Figure 1, we listed the single and multi-concept
both F and G matrix. We tried different kind of weighting       based results separately.
methods, but the pure matrices supplied the best results on
the development set.                                            4.        CONCLUSION AND FUTURE WORK
                                                                   As one can see the visual information based results are
2.4   Run4: Credibility based re-ranking                        the best among all the runs. In the development set we
   In the fourth run participants were provided with            experienced that the textual information for many images
credibility descriptors [2].                                    are missing or do not describe the content very well. It
   Using the original result we filtered the images by users    is not uncommon that an author gives the same textual
who had f aceP roportion more than 1.3 to create the same       information to all of the images in a topic.
effect as we did with the F ACE descriptor.                        The credibility based descriptors are proved to be much
   With the purpose of increasing the diversity we used         more useful than we initially thought, in the future we
the locationSimilarity descriptor, if this value exceeds the    should focus on those to improve textual and visual
threshold of 3.0 we excluded the image. Despite our simple      descriptor based results.
approach we had great results on the development set.
5.   REFERENCES
[1] R. Datta, D. Joshi, J. Li, and J. Z. Wang. Image
    retrieval: Ideas, influences, and trends of the new age.
    ACM Comput. Surv., 40(2):5:1–5:60, May 2008.
[2] B. Ionescu, A. L. Ginsca, B. Boteanu, A. Popescu,
    M. Lupu, and H. Müller. Retrieving diverse social
    images at mediaeval 2015: Challenge, dataset and
    evaluation. In Working Notes Proceedings of the
    MediaEval 2015 Workshop, Wurzen, Germany,
    September 14-15, CEUR-WS.org, 2015.
[3] X. Ma, W. Wan, and L. Jiao. Spectral clustering
    ensemble for image segmentation. In Proceedings of the
    First ACM/SIGEVO Summit on Genetic and
    Evolutionary Computation, GEC ’09, pages 415–420,
    New York, NY, USA, 2009. ACM.
[4] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral
    clustering: Analysis and an algorithm. In Advances in
    neural information processing systems, pages 849–856.
    MIT Press, 2001.
[5] M. L. Paramita, M. Sanderson, and P. Clough.
    Diversity in photo retrieval: Overview of the
    imageclefphoto task 2009. In Proceedings of the 10th
    International Conference on Cross-language Evaluation
    Forum: Multimedia Experiments, CLEF’09, pages
    45–59, Berlin, Heidelberg, 2010. Springer-Verlag.
[6] Z. Paróczi, B. Fodor, and G. Szucs. Dclab at
    mediaeval2014 search and hyperlinking task. In
    Working Notes Proceedings of the MediaEval 2014
    Workshop, Barcelona, Spain, October 16-17,
    CEUR-WS. org, ISSN 1613-0073, 2014.
[7] G. Szűcs, Z. Paróczi, and D. Vincz. Bmemtm at
    mediaeval 2013 retrieving diverse social images task:
    Analysis of text and visual information. In Working
    Notes Proceedings of the MediaEval 2013 Workshop,
    Barcelona, Spain, October 18-19, CEUR-WS. org, ISSN
    1613-0073, 2013.
[8] B. Taneva, M. Kacimi, and G. Weikum. Gathering and
    ranking photos of named entities with high precision,
    high recall, and diversity. In Proceedings of the Third
    ACM International Conference on Web Search and
    Data Mining, WSDM ’10, pages 431–440, New York,
    NY, USA, 2010. ACM.
[9] J.-B. Yeh and C.-H. Wu. Video news retrieval
    incorporating relevant terms based on distribution of
    document frequency. In Proceedings of the 9th Pacific
    Rim Conference on Multimedia: Advances in
    Multimedia Information Processing, PCM ’08, pages
    583–592, Berlin, Heidelberg, 2008. Springer-Verlag.