Multimodal-based Diversified Summarization in
                         Social Image Retrieval

          Duc-Tien Dang-Nguyen1 , Giulia Boato1 , Francesco G. B. De Natale1 , Luca Piras2 ,
                      Giorgio Giacinco2 , Franco Tuveri3 , Manuela Angioni3
                                                    1
                                             DISI - University of Trento, Italy
                                                2
                                            DIEE - University of Cagliari, Italy
                       3
                         Center for Advanced Studies, Research and Development in Sardina, Italy
       dangnguyen@disi.unitn.it, boato@disi.unitn.it, denatale@ing.unitn.it, luca.piras@diee.unica.it,
                           giacinto@diee.unica.it, tuveri@crs4.it, angioni@crs4.it


ABSTRACT                                                             exploiting the provided user credibility information. In order
In this paper, we describe our approach and its results for the      to detect blurred images (rule iv), we estimate the focus by
MediaEval 2015 Retrieving Diverse Social Images task. The            computing the sum of wavelet coefficients and decide if it is
main strength of the proposed approach is its flexibility that       out-of-focus following the method in [3]. After this step, all
permits to filter out irrelevant images, and to obtain a reli-       the images left are considered as relevant and are passed to
able set of diverse and relevant images. This is done by first       the next step.
clustering similar images according to their textual descrip-           Clustering: we propose to cluster similar images by con-
tions and their visual content, and then extracting images           structing a particular clustering feature tree (CF tree) which
from different clusters according to a measure of user’s cred-       is built based on the combination of textual and visual in-
ibility. Experimental results shown that it is stable and has        formation. To this end, we exploit the characteristic of
little fluctuation in both single-concept and multi-concept          the BIRCH algorithm [6] to perform clustering in two main
queries.                                                             phases, namely the Global Clustering phase, and the Refin-
                                                                     ing phase. While these two phases are intended to produce
                                                                     a high quality clustering results by using the same set of fea-
1.   INTRODUCTION                                                    tures, we used textual features to perform the first phase and
   In the MediaEval 2015 Retrieving Diverse Social Images            we refined the clusters by using visual features instead. We
task [4], participants are provided with sets of images re-          computed a different set of textual features by performing
trieved from Flickr, where each set is related to a location.        the analysis of the provided textual data in order to reduce
However, these sets are normally noisy and redundant, thus,          the noise of not relevant words. After this step, all images
the goal of this task is to refine the initial results by choosing   that are visually similar and have the same context (i.e., the
a subset of images that are relevant to the queried location         textual information) are grouped into the same branch of
in different views, times, and other conditions.                     the tree.
   We propose here an improved method based on our pre-                 Summarization: Starting from the CF tree, the clus-
vious approaches in [1] and [2]. The basic idea is to filter         ters can be obtained by applying the agglomerative hierar-
out the non-relevant images at the beginning of the process          chical clustering algorithm on CF leaves to form the set of
according to the rules of the task. Then, exploit textual and        clusters. To choose the best images for summarizing the
visual features, as well as the user credibility information         landmark, first the clusters are sorted based on the number
by a multi-modal retrieval framework to have a diversified           of images, i.e., clusters containing more images are ranked
summarization of the queried images.                                 higher. Then, we extract images from each cluster till the
                                                                     maximum number of required images is reached (e.g., 20 im-
                                                                     ages). In each cluster, the image uploaded by the user with
2.   METHODOLOGY                                                     highest visual score is selected as the first image. If there
  The proposed method comprises 3 steps (see Fig. 1):                is more than one image from that user, the image closest
  Filtering: The goal of this step is to filter out outliers by      to the centroid is selected. If more than one image have to
removing images that are considered as non-relevant. We              be extracted from a cluster to reach the exact number of
consider an image as non-relevant by defining the following          images required to build the visual summary, we select the
rules: (i) it contains people as the main subject; (ii) it was       second image as the one which has the largest distance from
shot far away from the queried location; (iii) it received very      the first image, the third image as the one with the largest
few number of views on Flickr; and (iv) it is out-of-focus or        distance to both the first two images, and so on.
blurred. Condition (i) can be detected by the proportion of
the human face size with respect to the size of the image. In
our method, the Luxand FaceSDK (luxand.com) is used as               3.   RUN DESCRIPTION
a face detector. Conditions (ii) and (iii) can be computed             We ran our model on the development set (devset, con-
                                                                     taining 153 location queries from 45.375 Flickr photos). Ac-
                                                                     cording to the results, we choose the best features and the
Copyright is held by the author/owner(s).                            tuned parameters for each run and applied to the test set
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany          (containing 69 single-concept queries and 70 multi-concept
                                        Figure 1: Schema of the proposed method

       Table 1: Run performances on MediaEval Retrieving Diverse Social Images Task 2015 Test Set.
                        Single Concept              Multi Concept               Overall
          Run     P@20 CR@20 F1@20           P@20 CR@20 F1@20            P@20 CR@20 F1@20
          Run 1 0.6601       0.406    0.4902 0.6643    0.4226     0.5017 0.6622 0.4143  0.4959
          Run 2 0.5993        0.37    0.4461 0.6636    0.4275     0.5028 0.6317  0.399  0.4747
          Run 3 0.6181      0.3725    0.4538  0.67     0.4315     0.5088 0.6442 0.4022  0.4815
          Run 4 0.6768      0.4131    0.5009 0.6921    0.4198     0.5052 0.6845 0.4165  0.5031
          Run 5 0.7362 0.4288          0.529 0.7607 0.4753        0.567  0.7486 0.4522  0.5481


queries from 41.394 Flickr images) as follows:
   Run 1: Color naming (CNM), color descriptor (GCD),               Table 2: Run performances on Development set.
histogram of oriented gradients (HOG) and local binary pat-                  Runs   P@20 CR@20 F1@20
tern (GLBP) are used. In the Summarization step, since we                    Run 1 0.7268 0.4125   0.5188
do not have the user credibility information in this run, the                Run 2 0.7229 0.4245   0.5127
centroid of each cluster is selected as the first image.                     Run 3 0.8000 0.4013   0.5266
   Run 2: In this run, we refined text features by normaliz-                 Run 4 0.7012 0.4198   0.5015
ing the text terms and removing stop-words, html tags and                    Run 5 0.8517 0.4829   0.6102
special characters from the given TF-IDF. Cosine similarity
was used as the distance metric. The parameters are chosen
similar to Run 1.                                                  removed images whose title and descriptors do not contain
   Run 3: The proposed method is applied on the combined           any word from the query. In the Clustering step, a simi-
features from run 1 and run 2 where TF-IDF is used first,          lar clustering as Run 3 is applied for both types of query
then the visual features with Euclidean distance are applied       with the extra visual features: Dense SIFT and HOG2x2,
after.                                                             extracted as the study in [5]. Text features were refined as
   Run 4: In this run, we clustered the images by user. The        described in Run 2. Finally, in the Summarization step, the
order of the clusters is ranked based on the visual score (i.e.,   same method as described in Section 2 were applied.
the cluster belong to the user with highest visual score will
be selected first), then by face proportion, and so on with
all the user credibility information. For each cluster, images
are selected based on the number of views, i.e., the image
                                                                   4.   RESULTS AND CONCLUSION
with highest number of views is selected as the first image.         With the mentioned selected features and parameters, we
   Run 5: In the first four runs, we applied the same method       obtained the highest F 1@20, the official metric of the task,
on both single-concept and multi-concept queries. However,         at Run 5 on both development and test sets with the values
in this run, we used two different methods for these two dif-      of 0.61 and 0.55, respectively. These results confirmed that
ferent cases. In the filtering step for single-concept queries,    removing outliers and combining textual, visual and user
outliers are detected as follows: rule (i): the face size is       credibility information as run 5 significantly improved the
bigger than 10% with respect to the size of the image, (ii)        performance with respect to the other runs (see in Table 1
images that were shot farther than 15kms, (iii) images that        and Table 2 the results on the test set and development set,
have less than 25 views, and (iv) images that have f-score (fo-    respectively).
cus measure) smaller than 20. For the multi-concept queries,         According to the results on the test set, we can state
only rule (iii) and (iv) were applied since there are many         that the performances is stable and has little fluctuation
queries require images belong to multiple locations. We also       in both single-concept (F 1@20 = 0.529) and multi-concept
                                                                   (F 1@20 = 0.567) queries.
5.   REFERENCES                                                    M. Lupu, and H. Muller. Retrieving Diverse Social
[1] D.-T. Dang-Nguyen, L. Piras, G. Giacinto, G. Boato,            Images at MediaEval 2015: Challenge, Dataset and
    and F. G. B. De Natale. Retrieval of diverse images by         Evaluation. In Working Notes Proceedings of the
    pre-filtering and hierarchical clustering. In MediaEval,       MediaEval 2015 Workshop, September 2015.
    2014.                                                      [5] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and
[2] D.-T. Dang-Nguyen, L. Piras, G. Giacinto, G. Boato,            A. Torralba. Sun database: Large-scale scene
    and F. G. B. D. Natale. A Hybrid Approach for                  recognition from abbey to zoo. In IEEE Conference on
    Retrieving Diverse Social Images of Landmarks. In              Computer Vision and Pattern Recognition, pages
    IEEE International Conference on Multimedia and                3485–3492. IEEE, 2010.
    Expo, 2015.                                                [6] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH:
[3] J.-T. Huang, C.-H. Shen, S.-M. Phoong, and H. Chen.            An Efficient Data Clustering Method for Very Large
    Robust measure of image focus in the wavelet domain.           Databases. In Proceedings of the 1996 ACM SIGMOD
    In Intelligent Signal Processing and Communication             International Conference on Management of Data,
    Systems, pages 157–160, Dec 2005.                              pages 103–114, 1996.
[4] B. Ionescu, A. L. Gı̂nscă, B. Boteanu, A. Popescu,