Multimodal-based Diversified Summarization in Social Image Retrieval Duc-Tien Dang-Nguyen1 , Giulia Boato1 , Francesco G. B. De Natale1 , Luca Piras2 , Giorgio Giacinco2 , Franco Tuveri3 , Manuela Angioni3 1 DISI - University of Trento, Italy 2 DIEE - University of Cagliari, Italy 3 Center for Advanced Studies, Research and Development in Sardina, Italy dangnguyen@disi.unitn.it, boato@disi.unitn.it, denatale@ing.unitn.it, luca.piras@diee.unica.it, giacinto@diee.unica.it, tuveri@crs4.it, angioni@crs4.it ABSTRACT exploiting the provided user credibility information. In order In this paper, we describe our approach and its results for the to detect blurred images (rule iv), we estimate the focus by MediaEval 2015 Retrieving Diverse Social Images task. The computing the sum of wavelet coefficients and decide if it is main strength of the proposed approach is its flexibility that out-of-focus following the method in [3]. After this step, all permits to filter out irrelevant images, and to obtain a reli- the images left are considered as relevant and are passed to able set of diverse and relevant images. This is done by first the next step. clustering similar images according to their textual descrip- Clustering: we propose to cluster similar images by con- tions and their visual content, and then extracting images structing a particular clustering feature tree (CF tree) which from different clusters according to a measure of user’s cred- is built based on the combination of textual and visual in- ibility. Experimental results shown that it is stable and has formation. To this end, we exploit the characteristic of little fluctuation in both single-concept and multi-concept the BIRCH algorithm [6] to perform clustering in two main queries. phases, namely the Global Clustering phase, and the Refin- ing phase. While these two phases are intended to produce a high quality clustering results by using the same set of fea- 1. INTRODUCTION tures, we used textual features to perform the first phase and In the MediaEval 2015 Retrieving Diverse Social Images we refined the clusters by using visual features instead. We task [4], participants are provided with sets of images re- computed a different set of textual features by performing trieved from Flickr, where each set is related to a location. the analysis of the provided textual data in order to reduce However, these sets are normally noisy and redundant, thus, the noise of not relevant words. After this step, all images the goal of this task is to refine the initial results by choosing that are visually similar and have the same context (i.e., the a subset of images that are relevant to the queried location textual information) are grouped into the same branch of in different views, times, and other conditions. the tree. We propose here an improved method based on our pre- Summarization: Starting from the CF tree, the clus- vious approaches in [1] and [2]. The basic idea is to filter ters can be obtained by applying the agglomerative hierar- out the non-relevant images at the beginning of the process chical clustering algorithm on CF leaves to form the set of according to the rules of the task. Then, exploit textual and clusters. To choose the best images for summarizing the visual features, as well as the user credibility information landmark, first the clusters are sorted based on the number by a multi-modal retrieval framework to have a diversified of images, i.e., clusters containing more images are ranked summarization of the queried images. higher. Then, we extract images from each cluster till the maximum number of required images is reached (e.g., 20 im- ages). In each cluster, the image uploaded by the user with 2. METHODOLOGY highest visual score is selected as the first image. If there The proposed method comprises 3 steps (see Fig. 1): is more than one image from that user, the image closest Filtering: The goal of this step is to filter out outliers by to the centroid is selected. If more than one image have to removing images that are considered as non-relevant. We be extracted from a cluster to reach the exact number of consider an image as non-relevant by defining the following images required to build the visual summary, we select the rules: (i) it contains people as the main subject; (ii) it was second image as the one which has the largest distance from shot far away from the queried location; (iii) it received very the first image, the third image as the one with the largest few number of views on Flickr; and (iv) it is out-of-focus or distance to both the first two images, and so on. blurred. Condition (i) can be detected by the proportion of the human face size with respect to the size of the image. In our method, the Luxand FaceSDK (luxand.com) is used as 3. RUN DESCRIPTION a face detector. Conditions (ii) and (iii) can be computed We ran our model on the development set (devset, con- taining 153 location queries from 45.375 Flickr photos). Ac- cording to the results, we choose the best features and the Copyright is held by the author/owner(s). tuned parameters for each run and applied to the test set MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany (containing 69 single-concept queries and 70 multi-concept Figure 1: Schema of the proposed method Table 1: Run performances on MediaEval Retrieving Diverse Social Images Task 2015 Test Set. Single Concept Multi Concept Overall Run P@20 CR@20 F1@20 P@20 CR@20 F1@20 P@20 CR@20 F1@20 Run 1 0.6601 0.406 0.4902 0.6643 0.4226 0.5017 0.6622 0.4143 0.4959 Run 2 0.5993 0.37 0.4461 0.6636 0.4275 0.5028 0.6317 0.399 0.4747 Run 3 0.6181 0.3725 0.4538 0.67 0.4315 0.5088 0.6442 0.4022 0.4815 Run 4 0.6768 0.4131 0.5009 0.6921 0.4198 0.5052 0.6845 0.4165 0.5031 Run 5 0.7362 0.4288 0.529 0.7607 0.4753 0.567 0.7486 0.4522 0.5481 queries from 41.394 Flickr images) as follows: Run 1: Color naming (CNM), color descriptor (GCD), Table 2: Run performances on Development set. histogram of oriented gradients (HOG) and local binary pat- Runs P@20 CR@20 F1@20 tern (GLBP) are used. In the Summarization step, since we Run 1 0.7268 0.4125 0.5188 do not have the user credibility information in this run, the Run 2 0.7229 0.4245 0.5127 centroid of each cluster is selected as the first image. Run 3 0.8000 0.4013 0.5266 Run 2: In this run, we refined text features by normaliz- Run 4 0.7012 0.4198 0.5015 ing the text terms and removing stop-words, html tags and Run 5 0.8517 0.4829 0.6102 special characters from the given TF-IDF. Cosine similarity was used as the distance metric. The parameters are chosen similar to Run 1. removed images whose title and descriptors do not contain Run 3: The proposed method is applied on the combined any word from the query. In the Clustering step, a simi- features from run 1 and run 2 where TF-IDF is used first, lar clustering as Run 3 is applied for both types of query then the visual features with Euclidean distance are applied with the extra visual features: Dense SIFT and HOG2x2, after. extracted as the study in [5]. Text features were refined as Run 4: In this run, we clustered the images by user. The described in Run 2. Finally, in the Summarization step, the order of the clusters is ranked based on the visual score (i.e., same method as described in Section 2 were applied. the cluster belong to the user with highest visual score will be selected first), then by face proportion, and so on with all the user credibility information. For each cluster, images are selected based on the number of views, i.e., the image 4. RESULTS AND CONCLUSION with highest number of views is selected as the first image. With the mentioned selected features and parameters, we Run 5: In the first four runs, we applied the same method obtained the highest F 1@20, the official metric of the task, on both single-concept and multi-concept queries. However, at Run 5 on both development and test sets with the values in this run, we used two different methods for these two dif- of 0.61 and 0.55, respectively. These results confirmed that ferent cases. In the filtering step for single-concept queries, removing outliers and combining textual, visual and user outliers are detected as follows: rule (i): the face size is credibility information as run 5 significantly improved the bigger than 10% with respect to the size of the image, (ii) performance with respect to the other runs (see in Table 1 images that were shot farther than 15kms, (iii) images that and Table 2 the results on the test set and development set, have less than 25 views, and (iv) images that have f-score (fo- respectively). cus measure) smaller than 20. For the multi-concept queries, According to the results on the test set, we can state only rule (iii) and (iv) were applied since there are many that the performances is stable and has little fluctuation queries require images belong to multiple locations. We also in both single-concept (F 1@20 = 0.529) and multi-concept (F 1@20 = 0.567) queries. 5. REFERENCES M. Lupu, and H. Muller. Retrieving Diverse Social [1] D.-T. Dang-Nguyen, L. Piras, G. Giacinto, G. Boato, Images at MediaEval 2015: Challenge, Dataset and and F. G. B. De Natale. Retrieval of diverse images by Evaluation. In Working Notes Proceedings of the pre-filtering and hierarchical clustering. In MediaEval, MediaEval 2015 Workshop, September 2015. 2014. [5] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and [2] D.-T. Dang-Nguyen, L. Piras, G. Giacinto, G. Boato, A. Torralba. Sun database: Large-scale scene and F. G. B. D. Natale. A Hybrid Approach for recognition from abbey to zoo. In IEEE Conference on Retrieving Diverse Social Images of Landmarks. In Computer Vision and Pattern Recognition, pages IEEE International Conference on Multimedia and 3485–3492. IEEE, 2010. Expo, 2015. [6] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: [3] J.-T. Huang, C.-H. Shen, S.-M. Phoong, and H. Chen. An Efficient Data Clustering Method for Very Large Robust measure of image focus in the wavelet domain. Databases. In Proceedings of the 1996 ACM SIGMOD In Intelligent Signal Processing and Communication International Conference on Management of Data, Systems, pages 157–160, Dec 2005. pages 103–114, 1996. [4] B. Ionescu, A. L. Gı̂nscă, B. Boteanu, A. Popescu,