DCLab at MediaEval 2015 Retrieving Diverse Social Images Task Zsombor Paróczi Máté Kis-Király Bálint Fodor Budapest University of Budapest University of Budapest University of Technology and Economics Technology and Economics Technology and Economics paroczi@tmit.bme.hu kis.kiraly.mate@gmail.com balint.fodor@gmail.com ABSTRACT 2.1 Run1: Visual based re-ranking In this paper we present our contribution to the MediaEval In the first run participants could use only visual based 2015 Retrieving Diverse Social Images Task which requested descriptors or own descriptors calculated using only the participants to provide methods for refining Flickr image images. retrieval results thus to increase their relevance and For the first run we use the following approach: step 1 - diversification. Our approach is based on re-ranking the calculating F ACE descriptor for each image, step 2 - filter original result, using a precomputed distance matrix and the images using F ACE and CN [0] descriptors, step 3 - a spectral clustering scheme. We use color related visual creating a distance matrix from color similarity, step 4 - features, text and credibility descriptors to define similarity doing spectral clustering using the distance matrix, step 5 - between images. using the cluster information create the new result list. Our main approach was using color based distances [1, 5] and filtering photos with faces [7, 6]. We used two of 1. INTRODUCTION the descriptors provided by the organizers [2]: Global Color When a potential tourist makes an image search for a Moments on HSV Color Space (CM): represent the first place, she expects to get a diverse and relevant visual result three central moments of an image color distribution: mean, as a summary of the different views of the location. standard deviation and skewness; Global Color Naming In the official challenge (Retrieving Diverse Social Images Histogram (CN): maps colors to 11 universal color names: at MediaEval 2015) [2] a ranked list of location photos ”black”, ”blue”, ”brown”, ”grey”, ”green”, ”orange”, ”pink”, retrieved from Flickr is given, and the task is to refine the ”purple”, ”red”, ”white”, and ”yellow”. result by providing a set of images that are both relevant and First we calculated a new descriptor for each image: the provide a diversified summary. An extended explanation F ACE descriptor is the ratio of the calculated area occupied for the task objectives, provided dataset and evaluation by the possible face regions on an image and whole image descriptors can be found in the task description paper [2]. area [7]. Then we used the CN descriptor to filter out black The diversity means that images can illustrate different color based images, since mostly dark images tend to have views of the location at different times of the day/year and less colors and those are mainly shifted into the gray range under different weather conditions, creative views, etc. The rather than having bright colors. utility score of the refinement process can be measured using In the reordering step we started from the original result. the precision and diversity metric [8]. We did our initial filtering by putting images to the end of Our team participated in previous challenges [7, 6], each the result list where F ACE > 0 or CN [0] > 0.8, the first year we experimented with a different approach. In 2013 value in CN corresponds to the color black. we used diversification of initial results using clustering, but After the preprocessing step we built the distance matrix our solution was focused on diversification only [7]. In 2014 F , between each A and B images the distance was calculated we tried to focus on relevance and diversity with the same using the following equation: importance as a new idea [6]. In the previous approaches to the task we treated our 10 10 feature vectors (calculated values from metrics) as an N X X dimensional continuous space with Euclidean coordinates. FA,B = |CNA [i] − CNB [i]| + |si ∗ (CMA [i] − CMB [i])| i=0 i=0 In this year apporach we will define a set of hand crafted  distance matrices with non-Euclidean coordinates, which  5, where 0 6 i < 3 can be used during the clustering. si = 1.5, where 3 6 i < 5 0.5, where 5 6 i < 9 2. RUNS After the distance matrix was created we used In this section we introduce the approaches used to unsupervised spectral clustering [3, 4] to create clusters from generate the runs for each task. the first 150 images, the target cluster count was 10. The final result was generated by picking the lowest ranking item from each cluster, appending those to the result Copyright is held by the author/owner(s). list, then repeating this until all the items are used. The MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany same clustering and sorting method was used during run2 and run3. run name P@20 CR@20 F1@20 Run1 (all) .7094 .3780 .4782 2.2 Run2: Text based re-ranking Run2 (all) .6730 .3655 .4565 The second run was the text based re-ranking which is Run3 (all) .6863 .3624 .4603 accomplished using the title, tags and description fields of Run4 (all) .7083 .3543 .4564 each image. Run1 single .7022 .3702 .4751 For the second run we use the following approach: step Run2 single .6435 .3494 .4379 1 - filtering stop words and characters, step 2 - creating a Run3 single .6732 .3563 .4554 distance matrix from text similarity, step 4 - doing spectral Run4 single .7014 .3589 .4651 clustering using the distance matrix, step 5 - using the Run1 multi .7164 .3857 .4813 cluster information create the new result list. Run2 multi .7021 .3813 .4748 As a preprocessing step we executed a stop word Run3 multi .6993 .3683 .4651 filtering. We also removed some special characters (namely: Run4 multi .7150 .3498 .4479 .,-:;0123456789() @) and HTML specific character sets (&, " and everything between < and >), then we Table 1: Official results on the test data. used the remaining text as the input for a simple TF-IDF calculation [9]. 0.7   We calculated the distance between images (e.g. 0.6   description fields) A and B in the following manner. We initialize distance GA,B to zero and compared A and B 0.5   at the term level. All occurring t terms in document A compared with all terms in the document B and so on. 0.4   If term t is contained by both documents, then GA,B will not be increased. If t contained by only one document, we 0.3   take into consideration the document frequency (DFt ): if 0.2   DFt < 5, then it is a rare term and GA,B should be increased F1@5   F1@10   F1@20   F1@30   F1@40   F1@50   by 2; if DFt > DN/4, then it is a common term and GA,B should be increased by 0.1 (where DN is the total number Visual  single   Visual  mul7   Text  single   Text  mul7   of documents). If the term is not common nor rare, then we Vistext  single   Vistext  mul7   Cred  single   Cred  mul7   added the DFt /DN to the distance. Using the three text descriptors we created a weighted Figure 1: Official runs F1-score metric for various sum for the field distances, where the empirically determined cutoff points (results of test data). weights are as follows: title=1, tags=2, description=0.5. From these GA,B values we created the G distance matrix. 3. RESULTS 2.3 Run3: Multimodal re-ranking The 2015 dataset contained 153 location queries (45,375 In the third run both visual and textual descriptors could Flickr photos) as the development set, we used this to be used to create the results. develop our approach, all methods and thresholds were For the third run we use the following approach: step 1 calculated using the whole development set. - creating the distance matrix F (see Section 2.1), step 2 The test set containing 139 queries: 69 one-concept - creating the distance matrix G (see Section 2.2), step 2 - location queries (20,700 Flickr photos) and 70 multi-concept creating a new distance matrix from combining F and G, queries related to events and states associated with locations step 4 - doing spectral clustering using the distance matrix, (20,694 Flickr photos). Single-topic queries are basic step 5 - using the cluster information create the new result formulations such as the name of a location, multi-concept list. queries are more complex, they are related to events and We used our visual distance matrix F and text distance states associated with locations (like ’sunset in the city’). matrix G and created a new aggregate matrix H. This Our results can be seen in Table 3. and the F1 metrics can matrix is simply the sum of the corresponding values from be seen in Figure 1, we listed the single and multi-concept both F and G matrix. We tried different kind of weighting based results separately. methods, but the pure matrices supplied the best results on the development set. 4. CONCLUSION AND FUTURE WORK As one can see the visual information based results are 2.4 Run4: Credibility based re-ranking the best among all the runs. In the development set we In the fourth run participants were provided with experienced that the textual information for many images credibility descriptors [2]. are missing or do not describe the content very well. It Using the original result we filtered the images by users is not uncommon that an author gives the same textual who had f aceP roportion more than 1.3 to create the same information to all of the images in a topic. effect as we did with the F ACE descriptor. The credibility based descriptors are proved to be much With the purpose of increasing the diversity we used more useful than we initially thought, in the future we the locationSimilarity descriptor, if this value exceeds the should focus on those to improve textual and visual threshold of 3.0 we excluded the image. Despite our simple descriptor based results. approach we had great results on the development set. 5. REFERENCES [1] R. Datta, D. Joshi, J. Li, and J. Z. Wang. Image retrieval: Ideas, influences, and trends of the new age. ACM Comput. Surv., 40(2):5:1–5:60, May 2008. [2] B. Ionescu, A. L. Ginsca, B. Boteanu, A. Popescu, M. Lupu, and H. Müller. Retrieving diverse social images at mediaeval 2015: Challenge, dataset and evaluation. In Working Notes Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany, September 14-15, CEUR-WS.org, 2015. [3] X. Ma, W. Wan, and L. Jiao. Spectral clustering ensemble for image segmentation. In Proceedings of the First ACM/SIGEVO Summit on Genetic and Evolutionary Computation, GEC ’09, pages 415–420, New York, NY, USA, 2009. ACM. [4] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems, pages 849–856. MIT Press, 2001. [5] M. L. Paramita, M. Sanderson, and P. Clough. Diversity in photo retrieval: Overview of the imageclefphoto task 2009. In Proceedings of the 10th International Conference on Cross-language Evaluation Forum: Multimedia Experiments, CLEF’09, pages 45–59, Berlin, Heidelberg, 2010. Springer-Verlag. [6] Z. Paróczi, B. Fodor, and G. Szucs. Dclab at mediaeval2014 search and hyperlinking task. In Working Notes Proceedings of the MediaEval 2014 Workshop, Barcelona, Spain, October 16-17, CEUR-WS. org, ISSN 1613-0073, 2014. [7] G. Szűcs, Z. Paróczi, and D. Vincz. Bmemtm at mediaeval 2013 retrieving diverse social images task: Analysis of text and visual information. In Working Notes Proceedings of the MediaEval 2013 Workshop, Barcelona, Spain, October 18-19, CEUR-WS. org, ISSN 1613-0073, 2013. [8] B. Taneva, M. Kacimi, and G. Weikum. Gathering and ranking photos of named entities with high precision, high recall, and diversity. In Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM ’10, pages 431–440, New York, NY, USA, 2010. ACM. [9] J.-B. Yeh and C.-H. Wu. Video news retrieval incorporating relevant terms based on distribution of document frequency. In Proceedings of the 9th Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing, PCM ’08, pages 583–592, Berlin, Heidelberg, 2008. Springer-Verlag.