-

BMEMTM at MediaEval 2013 Retrieving Diverse Social Images Task: Analysis of Text and Visual Information

Gábor Szqcs

szucs@tmit.bme.hu 1

Zsombor Paróczi

paroczi@tmit.bme.hu 0

Dániel Máté Vincz

dani.vincz@gmail.com 0 0 Dept. of Telecommunications and , Media Informatics, BME, Budapest , Hungary 1 Inter-University Centre for , Telecommunications and Informatics, H-4028 Kassai út 26., Debrecen , Hungary

2013

18 19

In this paper, the possibilities of using visual and textual information are investigated to improve the ranking of photos from Flickr about famous places. We have elaborated improved textual features based on standard ones and visual features e.g. face feature for measure the relative face area on the images. These heuristic features have been used for the solution in the MediaEval 2013 Retrieving Diverse Social Images Task to rerank social photos based on two evaluation metrics, the precision and the diversity.

hiercn<N>: Using the CN (Global Color Naming Histogram) descriptor this algorithm creates N clusters by a simple hierarchic clustering using Euclidean distance function in 11 dimension of the descriptor in order to get better diversity. The algorithm takes 9 åCM k (i, j) +k=7 a2 where the first 3 CM values are the means, then standard deviations, finally the last 3 CM values are the second momentums, ± is tuning parameter(in our experiences 30 was the best value based on the development set), furthermore CM k (i, j) =CM k (i) -CM k ( j) ( 1 ) ( 2 ) clustermodcm<N>: This is a modified version of the clustercm<N> algorithm, which takes back certain images (as punishment) by only 3 places in the queue, therefore the similar images can be too close.

We have tested these algorithms on the development set, and the results can be seen on Figure 1. The baseline is the original Flickr result, and the facehiercn20 algorithm was the best at F1@10 metric; that is why we have chosen this for visual-only run (run1).

2.2 Textual models

Firstly we have separated the textual task (for run2) into four subtasks: 1: Improving the provided textual models (probabilistic, TF-IDF, Social TF-IDF). 2: Assigning score values to each image for each provided textual model and for each improved model (so an image will possess 6 score values). 3: Calculating the rank 0.7389 0.5066 0.6472 score of each image based on the weights of the textual models. 4: Calculating the new order of images for each location. More detailed explanation of subtasks is described below. 1. Rewarding the keywords which appear more often related to one location may lead to a better result, but a keyword is sometimes nested in the tag of the image, e.g. basilica can be found in absilicadisantamariadellasalute , thebasilicaofstmaryo-f health I.n order to handle this problem a developed algorithm has split the tags without spaces into list of keywords (using the estimated position of the spaces, as results of an inference algorithm), and then it has assigned new values to the keywords according to the number of their appearance. 2. Our method calculates an average value for every image based on the number of keywords belonging to the image and the values assigned to those keywords according to all six different textual descriptors i.e. probabilistic, TF-IDF, social TF-IDF models and the improved versions of these. Then the method calculates a score value for every image (according to each textual model), which is going to be the sum of the maximum value from all the keywords related to the image and the logarithm of the previously calculated average value. 3. We assign weights to the 6 textual models, and our method calculates the weighted average score (final score) for each image. 4. A higher final score means a better final rank position, thus the new ranks (improved order) can be produced for the images. We executed many test cases with various weights assigned to both the original and the improved textual models and we found, that the best result is in P@10 the improved TF-IDF weighting model, however in case of CR@10 and F1@10 using only the improved probabilistic model led to the best results.

2.3 Combination of visual and textual models

Our text based approach ignores the original ordering of the images and our visual based solution only modifies a predefined order, so it seemed natural to combine them. At the combination the text algorithm was the first phase, then using the ordered result the visual algorithm was the second phase. Our results on the development set have indicated, that this combination is better (at least in the CR@10 metric) than the original two solutions.

2.4 Human-based approach

We have implemented a helping tool for humans, by which the user is able to sort the images into clusters and to store the binary decision about the relevance of each image. After the human s work a developed algorithm has determined the order of the images by the following way: in a cycle the most relevant image in each non-empty cluster is selected (and removed from the cluster) and ordered based on Flickr rank. This cycle is repeated, and the process terminates after the last image.

We have not enough time to survey the Internet, thus the humanbased run (run 4) and the general run (run 5, where everything allowed including using data from external sources) were the same in our contribution, so the results of them were the same.

3. RESULTS

Evaluation metrics include precision at top 10 results (P@10), cluster recall (CR@10) (measure of how many of the existing clusters are represented in the final refinement, so this is the diversity) and harmonic mean of them, the F1-measure (F1@10). 0.6754 0.461 0.6469 0.7814 0.6399 0.6981 0.6711 0.6098 0.8936 0.2963 0.4115 0.8163 0.6519 0.5753 0.4922 0.6798 0.6278 0.5734

[1] Ionescu , B. , Menéndez , M. , Müller , H.

and

Popescu , A.

2013 . Retrieving Diverse Social Images at MediaEval Objectives, Dataset and Evaluation, MediaEval 2013 Workshop, ISSN: 1613 - 0073 , 18 -19 October 2013 , Barcelona, Spain.

[2] Viola , P. , and Jones , M. 2001 . Rapid object detection using a boosted cascade of simple features . In Computer Vision and Pattern Recognition. CVPR 2001. Proceedings of the IEEE Computer Society Conference on. Vol. 1 , pp. I-511-I-518.