BMEMTM at MediaEval 2013 Retrieving Diverse Social Images Task: Analysis of Text and Visual Information Gábor Szűcs Zsombor Paróczi Dániel Máté Vincz Inter-University Centre for Dept. of Telecommunications and Dept. of Telecommunications and Telecommunications and Informatics, Media Informatics, BME, Budapest, Media Informatics, BME, Budapest, H-4028 Kassai út 26., Debrecen, Hungary Hungary Hungary paroczi@tmit.bme.hu dani.vincz@gmail.com szucs@tmit.bme.hu ABSTRACT back every image until the first N images are in different clusters. In this paper, the possibilities of using visual and textual So if the original order was A1, A2, B1, C4, C5, D2 (where letters information are investigated to improve the ranking of photos represent clusters, numbers represent index in the cluster), than from Flickr about famous places. We have elaborated improved the reordered list will be A1, B1, C4, D5, A2, C5. textual features based on standard ones and visual features e.g. facehiercn: This algorithm takes advantage of both the face feature for measure the relative face area on the images. noface and the hiercn by executing them after each other (in These heuristic features have been used for the solution in the the order of 1. face, 2. hiercn) MediaEval 2013 Retrieving Diverse Social Images Task to rerank clustercm: Using the CM (Global Color Moments on HSV social photos based on two evaluation metrics, the precision and Color Space) descriptors this algorithm creates N clusters by the diversity. hierarchic clustering based on a special distance function, and the ordering is the same as in the hiercn algorithm. The distance 1. INTRODUCTION function between image i and j is: In the MediaEval 2013 Retrieving Diverse Social Images Task [1] 6 9 retrieved social photos should be reranked. In the use case of the 3  CM (i, j )  CM (i, j ) k k task a potential tourist tries to find more information about d (i, j )   CM k (i, j )  k 4  k 7 (1) famous place, because he/she has only a vague idea about the k 1  2 location, knowing the name of the place. In this task many photos where the first 3 CM values are the means, then standard retrieved from Flickr with rank information have been available. deviations, finally the last 3 CM values are the second These results have been noisy and redundant, so the aim was to momentums, α is tuning parameter (in our experiences 30 was the refine these results by providing a ranked list of up to 50 photos best value based on the development set), furthermore that are considered to be both relevant and diverse representations of the query. CM k (i, j )  CM k (i )  CM k ( j ) (2) clustermodcm: This is a modified version of the 2. OUR CONTRIBUTION clustercm algorithm, which takes back certain images (as Firstly we have thought that the supervised learning of machine punishment) by only 3 places in the queue, therefore the similar learning would have been useful, but investigating the locations it images can be too close. can be stated that albums are very different. Because of large difference (statues, buildings, squares, famous ship, etc.) there We have tested these algorithms on the development set, and the was a little chance to learn. Thus instead of machine learning we results can be seen on Figure 1. The baseline is the original Flickr have used statistics, heuristic for solving the task. result, and the facehiercn20 algorithm was the best at F1@10 metric; that is why we have chosen this for visual-only run (run1). 2.1 Visual models The contest organizer has made many visual descriptors available, but we have introduced an additional descriptor, the FACE feature based on the OpenCV’s implementation of Haar Like Feature detection [2]. The FACE feature is the ratio of the calculated area occupied by the possible face regions on an image and whole image area. Based on the descriptors we have developed several algorithms for the task. Each algorithm uses the original order of the images (Flickr order), and reorder them. noface: Using FACE descriptor the algorithm filters out images containing faces, and it takes back them into the end of the queue. Figure 1. Comparing visual algorithms on the development set hiercn: Using the CN (Global Color Naming Histogram) 2.2 Textual models descriptor this algorithm creates N clusters by a simple hierarchic Firstly we have separated the textual task (for run2) into four clustering using Euclidean distance function in 11 dimension of subtasks: 1: Improving the provided textual models (probabilistic, the descriptor in order to get better diversity. The algorithm takes TF-IDF, Social TF-IDF). 2: Assigning score values to each image Copyright is held by the author/owner(s). for each provided textual model and for each improved model (so MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain an image will possess 6 score values). 3: Calculating the rank score of each image based on the weights of the textual models. 4: Table 1. and table 2. present the results achieved using the expert Calculating the new order of images for each location. More ground truth and using the crowd-sourcing ground truth, detailed explanation of subtasks is described below. respectively. The expert evaluation is conducted on the entire test 1. Rewarding the keywords which appear more often related to set of 396 locations, and the crowd-sourcing evaluation is based one location may lead to a better result, but a keyword is on a selection of 50 locations. The crowd-sourcing relevance sometimes nested in the tag of the image, e.g. basilica can be ground truth was determined after a majority voting scheme, and found in ‘basilicadisantamariadellasalute’, ‘thebasilicaofstmaryof- the crowd-sourcing diversity ground truth was provided with 3 health’. In order to handle this problem a developed algorithm has different annotations (a column in the Table 2.). split the tags without spaces into list of keywords (using the Table 1. Results achieved using the expert ground truth. estimated position of the spaces, as results of an inference algorithm), and then it has assigned new values to the keywords run1 run2 run3 run4, 5 according to the number of their appearance. P@10 0.7389 0.8056 0.6754 0.8936 2. Our method calculates an average value for every image based 0.4076 0.3859 0.3709 0.2963 CR@10 on the number of keywords belonging to the image and the values assigned to those keywords according to all six different textual F1@10 0.5066 0.4979 0.461 0.4115 descriptors – i.e. probabilistic, TF-IDF, social TF-IDF models and the improved versions of these. Then the method calculates a Table 2. Results at the crowd-sourcing ground truth score value for every image (according to each textual model), which is going to be the sum of the maximum value from all the a run1 run2 run3 run4, 5 keywords related to the image and the logarithm of the previously P@10 0.6857 0.7653 0.6469 0.8163 calculated average value. 1 0.8067 0.7731 0.8217 0.6519 3. We assign weights to the 6 textual models, and our method calculates the weighted average score (final score) for each image. CR@10 2 0.7371 0.713 0.7814 0.5753 4. A higher final score means a better final rank position, thus the 3 0.6469 0.6219 0.6399 0.4922 new ranks (improved order) can be produced for the images. 1 0.7075 0.7429 0.6981 0.6798 We executed many test cases with various weights assigned to both the original and the improved textual models and we found, F1@10 2 0.6825 0.707 0.6711 0.6278 that the best result is in P@10 the improved TF-IDF weighting model, however in case of CR@10 and F1@10 using only the 3 0.6266 0.6472 0.6098 0.5734 improved probabilistic model led to the best results. 4. CONCLUSIONS 2.3 Combination of visual and textual models The only visual (run1) results correspond with our expectations, Our text based approach ignores the original ordering of the because the value of F1@10 was 0.559 for the development set. images and our visual based solution only modifies a predefined The text + visual (run3) result is a big surprise for the team order, so it seemed natural to combine them. At the combination (BMEMTM), because our development set results indicated, that the text algorithm was the first phase, then using the ordered the combined algorithm will be better at least at the CR@10 result the visual algorithm was the second phase. Our results on metric, but both the only visual and the only text methods reached the development set have indicated, that this combination is better better results. (at least in the CR@10 metric) than the original two solutions. In future works we will try to use the remaining visual descriptors as teaching dataset for an SVM classifier, a learning based 2.4 Human-based approach algorithm (maybe combined with the facehiercn algorithm) We have implemented a helping tool for humans, by which the may achieve better results in the CR metric. The text + visual user is able to sort the images into clusters and to store the binary algorithm should be reevaluated to find out the mistake. decision about the relevance of each image. After the human’s work a developed algorithm has determined the order of the images by the following way: in a cycle the most relevant image 5. ACKNOWLEDGMENTS The publication was supported by the TÁMOP-4.2.2.C- in each non-empty cluster is selected (and removed from the 11/1/KONV-2012-0001 project. Project has been supported by cluster) and ordered based on Flickr rank. This cycle is repeated, the European Union, co-financed by the European Social Fund. and the process terminates after the last image. We have not enough time to survey the Internet, thus the human- based run (run 4) and the general run (run 5, where everything 6. REFERENCES allowed including using data from external sources) were the [1] Ionescu, B., Menéndez, M., Müller, H. and Popescu, A. same in our contribution, so the results of them were the same. 2013. Retrieving Diverse Social Images at MediaEval Objectives, Dataset and Evaluation, MediaEval 2013 Workshop, ISSN: 1613-0073, 18-19 October 2013, 3. RESULTS Barcelona, Spain. Evaluation metrics include precision at top 10 results (P@10), cluster recall (CR@10) (measure of how many of the existing [2] Viola, P., and Jones, M. 2001. Rapid object detection using a clusters are represented in the final refinement, so this is the boosted cascade of simple features. In Computer Vision and diversity) and harmonic mean of them, the F1-measure (F1@10). Pattern Recognition. CVPR 2001. Proceedings of the IEEE Computer Society Conference on. Vol. 1, pp. I-511-I-518.