BMEMTM at MediaEval 2013 Retrieving Diverse Social
      Images Task: Analysis of Text and Visual Information
              Gábor Szűcs                                 Zsombor Paróczi                              Dániel Máté Vincz
     Inter-University Centre for                 Dept. of Telecommunications and               Dept. of Telecommunications and
Telecommunications and Informatics,              Media Informatics, BME, Budapest,             Media Informatics, BME, Budapest,
  H-4028 Kassai út 26., Debrecen,                             Hungary                                       Hungary
              Hungary                                  paroczi@tmit.bme.hu                          dani.vincz@gmail.com
         szucs@tmit.bme.hu
ABSTRACT                                                               back every image until the first N images are in different clusters.
In this paper, the possibilities of using visual and textual           So if the original order was A1, A2, B1, C4, C5, D2 (where letters
information are investigated to improve the ranking of photos          represent clusters, numbers represent index in the cluster), than
from Flickr about famous places. We have elaborated improved           the reordered list will be A1, B1, C4, D5, A2, C5.
textual features based on standard ones and visual features e.g.       facehiercn<N>: This algorithm takes advantage of both the
face feature for measure the relative face area on the images.         noface and the hiercn<N> by executing them after each other (in
These heuristic features have been used for the solution in the        the order of 1. face, 2. hiercn<N>)
MediaEval 2013 Retrieving Diverse Social Images Task to rerank
                                                                       clustercm<N>: Using the CM (Global Color Moments on HSV
social photos based on two evaluation metrics, the precision and
                                                                       Color Space) descriptors this algorithm creates N clusters by
the diversity.
                                                                       hierarchic clustering based on a special distance function, and the
                                                                       ordering is the same as in the hiercn<N> algorithm. The distance
1. INTRODUCTION                                                        function between image i and j is:
In the MediaEval 2013 Retrieving Diverse Social Images Task [1]                                          6               9

retrieved social photos should be reranked. In the use case of the                    3                 CM (i, j )  CM (i, j )
                                                                                                                  k             k
task a potential tourist tries to find more information about             d (i, j )   CM k (i, j )  k 4            k 7            (1)
famous place, because he/she has only a vague idea about the                         k 1                                     2
location, knowing the name of the place. In this task many photos      where the first 3 CM values are the means, then standard
retrieved from Flickr with rank information have been available.       deviations, finally the last 3 CM values are the second
These results have been noisy and redundant, so the aim was to         momentums, α is tuning parameter (in our experiences 30 was the
refine these results by providing a ranked list of up to 50 photos     best value based on the development set), furthermore
that are considered to be both relevant and diverse representations
of the query.                                                                        CM k (i, j )  CM k (i )  CM k ( j )              (2)

                                                                       clustermodcm<N>: This is a modified version of the
2. OUR CONTRIBUTION                                                    clustercm<N> algorithm, which takes back certain images (as
Firstly we have thought that the supervised learning of machine        punishment) by only 3 places in the queue, therefore the similar
learning would have been useful, but investigating the locations it    images can be too close.
can be stated that albums are very different. Because of large
difference (statues, buildings, squares, famous ship, etc.) there      We have tested these algorithms on the development set, and the
was a little chance to learn. Thus instead of machine learning we      results can be seen on Figure 1. The baseline is the original Flickr
have used statistics, heuristic for solving the task.                  result, and the facehiercn20 algorithm was the best at F1@10
                                                                       metric; that is why we have chosen this for visual-only run (run1).
2.1 Visual models
The contest organizer has made many visual descriptors available,
but we have introduced an additional descriptor, the FACE feature
based on the OpenCV’s implementation of Haar Like Feature
detection [2]. The FACE feature is the ratio of the calculated area
occupied by the possible face regions on an image and whole
image area. Based on the descriptors we have developed several
algorithms for the task. Each algorithm uses the original order of
the images (Flickr order), and reorder them.
noface: Using FACE descriptor the algorithm filters out images
containing faces, and it takes back them into the end of the queue.    Figure 1. Comparing visual algorithms on the development set

hiercn<N>: Using the CN (Global Color Naming Histogram)                2.2 Textual models
descriptor this algorithm creates N clusters by a simple hierarchic
                                                                       Firstly we have separated the textual task (for run2) into four
clustering using Euclidean distance function in 11 dimension of
                                                                       subtasks: 1: Improving the provided textual models (probabilistic,
the descriptor in order to get better diversity. The algorithm takes
                                                                       TF-IDF, Social TF-IDF). 2: Assigning score values to each image
Copyright is held by the author/owner(s).                              for each provided textual model and for each improved model (so
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain         an image will possess 6 score values). 3: Calculating the rank
score of each image based on the weights of the textual models. 4:      Table 1. and table 2. present the results achieved using the expert
Calculating the new order of images for each location. More             ground truth and using the crowd-sourcing ground truth,
detailed explanation of subtasks is described below.                    respectively. The expert evaluation is conducted on the entire test
1. Rewarding the keywords which appear more often related to            set of 396 locations, and the crowd-sourcing evaluation is based
one location may lead to a better result, but a keyword is              on a selection of 50 locations. The crowd-sourcing relevance
sometimes nested in the tag of the image, e.g. basilica can be          ground truth was determined after a majority voting scheme, and
found in ‘basilicadisantamariadellasalute’, ‘thebasilicaofstmaryof-     the crowd-sourcing diversity ground truth was provided with 3
health’. In order to handle this problem a developed algorithm has      different annotations (a column in the Table 2.).
split the tags without spaces into list of keywords (using the             Table 1. Results achieved using the expert ground truth.
estimated position of the spaces, as results of an inference
algorithm), and then it has assigned new values to the keywords                         run1            run2        run3       run4, 5
according to the number of their appearance.                              P@10         0.7389           0.8056      0.6754     0.8936
2. Our method calculates an average value for every image based                        0.4076           0.3859      0.3709     0.2963
                                                                         CR@10
on the number of keywords belonging to the image and the values
assigned to those keywords according to all six different textual         F1@10        0.5066           0.4979      0.461      0.4115
descriptors – i.e. probabilistic, TF-IDF, social TF-IDF models and
the improved versions of these. Then the method calculates a                 Table 2. Results at the crowd-sourcing ground truth
score value for every image (according to each textual model),
which is going to be the sum of the maximum value from all the                         a        run1        run2      run3      run4, 5
keywords related to the image and the logarithm of the previously          P@10                0.6857      0.7653    0.6469     0.8163
calculated average value.
                                                                                       1       0.8067      0.7731    0.8217     0.6519
3. We assign weights to the 6 textual models, and our method
calculates the weighted average score (final score) for each image.       CR@10        2       0.7371       0.713    0.7814     0.5753
4. A higher final score means a better final rank position, thus the                   3       0.6469      0.6219    0.6399     0.4922
new ranks (improved order) can be produced for the images.
                                                                                       1       0.7075      0.7429    0.6981     0.6798
We executed many test cases with various weights assigned to
both the original and the improved textual models and we found,           F1@10        2       0.6825       0.707    0.6711     0.6278
that the best result is in P@10 the improved TF-IDF weighting
model, however in case of CR@10 and F1@10 using only the                               3       0.6266      0.6472    0.6098     0.5734
improved probabilistic model led to the best results.
                                                                        4. CONCLUSIONS
2.3 Combination of visual and textual models                            The only visual (run1) results correspond with our expectations,
Our text based approach ignores the original ordering of the            because the value of F1@10 was 0.559 for the development set.
images and our visual based solution only modifies a predefined         The text + visual (run3) result is a big surprise for the team
order, so it seemed natural to combine them. At the combination         (BMEMTM), because our development set results indicated, that
the text algorithm was the first phase, then using the ordered          the combined algorithm will be better at least at the CR@10
result the visual algorithm was the second phase. Our results on        metric, but both the only visual and the only text methods reached
the development set have indicated, that this combination is better     better results.
(at least in the CR@10 metric) than the original two solutions.         In future works we will try to use the remaining visual descriptors
                                                                        as teaching dataset for an SVM classifier, a learning based
2.4 Human-based approach                                                algorithm (maybe combined with the facehiercn<N> algorithm)
We have implemented a helping tool for humans, by which the             may achieve better results in the CR metric. The text + visual
user is able to sort the images into clusters and to store the binary   algorithm should be reevaluated to find out the mistake.
decision about the relevance of each image. After the human’s
work a developed algorithm has determined the order of the
images by the following way: in a cycle the most relevant image
                                                                        5. ACKNOWLEDGMENTS
                                                                        The publication was supported by the TÁMOP-4.2.2.C-
in each non-empty cluster is selected (and removed from the
                                                                        11/1/KONV-2012-0001 project. Project has been supported by
cluster) and ordered based on Flickr rank. This cycle is repeated,
                                                                        the European Union, co-financed by the European Social Fund.
and the process terminates after the last image.
We have not enough time to survey the Internet, thus the human-
based run (run 4) and the general run (run 5, where everything
                                                                        6. REFERENCES
allowed including using data from external sources) were the            [1] Ionescu, B., Menéndez, M., Müller, H. and Popescu, A.
same in our contribution, so the results of them were the same.             2013. Retrieving Diverse Social Images at MediaEval
                                                                            Objectives, Dataset and Evaluation, MediaEval 2013
                                                                            Workshop, ISSN: 1613-0073, 18-19 October 2013,
3. RESULTS                                                                  Barcelona, Spain.
Evaluation metrics include precision at top 10 results (P@10),
cluster recall (CR@10) (measure of how many of the existing             [2] Viola, P., and Jones, M. 2001. Rapid object detection using a
clusters are represented in the final refinement, so this is the            boosted cascade of simple features. In Computer Vision and
diversity) and harmonic mean of them, the F1-measure (F1@10).               Pattern Recognition. CVPR 2001. Proceedings of the IEEE
                                                                            Computer Society Conference on. Vol. 1, pp. I-511-I-518.