=Paper= {{Paper |id=Vol-1263/paper18 |storemode=property |title=PeRCeiVe Lab@UNICT at MediaEval 2014 Diverse Images: Random Forests for Diversity-based Clustering |pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_18.pdf |volume=Vol-1263 |dblpUrl=https://dblp.org/rec/conf/mediaeval/SpampinatoP14 }} ==PeRCeiVe Lab@UNICT at MediaEval 2014 Diverse Images: Random Forests for Diversity-based Clustering== https://ceur-ws.org/Vol-1263/mediaeval2014_submission_18.pdf
 PeRCeiVe Lab@UNICT at MediaEval 2014 Diverse Images:
     Random Forests for Diversity-based Clustering

                                        Concetto Spampinato, Simone Palazzo
                             Department of Electrical, Electronics and Computer Engineering
                                                  University of Catania
                                     Viale Andrea Doria, 6 - 95125, Catania, Italy
                                     {cspampin, simone.palazzo}@dieei.unict.it

ABSTRACT                                                              j land in the same terminal node, their similarity s-
In this paper we describe the work done by the Pattern                core is increased by one. The similarities are finally
Recognition and Computer Vision Laboratory (PeRCeiVe                  normalized by the number of trees.
Lab) of the University of Catania (Italy) for the MediaEval           The similarities between samples build up a matrix,
2014 Retrieving Diverse Social Images Task. The main chal-            SIM , which is symmetric, positive definite and with
lenge consists of retrieving, for a given topic, a set of pho-        values in the unit interval [0, 1]. The
                                                                                                           p dissimilarity ma-
tos which are relevant to the topic but also showing diverse          trix is defined as DISSIMij =         1 − SIMij , which
views of it. We submitted four runs exploiting a common               is employed as input for partitioning around medoids
feature-independent clustering strategy based on random-              clustering [2]. Each obtained cluster is likely to show
forests for diversifying Flickr result images while preserving        a specific view of the considered location;
relevance.
                                                                    • Cluster filtering by relevance. Starting from the
                                                                      diversity clusters we then perform a cluster filtering
1.    INTRODUCTION                                                    by relevance. In particular, let SW be the average
   The goal of the MediaEval 2014 Retrieving Diverse Social           of maximum similarities between all the topic samples
Images Task [1] is to refine a list of location photos, retrieved     and the content available on Wikipedia computed as
from Flickr through textual queries. Although photos re-              follows:
trieved in Flickr are often relevant, e.g., depict partially or
                                                                                               N
entirely the target location, a significant number is either                               1 X
                                                                                   SW =           max SIM (i, j)              (1)
noisy or redundant. The objective is, therefore, to filter out                             N i=1 j∈W iki
such photos in order to obtain a exhaustive, and compact
summary for the considered location.                                  with N being the number of topic samples, and W iki
   Conversely to most of the existing methods, our approach           the number of samples describing Wikipedia location
first looks for diversity of the Flickr photos by performing          content (e.g., in case of visual features, W iki is the
a diversity-based clustering and then removes the irrelevant          number of images on Wikipedia for that location, while
clusters by computing the similarities between the clustered          when employing text features W iki = 1 since only
photos and information available on Wikipedia. Final re-              sample describing the entire Wikipedia page text is
ranking is carried out by re-clustering the relevant images           considered).
and computing a diversity score for each location photo.              For each cluster C we carry out an unsupervised hier-
                                                                      archical tree clustering on samples’ features, thus ob-
2.    METHOD                                                          taining C 0 clusters. After that, we scale the similarities
   The strategy employed for dealing with the Retrieving              between the samples of cluster C and the Wikipedia
Diverse Social Images Task of MediaEval 2014 relies on a              content by C 0 , i.e., SIMC0 = SIM (i, j)i∈C,j∈W iki · C 0 .
common, feature-independent framework consisting of the               This gives more weight to the samples which resemble
following steps:                                                      mostly the Wikipedia content and that, at the same
                                                                      time, are diverse from other samples. The relevance s-
     • Diversity-based clustering. The goal of this clus-             core RSC of cluster C is, eventually, computed as mean
       tering step is to assess dissimilarity between samples         of SIMC0 and the cluster is removed if RSC ≥ k · SW ;
       (i.e., location photos described with either visual or
       text descriptors) of a given location and to group them      • Final ranking according to a diversity score.
       according to such dissimilarity. To do that, we use            The samples obtained at the previous step are again
       Random forest predictors which allow us to define a            clustered using unsupervised hierarchical clustering and
       dissimilarity measure between observations [3]. For            re-ranked according to a diversity score, computed for
       each tree of the forest, if samples/observations i and         a sample j, by integrating: 1) number of samples in n-
                                                                      ear clusters: if sample j is in cluster Cj we count how
                                                                      main samples are in Cj , Cj−1 and Cj+1 and its score
                                                                      is s1j = NC +NC 1 +NC         . This favors again diver-
Copyright is held by the author/owner(s).                                          j     j−1       j+1

MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain        sity as samples with many items in near clusters are
        Table 1: Training and official test results obtained by our method for each of the submitted runs.
                                              Development set           Test set (official)
                       Run                      P@20      CR@20     F1@20      P@20      CR@20      F1@20
                       run1 (visual)            86.67%     43.10%   56.87%     74.80%    38.74%     50.34%
                       run2 (text)              78.17%     44.02%   55.59%     75.53%    39.02%     50.63%
                       run3 (visual-text)       85.33%     43.61%   56.93%     72.40%    37.88%     49.03%
                       run5 (any resources)     84.14%     42.97%   56.14%     72.93%    37.31%     48.49%


        strongly penalized; 2) photo id distance s2j from sam-             for run 1 and run 2, i.e., by multiplying the ranking
        ple j to all other samples in the list: photos with close          scores of the two ranked lists for images appearing in
        IDs are likely to refer to a similar view of a location;           both lists and completing the final list with images of
        and 3) a random factor s3j to enable exploration of                the list of Run 1;
        new solutions. The final ranking of sample j is given
        by multiplying the three above scores.                           • Run 5 (any resources): same algorithm as in Run 1,
                                                                           without applying the face and GPS-based filtering.
3.      EXPERIMENTS                                                 It is clear that our attempt of making the training phase as
                                                                    independent as possible from the development set succeeded
3.1      Setup and features                                         only partially, since the results on the test set are sensibly
   The algorithm described in Sect. 2 depends on the number         lower than those on the training set. Also, textual features
of trees used for the diversity based clustering and on the k       performed surprisingly better than visual ones, which had
threshold for cluster filtering by relevance, which were set,       obtained the highest accuracy by far on the development
respectively, to 50 and 3.5 for visual features and 1.0 for text    set. The significant performance difference, in terms of pre-
features.                                                           cision, of the visual runs (about 12%) on the test set and
   For runs using only visual features, we used the visual          the development set can be due to a different relevance dis-
descriptors provided by the task’s organizers for each photo        tribution of visual features, while it seems that text features
(including the Wikipedia ones) normalized between 0 and 1           keep the same distribution on the two sets.
and concatenated into a single 945-dimensional vector.
   Textual descriptors were computed as TF-IDF vectors              4.    CONCLUSIONS
from a vocabulary made up of all words from titles, descrip-           In this paper we describe our random-forest based ap-
tions and tags for photos in the development and test sets,         proach for tackling the MediaEval 2014 Retrieving Diverse
plus words extracted from Wikipedia page (available as part         Social Images Task. Our method, applied to text and visual
of the data provided by the task’s organizers). In order to         features indifferently, leverages on a diversity-based cluster-
reduce the original vocabulary size, being too large (more          ing using Random Forests and on noisy cluster filtering to
than 90,000 words), we removed: 1) words shorter than four          increase relevance. Final ranking is made in order to favor
characters; 2) words starting with digits; 3) words with low        diversity with respect to relevance.
maximum TF-IDF values, thus resulting in a vocabulary size             As future improvement, we mean to better exploit the
of 51,136 terms.                                                    amount of information provided for development: in par-
   For both the visual-based and the text-based classifiers,        ticular, the diversity ground truth will be used to improve
the random forest was trained on 10 locations (the remaining        intra-cluster split in the diversity-based clustering, and user
ones were used for testing), randomly selected from the ones        credibility information will be integrated into the relevance
available in the training set. Increasing the number locations      estimation model.
led to higher training times without an actual improvement
in accuracy on the training set.
                                                                    5.    REFERENCES
3.2      Results and discussion                                     [1] B. Ionescu, A. Popescu, M. Lupu, A. L. Gı̂nsca, and
     We submitted four runs whose results are given in Table 1:         H. Müller. Retrieving Diverse Social Images at
                                                                        MediaEval 2014: Challenge, Dataset and Evaluation. In
      • Run 1 (visual information only): we employed the al-            MediaEval 2014 Workshop, October 16-17, 2014,
        gorithm described in Sect. 2 on the visual features,            Barcelona, Spain, 2014.
        filtering out images: 1) having people, detected by         [2] H.-S. Park and C.-H. Jun. A simple and fast algorithm
        the face detector in [4], as subjects, and 2) whose Eu-         for k-medoids clustering. Expert Syst. Appl.,
        clidean distance between the location’s GPS coordi-             36(2):3336–3341, Mar. 2009.
        nates (provided as part of each topic’s description) and    [3] T. Shi and S. Horvath. Unsupervised Learning With
        their GPS coordinates (when provided) was over 10;              Random Forest Predictors. Journal of Computational
      • Run 2 (text information only): the same algorithm was           and Graphical Statistics, 15(1):118 – 138, 2006.
        employed on the reduced-vocabulary TF-IDF descrip-          [4] P. Viola and M. Jones. Robust real-time object
        tors;                                                           detection. International Journal of Computer Vision,
                                                                        57:137–154, 2001.
      • Run 3 (visual-text fusion): the results presented in
        this run were obtained by combining those computed