Retrieving Social Images using Relevance Filtering and
                         Diverse Selection

                        Taruna Agrawal1 , Rahul Gupta2 , Shrikanth Narayanan2
 1
     Ming Hsieh Department of Electrical Engineering, University of Southern California, Los Angeles, USA
     2
       Signal Analysis and Interpretation Lab (SAIL), University of Southern California, Los Angeles, USA
                     tagrawal@usc.edu, guptarah@usc.edu, shri@sipi.usc.edu


ABSTRACT                                                            trieval tasks. In the next section, we describe our method-
Retrieving relevant and diverse images from a large set of          ology in detail followed by the results.
images is problem of interest in social media. Given a set of
images pertaining to a location or a concept, a subset of di-       2.    METHODOLOGY DESCRIPTION
verse image can summarize the attributes of the correspond-           Our system for retrieving diverse social images consists of
ing location/concept. In this work, we present a two step           two steps: (i) Relevance filtering, and (ii) Diverse selection.
image retrieval model involving relevance filtering followed        Relevance filtering helps us to filter out images that have
by diverse selection. Based on the visual features, textual         no or little relation with the concept of interest and diverse
descriptions and Flickr rank, relevance filtering initially de-     selection provides a subset of images which are different from
termines a subset of images which have correspondence to a          each other. We provide a detailed description of the two
topic of interest. Subsequently, diverse selection determines       systems below.
a smaller subset of images to provide a diverse perspective of
the concept. We obtain an F1 score of .509 on a test set con-       2.1     Relevance filtering
taining 139 concepts, when computed over the top 20 images             We perform relevance filtering to filter out images unre-
output by our system. We analyze the outcomes of our sys-           lated to a concept. The 2015 MediaEval challenge data pro-
tem and investigate the utility of image metadata (reviews,         vides a set of visual and textual descriptors over 153 con-
Flickr content) when combined with visual descriptors.              cepts for model development and 139 concepts for evalua-
                                                                    tion. Given the visual descriptors, textual information and
1.     INTRODUCTION                                                 Flickr metadata, we train several supervised and knowledge
  “Deluge of information” is a term prevalent in present day        based filtering schemes. We describe these models below.
social media [1–4], often attributed to advances in technol-
ogy and social connectivity. Compact representation of rel-         2.1.1    Supervised methods
evant information is a major challenge posed by the growth          K-nearest neighbor classifier on visual descriptors:
of social media. Retrieving diverse social images task at Me-       The 2015 MediaEval challenge data set provides a set of
diaEval challenge 2015 [5] addresses this problem in the do-        general purpose visual descriptors such as color, texture and
main of images on social media such as Flickr. The goal is to       feature information along with a binary label indicating if
design a query based social image retrieval engine, focusing        an image is relevant/irrelevant to the concept under consid-
on obtaining relevant images while covering diverse aspects         eration [5]. We train a K-nearest neighbor (KNN) classifier
of the query, for instance, various sub-topics of the query.        on these visual descriptors using these labels. The features
Potential information sources include image attributes as           are z-normalized before training and K is tuned on the de-
well as image metadata such as image description, view              velopment set using a 3-fold cross-validation.
count and image rank on social media.
   Various previous works [6, 7] have focused on knowledge          Maximum entropy model on textual descriptors: The
based image selection for relevant image selection and/or           textual descriptors are extracted from sources such as photo
clustering based methods for diversification. The relevance         title, description as provided by the author and photo tags
selection is usually based on image attributes such as pres-        on Flickr. We extract features from these sources using the
ence of people [8], image quality [7] and similarity to a stan-     following steps:
dard source of images like Wikipedia [9]. In this work, we
adopt a combination of supervised and unsupervised schemes          1. Feature standardization: This step is performed to train
for relevance filtering followed by clustering for diverse selec-   a universal model for all the concepts instead of concept spe-
tion. After filtering out irrelevant images, we use clustering      cific models. We replace any word related to a concept by
for diverse selection of images. Through our methods, we            a keyword. For instance, if the query is “The great wall of
show the promise of using supervised learning methods in            china”, words such as “great wall”, “wall of china” and “great
addition to existing knowledge based methods in such re-            wall china” occurring anywhere in textual descriptions are
                                                                    replaced by a single keyword “Place of interest”. The list of
                                                                    words to be replaced is created based on the query title and
Copyright is held by the author/owner(s).                           contains various combinations of words in the query.
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany
                                                                Run    Relevance        Single     Multi      All
                                                                #      filter          concept    concept
                                                                                      F1/P/CR F1/P/CR F1/P/CR
                                                                1     Visual desc.      .492/      .514/     .504/
                                                                                      .664/.408 .700/.426 .682/.417
                                                                3     + Textual desc.   .497/      .517/     .507/
                                                                                      .677/.410 .708/.426 .692/.418
                                                                5     + Flickr rank     .512/      .507/     .509/
                                                                                      .702/.421 .708/.411 .705/.416
                                                                Table 1: Results (F1 score/Precision/Cluster recall)
                                                                for the proposed system @X = 20.
                                                                and face detection. In run 2, we append filtering using max-
                                                                imum entropy model on textual descriptors. Finally, run5
                                                                uses all the relevance filtering schemes (visual, face detec-
                                                                tion, text and Flickr rank based). Note that in all the three
                                                                runs diverse selection is based on visual descriptors only.
                                                                The evaluation metric is cluster recall (CR) and precision
                                                                (P) for top X ranked images as predicted by the system.
Figure 1: Results for the proposed system at differ-            We show the CR@X and P@X along with corresponding
ent number of retrieved images (X).                             F-score F1@X for X = 5, 10, 20, 30, 40, 50 in Figure 1. All
                                                                these outcomes are based on cluster with K̂ set to 50. Also,
2. Feature selection: Given the set of standardized features,   in the 2015 challenge, separate metrics were reported for
we retain the words within the top 10% of word frequencies.     concepts which share images with other concepts (multi-
This step is performed to reduce the feature dimensionality     concept) along with single-concept images. We report the
while training the model.                                       official score of CR,P and F1 @X=20 for the multi and single
3. Model training: Given the set of selected features, we       concept images in Table 1.
train a maximum entropy model to predict the binary la-            From the results, we observe that for low values of X the
bels (relevant/irrelevant).                                     combined system (visual + face detection +text + Flickr
                                                                rank) marginally (although insignificantly) outperforms the
2.1.2    Unsupervised methods                                   system using only the visual cues. However the performance
Removal of images with people in focus: Relevant im-            degrades significantly at higher value of X. Note that this
ages do not have a person as the subject of focus. We in-       decrease in performance is not due to additional filtering
corporated this fact by using the facedetect software [10] to   schemes not performing well. Instead, this decrease in per-
filter out images containing people as the main subjects.       formance is due to the fact that additional filtering leads
                                                                to decrease in data points available for diverse selection.
Relevance filtering based on Flickr rank: As a final            Therefore we had to reduce the number of clusters in rel-
relevance filtering scheme, we remove images above a certain    evance selection, sometimes to the extent that our model
threshold (>200) on Flickr rank. The motivation behind this     returned less than 50 images. However, better performance
scheme is that images low in rank are more likely to be not     at lower X (e.g. X = 20 in Table 1) shows the promise
associated with the concept in question.                        of using additional modalities. In Table 1, we observe mi-
                                                                nor improvements in F1@20 after adding subsequent rele-
2.2     Diverse selection                                       vance filtering schemes. One interesting observation is that
   After obtaining the set of images based on relevance fil-    while using Flickr ranks, F1 for multiple concepts decreases,
tering, we use image clustering for diverse selection. Given    whereas for single concepts increases. This indicates that
a query size of K̂ images, we perform K̂-means clustering on    Flickr ranks are more reliable in the case of single concept
the visual descriptors. We hypothesize that similar images      images than multiple-concept images. This factor can be
fall into a single cluster and retain only image per cluster.   regarded in future system designs.
We select the image closest to the cluster centroid as the
cluster representative.                                         4.    CONCLUSION
   In order to compute the selection score for each image, we
                                                                   In this work, we present a two stage system for social
use the output of the KNN classifier, maxent model and dis-
                                                                image retrieval. In the first stage, we perform relevance fil-
tance of image from cluster centroid. The score is given by
                                                                tering to remove irrelevant images and in the second stage
an unweighted sum of the ratio of relevant images amongst
                                                                we perform diverse selection using clustering in the visual
closest K images, the maxent output probability for image
                                                                descriptor space. Our relevance filtering system involves a
being relevant and inverse of Euclidean distance of image
                                                                combination of supervised and unsupervised methods. In
from cluster centroid. The last term is added based on the
                                                                the future, we can extend the work presented by explor-
assumption that images closer to centroids are more repre-
                                                                ing other methods (filtering, clustering) under a similar sys-
sentative of the cluster. In the next section we present our
                                                                tem development paradigm. We can also reformulate the
results and discussion.
                                                                problem as a diverse system development and can be in-
                                                                spired from several of the existing works [11, 12]. Finally,
3.    RESULTS                                                   we would also like additional metadata like Flickr user cred-
  In run 1, we only use the relevance filtering model devel-    ibility [5, 13] and other image properties (CNN features) to
oped on visual descriptors (K-nearest neighbors classifier)     further improve our system.
5.   REFERENCES
 [1] Holly M Bik and Miriam C Goldstein. An
     introduction to social media for scientists. 2013.
 [2] Sophia B Liu. Trends in distributed curatorial
     technology to manage data deluge in a networked
     world. The European Journal for the Informatics
     Professional, 11(4):18–24, 2010.
 [3] C Szongott, Benjamin Henne, G von Voigt, et al. Big
     data privacy issues in public social media. In Digital
     Ecosystems Technologies (DEST), 2012 6th IEEE
     International Conference on, pages 1–6. IEEE, 2012.
 [4] Duc-Tien Dang-Nguyen, Luca Piras, Giorgio Giacinto,
     Giulia Boato, and F De Natale. Retrieval of diverse
     images by pre-filtering and hierarchical clustering.
     MediaEval Benchmarking Initiative for Multimedia
     Evaluation, 2014.
 [5] Bogdan Ionescu, Alexandru L Gınsca, Bogdan
     Boteanu, Adrian Popescu, Mihai Lupu, and Henning
     Müller. Retrieving diverse social images at mediaeval
     2015: Challenge, dataset and evaluation. In MediaEval
     2015 Workshop, Wurzen, Germany, 2015.
 [6] Bogdan Ionescu, Adrian Popescu, Mihai Lupu,
     Alexandru L Gınsca, and Henning Müller. Retrieving
     diverse social images at mediaeval 2014: Challenge,
     dataset and evaluation. In MediaEval 2014 Workshop,
     Barcelona, Spain, 2014.
 [7] Tomas Brodsky. Relevant image detection in a
     camera, recorder, or video streaming device, April 4
     2006. US Patent App. 11/397,780.
 [8] Alexandru Lucian Ginsca, Adrian Popescu, and Navid
     Rekabsaz. Cea listâĂŹs participation at the mediaeval
     2014 retrieving diverse social images task. In
     Proceedings of the MediaEval Multimedia Benchmark
     Workshop, CEURWS. org, volume 1263, pages
     1613–0073, 2014.
 [9] Maia Zaharieva and Patrick Schwab. A unified
     framework for retrieving diverse social images. 2014.
[10] Robert Frischholz. The face detection homepage.
     https://facedetection.com/.
[11] Rahul Gupta, Kartik Audhkhasi, and Shrikanth
     Narayanan. A mixture of experts approach towards
     intelligibility classification of pathological speech. In
     Acoustics, Speech and Signal Processing (ICASSP),
     2015 IEEE International Conference on, pages
     1986–1990. IEEE, 2015.
[12] Rahul Gupta, Kartik Audhkhasi, and Shrikanth
     Narayanan. Training ensemble of diverse classifiers on
     feature subsets. In Acoustics, Speech and Signal
     Processing (ICASSP), 2014 IEEE International
     Conference on, pages 2927–2931. IEEE, 2014.
[13] Bogdan Ionescu, Adrian Popescu, Mihai Lupu,
     Alexandru Lucian Gı̂nsca, Bogdan Boteanu, and
     Henning Müller. Div150cred: A social image retrieval
     result diversification with user tagging credibility
     dataset. ACM Multimedia Systems-MMSys, Portland,
     Oregon, USA, 2015.