<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Retrieving Social Images using Relevance Filtering and Diverse Selection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Taruna Agrawal</string-name>
          <email>tagrawal@usc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rahul Gupta</string-name>
          <email>guptarah@usc.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shrikanth Narayanan</string-name>
          <email>shri@sipi.usc.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ming Hsieh Department of Electrical Engineering, University of Southern California</institution>
          ,
          <addr-line>Los Angeles</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Signal Analysis and Interpretation Lab (SAIL), University of Southern California</institution>
          ,
          <addr-line>Los Angeles</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>Retrieving relevant and diverse images from a large set of images is problem of interest in social media. Given a set of images pertaining to a location or a concept, a subset of diverse image can summarize the attributes of the corresponding location/concept. In this work, we present a two step image retrieval model involving relevance ltering followed by diverse selection. Based on the visual features, textual descriptions and Flickr rank, relevance ltering initially determines a subset of images which have correspondence to a topic of interest. Subsequently, diverse selection determines a smaller subset of images to provide a diverse perspective of the concept. We obtain an F1 score of .509 on a test set containing 139 concepts, when computed over the top 20 images output by our system. We analyze the outcomes of our system and investigate the utility of image metadata (reviews, Flickr content) when combined with visual descriptors.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. INTRODUCTION</p>
      <p>
        \Deluge of information" is a term prevalent in present day
social media [1{4], often attributed to advances in
technology and social connectivity. Compact representation of
relevant information is a major challenge posed by the growth
of social media. Retrieving diverse social images task at
MediaEval challenge 2015 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] addresses this problem in the
domain of images on social media such as Flickr. The goal is to
design a query based social image retrieval engine, focusing
on obtaining relevant images while covering diverse aspects
of the query, for instance, various sub-topics of the query.
Potential information sources include image attributes as
well as image metadata such as image description, view
count and image rank on social media.
      </p>
      <p>
        Various previous works [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ] have focused on knowledge
based image selection for relevant image selection and/or
clustering based methods for diversi cation. The relevance
selection is usually based on image attributes such as
presence of people [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], image quality [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and similarity to a
standard source of images like Wikipedia [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In this work, we
adopt a combination of supervised and unsupervised schemes
for relevance ltering followed by clustering for diverse
selection. After ltering out irrelevant images, we use clustering
for diverse selection of images. Through our methods, we
show the promise of using supervised learning methods in
addition to existing knowledge based methods in such
retrieval tasks. In the next section, we describe our
methodology in detail followed by the results.
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>METHODOLOGY DESCRIPTION</title>
      <p>Our system for retrieving diverse social images consists of
two steps: (i) Relevance ltering, and (ii) Diverse selection.
Relevance ltering helps us to lter out images that have
no or little relation with the concept of interest and diverse
selection provides a subset of images which are di erent from
each other. We provide a detailed description of the two
systems below.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Relevance filtering</title>
      <p>We perform relevance ltering to lter out images
unrelated to a concept. The 2015 MediaEval challenge data
provides a set of visual and textual descriptors over 153
concepts for model development and 139 concepts for
evaluation. Given the visual descriptors, textual information and
Flickr metadata, we train several supervised and knowledge
based ltering schemes. We describe these models below.
2.1.1</p>
      <sec id="sec-3-1">
        <title>Supervised methods</title>
        <p>
          K-nearest neighbor classi er on visual descriptors:
The 2015 MediaEval challenge data set provides a set of
general purpose visual descriptors such as color, texture and
feature information along with a binary label indicating if
an image is relevant/irrelevant to the concept under
consideration [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. We train a K-nearest neighbor (KNN) classi er
on these visual descriptors using these labels. The features
are z-normalized before training and K is tuned on the
development set using a 3-fold cross-validation.
        </p>
        <p>Maximum entropy model on textual descriptors: The
textual descriptors are extracted from sources such as photo
title, description as provided by the author and photo tags
on Flickr. We extract features from these sources using the
following steps:
1. Feature standardization: This step is performed to train
a universal model for all the concepts instead of concept
speci c models. We replace any word related to a concept by
a keyword. For instance, if the query is \The great wall of
china", words such as \great wall", \wall of china" and \great
wall china" occurring anywhere in textual descriptions are
replaced by a single keyword \Place of interest". The list of
words to be replaced is created based on the query title and
contains various combinations of words in the query.
2. Feature selection: Given the set of standardized features,
we retain the words within the top 10% of word frequencies.
This step is performed to reduce the feature dimensionality
while training the model.
3. Model training : Given the set of selected features, we
train a maximum entropy model to predict the binary
labels (relevant/irrelevant).
2.1.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Unsupervised methods</title>
        <p>
          Removal of images with people in focus: Relevant
images do not have a person as the subject of focus. We
incorporated this fact by using the facedetect software [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] to
lter out images containing people as the main subjects.
Relevance ltering based on Flickr rank: As a nal
relevance ltering scheme, we remove images above a certain
threshold (&gt;200) on Flickr rank. The motivation behind this
scheme is that images low in rank are more likely to be not
associated with the concept in question.
2.2
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Diverse selection</title>
      <p>After obtaining the set of images based on relevance
ltering, we use image clustering for diverse selection. Given
a query size of K^ images, we perform K^ -means clustering on
the visual descriptors. We hypothesize that similar images
fall into a single cluster and retain only image per cluster.
We select the image closest to the cluster centroid as the
cluster representative.</p>
      <p>In order to compute the selection score for each image, we
use the output of the KNN classi er, maxent model and
distance of image from cluster centroid. The score is given by
an unweighted sum of the ratio of relevant images amongst
closest K images, the maxent output probability for image
being relevant and inverse of Euclidean distance of image
from cluster centroid. The last term is added based on the
assumption that images closer to centroids are more
representative of the cluster. In the next section we present our
results and discussion.</p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS</title>
      <p>In run 1, we only use the relevance ltering model
developed on visual descriptors (K-nearest neighbors classi er)
and face detection. In run 2, we append ltering using
maximum entropy model on textual descriptors. Finally, run5
uses all the relevance ltering schemes (visual, face
detection, text and Flickr rank based). Note that in all the three
runs diverse selection is based on visual descriptors only.
The evaluation metric is cluster recall (CR) and precision
(P) for top X ranked images as predicted by the system.
We show the CR@X and P@X along with corresponding
F-score F1@X for X = 5; 10; 20; 30; 40; 50 in Figure 1. All
these outcomes are based on cluster with K^ set to 50. Also,
in the 2015 challenge, separate metrics were reported for
concepts which share images with other concepts
(multiconcept) along with single-concept images. We report the
o cial score of CR,P and F1 @X=20 for the multi and single
concept images in Table 1.</p>
      <p>From the results, we observe that for low values of X the
combined system (visual + face detection +text + Flickr
rank) marginally (although insigni cantly) outperforms the
system using only the visual cues. However the performance
degrades signi cantly at higher value of X. Note that this
decrease in performance is not due to additional ltering
schemes not performing well. Instead, this decrease in
performance is due to the fact that additional ltering leads
to decrease in data points available for diverse selection.
Therefore we had to reduce the number of clusters in
relevance selection, sometimes to the extent that our model
returned less than 50 images. However, better performance
at lower X (e.g. X = 20 in Table 1) shows the promise
of using additional modalities. In Table 1, we observe
minor improvements in F1@20 after adding subsequent
relevance ltering schemes. One interesting observation is that
while using Flickr ranks, F1 for multiple concepts decreases,
whereas for single concepts increases. This indicates that
Flickr ranks are more reliable in the case of single concept
images than multiple-concept images. This factor can be
regarded in future system designs.
4.</p>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSION</title>
      <p>
        In this work, we present a two stage system for social
image retrieval. In the rst stage, we perform relevance
ltering to remove irrelevant images and in the second stage
we perform diverse selection using clustering in the visual
descriptor space. Our relevance ltering system involves a
combination of supervised and unsupervised methods. In
the future, we can extend the work presented by
exploring other methods ( ltering, clustering) under a similar
system development paradigm. We can also reformulate the
problem as a diverse system development and can be
inspired from several of the existing works [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ]. Finally,
we would also like additional metadata like Flickr user
credibility [
        <xref ref-type="bibr" rid="ref13 ref5">5, 13</xref>
        ] and other image properties (CNN features) to
further improve our system.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Holly</surname>
            <given-names>M</given-names>
          </string-name>
          <string-name>
            <surname>Bik and Miriam C Goldstein</surname>
          </string-name>
          .
          <article-title>An introduction to social media for scientists</article-title>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Sophia</surname>
            <given-names>B Liu.</given-names>
          </string-name>
          <article-title>Trends in distributed curatorial technology to manage data deluge in a networked world</article-title>
          .
          <source>The European Journal for the Informatics Professional</source>
          ,
          <volume>11</volume>
          (
          <issue>4</issue>
          ):
          <volume>18</volume>
          {
          <fpage>24</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C</given-names>
            <surname>Szongott</surname>
          </string-name>
          ,
          <string-name>
            <surname>Benjamin Henne</surname>
          </string-name>
          , G von Voigt, et al.
          <article-title>Big data privacy issues in public social media</article-title>
          .
          <source>In Digital Ecosystems Technologies (DEST)</source>
          ,
          <year>2012</year>
          6th IEEE International Conference on, pages
          <fpage>1</fpage>
          <article-title>{6</article-title>
          . IEEE,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Duc-Tien</surname>
            Dang-Nguyen, Luca Piras, Giorgio Giacinto, Giulia Boato, and
            <given-names>F De</given-names>
          </string-name>
          <string-name>
            <surname>Natale</surname>
          </string-name>
          .
          <article-title>Retrieval of diverse images by pre- ltering and hierarchical clustering</article-title>
          .
          <source>MediaEval Benchmarking Initiative for Multimedia Evaluation</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Bogdan</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , Alexandru L G nsca, Bogdan Boteanu, Adrian Popescu, Mihai Lupu, and
          <article-title>Henning Muller. Retrieving diverse social images at mediaeval 2015: Challenge, dataset and evaluation</article-title>
          . In MediaEval 2015 Workshop, Wurzen, Germany,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Bogdan</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , Adrian Popescu, Mihai Lupu, Alexandru L G nsca,
          <article-title>and Henning Muller. Retrieving diverse social images at mediaeval 2014: Challenge, dataset and evaluation</article-title>
          . In MediaEval 2014 Workshop, Barcelona, Spain,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Brodsky</surname>
          </string-name>
          .
          <article-title>Relevant image detection in a camera, recorder, or video streaming device</article-title>
          ,
          <source>April</source>
          <volume>4</volume>
          2006. US Patent App.
          <volume>11</volume>
          /397,
          <fpage>780</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Alexandru</given-names>
            <surname>Lucian</surname>
          </string-name>
          <string-name>
            <surname>Ginsca</surname>
          </string-name>
          , Adrian Popescu, and
          <string-name>
            <given-names>Navid</given-names>
            <surname>Rekabsaz</surname>
          </string-name>
          .
          <article-title>Cea lista^AZs participation at the mediaeval 2014 retrieving diverse social images task</article-title>
          .
          <source>In Proceedings of the MediaEval Multimedia Benchmark Workshop</source>
          , CEURWS. org, volume
          <volume>1263</volume>
          , pages
          <fpage>1613</fpage>
          {
          <fpage>0073</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Maia</given-names>
            <surname>Zaharieva</surname>
          </string-name>
          and
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Schwab</surname>
          </string-name>
          .
          <article-title>A uni ed framework for retrieving diverse social images</article-title>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Robert</given-names>
            <surname>Frischholz</surname>
          </string-name>
          .
          <article-title>The face detection homepage</article-title>
          . https://facedetection.com/.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Rahul</surname>
            <given-names>Gupta</given-names>
          </string-name>
          , Kartik Audhkhasi, and
          <string-name>
            <given-names>Shrikanth</given-names>
            <surname>Narayanan</surname>
          </string-name>
          .
          <article-title>A mixture of experts approach towards intelligibility classi cation of pathological speech</article-title>
          .
          <source>In Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2015</year>
          IEEE International Conference on, pages
          <year>1986</year>
          {
          <year>1990</year>
          . IEEE,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Rahul</surname>
            <given-names>Gupta</given-names>
          </string-name>
          , Kartik Audhkhasi, and
          <string-name>
            <given-names>Shrikanth</given-names>
            <surname>Narayanan</surname>
          </string-name>
          .
          <article-title>Training ensemble of diverse classi ers on feature subsets</article-title>
          .
          <source>In Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2014</year>
          IEEE International Conference on, pages
          <volume>2927</volume>
          {
          <fpage>2931</fpage>
          . IEEE,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Bogdan</surname>
            <given-names>Ionescu</given-names>
          </string-name>
          , Adrian Popescu, Mihai Lupu, Alexandru Lucian G^nsca, Bogdan Boteanu, and
          <article-title>Henning Muller. Div150cred: A social image retrieval result diversi cation with user tagging credibility dataset</article-title>
          .
          <source>ACM Multimedia Systems-MMSys</source>
          , Portland, Oregon, USA,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>