<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zsombor Paróczi</string-name>
          <email>paroczi@tmit.bme.hu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Máté Kis-Király</string-name>
          <email>kis.kiraly.mate@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bálint Fodor</string-name>
          <email>balint.fodor@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Budapest University of</institution>
          ,
          <addr-line>Technology and Economics</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this paper we present our contribution to the MediaEval 2015 Retrieving Diverse Social Images Task which requested participants to provide methods for re ning Flickr image retrieval results thus to increase their relevance and diversi cation. Our approach is based on re-ranking the original result, using a precomputed distance matrix and a spectral clustering scheme. We use color related visual features, text and credibility descriptors to de ne similarity between images.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>When a potential tourist makes an image search for a
place, she expects to get a diverse and relevant visual result
as a summary of the di erent views of the location.</p>
      <p>
        In the o cial challenge (Retrieving Diverse Social Images
at MediaEval 2015) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] a ranked list of location photos
retrieved from Flickr is given, and the task is to re ne the
result by providing a set of images that are both relevant and
provide a diversi ed summary. An extended explanation
for the task objectives, provided dataset and evaluation
descriptors can be found in the task description paper [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
The diversity means that images can illustrate di erent
views of the location at di erent times of the day/year and
under di erent weather conditions, creative views, etc. The
utility score of the re nement process can be measured using
the precision and diversity metric [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Our team participated in previous challenges [
        <xref ref-type="bibr" rid="ref6 ref7">7, 6</xref>
        ], each
year we experimented with a di erent approach. In 2013
we used diversi cation of initial results using clustering, but
our solution was focused on diversi cation only [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In 2014
we tried to focus on relevance and diversity with the same
importance as a new idea [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>In the previous approaches to the task we treated our
feature vectors (calculated values from metrics) as an N
dimensional continuous space with Euclidean coordinates.
In this year apporach we will de ne a set of hand crafted
distance matrices with non-Euclidean coordinates, which
can be used during the clustering.</p>
    </sec>
    <sec id="sec-2">
      <title>RUNS</title>
      <p>2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Run1: Visual based re-ranking</title>
      <p>In the rst run participants could use only visual based
descriptors or own descriptors calculated using only the
images.</p>
      <p>For the rst run we use the following approach: step 1
calculating F ACE descriptor for each image, step 2 - lter
the images using F ACE and CN [0] descriptors, step 3
creating a distance matrix from color similarity, step 4
doing spectral clustering using the distance matrix, step 5
using the cluster information create the new result list.</p>
      <p>
        Our main approach was using color based distances [
        <xref ref-type="bibr" rid="ref1 ref5">1,
5</xref>
        ] and ltering photos with faces [
        <xref ref-type="bibr" rid="ref6 ref7">7, 6</xref>
        ]. We used two of
the descriptors provided by the organizers [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]: Global Color
Moments on HSV Color Space (CM): represent the rst
three central moments of an image color distribution: mean,
standard deviation and skewness; Global Color Naming
Histogram (CN): maps colors to 11 universal color names:
"black", "blue", "brown", "grey", "green", "orange", "pink",
"purple", "red", "white", and "yellow".
      </p>
      <p>
        First we calculated a new descriptor for each image: the
F ACE descriptor is the ratio of the calculated area occupied
by the possible face regions on an image and whole image
area [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Then we used the CN descriptor to lter out black
color based images, since mostly dark images tend to have
less colors and those are mainly shifted into the gray range
rather than having bright colors.
      </p>
      <p>In the reordering step we started from the original result.
We did our initial ltering by putting images to the end of
the result list where F ACE &gt; 0 or CN [0] &gt; 0:8, the rst
value in CN corresponds to the color black.</p>
      <p>After the preprocessing step we built the distance matrix
F , between each A and B images the distance was calculated
using the following equation:</p>
      <p>10
FA;B = X jCNA[i]
i=0</p>
      <p>10
CNB[i]j + X jsi (CMA[i]
i=0</p>
      <p>CMB[i])j
si =
8&lt;15: 5,,wwhheerree0366ii&lt;&lt;35
:0:5, where 5 6 i &lt; 9</p>
      <p>
        After the distance matrix was created we used
unsupervised spectral clustering [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ] to create clusters from
the rst 150 images, the target cluster count was 10.
      </p>
      <p>The nal result was generated by picking the lowest
ranking item from each cluster, appending those to the result
list, then repeating this until all the items are used. The
same clustering and sorting method was used during run2
and run3.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Run2: Text based re-ranking</title>
      <p>The second run was the text based re-ranking which is
accomplished using the title, tags and description elds of
each image.</p>
      <p>For the second run we use the following approach: step
1 - ltering stop words and characters, step 2 - creating a
distance matrix from text similarity, step 4 - doing spectral
clustering using the distance matrix, step 5 - using the
cluster information create the new result list.</p>
      <p>
        As a preprocessing step we executed a stop word
ltering. We also removed some special characters (namely:
.,-:;0123456789() @) and HTML speci c character sets
(&amp;amp;, &amp;quot; and everything between &lt; and &gt;), then we
used the remaining text as the input for a simple TF-IDF
calculation [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>We calculated the distance between images (e.g.
description elds) A and B in the following manner. We
initialize distance GA;B to zero and compared A and B
at the term level. All occurring t terms in document A
compared with all terms in the document B and so on.
If term t is contained by both documents, then GA;B will
not be increased. If t contained by only one document, we
take into consideration the document frequency (DFt): if
DFt &lt; 5, then it is a rare term and GA;B should be increased
by 2; if DFt &gt; DN=4, then it is a common term and GA;B
should be increased by 0:1 (where DN is the total number
of documents). If the term is not common nor rare, then we
added the DFt=DN to the distance.</p>
      <p>Using the three text descriptors we created a weighted
sum for the eld distances, where the empirically determined
weights are as follows: title=1, tags=2, description=0:5.
From these GA;B values we created the G distance matrix.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Run3: Multimodal re-ranking</title>
      <p>In the third run both visual and textual descriptors could
be used to create the results.</p>
      <p>For the third run we use the following approach: step 1
- creating the distance matrix F (see Section 2.1), step 2
- creating the distance matrix G (see Section 2.2), step 2
creating a new distance matrix from combining F and G,
step 4 - doing spectral clustering using the distance matrix,
step 5 - using the cluster information create the new result
list.</p>
      <p>We used our visual distance matrix F and text distance
matrix G and created a new aggregate matrix H. This
matrix is simply the sum of the corresponding values from
both F and G matrix. We tried di erent kind of weighting
methods, but the pure matrices supplied the best results on
the development set.
2.4</p>
    </sec>
    <sec id="sec-6">
      <title>Run4: Credibility based re-ranking</title>
      <p>
        In the fourth run participants were provided with
credibility descriptors [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Using the original result we ltered the images by users
who had f aceP roportion more than 1:3 to create the same
e ect as we did with the F ACE descriptor.</p>
      <p>With the purpose of increasing the diversity we used
the locationSimilarity descriptor, if this value exceeds the
threshold of 3:0 we excluded the image. Despite our simple
approach we had great results on the development set.
Visual  single   Visual  mul7   Text  single  
Vistext  single   Vistext  mul7   Cred  single  
Text  mul7  
Cred  mul7  </p>
      <p>The 2015 dataset contained 153 location queries (45,375
Flickr photos) as the development set, we used this to
develop our approach, all methods and thresholds were
calculated using the whole development set.</p>
      <p>The test set containing 139 queries: 69 one-concept
location queries (20,700 Flickr photos) and 70 multi-concept
queries related to events and states associated with locations
(20,694 Flickr photos). Single-topic queries are basic
formulations such as the name of a location, multi-concept
queries are more complex, they are related to events and
states associated with locations (like 'sunset in the city').</p>
      <p>Our results can be seen in Table 3. and the F1 metrics can
be seen in Figure 1, we listed the single and multi-concept
based results separately.
4.</p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSION AND FUTURE WORK</title>
      <p>As one can see the visual information based results are
the best among all the runs. In the development set we
experienced that the textual information for many images
are missing or do not describe the content very well. It
is not uncommon that an author gives the same textual
information to all of the images in a topic.</p>
      <p>The credibility based descriptors are proved to be much
more useful than we initially thought, in the future we
should focus on those to improve textual and visual
descriptor based results.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Datta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and J. Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>Image retrieval: Ideas, in uences, and trends of the new age</article-title>
          .
          <source>ACM Comput. Surv.</source>
          ,
          <volume>40</volume>
          (
          <issue>2</issue>
          ):5:
          <issue>1</issue>
          {5:
          <fpage>60</fpage>
          , May
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Ginsca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Boteanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lupu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Mu</surname>
          </string-name>
          <article-title>ller. Retrieving diverse social images at mediaeval 2015: Challenge, dataset and evaluation</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2015 Workshop</source>
          , Wurzen, Germany, September 14-15, CEUR-WS.org,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          , W. Wan, and
          <string-name>
            <given-names>L.</given-names>
            <surname>Jiao</surname>
          </string-name>
          .
          <article-title>Spectral clustering ensemble for image segmentation</article-title>
          .
          <source>In Proceedings of the First ACM/SIGEVO Summit on Genetic and Evolutionary Computation</source>
          ,
          <source>GEC '09</source>
          , pages
          <fpage>415</fpage>
          {
          <fpage>420</fpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Jordan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Weiss</surname>
          </string-name>
          .
          <article-title>On spectral clustering: Analysis and an algorithm</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <volume>849</volume>
          {
          <fpage>856</fpage>
          . MIT Press,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Paramita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanderson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Clough</surname>
          </string-name>
          .
          <article-title>Diversity in photo retrieval: Overview of the imageclefphoto task 2009</article-title>
          .
          <source>In Proceedings of the 10th International Conference on Cross-language Evaluation Forum: Multimedia Experiments, CLEF'09</source>
          , pages
          <fpage>45</fpage>
          {
          <fpage>59</fpage>
          , Berlin, Heidelberg,
          <year>2010</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Paroczi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fodor</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Szucs</surname>
          </string-name>
          .
          <article-title>Dclab at mediaeval2014 search and hyperlinking task</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2014 Workshop</source>
          , Barcelona, Spain, October
          <volume>16</volume>
          -17, CEUR-WS. org,
          <source>ISSN 1613-0073</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Szu</surname>
          </string-name>
          }cs,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Paroczi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Vincz</surname>
          </string-name>
          .
          <article-title>Bmemtm at mediaeval 2013 retrieving diverse social images task: Analysis of text and visual information</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2013 Workshop</source>
          , Barcelona, Spain, October
          <volume>18</volume>
          -19, CEUR-WS. org,
          <source>ISSN 1613-0073</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Taneva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kacimi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum</surname>
          </string-name>
          .
          <article-title>Gathering and ranking photos of named entities with high precision, high recall, and diversity</article-title>
          .
          <source>In Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM '10</source>
          , pages
          <fpage>431</fpage>
          {
          <fpage>440</fpage>
          , New York, NY, USA,
          <year>2010</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.-B.</given-names>
            <surname>Yeh</surname>
          </string-name>
          and
          <string-name>
            <surname>C.-H. Wu</surname>
          </string-name>
          .
          <article-title>Video news retrieval incorporating relevant terms based on distribution of document frequency</article-title>
          .
          <source>In Proceedings of the 9th Paci c Rim Conference on Multimedia: Advances in Multimedia Information Processing, PCM '08</source>
          , pages
          <fpage>583</fpage>
          {
          <fpage>592</fpage>
          , Berlin, Heidelberg,
          <year>2008</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>