<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BMEMTM at MediaEval 2013 Retrieving Diverse Social Images Task: Analysis of Text and Visual Information</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gábor Szqcs</string-name>
          <email>szucs@tmit.bme.hu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zsombor Paróczi</string-name>
          <email>paroczi@tmit.bme.hu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dániel Máté Vincz</string-name>
          <email>dani.vincz@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Telecommunications and</institution>
          ,
          <addr-line>Media Informatics, BME, Budapest</addr-line>
          ,
          <country country="HU">Hungary</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Inter-University Centre for</institution>
          ,
          <addr-line>Telecommunications and Informatics, H-4028 Kassai út 26., Debrecen</addr-line>
          ,
          <country country="HU">Hungary</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>In this paper, the possibilities of using visual and textual information are investigated to improve the ranking of photos from Flickr about famous places. We have elaborated improved textual features based on standard ones and visual features e.g. face feature for measure the relative face area on the images. These heuristic features have been used for the solution in the MediaEval 2013 Retrieving Diverse Social Images Task to rerank social photos based on two evaluation metrics, the precision and the diversity.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        hiercn&lt;N&gt;: Using the CN (Global Color Naming Histogram)
descriptor this algorithm creates N clusters by a simple hierarchic
clustering using Euclidean distance function in 11 dimension of
the descriptor in order to get better diversity. The algorithm takes
9
åCM k (i, j)
+k=7
a2
where the first 3 CM values are the means, then standard
deviations, finally the last 3 CM values are the second
momentums, ± is tuning parameter(in our experiences 30 was the
best value based on the development set), furthermore
CM k (i, j) =CM k (i) -CM k ( j)
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
clustermodcm&lt;N&gt;: This is a modified version of the
clustercm&lt;N&gt; algorithm, which takes back certain images (as
punishment) by only 3 places in the queue, therefore the similar
images can be too close.
      </p>
      <p>We have tested these algorithms on the development set, and the
results can be seen on Figure 1. The baseline is the original Flickr
result, and the facehiercn20 algorithm was the best at F1@10
metric; that is why we have chosen this for visual-only run (run1).</p>
    </sec>
    <sec id="sec-2">
      <title>2.2 Textual models</title>
      <p>Firstly we have separated the textual task (for run2) into four
subtasks: 1: Improving the provided textual models (probabilistic,
TF-IDF, Social TF-IDF). 2: Assigning score values to each image
for each provided textual model and for each improved model (so
an image will possess 6 score values). 3: Calculating the rank
0.7389
0.5066
0.6472
score of each image based on the weights of the textual models. 4:
Calculating the new order of images for each location. More
detailed explanation of subtasks is described below.
1. Rewarding the keywords which appear more often related to
one location may lead to a better result, but a keyword is
sometimes nested in the tag of the image, e.g. basilica can be
found in absilicadisantamariadellasalute , thebasilicaofstmaryo-f
health I.n order to handle this problem a developed algorithm has
split the tags without spaces into list of keywords (using the
estimated position of the spaces, as results of an inference
algorithm), and then it has assigned new values to the keywords
according to the number of their appearance.
2. Our method calculates an average value for every image based
on the number of keywords belonging to the image and the values
assigned to those keywords according to all six different textual
descriptors i.e. probabilistic, TF-IDF, social TF-IDF models and
the improved versions of these. Then the method calculates a
score value for every image (according to each textual model),
which is going to be the sum of the maximum value from all the
keywords related to the image and the logarithm of the previously
calculated average value.
3. We assign weights to the 6 textual models, and our method
calculates the weighted average score (final score) for each image.
4. A higher final score means a better final rank position, thus the
new ranks (improved order) can be produced for the images.
We executed many test cases with various weights assigned to
both the original and the improved textual models and we found,
that the best result is in P@10 the improved TF-IDF weighting
model, however in case of CR@10 and F1@10 using only the
improved probabilistic model led to the best results.</p>
    </sec>
    <sec id="sec-3">
      <title>2.3 Combination of visual and textual models</title>
      <p>Our text based approach ignores the original ordering of the
images and our visual based solution only modifies a predefined
order, so it seemed natural to combine them. At the combination
the text algorithm was the first phase, then using the ordered
result the visual algorithm was the second phase. Our results on
the development set have indicated, that this combination is better
(at least in the CR@10 metric) than the original two solutions.</p>
    </sec>
    <sec id="sec-4">
      <title>2.4 Human-based approach</title>
      <p>We have implemented a helping tool for humans, by which the
user is able to sort the images into clusters and to store the binary
decision about the relevance of each image. After the human s
work a developed algorithm has determined the order of the
images by the following way: in a cycle the most relevant image
in each non-empty cluster is selected (and removed from the
cluster) and ordered based on Flickr rank. This cycle is repeated,
and the process terminates after the last image.</p>
      <p>We have not enough time to survey the Internet, thus the
humanbased run (run 4) and the general run (run 5, where everything
allowed including using data from external sources) were the
same in our contribution, so the results of them were the same.</p>
    </sec>
    <sec id="sec-5">
      <title>3. RESULTS</title>
      <p>Evaluation metrics include precision at top 10 results (P@10),
cluster recall (CR@10) (measure of how many of the existing
clusters are represented in the final refinement, so this is the
diversity) and harmonic mean of them, the F1-measure (F1@10).
0.6754
0.461
0.6469
0.7814
0.6399
0.6981
0.6711
0.6098
0.8936
0.2963
0.4115
0.8163
0.6519
0.5753
0.4922
0.6798
0.6278
0.5734</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Menéndez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Müller</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Popescu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2013</year>
          . Retrieving Diverse Social Images at MediaEval Objectives, Dataset and Evaluation, MediaEval 2013 Workshop, ISSN:
          <fpage>1613</fpage>
          -
          <lpage>0073</lpage>
          ,
          <fpage>18</fpage>
          -19
          <source>October</source>
          <year>2013</year>
          , Barcelona, Spain.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Viola</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2001</year>
          .
          <article-title>Rapid object detection using a boosted cascade of simple features</article-title>
          .
          <source>In Computer Vision and Pattern Recognition. CVPR 2001. Proceedings of the IEEE Computer Society Conference on. Vol. 1</source>
          , pp.
          <source>I-511-I-518.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>