<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UMONS @ MediaEval 2017: Diverse Social Images Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Omar Seddati</string-name>
          <email>omar.seddati@umons.ac.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nada Ben Lhachemi</string-name>
          <email>nada.ben-lhachemi@umons.ac.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stéphane Dupont</string-name>
          <email>stephane.dupont@umons.ac.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Saïd Mahmoudi</string-name>
          <email>said.mahmoudi@umons.ac.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Mons University</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper presents the results achieved during our participation at the MediaEval 2017 Retrieving Diverse Social Images Task. The proposed unsupervised multimodal approach exploits visual and textual information in a fashion that prioritizes both relevance and diversification. As features, we used a modified version of the RMAC (Regional Maximum Activation of Convolutions) descriptor for visual information and word2vec-based weighted averaging for textual information. In order to provide an adaptive unsupervised solution, we combine these features with the DBSCAN (densitybased spatial clustering of applications with noise) clustering algorithm. Our system achieved promising results and reached an F1@20 of 0.6554.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Over the past decades, available image collections have seen
consistent growth thanks to the easily accessible devices that we now
use on a daily basis. These huge multimedia collections motivated
researchers to look for eficient approaches for image retrieval.
However, most of the approaches in this field primarily aim at the
improvement of the relevance of the results, commonly
neglecting the diversity aspect. The goal of the Retrieving Diverse Social
Images Task [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] is to encourage researchers to propose new
solutions that ofer a good relevance-diversity balance. Participants
are provided with several queries and up to 300 results
corresponding to each query retrieved using the Flicker search engine. Each
participating system is expected to provide a list with up to 50
ranked images per query that are both relevant and diversified. In
addition to the images and the Flicker ranking, several metadata
are provided such as username, credibility, etc. Both, visual
information and metadata have been exploited in several ways by the
participants of previous editions of the task [
        <xref ref-type="bibr" rid="ref11 ref13 ref2">2, 11, 13</xref>
        ]. The most
used text-based features are Term Frequency-Inverse Document
Frequency (TF-IDF)[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], Latent Dirichlet Allocation (LDA)[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and
word embeddings like word2vec [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. For visual information, the
most used features are Convolutional Neural Networks (CNN) based
features. Several clustering algorithms have been explored such as
k-means [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], X-means [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], agglomerative hierarchical clustering
(AHC) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], etc. In our work, we use word2vec-based weighted
average as text-based features, an improvement of the RMAC
descriptor [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] based on CNN features for visual information, and
DBSCAN [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] as clustering algorithm.
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>
        In this work, we combine visual and/or textual descriptors with the
DBSCAN algorithm at two diferent stages. In the first stage, we
rerank the provided list of results in order to remove some irrelevant
images, while during the second stage, we aim to improve diversity.
In our approach, the visual features based on the work of Tolias et
al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Tolias et al. discarded the fully connected layers of a
pretrained CNN (VGG16) and used the resulting fully convolutional
CNN for feature extraction. Let assume we have an input image I of
size (WI × HI ), the output feature maps (FMs) will form a 3D tensor
in the form C × W × H (where C is the number of channels, (W , H )
the width and height of FMs). If we represent this 3D tensor as a
set of 2D feature maps X = {Xc }, c = 1...C, we can compute the
MAC (Maximum Activations of Convolutions) using the following
equation:
f = [f1... fc ... fC ], with fc = max x (1)
x ∈Xc
In order to compute the RMAC descriptor, Tolias et al. proposed
a simple approach to sample R = {Ri }, a set of square regions
within X, and compute the MAC for each region The sum
aggregation of the resulting vectors after an l2-normalization provides
the RMAC descriptor (for more details please refer to the original
paper [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]). In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Gordo et al. proposed two simple modifications
to bring significant improvements to the RMAC representation: 1)
using ResNet101 instead of VGG16; 2) three resolutions of the input
image are feeded to the network. The RMAC descriptors are
computed separately and l2−normalized. Then, the three vectors are
summed and l2−normalized. In this work, we use the ResNet50 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
and the publicly available Torch toolbox [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to extract the RMAC
descriptor with multi-resolution. However, instead of computing
the RMAC descriptor separately for each resolution, we rescale the
output feature maps of the three resolutions to the same
resolution (the highest resolution) and sum them. Following, we compute
the RMAC descriptor and do the sum-aggregation followed by an
l2-normalization (more information on the approach can be found
in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]). The RMAC descriptor has the advantages of keeping the
aspect ratio of the inputs and encoding eficiently spatial
information while keeping the size of the descriptor independent of the
resolution of the input (but rather on the number of channels of
the selected layer for feature extraction, which can be used as a
parameter of the method).
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>EXPERIMENTAL RESULTS</title>
      <p>In this section, we detail the five runs submitted by our team. Then,
we briefly present the results obtained with the proposed approach
on the development and test set.</p>
      <p>Run 1: In the first run, only visual features are allowed to be
used. Since the query is a textual query, we used the Flicker initial
ranking and we made the assumption that the first three results (top
3) are relevant and can be used to generate a visual representation
of the query. In order to re-rank the initial list, we extract the RMAC
features from each image using three diferent input scales (where
S is the largest side of the input and S ∈ [550, 800, 1050]). Then, we
do a first clustering using the DBSCAN algorithm and we follow the
following steps: 1. For each cluster, we find the closest feature vector
to the cluster’s center (Vcl ); 2. We select the n clusters that contain
the top 3 images (n ≤ 3); 3. We compute the distance between each
of the n clusters (centers) and the remaining r clusters, for each
of these r clusters we keep as representative distance the minimal
distance to one of the n clusters and we use it to re-rank the list
of results; 4. We remove clusters that are at the bottom of the list,
but we make sure that we keep enough clusters to have at least
150 images. This first stage enables us to remove some irrelevant
images. In the second stage, we do another clustering (DBSCAN)
and we sort the diferent clusters using the initial Flicker rank of
the centroid. Then, we select one image per cluster until we obtain
the required number of result images, if the last cluster is reached,
we start again from beginning. Finally, we group the images that
belong to the same cluster and present the results in the clusters
order (based on the rank of centroids).</p>
      <p>Note: In order to correctly use the DBSCAN algorithm, we should
carefully define the maximum radius ϵ. In our case, for each query,
we compute a vector with n elements, where n is the number of
available results and each element ei , i ∈ 1, ..., n is the minimal distance
between image i and any other image. Finally, we use the median
of this vector as ϵ, one as the number of minimum points, and the
Manhattan distance as metric.</p>
      <p>
        Run 2: The second run uses the provided word2vec
(dimensionality = 300) semantic vectors for English terms (trained over
Wikipedia). Unlike TF-IDF or LDA, word2vec vectors do not look
at word co-occurrence patterns but they have the advantage of
addressing various sorts of similarities between words (syntactic
and semantic). In order to select the textual information to use, we
examined the devset queries and noticed that tags are more
significant syntactically and semantically than other textual fields (e.g.
title and descriptions). For each image, we compute the weighted
average vector representation (as described in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]) based on its tags
. Then, we do clustering using The DBSCAN algorithm and sort
the clusters using the distance between the query representation
and the representation of the centroid of a given cluster. Finally, we
re-rank the images following the same approach as the last step of
run 1.
      </p>
      <p>Run 3 &amp; 4: In the third run, we concatenate the RMAC feature
vector of Run 1 with the textual feature vector of Run 2 and followed
the diferent steps of Run 1. In addition to that, just after the second
clustering, we group the images uploaded by the same user and
make sure that when picking images for the final ranking, we
choose images from the diferent user groups of a given cluster. In
Run 4, we followed the same steps as in Run 3, but we used only
the RMAC descriptor as feature vector and the username grouping
technique.</p>
      <p>Run 5: In the fifth run, we first remove stop words from the
queries. Then, we use each query to retrieve 10 images using Google
image engine. We extract the RMAC features from these images
and use them as a visual representation of the query as in Run 1.
0,5599
0,5834
0,5780
0,5649
0,5856
0,5789
0.5521
0,5886
0,5809
0,6554
Next, we follow the same steps to re-rank the Flicker list. Since
Google image results match better the queries, we can expect better
visual representations, which allows us to use more eficiently the
RMAC descriptor. In addition to that, as in Run 3 &amp; 4, we use the
grouping by username approach to further improve diversity.</p>
      <p>Note: in order to retrieve enough results from Google image and
enhance diversity, we used the query in the following way: let’s assume
that we have a query with five words w1, w2, w3, w4, w5, we use the
following for image crawling:
w1 + w2 + w3 + w4 + w5 + w1_w2_w3_w4_w5
For example if the query is animalatzoo, the query used for Google
image is animal + zoo + animal _zoo.</p>
      <p>All results are reported in Table 1. As we can see, the approach
based on visual features (Run 1) gives better results than those
obtained when textual features are used (Run 2). This confirms that
the assumption made about the visual representation of the query
(using the RMAC descriptors of the top 3 images) is admissible.
The comparison of the results of Run 3 and 4 shows that using the
tags (with the proposed approach) was not able to bring significant
improvements in comparison to the simple combination of visual
features with username grouping. Finally, using images retrieved
by Google engine (Run 5) outperforms significantly the results of
Run 3 (visual + textual). This achievement leads us to reflect on the
efectiveness of the proposed approach based on textual features.
In order to achieve results close to those of Run 5, we should find
a better solution for text analysis since there is no image query.
Our future developments will mainly focus on exploiting diferent
approaches to improve image retrieval based on metadata.
4</p>
    </sec>
    <sec id="sec-4">
      <title>CONCLUSION</title>
      <p>In this paper we presented a detailed description of the approach
proposed to address the task of retrieving diverse social images.
The proposed approach achieves promising results and shows the
potential of automatic techniques in improving both precision and
diversity. The comparison of the diferent runs shows that contrary
to what we expected, textual information is outperformed by visual
information. This observation raises some questions regarding the
proposed approach and the quality of the provided metadata. We
plan to investigate these questions in more detail and bring new
solutions in our future work.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>David</surname>
            <given-names>M Blei</given-names>
          </string-name>
          , Andrew Y Ng, and
          <string-name>
            <given-names>Michael I</given-names>
            <surname>Jordan</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>Journal of machine Learning research 3</source>
          ,
          <string-name>
            <surname>Jan</surname>
          </string-name>
          (
          <year>2003</year>
          ),
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Bogdan</given-names>
            <surname>Boteanu</surname>
          </string-name>
          , Ionut Mironica, and
          <string-name>
            <given-names>Bogdan</given-names>
            <surname>Ionescu</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>LAPI@ 2015 Retrieving Diverse Social Images Task: A Pseudo-Relevance Feedback Diversification Perspective.</article-title>
          . In MediaEval.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Ronan</given-names>
            <surname>Collobert</surname>
          </string-name>
          , Koray Kavukcuoglu, and
          <string-name>
            <given-names>Clément</given-names>
            <surname>Farabet</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Torch7: A matlab-like environment for machine learning</article-title>
          .
          <source>In BigLearn, NIPS Workshop.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Martin</given-names>
            <surname>Ester</surname>
          </string-name>
          ,
          <string-name>
            <surname>Hans-Peter Kriegel</surname>
          </string-name>
          , Jörg Sander, Xiaowei Xu, and others.
          <year>1996</year>
          .
          <article-title>A density-based algorithm for discovering clusters in large spatial databases with noise.</article-title>
          .
          <source>In Kdd</source>
          , Vol.
          <volume>96</volume>
          .
          <fpage>226</fpage>
          -
          <lpage>231</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Albert</surname>
            <given-names>Gordo</given-names>
          </string-name>
          , Jon Almazán, Jerome Revaud, and
          <string-name>
            <given-names>Diane</given-names>
            <surname>Larlus</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>End-to-end learning of deep visual representations for image retrieval</article-title>
          .
          <source>International Journal of Computer Vision</source>
          (
          <year>2016</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <volume>770</volume>
          -
          <fpage>778</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Seddati</given-names>
            <surname>Omar</surname>
          </string-name>
          , Dupont Stéphane, Mahmoudi Saïd, and
          <string-name>
            <given-names>Pariyaan</given-names>
            <surname>Mahnaaz</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Towards Good Practices for Image Retrieval Based on CNN Features</article-title>
          . In International Conference on Computer Vision Workshop (ICCVW).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Arora</given-names>
            <surname>Sanjeev</surname>
          </string-name>
          , Liang Yingyu, and Ma Tengyu.
          <year>2017</year>
          .
          <article-title>A simple but tough-to-beat baseline for sentence embeddings</article-title>
          .
          <source>In Proceedings of ICLR</source>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Karen</given-names>
            <surname>Sparck Jones</surname>
          </string-name>
          .
          <year>1972</year>
          .
          <article-title>A statistical interpretation of term specificity and its application in retrieval</article-title>
          .
          <source>Journal of documentation 28</source>
          ,
          <issue>1</issue>
          (
          <year>1972</year>
          ),
          <fpage>11</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Giorgos</surname>
            <given-names>Tolias</given-names>
          </string-name>
          , Ronan Sicre, and
          <string-name>
            <given-names>Hervé</given-names>
            <surname>Jégou</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Particular object retrieval with integral max-pooling of CNN activations</article-title>
          .
          <source>arXiv preprint arXiv:1511.05879</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Sabrina</given-names>
            <surname>Tollari</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>UPMC at MediaEval 2016 Retrieving Diverse Social Images Task</article-title>
          . In MediaEval 2016 Workshop.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Greg</given-names>
            <surname>Corrado Jefrey Dean Tomas Mikolov</surname>
          </string-name>
          , Kai Chen.
          <volume>20013</volume>
          .
          <article-title>Eficient Estimation of Word Representations in Vector Space.</article-title>
          .
          <source>In arXiv preprint arXiv:1301</source>
          .
          <fpage>3781</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Maia</given-names>
            <surname>Zaharieva</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>An Adaptive Clustering Approach for the Diversification of Image Retrieval Results.</article-title>
          .
          <source>In MediaEval.</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Maia</surname>
            <given-names>Zaharieva</given-names>
          </string-name>
          , Bogdan Ionescu, Alexandru Lucian Gînscă, Rodrygo L.T. Santos, and
          <string-name>
            <given-names>Henning</given-names>
            <surname>Müller</surname>
          </string-name>
          .
          <year>2017</year>
          . Retrieving Diverse Social Images at MediaEval 2017:
          <article-title>Challenges, Dataset and Evaluation</article-title>
          .
          <source>In Proc. of the MediaEval 2017 Workshop</source>
          , Dublin, Ireland, Sept.
          <fpage>13</fpage>
          -
          <lpage>15</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>