<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>OHSU @ MediaEval 2015: Adapting7extual7echniquesWo Multimedia Search</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shiran Dudy</string-name>
          <email>dudy@ohsu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Steven Bedrick</string-name>
          <email>bedricks@ohsu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Spoken Language Understanding, OHSU</institution>
          ,
          <addr-line>Portland, Oregon</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this paper, we present the motivation, process, results and analysis of results that we have worked on as part of our participation in the 2015 MediaEval Retrieving Diverse Social Images Task. This year, we adapted a recently-published technique for result diversi cation (\Relational Learning-toRank" [13]), borrowed from the world of standard document retrieval. As compared to the original work, our version makes certain changes to the ranking and comparison algorithm, and explores a variety of feature combinations speci c to an image retrieval context. The key idea behind our technique was a greedy iterative approach to ranking search results, which attempted to balance relevance with redundancy by comparing candidate results to those already selected by the algorithm. Our approach worked tolerably well on many queries, but there is clearly room for improvement.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Imagine you are in Munich, and it's the time of year when
everybody is talking about Oktoberfest. Being unfamiliar
with this festival, you perform an image search, to try and
nd out whether you'd like the event, and to discover what
to expect should you attend. Unfortunately, your results
consist of two hundred very similar images, all of the inside
of beer tents. While certainly relevant, these results only
show a small slice of what Oktoberfest is about: where are
the parades, the concerts, the fairgrounds? A more diverse
set of search results would have been much more useful in
this situation.</p>
      <p>
        The Retrieving Diverse Social Images task at the 2015
MediaEval workshop required participants to provide the most
diverse and relevant images given a search query like
\Oktoberfest." The organizers provided a detailed task
description along with data set for development and evaluation,
described fully in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Our team chose to adapt a recent
technique for search result diversi cation [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] that adapts
\traditional" learning-to-rank methods [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] to incorporate
diversity into its loss function.
Search result diversi cation is a very active area of research
in information retrieval. In principle, the general problem of
identifying an optimal ranking that balances both relevance
and diversity has been shown to be NP-complete [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which
means that most techniques rely on approximations of one
kind or another. One common family of approximations
descend from the greedy and iterative Maximal and Marginal
Relevance approach [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], in which each successive document
is chosen based on its similarity to the user's query and its
dissimilarity to the set of already-chosen documents.
      </p>
      <p>
        Another family of approaches directly model attributes of
the query and of the documents, and then identify sub-sets
of results that are representative of di erent combinations
of attributes. For example, Agrawal et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] use a
taxonomy to model documents and queries, and identify a set of
results that thoroughly covers the entries represented by the
retrieved documents.
      </p>
      <p>
        The approach our group used in this year's MediaEval
fuses elements of both families of techniques. It is based
on a paper by Zhu et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] that describes an extension of
Learning-to-Rank (LtR). Traditional LtR consists of
learning a ranking function that attempts to assign a rank to a
particular document given a particular query. Zhu et al.'s
extension, \Relational Learning-to-Rank" (R-LtR), models
result ranking as a sequential selection process, and their
formulation incorporates knowledge about not only the
document in question and the query, but also the set of
documents that have already been selected.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>
        For a complete description of R-LtR, we refer the reader to
the original paper [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. In brief, R-LtR is an iterative
scoring method that takes into account both the relevance of a
textual document along with information about how similar
it is to documents that have already been chosen. The
algorithm represents documents as arbitrary feature vectors.
Each successive document is scored against the documents
that have already been chosen according to the following
scoring function (equation 2 in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]):
      </p>
      <p>fs(xi; Ri) = wrT xi+wdT hs(Ri); 8xi 2 XnS (1)
This scoring function combines information on relevance
and diversity given the candidate document xi (represented
as a k-dimensional feature vector) and its \diversity matrix"
Ri. This matrix is actually a \slice" of a three-way
tensor mapping documents to documents along features; each
value Rijk represents the relationship between documents
i and j in terms of feature k. For example, if we were to
use the Jaccard similarity metric as our rst feature, Ri;j;1
would consist of the Jaccard similarity of documents xi and
xj. This formulation allows us to combine entirely arbitrary
features and relational functions.</p>
      <p>Note further that Ri in equation 1 is de ned as including
all documents xj 2 S, where S is the set of documents that
have already been chosen out of the set of all possible
documents, X. In other words, Ri contains information relating
document xi to the already-selected documents. XnS refers
to the remaining set of not-yet-selected documents. hs(Ri)
refers to a relational function comparing document xi to the
entire set of documents in S.1</p>
      <p>
        Finally, wr and wd are weight vectors corresponding to
the relative weights of relevance and diversity, respectively.
Equation 1 is used for prediction (i.e., scoring); Zhu et al.
outline a training process that uses stochastic gradient
descent to learn learn values for wr and wd. For reasons of
space, we will not discuss training in this paper, and refer
the reader to the full description in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Note that, unlike
Zhu et al., in this year's task we were given results that are
already sorted in terms of \relevance" (according to Flickr's
search engine). As such, we were able to simplify the
precise algorithm described by Zhu et al., as we were able to
use this existing relevance information instead of computing
our own from scratch.
      </p>
      <p>
        In order to adapt this algorithm to the image search
domain, we identi ed combinations of features and
appropriate distance metrics based on the shared task data. We
represented \textual" information by transforming each
image's \tags" and \description" features into a tf-idf-weighted
bag-of-words representation, which we then processed using
Latent Semantic Analysis (LSA) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to reduce its
dimensionality. We also performed Latent Dirichlet Allocation
(LDA) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] on the tag/description data, in order to attempt
to represent topic groups within the results. We computed
similarity for these features using L2 (Euclidean) distance;
both feature sets were computed using the Gensim package.2
      </p>
      <p>
        In addition to the textual features, we utilized several of
the visual features provided by the shared task. Along with
their distance metrics, we used \csd" (L2) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], \hog"
(Bhattacharyya distance) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], \cn" (L2) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], \cm" (Canberra
distance) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], \lbp" ( 2) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and \glr" (L1 Manhattan) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
All features were normalized such that larger values for the
distance functions represented either higher degrees of
similarity or higher degrees of diversity (for the values in R).
For our run including user credibility data, we included
\visualScore", \faceproportion", \tagSpeci city", \uniqueTags",
and \locationSimilarity".
      </p>
    </sec>
    <sec id="sec-3">
      <title>4 Submitted Runs</title>
      <p>We trained four di erent models. The rst, run 1, used only
image (visual) features. run 2 used the textual features
described above (LSA and LDA on descriptions and tags). run
4 and run 5 combined both image and textual features with
user credibility information. The textual features remained
the same across runs; runs 4 and 5 experimented with using
global image features (i.e., calculated on the entire image)
versus features computed locally on image quadrants.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Results &amp; Discussion</title>
      <p>Our results are summarized in Table 1. Our
visual-featureonly run (run 1) outperformed our text-feature-only run (run
2) in terms of Cluster Recall @ 20, but interestingly, not in
terms of Precision @ 20. Incorporating textual and user
information (run 4) did not seem to substantially alter our
1Zhu et al. propose several di erent methods of combining
the data stored in Ri: taking the minimal distance (i.e., for
all features k, taking minxj2S Rijk), averaging, or taking the
maximum distance.
2http://radimrehurek.com/gensim/
run 1
run 2
run 4
run 5
0.46
0.42
0.46
0.41
all
CR
Our adaptation of R-LtR to an image retrieval task shows
that this approach to result diversi cation can work with a
wide variety of features and distance metrics. Our results
are promising, though clearly much work remains to be done
in terms of feature engineering and parameter tuning. We
also hope to extend the algorithm to include more adaptable
feature weight vectors, to enable the system to give di erent
weight to certain features (e.g., textual or visual feature
subsets) depending on query or image characteristics. R-LtR is
a exible and powerful platform from which to begin such
experiments.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gollapudi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Halverson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ieong</surname>
          </string-name>
          .
          <article-title>Diversifying search results</article-title>
          .
          <source>In WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data Mining</source>
          , pages
          <volume>5</volume>
          {
          <fpage>14</fpage>
          , New York, New York, USA, Feb.
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Jordan</surname>
          </string-name>
          .
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          ,
          <volume>3</volume>
          :
          <fpage>993</fpage>
          {
          <fpage>1022</fpage>
          ,
          <string-name>
            <surname>Mar</surname>
          </string-name>
          .
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Carbonell</surname>
          </string-name>
          and J.
          <string-name>
            <surname>Goldstein</surname>
          </string-name>
          .
          <article-title>The use of MMR, diversity-based reranking for reordering documents and producing summaries</article-title>
          .
          <source>In SIGIR '98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <volume>335</volume>
          {
          <fpage>336</fpage>
          , New York, New York, USA, Aug.
          <year>1998</year>
          . ACM Request Permissions.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Deerwester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Dumais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. K.</given-names>
            <surname>Landauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. W.</given-names>
            <surname>Furnas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Harshman</surname>
          </string-name>
          .
          <article-title>Indexing by latent semantic analysis</article-title>
          .
          <source>JASIS</source>
          ,
          <volume>41</volume>
          (
          <issue>6</issue>
          ):
          <volume>391</volume>
          {
          <fpage>407</fpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.-C.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. P.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. W.</given-names>
            <surname>Ng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Yeung</surname>
          </string-name>
          .
          <article-title>Content-based image retrieval using color moment and gabor texture feature</article-title>
          .
          <source>In 2010 International Conference on Machine Learning and Cybernetics (ICMLC)</source>
          , volume
          <volume>2</volume>
          , pages
          <fpage>719</fpage>
          {
          <fpage>724</fpage>
          . IEEE,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L. G nsca</given-names>
            , B.
            <surname>Boteanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lupu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Mu</surname>
          </string-name>
          <article-title>ller. Retrieving diverse social images at mediaeval 2015: Challenge, dataset and evaluation</article-title>
          . In MediaEval 2015 Workshop, Wurzen, Germany,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H. Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. K.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y. H.</given-names>
            <surname>Ha</surname>
          </string-name>
          .
          <article-title>Spatial color descriptor for image retrieval and video segmentation</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ):
          <volume>358</volume>
          {
          <fpage>367</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.-Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          .
          <article-title>Learning to rank for information retrieval</article-title>
          . Springer, New York, 1st.
          <source>ed edition</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Selvarajah</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Kodituwakku</surname>
          </string-name>
          .
          <article-title>Analysis and comparison of texture features for content based image retrieval</article-title>
          .
          <source>International Journal of Latest Trends in Computing</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ),
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sikora</surname>
          </string-name>
          .
          <article-title>The mpeg-7 visual standard for content description-an overview</article-title>
          .
          <source>IEEE Transactions on Circuits and Systems for Video Technology</source>
          ,
          <volume>11</volume>
          (
          <issue>6</issue>
          ):
          <volume>696</volume>
          {
          <fpage>702</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yin</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Collins</surname>
          </string-name>
          .
          <article-title>Object tracking and detection after occlusion via numerical hybrid local and global mode-seeking</article-title>
          .
          <source>In 2008 IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pages
          <fpage>1</fpage>
          <lpage>{</lpage>
          8,
          <string-name>
            <surname>June</surname>
          </string-name>
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <article-title>Boosting local binary pattern (lbp)-based face recognition</article-title>
          .
          <source>In Advances in biometric person authentication</source>
          , pages
          <volume>179</volume>
          {
          <fpage>186</fpage>
          . Springer,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          , X. Cheng, and
          <string-name>
            <given-names>S.</given-names>
            <surname>Niu</surname>
          </string-name>
          .
          <article-title>Learning for search result diversi cation</article-title>
          .
          <source>In Proceedings of the 37th international ACM SIGIR conference on Research &amp; development in information retrieval</source>
          , pages
          <volume>293</volume>
          {
          <fpage>302</fpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>