<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MIL at ImageCLEF 2013: Personal Photo Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Masaru Mizuochi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Takayuki Higuchi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chie Kamada</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tatsuya Harada</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Machine Intelligence Laboratory, The University of Tokyo</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <abstract>
        <p>In this paper, we describe our methods for ImageCLEF 2013 Personal Photo Retrieval Task. We devote our attention to making our system efficient in retrieving documents which have the similar topic with few query data. We train a ranking function using rankSVM. We extract Fisher Vectors (FVs) from several local descriptors as visual features, and use Bag-of-Words (BoW) as metadata features. The final similarity score is the weighted average of the visual similarities and the metadata similarities, where the weights are determined by relevance feedback. Results have shown that our system achieves the best performance in the Personal Photo Retrieval Task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>This paper describes our methods for the ImageCLEF 2013 Personal Photo
Retrieval Task. Our objective is to investigate efficient methods for retrieving
documents that have the similar topic with few query data. The dataset is obtained
such that it imitates actual browsing data. It contains query by example(QBE)
and browsing data, though some of them are unavailable. We compare three
methods to measure a visual similarity between a document and a topic: SVM,
rankSVM, and a similarity between a query and the nearest sample. We use the
distance metric like radial basis function (RBF) kernel for a metadata
similarity. We extract Fisher Vectors (FVs) as visual features, and use Bag-of-Words
(BoW) as metadata features. The final similarity score is the weighted average of
the visual similarities from several local descriptors and the metadata similarity,
where the weights are determined by relevance feedback.
Fisher Vectors. As a visual feature, we use Fisher Vectors (FVs) computed
from four kinds of local descriptors: SIFT, C-SIFT, LBP, and GIST. Actually,
GIST is usually used to describe a whole image, but we use it as a local
descriptor. All local descriptors are reduced to 64 dimensions using Principal
Component Analysis (PCA). Local descriptors are densely extracted from five scales of
patches on a regular grid every six pixels and learn a Gaussian Mixture Model
(GMM) with 256 components, which have a diagonal matrix as its covariance
matrix. To use spatial information, we divide images into 1 1, 2 2, and 3 1
cells. Then FVs are calculated over each region as follows.</p>
      <p>Let X = fx1; x2; ; xN g be a set of N local descriptors extracted from an
image, and wi, i, i be the mixture weight, mean vector, covariance matrix of
the i-th Gaussian, respectively. Then we difine,
ui =
vi =</p>
      <p>1 ∑N
N pwi n=1</p>
      <p>1
N p
2wi n=1</p>
      <p>N
∑ n(i) [</p>
      <p>1
n(i) i 2 (xn</p>
      <p>i) ;
i 1diag((xn
i)(xn
i)T )
1] ;
where 1 is a column vector whose components are all 1 and diag(X) for matrix
X is a column vector which is composed of diagonal components of X. n(i) is
the soft assignment of xn to i-th Gaussian as
n(i) =
wiui(xn)
K
j=1wj uj (xn)</p>
      <p>;</p>
      <p>G = [u1T v1T : : : uTK vKT ]T ;
where ui is the i-th Gaussian, and it is also known as the posterior probability.
The FV representation is therefore given as
where K is the number of GMM components.</p>
      <p>
        Following [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], we apply power normalization and L2 normalization to each of
the extracted FVs. Power normalization is done by applying the function,
g(z) = sign(z)jzja;
to each component of FVs, where a is a parameter and is set to 1/2 in this work.
After normalization, we concatenate them into a single vector. The dimension
of our FVs is 262144. In the following sections, we represent visual features as
xv.
2.2
      </p>
      <sec id="sec-1-1">
        <title>Metadata Features</title>
        <p>As metadata features, we use Bag-of-Words (BoW) representation, which is
based on the idea that each word in a text appears independently. BoW is
obtained by counting the occurrence frequency of words in a text. In our method,
each EXIF datum is represented as a component of the histogram separately.
We use 10 Exif data in xml files as metadata features. Each feature and its
dimension are shown in Table 1. We represent metadata features as xm.
This section describes retrieving methods that we applied in the task. The
dataset I for the task is composed of 5,555 images with metadata. Therefore, each
image i 2 I has visual information and meta information. For a topic t, we
represent the query by example(QBE) as qt, and browsing data as Bt = fb1; b2; :::g 2 I
in which bi is browsed later than bj (i &gt; j). QBE is an image which is admitted
as the similar image for the topic. Browsing data is obtained while seeking QBE.
A n-dimensional feature vector of image i is represented as xiv 2 Rn.</p>
        <p>In our approach, we use rankSVM to ascertain the similarity between
visual features, and use the distance metric as the similarity between metadata
features. Finally, we obtain a similarity score of the document by summing the
weighted similarity scores (rankSVM score for visual features + similarity score
for metadata features). We use relevance feedback to obtain the weights for each
score. Relevance is based on the variance between QBE and browsing data. For
a comparison, we also adopt Nearest Neighbor and Support Vector Machine for
visual features.
where d(xv; xiv) is a distance between xv and xiv, e.g. a Euclidean distance.
3.2</p>
      </sec>
      <sec id="sec-1-2">
        <title>Support Vector Machine</title>
        <p>Support Vector Machine (SVM) is a popular learning method that applies a
maximum margin manner for binary classification. stv;i is a similarity score of an
image i for topic t, and calculated as:
For topic t, to train a linear model (wt; ht), SVM needs training data Dt =
f(xiv; yiv;t)gjiI=j1. We choose a label of image i for topic t as:
stv;i = wt xvi</p>
        <p>ht:
yiv;t =
{1</p>
        <p>(i 2 fqtg [ Bt)
1 (otherwise):
Here, SVM is formulated as following optimization problem:</p>
        <p>1
wt;f igjiD=1tj 2 jjwtjj2 + C∑
min
i=1</p>
        <p>i;
jDtj
s.t. yiv;t(wt xiv
ht)
i
1
0:
i (i = 1;
; jDtj);
3.3</p>
      </sec>
      <sec id="sec-1-3">
        <title>RankSVM</title>
        <p>
          To learn a model for each topic from the QBE, browsing data, and other images,
we applied rankSVM proposed by T. Joachims[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] for search engine optimization.
stv;i is a similarity score of an image i for topic t, and calculate as:
stv;i = wt xiv:
RankSVM is a pairwise learning method that applies a large margin principle
used in SVM for ranking learning. The QBE has the strongest presence for the
topic in the given image set. The browsing data have stronger presence than
images except the QBE and other browsing data. Later browsed images are
regarded as higher ranked than earlier ones. Pt is a pair set of topic t, and
defined as:
        </p>
        <p>t t t
Pt = O1 [ O2 [ O3;
O1t = f(i; j)ji = qt; j 2 I
O2t = f(i; j)ji 2 Bt; j 2 I
O3t = f(bi; bj )jbi; bj 2 Bt; i &gt; jg:
fqtgg;
Bt</p>
        <p>fqtgg;
Here, rankSVM is formulated as following optimization problem:
1
wt;f igjiP=t1j 2 jjwtjj2 + C∑
min
i=1</p>
        <p>i;
jPtj
s.t. wt (xjv
xvk)
i
1
0:
j (8(j; k) 2 Pt);
3.4</p>
      </sec>
      <sec id="sec-1-4">
        <title>Distance Metric between Metadata Features</title>
        <p>Similarity between metadata features is calculated by the distance metric of
BoW feature vectors as:
d(xim; xjm) = 1
e jjxim xjmjj2 ;</p>
        <p>1
ci;j = 1 + d(xim; xjm) ;
stm;i =</p>
        <p>∑
k2Bt[qt
ck;i;
where xim and xjm are feature vectors i; j 2 I. d(xim; xjm) is the distance between
two metadata features. In this method, we use 1 as . Here, ci;j is a similarity
between xim and xjm. stm;i is a similarity score of metadata of image i for topic
t, and is obtained by summing the scores with all of the query browsed images.
Then we normalize the scores to zero mean and unit variance.
3.5</p>
      </sec>
      <sec id="sec-1-5">
        <title>Relevance Feedback</title>
        <p>Relevance feedback is a technique of information retrieval that improves a user’s
query and facilitates retrieval of information that a user needs. In this task,
which feature is important for retrieval varies by topics. Therefore, the system
must recognize the criteria for retrieval automatically from the query such as
QBE document and browsing data. We calculate the weights of each feature.
The final similarity score is the weighted average of the visual similarities and
the metadata similarity.</p>
        <p>In this method, the weights are calculated utilizing the process of the
creation of the query data. Each query image is browsed sequentially. The weights
of features are updated by chosen images. For example, the feature of which
variance in query images is small should be important. We expand the weight
formula as:
!ln;tew =
!l;t =</p>
        <p>I
Bl t ;
l
!ln;tew + (1
)
!lo;ltd;
where is the updating weight. !l;t is the weight used in summing all of the
similarity scores. lI is the variance of images about feature l. lBt is the variance
of images in Bt. An initial value of !l;t is 1, and as the element count of Qt
increases by one, !l;t is recalculated.</p>
        <p>Finally, the score of image i is calculated by summing visual feature scores
and metadata feature scores which are weighted with relevance feedback. The
final score stf;iin is calculated as:
where slv;t;i and slm;t;i are the visual and metadata scores of the image i for topic t.
Lv and Lm are all of the visual and metadata features respectively. Nlv and N m
l
represent the number of combinations of the visual features and the metadata
features respectively. In this paper, we set N m = 10. Nlv is 1, 2, 3, and 4. lv
l
and lm are the parameters to change ratio of the weight between visual and
metadata features.
4</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Results</title>
      <p>This section describes details of the comparison of learning methods and the
result of visual feature’s combinations.
4.1</p>
      <sec id="sec-2-1">
        <title>Retrieving method comparison</title>
        <p>
          In our experiment, four kinds of local descriptors were extracted from each
image: SIFT, C-SIFT, LBP, and GIST. Each descriptor was sampled densely on
a regular grids (every six pixels). The dimensionalities of SIFT, C-SIFT, LBP,
and GIST were 128, 384, 1024, and 960 respectively. They were coded into two
state-of-the-art global feature representations (4 2 = 8 visual features in total).
One is FV, as explained in the previous section. First, the mixture model of 256
Gaussians was trained using a standard EM-algorithm. To use spatial
information, FVs were calculated respectively over 1 1, 2 2, and 3 1 cells. Thereby,
we obtained FVs of which the dimensionality was 64 256 8 2 = 262; 144.
The other is Locality-constrained Linear Coding (LLC) [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], which describes each
local descriptor as a linear weighted sum of a few nearest codewords. In our
experiment, 1,024 codewords were generated with the k-means algorithm, and then
each local descriptor was approximated using 5-NN of the descriptor. The
images were divided into 1 1, 1 3 and 3 1 spatial grids differently from FV.
Therefore, the dimensionality was 1024 7 = 7168.
        </p>
        <p>In this experiment, we examined which learning method is effective to train
a ranking function for the task using only visual features. We investigated which
visual feature should be combined to achieve the best performance through the
next experiment. We use NN, SVM, and rankSVM to train the ranking
function. Therefore, we examine 8 3 = 24 experiments in total. We use trec eval
system published for evaluation. We use a ndcg cut 100 value in the evaluation
of average users. The ndcg cut 100 value indicates the rate of compliance with
the top 100 documents of correct data.
This experiment addresses a combination of visual features and metadata
features. First, we evaluated only metadata features for retrieval, then we got a
score of 0:6228. With only visual features, the best score is 0:3861 using FV,
LBP, and rankSVM.</p>
        <p>The combination formula is given as:
where lv and lm are the parameters to change ratio of the weight between visual
and metadata features. We use lv = 3 and lm = 1.</p>
        <p>Finally, we present the top six combinations of four visual features in two
circumstances where 10 metadata features are used and not used respectively. We
obtained the scores shown in Table 3 and Table 4. Without metadata features,
the combination of three descriptors (C-SIFT, LBP, and GIST) got the best
score among them. Additionally when metadata features were combined, the
combination of all of the features (SIFT, C-SIFT, LBP, GIST, and Metadata)
got the best score. We could not use this evaluation system before the submission.
Therefore, we submitted the score of the experiment which uses four features
(C-SIFT, LBP, GIST, and Metadata) we considered the best.
5</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusions</title>
      <p>This paper explained our approach for ImageCLEF 2013 Personal Photo
Retrieval Task. Results of comparative experiments show FVs as useful for visual
representation and BoWs as metadata features. Retrieval is performed using
rankSVM with visual features and the distance metric between metadata
features. Using combinations of FVs from various local descriptors (C-SIFT, LBP,
and GIST) and metadata features, we achieved the top score among all teams.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Joachims</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Optimizing search engines using clickthrough data</article-title>
          .
          <source>In: ACM SIGKDD</source>
          . pp.
          <fpage>133</fpage>
          -
          <lpage>142</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Perronnin</surname>
          </string-name>
          , F., S´anchez, J.,
          <string-name>
            <surname>Mensink</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Improving the fisher kernel for large-scale image classification</article-title>
          .
          <source>In: ECCV</source>
          , pp.
          <fpage>143</fpage>
          -
          <lpage>156</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lv</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gong</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Locality-constrained linear coding for image classification</article-title>
          .
          <source>In: CVPR</source>
          . pp.
          <fpage>3360</fpage>
          -
          <lpage>3367</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>