=Paper= {{Paper |id=Vol-1179/CLEF2013wn-ImageCLEF-GranaEt2013 |storemode=property |title=UNIMORE at ImageCLEF 2013: Scalable Concept Image Annotation |pdfUrl=https://ceur-ws.org/Vol-1179/CLEF2013wn-ImageCLEF-GranaEt2013.pdf |volume=Vol-1179 |dblpUrl=https://dblp.org/rec/conf/clef/GranaSMCMM13 }} ==UNIMORE at ImageCLEF 2013: Scalable Concept Image Annotation== https://ceur-ws.org/Vol-1179/CLEF2013wn-ImageCLEF-GranaEt2013.pdf
      UNIMORE at ImageCLEF 2013: Scalable
          Concept Image Annotation

    Costantino Grana1 , Giuseppe Serra1 , Marco Manfredi1 , Rita Cucchiara1 ,
                Riccardo Martoglia2 , and Federica Mandreoli2
           1
               University of Modena and Reggio Emilia - DIEF Department
           2
               University of Modena and Reggio Emilia - FIM Department
                              @unimore.it



       Abstract. In this paper we propose a large-scale Image annotation sys-
       tem for the Scalable Concept Image Annotation task. For each concept
       to be detected a separated classifier is built using the provided textual
       annotation. Images are represented as a Multivariate Gaussian distribu-
       tion of a set of local features extracted over a dense regular grid. Textual
       analysis, on the web pages containing training images, is performed to
       retrieve a relevant set of samples for learning each concept classifier.
       An online SVMs solver based on Stochastic Gradient Descent is used to
       manage the large amount of training data. Experimental results show
       that the combination of different kind of local features encoded with our
       strategy achieves very competitive performance both in terms of mAP
       and mean F-measure.

       Keywords: image retrieval, image classification, multi-class, multi-label,
       stochastic gradient descent


1    Introduction
University of Modena and Reggio Emilia group (UNIMORE) participated at
ImageCLEF 2013 [1] to the Scalable Concept Image Annotation Task [2], and
identified two possible strategies to attack the problem: finding images similar
to the query, and from those extract the image concepts, leveraging the provided
textual annotation, or directly using the textual annotation to roughly annotate
the training set and then for every concept building a classifier applicable to the
query.
     The first approach is the organizers baseline, while for the second we anno-
tated every training image with a concept if the concept word was found in the
scofeats file, then we used the provided BoW CSIFT features to build a clas-
sifier for each concept. This second strategy largely outperformed the baseline,
so we further expanded this second approach. In our experiments we aimed at
improving the features and the initial textual annotation. Instead of relying on
the BoW model we propose to describe the local features as a Multivariate Gaus-
sian Distribution, which employs a full rank covariance matrix, thus leading to a
large feature vector (for a 128 dimensional SIFT descriptor a 8,384 dimensional
vector for each spatial pyramid region). We partitioned the image into 1x1, 2x2,
1x3 regions, thus a total of 8 spatial regions and a 67,072 dimensional vector
which becomes 201,216 dimensional for color based SIFTs.
    For the textual part, stopword removal and stemming is performed on the
scofeats file, then the titles of the original web pages are extracted and parsed.
Moreover we built a context from the WordNet definition, in order to detect if
different words fall in the same concept context. Finally a negative context is
similarly built by other senses of the same word.
    We used a linear SVM classifier for each concept, built using a Stochastic
Gradient Descent online technique, which allowed us to provide one example at
a time to the algorithm, thus allowing training within our memory limit (6 bipro-
cessor Xeon machines with 32 GB of RAM each). Parallelization was achieved by
separately training the classifiers on every machine on chunks of data read from
disk. A late fusion averaging approach is used in our best run (UNIMORE5 test)
with the HSVSIFT, OPPONENTSIFT, RGBSIFT, and SIFT features. We fur-
ther improved the training set by querying about 100k images from Google
Images with the concept name. We managed to be the best group in terms of
mAP-samples, the second in terms of MF-concepts and the third in terms of
MF-samples.
    In Section 2 we describe the feature sumarization approach, while in Sec-
tion 3 the textual annotation processing method is presented. In Section 4 the
Stochastic Gradient Descent is briefly sumarized and our modifications are high-
lighted. Finally Section 5 describes the submitted runs in detail and reports the
performance obtained on both the development and the test sets.


2   Visual Information Processing
For an image W , we first extract features through densely sampling in a regular
grid. Let F = {f1 . . . fN } be a set of local features (e.g. SIFT descriptors, where
d = 128) in W (or a sub-region of W , when Spatial Pyramid Matching is used),
we describe them with a multivariate Gaussian distribution supposing that they
are normally distributed. The multivariate Gaussian distribution of a set of d-
dimensional vectors F is given by
                                      1            1      T   −1
                   N (f ; m, C) =           1   e− 2 (f −m) C      (f −m)
                                                                            ,    (1)
                                    |2πC|   2


where | · | is the determinant, m is the mean vector and C is the covariance
                              d×d         d×d
matrix (f , m ∈ Rd and C ∈ S++    , with S++  the space of real symmetric positive
semi-definite matrices) defined as follows:
                                 N
                           1 X
                        m=       fi ,                                            (2)
                           N i=1
                                     N
                                1 X
                        C=              (fi − m)(fi − m)T .                      (3)
                              N − 1 i=1
 Input Image                                       Mean
                                 Multivariate
                                  Gaussian
                                                     d
                                  Mapping       Covariance                     Linear
                                                  Matrix
                                                                          +   Classifier
                 Dense
               descriptors                                   Projection
               extraction                                       on a
                                                             Euclidean
                                                               space
                                                  dXd


                      Fig. 1. Overview of visual information processing.


     The covariance matrix encodes information about the variance of the fea-
tures and their correlation. Although it is very informative, it does not lie in a
vector space since the covariance space is not closed under multiplication with
a negative scalar. In fact, it lies in a Riemannian manifold. Most of the com-
mon machine learning algorithms assume that the data points form a vector
space, therefore a suitable transformation is required prior to their use. Since
the covariance matrix is symmetric positive definite we adopt the Log-Euclidean
metric. The basic idea of the Log-Euclidean metric is to construct an equiva-
lent relationship between the Riemannian manifold and the vector space of the
symmetric matrix.
     An approach to map from Riemannian manifolds to Euclidean spaces is in-
troduced in [3] and used in [4]. The first step is the projection of the covariance
matrices on an Euclidean space tangent to the Riemannian manifold, on a spe-
cific tangency matrix P. The second step is the extraction of the orthonormal
coordinates of the projected vector. In the following, matrices (points in the
Riemannian manifold) will be denoted by bold uppercase letters, while vectors
(points in the Euclidean space) by bold lowercase ones.
     More formally, the projected vector of a covariance matrix C is given by:
                                        1
                                                  1     1
                                                            1
                    tC = logP (C) = P 2 log P− 2 CP− 2 P 2                      (4)
where log is the matrix logarithm operator and logP is the manifold specific
logarithm operator, dependent on the point P to which the projection hyperplane
is tangent. The matrix logarithm operators of a matrix C can be computed by
eigenvalue decomposition (C = UDUT ); it is given by:
                                 ∞
                                 X (−1)k−1
                      log(C) =                    (C − I)k = Ulog(D)UT .                   (5)
                                           k
                                 k=1
   The orthonormal coordinates of the projected vector tC in the tangent space
at point P are then given by the vector operator:
                                            1       1
                                                       
                        vecP (tC ) = vecI P− 2 tC P− 2                     (6)
where I is the identity matrix, while the vector operator on the tanget space at
identity of a symmetric matrix Y is defined as:
                       h    √     √                √                i
             vecI (Y) = y1,1 2y1,2 2y1,3 . . . y2,2 2y2,3 . . . yd,d .            (7)

   Substituting tC from Eq. 4 in Eq. 6, the projection of C on the hyperplane
tangent to P becomes
                                        1      1
                                                   
                       c = vecI log P− 2 CP− 2        .                   (8)

Thus, after selecting an appropriate projection origin, every covariance matrix
is projected to an Euclidean space. Since c is a symmetric matrix of size d × d
a (d2 + d)/2-dimensional feature vector is obtained.
    As observed in [5], the projection point P is arbitrary and, even if it could
influence the performance (distortion) of the projection, from a computational
point of view, the best choice is the identity matrix, which simply translates the
mapping into a standard matrix logarithm.
    In short, our method (see Fig. 1) is to extract local descriptors from an image
and then collect them in a spatial pyramid; each sub-region is described by a
multivariate Gaussian distribution (MGD). The covariance matrix is projected
on a Euclidean space and concatenated to the mean vector to obtain the final
descriptor (in the case of SIFT descriptors, the dimensionality becomes 8384 per
sub-region). We empirically observe that most of the values in the concatenated
descriptor are low, while few are high. In order to distribute the values more
evenly, we adopt the power normalization method proposed by Perronnin et
al. [6], i.e. to apply to each dimension the function:

                        f (x) = sign(x)|x|α    with α = 0.5                       (9)
Eventually, the concatenated descriptors are fed to a linear classifier. For a more
detailed analysis of the proposed multivariate Gaussian descriptor see [7].


3     Textual Information Processing

The goal of textual information processing is, given a list Lconc = {c1 , . . . , cn }
of concepts of interest, to retrieve a relevant set of images from the ImageCLEF
training set exploiting only the textual content of the web pages that referenced
the images. The concepts c are expressed as WordNet3 synsets, for instance
airplane.n.1 is the first sense of the term airplane as a noun; the list can
include more than one synset, as they could be equally relevant, as in Lconc =
{book.n.1, book.n.2}. The retrieved image set I will then be used to train
ad-hoc image classifiers in identifying the specific concepts in the test image
set. In particular, in order for the training to be effective, the text processing
techniques should be designed to retrieve a set of images:

(a) sufficiently large so to perform training (a minimum number of images
    threshold thmin should be exceeded);
3
    http://wordnet.princeton.edu/
(b) as relevant as possible to the concepts.
    One of the most naive approaches for text processing, which also constitutes
a typical baseline, could be accomplished in very few simple steps. For instance,
for a given c ∈ Lconc (e.g. airplane.n.1):
 1. extract the main term t associated to c (e.g. airplane);
 2. look for t in the “scofeats” data file, containing, for each of the training
    images, the processed text of the referencing web pages, and retrieve in I
    the images referenced from the web pages where t appears.
    Following the above baseline, however, brings to very unsatisfying results due
to a large number of both (a) false negatives and (b) false positives in I, thus
failing to meet the above mentioned desiderata. The following are just some real
examples for airplane.n.1:
 – many relevant and useful images are missing since the original pages de-
   scribed the concepts using different terms (e.g. aeroplane, jumbojet, etc);
 – many unrelated images are retrieved, for instance the closeup of a hat (from
   a web page about “airplane pilot hats”), the album covers and group shots
   of the “Jefferson Airplane” music band, etc.

3.1   Enhanced Text Processing Approach
In order to overcome the above mentioned issues, we exploit an enhanced text
processing approach, whose steps are outlined in Figure 2. First of all, a textual
information pre-processing is executed on the “scofeats” data and on the original
web pages data (top part of the figure):
 – stopword removal and stemming is performed on the “scofeats” file, thus
   producing a “stemmed scofeats” file;
 – the titles of the original web pages are extracted and parsed (title extraction
   and analysis step), thus producing a “parsed page titles” file.
    Note that, in our approach, we choose to exploit first of all the textual features
already extracted in the “scofeats” file (including the term scores), complement-
ing them with specific information extracted from the original web pages which
would be otherwise unavailable. The output of pre-processing, then, enables the
actual textual information search process (bottom part of the figure) that, given
an input list of concepts Lconc , produces the associated result image set I. With
respect to the baseline described in the previous section, and for each concept
c ∈ Lconc the processing is enhanced in two main directions, corresponding to
the original desiderata:

 – the number of retrieved images associated to c is significantly higher, thanks
   to the candidate image search step, which produces an expanded candidate
   image set CI. Each image in CI is associated to its scofeat information for
   further refinements;
 Textual(informa;on(preBprocessing((
                                 Webpages(
                                                                 Stopword(removal(                Title(extrac;on(and(    Original(
                               processed(text(
                                                                   and(stemming(                         analysis(       webpages((
                               (scofeats(data)((




                                                                     Stemmed(                        Parsed(page(
                                                                      scofeats(                         ;tles(



 Textual(informa;on(search(
                                                   c:(
  Concepts(of(                                     (sync(,(hypc()(                          CI(                           Retrieved(
                            Synonyms(and(                            Candidate'image'                Result'filtering'
    interest(                                                                                                              images(
                          hyponyms(analysis(                             search'                     and'refinement'
    Lconc={c}(                                                                                                                I(

                                                                                  c:((contc(,ncontc()(
                                                   Context(genera;on(




Fig. 2. Overview of textual information pre-processing (top) and search process (bot-
tom).



 – irrelevant images in CI are discarded and a refined set I is produced in a
   result filtering and refinement step.

    The techniques behind these two crucial steps, along with the ones propaedeu-
tic to them, will shortly be discussed in the following.
Candidate image search. The first key idea here is to search for term t (e.g.
airplane) associated to c in the web page processed text of the “stemmed
scofeats” file, instead of the plain “scofeats” file. In this way, stemming [8] sig-
nificantly improves the recall in the image retrieval process, also retrieving pages
containing different inflexions of the term (e.g., the plural airplanes).
    Moreover, the search is performed not only for term t but also for the terms t
available in its synonyms and hyponyms (i.e. more specific terms) lists, sync and
hypc , respectively. Such lists are extracted from WordNet in the synonyms and
hyponyms analysis step. For instance, for airplane sync will include aeroplane
and hypc biplane, jumbojet, etc., whereas for rodent hypc will contain the
more common mouse and rat terms. In this way, a set of images significantly
larger than the baseline is retrieved, for instance 40% more for airplane and
even 300% more for rodent.
    Please note that including in the search all the synonyms and hyponyms of a
term, without any discrimination, while enlarging the retrieved image set, could
also bring negative effects on its precision, due to ambiguity of language. For
instance, an hyponym of newspaper is daily, an hyponym of horse is bay; these
are however terms which are more typically used in completely different contexts
and which would, thus, produce noise in the results. For this reason, we chose the
“safest” approach of selecting only those synonyms/hyponyms having a single
sense in WordNet; alternatively word sense disambiguation approaches (such as
[9, 10]) could be applied to the web pages before the search (their application
will be investigated in the future).
Result filtering and refinement. In this step, various refinement techiques
are exploited so to reduce the number of false positives in the CI image set:

 – score threshold: the “stemmed scofeats” file contains, for each term of the
   referencing web page, a score which captures the concepts of term frequency
   [8] and term distance to the image: the more frequent and the more close
   to the image is a term, the more it should be representative of the image
   content. Therefore, we define a thscore threshold which the score of term
   t should exceed in order for the image to be put in the final results. For
   instance, the scofeat for an helicopter image from a web page about various
   means of transport could contain the term airplane but with a very low
   score, and would thus not be considered;
 – context threshold: a term in a web page can have very different senses. For
   instance, when looking for elder.n.1, i.e. “a person who is older than you
   are”, one should be very careful to exclude web pages (and its images) which
   are about elders as “shrubs or small trees”. While word sense disambiguation
   [9, 10] could be applied to the web pages, in order to be able to directly exploit
   the “stemmed scofeats” information we chose to implicitly derive the sense
   of a term from its co-occurent terms. In particular, for a given concept c,
   the context generation step first derives a preliminary context pcontc from
   the nouns of its WordNet definition, then expands it to a context contc
   containing the terms which most frequently co-occur with the ones of pcontc .
   For instance, for c =elder.n.1, pcontc ={elder, person}, while contc will
   include additional terms as old, family, people, house, woman and man.
   Then, we define a thcont threshold which is the minimum number of terms
   of contc that an image scofeat should have in order for the image to be put
   in the final results. In this way, when looking for images about elder.n.1,
   images about bushes are typically excluded;
 – negative context check: in order to enforce context filtering, we also de-
   rive a negative context ncontc from the definitions of the other senses of
   c that the image scofeat should not include: for instance when looking for
   c =castle.n.2 (“a large building”), ncontc will include misleading terms
   such as chess;
 – page title check: in this further check we exploit the “parsed page titles”
   information so to exclude from the results the images whose referencing pages
   have a title not meeting the desired criteria. In particular, term t should not
   be used as a deteriminer in a noun phrase, or be present in capitalized form:
   for instance, images from pages about “airplane pilot hats” or “Jefferson
   Airplane songs” are excluded as deemed not representative for airplane.

    Please note that the above refinement techniques are applied only: (a) if they
are relevant to the case (for instance, context and negative context are not needed
in case of single-sense terms) and (b) if the number of images in the candidate set
CI does not fall under the minimum threshold thmin (in particular, thresholds
thcont and thscore are adjusted so to satisfy the thmin =500 limit).


4   Online learning for SVM training
Although there exist many off-the-shelf SVM solvers, such as SVMlight, SVMperf
or LibSVM/LIBLINEAR, they are not feasible for training large volumes of data.
This is because most of them are batch methods, which require to go through
all data to compute gradient in each iteration and often need many iterations
to reach a reasonable solution. Even worse, most off-the-shelf batch type SVM
solvers require to pre-load training data into memory, which is impossible when
the size of the training data explodes. Indeed, LIBLINEAR released an extended
version that explicitly considered the memory issue, but in a recent test [11] it
was shown that the performance dropped considerably and even on 80GB of
training data it could not provide useful results. Therefore, a better solution
may be provided by the stochastic gradient descent (SGD) algorithm recently
introduced for SVM classifiers training.
    We have training data that consists of T feature-label pairs, denoted as
{xt , yt }Tt=1 , where xt is a s × 1 feature vector representing an image and yt ∈
{−1, +1} is the label of the image. Then, the cost function for binary SVM
classification can be written as
                       T
                       X λ        2
                               kwk + max 0, 1 − yt (wT xt + b) ,
                                                             
                  L=                                                             (10)
                       t=1
                           2

where w is s × 1 SVM weight vector, λ (nonnegative scalar) is a regularization
parameter, and b (scalar) is a bias term. In the SGD algorithm, training data
are fed to the system one by one, and the update rule for w and b respectively
are
                          wt = (1 − λη)wt−1 + ηyt xt
                                                                          (11)
                          bt = bt−1 + ηyt
if margin ∆t = yt (wT xt + b) is less than 1; otherwise, wt = (1 − λη)wt−1 and
bt = bt−1 . The parameter η is the step size. We set η = (1 + λt)−1 , following the
vl pegasos implementation [12]. To parallelize SVMs training, we randomize
the data on disk. We load the data in chunks which fit in memory, then train the
different classifiers in parallel threads on further randomizations of the chunks,
so that different epochs will get the chunks data with different orderings.
    In our experimental setting each classifier is trained considering all the 250,000
images given as training; as a result the data are highly unbalanced, namely the
negative samples are predominant, and in addition the number of samples per
concept is unevenly distributed. We observed that this leads the classifier to in-
correctly estimate the hyperplane, that is shifted towards the positive samples
while maintaining a proper orientation. To reduce this effect, a simultaneous
optimization of the SVMs bias for all the classifiers is conducted by maximizing
the F-measure on all the training set.
5   Experimental results

UNIMORE participated to the Scalable Concept Image Annotation task sub-
mitting six runs. All the different kind of SIFT descriptors are extracted at four
scales, defined by setting the width of the local feature spatial bins to 4, 6, 8,
and 10 pixels respectively, over a dense regular grid with a spacing of 3 pixels.
We use the function vl phow provided by the vl feat library [12] and, apart
from the spacing step, the defaults options are used. Images are hierarchically
partitioned into 1×1, 2×2 and 1×3 blocks on 3 levels respectively. In the case of
SIFT descriptor we obtain a 67,072 dimensional vector concatenating the MGD
features of the 8 spatial windows, while for color SIFT descriptors (RGB, OP-
PONENT and HSV) we describe a region by concatenating the MGD computed
for each color channel separately (instead of using the full covariance matrix
of 384 dimensions) obtaining a 201,216 dimensional feature vector. Stochastic
gradient descent is used to learn a classifier for each concept, for each feature de-
scriptor and each training set; detection scores are thresholded at zero to obtain
the decisions. Finally, a late fusion averaging approach is used.
    In some runs we added another training set of about 100k images queried
from Google Image Search directly using the concepts list. Each image is au-
tomatically labeled with the concept word used in the query; synonyms and
hyponyms analysis is also performed in order to identify labels relationship (e.g.
images labeled as “car” are also tagged with the “vehicle” label). These images
are described with RGBSIFT and summarized with a Multivariate Gaussian De-
scriptor. All experiments are performed on six biprocessor Xeon machines with
32 GB of RAM each. Runs are described and discussed in the following:

 – UNIMORE 1: Training images are associated to a concept using the first
   step of the textual information processing, called “candidate image search”,
   on the scofeats file. For every image multiscale HSVSIFT and RGBSIFT
   features are extracted and summarized with a Multivariate Gaussian De-
   scriptor.
 – UNIMORE 2: Based on the scofeats file, two different sets of training im-
   ages are associated to a concept: 1) images linked to a concept when the
   concept word is present in the stemmed scofeats; 2) images obtained by the
   candidate image search technique. For every image, multiscale HSVSIFT,
   OPPONENTSIFT, RGBSIFT and SIFT features are extracted and summa-
   rized with a MGD. Google Images dataset is also used for training.
 – UNIMORE 3 Training images are associated to a concept only if the con-
   cept word is present in the stemmed scofeats file. For every image multiscale
   HSVSIFT, OPPONENTSIFT, RGBSIFT and SIFT features are extracted
   and summarized with a MGD. Google Images dataset is also used for train-
   ing.
 – UNIMORE 4 Training images are associated to a concept if the concept
   word is present in the stemmed scofeats file. For every image multiscale
   HSVSIFT, OPPONENTSIFT, RGBSIFT features are extracted and sum-
   marized with a MGD.
 – UNIMORE 5 Three sets of training images, based of the scofeats file, are
   associated to a concept : 1) an image is linked to a concept when the con-
   cept word is present in the stemmed scofeats file; 2) images obtained by
   the candidate image search technique; 3) images obtained applying the com-
   plete textual information processing pipeline. For every image multiscale
   HSVSIFT, OPPONENTSIFT, RGBSIFT and SIFT features are extracted
   and summarized with a MGD. In addition, Google Images dataset is used for
   training. Two different combination strategies are used to compute decisions
   and score values: decisions are computed through a late fusion averaging
   approach of classifiers trained with images derived by the candidate image
   search technique and described with HSVSIFT, OPPONENTSIFT and RG-
   BSIFT descriptors, while the score values are obtained combining all the
   classifiers learned in this run.
 – UNIMORE 6 Two sets of training images, based on the scofeats file, are
   associated to a concept: 1) an image is linked to a concept when the concept
   word is present in the stemmed scofeat file; 2) images obtained by the candi-
   date image search technique. For every image multiscale HSVSIFT, OPPO-
   NENTSIFT, RGBSIFT and SIFT features are extracted and summarized
   with a MGD. Google Images dataset is also used for training. Two different
   combination strategies are used to compute decisions and score values: de-
   cisions are computed through a late fusion averaging approach of classifier
   trained with images derived by the candidate image search technique and
   described with HSVSIFT and RGBSIFT features, while the score values are
   obtained combining all the classifiers learned in this run.


    Tab. 1 and 2 present the results obtained for each run in term of mAP (mean
average precision) and MF (mean F-measure). The MF is computed analyzing
both the samples (MF-samples) and the concepts (MF-concepts), whereas the
mAP is computed analyzing the samples. It can be noted that the performance
reported in all the three metrics in the development and test sets are strictly
related, and shows slightly lower results in the latter. This is probably due to
the higher number and variability of the concepts given in the test setting.
The late fusion averaging approach proves to be a good solution for combining
different features and training sets and for easily learning classifiers in parallel. In
particular it greatly improves the mAP value, that increases for each new feature
or training set added in the system (for example see runs UNIMORE 1 and
UNIMORE 2). Adding our Google Images dataset, automatically downloaded
using concepts list, increases the performance in term of mAP although the
high level of label noise (see runs UNIMORE 3 and UNIMORE 4). Textual
information processing is also essential for the proposed method, mainly because
it increases the performance in term of MF and defines new training sets to be
used in the late fusion approach. It can be noted that both textual processing
steps contribute to the improvement of the performance: see for example the
gap in terms of MF between runs UNIMORE 4 and UNIMORE 1 mainly caused
by candidate image search strategy and the difference of mAP values between
                          Table 1. Development set results

                          MF-samples MF-concepts mAP-Samples
            baseline rand     6.2        4.8        10.9
            baseline sift    17.8       11.0        24.0
            UNIMORE 1        33.0       34.1        39.2
            UNIMORE 2        27.3       34.2        46.0
            UNIMORE 3        23.1       32.4        43.7
            UNIMORE 4        26.8       31.7        39.7
            UNIMORE 5        33.3       33.7        47.9
            UNIMORE 6        33.0       34.1        46.0

                               Table 2. Test set results

                          MF-samples MF-concepts mAP-Samples
            baseline rand     4.6        3.6         8.7
            baseline sift    15.9       11.0        21.0
            UNIMORE 1        31.1       32.0        36.7
            UNIMORE 2        27.5       33.1        44.1
            UNIMORE 3        23.1       31.5        41.9
            UNIMORE 4        24.1       29.5        36.2
            UNIMORE 5        31.5       31.9        45.6
            UNIMORE 6        31.1       32.0        44.1



runs UNIMORE 6 and UNIMORE 5 obtained applying the complete textual
information processing pipeline.


6   Conclusions

In this paper we presented the approach developed to participate at ImageCLEF
2013 for the Scalable Concept Image Annotation task. Our proposal focus on
the definition of a new image descriptor that encodes local features, densely
extracted from a region, as a Multivariate Gaussian Distribution. A new textual
information processing strategy is also presented to cope with the high level of
noise of the training data. To deal with the large-scale nature of this task, we
use an online linear SVM classifier based on the Stochastic Gradient Descent
algorithm. Experimental results show that both visual and textual information
processing are necessary to build a competitive system.


References

 1. Caputo, B., Muller, H., Thomee, B., Villegas, M., Paredes, R., Zellhofer, D., Goeau,
    H., Joly, A., Bonnet, P., Martinez Gomez, J., Garcia Varea, I., Cazorla, M.: Im-
    ageclef 2013: the vision, the data and the open challenges. In: Proc CLEF. (2013)
 2. Villegas, M., Paredes, R., Thomee, B.: Overview of the imageclef 2013 scalable
    concept image annotation subtask. In: CLEF 2013 working notes, Valencia, Spain.
    (2013)
 3. Tuzel, O., Porikli, F., Meer, P.: Pedestrian Detection via Classification on Rieman-
    nian Manifolds. IEEE T. Pattern Anal. 30(10) (2008) 1713–1727
 4. Borghesani, D., Grana, C., Cucchiara, R.: Miniature illustrations retrieval and
    innovative interaction for digital illuminated manuscripts. Multimedia Systems
    (2013)
 5. Martelli, S., Tosato, D., Farenzena, M., Cristani, M., Murino, V.: An FPGA-
    based Classification Architecture on Riemannian Manifolds. In: DEXA Workshops.
    (2010)
 6. Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale
    image classification. In: ECCV. (2010)
 7. Grana, C., Serra, G., Manfredi, M., Cucchiara, R.: Image classification with mul-
    tivariate gaussian descriptors. In: ICIAP. (2013)
 8. Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-
    Wesley Longman Publishing Co., Inc., Boston, MA, USA (1999)
 9. Mandreoli, F., Martoglia, R.: Knowledge-Based Sense Disambiguation (Almost)
    For All Structures. Information Systems (Information) 36(2) (2011) 406–430
10. Mandreoli, F., Martoglia, R., Ronchetti, E.: Versatile Structural Disambiguation
    for Semantic-aware Applications. In: ACM International Conference on Informa-
    tion Knowledge and Management. (2005)
11. Lin, Y., Lv, F., Zhu, S., Yang, M., Cour, T., Yu, K.: Large-scale image classification:
    Fast feature extraction and svm training. In: CVPR. (2011)
12. Vedaldi, A., Fulkerson, B.: VLFeat: An open and portable library of computer
    vision algorithms. http://www.vlfeat.org/ (2008)