=Paper=
{{Paper
|id=Vol-1178/CLEF2012wn-ImageCLEF-BenaventEt2012
|storemode=property
|title=Visual Concept Features and Textual Expansion in a Multimodal System for Concept Annotation and Retrieval with Flickr Photos at ImageCLEF2012
|pdfUrl=https://ceur-ws.org/Vol-1178/CLEF2012wn-ImageCLEF-BenaventEt2012.pdf
|volume=Vol-1178
}}
==Visual Concept Features and Textual Expansion in a Multimodal System for Concept Annotation and Retrieval with Flickr Photos at ImageCLEF2012==
<pdf width="1500px">https://ceur-ws.org/Vol-1178/CLEF2012wn-ImageCLEF-BenaventEt2012.pdf</pdf>
<pre>
    Visual Concept Features and Textual Expansion in a
      Multimodal System for Concept Annotation and
      Retrieval with Flickr Photos at ImageCLEF2012

    J. Benavent2 , A. Castellanos1, X. Benavent2, E. De Ves2, Ana García-Serrano1
                  1 Universidad Nacional de Educación a Distancia, UNED
                                   2 Universitat de València


    xaro.benavent@uv.es,{acastellanos,agarcia}@lsi.uned.es


       Abstract. This paper presents our submitted experiments in the Concept anno-
       tation and Concept Retrieval tasks using Flickr photos at ImageCLEF 2012.
       This edition we applied new strategies for both the textual and the visual sub-
       systems included in our multimodal retrieval system. The visual subsystem has
       focus on extending the low-level features vector with concept features. These
       concept features have been calculated by means of a logistic regression model.
       The textual subsystem has focus on expanding the query information using ex-
       ternal resources. Our best concept retrieval run, a multimodal one, is at the
       ninth position with a MnAP of 0.0295, being the second best group of the con-
       test for the multimodal modality. This is also our best run in the global ordered
       list (where eleven textual runs are also better than it). We have adapted our mul-
       timodal retrieval process for the annotation task obtaining non-very good results
       for this first participation, with a MiAP of 0.1020.

       Keywords: Multimedia Retrieval, Flickr Expansion, Concept Features, Low-
       level features, Logistic regression relevance feedback.


1      Introduction

The UNED-UV is a research group with researchers from two universities in Spain,
the Universidad Nacional de Educación a Distancia (UNED) and the Valencia Uni-
versity (UV). The group is working together since ImageCLEF08 edition. Notice that
this is our first participation in the Photo Annotation and Retrieval Task using Flickr
photos, being our previous participations at the Wikipedia retrieval [6] and at the
Medical [4] tasks.
   The visual concept detection, annotation, and retrieval task is a multi-label classifi-
cation challenge. The participants are asked to annotate the presence of one or more
concepts at the annotation subtask using visual and/or textual features, and use this
information in the retrieval process [2]. We have participated in the annotation and in
the retrieval subtask using visual and textual information.
   Our multimedia retrieval system very similar to the ones already used at previous
ImageCLEF editions [4,5] is composed of three subsystems (Fig. 1): the Textual
Based Information Retrieval (TBIR) system, the Content Based Information Retrieval
System (CBIR), and the Fusion subsystem. The main three steps are the following:
TBIR subsystem acts first as a pre-filter, and then the CBIR system works over this
pre-filtered collection by re-ranking it. The final ranked list is the fusion of the both
mono-modal lists. This retrieval process is based on the idea that textual retrieval
subsystem better captures the meaning of the query. So it is expected that the textual
subsystem eliminates images that are similar from a visual point of view but com-
pletely different from a semantic point of view.
   At this edition, the TBIR system has been improved by expanding the textual in-
formation of the query to improve the retrieval. Most of the participating groups at the
previous retrieval task try to take advantage of Flickr tag annotation of the images for
the retrieval process. In this regard, Ksibi et al [8] use Flickr tags to extract contextual
relationships between them. Izawa et al [7] also use Flickr tags. They combine a TF-
IDF model over the tags with a visual word co-occurrence approximation. Another
approach investigated for Spyromitros-Xious et al. [11] use the concepts, instead the
tags, in order to improve the textual-based retrieval. Unlike the papers presented be-
fore, we decided to go beyond in the use of textual information about the images (in-
cluding tags). For that we have carried out an expansion of the original collection,
using the information of the images existing on Flickr.
   The CBIR system uses low-level features for image retrieval. This low-level in-
formation although gives quite good results depending on the visual information of
the query, it is not able to reduce the “semantic gap” in a semantic complex query.
Our proposal [3] is to generate Concept features extracted from the low-level features
to obtain the probability of the presence of each trained concept. We call this new
vector, the expanded low-level Concept vector that is calculated for each image of the
collection and also for the example images of the query to process the retrieval task.
A model for each concept is trained using a logistic regression [9]. We use these re-
gression models as multi-label classifiers at the annotation subtask and as a features
vector for the retrieval subtask.
   Our proposals both for the textual as for the visual systems are more oriented to a
retrieval process than to an annotation subtask. Anyway, we have adapted them for
the multi-label annotation subtask. Section 2 describes the visual, textual and multi-
modal approaches for the concept annotation subtask with Flickr photos. Section 3
explains our multimodal retrieval system use for the concept retrieval subtask. After
that section 4 shows the submitted runs and the results obtained for annotation and for
retrieval. Finally, in section 5 we extract conclusions and outlines possible future
research lines.
2      Concept annotation subtask with Flickr photos.

2.1    Annotation approach using visual information.
For the annotation subtask we train a logistic regression model [9] for each of the
concepts defined by the concept annotation subtask [2]. Each trained model predicts
the probability that a given image belongs to a certain concept. The concept annota-
tion subtask gives to the participants a training set, 𝐼𝑠 , for each of the concepts. Being
𝐼𝑠𝑃 the training image set for each concept, we refer to them as the relevant or positive
images. And, being 𝐼𝑠𝑁 the set of no relevant images for a given concept referred as
non-relevant or negative images. The logistic regression analysis calculates the prob-
ability for a given image to belong to a certain concept. Each image of the training
set, 𝐼𝑠 is represented by a K-dimensional low-level features vector{𝑥1 , . . , 𝑥𝑖 , . . , 𝑥𝑘 }.
The relevance probability for a certain concept 𝑐𝑖 for a given image 𝐼𝑗 will be repre-
sented as 𝑃𝑐 𝑖 �𝐼𝑗 �. A logistic regression model can estimate these probabilities. Let us
consider for a binary Y, and k explanatory variables𝑥 = (𝑥1, … , 𝑥𝑘 ), the model for
π(x) = P(Y=1| X) (probability 𝑌 = 1) for the x values 𝑙𝑜𝑔𝑖𝑡 [𝜋(𝑥)] = 𝛼 +
𝛽1 𝑥1 + ⋯ + 𝛽𝑘 𝑥𝑘 , where logit (π(x))=ln(π(x) / (1-π(x)). The model parameters are
obtained by maximizing the likelihood estimator (MLE) of the parameter vector β by
using an iterative method.
    We have a major difficulty when having to adjust an overall regression model in
which we take the whole set of variables into account because the number of selected
images (the number of positive plus negative images, k) is typically smaller than the
number of characteristics (k < p). In this case the adjusted regression model has as
many parameters as the amount of data and many relevant variables could be not
considered. In order to solve this problem our proposal is to adjust different smaller
regression models: each model considers only a subset of variables consisting of se-
mantically related characteristics of the image. Consequently each sub-model will
associate a different relevance probability to a given image x and we have to combine
them in order to rank the database according to the image probability or image score
(Si).
    The explanatory variables 𝑥 = (𝑥1, … , 𝑥𝑘 ) to train the model are the visual low-
level features based on color and texture information that are calculated by our group.
We have a low-level features vector of 293 components divided by five different vis-
ual information families.

• Color information: Color information has been extracted by calculating both local
  and overall histograms of the images. Overall histograms have been calculated us-
  ing 10x3 bins on the HS color system. Meanwhile, local histograms have been cal-
  culated by dividing the images into four fragments of the same size. A bi-
  dimensional HS histogram with 12x4 bins is computed for each patch. Therefore, a
  feature vector of 222 components represents the color information of the image.
• Texture information: This information is embodied as the granulometric distribu-
  tion function. A granulometry is defined from the morphological opening of the
  texture using a convex and compact subset containing the origin as structuring el-
  ement [1]. In our case we have used a horizontal and a vertical segment as the
  structuring elements, being 60 components in total for both structuring elements.
  And the Spatial Size Distribution that is another morphological operation defined
  in [1] using a horizontal segment as structuring element, being 10 components.

   Once we have the 99 trained models, we calculate for each image the probability of
belonging to a given concept, Pc i �Ij �. This probability is a floating-point value be-
tween 0 and 1 that is the confidence score for the annotation run. For calculating the
binary score, if the concept probability is greater than 0.5 (Pc i �Ij � > 0.5) is assumed
that the concept is present at the image and then it is marked as 1, otherwise is marked
as 0 meaning the absence of the concept.


2.2    Annotation approach using visual and textual information.
Based on visual annotation, presented above, we propose a multimodal annotation by
an IR-based approach. Our proposal uses a two-step process. In the first step, the vis-
ual annotation approach generates a visual-based results list. Then, in the second step,
the textual system refines this visual annotated list as follows:

• The textual system only checks the annotated concept as present in an image ac-
  cording to the visual system (set to 1 at the binary annotation).
• The textual system retrieves the concepts, which are most likely in the image,
  ranked by score, using the textual information of the image as a query against the
  information associated to the concepts.
• If the textual system identifies the concept as present, the concept is fixed as pre-
  sent and the confidence score is calculated as the product of both textual and visual
  confidence scores.
• But, if the textual system does not identify the concept as present, the concept is
  fixed as not present, regardless of the criteria of the visual annotation.

This proposal entailed a problem, there was not enough information associated with
both the images and the concepts. Due to this lack of information, it was decided to
expand textual information of the collection by external sources. The expansion was
posed both for images information and concepts information.
   In order to expand the information associated to the images, Flickr was used to
provide an adequate textual description for each image. Two different expansion pro-
cesses are posed:

• Expansion using Flickr Description of Image: To all of the images on the collec-
  tion, we have retrieved the Flickr Description and we have aggregated it to the im-
  age description for all the images on the collection.
• Expansion using Flickr Description of Similar Images: It was decided to com-
  plement user descriptions with the descriptions of other users on similar images. In
  order to find images that are similar to each image of the collection, we use the tag
  annotation of the images. For each image, the Flickr API was queried to retrieve
     images that share the same tags (all of them or a subset) and aggregate to the image
     description, the descriptions of the 50 first images retrieved.

We propose three methods for the expansion of the concepts based on two external
sources (Flickr and ImageNET 1):

• Expansion using user descriptions of the concept on Flickr: The name of the
  concept is used as the query for the Flickr API and gets a set of relevant images.
  Then, the descriptions of these images are aggregated to the concept description.
• Expansion using user descriptions on Flickr of images annotated with the
  same concept: The idea is similar to the expansion presented previously; but in-
  stead of querying for images relevant to each concept, we used the images annotat-
  ed with the given concept. The method is as follows: 1) for each concept the imag-
  es annotated with it are identified, 2) the descriptions of these images are taken and
  finally 3) the image description is aggregated to the concept description.
• Expansion using structured information (ImageNET): For this approximation,
  each concept was manually extended by searching them on ImageNET and adding
  the definition provided by ImageNET to the concept definition.


3         Concept retrieval subtask with Flickr photos.

The system is composed by three subsystems: the Textual Based Image Retrieval
(TBIR) Subsystem and the Content Based Image Retrieval (CBIR) Subsystem and the
Fusion subsystem (Fig.1.).


                                           Flickr         ImageNET


            Concepts
                            TBIR

                                                    Expansion                     Index              Txt
             Textual                                                                                Results
          Characteristics                                                                            (St)

                                 Pre-             Query
               Text
                               process        Reformulation                       Search
                                                                                                     St*Si
            Topics
                                                                                                   FUSION
             Images
                              CBIR                                                     IMG
                                         Expanded           Similarity Module:      Results (Si)
                             Feature
                                         Concept            Logistic Regression
                            Extraction
           Collection                    Vector             Relevance feedback
                                                                                                   Txt_Img
            Images
                                                                                                   Results


                                    Fig. 1. Retrieval System overview.


1
    http://www.image-net.org/
    The TBIR subsystem is responsible of the preprocessing, the expansion, the index-
ing and, finally, the retrieval process using textual information. The TBIR subsystem
recovers only relevant images with a given query and assigns to each image a score
(St), based on textual similarity between associated text and query. These relevant
images returned by TBIR module are submitted as candidates to the CBIR system as a
list, sorted by the score. The TBIR subsystem acts over all images on the collection as
a filter. After, the CBIR subsystem assigns another score, Si, to each image based on
its work with visual features. In the last step the image list is re-ranked, fusing the
scores given by TBIR and CBIR modules by the product of both scores, St*Si.
    Each subsystem is described in detail in the two following sections.

3.1       Text Based Information Retrieval Subsystem
   This subsystem carries out all the work related with the textual information of the
collection (preprocess, query reformulation, collection expansion, indexing, and final-
ly the retrieval). The operation of all these stages is presented below:

• Preprocess: Since the indexing and retrieval process are based on terms frequen-
  cies, it is important to perform a previous work in favor of normalize and remove
  noise terms. The preprocessed includes: 1) the special characters, with no statistical
  meaning, are eliminated; 2) deletion of semantically empty words (i.e. stop-words)
  in English language, 3) stemming: reduction of word to their base form, using Por-
  ter Algorithm and, finally, 4) convert all words to lower case.
• Query Processed: The query is processed in two senses. First, meaningless terms
  or expressions are deleted; more concretely, expressions like The user is looking
  for photos showing… are removed, as it doesn’t add any semantic information to
  the query content. On the other hand, for each query, the concept (or concepts) ex-
  pected for the results of a given query is identified. This identification is manually
  done. An example of processing of a query is:
  ─ Original query:
         The user is looking for photos showing only one or more elderly men, so no
         other people should be additionally visible
  ─ Query without meaningless terms:
         one or more elderly men, so no other people should be additionally visible.
  ─ Concept/s identified:
         Elder Male
• Collection Expansion: Textual information associated to the images available in
  the collection is scarce (both for images and concepts). Due to that our approach
  requires a significant amount of textual information to work; it became necessary
  raise an expansion process, using the information available for each image to query
  external sources. The expansion information, created according the process ex-
  plained in section 2.2, is aggregated to the collection.
• Indexing: For indexing the collection has been used Apache Solr 2. Solr is an open-
  source search platform from Apache Lucene 3 project. Through Solr it has been in-

2
    http://lucene.apache.org/solr/
  dexed the textual information that the collection had, as well as, the descriptions,
  generated in the expansion of the collection.
• Retrieval: The search process is done by Solr, over Lucene operation. The score
  function used for calculate the similarity between a given query and the documents
  is BM25. The results are TREC-format.

3.2       Content Based Information Retrieval Subsystem
   The work of the CBIR subsystem is based on three main stages: Extraction of the
low-level, calculating the Concept features of the images to expand the features vec-
tor, and the calculation of the similarity (Si) of each of the images to the image exam-
ples given by a query.

1. Extraction of low-level features: The first step in the CBIR system is to extract
   the visual low-level and the Concept features for all the images of the database as
   well as from the example images given in each question. The low-level features we
   use are calculated by our group and give color and texture information about the
   images. These features are the same that we have used for the modality classifica-
   tion task (see section 2.1 for more detailed information).
2. Calculating the Concept features vector. The regression models trained for each
   of the concepts gives for each image on the database and for the example query the
   probability of the presence of each conceptPc i �Ij �. With this probability infor-
   mation for each concept, we extend the low-level features vector to m components,
   being m the number of concepts trained. Each image Ij on the database is described
   by the extended vectorF�Ij � = (x1 , … xk , c1 , . . , cm } ∈ Rk+m .
3. Similarity Module: The similarity module instead of using the classical distance
   method to calculate the similarity of each of the images of the database to the ex-
   ample images for a given topic uses our own logistic regression relevance algo-
   rithm to get the probability of an image belonging to the query set. The sub-models
   regressions are set to five features inside each features family that are the number
   of example images given for each topic (see more details of the regression method
   at section 2.1.). The relevant images are the example images, and the non-relevant
   images are randomly taken from outside the pre-textual filtered list.

3.3       Fusion subsystem
The fusion subsystem is in charge of merging the two score result lists from the
TBIR and the CBIR subsystem. In the present work we use the product fusion algo-
rithm (Si*St). The two results lists are fused together to combine the relevance scores
of both textual and visually retrieved images (St and Si). Both subsystems will have
the same importance for the resulting list: the final relevance of the images will be
calculated using the product.


3
    http://lucene.apache.org/core/
4        Experiments and results

4.1      Concept annotation
In this our first participation on the concept annotation subtask, we have participated
with three visual and two multimodal runs (see table 1 for detailed information of the
submitted runs).
   Our main objective for the visual runs is to test the behavior of our logistic regres-
sion model as a classifier for the annotation task, and to adjust the parameters of the
regression model. As explained in section 2.1., one of the important parameter is the
set of relevant images. The number of images per concept range significantly from 30
to 200 images [2]. We have manually selected the relevant images for runs
UNED_UV_02 and UNED_UV_03, and for run UNED_UV_01 all given images are
taken up to 100 images. The number of positive plus negative images, k has to be
greater than the number of regression parameters to be estimated (see section 2.2). We
have fixed for all submitted runs the same number of regression models: a regression
for each low-level feature sub-family, being eight regression models varying between
9, 30 or 48 low-level components. It means that we would need between 30 and 50
relevant images. We have fixed the number of relevant images for run UNED_UV_02
and UNED_UV_03 to 30 images.
   The other input that the regression model needs is the set of non-relevant images.
The number of non-relevant images should be the double of the number of relevant
images. The other fact is how to choose the non-relevant images for each concept. At
this edition of the Photo annotation Flickr subtask, the concepts have been categorized
in family groups [2]. We have used this information to select the number of the non-
relevant images. Therefore, at runs UNED_UV_01 and UNED_UV_02 the non-
relevant images are selected from a subset of the images outside the family. Mean-
while, at run 3 are selected from subset of images from the same family that not be-
long to the training concept.

    Table 1. Detailed information of the submitted experiments for the concept annotation task.

                                               Visual information
                                                                                         Textual information
                                                                    Non-relevant
                                       Relevant images
                                                                      images
                              #num-                                                      Visual       Textual
Run              Modality                  Selection method       Selection method
                                ber                                                     baseline     algorithm
UNED_UV_01       Visual       All up     If >100 the nearest to   Outside the family
                              to 100     the centroid             concept.
UNED_UV_02       Visual       30         Manually selected        Outside the family
                                                                  concept
UNED_UV_03       Visual       30         Manually selected        Inside the family
                                                                  concept
UNED_UV_04       Multimodal                                                            Run2        Textual filter
UNED_UV_05       Multimodal                                                            Run3        Textual filter
   The two multimodal runs, UNED_UV_04 and UNED_UV_05 use a different visu-
al run baseline, (run UNED_UV_02 and UNED_UV_03 respectively), and then the
textual algorithm described at section 2.2 acts as a filter for the visual run. The two
expansion approaches used are the one performed using the Flickr Description of
Similar Images for the image descriptions and the second expansion using the user
descriptions on Flickr of images annotated with the same concept for the concept
descriptions. This two expansion approaches are those that provide more information.
   Table 2 shows our submitted runs results measured by means of The Interpolated
Mean Average Precision (MiAP), Geometric Interpolated Mean Average Precision
(GMiAP) and the photo based micro-F1 measure (F-ex). Our best result by MiAP, run
UNED_UV_01, is at position 55 from the global result list (80 runs).
   In the configurations tested for the visual runs, our results ordered by MiAP from
best to worst are run UNED_UV_01, UNED_UV_02 and UNED_UV_03 respective-
ly. It can signify that as more relevant images we have better is the regression model
performance. Both runs UNED_UV_01 and UNED_UV_02 outperform run
UNED_UV_03 meaning that it is better to select the non-relevant images outside the
categorized group.
   Concerning these multimodal results, it is clear that the combination of visual and
textual annotation proposed does not provide the expected performance. Both multi-
modal runs (UNED_UV_04 and UNED_UV_05) do not outperform the visual base-
lines for any of the evaluation measures (MiAP, GMiAP and F-ex).Anyway we think
that these not very good results are because the inaccuracy of the information associ-
ated with the concepts, that is obtained in the expansion process. Need also to be stud-
ied if the filter effect of the textual information over the visual one is too restrictive or
not.
   It is needed to point out that all the results ordered by the F-ex values are on the
opposite way than ordered by MiAP. This fact would have to be analyzed in detail
query by query.

              Table 2. Results for the submitted concept annotation experiments.
Run                                                         Mode       MiAP     GMiAP    F-ex

UNED_UV_01_CLASS_IMG_NOTADJUST                              Visual     0.1020   0.0512   0.1081
UNED_UV_02_CLASS_IMG_RELEVANTSEL_NONREL_OUTSIDE             Visual     0.0932   0.0475   0.1227
UNED_UV_03_CLASS_IMG_RELEVANTSEL_NONREL_INSIDE              Visual     0.0873   0.0441   0.1360
UNED_UV_04_CLASS_Img_base2_TextualFilter                  Multimodal   0.0756   0.0376   0.0849
UNED_UV_05_CLASS_Img_base3_TextualFilter                  Multimodal   0.0758   0.0383   0.0864


4.2    Concept Retrieval using Flickr photos

We have submitted two textual and eight multimodal runs. Table 3 shows the detailed
information for the submitted runs. For the textual baseline, run UNED_UV_01, the
content of the topic/query is previously preprocessed and is used to query over the
image description in Flickr and also against the description obtained by the expansion
using user descriptions on Flickr of images annotated with the same concept (see
section 2.2.). For the UNED_UV_02 run the query process is similar to the previous
one, but in addition to use the content of the topic to query over the descriptions, the
concept expected for the results of a given query is also used. Given that the concept
expected for the queries is not provided, we have identified it in a manually way for
each query. For the concepts no expansion information has been used.

   Table 3. – Detailed information of the submitted concept retrieval experiments.
                                                 TBIR                 CBIR
                                                            Concept      Features
        Run                   Modality        Baseline       model        Vector
      UNED_UV_01_TXT_EN         Textual

      UNED_UV_02_TXT_EN          Textual

      UNED_UV_03_TXTIMG        Multimodal   UNED_UV_01      Base2       [LF]*[CF]
      UNED_UV_04_ TXTIMG       Multimodal   UNED_UV_01      Base2       [LF … CF]
      UNED_UV_05_ TXTIMG       Multimodal   UNED_UV_02      Base2       [LF]*[CF]
      UNED_UV_06_ TXTIMG       Multimodal   UNED_UV_02      Base2       [LF...CF]
      UNED_UV_07_ TXTIMG       Multimodal   UNED_UV_01      Base3       [LF]*[CF]
      UNED_UV_08_ TXTIMG       Multimodal   UNED_UV_01      Base3       [LF...CF]
      UNED_UV_09_ TXTIMG       Multimodal   UNED_UV_02      Base3       [LF]*[CF]
      UNED_UV_10 _TXTIMG       Multimodal   UNED_UV_02      Base3        [LF…CF]


   The multimodal runs (runs 3 to 10) have been designed to test the behavior of the
expanded features vector. The expanded features vector is obtained as explained at
section 2.1. by the regressions models trained at the annotation subtask. We have used
two of the four regressions models used at the concept annotation subtask: the exper-
iments two and three from table 2 denoted at table 4 as base2 and base3 respectively.
The extended vector F(Ii ) = (x1 , … xk , c1 , . . , cm } ∈ Rk+m can be calculated as a
unique vector with the low-level and the Concept features (denoted as [LF…CF] at
table 4), and as two different vectors (denoted as [LF]*[CF]). For the last scheme, two
different probabilities are obtained by the low-level features 𝑆𝑥 (𝐼𝑖 ), and for the Con-
cept features 𝑆𝑐 (𝐼𝑖 ), combining both probabilities by the product 𝑆(𝐼𝑖 ) = 𝑆𝑥 (𝐼𝑖 ) ∗
𝑆𝑐 (𝐼𝑖 ). All multimodal runs use the textual pre-filter algorithm, so the visual system
only works over this pre-filtered sub-collection. We have presented four multimodal
runs with textual baseline (UNED_UV_01) and the other four with the concept ex-
tended textual run (UNED_UV_02). The multimodal runs merged both image and
textual scores by the product (St*Si).
   The evaluation is done according to the following measures: The overall non-
interpolated MAP (MnAP), average of the non-interpolated precisions for each con-
cept and the Average Precision at different values AP@10, AP@20 and AP@100.
Table 4 shows our submitted run results.
   Our best result, the multimodal run UNED_UV_10 (MnAP of 0.0295) is at the
20th position of the overall result list and at the ninth position for the multimodal
runs, being our group, the UNED_UV, the third group for the multimodal runs, and
the fourth best group in the overall results for the concept retrieval results subtask.
   Looking      at   textual    runs,     our     best   result     is    obtained     with
UNED_UV_02_TXT_AUTO_EN (MnAP of 0.0250), which uses the information
about the concept for query expansion. This run improves the results of baseline run
without concept information (UNED_UV_01_TXT_AUTO_EN with MnAP of
0.0208).
   Our two best multimodal runs, UNED_UV_6 (MnAP of 0.0286) and
UNED_UV_10 (MnAP of 0.0295) outperforms its corresponding pre-filtered textual
baseline (run UNED_UV_2). This shows that the use of the expanded concept fea-
tures vector as an unique vector or as two different vectors do not make any important
difference given that run UNED_UV_6 uses only one vector and UNED_UV_09 run
uses two different vectors and both obtain a very similar MnAP values. A similar
behavior can be also observed for the annotation base regression model used to get
the expanded concept features vector so that UNED_UV_6 uses base2 and
UNED_UV_09 uses base3, and the MnAP values are similar for both runs. This is
also observed for the concept annotation results obtained, in which models 2 and 3
have similar results by MiAP (see Table 2).

               Table 4. Results for the submitted concept retrieval experiments.
     Run                             Mode       MnAP        AP@10       AP@20      AP@100
     UNED_UV_01_TXT_AUTO_EN         Textual     0.0208      0.0032      0.0021     0.0653
     UNED_UV_02_TXT_AUTO_EN         Textual     0.0250      0.0004      0.0019     0.0250
     Best textual (IMU)                         0.0933      0.0187      0.0338     0.1715
     UNED_UV_03_TXTIMG             Multimodal   0.0271      0.0125      0.0203     0.0813
     UNED_UV_04_ TXTIMG            Multimodal   0.0271      0.0131      0.0199     0.0837
     UNED_UV_05_ TXTIMG            Multimodal   0.0260      0.0121      0.0224     0.0807
     UNED_UV_06_ TXTIMG            Multimodal   0.0286      0.0116      0.0223     0.0819
     UNED_UV_07_ TXTIMG            Multimodal   0.0275      0.0112      0.0203     0.0859
     UNED_UV_08_ TXTIMG            Multimodal   0.0275      0.0122      0.0198     0.0854
     UNED_UV_09_ TXTIMG            Multimodal   0.0270      0.0104      0.0217     0.0822
     UNED_UV_10 _TXTIMG            Multimodal   0.0295      0.0125      0.0206     0.0848
     Best Multimodal (MLKD)                     0.0702      0.0214      0.0342     0.1495


5      Concluding Remarks and Future Work

   Our best result is obtained at the concept retrieval subtask in in the multimodal
modality. This multimodal run is the UNED_UV_10 (MnAP of 0.0295), at the 20th
position of the overall result list and at the ninth position at the multimodal runs list.
For the different textual approaches presented, we can conclude that the expansion
using information about the concepts outperform the standard retrieval process; even
by a simple expansion approach as the presented here. In this regard, we will continue
exploring this research line with some remarks. First, a better definition of each con-
cept is desirable in order to improve a better representation of the concept and then a
better retrieval process. In this work, we use a simple TF-IDF-based, but more sophis-
ticated approach could be proposed as divergence-based techniques. Second, to ad-
dress the lack of detailed descriptions of the concepts, we have presented an expan-
sion based on external sources. Although the results shows that these technique im-
prove the baseline results, it has been shown that these expansion introduces a signifi-
cant amount of noise information. This noise information gets low precision values at
the first results.
   For the multimodal approaches presented for the concept retrieval subtask, our
combination of the textual pre-filtered list as input to the visual system outperform the
textual baseline, as it has already been tested in other ImageClef collections, Wikipe-
dia [6] and Medical [4]. Focusing on the visual system, the expanded Concept vector
outperforms the use of the low-level features vector in Flickr photo collection and in
the Medical collection [5]. Therefore, we will continue working in adjusting the best
configuration for the regression models so that no definitive conclusions for the best
configuration can be extracted for the present work.
    The results obtained at the Concept Annotation subtask have not been as good as
the ones obtained at the retrieval subtask. In this first participation our best result is at
56th position from 80 runs. This is due to the fact that both our textual and our visual
approaches are retrieval approaches adapted for a classification task. Nevertheless, the
regression model system proposed as a multilabel classifier for the annotation concept
subtask will deeply be studied. The multimodal approaches do not outperform the
visual baseline so that they will also be redefined. We think the textual filter has been
too strict, and a relax combination of both confidence scores, textual and visual will
get better multimodal annotation results.


   Acknowledgments. This work has been partially supported for Regional Govern-
ment of Madrid under Research Network MA2VIRMR (S2009/TIC-1542), for Span-
ish Government by projects BUSCAMEDIA (CEN-20091026), HOLOPEDIA (TIN
2010-21128-C02) and MCYT TEC2009-12980.


6      References
 1. Ayala, G.; Domingo, J. Spatial Size Distributions. Applications to Shape and Texture
    Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2001. Vol. 23,
    N. 12, pages 1430-1442.
 2. Bart Thomee, Adrian Popescu, Overview of the ImageCLEF 2012 Flickr Photo Annotation
    and Retrieval Task, CLEF 2012 working notes, Rome, Italy, 2012.
 3. Benavent, J., Benavent, X., de Ves, E. Recuperación de Información visual utilizando des-
    criptores conceptuales. In Conference Proceedings of the Conferencia Española de Recu-
    peración de Información, CERI 2012, Valencia.
 4. Castellanos, A. Benavent, X., Benavent, J. Garcia-Serrano, A.: UNED-UV at Medical Re-
    trieval Task of ImageCLEF 2011. In Working Notes of CLEF 2011.
 5. Castellanos, A., Benavent, J., Benavent, X., García-Serrano A., de Ves, E.: Using Visual
    Concept Features in a Multimodal Retrieval System for the Medical collection at Im-
    ageCLEF2012, CLEF 2012 working notes, Rome, Italy.
 6. Granados, R. Benavent, J. Benavent, X. de Ves, E. Garcia-Serrano, A.: Multimodal Infor-
    mation Approaches for the Wikipedia Collection at ImageCLEF. In 2011 Working Notes.
 7. Izawa, R. Motohashi, N. Takagi, T.: Annotation and Retrieval System Using Confabula-
    tion Model for ImageCLEF2011 Photo Annotation. In: Working Notes of CLEF 2011.
 8. Ksibi, A. Ammar, A.B. Amar, C.B.: REGIMvid at ImageCLEF2011: Integrating Contex-
    tual Information to Enhance Photo Annotation and Concept-based Retrieval. In: Working
    Notes of CLEF 2011.
 9. Leon T., Zuccarello P., Ayala G., de Ves E., Domingo J.: Applying logistic regression to
    relevance feedback in image retrieval systems, Pattern Recognition, V40, p.p. 2621, 2007.
10. Nowak, S., Nagel, K., Liebetrau, J.: The CLEF 2011 photo annotation and concept-based
    retrieval tasks. In: Working Notes of CLEF 2011.
11. Spyromitros-Xious, E. Sechidis, K. Tsoumakas, G. Vlahavas, I.: MLKD's Participation at
    the CLEF 2011 Photo Annotation and Concept-Based Retrieval Tasks. In: Working Notes
    of CLEF 2011.

</pre>