=Paper=
{{Paper
|id=Vol-1177/CLEF2011wn-ImageCLEF-Spyromitros-XioufisEt2011
|storemode=property
|title=MLKD's Participation at the CLEF 2011 Photo Annotation and Concept-Based Retrieval Tasks
|pdfUrl=https://ceur-ws.org/Vol-1177/CLEF2011wn-ImageCLEF-Spyromitros-XioufisEt2011.pdf
|volume=Vol-1177
}}
==MLKD's Participation at the CLEF 2011 Photo Annotation and Concept-Based Retrieval Tasks==
<pdf width="1500px">https://ceur-ws.org/Vol-1177/CLEF2011wn-ImageCLEF-Spyromitros-XioufisEt2011.pdf</pdf>
<pre>
MLKD’s Participation at the CLEF 2011 Photo
Annotation and Concept-Based Retrieval Tasks

             Eleftherios Spyromitros-Xioufis, Konstantinos Sechidis,
                   Grigorios Tsoumakas, and Ioannis Vlahavas

                                 Dept of Informatics
                         Aristotle University of Thessaloniki
                             Thessaloniki 54124, Greece
                    {espyromi,sechidis,greg,vlahavas}@csd.auth.gr


       Abstract. We participated both in the photo annotation and concept-
       based retrieval tasks of CLEF 2011. For the annotation task we developed
       visual, textual and multi-modal approaches using multi-label learning al-
       gorithms from the Mulan open source library. For the visual model we
       employed the ColorDescriptor software to extract visual features from
       the images using 7 descriptors and 2 detectors. For each combination
       of descriptor and detector a multi-label model is built using the Binary
       Relevance approach coupled with Random Forests as the base classifier.
       For the textual models we used the boolean bag-of-words representation,
       and applied stemming, stop words removal, and feature selection using
       the chi-squared-max method. The multi-label learning algorithm that
       yielded the best results in this case was Ensemble of Classifier Chains
       using Random Forests as base classifier. Our multi-modal approach was
       based on a hierarchical late-fusion scheme. For the concept based re-
       trieval task we developed two different approaches. The first one is based
       on the concept relevance scores produced by the system we developed
       for the annotation task. It is a manual approach, because for each topic
       we manually selected the relevant topics and manually set the strength
       of their contribution to the final ranking produced by a general formula
       that combines topic relevance scores. The second approach is based solely
       on the sample images provided for each query and is therefore fully au-
       tomated. In this approach only the textual information was used in a
       query-by-example framework.


1     Introduction

ImageCLEF is the cross-language image retrieval track run annually since 2003
as part of the Cross Language Evaluation Forum (CLEF)1 . This paper doc-
uments the participation of the Machine Learning and Knowledge Discovery
(MLKD) group of the Department of Informatics of the Aristotle University of
Thessaloniki at the photo annotation task (also called visual concept detection
and annotation task) of ImageCLEF 2011.
1
    http://www.clef-campaign.org/
    This year, the photo annotation task consisted of two subtasks. An annota-
tion task, similar to that of ImageCLEF 2010, and a new concept-based retrieval
task. Data for both tasks come from the MIRFLICKR-1M image dataset [1],
which apart from the image files contains Flickr user tags and Exchangeable
Image File Format (Exif) information. More information about the exact setup
of the data can be found in [4].
    In the annotation task, participants are asked to annotate a test set of 10,000
images with 99 visual concepts. An annotated training set of 8,000 images is
provided. This multi-label learning task [8] can be solved in three different ways
according to the type of information used for learning: 1) visual (the image files),
2) textual (Flickr user tags), 3) multi-modal (visual and textual information).
We developed visual, textual and multi-modal approaches for this task using
multi-label learning algorithms from the Mulan open source library [9]. In this
task, the relative performance of our textual models was quite good, but that
of our visual models was bad (our group does not have expertise on computer
vision), leading to an average multi-modal (and overall) performance.
    In the concept-based retrieval task, participants were given 40 topics con-
sisting of logical connections between the 99 concepts of the photo annotation
task, such as “find all images that depict a small group of persons in a landscape
scenery showing trees and a river on a sunny day”, along with 2 to 5 examples
images of each topic from the training set of the annotation task. Participants
were asked to submit (up to) the 1,000 most relevant photos for each topic in
ranked order from a set of 200,000 unannotated images. This task can be solved
by manual construction of the query out of the narrative of the topics, followed
by automatic retrieval of images, or by a fully automated process. We developed
a manual approach that exploits the multi-label models trained in the annota-
tion task and a fully automated query-by-example approach based on the tags
of the example images. In this task, both our manual and automated approaches
ranked 1st in all evaluation measures by a large margin.
    The rest of this paper is organized as follows. Sections 2 and 3 describe our
approaches to the annotation task and concept-based retrieval task respectively.
Section 4 presents the results of our runs for both tasks. Section 5 concludes our
work and poses future research directions.


2   Annotation Task

This section presents the visual, textual and multi-modal approaches that we
developed for the automatic photo annotation task. There were two (eventually
three) evaluation measures to consider for this task: a) mean interpolated average
precision (MIAP), b) example-based F-measure (F-ex), c) semantic R-precision
(SR-Precision). In order to optimize a learning approach based on each of the ini-
tial two evaluation measures and type of information, six models should be built.
However, there were only five runs allowed for this task. We therefore decided
to perform model selection based on the widely-used mean average precision
(MAP) measure for all types of information. In particular, MAP was estimated
using an internal 3-fold cross-validation on the 8,000 training images. Our multi-
modal approach was submitted in three different variations to reach the total
number of five submissions.


2.1    Automatic Annotation with Visual Information

We here describe the approach that we followed in order to learn multi-label
models using the visual information of the images. The flowchart of this approach
is shown in Fig. 1.


                                                      xi
            i                                                            p(cj | xi )j


                Fig. 1. Automatic annotation using visual information.


    As our group does not have expertise in computer vision, we largely fol-
lowed the color-descriptor extraction approach described in [6,7] and used the
accompanying software tool2 for extracting visual features from the images.
    Harris-Laplace and Dense Sampling were used as point detection strategies.
Furthermore seven different descriptors were used: SIFT, HSV-SIFT, HueSIFT,
OpponentSIFT, C-SIFT, rgSIFT and RGB-SIFT. For each one of the 14 com-
binations of point detection strategy and descriptor, a different codebook was
created in order to obtain a fixed length representation for all images. This is
also known as the bag-of-words approach. The k-means clustering algorithm
was applied to 250,000 randomly sampled points from the training set, with the
codebook size (k) fixed to 4096 words. Finally, we employed hard assignment of
points to clusters.
    Using these 4,096-dimensional vector representations along with the ground
truth annotations given for the training images we built 14 multi-label training
datasets. After experimenting with various multi-label learning algorithms we
found that the simple Binary Relevance (BR) approach coupled with Random
Forests as the base classifier (number of trees = 150, number of features = 40)
yielded the best results.
    In order to deal with the imbalance in the number of positive and negative
examples of each label we used instance weighting. The weight of the examples
of the minority class was set to (min+maj)/min and the weight of the examples
of the majority class was set to (min + maj)/maj, where min is the number of
examples of the minority class and maj the number of examples of the majority
class. We also experimented with sub-sampling, but the results were worse than
instance weighting.
2
    Available from http://www.colordescriptors.com
    Our approach concludes with a late fusion scheme that averages the output
of the 14 different multi-label models that we built.

2.2   Automatic Annotation with Flickr User Tags
We here describe the approach that we followed in order to learn multi-label
models using the tags assigned to images by Flickr users. The flowchart of this
approach is shown in Fig. 2.


                   Preprocessing


                                        27,323                  4,000    xi
                                       features               features        Multi-label
 Tags of                                           Feature                                  p(cj | xi )j
            Stemmer       Stop Words                                           learning
 Image i                                          Selection
                                                                              algorithm


                Fig. 2. Automatic annotation using Flickr user tags.


     An initial vocabulary was constructed by taking the union of the tag sets of
all images in the training set. We then applied stemming to this vocabulary and
removed stop words. This led to a vocabulary of approximately 27000 stems.
The use of stemming improved the results, despite that some of the tags were
not in the English language and that we used an English stemmer. We further
applied feature selection in order to remove irrelevant or redundant features and
improve efficiency. In particular, we used the χ2max criterion [3] to score the stems
and selected the top 4000 stems, after experimenting with a variety of sizes (500,
1000, 2000, 3000, 4000, 5000, 6000 and 7000).
     The multi-label learning algorithm that was found to yield the best results
in this case was Ensemble of Classifier Chains (ECC) [5] using Random Forests
as base classifier. ECC was run with 15 classifier chains and Random Forests
with 10 decision trees, while all other parameters were left to their default value.
The approach that we followed to deal with class imbalance in the case of visual
information (see the previous subsection), was followed in this case too.

2.3   Automatic Annotation with a Multi-Modal Approach
Our multi-modal approach is based on a late fusion scheme that combines the
output of the 14 visual models and the single textual model. The combination is
not an average of these 15 models, because in that case the visual models would
dominate the final scores. Instead, we follow a hierarchical combination scheme.
We separately average the 7 visual models of each point estimator and then com-
bine the output of the textual model, the Harris-Laplace average and the Dense
Sampling average, as depicted in Fig. 3. The motivation for this scheme was
the three different views of the images that existed in the data (Harris-Laplace,
Dense Sampling, user tags) as explained in the following two paragraphs.
             Harris Laplace Ensemble Model

                             Single Model using Harris Laplace & SIFT Codebook


                           Single Model using Harris Laplace & HSV-SIFT Codebook


                           Single Model using Harris Laplace & HueSIFT Codebook


                         Single Model using Harris Laplace & OpponentSIFT Codebook   Averaging       p(cj | xi )j
                            Single Model using Harris Laplace & C-SIFT Codebook


                            Single Model using Harris Laplace & rgSIFT Codebook


                           Single Model using Harris Laplace & RGB-SIFT Codebook


             Dense Sampling Ensemble Model

                            Single Model using Dense Sampling & SIFT Codebook


                          Single Model using Dense Sampling & HSV-SIFT Codebook


       xi                 Single Model using Dense Sampling & HueSIFT Codebook

                                                                                                     p(cj | xi )j
 Image i                Single Model using Dense Sampling & OpponentSIFT Codebook    Averaging
                                                                                                                     Averaging/
                                                                                                                     Arbitrator
                                                                                                                                  p(cj | xi )j
                           Single Model using Dense Sampling & C-SIFT Codebook


                           Single Model using Dense Sampling& rgSIFT Codebook


                          Single Model using Dense Sampling& RGB-SIFT Codebook


                                             Flickr users tags Model
                                                                                     p(cj | xi )j


            Fig. 3. Automatic annotation with a multi-modal approach


    We can discern two main categories of concepts in photo annotation: objects
and scenes. For objects, Harris-Laplace performs better because it ignores the
homogeneous areas, while for scenes, Dense Sampling performs better [6]. For ex-
ample, two of the concepts where Dense Sampling achieves much higher Average
Precision (AP) from Harris-Laplace are Night and Macro, which are abstract,
while the inverse is happening in concepts Fish and Ship, which correspond to
things (organisms, objects) of particular shape.
    Furthermore, we observe that the visual approach performs better in con-
cepts, such as Sky, which for some reason (e.g. lack of user interest for retrieval
by this concept) do not get tagged. On the other hand the textual approach
performs much better when it has to predict concepts, such as Horse, Insect,
Dog and Baby that typically get tagged by users. Table 1 shows the average
precision for 10 concepts, half of which suit much better the textual models and
half the visual models.
    Two variations of this scheme were developed, differing in how the output
of the three different views is combined. The first one, named Multi-Modal-Avg,
used an averaging operator, similarly to the one used at the lower levels of the
hierarchy. The second one, named Multi-Modal-MaxAP, used an arbitrator func-
tion to select the best one out of the three outputs for each concept, according to
internal evaluation results in terms of average precision. Our third multi-modal
submission, named Multi-Modal-MaxAP-RGBSIFT, was a preliminary version
of Multi-Modal-MaxAP, where only the RGBSIFT descriptor was used.
Table 1. Average precision for 10 concepts, half of which suit much better the textual
models and half the visual models.

                Concept Textual Visual Concept Textual Visual
                Airplane 0.6942 0.0946   Trees    0.3004 0.5501
                 Horse 0.5477 0.0541    Clouds    0.4744 0.6949
                  Bird 0.5260 0.1275      Sky     0.6021 0.7945
                 Insect 0.5087 0.1241 Overexposed 0.0183 0.1937
                  Dog 0.6190 0.2406 Big Group 0.1510 0.3245


2.4   Thresholding

The multi-label learners used in this work provide us with a confidence score for
each concept. This is fine for an evaluation with MIAP and SR-Precision, but
does not suffice for an evaluation with example-based F-measure, which requires
a bipartition of the concepts into relevant and irrelevant ones. This is a typical
issue in multi-label learning, which is dealt with a thresholding process [2].
    We used the thresholding method described in [5], which applies a common
threshold across all concepts and provides a close approximation of the label
cardinality (LC) of the training set to the predictions made on the test set. The
threshold is calculated using the following formula:

           t = argmin{t∈0.00,0.05,...,1.00} |LC(Dtrain ) − LC(Ht (Dtest ))|          (1)

where Dtrain is the training set and Ht is a classifier which has made predictions
on a test set Dtest under threshold t.


3     Concept-Based Retrieval Task

We developed two different approaches for the concept-based retrieval task. The
first one is based on the concept relevance scores produced by the system we
developed for the annotation task. It is a manual approach, because for each
topic we manually selected the relevant topics and manually set the strength
of their contribution to the final ranking produced by a general formula that
combines topic relevance scores. The second one is based solely on the sample
images provided for each query and is therefore fully automated.


3.1   Manual Approach

Let I = 1, . . . , 200, 000 be the collection of images, Q = 1, . . . , 40 the set of
topics and C = 1, . . . , 99 the set of concepts. We first apply our automated
image annotation system to each image i ∈ I and obtain a corresponding 99-
dimensional vector Si = [s1i , s2i , ..., s99
                                           i ] with the relevance scores of this image to
each one of the 99 concepts. For efficiency reasons, we used simplified versions of
our visual approach, taking into account only models produced with the RGB-
SIFT descriptor, which has been found in the past to provide better results
compared to other single color descriptors [7].
    Then, based on the description of each of the 40 queries, we manually select
a number of concepts that we consider related to the query, either positively or
negatively. Formally, for topic q ∈ Q let Pq ⊆ C denote the set of concepts that
are positively related to q and Nq ⊆ C the set of concepts that are negatively
related to q, Pq ∩ Nq = ∅. For each concept c in Pq ∪ Nq , we further define a
real valued parameter mcq ≥ 1 denoting the strength of relevance of concept c
to q. The larger this value, the stronger the influence of concept c to the final
relevance score. For each topic q and image i, the scores of the relevant concepts
are combined using (2).

                                  Y             c   Y                 c
                         Sq,i =          (sci )mq          (1 − sci )mq          (2)
                                  c∈Pq              c∈Nq


    Finally, for each topic, we arrange the images in descending order according
to the overall relevance score and we retrieve a fixed number of images (in our
submissions we retrieved 250 and 1,000 images).
    Note that for each topic, the selection of related concepts and the setting of
values for the mcq parameters was done using a trial-and-error approach involving
careful visual examination of the top 10 retrieved images, as well as more relaxed
visual examination of the top 100 retrieved images. Two examples of topics and
corresponding combination of scores follow.
    Topic 5: rider on horse. Here we like to find photos of riders on a horse.
So no sculptures or paintings are relevant. The rider and horse can be also only
in parts on the photo. It is important that the person is riding a horse and not
standing next to it. Based on the description of this topic and experimentation,
we concluded that concepts 75 (Horse) and 8 (Sports) are positively related (rider
on horse), while concept 63 (Visual Arts) is negatively related (no sculptures or
paintings). We therefore set P5 = {8, 75}, N5 = {63}. All concepts were set to
equal strength for this topic: m8,5 = m63,5 = m75,5 = 1.
    Topic 24: funny baby. We like to find photos of babies looking funny. The
baby should be in the main focus of the photo and be the reason why the photo
looks funny. Photos presenting funny things that are not related to the baby are
not relevant. Based on the description of this topic and experimentation, we
concluded that concepts 86 (Baby), 92 (Funny) and 32 (Portrait) are positively
related. We therefore set P24 = {32, 86, 92}, N24 = ∅. Based on experimentation
the concept Funny was given twice the strength of the other concepts, we set
m32,24 = m86,24 = 1 and m92,24 = 2.
    For some topics, instead of explicitly using the score of a group of interrelated
concepts we considered introducing a virtual concept with score equal to the
maximum of this group of concepts. This slight adaptation of the general rule of
(2), enhances its representation capabilities. The following example clarifies this
adaptation.
    Topic 32: underexposed photos of animals. We like to find photos of
animals that are underexposed. Photos with normal illumination are not rele-
vant. The animal(s) should be more or less in the main focus of the image.
Based on the description of this topic and experimentation, we concluded that
concepts 44 (Animals), 34 (Underexposed), 72 (Dog), 73 (Cat), 74 (Bird), 75
(Horse), 76 (Fish) and 77 (Insect) are positively related, while concept 35 (Neu-
tral Illumination) is negatively related. The six last specific animal concepts were
grouped into a virtual concept, say concept 1001, with score, the maximum of
the scores of these six concepts. We then set P32 = {34, 44, 1001}, N32 = {35}
and m34,32 = m44,32 = m1001,32 = m35,32 = 1.
    Figure 4 shows the top 10 retrieved images for topics 5, 24 and 32, along
with the Precision@10 for these topics.

3.2   Automated Approach
Apart from the narrative description, each topic of the concept-based retrieval
task was accompanied by a set of 2 to 5 images from the training set which could
be considered relevant for the topic. Using these examples images as queries we
developed a Query by Example approach to find the most relevant images in the
retrieval set. The representation followed the bag-of-words model and was based
on the Flickr user tags assigned to each image.
    To generate the feature vectors, we applied the same method as the one used
for the annotation task. Thus, each image was represented as a 4000-dimensional
feature vector where each feature corresponds to a tag from the training set
which was selected by the feature selection method. A value of 1/0 denotes the
presence/absence of the tag in the tags accompanying an image.
    To measure the similarity between the vectors representing two images we
used the Jaccard similarity coefficient which is defined as the total number of
attributes where two vectors A and B both have a value of 1 divided by the the
total number of attributes where either A or B have a value of 1.
    Since more than one images where given as examples for each topic, we added
their feature vectors in order to form a single query vector. This approach was
found to work well in comparison to other approaches, such as taking only one
of the example images as query or measuring the similarity between a retrieval
image and each example image separately and then returning the images from the
retrieval set with the largest similarity score to any of the queries. We attribute
this to the fact that by adding the feature vectors, a better representation of the
topic of interest was created which could not be possible if only one image (with
possibly noisy or very few tags) was considered.
    As in the manual approach, we submitted two runs, one returning the 250 and
one the 1000 most similar images from the retrieval set (in descending similarity
order).
    Figure 5 shows the top 10 retrieved images, along with the Precision@10 for
the following topics:
 – Topic 10: single person playing a musical instrument. We like to find
   pictures (no paintings) of a person playing a musical instrument. The person
   can be on stage, off stage, inside or outside, sitting or standing, but should
   be alone on the photo. It is enough if not the whole person or instrument is
   shown as long as the person and the instrument are clearly recognizable.
 – Topic 12: snowy winter landscaper. We like to find pictures (photos or
   drawings) of white winter landscapes with trees. The landscape should not
   contain human-made objects e.g. houses, cars and persons. Only snow on
   the top of a mountain is not relevant, the landscape has to be fully covered
   in (at least light) snow.
 – Topic 30: cute toys arranged to a still-life. We like to find photos of
   toys arranged to a still-life. These toys should look cute in the arrangement.
   Simple photos of a collection of toys e.g. in a shop are not relevant.

   We see that the 10 retrieved images for topic 30 are better than those of topics
12 and 10. This can be explained by noticing that topic 12 is a difficult one, while
the tags of the example images for topic 10 are not very descriptive/informative.


4     Results

We here briefly present our results, as well as our relative performance compared
to other groups and submissions. Results for all groups, as well as more details
on the data setup and evaluation measures can be found in [4].


4.1   Annotation Task

The official results of our runs are illustrated in Table 2. We notice that in
terms of MIAP, the textual model is slightly better than the visual, while for the
other two measures, the visual model is much better than the textual. Among
the multi-modal variations, we notice that averaging works better than arbitrat-
ing, and as expected using all descriptors is better than using just the RGB-
SIFT one. In addition, we notice that the multi-modal approach significantly
improves over the MIAP of the visual and textual approaches, while it slightly
decreases/increases the performance of the visual model in the two example-
based measures. This may partly be due to the fact that we performed model
selection based on MAP.


        Table 2. Official results of the MLKD team in the annotation task.

                    Run Name              MIAP F-measure SR-Precision
                    Textual        0.3256          0.5061      0.6527
                     Visual        0.3114          0.5595      0.6981
                Multi-Modal-Avg    0.4016          0.5588      0.6982
         Multi-Modal-MaxAP-RGBSIFT 0.3489          0.5094      0.6687
              Multi-Modal-MaxAP    0.3589          0.5165      0.6709
    Table 3 shows the rank of our best result compared to the best results of
other groups and compared to all submissions. We did quite good in terms of
textual information, but quite bad in terms of visual information, leading to
an overall average performance. Lack of computer vision expertise in our group
may be a reason for not being able to get results out of the visual information.
Among the three evaluation measures, we notice that overall we did better in
terms of MIAP, slightly worse in terms of F-measure, and even worse in terms
of SR-Precision. The fact that model selection was performed based on MAP
definitely played a role for this result.


Table 3. Rank of our best result compared to the best results of other teams and
compared to all submissions in the annotation task.

                         Team Rank            Submission Rank
        Approach
                    MIAP F-Measure SR-Prec MIAP F-Measure SR-Prec
         Visual    9th/15   5th/15   9th/15 25th/46 12th/46      17th/46
        Textual    3rd/7    2nd/7     3rd/7 3rd/8    2nd/8        4th/8
       Multi-modal 5th/10   5th/10   7th/10 9th/25 7th/25        15th/25
           All     5th/18   7th/18   10th/18 9th/79 19th/79      31st/79


4.2   Concept-Based Retrieval Task

In this task, participating systems were evaluated using the following measures:
Mean Average Precision (MAP), Precision@10, Precision@20, Precision@100
and R-Precision.
    The official results of our runs are illustrated in Table 4. We first notice
that the first 5 runs, which retrieved 1000 images, lead to better results in
terms of MAP and R-Precision compared to the last 5 runs, which retrieved 250
images. Obviously, in terms of Precision@10, Precision@20 and Precision@100,
the results are equal. Among the manual runs, we notice that the visual models
perform quite bad. We hypothesize that a lot of concepts that favor textual
rather than visual models, as discussed in Sect. 2, appear in most of the topics.
The textual and multi-modal models perform best, with the Multi-Modal-Avg
model having the best result in 3 out of the 5 measures.
    The automated approach performs slightly better than the visual model of
the manual approach, but still much worse than the textual and multi-modal
manual approaches. As expected, the knowledge that is provided by a human can
clearly lead to better results compared to a fully automated process. However,
this is not true across all topics, as can be seen in Table 5, which compares the
results of the best automated and manual approach for each individual topic. We
can see there that the automated approach performs better on 9 topics, while
the manual on 31.
  Table 4. Official results of the MLKD team in the concept-based retrieval task.

                 Run Name                     MAP     P@10    P@20 P@100 R-Prec
      Manual-Visual-RGBSIFT-1000      0.0361 0.1525 0.1375 0.1080 0.0883
        Automated-Textual-1000        0.0849 0.3000 0.2800 0.2188 0.1530
          Manual-Textual-1000         0.1546 0.4100 0.3838 0.3102 0.2366
 Manual-Multi-Modal-Avg-RGBSIFT-1000 0.1640 0.3900 0.3700 0.3180 0.2467
Manual-Multi-Modal-MaxAP-RGBSIFT-1000 0.1533 0.4175 0.3725 0.2980 0.2332
       Manual-Visual-RGBSIFT-250      0.0295 0.1525 0.1375 0.1080 0.0863
         Automated-Textual-250        0.0708 0.3000 0.2800 0.2188 0.1486
           Manual-Textual-250         0.1328 0.4100 0.3838 0.3102 0.2298
 Manual-Multi-Modal-Avg-RGBSIFT-250   0.1346 0.3900 0.3700 0.3180 0.2397
Manual-Multi-Modal-MaxAP-RGBSIFT-250 0.1312 0.4175 0.3725 0.2980 0.2260


Table 5. Comparison of AP for each topic between automated and manual approach

               Topic Automated Manual Topic Automated Manual
                 1     0.235   0.2201  21    0.0799 0.0312
                 2     0.0294 0.1518 22          0    0.1018
                 3    0.0893 0.0613    23     0.0405 0.0617
                 4      0.257  0.3701 24      0.0231 0.1226
                 5     0.0011 0.5478 25        0.009  0.1691
                 6       0.12  0.3574 26      0.0027 0.0056
                 7     0.0142 0.2164 27       0.0477 0.1311
                 8     0.0864 0.0879 28       0.0123 0.1315
                 9     0.0001 0.1143 29       0.0232   0.118
                10     0.1618 0.2528 30       0.135   0.0378
                11     0.1393 0.3133 31       0.0794 0.1535
                12     0.0519 0.0734 32       0.0221 0.1135
                13     0.0275 0.1516 33       0.0343   0.434
                14     0.0087 0.0968 34      0.4464 0.4341
                15     0.0455 0.3327 35       0.3065 0.3685
                16     0.0711 0.0715 36       0.0001 0.1426
                17     0.0349 0.0401 37      0.2232 0.0207
                18     0.0011 0.0044 38      0.2431 0.1477
                19     0.0379 0.0691 39      0.0226 0.0153
                20    0.1837 0.1168    40     0.0508 0.1703
                                      MAP 0.0849 0.1640


   Table 6 shows the rank of our best result compared to the best results of other
groups and compared to all submissions. Both our manual and our automated
approach ranked 1st in all evaluation measures.
Table 6. Rank of our best result compared to the best results of other teams and
compared to all submissions in the annotation task.

                        Team Rank                Submission Rank
Configurations
                 MAP P@10 P@20 P@100 R-Prec MAP P@10 P@20 P@100 R-Prec
    Automated    1st/4 1st/4 1st/4 1st/4    1st/4 1st/16 1st/16 1st/16 1st/16 1st/16
     Manual      1st/3 1st/3 1st/3 1st/3    1st/3 1st/15 1st/15 1st/15 1st/15 1st/15
       All       1st/4 1st/4 1st/4 1st/4    1st/4 1st/31 1st/31 1st/31 1st/31 1st/31


5     Conclusions and Future Work

Our participation to the very interesting photo annotation and concept-based
retrieval tasks of CLEF 2011, led to a couple of interesting conclusions. First
of all, we found out that we need the collaboration of a computer vision/image
processing group to achieve better results. In terms of multi-label learning al-
gorithms, we noticed that binary approaches worked quite well, especially when
coupled with the strong Random Forests algorithm and class imbalance issues
are taken into account. We also reached to the conclusion, that we should have
performed model selection separately for each evaluation measure. We therefore
suggest that in future versions of the annotation task, the allowed number of
submissions should be equal to the number of evaluation measures multiplied by
the number of information types, so that there is space in the official results for
models with all kinds of information.
    There is a lot of room for improvements in the future, both in the annota-
tion and the very interesting concept-based retrieval task. In terms of textual
information, we intend to investigate the translation of non-English tags. We
would also like to investigate other hierarchical late fusion schemes, such as an
additional averaging step for the two different visual modalities (Harris-Laplace,
Dense Sampling) and more advanced arbitration techniques. Other thresholding
approaches for obtaining bipartitions is another interesting direction for future
study.


Acknowledgments

We would like to acknowledge the student travel support from EU FP7 under
grant agreement no 216444 (PetaMedia Network of Excellence).


References

1. Huiskes, M.J., Thomee, B., Lew, M.S.: New trends and ideas in visual concept
   detection: The mir flickr retrieval evaluation initiative. In: MIR ’10: Proceedings of
   the 2010 ACM International Conference on Multimedia Information Retrieval. pp.
   527–536. ACM, New York, NY, USA (2010)
2. Ioannou, M., Sakkas, G., Tsoumakas, G., Vlahavas, I.: Obtaining bipartitions from
   score vectors for multi-label classification. Tools with Artificial Intelligence, IEEE
   International Conference on 1, 409–416 (2010)
3. Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: A new benchmark collection for
   text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)
4. Nowak, S., Nagel, K., Liebetrau, J.: The clef 2011 photo annotation and concept-
   based retrieval tasks. In: Working Notes of CLEF 2011 (2011)
5. Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label
   classification. In: Proc. 20th European Conference on Machine Learning (ECML
   2009). pp. 254–269 (2009)
6. van de Sande, K.E.A., Gevers, T.: University of Amsterdam at the Visual Con-
   cept Detection and Annotation Tasks, The Information Retrieval Series, vol. 32:
   ImageCLEF, chap. 18, pp. 343–358. Springer (2010)
7. van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: Evaluating color descriptors for
   object and scene recognition. IEEE Transactions on Pattern Analysis and Machine
   Intelligence 32(9), 1582–1596 (2010)
8. Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Maimon, O.,
   Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, chap. 34, pp.
   667–685. Springer, 2nd edn. (2010)
9. Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., Vlahavas, I.: Mulan: A java
   library for multi-label learning. Journal of Machine Learning Research (JMLR) 12,
   2411–2414 (July 12 2011)
                      Topic 5         Topic 24         Topic 32
                     P@10 = 0.8      P@10 = 0.2       P@10 = 0.5

                1


                      irrelevant      irrelevant       irrelevant

                2


                       relevant       irrelevant        relevant

                3


                       relevant       irrelevant       irrelevant

                4


                       relevant       irrelevant        relevant

                5


                      irrelevant       relevant        irrelevant

                6


                       relevant        relevant         relevant

                7


                       relevant       irrelevant        relevant

                8


                       relevant       irrelevant       irrelevant

                9


                       relevant       irrelevant        relevant

                10


                       relevant       irrelevant       irrelevant

Fig. 4. Retrieved images for topics 5, 24 and 32 using manual retrieval. Images come
from the MIRFLICKR-1M image dataset [1].
                     Topic 10         Topic 12        Topic 30
                    P@10 = 0.3       P@10 = 0.2      P@10 = 1.0

                1


                      relevant        irrelevant       relevant

                2


                      irrelevant      irrelevant       relevant

                3


                      relevant         relevant        relevant

                4


                      relevant        irrelevant       relevant

                5


                      irrelevant      irrelevant       relevant

                6


                      irrelevant      irrelevant       relevant

                7


                      irrelevant       relevant        relevant

                8


                      irrelevant      irrelevant       relevant

                9


                      irrelevant      irrelevant       relevant

               10


                      irrelevant      irrelevant       relevant

Fig. 5. Retrieved images for topics 10, 12 and 30 using automated retrieval. Images
come from the MIRFLICKR-1M image dataset [1].

</pre>