<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MLKD's Participation at the CLEF 2011 Photo Annotation and Concept-Based Retrieval Tasks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eleftherios Spyromitros-Xiou s</string-name>
          <email>P@10</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Konstantinos Sechidis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Grigorios Tsoumakas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ioannis Vlahavas</string-name>
          <email>vlahavasg@csd.auth.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept of Informatics Aristotle University of Thessaloniki Thessaloniki 54124</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We participated both in the photo annotation and conceptbased retrieval tasks of CLEF 2011. For the annotation task we developed visual, textual and multi-modal approaches using multi-label learning algorithms from the Mulan open source library. For the visual model we employed the ColorDescriptor software to extract visual features from the images using 7 descriptors and 2 detectors. For each combination of descriptor and detector a multi-label model is built using the Binary Relevance approach coupled with Random Forests as the base classi er. For the textual models we used the boolean bag-of-words representation, and applied stemming, stop words removal, and feature selection using the chi-squared-max method. The multi-label learning algorithm that yielded the best results in this case was Ensemble of Classi er Chains using Random Forests as base classi er. Our multi-modal approach was based on a hierarchical late-fusion scheme. For the concept based retrieval task we developed two di erent approaches. The rst one is based on the concept relevance scores produced by the system we developed for the annotation task. It is a manual approach, because for each topic we manually selected the relevant topics and manually set the strength of their contribution to the nal ranking produced by a general formula that combines topic relevance scores. The second approach is based solely on the sample images provided for each query and is therefore fully automated. In this approach only the textual information was used in a query-by-example framework.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>ImageCLEF is the cross-language image retrieval track run annually since 2003
as part of the Cross Language Evaluation Forum (CLEF)1. This paper
documents the participation of the Machine Learning and Knowledge Discovery
(MLKD) group of the Department of Informatics of the Aristotle University of
Thessaloniki at the photo annotation task (also called visual concept detection
and annotation task) of ImageCLEF 2011.</p>
    </sec>
    <sec id="sec-2">
      <title>1 http://www.clef-campaign.org/</title>
      <p>
        This year, the photo annotation task consisted of two subtasks. An
annotation task, similar to that of ImageCLEF 2010, and a new concept-based retrieval
task. Data for both tasks come from the MIRFLICKR-1M image dataset [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
which apart from the image les contains Flickr user tags and Exchangeable
Image File Format (Exif) information. More information about the exact setup
of the data can be found in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        In the annotation task, participants are asked to annotate a test set of 10,000
images with 99 visual concepts. An annotated training set of 8,000 images is
provided. This multi-label learning task [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] can be solved in three di erent ways
according to the type of information used for learning: 1) visual (the image les),
2) textual (Flickr user tags), 3) multi-modal (visual and textual information).
We developed visual, textual and multi-modal approaches for this task using
multi-label learning algorithms from the Mulan open source library [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In this
task, the relative performance of our textual models was quite good, but that
of our visual models was bad (our group does not have expertise on computer
vision), leading to an average multi-modal (and overall) performance.
      </p>
      <p>In the concept-based retrieval task, participants were given 40 topics
consisting of logical connections between the 99 concepts of the photo annotation
task, such as \ nd all images that depict a small group of persons in a landscape
scenery showing trees and a river on a sunny day", along with 2 to 5 examples
images of each topic from the training set of the annotation task. Participants
were asked to submit (up to) the 1,000 most relevant photos for each topic in
ranked order from a set of 200,000 unannotated images. This task can be solved
by manual construction of the query out of the narrative of the topics, followed
by automatic retrieval of images, or by a fully automated process. We developed
a manual approach that exploits the multi-label models trained in the
annotation task and a fully automated query-by-example approach based on the tags
of the example images. In this task, both our manual and automated approaches
ranked 1st in all evaluation measures by a large margin.</p>
      <p>The rest of this paper is organized as follows. Sections 2 and 3 describe our
approaches to the annotation task and concept-based retrieval task respectively.
Section 4 presents the results of our runs for both tasks. Section 5 concludes our
work and poses future research directions.
2</p>
      <sec id="sec-2-1">
        <title>Annotation Task</title>
        <p>This section presents the visual, textual and multi-modal approaches that we
developed for the automatic photo annotation task. There were two (eventually
three) evaluation measures to consider for this task: a) mean interpolated average
precision (MIAP), b) example-based F-measure (F-ex), c) semantic R-precision
(SR-Precision). In order to optimize a learning approach based on each of the
initial two evaluation measures and type of information, six models should be built.
However, there were only ve runs allowed for this task. We therefore decided
to perform model selection based on the widely-used mean average precision
(MAP) measure for all types of information. In particular, MAP was estimated
using an internal 3-fold cross-validation on the 8,000 training images. Our
multimodal approach was submitted in three di erent variations to reach the total
number of ve submissions.
2.1</p>
        <p>Automatic Annotation with Visual Information
We here describe the approach that we followed in order to learn multi-label
models using the visual information of the images. The owchart of this approach
is shown in Fig. 1.</p>
        <p>i
xi</p>
        <p>
          As our group does not have expertise in computer vision, we largely
followed the color-descriptor extraction approach described in [
          <xref ref-type="bibr" rid="ref6 ref7">6,7</xref>
          ] and used the
accompanying software tool2 for extracting visual features from the images.
        </p>
        <p>Harris-Laplace and Dense Sampling were used as point detection strategies.
Furthermore seven di erent descriptors were used: SIFT, HSV-SIFT, HueSIFT,
OpponentSIFT, C-SIFT, rgSIFT and RGB-SIFT. For each one of the 14
combinations of point detection strategy and descriptor, a di erent codebook was
created in order to obtain a xed length representation for all images. This is
also known as the bag-of-words approach. The k-means clustering algorithm
was applied to 250,000 randomly sampled points from the training set, with the
codebook size (k) xed to 4096 words. Finally, we employed hard assignment of
points to clusters.</p>
        <p>Using these 4,096-dimensional vector representations along with the ground
truth annotations given for the training images we built 14 multi-label training
datasets. After experimenting with various multi-label learning algorithms we
found that the simple Binary Relevance (BR) approach coupled with Random
Forests as the base classi er (number of trees = 150, number of features = 40)
yielded the best results.</p>
        <p>In order to deal with the imbalance in the number of positive and negative
examples of each label we used instance weighting. The weight of the examples
of the minority class was set to (min+maj)=min and the weight of the examples
of the majority class was set to (min + maj)=maj, where min is the number of
examples of the minority class and maj the number of examples of the majority
class. We also experimented with sub-sampling, but the results were worse than
instance weighting.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2 Available from http://www.colordescriptors.com</title>
      <p>Our approach concludes with a late fusion scheme that averages the output
of the 14 di erent multi-label models that we built.
2.2</p>
      <p>Automatic Annotation with Flickr User Tags
We here describe the approach that we followed in order to learn multi-label
models using the tags assigned to images by Flickr users. The owchart of this
approach is shown in Fig. 2.</p>
      <p>Tags of
Image i</p>
      <p>Preprocessing
Stemmer</p>
      <p>Stop Words
27,323
features</p>
      <p>Feature
Selection
fe4a,0tu0r0es xi Mleualtri-nlainbgel
algorithm</p>
      <p>
        An initial vocabulary was constructed by taking the union of the tag sets of
all images in the training set. We then applied stemming to this vocabulary and
removed stop words. This led to a vocabulary of approximately 27000 stems.
The use of stemming improved the results, despite that some of the tags were
not in the English language and that we used an English stemmer. We further
applied feature selection in order to remove irrelevant or redundant features and
improve e ciency. In particular, we used the 2max criterion [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to score the stems
and selected the top 4000 stems, after experimenting with a variety of sizes (500,
1000, 2000, 3000, 4000, 5000, 6000 and 7000).
      </p>
      <p>
        The multi-label learning algorithm that was found to yield the best results
in this case was Ensemble of Classi er Chains (ECC) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] using Random Forests
as base classi er. ECC was run with 15 classi er chains and Random Forests
with 10 decision trees, while all other parameters were left to their default value.
The approach that we followed to deal with class imbalance in the case of visual
information (see the previous subsection), was followed in this case too.
Our multi-modal approach is based on a late fusion scheme that combines the
output of the 14 visual models and the single textual model. The combination is
not an average of these 15 models, because in that case the visual models would
dominate the nal scores. Instead, we follow a hierarchical combination scheme.
We separately average the 7 visual models of each point estimator and then
combine the output of the textual model, the Harris-Laplace average and the Dense
Sampling average, as depicted in Fig. 3. The motivation for this scheme was
the three di erent views of the images that existed in the data (Harris-Laplace,
Dense Sampling, user tags) as explained in the following two paragraphs.
      </p>
      <p>x i
Image i</p>
      <p>Harris Laplace Ensemble Model</p>
      <p>SingleModelusingHarrisLaplace&amp;SIFTCodebook
SingleModelusingHarris Laplace&amp;HSV-SIFTCodebook
SingleModelusingHarris Laplace&amp;HueSIFTCodebook
SingleModelusingHarrisLaplace&amp;OpponentSIFTCodebook</p>
      <p>SingleModelusingHarrisLaplace&amp;C-SIFTCodebook
SingleModelusingHarrisLaplace&amp;rgSIFTCodebook</p>
      <p>SingleModelusingHarrisLaplace&amp;RGB-SIFTCodebook
Dense Sampling Ensemble Model</p>
      <p>SingleModelusingDenseSampling&amp;SIFTCodebook
SingleModelusingDense Sampling&amp;HSV-SIFT Codebook
Single ModelusingDenseSampling&amp;HueSIFTCodebook
SingleModelusingDenseSampling&amp;OpponentSIFT Codebook</p>
      <p>SingleModelusingDenseSampling&amp;C-SIFTCodebook
SingleModelusingDenseSampling&amp;rgSIFTCodebook
SingleModelusingDense Sampling&amp;RGB-SIFTCodebook</p>
      <p>Averaging
Averaging
p(cj | xi)j</p>
      <p>Averaging/
Arbitrator
p(cj | xi)j
FlickruserstagsModel</p>
      <p>
        We can discern two main categories of concepts in photo annotation: objects
and scenes. For objects, Harris-Laplace performs better because it ignores the
homogeneous areas, while for scenes, Dense Sampling performs better [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. For
example, two of the concepts where Dense Sampling achieves much higher Average
Precision (AP) from Harris-Laplace are Night and Macro, which are abstract,
while the inverse is happening in concepts Fish and Ship, which correspond to
things (organisms, objects) of particular shape.
      </p>
      <p>Furthermore, we observe that the visual approach performs better in
concepts, such as Sky, which for some reason (e.g. lack of user interest for retrieval
by this concept) do not get tagged. On the other hand the textual approach
performs much better when it has to predict concepts, such as Horse, Insect,
Dog and Baby that typically get tagged by users. Table 1 shows the average
precision for 10 concepts, half of which suit much better the textual models and
half the visual models.</p>
      <p>
        Two variations of this scheme were developed, di ering in how the output
of the three di erent views is combined. The rst one, named Multi-Modal-Avg,
used an averaging operator, similarly to the one used at the lower levels of the
hierarchy. The second one, named Multi-Modal-MaxAP, used an arbitrator
function to select the best one out of the three outputs for each concept, according to
internal evaluation results in terms of average precision. Our third multi-modal
submission, named Multi-Modal-MaxAP-RGBSIFT, was a preliminary version
of Multi-Modal-MaxAP, where only the RGBSIFT descriptor was used.
The multi-label learners used in this work provide us with a con dence score for
each concept. This is ne for an evaluation with MIAP and SR-Precision, but
does not su ce for an evaluation with example-based F-measure, which requires
a bipartition of the concepts into relevant and irrelevant ones. This is a typical
issue in multi-label learning, which is dealt with a thresholding process [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        We used the thresholding method described in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which applies a common
threshold across all concepts and provides a close approximation of the label
cardinality (LC) of the training set to the predictions made on the test set. The
threshold is calculated using the following formula:
t = argminft20:00;0:05;:::;1:00gjLC(Dtrain)
LC(Ht(Dtest))j
(1)
where Dtrain is the training set and Ht is a classi er which has made predictions
on a test set Dtest under threshold t.
3
      </p>
      <sec id="sec-3-1">
        <title>Concept-Based Retrieval Task</title>
        <p>We developed two di erent approaches for the concept-based retrieval task. The
rst one is based on the concept relevance scores produced by the system we
developed for the annotation task. It is a manual approach, because for each
topic we manually selected the relevant topics and manually set the strength
of their contribution to the nal ranking produced by a general formula that
combines topic relevance scores. The second one is based solely on the sample
images provided for each query and is therefore fully automated.
3.1</p>
        <p>
          Manual Approach
Let I = 1; : : : ; 200; 000 be the collection of images, Q = 1; : : : ; 40 the set of
topics and C = 1; : : : ; 99 the set of concepts. We rst apply our automated
image annotation system to each image i 2 I and obtain a corresponding
99dimensional vector Si = [si1; si2; :::; si99] with the relevance scores of this image to
each one of the 99 concepts. For e ciency reasons, we used simpli ed versions of
our visual approach, taking into account only models produced with the
RGBSIFT descriptor, which has been found in the past to provide better results
compared to other single color descriptors [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>Then, based on the description of each of the 40 queries, we manually select
a number of concepts that we consider related to the query, either positively or
negatively. Formally, for topic q 2 Q let Pq C denote the set of concepts that
are positively related to q and Nq C the set of concepts that are negatively
related to q, Pq \ Nq = ;. For each concept c in Pq [ Nq, we further de ne a
real valued parameter mcq 1 denoting the strength of relevance of concept c
to q. The larger this value, the stronger the in uence of concept c to the nal
relevance score. For each topic q and image i, the scores of the relevant concepts
are combined using (2).</p>
        <p>Sq;i =</p>
        <p>Y (sic)mcq Y (1
sc)mcq
i
c2Pq
c2Nq
(2)</p>
        <p>Finally, for each topic, we arrange the images in descending order according
to the overall relevance score and we retrieve a xed number of images (in our
submissions we retrieved 250 and 1,000 images).</p>
        <p>Note that for each topic, the selection of related concepts and the setting of
values for the mcq parameters was done using a trial-and-error approach involving
careful visual examination of the top 10 retrieved images, as well as more relaxed
visual examination of the top 100 retrieved images. Two examples of topics and
corresponding combination of scores follow.</p>
        <p>Topic 5: rider on horse. Here we like to nd photos of riders on a horse.
So no sculptures or paintings are relevant. The rider and horse can be also only
in parts on the photo. It is important that the person is riding a horse and not
standing next to it. Based on the description of this topic and experimentation,
we concluded that concepts 75 (Horse) and 8 (Sports) are positively related (rider
on horse), while concept 63 (Visual Arts) is negatively related (no sculptures or
paintings). We therefore set P5 = f8; 75g, N5 = f63g. All concepts were set to
equal strength for this topic: m8;5 = m63;5 = m75;5 = 1.</p>
        <p>Topic 24: funny baby. We like to nd photos of babies looking funny. The
baby should be in the main focus of the photo and be the reason why the photo
looks funny. Photos presenting funny things that are not related to the baby are
not relevant. Based on the description of this topic and experimentation, we
concluded that concepts 86 (Baby), 92 (Funny) and 32 (Portrait) are positively
related. We therefore set P24 = f32; 86; 92g, N24 = ;. Based on experimentation
the concept Funny was given twice the strength of the other concepts, we set
m32;24 = m86;24 = 1 and m92;24 = 2.</p>
        <p>For some topics, instead of explicitly using the score of a group of interrelated
concepts we considered introducing a virtual concept with score equal to the
maximum of this group of concepts. This slight adaptation of the general rule of
(2), enhances its representation capabilities. The following example clari es this
adaptation.</p>
        <p>Topic 32: underexposed photos of animals. We like to nd photos of
animals that are underexposed. Photos with normal illumination are not
relevant. The animal(s) should be more or less in the main focus of the image.
Based on the description of this topic and experimentation, we concluded that
concepts 44 (Animals), 34 (Underexposed), 72 (Dog), 73 (Cat), 74 (Bird), 75
(Horse), 76 (Fish) and 77 (Insect) are positively related, while concept 35
(Neutral Illumination) is negatively related. The six last speci c animal concepts were
grouped into a virtual concept, say concept 1001, with score, the maximum of
the scores of these six concepts. We then set P32 = f34; 44; 1001g, N32 = f35g
and m34;32 = m44;32 = m1001;32 = m35;32 = 1.</p>
        <p>Figure 4 shows the top 10 retrieved images for topics 5, 24 and 32, along
with the Precision@10 for these topics.
3.2</p>
        <p>Automated Approach
Apart from the narrative description, each topic of the concept-based retrieval
task was accompanied by a set of 2 to 5 images from the training set which could
be considered relevant for the topic. Using these examples images as queries we
developed a Query by Example approach to nd the most relevant images in the
retrieval set. The representation followed the bag-of-words model and was based
on the Flickr user tags assigned to each image.</p>
        <p>To generate the feature vectors, we applied the same method as the one used
for the annotation task. Thus, each image was represented as a 4000-dimensional
feature vector where each feature corresponds to a tag from the training set
which was selected by the feature selection method. A value of 1/0 denotes the
presence/absence of the tag in the tags accompanying an image.</p>
        <p>To measure the similarity between the vectors representing two images we
used the Jaccard similarity coe cient which is de ned as the total number of
attributes where two vectors A and B both have a value of 1 divided by the the
total number of attributes where either A or B have a value of 1.</p>
        <p>Since more than one images where given as examples for each topic, we added
their feature vectors in order to form a single query vector. This approach was
found to work well in comparison to other approaches, such as taking only one
of the example images as query or measuring the similarity between a retrieval
image and each example image separately and then returning the images from the
retrieval set with the largest similarity score to any of the queries. We attribute
this to the fact that by adding the feature vectors, a better representation of the
topic of interest was created which could not be possible if only one image (with
possibly noisy or very few tags) was considered.</p>
        <p>As in the manual approach, we submitted two runs, one returning the 250 and
one the 1000 most similar images from the retrieval set (in descending similarity
order).</p>
        <p>Figure 5 shows the top 10 retrieved images, along with the Precision@10 for
the following topics:
{ Topic 10: single person playing a musical instrument. We like to nd
pictures (no paintings) of a person playing a musical instrument. The person
can be on stage, o stage, inside or outside, sitting or standing, but should
be alone on the photo. It is enough if not the whole person or instrument is
shown as long as the person and the instrument are clearly recognizable.
{ Topic 12: snowy winter landscaper. We like to nd pictures (photos or
drawings) of white winter landscapes with trees. The landscape should not
contain human-made objects e.g. houses, cars and persons. Only snow on
the top of a mountain is not relevant, the landscape has to be fully covered
in (at least light) snow.
{ Topic 30: cute toys arranged to a still-life. We like to nd photos of
toys arranged to a still-life. These toys should look cute in the arrangement.
Simple photos of a collection of toys e.g. in a shop are not relevant.</p>
        <p>We see that the 10 retrieved images for topic 30 are better than those of topics
12 and 10. This can be explained by noticing that topic 12 is a di cult one, while
the tags of the example images for topic 10 are not very descriptive/informative.
4</p>
      </sec>
      <sec id="sec-3-2">
        <title>Results</title>
        <p>
          We here brie y present our results, as well as our relative performance compared
to other groups and submissions. Results for all groups, as well as more details
on the data setup and evaluation measures can be found in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
4.1
        </p>
        <p>Annotation Task
The o cial results of our runs are illustrated in Table 2. We notice that in
terms of MIAP, the textual model is slightly better than the visual, while for the
other two measures, the visual model is much better than the textual. Among
the multi-modal variations, we notice that averaging works better than
arbitrating, and as expected using all descriptors is better than using just the
RGBSIFT one. In addition, we notice that the multi-modal approach signi cantly
improves over the MIAP of the visual and textual approaches, while it slightly
decreases/increases the performance of the visual model in the two
examplebased measures. This may partly be due to the fact that we performed model
selection based on MAP.
Textual 0.3256</p>
        <p>Visual 0.3114</p>
        <p>Multi-Modal-Avg 0.4016
Multi-Modal-MaxAP-RGBSIFT 0.3489</p>
        <p>Multi-Modal-MaxAP 0.3589</p>
        <p>Table 3 shows the rank of our best result compared to the best results of
other groups and compared to all submissions. We did quite good in terms of
textual information, but quite bad in terms of visual information, leading to
an overall average performance. Lack of computer vision expertise in our group
may be a reason for not being able to get results out of the visual information.
Among the three evaluation measures, we notice that overall we did better in
terms of MIAP, slightly worse in terms of F-measure, and even worse in terms
of SR-Precision. The fact that model selection was performed based on MAP
de nitely played a role for this result.
In this task, participating systems were evaluated using the following measures:
Mean Average Precision (MAP), Precision@10, Precision@20, Precision@100
and R-Precision.</p>
        <p>The o cial results of our runs are illustrated in Table 4. We rst notice
that the rst 5 runs, which retrieved 1000 images, lead to better results in
terms of MAP and R-Precision compared to the last 5 runs, which retrieved 250
images. Obviously, in terms of Precision@10, Precision@20 and Precision@100,
the results are equal. Among the manual runs, we notice that the visual models
perform quite bad. We hypothesize that a lot of concepts that favor textual
rather than visual models, as discussed in Sect. 2, appear in most of the topics.
The textual and multi-modal models perform best, with the Multi-Modal-Avg
model having the best result in 3 out of the 5 measures.</p>
        <p>The automated approach performs slightly better than the visual model of
the manual approach, but still much worse than the textual and multi-modal
manual approaches. As expected, the knowledge that is provided by a human can
clearly lead to better results compared to a fully automated process. However,
this is not true across all topics, as can be seen in Table 5, which compares the
results of the best automated and manual approach for each individual topic. We
can see there that the automated approach performs better on 9 topics, while
the manual on 31.
Manual-Visual-RGBSIFT-1000 0.0361 0.1525 0.1375 0.1080 0.0883
Automated-Textual-1000 0.0849 0.3000 0.2800 0.2188 0.1530</p>
        <p>Manual-Textual-1000 0.1546 0.4100 0.3838 0.3102 0.2366
Manual-Multi-Modal-Avg-RGBSIFT-1000 0.1640 0.3900 0.3700 0.3180 0.2467
Manual-Multi-Modal-MaxAP-RGBSIFT-1000 0.1533 0.4175 0.3725 0.2980 0.2332
Manual-Visual-RGBSIFT-250 0.0295 0.1525 0.1375 0.1080 0.0863
Automated-Textual-250 0.0708 0.3000 0.2800 0.2188 0.1486</p>
        <p>Manual-Textual-250 0.1328 0.4100 0.3838 0.3102 0.2298
Manual-Multi-Modal-Avg-RGBSIFT-250 0.1346 0.3900 0.3700 0.3180 0.2397
Manual-Multi-Modal-MaxAP-RGBSIFT-250 0.1312 0.4175 0.3725 0.2980 0.2260</p>
        <p>Table 6 shows the rank of our best result compared to the best results of other
groups and compared to all submissions. Both our manual and our automated
approach ranked 1st in all evaluation measures.
Our participation to the very interesting photo annotation and concept-based
retrieval tasks of CLEF 2011, led to a couple of interesting conclusions. First
of all, we found out that we need the collaboration of a computer vision/image
processing group to achieve better results. In terms of multi-label learning
algorithms, we noticed that binary approaches worked quite well, especially when
coupled with the strong Random Forests algorithm and class imbalance issues
are taken into account. We also reached to the conclusion, that we should have
performed model selection separately for each evaluation measure. We therefore
suggest that in future versions of the annotation task, the allowed number of
submissions should be equal to the number of evaluation measures multiplied by
the number of information types, so that there is space in the o cial results for
models with all kinds of information.</p>
        <p>There is a lot of room for improvements in the future, both in the
annotation and the very interesting concept-based retrieval task. In terms of textual
information, we intend to investigate the translation of non-English tags. We
would also like to investigate other hierarchical late fusion schemes, such as an
additional averaging step for the two di erent visual modalities (Harris-Laplace,
Dense Sampling) and more advanced arbitration techniques. Other thresholding
approaches for obtaining bipartitions is another interesting direction for future
study.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Acknowledgments References</title>
        <p>We would like to acknowledge the student travel support from EU FP7 under
grant agreement no 216444 (PetaMedia Network of Excellence).
1
2
irrelevant
irrelevant</p>
        <p>irrelevant
relevant
irrelevant</p>
        <p>relevant
relevant
irrelevant</p>
        <p>irrelevant
relevant
irrelevant</p>
        <p>relevant
irrelevant
relevant</p>
        <p>irrelevant
relevant
relevant</p>
        <p>relevant
relevant
irrelevant</p>
        <p>relevant
relevant
irrelevant</p>
        <p>irrelevant
relevant
irrelevant</p>
        <p>
          relevant
relevant
irrelevant
irrelevant
Fig. 4. Retrieved images for topics 5, 24 and 32 using manual retrieval. Images come
from the MIRFLICKR-1M image dataset [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
relevant
irrelevant
        </p>
        <p>relevant
irrelevant
irrelevant</p>
        <p>relevant
relevant
relevant</p>
        <p>relevant
relevant
irrelevant</p>
        <p>relevant
irrelevant
irrelevant</p>
        <p>relevant
irrelevant
irrelevant</p>
        <p>relevant
irrelevant
relevant</p>
        <p>relevant
irrelevant
irrelevant</p>
        <p>relevant
irrelevant
irrelevant</p>
        <p>relevant
irrelevant
irrelevant
relevant</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Huiskes</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thomee</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lew</surname>
            ,
            <given-names>M.S.</given-names>
          </string-name>
          :
          <article-title>New trends and ideas in visual concept detection: The mir ickr retrieval evaluation initiative</article-title>
          .
          <source>In: MIR '10: Proceedings of the 2010 ACM International Conference on Multimedia Information Retrieval</source>
          . pp.
          <volume>527</volume>
          {
          <fpage>536</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ioannou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sakkas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsoumakas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vlahavas</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Obtaining bipartitions from score vectors for multi-label classi cation</article-title>
          .
          <source>Tools with Arti cial Intelligence</source>
          ,
          <source>IEEE International Conference on 1, 409{416</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>D.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rose</surname>
            ,
            <given-names>T.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Rcv1: A new benchmark collection for text categorization research</article-title>
          .
          <source>J. Mach. Learn. Res</source>
          .
          <volume>5</volume>
          ,
          <issue>361</issue>
          {
          <fpage>397</fpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Nowak</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nagel</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liebetrau</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The clef 2011 photo annotation and conceptbased retrieval tasks</article-title>
          .
          <source>In: Working Notes of CLEF</source>
          <year>2011</year>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Read</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pfahringer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holmes</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
          </string-name>
          , E.:
          <article-title>Classi er chains for multi-label classi cation</article-title>
          .
          <source>In: Proc. 20th European Conference on Machine Learning (ECML</source>
          <year>2009</year>
          ). pp.
          <volume>254</volume>
          {
          <issue>269</issue>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. van de Sande,
          <string-name>
            <given-names>K.E.A.</given-names>
            ,
            <surname>Gevers</surname>
          </string-name>
          , T.: University of Amsterdam at the Visual Concept Detection and
          <string-name>
            <given-names>Annotation</given-names>
            <surname>Tasks</surname>
          </string-name>
          ,
          <source>The Information Retrieval Series</source>
          , vol.
          <volume>32</volume>
          : ImageCLEF, chap. 18, pp.
          <volume>343</volume>
          {
          <fpage>358</fpage>
          . Springer (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. van de Sande,
          <string-name>
            <given-names>K.E.A.</given-names>
            ,
            <surname>Gevers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Snoek</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.G.M.:</surname>
          </string-name>
          <article-title>Evaluating color descriptors for object and scene recognition</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>32</volume>
          (
          <issue>9</issue>
          ),
          <volume>1582</volume>
          {
          <fpage>1596</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Tsoumakas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Katakis</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vlahavas</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Mining multi-label data</article-title>
          . In: Maimon,
          <string-name>
            <given-names>O.</given-names>
            ,
            <surname>Rokach</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . (eds.)
          <article-title>Data Mining and Knowledge Discovery Handbook, chap</article-title>
          . 34, pp.
          <volume>667</volume>
          {
          <fpage>685</fpage>
          . Springer, 2nd edn. (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Tsoumakas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spyromitros-Xiou s</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Vilcek</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vlahavas</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Mulan: A java library for multi-label learning</article-title>
          .
          <source>Journal of Machine Learning Research (JMLR) 12, 2411{2414 (July 12</source>
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>