=Paper=
{{Paper
|id=Vol-1177/CLEF2011wn-ImageCLEF-BorosEt2011
|storemode=property
|title=UAIC's Participation at Wikipedia Retrieval @ ImageCLEF 2011
|pdfUrl=https://ceur-ws.org/Vol-1177/CLEF2011wn-ImageCLEF-BorosEt2011.pdf
|volume=Vol-1177
|dblpUrl=https://dblp.org/rec/conf/clef/BorosGI11
}}
==UAIC's Participation at Wikipedia Retrieval @ ImageCLEF 2011==
<pdf width="1500px">https://ceur-ws.org/Vol-1177/CLEF2011wn-ImageCLEF-BorosEt2011.pdf</pdf>
<pre>
            UAIC’s participation at Wikipedia Retrieval @
                         ImageCLEF 2011

                   Emanuela Boroș, Alexandru-Lucian Gînscă, Adrian Iftene

          UAIC: Faculty of Computer Science, “Alexandru Ioan Cuza” University, Romania
                      {emanuela.boros, lucian.ginsca, adiftene}@info.uaic.ro


          Abstract. This paper describes the participation of UAIC team at the
          ImageCLEF 2011 competition, Wikipedia Retrieval task. The aim of the task
          was to investigate retrieval approaches in the context of a large and
          heterogeneous collection of images and their noisy text annotations. We
          submitted a total of six runs, focusing our effort along the textual retrieval,
          query expansion on English language, combined with feature extraction (Color
          and Edge Directionality Descriptor, CEDD). Our intention was to build a CBIR
          (Content-based image retrieval) system that relies on a fast indexing and
          retrieval practice based not only on the textual multilingual metadata, but also
          on the images features. The results were satisfying in the multilingual mixed
          search (text and images) and query expansion approach.
          Keywords: Support Vector Machine, Query                   Expansion,   Similarity
          Identification, Content-based image retrieval


1 Introduction

It is a truism to observe that images are currently used in all types of applications.
Methods are needed to be used to summarize, describe and classify collections of
data. The influence of visual perspective in today’s society is clear for all to see. The
difficulty of locating a desired image in a large data collection increased the need for
a CBIR (Content-based image retrieval) with more effective techniques.
   The subject of this paper outlines our approach for the Wikipedia Retrieval1 task at
ImageCLEF 20112. The task addresses the investigation on retrieval approaches in the
context of a large and heterogeneous collection of images and their noisy text
annotations (similar to those encountered on the Web) that are searched for by users
with diverse information needs. We received a collection of images that consisted of
237,434 images and associated user-supplied annotations, built to cover similar topics
in English, German and French and a file in XML format with the topics and their
metadata. This study was performed in order to build a system capable of classifying,
indexing and retrieving images from a large-scale collection of images.
   The paper is organized as follows: in Section 2 we describe our system, while the
advantages, results of the system and experimental results are reported in Sections 3

1
    Wikipedia Retrieval task: http://imageclef.org/2011/wikipedia
2
    ImageCLEF 2011: http://imageclef.org/2011
and 4. Last Section draws conclusions regarding our participation in Wikipedia
Retrieval task at ImageCLEF 2011.


2 System Description

One of the main problems of building such a system is the difficulty of locating a
desired image in a large data collection. While it is perfectly feasible to identify a
desired image from a small collection simply by browsing, more effective techniques
are needed with collections containing thousands of items. Access to a desired image
from a large collection might thus involve a search for images depicting specific types
of features, evoking a particular mood, or simply containing a specific texture or
pattern. Query by example is a query technique that involves providing the system
with an example image that it will then base its search upon. The underlying search
algorithms may vary depending on the application, but result images should all share
common elements with the provided example.
   Our system can be divided into two main components, one for text processing and
one for image processing. All components were written in Java with the use of
Lucene3 library and LIBSVM4, a library for support vector machines [1]. The choice
of the classifier is a key ingredient for an effective machine learning based image
recognition system. We chose Support Vector Machines (SVMs) [2] based on their
state-of-the-art performances in several visual recognition domains and Lucene for its
utility in the implementation of Internet search engines and local, single-site
searching. For image analysis, we used CEDD features [3] that were donated by the
DUTH team (i.e., the Information Retrieval Unit, Department of Electrical and
Computer Engineering, Democritus University of Thrace, Greece) for this year's
competition and for text analysis we used the full-featured text search engine library
Lucene. A main interest was to create a gold trunk of images so a more reliable
classification could be done.
   The architecture of the system relies on a model-view-controller pattern, the
controller being multithreaded and can accomplish more requests at a time. It
manages the feature extracting module and the final result computing module.
Therefore, one of the maturity levels that the system has is the possibility of retrieving
images in real-time. The main focuses in this project are data representation, feature
extraction and indexing, image query matching and user interfacing. We can consider
a training stage with the sample image collection. The main flow of the application is
described in Figure 1.


3
    Lucene: http://lucene.apache.org/java/docs/index.html
4
    LIBSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/
          Figure 1: Basic system used for image retrieval with query-by-example.


2.1 Image processing

A model construction step aims at providing a representation of the input data that
minimizes the within-class variability while at the same time maximizing the
between-class variability in this type of system, CBIR (Content-Based Image
Retrieval) [4]. Additionally, this representation is usually more compact than raw
input data and therefore allows reducing the computational load imposed by the
classification process. When the user can provide example images which contain
instances of the visual properties, content or configurations they would like to search
for, it is very hard for the system to ascertain which aspects make a given image
relevant and how similarity should be assessed. Many such systems therefore rely on
extracting different images features to guide the search towards desirable images, but
this approach is still studied so it would work in real-world retrieval scenarios.
   Features can be derived from the whole image (global features) or can be
computed locally, based on its salient parts (local features). We chose one scale-space
theory based features, a global one (Color and Edge Directionality Descriptor, CEDD)
and no local one considering that the most influential factor of the query result will be
the textual metadata. The rest of the section describes briefly this approach. The SIFT
(Scale Invariant Feature Transform) components [5, 6] were used similar with
approach from [7].
   First of all, the CEDD features for the large data collection are already extracted
thanks to Information Retrieval Unit, Department of Electrical and Computer
Engineering, Democritus University of Thrace, Greece and we also extract the CEDD
features from example images. Every image is represented by a file of approximately
54 bytes, meaning a vector of 144 numbers in the default representation. We modify
this representation to a specific one needed by LIBSVM. Search results are sorted
based on their distance to the queried images. It is virtually impossible to compare
images using traditional methods such as a direct comparison between gray values so
we chose the more practical method for comparison, a simple similarity function. An
image distance measure compares the similarity of two images in various dimensions
such as color, texture, shape, and others. So, having the CEDD representation of the
images collection and for the example images, the most common method for
comparing two images (an example image and an image from the large collection) is
Euclidean distance similarity function.
   For the second representation, the SVM trainer is learning the example images and
then it predicts the category for every image in the large data collection. The given
topics contained an XML with fifty topics with their metadata and example images.
From the example images, gold images collection is constructed and used for basic
comparison with CEDD features and for training data for SVM classifier, as it can be
seen from Figure 1. This is considered the training step, so, for the SVM classifier, we
obtained fifty classes.


2.2 Text processing


In a first step, we extracted the textual information from the metadata files and
combined the data from the description, comment and caption tags. This was done
independently for each of the three languages. We then created a Lucene index for
each language. A separate index was created for English, in which the text was put
through a stemming process, using the Porter Stemmer5 provided by the Lucene
framework6 and a stop word elimination process.
   The topics are used as Lucene search queries after they go through a processing
step. For English, we generated a separate query that consists of the stemmed initial
text. Also, we give a special attention to capitalized words. We consider a series of
capitalized words to be a named entity and we give it an increased boost in the
Lucene query. Due to the nature of the short, simple topics, we didn’t find the need to
use a gazetteer based named entity recognition approach. For some runs, we have
used a simple form of query expansion. This consisted of introducing synonyms with
the help of WordNet7 and using them as a disjunction of terms consistent with the
Lucene query syntax.


2.3 Aggregation of results

There are two places in our system where intermediary results are combined. The first
addressees the multilingual runs and gives a method for selecting the top answers for
the queries posed in each of the three languages, while the second bridges the results
obtained after the textual queries with those given by the image matching side of the
system.


5
  Porter Stemmer: http://tartarus.org/~martin/PorterStemmer/
6
  Lucene: http://lucene.apache.org/java/docs/index.html
7
  WordNet: http://wordnet.princeton.edu/
   Using queries in all of three languages, English, French and German, imposed the
problem of retaining the best results for each language. This was done by first
obtaining a maximum of 1000 documents sorted descending by the Lucene score for
each language. In the next step, all of the documents were mixed and reordered by
their associated scores and after duplicate documents were eliminated, the first
maximum 1000 documents were returned. This process is sketched in Figure 2.


                         Figure 2: Textual query results aggregation

   The second time when aggregation of partial results is required is after the final
documents from the textual queries are obtained and the similarity scores for the topic
images are calculated. To obtain a uniform set of final scores, the similarity distance
score are brought in the [0, 1] interval. After multiple tests, we settled on a weighted
mean between the Lucene scores and the normalized image similarity scores, with the
first having a contribution of 60% of the final score.


3 Experiments

This section describes the setup used for the experiments reported in this paper. We
conducted six experiments to evaluate the effectiveness of our approach. In all of
them, we compared the techniques as well as the batch algorithm. We employed
Lucene library and LIBSVM library, and we chose multiclass SVM. The local
descriptors that we already discussed, separated in files with the specific SVM
representation, are used as input to SVM via the RBF kernel and the training data that
contained fifty classes described by CEDD descriptors extracted in model
construction phase. First results are obtained from textual indexing; the second results
from similarity computation between example images and the large data collection
and the third results are obtained from SVM prediction.
   In all the experiments, we benchmarked against a system not using any prior
knowledge. The prior-knowledge model was built from the example images. The six
experiments are separated by analysis on different languages, SVM use and image
processing. Three runs were submitted for English only queries and the other three for
all languages. The multilingual scores of images were formed by adding up scores in
individual languages. As with multimodal runs, the overall performance was
increased with respect to English only runs. Every experiment had different results
from different sources (textual or visual) and the overall classification rate was then
computed as an average, to which the results from each method contributed equally.
Every experiment had different results from different sources (textual or visual) and
score between 0 and 1 is attributed to the image. The overall classification rate was
then computed as an average, to which the results from each method contributed
equally.


4 Results

All the obtained results are summarized in Table 1. All submitted runs were automatic
and combined, so an initial text search is followed by a possibly combined visual
search. We could have obtained better results if we had combined another SVM
parameters, the local descriptor (SIFT) and query expansion on all languages. The
presented results provide clear evidence of the capability of the query expansion
method and textual retrieval on all languages to perform a satisfying search result.

                   Table 1: UAIC runs in Wikipedia Retrieval 2011 task8


In this section, we analyze strengths and weaknesses of the system. For that we
compare the results obtained for two of our better runs: uaic3lucene&alllang&cedd
and uaic6lucene&alllang&qe&cedd. The first experiment contributed to the best
result by using metadata in all languages indexed by Lucene. In the second
experiment we only added query expansion on English language. The experiment
where we used SVM prediction had not satisfying results. So, the success rate for the
ones with textual retrieval on all languages is better than in the other cases where
SVM or only English language metadata was used. Document expansion can improve
the MAP from 0.1099 to 0.1665, but query expansion in combination with our
methods does not show much improvement.


8
    ImageCLEF 2011: Wikipedia image retrieval results: http://www.imageclef.org/2011/
    wikimm-results
5 Conclusions

In this paper we describe our first participation at Wikipedia Retrieval at ImageCLEF
2011 competition. We have experimented with different methods combined.
However, the obtained results show that it is necessary to continue investigating the
expansion methodology, to better apply any content-based image retrieval techniques
that helps us to extract the features of the images, to improve our SVM learner. Thus,
our next goal will be to improve the expansion by applying some more techniques.
For example, it will be interesting to apply query expansion on all languages and
repair our computation function for combining all methods. Our runs got bad results
due to some computation error which will fixed in the future research. In addition,
further investigation in textual processing could achieve better results.


Acknowledgements. The research presented in this paper was funded by the Sector
Operational Program for Human Resources Development through the project
“Development of the innovation capacity and increasing of the research impact
through post-doctoral programs” POSDRU/89/1.5/S/49944.


References

1.   Chang, C. C., Lin, C. J.: LIBSVM: a library for support vector machines. ACM
     Transactions on Intelligent Systems and Technology, 2:27:1--27:27. (2011)
2.   Hsu, C. W., Chang, C. C., Lin, C. J.: A Practical Guide to Support Vector Classification.
     At Department of Computer Science, National Taiwan University, Taipei 106, Taiwan,
     April 15. (2010)
3.   Chatzichristofis, S., Boutalis, Y. S.: CEDD: Color and Edge Directivity Descriptor – A
     compact descriptor for image indexing and retrieval. In 6th International Conference in
     advanced research on Computer Vision Systems ICVS 2008, May 12-15, 2008, Santorini,
     Greece. (2008)
4.   Eakins, J., Graham, M.: Content-based Image Retrieval. Report number 39 at University
     of Northumbria at Newcastle, October. (1999)
5.   Lowe, D. G.: Object recognition from local scale-invariant features. Proceedings of the
     International Conference on Computer Vision. Corfu, Greece. Pp. 1150-1157. (1999)
6.   Lowe, D. G.: Distinctive Image Features from Scale-Invariant Keypoints. International
     Journal of Computer Vision. 60, 2. Pp. 91-110. (2004)
7.   Boroș, E., Roșca, G., Iftene, A.: Using SIFT Method for Global Topological Localization
     for Indoor Environments. In C. Peters et al. (Eds.): CLEF 2009, LNCS 6242, Part II
     (Multilingual Information Access Evaluation Vol. II Multimedia Experiments). Pp. 277-
     282. ISBN: 978-3-642-15750-9. Springer, Heidelberg. (2009)

</pre>