Multimodal Photo Retrieval through Finding
Similar Documents Enhanced with Visual Clues –
              a Baseline Method
                   Bartosz Broda, Mariusz Paradowski, Halina Kwanicka
                 Institute of Informatics, Wroclaw University of Technology
                     27 Wybrzee Wyspiaskiego 50-370 Wrocaw, Poland
         {bartosz.broda, mariusz.paradowski, halina.kwasnicka}@pwr.wroc.pl


                                             Abstract
     Image retrieval till today is one of the most challenging problems in computer science.
     Even though there are lots of researches performed around the World, an efficient,
     user friendly image retrieval system still seems to be an unachievable goal. Image-
     CLEF Photo Retrieval Track allows to compare various approaches to this challenging
     problem. In this paper we present a staring point of our research, connected to a
     joint Polish–Singaporean research project, titled: Framework for Visual Information
     Retrieval and Building Content-based Visual Search Engines. Various techniques, pub-
     lished in the literature are gathered and orchestrated together. A reference image
     retrieval system is build, supporting both image queries, text queries and joint text-
     image queries. In our work we have tried to capture state-of-the-art in text and image
     retrieval.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database
Management]: Languages—Query Languages

General Terms
Measurement, Performance, Experimentation

Keywords
Information retrieval, Image Retrieval, Integrated Region Matching, Mallows Distance, Document
similarity, Vector Space Model


1    Introduction
In this paper we present work performed during our participation in ImageCLEF Photo Retrieval
Task. The presented approach is actually a baseline, the starting point of our research. The
work is done as a part of joint Polish–Singaporean research project, titled Framework for Visual
Information Retrieval and Building Content-based Visual Search Engines. The aim of our work
is to build image retrieval methods, incorporating both image matching techniques, ’soft’ image
distance measures and knowledge based methods. As this is a starting point, all presented tech-
niques are not a novelty. They are already presented in the research literature for at least several
years. However, the key difficulty is such combination of these known approaches to have a viable
and efficient image retrieval system.
   We participate in ImageCLEF contest for the first time. Our goal is actually to check: Where
we are?, comparing to other, often much larger and more experienced research teams. We have
achieved the goal by preparing a simple image retrieval system, using state-of-the-art techniques
described in the literature. On the other hand, since the contest took place new ideas have
appeared and are currently under heavy research. We hope that these ideas will be verified by the
next contest – ImageCLEF 2010.


2      Task Description
As the name suggest ImageCLEF 2009 Photo Retrieval Task 1 (henceforth, ICPRT2009) is one
of the tasks of information retrieval. The goal of information retrieval, and more specific image
retrieval, is to satisfy users information needs — finding images that are relevant to user queries.
In ICPRT2009 those needs are specified by detailed description of user queries in form of topics.
Topics were divided into two groups containing 25 queries. In first group every topic is described
by title and number of clusters. A cluster has title, description and image example. The second
group of topics had only title and several image examples. Figure 1 shows one of the topics
developed for ICPRT2009 competition.


 (a) Cluster title: kim gevaert. (b) Cluster title: agfa gevaert. De- (c) Cluster title: hellebaut gevaert. De-
 Description: Relevant images scription: Images are relevant if they scription: Relevant images show pho-
 will show photographs of Kim contain photographs of the Agfa- tographs of Hellebaut and Kim Gevaert.
 Gevaert. Images showing her Gevaert company.            Relevant im- Images showing only one of them are not
 and other people are relevant if ages include those showing the logos, relevant.
 she is shown in the foreground. buildings or any aspects of the com-
 Other images where Kim is in pany.
 the foreground are irrelevant.


Figure 1: Example of user query in form of a topic. The topic has number 7, is titled gevaert and
contains only three clusters.

    One of the main goals of the ICPRT2009 competition is to encourage participants to focus
on diversity in retrieved collection of images. User queries often do not contain the true inten-
tions, delivered in an machine understandable (or even human understandable) form. They are
ambiguous by definition. Taking this into account, diversity is one of the approaches to diminish
the problem of query ambiguity. In classical information retrieval, when users searches for word
“bank”, it is not possible to determine whether user wants to find information about river banks
    1 http://imageclef.org/2009/photo
or financial institutions2 . In image retrieval this corresponds to searching for a Formula 1 bolid by
presenting image of a bolid during the race (probably accompanied by short description). Does the
user want to find any images of bolids or should the pictures be taken during the race? Perhaps
user is interested in pictures of only one team or only from the same event? One of the approaches
to this problem is to focus on finding all potentially relevant images that encompasses as broad
area of topics as possible.
    One of the most interesting things about the ICPRT2009 is that the work is performed on the
real image database from Belga News Agency and the topics created on the basis of analysis of
Belga query logs.
    The image database consists of 498,920 images. Every image is accompanied by caption, or
annotation, i.e., a few English sentences describing image content. The format of captions is not
formally standardized, so basically a caption can contain anything. Fortunately, one can observe
a pattern that is usually followed: at the beginning the uppercase letters contains image identifier,
date and place followed by description of image content (normal case). Caption is usually ended
by attribution of authorship, again written in upper case. Figure 2 shows example of images with
caption.
    The size of the database introduces interesting efficiency problems to solve, especially with
regard to image processing. At first, image feature vectors need to be computed from the image
database. Afterward, image distance calculations need to be made. As it is presented later on in
the paper, a single distance calculation requires solving an optimization problem. Such process,
repeated for hundreds of thousands of images, is very time consuming. On the other hand,
processing almost 500,000 documents (captions) might seems difficult. On the contrary, due to
short length of captions, it is rather easy and straightforward process. It is worth noting that file
with captions in raw text format has only 162MB, which is little comparing to other information
retrieval tasks that has to deal with hundreds of gigabytes of data, e.g., [2, 16]. Creating all the
necessary data structures in our case takes from 7 to 15 minutes on commodity PC depending on
the experiment setting.


3      Development of the Baseline Method
It is very difficult to define visual similarity between images in a formal way. This is because one
would have to know semantics of a given image. Note that there can be a huge difference between
images that are visually and semantically similar. Two images of silver cars can be visually very
similar, even if one of the images shows an Audi and the other one shows only a toy car, a model
of Mercedes. User searching for pictures of Audi would not be satisfied with images of toy cars.
On the other hand if the user would be interesting in any silver car picture taken from the side
of the car he might be perfectly satisfied with that result. We think that extraction of semantic
clues from text is significantly easier then from images so for the ICPRT2009 we assumed model
that uses textual features as a primary knowledge source. After initial processing of textual data,
visual data is used to refine the results presented to user.
    There is a plethora of methods for textual information retrieval [10]. Combining this with
techniques using visual data in image retrieval [3] gives very large number of possible ways to
approach ICPRT2009.
    As mentioned earlier in the paper, the presented research is a part of joint Polish–Singaporean
research project, titled: Framework for Visual Information Retrieval and Building Content-based
Visual Search Engines. One of aims of our project is to develop methods for finding visually similar
images to a given one using only visual information. We divided the work into two stages: creating
ranked list of similar images for every image in a topic and combining those lists considering all
the images in the topic. We separated textual processing from visual, because we wanted to
focus on robust and reliable techniques from both visual and textual point of views separately.
After that we have developed a method for combining both knowledge sources. Our main aim in
this research is to create prototype method for image retrieval for which precision of retrieval is
    2 At least without asking the user for clarification or introduction of additional techniques like user profile.
(a) BRU199 - 20031012 - METTET, BELGIUM : Il- (b) The early morning rising sun over the Bavarian town
lustration picture shows a pilot making smoke during of Marktoberdorf gives a golden glow to the cloud cover
a burn-put wheelie at the Superbiker Grand Prix of Tuesday 2nd December 2003.. This area just short of the
Mettet, Sunday 12 October 2003, in Mettet. BELGA Alpine region continues to enjoy extraordinary mild tem-
PHOTO JOHN THYS                                      peratures with 16degrees c. recorded yesterday. EPA/Karl-
                                                     Josef Hildenbrand COLOR


 (c) Japan’s maglev train setting a world speed record on (d) LHT22 - 20010223 - LAHTI, FINLAND : From L
 an experimental track in Yamanashi Province, Tuesday to R Germany’s Martin Schmitt, silver medal, Poland’s
 02 Decemeber 2003. The three-car magnetically levi- Adam Malysz, gold medal and Austria’s Martin Hoell-
 tated train reached a maximum speed of 581 kilometers warth, bronze medal, jubilate on the podium after the
 per hour with technicians on board, according to Cen- K 90 ski jump final at the Nordic World Ski Cham-
 tral Japan Railway Co. (JR Tokai) and the government- pionships in Lahti on Friday, 23 February 2001. EPA
 affiliated Railway Technical Research Institute, who PHOTO EPA-ANJA NIEDRINGHAUS
 operates the experimental train.       EPA/EVERETT
 KENNEDY BROWN

  Figure 2: Example of image with caption (annotation) from Belga News Agency collection.


the most important factor. As mentioned earlier, we are treating this work as a test ground for
developing a baseline method upon which we are going to build more complete solutions later on.

3.1     Using Visual Clues
Similarity search in image domain always have and still is a great challenge [3]. Despite there are
hundreds of different approaches proposed in the literature, there is no general one working for a
large domain of images. To make an image retrieval system, one has to decide about three key
components: image distance function, image segmentation method and feature extraction method.
Proper selection of all these components, so they fit together, is a difficult problem.

3.1.1   Distance function
In our work we have examined several methods widely discussed in the literature. Local image
distances are our major interest topic. Local image distance is a distance calculated between
individual segments of images and later on transformed into an image distance. Such distance
operates on sets of feature vectors (they are not ordered), instead on single feature vectors. This
means that a single image I has to be defined in terms of its segments (and feature vectors)
ik : k = {1, ..n} as follows:

                                               I = {i1 , i2 , ..., in }.                                        (1)
    As a result of our research the decision is made to use a variant of Mallows distance, called
Integrated Region Matching [8, 17]. The method has been proposed by Wang in 2001 as a part
of SIMPLIcity image retrieval system. Our experiments have shown that till today it is one of
the most effective means of image retrieval. The distance function is defined as an optimization
problem:
                                                 XX
                                D(I, J) = min           sij d(i, j),                          (2)
                                                  S=[sij ]
                                                             i∈I j∈J

   constrained by:
                                                               |I| |J|             |I|          |J|
                                                               X   X               X            X
             sij ≥ 0,     1 ≤ i ≤ |I|,      1 ≤ j ≤ |J|,                   sij =         pi =         qj = 1,   (3)
                                                                i=1 j=1            i=1          j=1

                        |J|                                  |I|
                        X                                    X
                              sij = pi ,   1 ≤ i ≤ |I|,            sij = qj ,      1 ≤ j ≤ |J|,                 (4)
                        j=1                                  i=1

    where:
    I, J – feature vector sets,
    i, j – single vectors belonging to I and J, respectively,
    S = [sij ] – significance matrix, the search space of the optimization methods,
    pi – probability of a region i ∈ I (usually equal to segment relative size),
    qj – probability of a region j ∈ J (usually equal to segment relative size),
    d(i, j) – vector distance measure, usually Euclidean distance.
    Of course the optimization problem itself is very challenging and finding a global solution is
simply not feasible. Integrated Region Matching approach is in fact an iterative greedy algorithm
of image segment pairing. Distance calculation between two images does not take too much time,
however calculating it for the whole database requires much computational power.

3.1.2   Image segmentation
The second mentioned key component is image segmentation. Processed image database is a
general type database, containing various kinds of images, encompassing many visual domains.
For such databases supervised image segmentation approach would have to consider all these
domains and is not a good choice. This means, that an unsupervised image segmentation, with
all its advantages and disadvantages has to be selected.
    Unsupervised methods may be even further divided into block-based approaches (fixed image
cuts) and region-based approaches (segmentation in a classical sense). Block-based methods have
been adopted to automatic image annotation as an effective means of image segmentation [4]. Our
earlier research in automatic image annotation [6, 11] also confirmed this observation. Generated
feature vectors are much less prone to changes due to slight changes of image content.
    The chosen image segmentation approach is a regular grid method. The grid itself has 5 × 5
dimensions for every image in the database. This means that each image is split into 25 identical
in size, rectangular blocks.
3.1.3    Feature extraction
The last key component are image features. In this research we have focused on three types
of image segment features: location, color and texture. Such approach is popular among image
retrieval and automatic image annotation research, e.g. [4]. Let us now describe features we are
using. Location related features are:

   • normalized region size (which is constant due to grid segmentation),
   • region average x and y coordinates.

   Color features are rather straightforward. Only two basic color models are used:

   • region intensity means red, green and blue (RGB color model),
   • region intensity standard deviations red, green and blue,

   • region intensity means hue, brightness and saturation (HSV color model),
   • region intensity standard deviations hue, brightness and saturation.

   Texture features are much more complex. They include result of image processing by Sobel
edge detector, Hessian-based edge detector3 and concurrence matrices:

   • region intensity of Sobel edge detector means for red, green and blue channels,
   • region intensity of Sobel edge detector standard deviations for red, green and blue channels,
   • region intensity of Hessian-based edge detector means for red, green and blue channels,
   • region intensity of Hessian-based edge detector standard deviations for red, green and blue
     channels,
   • region concurrence matrix values: correlation, entropy, homogeneity, contrast, dissimilarity,
     energy and sum of all values for red, green and blue channels.

3.2     Using Textual Data
The idea for textual retrieval is similar to described in previous section for visual features. First
we extract textual features for each image in the collection and than we use similarity measure
to compare pairs of images. As this is baseline method for our group, we assumed only very
shallow processing without usage of elaborate language processing tools. Also we used classic
representation of textual data, namely Vector Space Model (VSM) [10, 13].
     After initial inspection of both the topics and images captions we decided to use only images
caption. We assumed that captions describe the image in a better way than cluster descriptions
for VSM. For humans, cluster description is more informative, but it would require deeper level of
processing and usage of more elaborate language processing tools. E.g., we would need to resolve
negations for searching for terms that are irrelevant4 , prepare special stop-list, etc.
     We treat each caption as a document. After preprocessing the documents are represented in
VSM. We use cosine as a measure of similarity. Preprocessing involves two steps. We use a stop-
list, i.e., a list of a few hundred words that do not contribute much to semantics of a document,
like prepositions and articles. In some of our experiments we also use classic Porter stemming
algorithm [12]. This allows us to both reduce index size and treat different morphological variants
of a word (e.g., plural forms of a word) as one object called stem.
  3 Method taken from Bio-medical Imaging Java library, see: http://bij.isi.uu.nl/
   4 Creating heuristics that would work for 50 topics provided by ICPRT2009 organizer would be feasible, but we

assumed that description are not constrained in any way.
    As VSM is commonly known technique in natural language processing community we will
outline only main points for readers with no experience in this area. The most important concept
in VSM is that a document D  ~ i is represented as a vector in an n-dimensional feature space, where
n is a number of different terms found during indexing:

                                   ~ i =< tfi,1 , tfi,2 , . . . , tfi,n > ,
                                   D                                                               (5)
    where ti,j is number of occurrences of term j in document i. As some frequencies of occurrences
can be accidental it is best to use some weighting scheme. For this work we also focused on using
classic and robust method, i.e., tf.idf weighting scheme [13]. Instead of using raw term frequencies
we use weights, that are calculated for term t in document d in the following way:
                                                                  N
                                       tf.idft,d = tft,d · log        ,                            (6)
                                                                  dft
    where N is number of documents (captions), and dft is number of documents containing the
term t. Representation of captions as a vectors enables usage of many similarity (or distance)
measures known form literature. We used cosine as a similarity measure for our baseline method,
as it is shown many times that it copes well with high dimensional data spaces, including documents
represented in VSM [1, 9, 10].

3.3    Combining Text with Images
As precision is our top priority rather than diversity, we did not come up with very elaborate
techniques for problem of merging visual and textual features into the final model. Both, the visual
and textual part of our system produces ranked lists of images for every image that appeared in
topics accompanied by a score. For generation of textual list we used a cosine measure, which is
a similarity measure giving values from 0 to 1 (in case of positive vector values). On the other
hand, Integrated Region Matching (IRM) is an unbounded distance function. To convert distance
to similarity we subtract from one normalized value of IRM function. After the conversion we
simply multiply values of both functions if cosine is lower then threshold t = 0.8. We introduced
the threshold in order to preserve almost perfect matches from textual phase.
    The multiplication of both similarity functions results in a single ranked list of images for every
image in topics. As every topic contain a few images, only one question remains: how to combine
different similarity lists into one list for topic. In this step we also use very simple methods.
First method we considered is naive joining of sorted similarity lists into one big list sorted by
similarities. Duplicated images are removed from this list and the top 20 images are presented to
user (or for evaluation). Second method is more balanced : we calculate k as a simple division of
20 by a number of clusters in the topic. Then from every list we select k best images and combine
them all into resulting list. In case of duplicates we draw more images from randomly selected list
among the lists containing duplicates. Resulting list is also sorted by similarity values.


4     Experimental Results
For evaluation we submitted five runs: four using both textual and visual clues and one for textual
data only. These five runs differ only in the method of combining similarity lists for individual
images into final list for topics. The other difference is usage of stemmer. Table 1 summarizes
result achieved by our system. The measures used for evaluation are precision, cluster recall and
F1-measure at different cutoff levels. For brevity, we show only two cutoff levels: at 10 and 20
documents. The former is used by organizers to rank all participating system in ICPRT2009 and
the latter is the number of results we submitted for each topic.
    Results for runs presented in the Table 1 differ very little. Not surprisingly with higher cutoff
value the precision is lowered and the cluster recall is higher. Usage of stemmer has little impact
on this task. Balanced scheme of similarity list joining seems to be better, especially in terms of
cluster recall. Nevertheless, the difference is small.
 Stemmer       Type        Modality      CR@10       CR@20       P@10      P@20      F@10     F@20
 no            balanced    TXTIMG         0.5991      0.7300       0.8      0.71     0.6837   0,7356
 yes           balanced    TXTIMG         0.5946      0.7529       0.8      0.72     0.6815   0,7351
 no            naive       TXTIMG         0.5856      0.6717      0.79      0.72     0.6741   0,7291
 no            naive       TXT            0.5856      0.6717      0.79      0.72     0.6741   0,7291
 yes           naive       TXTIMG         0.5811      0.6743       0.8      0.72     0.6718   0,7286


      Table 1: Results of experiments sorted by F1-measure at with cut-off at 10 position.


    We approached the development of our baseline system with the proper method for evaluation
the results in mind. That is why we decided to create independently from organizers a system
for evaluation of precision. We did not consider evaluation of cluster recall, because as mentioned
earlier we focus on the quality of retrieved list of similarities, especially with regard to visual
similarity. Figure 3 shows example screen from application developed for supporting of manual
annotation of results. All the results are stored in database and automatically retrieved when
needed, so the pair of images has to be annotated only once even if it repeatedly occurs in
different experiments. To shorten time needed for evaluation we added functionality for selection
of statistically significant random sample from whole result set [5].


Figure 3: Example screen from evaluation application. On the left is the window showing 20 most
similar ids of images to the image in cluster shown on the right returned by one of our methods.
When user selects id of image for evaluation the image is shown in the frame. All the resulting
images are already evaluated.

    Surprisingly the result of evaluation that were obtained by our team were significantly lower
then those prepared by ICPRT2009 organizers (see Tab. 1). Our best method did not achieved
better precision than 0.56 (at cutoff of 20 documents). We think that main cause for this is that
we evaluated n times more images for every topic then in submitted runs, where n is number of
clusters in the topic. Another reason for lower precision is the problem that some images might
have very vague caption or caption that did not corresponds exactly to the cluster description. For
example, topics number 21 with the title ”princess maxima” contains cluster with Princess Maxima
appearing in different years. The cluster that describes year 2002 has an image of Princess Maxima
during skiing. Naturally, textual part of the system is mislead into finding images containing other
people skiing with similar captions. Also visually other people skiing are more similar than pictures
from different time of the same year with Princess Maxima.
    Figure 4 shows example of five best images retrieved for second cluster of topic 19 titled ”justine
           (a) Query image                 (b) 1st result                (c) 2nd result


               (d) 3rd result               (e) 4th result              (f) 5th result


       Figure 4: Example of our method: query image 4(a) with five most similar images.


                                (a) Query image                             (b) Most similar im-
                                                                            age to query using
                                                                            IRM.

           Figure 5: Example of visual similarity for one of the images in the topics.


henin kim clijsters”. The cluster description says that relevant cluster should contain photographs
of both Justine Henin and Kim Clijsters in the foreground to be relevant. This example shows
an interesting case, because retrieved images are both semantically and visually similar to user
query. On the other hand, we want to show the problems of usage of visual similarity based on
high level statistical features derived from images segmented with grid for semantically oriented
user queries. Figure 5 shows query image containing Fortis Bank logo, and most similar image
from whole Belga repository using only visual features. Without consideration for semantics of
the image those two images are highly similar in our opinion. Problem shown on Fig. 5 is not an
isolated case. This is the reason that we did not consider using only visual clues for submitting
runs for ICPRT2009.


5    Conclusions and Further Work
We presented a system implementing baseline method for image retrieval that took part in Image-
CLEF 2009 Photo Retrieval task. The method consists of three fairly independent components:
visual, textual and merged. Splitting our system onto those parts has many benefits. We can im-
prove each individual component independently, without affecting others, we are able to measure
how much textual and visual modules contribute to the final results, etc.
    In visual image retrieval part we have orchestrated three important components: image dis-
tance function, an image segmentation method and feature extraction approach. All these com-
ponents play an important role is the system and are responsible for the quality of achieved visual
processing.
    Textual image retrieval by caption is based on the well-known Vector Space Model using tf.idf
weighting scheme and cosine as a measure of similarity between images captions [13]. We used
also very naive technique for merging results of visual and textual components. Surprisingly we
obtained rather high results with F1-measure lower only by 0.13 from the best system participating
in this edition of Photo Retrieval Task. More surprisingly, when we consider only precision, which
we were aiming at, the difference of our best approach from the most precise system in the
competition is even lower: 0.05.
    Obviously as this is our baseline system, there are many areas of improvement. From textual
point of view one can try plethora of methods that were developed recently. Usage of deeper
syntactic and semantic analysis can improve performance. Also, merging step in our method
needs dramatical improvement. We consider using clustering algorithms for better partitioning
both the data and the subset of the data containing only retrieved images. Developing a measure
of certainty of retrieval for both visual and textual parts will lead to more intelligent ways of
joining results using different modality.
    As mentioned through the paper, we are intensely developing various approaches in image
retrieval and automatic image annotation. It is worth mentioning the improved version of MAGMA
[15] image annotation system, which also took part in ImageCLEF 2009 contest. We have also
proposed a theoretical model of optimal automatic image annotation and its practical realization,
called Greedy Resulted Words Count Optimizer [7,11]. Another concept being currently researched
(but not yet published) is an extension of Integrated Region Matching matching. All presented
image retrieval methods are based purely on spatial, color and texture features. We are also
working on integration of those features with local features, such like SIFT or Pattern-based
approximations of Patches using Hough Transform [14]. We hope to use at least some of the
mentioned approaches in coming ImageCLEF 2010.
    Acknowledgment This work is financed from the Ministry of Science and Higher Education Re-
public of Poland resources in 2008–2010 years as a Poland–Singapore joint research project 65/N-
SINGAPORE/2007/0. It is supported by the DCS-Lab, which is operated by the Department
of Distributed Computer Systems (DDCS) at the Institute of Informatics, Wroclaw University of
Technology, Wroclaw, Poland.
References
 [1] Bartosz Broda and Maciej Piasecki. Experiments in documents clustering for the auto-
     matic acquisition of lexical semantic networks for Polish. In Mieczysaw A. Kopotek, Adam
     Przepirkowski, Sawomir T. Wierzcho, and Krzysztof Trojanowski, editors, Proceedings of
     the Sixteenth International Conference on Intelligent Information Systems, Advances in Soft
     Computing, pages 203–212, Warsaw, 2008. Academic Publishing House EXIT.
 [2] C. Clarke, N. Craswell, and I. Soboroff. Overview of the TREC 2004 terabyte track. In
     Proceedings of the 13th Text REtrieval Conference, Gaithersburg, USA, 2004.
 [3] Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z Wang. Image retrieval: Ideas, influences,
     and trends of the new age. ACM Computing Surveys (CSUR), 40(2), 2008.
 [4] S. L. Feng, R. Manmatha, and V. Lavrenko. Multiple bernoulli relevance models for image
     and video annotation. In Proceedings of the 2004 IEEE Computer Society Conference on
     Computer Vision and Pattern Recognition, volume 2, pages II–1002–II–1009, 2004.
 [5] Glenn D. Israel. Determining sample size. Technical report, University of Florida, 1992.
 [6] Halina Kwasnicka and Mariusz Paradowski. Fast image auto-annotation with discretized
     feature distance measures. Machine Graphics and Vision International Journal, 15(2):123–
     140, 2006.
 [7] Halina Kwasnicka and Mariusz Paradowski. Resulted word counts optimization – a new
     approach for better automatic image annotation. Pattern Recognition, 41(12):3562–3571,
     2008.
 [8] Jia Li, James Z. Wang, and Gio Wiederhold. Irm: Integrated region matching for image
     retrieval. pages 147–156, 2000.
 [9] C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The
     MIT Press, 2001.
[10] C.D. Manning, P. Raghavan, and H. Schtze. Introduction to information retrieval. Cambridge
     University Press New York, NY, USA, 2008.
[11] Mariusz Paradowski. Automatic Image Annotation as an Effective Method for Image Cap-
     tioning (in Polish). PhD thesis, Wroclaw University of Technology, Poland, 2008.
[12] M. F. Porter. An algorithm for suffix stripping. pages 313–316, 1997.
[13] G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun.
     ACM, 18(11):613–620, 1975.
[14] Andrzej Sluzek. Building local features from pattern-based approximations of patches: Dis-
     cussion onmoments and hough transform. EURASIP Journal on Image and Video Processing,
     2009.
[15] Michal Stanek, Bartosz Broda, Halina Kwasnicka, and Mariusz Paradowski. Magma - efficient
     method for image annotation in low dimensional feature space based on multivariate gaussian
     models. In In Proc. of IMCSIT 2009 (accepted), 2009.
[16] Ellen M. Voorhees and Lori P. Buckland, editors. Proceedings of The Seventeenth Text RE-
     trieval Conference, TREC 2008, Gaithersburg, Maryland, USA, November 18-21, 2008, vol-
     ume Special Publication 500-277. National Institute of Standards and Technology (NIST),
     2008.
[17] James Z. Wang, Jia Li, and Gio Wiederhold. Simplicity: Semantics-sensitive integrated
     matching for picture libraries. IEEE Transactions on Pattern Analysis and Machine Intelli-
     gence, 23(9):947–963, 2001.