<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Toward Cross-Language and Cross-Media Image Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Carmen Alvarez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ahmed Id Oumohmed</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Max Mignotte</string-name>
          <email>mignotte@iro.umontreal.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jian-Yun Nie</string-name>
          <email>nie@iro.umontreal.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CP. 6128, succursale Centre-ville</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>DIRO, University of Montreal</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Montreal</institution>
          ,
          <addr-line>Quebec</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2001</year>
      </pub-date>
      <fpage>137</fpage>
      <lpage>150</lpage>
      <abstract>
        <p>This report describes the approach used in our participation of ImageCLEF. Our focus is on image retrieval using text, i.e. Cross-Media IR. To do this, we rfist determine the strong relationships between keywords and types of visual features. Then the subset of images retrieved by text retrieval is used as examples to match other images according to the most important types of features of the query words.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The RALI group at University of Montreal has participated in several CLEF experiments on
Cross-Language IR (CLIR). Our participation in this year’s ImageCLEF experiments is to see
how our approach can be extended to Cross-Media IR. Two research groups are involved in this
task: one on image processing and the other on text retrieval. Our CLIR approach is similar to
that used in our previous participation in CLEF, i.e. we use statistical translation models trained
on parallel web pages for French to English translations. For the translation from other languages,
we use bilingual dictionaries. Our focus is on image retrieval from text queries.</p>
      <p>
        Different approaches have been used for image retrieval. 1) A user can submit a text query,
and the system can search for images using image captions. 2) A user can submit an image query
(using an example image - either selected from a database or drawn by the user). In this case,
the system tries to determine the most similar images to the example image by comparing various
visual features such as shape, texture, or color. 3) There is still a third group of approaches which
tries to assign some semantic meaning to images. This approach is often used to annotate images
by concepts or by keywords [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Once images have been associated with different keywords, they
can be retrieved for a textual query.
      </p>
      <p>The above three approaches have their own advantages and weaknesses.</p>
      <p>The first approach is indeed text retrieval. There is no particular image processing. The
coverage of the retrieval is limited to images with captions.</p>
      <p>The second approach does not require the images to be associated with captions. However,
the user is required to provide an example image and a visual feature or a combination of some
features to be used for image comparison. This is often difficult for a non-expert user.</p>
      <p>
        The third approach, if successful, would allow us to automatically recognize the semantics of
images, thus allow users to query images by keywords. However, the development up to now only
allows us to annotate images according to some typical components or features. For example,
according to a texture analysis, one can recognize a region of image as corresponding to a tiger
because of the particular texture of tigers [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It is still impossible to recognize all the semantic
meanings of images.
      </p>
      <p>
        Some recent studies [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] have tried to automatically create associations between visual features
and keywords. The basic idea is to use a set of annotated images as a set of learning examples, and
to extract strong associations between annotation keywords and the visual features of the images.
In our study, we initially tried to use a similar approach in ImageCLEF. That is, we wanted to
extract strong relationships between the keywords in the captions and the visual features of the
images. If such relationships could be created, then it would be possible to use them to retrieve
non-annotated images by a textual query. In this case, the relationships play a role of translation
between media. However, we discovered that this approach is extremely difficult in the context of
ImageCLEF for several reasons:
1. The annotations (captions) of the images in the ImageCLEF corpus often contain keywords
that are not strongly associated with particular visual features. They correspond to
abstract concepts. Examples of such keywords are “Scotland”, “north”, and “tournament”.
      </p>
      <p>Therefore, if we use the approach systematically, there will be many noisy relationships.
2. Even if there are some relationships between keywords and visual features, these relationships
may be difficult to be extracted because there are a huge number of possible visual features.
In fact, visual features are continuous. Even if we use some discretization techniques, their
number is still too high to be associated with some keywords. For example, for a set of
images associated with the keyword “water”, one would expect to extract strong relationships
between the keyword and the color and texture features. However, “water” in images may
only take up a small region of the image. There may be various other objects in the same
images, making it difficult to automatically isolate the typical features for “water”.</p>
      <p>Due to these reasons, we take a more flexible approach. We also use the images with captions
as a set of training examples, but we do not try to create relationships between keywords and
particular visual features (such as a particular shade of blue for the word “water”). We only try
to determine which type(s) of feature is (are) the most important for a keyword. For example,
“water” may be associated with “texture” and “color”. Only strong relationships are retained.
During the retrieval process, a text query is rfist matched with a set of images using image captions.
This is a text retrieval step. Then the retrieved images are used as examples to retrieve other
images, which are similar according to the determined types of features associated with the query
keywords. The whole processes of our system is illustrated in figure 1.</p>
      <p>In the following sections, we will first describe the image processing developed in this project. In
particular, we will describe the way that relationships between keywords and visual features are
extracted, as well as image retrieval with example images. In section 3, we will describe the CLIR
approach used. In section 4, both approaches are combined to perform image retrieval. Section 5
will describe the experimental results and some conclusions.</p>
      <p>
        Our approach is much less ambitious than that of [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], but it is more feasible in practice. In
fact, in many cases, image captions contain abstract keywords that cannot be strongly associated
with visual features, and even if they can, it is impossible to associate a single vector to a keyword.
Our approach does not require determining such a single feature vector for a given keyword. It
abandons the third approach mentioned earlier, but combines the rfist two families of approaches.
The advantage of extracting keyword-feature associations is to avoid the burden of requiring the
user to indicate the appropriate types of features to be used in image comparison.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Image processing-based learning procedure</title>
      <p>The objective of the automatic image processing-based learning procedure that we propose in this
section is twofold:
• First, this procedure aims at estimating the most discriminant type(s) of high-level visual
features for each annotated keyword. In our application, we have considered the three
fundamental visual characteristics; namely, texture (including color information), edge and
shape. For example, the keyword “animal” could belong to the shape class since the measure
using shape information will be the most discriminant to identify images with animals (but
the more specicfi keywords “zebra” and “tiger” will more probably belong to the edge and
texture classes respectively due to the characteristic coat of these animals).</p>
      <p>
        A discriminant measure, belonging to each of these classes of visual features has then been
denfied. We have considered :
1. The mean and the standard deviation of the energy distribution in each of the sub-band
of an Harr wavelet [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] decomposition of the image as discriminant measure of the edge
class.
2. The coarseness measure proposed by Tamura et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] as discriminant measure of the
texture class.
3. The histogram of the edge orientation of the different shapes extracted from the input
image (after a region-based segmentation) as discriminant measure of the shape class.
•
      </p>
      <p>The second objective is to identify a set of candidate images that are the most representative
for each annotated keyword, in the sense of similarity distance combining one or several
preestimated visual feature classes.</p>
      <p>The type of high-level visual feature (along with its discriminant measure) and a set of
candidate images along with its associated normalized similarity distance will be used with
crosslanguage Information, to refine the retrieval process.
2.1</p>
      <sec id="sec-2-1">
        <title>Edge class and its measure</title>
        <p>
          Wavelet-based measures have often been used in content-based image retrieval system because of
the appealing ability to describe the local texture and the distribution of the edges of a given image
at multiple scales. In our application we use the Harr wavelet transform [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] for the luminance
(i.e., grey-level) component of the image. There are several other wavelet transforms but the Harr
wavelet transform has better localization properties and requires less computation compared to
other wavelets (e.g., Daubechies’ wavelet). The procedure of image decomposition into wavelets
involves recursive numeric filtering. It is applied to the set of pixels of the digital image which is
decomposed with a family of orthogonal basis functions obtained through translation and dilatation
of a special function called mother wavelet. At each scale (or step) in the recursion, we obtain four
sub-bands (or sets of wavelet coefficients), which we refer to as LL, LH, HL and HH according to
their frequency characteristics (L : Low and H : High, see Figure 2). The LL sub-band is then
decomposed into four sub-bands at the next scale decomposition. For each scale decomposition
(three considered in our application), we compute the mean and the standard deviation of the
energy distribution (i.e., the average and the square of each set of wavelet coefficients) in each of
the sub-bands. This leads to a vector of 20 (i.e., (2 × 3 × 3) + 2) components or attributes which
can be viewed as the descriptor (or the signature) of the edge information/characteristics of the
image. For example, an image containing a zebra thus has high energy in the HL sub-band and
low energy in the LH sub-band due to the vertical strips of the coat of this animal.
(a)
(b)
Tamura et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] have proposed to characterize image texture along the dimensions of contrast,
directionality, coarseness, line-likeness, regularity and roughness. These properties correspond to
the human texture perception.
        </p>
        <p>− Contrast is a scalar value related to the amount of local intensity variations present in an
image and involves the standard deviation of the grey-level probability distribution.</p>
        <p>
          − Directionality is a global texture property which is computed from the oriented edge
histogram, obtained by an edge detector like the Sobel detector [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The directionality measures the
sharpness of the peaks in this edge histogram.
        </p>
        <p>− In this class of visual features, we have utilized only the coarseness property which yields a
histogram of six bins, for the following reasons :</p>
        <p>Coarseness refers to the size of the texton; i.e., the smallest unit of a texture. This measure
thus depends on the resolution of the texture. With this measure, we can compute a histogram
with 6 bins (i.e., a 6-component attribute vector) which will be used as the descriptor of the
texture characteristics of a given image. The procedure for computing the coarseness histogram is
outlined below,
1. At each pixel with coordinates (x, y) in the image, and for each value k (k
taking its value in {1, 2, . . . , 6}), we compute the average over its
neighborhood of size 2k × 2k i.e.,</p>
        <p>Ak(x, y) =
i=(x+2k− 1− 1) j=(y+2k− 1− 1)</p>
        <p>X X
i=(x− 2k− 1)
j=(y− 2k− 1)</p>
        <p>I(i, j)
22k
where I(i, j) is the intensity pixel of the image at pixel (i, j).
2. At each pixel, and for the horizontal and vertical directions, we compute
the differences between pairs of averages corresponding to pairs of
nonoverlapping neighborhoods just on opposite sides of the pixel. The horizontal
and vertical differences are expressed as:</p>
        <p>Ek,horizontal</p>
        <p>Ek,vertical
=
=
|Ak(x + 2k− 1, y) − Ak(x − 2k− 1, y)|
|Ak(y + 2k− 1, y) − Ak(y − 2k− 1, y)|
3. At each pixel, the value of k that maximizes Ek(i, j), in either direction
(horizontal or vertical), is used to set the best size Sbest(i, j) = 2k. At this
stage we can consider, as descriptor, the scalar measure of coarseness which is
the average of Sbest over the entire image, or consider, as in our application,
the histogram (i.e., the empirical probability distribution) of Sbest which is
more precise for discrimination.
2.3</p>
      </sec>
      <sec id="sec-2-2">
        <title>Shape class and its measure</title>
        <p>Description and interpretation of shapes contained in an input image remains a difficult task.
Several methods use a contour detection of the images (such as the Canny or Sobel edge detectors)
as a preliminary step in the shape extraction. But these methods remain dependent on certain
parameters as thresholds (on the magnitude of the image gradient).</p>
        <p>
          In image compression, some approaches [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] use a vector quantization method on the set of
vectors of dimension K2 of grey-levels corresponding to K × K blocks extracted from the image.
By using a clustering procedure into K classes, we can obtain an image with separate regions (a set
of connected pixels belonging to a same class) from which we extract the contours of the different
regions. These edges are connected and obtained without any parameter adjustment and the noise
is taken into consideration in this procedure. Figure 3 shows an example of edge detection using
three regions, i.e., three clusters in the vector quantization.
        </p>
        <p>
          In our application, we use this strategy of edge detection and we use, as clustering procedure,
the Generalized LLoyd [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] (generally used in this context). In our implementation, we use the
code provided from the QccPack Library1.
        </p>
        <p>For each edge pixel, we denfie a direction (horizontal, vertical, rfist or second diagonal)
depending on the disposition of its neighboring edge pixels. For each direction we count the number
of edge pixels associated with it, which yields a 4 bin histogram.
2.4</p>
      </sec>
      <sec id="sec-2-3">
        <title>The learning procedure</title>
        <p>Given a training database, we rfist pre-compute and store off-line for each image its three
descriptors (related to each of the three visual features). These set of three vectors simplify the
representation of each image, by giving maximal information about its content (according to each
considered feature). We now denfie a similarity measure between two images given a visual feature
class. This measure is simply the euclidean distance between two vectors.</p>
        <p>
          The learning procedure which allows us to determine the type of high-level visual feature (and
its measure) that is the most representative for each annotated keyword, is outlined below:
1. Let Iw be the set of all images Iw (each described by its three vectors or
edge, DIswhape]) in the training database that are
andescriptors [DItwexture, DIw
notated with the keyword w and |Iw|, the number of images in Iw.
2. For each class { Texture, Edge, Shape }
(a) We use a K-mean clustering procedure [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] (with a euclidean distance
        </p>
        <p>for the similarity measure) on the set of samples DIcwlass.
(b) This clustering allows us to approximate the distribution of the set
of samples DIcwlass by K spherical distributions (with identical radius)
and to give K prototype vectors [D1cl,wass, ..., Dkcl,awss] corresponding to the
centers of these distributions. Several values of K are used to find the
best clustering representation of DIcwlass.</p>
        <p>i. For each prototype vector { D1cl,wass, ..., Dkcl,awss }
• We search in the whole training database for the closest
descriptors (or images) of Dkcl,awss, according to the euclidean distance.</p>
        <sec id="sec-2-3-1">
          <title>Let Ickl,awss be this set of images.</title>
          <p>•</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>We compute the number of the rfist top-level T samples of Ickl,awss</title>
          <p>also belonging to Iw (best results are obtained with T = 10).</p>
          <p>
            Let Nkc,lwass be this number.
3. We retain the class(es) and Ickl,awss for which we have Nkc,lwass above a given
threshold ξ .
4. We normalize in [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ] all the similarity distances of each sample of each
selected set Ickl,awss.
5. We combine the similarity distance measures of the selected sets Ickl,awss, with
an identical weighting factor, in order to find a nfial set of images i
associated with each annotated keyword w. The similarity measures of these
final images are then normalized, and the noramlized similarity measure of
an image i for the given word w is represented as Rcluster(i, w) for retrieval,
described in section 4.
          </p>
          <p>The rfist 24 images of the set of images associated to the word garden are shown in Figure 4.
We can see that, even if most images are not annotated by the word garden (the word does not
exist in any field of the text associated with the image), we can visually count about 9 images
which are related to gardens from the 14 non-annotated images.
3
3.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Cross-language text retrieval</title>
      <sec id="sec-3-1">
        <title>Translation models</title>
        <p>Two approaches are used for query translation, depending on the ressources available for the
different languages. For Spanish, Italian, German, Dutch, Swedish, and Finnish, FreeLang bilingual
dictionaries 2 are used in a word-for-word translation approach. The foreign language words in
the dictionaries are stemmed using Snowball stemmers3, and the English words are left in their
original form. The queries are also stemmed, and stop words are removed with a stoplist in the
foreign language. The translated query consists of the set of all possible English word translations
for each query term, each translated word having equal weight.</p>
        <p>For French, a translation model trained on a web-aligned corpus is used [10]. The model
associates a list of English words and their corresponding probabilities with a French word. As
2http://www.freelang.net
3http://snowball.tartarus.org
with the bilingual dictionaries, the French words are stemmed, and the English words are not.
Word-for-word translation is done. For a given French root, all possible English translations are
added to the translated query. The translation probabilities determine the weight of the word in
the translated query. The term weights are represented implicitly by repeating a given translated
word a number of times according to its translation probability. For French as well as for the other
languages, the words in the translated query are stemmed using the Porter stemming algorithm.</p>
        <p>This query translation approach was found to be optimal, using training data described in the
following section. The parameters evaluated were:</p>
        <p>Whether to use a bilingual dictionary, or the translation model, for French.
• For a given query term, whether to pick just one translation from the dictionary or
translation model, all translations, or in the case of the translation model, the first n probable
translations.</p>
        <p>When to stem the English words: The English words could be stemmed in the dictionary,
rather than after translation. This affects the number times a particular word appears, and
therefore its implicit weight, in the nfial translated query. Without stemming English words
in the dictionary, multiple forms of a word may appear as a possible translation for a foreign
language stem. After the translated query is stemmed, the English root appears several
times.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>CLIR process</title>
        <p>
          For retrieval, the Okapi retreival algorithm [
          <xref ref-type="bibr" rid="ref10">11</xref>
          ] is used, implemented by the Lemur Toolkit for
Language Modeling and Information Retrieval. In particular, the BM 25 weighting function
is used. The following parameters contribute to the relevance score of a document (an image
annoatation) for a query:
•
•
•
•
•
        </p>
        <p>BM 25 k1
BM 25 b</p>
        <p>BM 25 k3
• FeedbackDocCount: the number of documents (image captions) to use for relevance feedback
• FeedbackTermCount: the number of terms to add to the expanded query
• qtf: the weight of the query terms added during relevance feedback</p>
        <p>The training data used to optimize each of these parameters, as well as the translation
approaches described in section 3.1 was the TREC-6 AP89 document collection and 53 queries in
English, French, Spanish, German, Italian, and Dutch. Since no training data was available for
Finnish and Swedish, the average of the optimal values found for the other languages is used.</p>
        <p>While the training collection, consisting of news articles about 200-400 words in length, is
quite different from the test collection of image captions, the volume of the training data (163000
documents, 25 or 53 queries, depending on the language, and 9403 relevance assessments) is
much greater than the training data provided from the image collection (5 queries, 167 relevance
assessments).</p>
        <p>Once the parameters for relevance feedback and the BM 25 weighting function are optimized
with the training data, retreival is performed on the test data, producing a list of images and their
relevance scores, for each query. We annotate this image relevance score for a query, based on
textual retrieval, as Rtext(i, q).
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Combining text and images in image retrieval</title>
      <sec id="sec-4-1">
        <title>The image relevance score based on clustering</title>
        <p>The image analysis based on clustering, described in section 2.4, provides a list of relevant images
i for a given word w, with a relevance score for each image, Rcluster(i, w). The relevance score of
an image for a query, based on clustering, is then a weighted sum of the relevance scores for that
image for each query term:</p>
        <p>Rcluster(i, q) =</p>
        <p>X λ wRcluster(i, w)
w∈q</p>
        <p>In our approach, each word has the same weight, and the relevance score for the query is
normalized, with λ w = 1q , where |q| is the number of words in the query.</p>
        <p>| |
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Combining the five image relevance scores</title>
        <p>We now have 5 lists of images for each query, with the following relevance scores:
(1)
• Rtext(i, q)
• Rcluster(i, q)
• Redge(i, q): The similarity between the query image q and a collection image i, according to
the wavelet measure described in section 2.1.
• Rtexture(i, q): The similarty according to the texture class measure from section 2.2.
• Rshape(i, q): The similarity according to the shape class measure from section 2.3.
Each of these relevance scores contributes to a nfial relevance score as follows :
R(i, q)
= λ textRtext(i, q) + λ clusterRcluster(i, q)
+ λ edgeRedge(i, q) + λ textureRtexture(i, q) + λ shapeRshape(i, q)
(2)
The coefficients we chose for the contribution of each approach are as follows:
• λ text = 0.8
• λ cluster = 0.1
• λ edge = λ texture = λ shape = 0.033
These values have been determined empirically using the training data provided in ImageCLEF.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Filtering the list of images based on location, photographer, and date</title>
        <p>A final lfitering is applied to the list of images for a given query. A “dictionary” of locations is
extracted from the location field in the entire collection’s annotations. Similarly, a “dictionary” of
photographers is extracted. If a query contains a term in the location dictionary, then the location
of a potential image, if it is known, must match this location. Otherwise, the image is removed
from the list. The same approach is applied to the photographer. Similarly, if a date is speciefid
in the query, then the date of the image, if it is known, must satisfy this date constraint.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experimental results and conclusion</title>
      <p>A preliminary analysis shows that our image retrieval works well. In particular, using the French
queries, our system produced the best results among the participants. This may be related to
two factors: - The method of query translation used for these queries is reasonable. For French
queries, we used a statistical translation model trained on parallel web pages. This translation
model has produced good results in our previous CLIR experiments.</p>
      <p>- The method based on keyword-feature type association we used in these experiments may be
effective. However, further analysis has to be carried out to conrfim this.</p>
      <p>For the experiments with other languages, our results are relatively good - they are often among
the top results. However, the absolute MAP is lower than for the French queries.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dumais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Czerwinski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Field</surname>
          </string-name>
          .
          <article-title>Semi-automatic image annotation</article-title>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Chad</given-names>
            <surname>Carson</surname>
          </string-name>
          , Megan Thomas,
          <string-name>
            <given-names>Serge</given-names>
            <surname>Belongie</surname>
          </string-name>
          ,
          <string-name>
            <surname>Joseph M. Hellerstein</surname>
            , and
            <given-names>Jitendra</given-names>
          </string-name>
          <string-name>
            <surname>Malik</surname>
          </string-name>
          .
          <article-title>Blobworld: A system for region-based image indexing and retrieval</article-title>
          .
          <source>In Third International Conference on Visual Information Systems</source>
          . Springer,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jeon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lavrenko</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Manmatha</surname>
          </string-name>
          .
          <article-title>Automatic image annotation and retrieval using cross-media relevance models</article-title>
          .
          <source>In ACM SIGIR</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.G.</given-names>
            <surname>Mallat</surname>
          </string-name>
          .
          <article-title>A theory for multiresolution signal decomposition : The wavelet representation</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>11</volume>
          :
          <fpage>674</fpage>
          -
          <lpage>693</lpage>
          ,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Tamura</surname>
          </string-name>
          , S. Mori, , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Yamawaki</surname>
          </string-name>
          .
          <article-title>Texture features corresponding to visual perception</article-title>
          .
          <source>IEEE Transactions on Systems, Man, and Cybernetics</source>
          ,
          <volume>8</volume>
          :
          <fpage>460</fpage>
          -
          <lpage>473</lpage>
          ,
          <year>1978</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banks</surname>
          </string-name>
          .
          <article-title>Signal processing image processing and pattern recognition</article-title>
          .
          <source>Prentice Hall</source>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Boucher</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Shlien</surname>
          </string-name>
          .
          <article-title>Image compression using adaptative vector quantization</article-title>
          .
          <source>Communications</source>
          , IEEE Transactions on [legacy, pre - 1988],
          <volume>34</volume>
          :
          <fpage>180</fpage>
          -
          <lpage>187</lpage>
          ,
          <year>1986</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.P.</given-names>
            <surname>Lloyd</surname>
          </string-name>
          .
          <article-title>Last square quantization in pcm's</article-title>
          .
          <source>Bell Telephone Laboratories Paper</source>
          ,
          <year>1957</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Linde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Buzo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.M.</given-names>
            <surname>Gray</surname>
          </string-name>
          .
          <article-title>An algorithm for vector quantizer design</article-title>
          .
          <source>IEEE Transactions on Communications</source>
          , COM-
          <volume>28</volume>
          :
          <fpage>84</fpage>
          -
          <lpage>95</lpage>
          ,
          <year>1980</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.M.</given-names>
            <surname>Hancock-Beaulieu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Gatford</surname>
          </string-name>
          .
          <article-title>Okapi at trec-3</article-title>
          .
          <source>In Proc. of the Third Text REtrieval Conference (TREC-3)</source>
          ,
          <source>NIST Special Publication 500-225</source>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>