<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Categorizing and Annotating Medical Images by Retrieving Terms Relevant to Visual Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Desislava Petkova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lisa Ballesteros</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Mount Holyoke College</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Images are difficult to classify and annotate but the availability of digital image databases creates a constant demand for tools that automatically analyze image content and describe it with either a category or a set of words. We develop two clusterbased cross-media relevance models that effectively categorize and annotate images by adapting a cross-lingual retrieval technique to choose the terms most likely associated with the visual features of an image. We also identify several important distinctions between assigning categories and assigning words.</p>
      </abstract>
      <kwd-group>
        <kwd>Cross-language Retrieval</kwd>
        <kwd>Image Annotation</kwd>
        <kwd>Categorization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The exponential growth of multi-media information has created a compelling need for innovative
tools for managing, retrieving, presenting, and analyzing image collections. Medical databases,
for example, continue to grow as hospitals and research institutes produce thousands of medical
images daily. The design and development of image retrieval systems will support a variety of tasks,
including image retrieval, auto-illustration of text documents, medical diagnostics, organizing
image collections such as digital photo albums, and browsing</p>
      <p>Image retrieval techniques can be classified into two types, content based image retrieval
(CBIR) and text-based image retrieval (TBIR). CBIR attempts to find images based on visual
similarities such as shape or texture. TBIR techniques retrieve images based on semantic relationships
rather than visual features and require that descriptive words or annotations have been previously
assigned to each image. For collections of realistic size, it is impractical to rely exclusively on
manual annotation because the process is both time-consuming and subjective. The task is even
more challenging for special collections such as medical databases since they require expensively
trained professionals to do the annotation. As a practical alternative, automatic annotation can
either complement or substitute manual annotation.</p>
      <p>The goal of automatic image annotation is to assign semantically descriptive words to
unannotated images. As with most tasks involving natural language processing, we assume that a
training collection of already annotated images is available, which the system can use to learn
∗This work was funded in part by the Howard Hughes Medical Institute Cascade Mentoring Program (HHMI
#52005134) and the Clare Boothe Luce Program of the Henry Luce Foundation.
what correlations exist between words and visual components or visterms. We specify the task
further by considering annotation to be a cross-lingual retrieval problem: Two languages - textual
and visual - are both used to describe images, and we want to infer the textual representation
of an image given its visual representation. Therefore, we can think of words being the target
language and visterms being the source language. Of course, the language of visterms is entirely
synthetic but a CLIR system does not require specialized linguistic theory and knowledge.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>
        Other researchers have proposed methods for modeling the relationships between words and visual
components. Mori et al divide images into regions and then use the co-occurrence of words
and regions to make nonsmoothed maximum likelihood estimates [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Duygulu et al apply a
segmentation algorithm to generate image blobs and then use a Machine Translation model to
assign words as a form of multi-modal translation from blobs to words [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. More recently, Jeon et
al apply a Maximum Entropy model that treats annotation as a discrete stochastic process whose
unknown parameters are word probabilities [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Our approach is a modification of the Cross-media Relevance Model (CMRM) developed by
Jeon et al [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In this case, the visterms of an image to be annotated constitute a query and all
candidate words are ranked in terms of their relevance to the visual representation. An annotation
of any length can be created by selecting the n highest ranked words. More precisely, using a
collection T of training images J , the joint probability of observing a word w and the set of
visterms derived from an unannotated image I = {v1, ..., vm} is computed as:
m
P (w, v1, ..., vm) = ! P (J )P (w|J ) " P (vi|J )
      </p>
      <p>J∈T i=1
where P (w|J ) and P (v|J ) are maximum-likelihood estimates smoothed with collection
frequencies.1</p>
      <p>P (w|J ) = (1 − α)
P (v|J ) = (1 − β)
#(w, J )</p>
      <p>|J |
#(v, J )
|J |
+ α
+ β
#(w, T )</p>
      <p>|T |
#(v, T )
|T |</p>
      <p>Therefore, CMRM uses word-visterm co-occurrences across training images to estimate the
probability of associating words and visterms together. But since this method computes the word
and visterm distributions P (·|J ) of each image separately, it does not take into account global
similarity patterns, i.e. how individual images and their representations are related to each other.
This shortcoming can be compensated by extracting and incorporating information from groups
of similar images - clusters - created by examining the overall corpus structure.</p>
      <p>
        Document clustering within the framework of full text retrieval has been investigated by Liu et
al [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. They define two cluster-based models: Cluster Query Likelihood (CQL) and Cluster-based
Document Model (CBDM). Both explore across-document word co-occurrence patterns in addition
to within-document occurrence patterns to improve the ranking of documents in response to user
queries. CQL directly ranks clusters based on P (Q|C), the probability of a cluster C generating
the query Q, while CBDM ranks documents but smooths their language models with the models
of respective clusters. Liu et al show that clustering improves retrieval performance indicating
that clusters provide more representative statistics of word distributions because they combine
multiple related documents.
      </p>
      <p>We adapt these techniques to annotate and categorize images by extending the Cross-media
Relevance Model to take advantage of cluster statistics in addition to image statistics.
1Throughout the paper I denotes an image to be annotated, and J - an already annotated training image.
P (w, v1, ..., vm) =</p>
      <p>P (w|C) =
P (v|C)
=</p>
      <p>m
X P (C)P (w|C) Y P (vi|C)
C∈T i=1
(1 − γ) #(|wC,|C) + γ #(|wT,|T )
(1 − δ) #(|vC,|C) + δ #(|vT,|T )</p>
      <p>P (w, v1, ..., vm) =</p>
      <p>P (w|J) =
P (v|J) =</p>
      <p>m
X P (J)P (w|J) Y P (vi|J)
J∈T i=1
(1 − α) #(|wJ,|J) + α #(|wC,JC| J )
(1 − β) #(|vJ,|J) + β #(|vC,JC|J )
Mathematical definitions of CQL (left) and CBDM (right). Note that clusters C
play two different roles - ranking in CQL and smoothing in CBDM.</p>
      <p>The motivation is that by analyzing collection-wide co-occurrence patterns, a cluster-based
approach to annotation can achieve a better estimation of word-visterm relationships. Clusters,
viewed as large pseudo-images, have more words and visterms and therefore their language models
P (·|C) can be approximated better than those of single images. In short, even if no prior knowledge
about the collection is available, we can learn from its similarity structure by inferring
wordvisterm co-occurrences from similar images. For example, indirect relationships between words
and visterms that do not occur together can be identified when there exist intermediate visterms
with which they co-occur independently.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Methodology of categorization and annotation</title>
      <sec id="sec-3-1">
        <title>From categories to concepts</title>
        <p>Textual representations provided for ImageCLEFmed 2005 are clearly categories rather than
annotations. Training images are divided into disjoint groups by being put into one of 57 folders,
and each folder is given a short description (multi-axial code). Since we are interested in both
categorizing and annotating images, we first need to create more realistic annotations. We achieve
this by breaking up categorical records into sets of individual concepts.</p>
        <p>We define a “concept” to be a comma-separated string, creating a restricted vocabulary of 46
distinct concepts. Some of these are literal dictionary words (e.g. ‘x-ray” and ‘spine’), others are
sequences of words (e.g. ‘plain radiography’ and ‘radio carpal joint’), and they all identify a single
distinctive image property. For example, the third concept in a categorical description indicates
body orientation - the choices are ‘coronal’, ‘sagittal’, ‘axial’ and ‘other orientation’. Clearly, it
does not make sense to have ‘other’ as a concept on its own.
cxupho-lpaarrpnaoineydnrarleaxdtiroegmraiptyhyarm crxupoa-lparrdpaoiineynorarlecaxadtriropegamrlaiptjyohiyanrtm
tmeumsculoskeletal sys- tmeumsculoskeletal
syscxupho-lpaarrpnaoineydnrarfleaoxdrtieroeagmrramiptyhyarm cexupol-lpbarrpaoioneynwrarleaxdtiroegmraiptyhyarm cxupuo-lpparrppaoineynerarrleaaxdrtirmoegmraiptyhyarm scxupoh-lparropaoiuneynrlardleaexdrtiroegmraiptyhyarm
musculoskeletal sys- musculoskeletal sys- musculoskeletal sys- musculoskeletal
system tem tem tem</p>
        <p>Thus we get two kinds of textual representations per image - a category and an annotation.
We also recognize the first important difference between the two. Concepts do not point directly
to objects in the images (there is one object per image anyway) but describe very high-level,
specialized attributes which are not reflected directly by any visual feature. As a result, images
that are apparently different can have very similar annotations, i.e. share many concepts (Figure
1). In contrast, all images classified in the same category are visually similar. In the rest of the
paper, we refer to concepts and categories jointly as terms.</p>
        <p>We also observe that concepts have an unusual distribution where the six most frequent ones
account for more than 75% of the total number of occurrences (Figure 3). In fact, one concept
- ‘x-ray’ - appears in every single image. Both CQL and CBDM would likely be biased in favor
of these very frequent concepts, tending to select them rather than rare ones. Since we set the
models to generate fixed-length annotations of six concepts (this is the maximum length of training
annotations), we would expect the same set of concepts to be assigned over and over.
Given the fact that we make a distinction between categories and concepts, we need to redefine the
task. We are interested both in selecting the category to which an image belongs and in choosing
concepts to describe it. (And of course the appropriate concepts and category are themselves
related.) Therefore, we are going to evaluate each model on two tasks simultaneously - annotation
and categorization.</p>
        <p>Recall that both CQL and CBDM compute a set of probabilities P (wi|I), i = 1...|V |, based
on the visterms of an image I. These probabilities are used to rank terms w according to their
suitability to describe the content of I. The only restriction on the vocabulary V is that it is a
finite set of discrete elements. Both categories and individual concepts satisfy this requirement
therefore we can use the same implementation to assign either categories or concepts by only
changing the input to the system.
3.2.1</p>
        <sec id="sec-3-1-1">
          <title>Assigning categories</title>
          <p>We consider each category to be an annotation of length 1. By learning relationships between
categories and visterms we can categorize new images directly by assigning the term with the
highest probability.
3.2.2</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Assigning concepts</title>
          <p>We divide categories into concepts and work with annotations of various lengths. By learning
relationships between concepts and visterms we can annotate new images directly by assigning
several of the highest probability concepts. Alternatively, we can categorize new images indirectly
by representing categories as combinations of concepts:
P (category) = P (concept1, ..., conceptk) = ! P (concepti)
i=1
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Data processing and experimental setup</title>
      <p>Preliminary image processing involves extracting visual features and obtaining an image
vocabulary of visterms. Briefly, our representations are generated in the following way. Each image
is grid partitioned into regions and the complete set of image regions is partitioned into disjoint
groups based on corresponding feature vectors. All regions in a group are given the same unique
identifier or visterm. Once image processing is complete, our approach relies on a model of the
correspondences between terms and visterms, inferred from a set of training images that have been
previously annotated.
4.1</p>
      <sec id="sec-4-1">
        <title>Feature extraction and visterm generation</title>
        <p>The dataset consists of 10000 images, divided into a training set of 9000 and a test set of 1000.
First, each of these images is scaled to 256×256 pixels (regardless of the original aspect ratio) and
divided into a 5×5 square grid. This produces 250,000 regions to be discretized into visterms.
Regions are clustered on the basis of visual similarities and each cluster is assigned a unique
identifier. Since the ImageCLEFmed collection consists entirely of black-and-white images, we
only consider visual features that analyze texture. More specifically, we apply two texture analysis
features - Gabor and Tamura.
4.1.1</p>
        <sec id="sec-4-1-1">
          <title>Gabor energy</title>
          <p>
            The Gabor Energy method measures the similarity between image neighborhoods and specially
defined masks to detect spatially local patterns such as oriented lines, edges and blobs [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]. We
use a MATLAB implementation courtesy of Shaolei Feng at the Center for Intelligent Information
Retrieval, University of Massachusetts at Amherst. This feature computes a 12-bin histogram per
image region.
4.1.2
          </p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Tamura texture</title>
          <p>
            The Tamura features - Coarseness, Directionality and Contrast - are intended to reproduce human
visual perception. They attempt to quantify intuitive information such as roughness, presence of
orientation, and picture quality in terms of factors like sharpness of edges and period of repeating
patterns. We use the FIRE Flexible Image Retrieval Engine to extract Tamura features [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]. Given
an input image, FIRE creates three output partial images, one for each of the three features, which
we convert into vectors. Each feature produces a 36-dimensional vector from every 6×6 partial
image.
4.2
          </p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Combining visual features</title>
        <p>
          Visual features describe distinctive image properties. Even if two features both analyze texture,
they do so using different calculations and therefore might recognize different characteristics of
the texture. On the other hand, we do not want to waste time and resources to extract
correlated features, which are equivalent rather than complimentary sources of information. However,
Deselaers et al show that Gabor filters and the individual Tamura features are not correlated [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
Therefore, we try to combine the four of them for a more comprehensive texture analysis. We
investigate two alternatives.
First we join feature vectors produced by each feature in one compound vector, and then we
cluster to quantize the vectors into visterms. For example, the length of Gabor energy is 12
(representing 12 histogram bins) and the length of Coarseness is 36 (representing 36 pixels of a
6x6 partial image). The result is a 250000×48 matrix of feature vectors, which is partitioned into
500 visterms. These theoretically reflect similarity of regions based both on Gabor energy and
Coarseness.
Rather than combining feature vectors prior to visterm assignment, we cluster the feature vectors
produced by each feature separately. For example, we partition the regions into 500 visterms
based on Gabor energy and then repartition them based on Coarseness. Use different cluster
identifiers each time, e.g. integers from 1 to 500 for Gabor energy and integers from 501 to 1000
for Coarseness, and assign both types of visterms to individual images. So if an image is originally
divided into 25 regions, it will end up with twice as many visterms. In this case, images can be
similar in one respect, e.g. have some Gabor visterms in common, and dissimilar in another, e.g.
share no Coarseness visterms. Also, their visual representations are longer, therefore probability
estimates could be closer to the true underlying distribution.
        </p>
        <p>The two approaches have different resource requirements. Combining at generation needs more
memory (to fit a bigger matrix) while combining at representation needs more time (to group the
regions separately for each feature). This fact should be taken into consideration, especially when
working with large collections.</p>
        <p>Our experiments show that combining features at generation is not very effective while two
features combined at representation work better than either feature alone. Figure 4 graphs the
performance of CQL according to error rate, as the number of clusters increases. Figure 5 graphs
the same results for CBDM. It is likely that the combining features at generation fails because the
weaker feature Coarseness is three times as long as the better feature Gabor energy. On the other
hand, when combining at representation each feature accounts for 25 out of the 50 visterms per
image, so in this respect the features are given equal weight.</p>
        <p>0.6
0.55
0.5
te0.45
a
rrr
o
re 0.4
0.35
0.3
0.6
0.55
0.5
The theoretical framework of cluster-based relevance modeling does not depend on the specific
implementation of clustering. We investigate three different clustering techniques that partition
an image collection into groups of similar images: groups based on manually assigned categories,
K-means clustering, or K-nearest neighbors (KNN) clustering.
Since ImageCLEFmed images are assigned to one particular category, we can assume that
categories play the role of cluster labels. It becomes straightforward to partition the collection by
putting all images of the same category in a separate cluster. The result is a set of 57
nonoverlapping clusters of various lengths, depending on how many training examples from each
category are provided.</p>
        <p>In general annotations are longer than a single word and therefore clustering could not be that
simple. Moreover, visterms tend to identify lower-level visual properties and terms - higher-level
semantic ones, making it advantageous to consider both the visual and textual representations
of images for cluster computations. By combining terms and visterms a clustering technique can
generate clusters with both visual and semantic coherence.
4.3.2</p>
        <sec id="sec-4-2-1">
          <title>K-means</title>
          <p>This clustering algorithm takes K, the desired number of clusters, as input and returns a list of
indices indicating to which cluster each point in the partitioned dataset belongs. The procedure
starts by randomly selecting K elements from the set as initial cluster centroids. Each remaining
element is added to the cluster to which it is most similar, then the centroids are reevaluated. The
algorithm refines the partitioning iteratively by repeatedly reevaluating and reassigning until no
element changes assignment and the clustering converges.</p>
          <p>K-means is a hard clustering algorithm which produces mutually exclusive clusters.
Performance depends on the starting condition - both the predetermined value of K and the initial choice
of centroids. The appropriate number of clusters is determined by the dataset configuration which
is usually unknown. And even if the value of K is close to the natural number of groupings, given
the starting centroid positions K-means can still get trapped in a local maximum and fail to find a
good solution. The method is also sensitive to extreme points which lie notably far away from most
points or outliers. Because K-means computes centroids as within-cluster averages, an outlier can
pull away a centroid away from its true position. We select the value for K experimentally. We
test values that range from 50 to 250 in 25-step increments.
4.3.3</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>K-nearest neighbors</title>
          <p>
            Kurland et al propose a clustering method that takes the K-1 nearest neighbors of each training
image to form a cluster of size K [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]. In contrast to K-means, KNN is a soft clustering technique
that can assign an element to more than one cluster. If an image is a top ranked neighbor to several
others, then it belongs to each of the corresponding clusters. KNN generates as many clusters as
there are training images, and all of them have exactly the same size since each includes an image
and its K-1 nearest neighbors.
          </p>
          <p>To find the nearest neighbors of a training image Jk, all images Jm, m = 1...|T |, m #= k, are
first ranked according to their similarity to Jk. In our work, language models are generated by
smoothing image frequencies with collection frequencies. Then the similarity between Jk and Jm
is estimated as sim(Jk, Jm) = #|iJ=k1| P (ti|Jm), where ti are the terms and visterms of Jk. The
ranking process is repeated |T | times - once for each one of the training images in the collection
T .
5
5.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experimental results</title>
      <sec id="sec-5-1">
        <title>Parameter setting</title>
        <p>The cluster-based models rely on several smoothing and clustering parameters. These include:
α for smoothing terms in image models, β for visterms in image models, γ for terms in cluster
models, δ for visterms in cluster models, K for the number of clusters with K-means, and K for
the number of nearest neighbors with KNN clustering.</p>
        <p>We apply 10-fold cross validation to set each parameter. We divide the 9000 training images
into 10 subsets of equal size and optimize performance by minimizing the error rate. For each
possible parameter value, we train the model 10 times using one of the folds for testing and the rest
for training, and we average the accuracy of the 10 trails. This evaluation method is more reliable
than the simpler holdout method because it uses every training image for validation exactly once.</p>
        <p>We determine that CQL works best with γ = 0.1 and δ = 0.2 while CBDM works best with
α = 0.5, β = 0.8, γ = 0.5 and δ = 0.3. We use these values throughout the rest of the experiments.
On the other hand, cluster quality is closely linked to the visual feature, more precisely to its
effectiveness to produce visterms with discriminative power. Since the value of K is
featuredependent, we cross-validate it individually for each visual feature.
5.1.1</p>
        <sec id="sec-5-1-1">
          <title>Feature effectiveness</title>
          <p>To get a sense of the relative effectiveness of the extracted features, we compare Coarseness and
Gabor energy. The former has highest performance at 100 clusters, the latter at 225, and Gabor
energy is the more useful feature (Tables 1 and 2). However, we cannot improve accuracy by simply
setting K to an ever higher value. Larger K does not automatically imply better performance
it is cluster quality that matters, not the number of clusters.</p>
          <p>Since images represented with Coarseness visterms are clustered into fewer groups, it is likely
that dissimilar images will occasionally be contained in the same cluster. Perhaps Coarseness
captures less information about content, yielding poorer discrimination between the visual
representations of images. This would be true if the images are naturally structured into more groups,
but the clustering algorithm fails to distinguish between some groups based on the Coarseness
representations. However, even though Coarseness extracts less information than Gabor energy
(or rather, less useful information), its texture analysis does not overlap with that of Gabor
energy. Since they identify different image properties, combining the two features proves to be an
advantage (Section 4.2).
5.2</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>Evaluation measures</title>
        <p>Possible evaluation measures do not necessarily suggest the same feature as most effective.
Therefore, we need to decide which measure is most appropriate for either task. We compare four
measures with respect to categorization using the CQL model: error rate, precision at 0.0
recall, average F-measure, and mean average precision. As discussed in Section 5.1, we
set the smoothing parameters γ and δ to 0.1 and 0.2, respectively. The clustering parameter K is
feature-dependent - we use 225 for Gabor energy, 100 for Coarseness, and 200 for Gabor energy
combined with the three Tamura features (Coarseness, Directionality and Contrast). Results are
reported in Tables 1 and 2.</p>
        <p>I.</p>
        <p>II.</p>
        <p>The experiments show that Gabor energy is the best feature for assigning annotations. On the
other hand, Gabor energy and Tamura combined is the optimal feature for assigning categories
according to all but mean average precision, in which Gabor energy is best. This leads to the
question of which evaluation measure should be used to optimize parameters.</p>
        <p>Perhaps the most important difference between categories and annotations is that a category
consists of a single term and an annotation is constructed from multiple terms (not necessarily
but in most cases). We select the evaluation measure based on this important distinction.</p>
        <p>When assigning categories, only the highest ranked category is selected, so we need not be
concerned about the tradeoff between recall and precision. On the other hand, when we assign
annotations we select several concepts. In this case, we are interested in both recall and precision.
These properties of categorization and annotation help us choose the appropriate evaluation
measure. For categorization, an evaluation measure that reflects the precision of assigning categories
should be selected - either error rate or precision at 0% recall. For the annotation task an
evaluation measure that combines recall and precision should be selected - either F-measure or mean
average precision. In the remaining experiments, effectiveness of categorization and annotation
are measured via error rate and F-measure, respectively.
5.3</p>
        <p>Weighting concept probabilities to assign categories
The method of splitting annotations into concepts and then multiplying concept probabilities to
rank categories is, not surprisingly, very ineffective (Table 3).</p>
        <p>CQL
Gabor energy</p>
        <p>CQL
Coarseness</p>
        <p>CBDM
Gabor energy</p>
        <p>CBDM
Coarseness</p>
        <p>Categories
.372222
.507778
.374444
.468889</p>
        <p>Concepts
not scaled
.588889
.627778
.596667
.624444</p>
        <p>Concepts
scaled
.389999
.528889
.377778
.483333</p>
        <p>Concepts by themselves have a very good annotation performance, so in theory they should
contain enough information for categorization. By directly multiplying concept probabilities to
estimate the probability of assigning a category, we treat all concepts as ‘equal’. In general this is
the right approach since we do not know in advance what the user would be interested in, so there
are no ‘significant’ and ‘insignificant’ words. Therefore, we would want to assign both frequent
(background) words and rare (foreground) words correctly. But in the very specific case of using
concepts to categorize medical images, we can in fact make a distinction between concepts based
on the number of times a concept appears in categorical definitions.</p>
        <p>As Figure 6 shows, some concepts are more important than others for defining a category - its
a rare concept that defines a category best by distinguishing it from the rest. To take advantage
of this information, we scale concepts using a TF×IDF weighting scheme. Concept probabilities
are computed as:</p>
        <p>P (ci|J ) =</p>
        <p>1
log(#(ci, S))
where S is the set of categorical definitions.</p>
        <p>Thus we emphasize concepts that appear in only a few categories and penalize concepts that
appear in many categories since they are not very effective for differentiating between categories.
5.4</p>
      </sec>
      <sec id="sec-5-3">
        <title>Clustering Effectiveness</title>
        <p>Clustering is an important ingredient of our method and the choice of clustering technique can
significantly affect performance. We look at three alternatives as described in Section 4.3.
5.4.1</p>
        <sec id="sec-5-3-1">
          <title>Categories</title>
          <p>Our baseline method is clustering according to manually assigned category (Section 4.3.1). In this
case there are no parameters to optimize because all cluster information is explicitly specified in
the data.</p>
          <p>Even though category-clusters have satisfactory performance (Table 4), we should acknowledge
that most collections are not processed, so this kind of ‘naive’ clustering would not always be an
option. And even if there are categorized images available (for instance, if the user provides some
manual examples to a browsing system which automatically organizes an image collection
according to user preferences), these categories would be user-dependent and therefore not necessarily
well-defined. They could be either too broad or too specific, i.e. generate clusters that are either
too large and loose, or too small and with no real advantage over individual images.</p>
          <p>Categories
K-means</p>
          <p>KNN
error
rate
.3010
.2650
(.0014)
.2440
(.0166)</p>
          <p>CQL
nonzero
categories
37
36
40
error
rate
.2570
.2630
(.4709)
.2310
(.0006)</p>
          <p>CBDM
nonzero
categories
40
39
46</p>
          <p>Therefore, we would like to find an unsupervised clustering method which performs as well as
or even better than manually labeled clusters. Then we could rely on the system to automatically
find appropriate values for clustering parameters, so that generated clusters approximate the
collection’s natural configuration.
5.4.2</p>
        </sec>
        <sec id="sec-5-3-2">
          <title>K-means</title>
          <p>K-means is a general clustering technique described in Section 4.3.2. It has one input parameter,
the number of clusters K, which is optimized separately for each visual feature. K-means gives
CQL a statistically significant improvement but slightly hurts CBDM (Table 4). The results
indicate that the medical categories are relatively broad. For example, there might be a category
which contains two visually different types of images, and the accuracy of CQL increases as a result
of separating them into two different clusters. (We know that K-means breaks up some of the
category clusters because the value of K is larger than 57 (Table 5). In this way, the system deals
with the issue of some clusters not being compact enough. On the other hand, cluster compactness
has less influence on their usefulness as background models for smoothing and this could explain
why the performance of CBDM does not improve. (With CBDM emphasis is on generalization
and therefore recall, and with CQL - on correctness and therefore precision.)</p>
          <p>For other collections manual categories can be too narrowly defined. In such situations we
would expect K-means to generate fewer clusters than categories, thus increasing recall, which
would have a positive effect both on CQL and CBDM.</p>
          <p>However, it is not necessary to use CQL and CBDM with the same set of clusters. In fact,
CBDM shows a consistent tendency to perform best with fewer but larger clusters as compared
to CQL:</p>
          <p>Gabor energy</p>
          <p>Coarseness
Dimensionality</p>
          <p>Contrast
Gabor energy
and Tamura</p>
          <p>CQL and CBDM apply clusters in two conceptually different roles - on one hand, as training
examples which are somewhat more general than images, and on the other hand, as background
collections which are somewhat more specific than the entire collection. Implicitly, bigger clusters
are more useful for generalizing patterns observed in individual images - if the clusters are too
small, they would fail to capture some aspects of member images and their content. Therefore,
with CBDM we are less concerned about the compactness of the clusters, and can allow some
relatively dissimilar elements to join the same cluster.
5.4.3</p>
        </sec>
        <sec id="sec-5-3-3">
          <title>K-nearest neighbors</title>
          <p>KNN is a soft clustering technique described in Section 4.3.3. We optimize K separately for the
two cluster-based models and establish empirically that K=25 for CQL and K=50 for CBDM.</p>
          <p>First, this corroborates our previous conclusion that CQL works well with very compact clusters
and CBDM works well with more general clusters. We also observe that categorization performance
improves with a statistically significant difference as compared to K-means clustering (Table 4).
KNN clusters have more local coherence because they are defined with respect to particular image
(i.e. locally). Since by generation a KNN cluster is specific to an image, it is better at describing
its context. In addition, the KNN method does not reduce the number of training examples. It
generates as many clusters as there are images. On the other hand, K-means creates considerably
fewer clusters, which implies that there are fewer observations on which to base the model’s
probability estimations.
In this work, we analyzed a cluster-based cross-lingual retrieval approach to image annotation
and categorization. We described two methods for incorporating cluster statistics into the general
framework of cross-media relevance modeling and showed that both build effective probabilistic
models of term-visterm relationships. We also discussed how different clustering techniques affect
the quality and discriminative power of automatically generated clusters. Finally, we demonstrated
an efficient method for combining visterms produced by several visual features.</p>
          <p>We regard clustering as a kind of unsupervised classification that offers greater flexibility
than manual classification. If the actual categories are too broad, then the system can break
them into smaller clusters. If the actual categories are too specific, then it can redefine them by
generating bigger clusters. If manually assigned categories are unavailable, the system can create
them automatically. The only disadvantage is that automatic clusters do not have explicit textual
descriptions, but the word distribution in clusters could be analyzed to build statistical language
models.</p>
          <p>In the future, we plan to investigate grouping by concept (similar to the method of grouping by
category described here but based on annotation words) as an alternative version of soft clustering.
We are also interested in analyzing the categorization performance of CQL and CBDM on a
collection of true-color images to examine how visual properties influence accuracy.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Deselaers</surname>
          </string-name>
          , Daniel Keysers, and Hermann Ney.
          <article-title>Features for Image Retrieval: A Quantitative Comparison</article-title>
          .
          <source>In Proceedings of the 26th DAGM Pattern Recognition Symposium</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Deselaers</surname>
          </string-name>
          , Daniel Keysers, and Hermann Ney. FIRE - Flexible Image Retrieval Engine:
          <article-title>ImageCLEF 2004 Evaluation</article-title>
          .
          <source>In Proceedings of the CLEF 2004 Workshop</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Pinar</given-names>
            <surname>Duygulu</surname>
          </string-name>
          , Kobus Barnard, Nando de Freitas, and David Forsyth.
          <article-title>Object Recognition as Machine Translation: Leaning a Lexicon for a Fixed Image Vocabulary</article-title>
          .
          <source>In Proceedings of 7th European Conference on Computer Vision</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>I</given-names>
            <surname>Fogel</surname>
          </string-name>
          and
          <string-name>
            <given-names>Dov</given-names>
            <surname>Sagi</surname>
          </string-name>
          .
          <article-title>Gabor Filters as Texture Discriminator</article-title>
          .
          <source>Journal of Biological Cybernetics</source>
          ,
          <volume>61</volume>
          (
          <fpage>102</fpage>
          -
          <lpage>113</lpage>
          ),
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Jiwoon</given-names>
            <surname>Jeon</surname>
          </string-name>
          , Victor Lavrenko, and
          <string-name>
            <given-names>R.</given-names>
            <surname>Manmatha</surname>
          </string-name>
          .
          <article-title>Automatic Image Annotation and Retrieval using Cross-Media Relevance Models</article-title>
          .
          <source>In Proceedings of the 26th International ACM SIGIR Conference</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Jiwoon</given-names>
            <surname>Jeon</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Manmatha</surname>
          </string-name>
          .
          <article-title>Using Maximum Entropy for Automatic Image Annotation</article-title>
          .
          <source>In Proceedings of the 3rd International Conference on Image and Video Retrieval</source>
          , pages
          <fpage>24</fpage>
          -
          <lpage>32</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Oren</given-names>
            <surname>Kurland</surname>
          </string-name>
          and
          <string-name>
            <given-names>Lillian</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>Corpus Structure, Language Models, and Ad Hoc Information Retrieval</article-title>
          .
          <source>In Proceedings of the 27th Internation ACM SIGIR Conference</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Xiaoyong</given-names>
            <surname>Liu</surname>
          </string-name>
          and
          <string-name>
            <given-names>W. Bruce</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>Cluster-Based Retrieval using Language Models</article-title>
          .
          <source>In Proceedings of the 27th International ACM SIGIR Conference</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Yasuhide</given-names>
            <surname>Mori</surname>
          </string-name>
          , Hironobu Takanashi, and
          <string-name>
            <given-names>Ryuichi</given-names>
            <surname>Oka</surname>
          </string-name>
          .
          <article-title>Image-to-word Transformation based on Dividing and Vector Quantizing Images with Words</article-title>
          .
          <source>In Proceedings of the 1st MISRM International Workshop</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>