Computer analysis of visual image similarity

                            Ksenia Zhagorina, Alexey Buslavyev

            Ural Federal University, Institute of Mathematics and Computer Sciences,
                                      Yekaterinburg, Russia
             ksuhka-zhagorina@yandex.ru, buslavyewal@mail.ru


         Abstract. This paper is a description of image analysis and machine-learning
         algorithms used for multiclass image classification in the process of our partici-
         pation in the ImageCLEF 2012 competition. Our goal was to develop an appli-
         cation that could successfully determine the location of a mobile robot using the
         visual information provided by a camera placed on the robot. The resulting ap-
         plication uses machine-learning methods that improved on self-organizing Ko-
         honen maps and classification algorithms based on probabilistic models. The
         result of our work was an application that was able to correctly classify 86 per-
         cent of input images presented in the ImageCLEF Robot Vision task.

         Keywords: visual similarity analysis, image analysis, image search, classifica-
         tion problem, multiclass classification, computer vision, maximum likelihood
         method, probabilistic model, Kohonen network.


1        Introduction

    In this paper we present the research performed in the context of our participation
in the ImageCLEF 2012 competition (RobotVision task)1. The task is a multiclass
image classification challenge: the goal is to assign each of the input images to a par-
ticular class.
    The official definition of the task is as follows: “The fourth edition of the Robot-
Vision challenge will focus on the problem of multi-modal place classification. Par-
ticipants will be asked to classify functional areas on the basis of image sequences,
captured by a perspective camera and a kinect mounted on a mobile robot within an
office environment. The test sequence will be acquired within the same building and
floor but there can be variations in the lighting conditions (sunny, cloudy, night) or
the acquisition procedure (clockwise and counter clockwise).
    These are all the rooms/categories that appear in the database: Corridor, Elevator
Area, Printer Room, Lounge Area, Professor Office, Student Office, Visio Confer-
ence, Technical Room, Toilet.”


1
    ImageCLEF – Image retrieval in CLEF (http://imageclef.org/)


adfa, p. 1, 2011.
© Springer-Verlag Berlin Heidelberg 2011
   For the test runs we only used images produced by the perspective camera. Each
test sequence contains about 2500 images. Quality of classification algorithm we
evaluated as the percentage of images correctly classified by this algorithm.


2        Use of Kohonen networks for image classification

2.1      Self-organizing Kohonen maps

   Self-organizing Kohonen networks2 or Kohonen maps are a type of artificial neural
network that is trained using unsupervised learning. Training material for the network
is a set of vectors in space ℝn. A self-organizing map consists of components called
nodes or neurons. A reference vector of the same dimension as the input vectors is
associated with each node. The typical arrangement of nodes is a regular spacing in a
rectangular grid.
   Kohonen networks are competitive (i.e. based on the "winner takes all" principle):
the neuron whose reference vector is most similar to the input is considered the best
matching unit (BMU) and is used during the training process and to produce the out-
put signal of the network.
   The map is initialized with random vectors for each neuron. For each iteration dur-
ing the training phase, a BMU is determined. The reference vectors of the best match-
ing neuron and its neighbors are modified according to the formula:

              ⃗⃗        ⃗⃗          (     ⃗⃗   )    (     ) ⃗⃗

where ⃗⃗ is the reference vector of a neuron, is the input vector from the training
sample. The variable depends inversely on the training step and the distance be-
tween the BMU and neuron being modified.
   The output signal of the network is the Euclidean distance between the input vector
and its BMU.
   In order to use Kohonen networks for image classification, images needed to be
presented as sets of vectors. For this for each image we generated a sample vector set
based on the image’s domains (overlapping areas of a certain size). Various domain
parameters were used as the vectors’ coordinates, for example: expected value of
pixel color, dispersion of pixel color, standard deviation of pixel color from color of
the domain’s center, etc.
   After training using a sample set of vectors built from one or more images a Ko-
honen map becomes representative for these images and can be used to determine the
degree of similarity between a new input image and the images used in the training
process. The degree of similarity (or distance between image and map) is defined as
the arithmetic mean of the output signals obtained for each vector in the vector set
built from this image.


    2
        From Wikipedia, the free encyclopedia (http://en.wikipedia.org/wiki/Kohonen_map)
   Our test application used the following algorithm for classification: 1) Kohonen
maps are trained on the provided set of already classified images - one map for each
class. 2) A Kohonen map trained on all images of some class is considered representa-
tive for this class. 3) A new input image is assigned to the class that has the least dis-
tance between the class’s representative map and the image.


2.2    Increasing Kohonen networks

   It was determined that classical self-organizing Kohonen maps are not well suited
for the task of image classification due to the nature of the data, primarily due to large
amounts of training data. Standard fixed size Kohonen maps use a monotonically
decreasing learning coefficient: this means that the maps actively train in the begin-
ning of the training set. Each image has less effect on the state of the network than the
previous one; therefore the network contains much less information regarding the
final images of the training set. An alternative would be not using a decreasing learn-
ing coefficient, but this would lead to information belonging to the first images to be
overwritten.
   Additionally, because we did not have any need for visualization of data, the ar-
rangement of nodes in a rectangular grid and the ability of neurons to affect each oth-
er were found not to be particularly useful.
    In order to solve the problems identified earlier, we abandoned the traditional ar-
rangement of nodes in favor of a linear one and added the ability to increase the num-
ber of neurons during training. This way a Kohonen map degenerated into a linear
structure of reference vectors that grew during the training process. The basic princi-
ple remained the same – "winner takes all". During training each vector of the sample
affected only the best matching unit (BMU).
   This modification allowed us to process large amounts of training data, while
avoiding loss of critical information.
    The detailed training process for a growing Kohonen map is as follows:

1. At some stage of training a new vector from the training sample is presented to
   the neural network.
2. Vector ⃗⃗ is the best matching unit in network for input vector . In the capacity
   of distance between vectors Euclidean distance is used.
3. If the distance between and ⃗⃗ is greater than a certain threshold value, is added
   to the network as new reference vector. The threshold value affects the accuracy of
   the Kohonen network.
4. Otherwise, vector ⃗⃗ is modified according to the formula:

                            ⃗⃗       ⃗⃗          (      ⃗⃗   )

    So after training a growing Kohonen network contains characteristic elements of
the vector sample constructed from images of a certain class. Information from the
training sample is not lost at any stage of training.
2.3     Hierarchical Kohonen networks.
    Poor running time performance was another problem experienced when using
standard Kohonen networks for image classification. Growing Kohonen networks
solved the problem of training on large amounts of data but did not improve poor
performance experienced during both training and classification stages.
    An hierarchical Kohonen network is a structure that combines several growing
Kohonen networks. Each node in the first level network is associated with another
Kohonen network (called a second level network). For each iteration during the train-
ing phase, the first level network is trained on vector that is characteristic of the
whole image. After that, the second level Kohonen network associated with the best
matching node of the first level network is trained on a vector sample set built form
the same image.
    It is also possible to modify the hierarchical Kohonen network so that both first
and second level networks are trained on the same data, but the first level network in
this case has a lower accuracy.
    Using hierarchical Kohonen networks significantly improves the application’s per-
formance, but slightly reduces the efficiency (it reduces the percent of correctly clas-
sified images).


2.4     Summary of Kohonen networks

    Kohonen networks were the main tool used for image classification and imple-
mented in our submission for ImageCLEF 2012. As input vectors we used various
characteristics of the images’ domains. The best result of approximately 73% correct-
ly classified images was obtained using growing Kohonen networks and the following
characteristics as vector coordinates for the images’ domains (in XYZ color space):

─ Expected value of pixel color,
─ Variance of pixel color,
─ Standard deviation of pixel color from the color of the domain’s center,
─ Standard distance between the colors of neighboring pixels,
─ Maximum horizontal color difference,
─ Maximum vertical color difference.


3       Maximum-likelihood method

3.1     Principles
   In statistics, maximum-likelihood estimation3 (MLE) is a method of estimating an
unknown parameter by maximizing the likelihood function.


3
    From Wikipedia, the free encyclopedia (http://en.wikipedia.org/wiki/Maximum_likelihood)
    Suppose we have a sample            (              ) from distribution , where
   – is unknown parameter. Let ( | )                 ℝ is the likelihood function. Point
estimate ̅      ̅( )                  ( | ) is named the maximum likelihood estimate
of the parameter θ. Thus the maximum likelihood estimate is an estimate, that maxim-
izes the likelihood function for a fixed realization of the sample.
    When applied to the task of image classification, the sample x is the set of all im-
ages in the sequence, the set Θ is the set of possible classes and the likelihood func-
tion ( ) is determined by the choice of probabilistic model of the data. In the sim-
plest case, when using only Kohonen networks, the function ( ) is defined as

                      (              |                   )       ∑
                                                                     (   )
where (      ) is the distance between image Xi and representative network of class
  . Consequently, when defined in this way, the likelihood function reaches its’ max-
imum at

                                                 (           )

    i.e. image belongs to the class for which the representative network is nearest to
this image.


3.2     Joining probabilistic model
    One of the important features of the ImageCLEF RobotVision task is the fact that
the input images that are to be classified are produced in sequence by a moving robot.
This means that images depend on preceding ones; most importantly this means that
the sample cannot contain a single image from a particular class. Every image has to
belong to a sequence of multiple images of the same class that precede or follow it.
This allows us to construct the following model.
    We define a function               [   ] , where m is the number of classes. The
function P converts the distance from an image to representative networks of classes
to the probability that this image belongs to a certain class.

              (                  )       (           )
                                                                         ∑

where

                  (        ) ,                               ,

      - is a negligible quantity that is necessary so that no probability is equal to 0.
       – is the probability that the i-th image belongs to j-th class regardless of which
classes the neighboring images belong to.
      – is the probability that the i-th image and k preceding it all belong to j-th class.


      – is the probability that the i-th image and k following it all belong j-th class.

                                                (                )

    – is the probability that the i-th image belongs to j-th class, taking into account
the probability that neighboring images also belong to the j-th class.
    Thus the class containing the i-th image can be determined using the formula:
                                                        (    )

   This model allows us to avoid «wavelets» - small groups of incorrectly classified
images.


3.3      Separating probabilistic model

    This model incorporates additional information about how the images can be di-
vided into groups and what classes can follow each other. In other words, this model
takes into account information about the adjacency of rooms shown in the images.
What rooms are adjacent can easily be determined from the training set, again using
the continuity of the sequence of images.
    Let us consider the more general case: suppose we have a matrix{ }             ,
where       – prior probability that classes j and k are adjacent (in our case j and k are
numbers corresponding to rooms).
    The fact that adjacent images a and b belong to different classes j and k is equiva-
lent to the following: images that precede a belong to the class j; images follow b
belong to the class k; classes j and k are adjacent. Therefore the probability that imag-
es a and b belong to different classes j and k can be expressed as:
                                (          )                         4


Let us assume that the transitions between classes occur in points of the finite set:
          (     (        )) (       (      ))     (      (         ))

    The probability of this is the probability that and images belong to different
classes and        , and that all images between neighboring transition points belong
to the same class.


        (∏         (                )) (   ∏        (       ))


4
    Notation              were described in section 3.2 Joining probabilistic model
    Now we only need to maximize the function P on all possible transition points.
This task can be simplified considerably if we find the logarithm of the equation. The
maximum of function P will be achieved in the same points as the maximum of func-
tion log(P).


         ( )           ∑(        (       )       (    )         (    ))        ∑           (   )


    This problem can be reduced to the problem of finding the maximum weight path
in a directed graph without cycles. The graph is constructed as follows: each image a
is represented as m nodes                  , where m is number of classes. Two special
nodes are also added: node s is used as a start node in graph search and node t is a
final node. The set of arcs contains follow arcs:

              {(        )                                                                      }

   and also arcs from node s to all nodes , formed from the first image in the se-
quence, and arcs from all nodes , formed from the last image in the sequence to
node t.


                            Fig. 1. The graph constructed from a sequence of images.

   We introduce a weight function:

                                             (   (    )        (    ))
        ( )        (         )       {
                                             (   (     )        (     ))         (     )

                        (        )

                        (        )
     In a given network without cycles of negative length, the maximum weight path
from node s to node t can be determined using the Bellman-Ford algorithm for exam-
ple. It should be noted that only the nodes located to the left of (i.e. formed form
images that precede the image ) can affect the maximum weight path to the node .
i.e. formed form images that precede the image .
     The maximum weight path in the graph determines the optimal classification of
images.


3.4    Summary of probabilistic models
   The correct choice of a probabilistic model can significantly influence the result
rate in multiple classification tasks. The effect of applying probabilistic methods to a
particular task can be significantly influenced by the precision of the function used as
a probability estimate; the more precise the estimate is, the more effective the method
in general is.
    In the process of developing our test application, we had the opportunity to test
both probabilistic models. The joining probabilistic model yielded a rate of 81.3%
correctly classified images, while use of the separating probabilistic model allowed us
to improve our result to 86% correctly classified images.


4      Conclusion

   In the course of our research we investigated in detail, improved and applied in
practice such methods of data processing as self-organizing Kohonen maps and their
modifications and methods of classification based on the maximum-likelihood meth-
od.
   We were able to create a test application that analyzed images and used the ob-
tained visual information about the surrounding space (without building a map of the
premises) to correctly determine the robot’s location (in most cases).
   The best result of 86% correctly classified images was achieved using growing
Kohonen networks and a separating probabilistic model.
    Our test application took part in the ImageCLEF-2012 contest (RobotVision task)
and placed fourth. We are mostly satisfied with our result and believe that the ap-
proach presented in this paper is new and can be generalized and applied to various
other image classification tasks.