=Paper= {{Paper |id=Vol-1175/CLEF2009wn-ImageCLEF-FengEt2009 |storemode=property |title=University of Glasgow at Image CLEF 2009 Robot Vision Task |pdfUrl=https://ceur-ws.org/Vol-1175/CLEF2009wn-ImageCLEF-FengEt2009.pdf |volume=Vol-1175 |dblpUrl=https://dblp.org/rec/conf/clef/FengHJ09a }} ==University of Glasgow at Image CLEF 2009 Robot Vision Task== https://ceur-ws.org/Vol-1175/CLEF2009wn-ImageCLEF-FengEt2009.pdf
                         University of Glasgow at ImageCLEF 2009
                                     Robot Vision Task
                              Yue Feng, Martin Halvey and Joemon M. Jose
                                 Multimedia Information Retrieval Group
                       University of Glasgow, Glasgow, G12 8RZ, United Kingdom
                                   {yuefeng, halvey, jj}@dcs.gla.ac.uk



Abstract
The submission from the Multimedia Information Retrieval Group at the University of Glasgow for the
ImageCLEF 2009 Robot Vision Task combines point matching methodologies with rule based decision
techniques. Instead of the whole image we use a large set of interesting points extracted from the image contents
to represent each image. The points of interest are extracted using an edge corner detector. The RANAC method
[3] was then applied to estimate the similarity between the test and training images based on the number of
matched pairs of points. The location of robot is then annotated based on the training image that contains the
highest number of matched point pairs with the test image. A set of decision rules with the respect to the
trajectory behaviour of robot’s motion are defined to refine the final results. An illumination filter was also
applied for two of the runs in order to reduce the illumination effect. Three runs were submitted using the
different of combination of the above approaches.

Categories and Subject Descriptors
I.4 [Image Processing and Computer vision]: I.4.8 Scene Analysis; I.4.9 Applications; I.5 [Pattern Recognition]:
I.5.4 Applications

General Terms
Algorithms, Measurement, Performance, Experimentation.

Keywords
Robot Vision, Computer Vision, Illumination Filter, Decision Rules.

1. Introduction
In this paper, we describe the system and examine the approaches and results for the three independent runs
submitted by the Multimedia Information Retrieval group at the University of Glasgow for the ImageCLEF 209
Robot Vision task.

The goal of the ImageCLEF 2009 Robot Vision Task is to address the problem of topological localisation of a
mobile robot using visual information. Specifically, participants were asked to determine the topological location
of a robot based on images acquired with a perspective camera mounted on a robot platform [2]. Given a test
sequence, a system/algorithm must be able to provide information about the location of the robot, where the
relevant location information is available in a training sequence. Our strategy is to analyse the visual content of
the test sequence and compare it with the training sequence to decide the location. Given a set of training
sequences which are annotated with location information and an unannotated video frame captured by the robot,
the best solution for identifying the location of this unannotated frame is to find the closest frame in the training
set. Two possible techniques could be applied to solve this problem: classification and image matching. The
classification approach [6], which is famous for classifying images into one of a trained set of groups, according
to feature distance to each group, requires an initial training step to obtain the class properties. However, its
performance for identifying images belonging to an unknown class is not very good due to a lack of training
samples, which are not available at the application stage. The image matching [7] approach is another potential
solution, which is able to automatically and efficiently detect that two images are similar, or find similar images
to a test image within a database of images. This approach estimates the visual distance between the unannotated
frame and each frame in the training sequence and returns a ranked list, where the location of the unannotated
frame could be annotated as the same as training image with the highest score in the rank list. In order to
annotate the unknown frames, which are captured in an unknown room, the computed rank list is not reliable
since all the training images on the list are all annotated. Thus, a similarity threshold could be introduced to filter
out the false matches and annotate the unknown rooms. Thus, our intuition is that an image matching approach is
a better approach than a classification approach for the ImageCLEF 2009 Robot Vision Task.

It is well known that when an object/robot moves through its environment it does not move randomly. Instead, it
usually follows specific trajectories or motion patterns corresponding to its intentions [1]. Knowledge about such
patterns can be used to assist any system to robustly keep track of robot in its environment and identify the
location of the robot.

Motivated by the above work [1, 6, 7], our adopted approach combines image matching with a rule based
decision approach. In addition, an illumination filter is integrated into one of the runs in order to minimise
lighting effects with the goal of improving the predictive accuracy.

The rest of the paper is organised as follows. The next section introduces our proposed methodology for the
robot vision task. In Section 3, we describe our submitted runs and the results of those runs. Finally, in Section 4
we present a conclusion.

2. Methodology
For our approach image matching is first applied to find the most closely matching frames from the training
sequences in order to annotate the test frame. A rule based decision maker is then applied in order to refine the
results based on the movement behaviour of the robot. In addition, an illumination filter is applied to pre-process
the sequences in order to reduce the lighting effectiveness. Each of these steps is described in more detail in the
following subsections.

2.1. Points Based Image Matching
Considering both the training and test sequences are captured using the same camera in the same geographic
condition, we assume that frames taken in the same location will contain similar content and geometric
information.

Motivated by the above assumption, the proposed image matching algorithm is designed in the following
successive stages: (1) A corner detection method is first used to create an initial group of points of interest; (2)
The Ransac algorithm [3] is then applied to establish point correspondences between two frames and calculate
the Fundamental matrix [4]. This matrix encodes an epipolar constraint which is applied to the general motion
and rigid structure; this is used to compute the geometric information for refining the matched point pairs. (3)
The number of refined matched points will be regarded as the similarity between two frames.

2.1.1. Points of Interest
In order to reduce the computation cost, we use points of interest (POI) rather than all pixels in the frame.
Considering the fact that corners are local image features characterised by locations where variations of intensity
in both X and Y directions are high, it is easier to detect and compare the POI within and around the corners, such
as edges or textures. To this end, for this work we employ Harris corner detector [5] to initiate the POI, since it
has strong invariance to rotation, scale, illumination variation and image noise. The Harris corner detector uses
the local autocorrelation function to measure the local changes of the signal to detect the corner positions in each
frame. More details can be found in [5]. Results of the corner detection are shown in Figure 1, where the
detected corners are represented by ‘*’ and ‘+’.




                      One frame from training sequence.          One frame from test sequence




                                    The detected corners of the above frames

Figure 1: The original frames and results for corner detection

2.1.2. Point Matching
The next step of our approach is to use a point matching technique to establish point correspondences between
two frames. The point matching method generates putative matches between two frames by looking for points
that are maximally correlated with each other inside a window surrounding each point. Only points that correlate
strongly with each other in both directions are returned. Given the initial P points of interests, a parameter X is
used for checking whether this point is fitted or not, is first estimated using N points chosen at random from P. It
is then found how many points in P fit the model with values of X within a tolerance value T given by the user. If
the number is satisfactory, it is regarded as a fit and the operation terminates with success. Such operations are
carried on looping through all the POI. In the present work, T is set at 95%. Setting such a high threshold value
reduces the number of points of interest, so only the points with ‘good match’ are kept.. Examples of matched
points can be seen in Figure 2a, where a pair of matched points is represented by ‘*’ and ‘+’.

The initial matching pair may contain many mismatches, thus a post processing step for refining the results is
needed. Given the assumption in Section 2.1 that frames taken in the same location will contain similar
geometric information, we applied the fundamental matrix [4]. The fundamental matrix was initially designed
for finding corresponding points in stereo images, where it helps to find the matching pairs using the computed
geometric information.

Given the initial matching points, the fundamental matrix F can be estimated given a minimum of seven point’s
correspondences. Its seven parameters represent the only geometric information about the cameras that can be
obtained through points correspondences alone.
Given the computed F, we applied it on all the matching pairs to eliminate the incorrect matching pairs, where
the matched point pair should satisfied with the following formula,

𝑟𝑇𝐹𝑥 = 0
𝑟 x′ are the corresponding points in two images, Fx describes a line (an epipolar line), where the other
corresponding point x′ on the other image must lie. The results of the matching point pairs after the refining step
are illustrated at Figure 2b.




Figure 2: (Left most) the initial matching map and (right most) the matching map after filtering, the points linked
with lines are correctly matched.



Thus, after applying the matrix F on all the paired points to filter out the mis-matched points, the number of
matched point pairs remaining will be regarded as the similarity between two images.

In order to localise the robot, we assume that the position can be retrieved by finding the most similar frame in
the training sequence to the test sequence. Thus, the most similar frame should contain the most matched points
with the testing frame.

2.2. Decision Rules
Given the results of point matching, each test frame can be annotated as being from one of the possible rooms
and the trajectory of the robot could be generated. The trajectory could be represented using the extracted
annotation information frame by frame. An example (Example 1) of the extracted trajectory could be represented
as:

                ppppppppppppppppppppppppp                                                  period 1
                cccc                                                                       period 2
                pp                                                                         period 3
                cccccccccccccccccccc ccccccccccccccccccccc                                 period 4
                eeeee eeeee eeeee eeeee eeeee eeeee eeeee eeeee                            period 5
                kkkkkkkkkkkkk kkkkkkkkkkkkk                                                period 6
                cccccccccc ccccccccccccc                                                   period 7

Where p, c, e and k represent ‘printing area’, ‘corridor’, ‘kitchen’, and ‘two-person office’ respectively. In this
example, the robot travels continuously from ‘printing area’ to ‘corridor’, then back to ‘printing area’, ‘corridor’,
‘two-persons office’, ‘kitchen’ and ‘corridor’. In order to study the movement behave of robot, each continuous
moving shot is considered as one period, this sequence contains seven periods in different rooms.
By studying the training sequence released as part of the ImageCLEF 2009 Robot Vision training and test sets,
we find (i) the robot does not move ‘randomly’, (ii) the period of time that the robot stays in the one room is
always more than 0.5 seconds, which corresponds to more than 12 continuously frames, (iii) the robot always
enters one room from the outside of the room, e.g. ‘pa’ or ‘corridor’, instead of from another room, then it exits
this room to the place where it came from instead of to a different place.

Based on the above observations, we defined a set of rules to help determine the location of the robot at any
given time,

Rule 1: Time length. The robot will not stay in one place for a period less than 20 frames. For instance, in the 2nd
and the 3rd period of the above example, the extracted sequence shown that the robot only stayed continuously in
these two locations for just four and two frames, respectively, which is satisfied with this rule. Thus, there must
be a false detection in the 2nd and the 3rd period.

Rule 2: Jumping room. If the location of the robot changes from room A to room B without passing through the
‘corridor’ and ‘printing area’, there must be a false detection. For instance, in the 5th and 6th period of the above
example, the extracted sequence shown that the robot moved from ‘two-person office’ to ‘kitchen’, which is a
typical ‘jumping room’ case that robot did not enter certain place during changing rooms. Thus, there must be a
false detection between the 5th and the 6th period.

Rule 3: Unknown room. Since test sequence contains additional rooms that were not images previously, no
correspondence frames in the training set could be used to annotate these rooms. In addition, considering the
nature of image matching algorithm that one test frame will be annotated with the most similar frame in the
training set even there is only one matched point pair found, a false annotation is not avoidable. In the
experiment, we found that the number of detected matched point pair between the unknown frame and the
training frames is very limited. Thus, we define one rule that any frame detected less than 15 matched point pairs
with the training frames is annotated as an unknown room.

To refine the false detection, if the initial detection does not obey the rule 1, we do as follows:

        If the location before the false detection period is the same as the location behind it, the location of the
         false period will be revised and annotated the same as the pervious period. For example, the 2 nd period
         will be corrected to the 1st period.

If the initial detection does not obey the rule 2, we do as follows:

        A window with a size of N frames is applied on the location boundary to recalculate the similarity.
        The similarity between the test image and the top 10 matched training images is summed as the
         recalculated similarity. The frames with the highest score will be used to annotate the current frame
         with the location.

2.3 Illumination Filter
The sequences used in the task are taken in different rooms under different illumination settings, e.g. some of the
images are from bright offices where as some are taken in dark corridor. Considering the performance of the
image matching method that we have described above is highly dominated by two factors, the number of points
of interests and point matching techniques, the more points of interest that are found, the easier the point
matching step becomes. Thus, the poor visual quality of the sequence could affect the performance of the edge
detector and thus could reduce the number of points of interest for point matching. With the goal of improving
the results, a method to reduce the effect of illumination effects was used for one of the runs which were
submitted. In order to reduce the illumination effect an illumination filter called Retinex [8], was applied.
Retinex can improve visual rendering of a frame in which lighting conditions are not good. Once the
illumination condition is improved, the number of points of interest detected by corner detector can be increased,
as the increased image quality should result the more corners being detected.
3. Submitted Runs
We submitted three runs for ImageCLEF 2009 Robot Vision task using the different combination of the image
matching approach, decision rules approach and illusion filter approach. The detailed descriptions of each run
are outlined below:

    (i) The first run is regarded as the baseline and it uses every first frame out of every five continuous frames
              of both the training and testing sequences for image matching, this is followed by the application of
              the rule based model to refine the results.
    (ii) The second run uses all of the frames in training and testing set for image matching, followed by the
              application of the rule based model. The illumination filter is applied for pre-processing the frames.
    (iii) The third run uses every first frame out of every five continuous frames for both training and test
              sequence for image matching, followed by the application of the rule based model to refine the
              results. The illumination filter is applied for pre-processing the frames.

Considering that every second of video contains 25 frames and the visual difference between two continuous
frames is very limited, a large amount of redundancy is containing in the sequence and increases the computing
cost. In order to improve the efficiency, we split one sequence into shots, where every shot contains five
continuous frames and the first frame of the five is regarded as the key-frame to represent the shot. Instead of
training and annotate using every frame, we now only take the key-frames into account. Once a key-frame has
been annotated, all the rest frames in this shot are annotated as the same. Since the computation cost of our
approach is linear, the key-frame representation can reduce the processing time by 80%. However, this may
result in false detection while the robot changes location. For example, in the example one, the frames from 21 st
to 25th are annotated as ‘printing area’. However, the robot moves out from the printing area to corridor at 23 rd
frame, since the 21st frame is annotated as ‘p’ the other four frames are annotated similarly.

4. Results and Evaluation
We have generated multiple runs of annotation results based on the approaches presented before. All three runs
are submitted to ImageCLEF for official evaluation. The ‘score’ is used as the measure of the overall
performance of the systems. The following rules are used when calculating the score for a single test sequence:

        +1.0 points for each correctly annotated frame.
        Correct detection of an unknown room is treated the same way as correct annotation.
        -0.5 points for each misannotated frame.
        0.0 points for each image that was not annotated (the algorithm refrained from the decision).

In addition, we also use ‘accuracy’ as another way of measure of the performance, which shows the percentage
of frames that are annotated correctly.

Table 1 shows the results of the three runs. It can be seen clearly that the second run achieved the highest
accuracy and score overall. Compared with the first run, the second run improves the accuracy from 59% to
68.5%, which shows that using more frames for training and testing on all test frames could increase the image
matching results. In addition, it also shows that the illumination filter does not improve the system performance.

Table 1 lists the performance of the submitted runs. The baseline run offers a reasonable starting performance for
the combination of image matching and rule based decision approaches. After using more frames for training and
testing and an illusion filter, the second run obtains a 9.5% and 240 point performance gain in terms of accuracy
and score, respectively. In the last run, an illusion filter is used for pre-processing and applied to the same
number of frames as the baseline for training and testing the accuracy and score is largely reduced by 33.1% and
838.5, respectively.
                                               Accuracy                              Score
       Run 1 (baseline)                           59%                                650.5
            Run 2                                68.5%                               890.5
            Run 3                                25.9%                               -188
Table 1: Results of submitted 3 runs, total frame 1689.

4. Conclusions
The multimedia information retrieval group at the University of Glasgow participated in the ImageCLEF Robot
Vision task. In this paper we have presented the methodology, experiments and the preliminary results for the
task. The model of combining image matching, decision rules and illumination filter contributes the high
effectiveness. The results reflect the magnitude of the difficulty of the problem at robot vision, while we believe
we have gained much insight to the practical problems, and future evaluations have the potential to produce
much better results.

References
    1.   M. Bennewitz, W. Burgard, G. Cielniak and S. Thrun; Learning motion patterns of people for compliant
         robot motion; The International Journal of Robotics Research, Vol. 24, No. 1, 31-48 (2005)
    2.   B. Caputo, A. Pronobis, P. Jensfelt,                    Overview of the CLEF 2009 robot
         vision track, CLEF working notes 2009, Corfu, Greece, 2009.
    3.   L. Lu, X. Dai and G. Hager, ‘Efficient particle filtering using RANSAC with application to 3D face
         tracking’, Image and Vision Computing, Volume 24, Issue 6, 1 June 2006, Pp. 581-592
    4.   H. X. Zhong, Y.J. Pang and Y.P. Feng, ‘A new approach to estimating fundamental matrix’, Image and
         Vision Computing, Volume 24, Issue 1, 1 January 2006, Pp. 56-60
    5.   C. Harris and M. Stephens, ‘A combined Corner and Edge Detector,’ Proc. Alvey Conf., pp189-192,
         1987
    6.   Y. Feng, J. Jose, ‘A hybrid approach for classification based multimedia retrieval’, SAMT’08
    7.   Y. Feng, J. Jiang, S. S. Ipson: A shape-match based algorithm for pseudo-3D conversion of 2D videos.
         ICIP (3) 2005: 808-811
    8.   Z. Rahman, D. J. Jobson, "Retinex processing for automatic image enhancement", Journal of Electronic
         Imaging, 2004, Vol. 13, No. 1, pp. 100-110