=Paper=
{{Paper
|id=Vol-379/paper-10
|storemode=property
|title=Determining Spatial Classification Models for Automated Landmark Identification
|pdfUrl=https://ceur-ws.org/Vol-379/paper13.pdf
|volume=Vol-379
}}
==Determining Spatial Classification Models for Automated Landmark Identification==
<pdf width="1500px">https://ceur-ws.org/Vol-379/paper13.pdf</pdf>
<pre>
   Determining Spatial Classification Models for Automated Landmark
                              Identification
                                                      Mark Hughes

                                         Center for Digital Video Processing
                                               Dublin City University
                       ABSTRACT                                real time. In this proposal an approach that uses SVM
                                                               classification models with assigned spatial data is
The use of interest point detection and key point              described.
descriptors has been used successfully for object
recognition and for automated image matching. We aim to                            2. BACKGROUND
use keypoint descriptors fused with spatial data to            Almost all widely used image search engines today use
automatically identify landmarks and buildings within          user-defined or semi-automated image tags to retrieve and
images using a large-scale training database. However one      rank images which are deemed to be relevant to a users
big problem that exists in large-scale image databases is      query. The accuracy of the results returned depends largely
that an average image can yield around 1000 keypoints. To      on the accuracy and richness of the tags associated with the
compare each feature from one image to every feature           image. The main disadvantage of this approach is that most
extracted from all the images within a large-scale database    casual users will not spend the appropriate time required to
is extremely computationally expensive and requires a lot      create accurate tags for images. Another disadvantage is
of time to execute. Therefore a technique is required which    that a lot of user created tags could be heterogeneous and
will provide reliable image matching using these extracted     might not be applicable for other people’s ideas of the
keypoint values in an acceptable timeframe. In this            objects or places located within an image. A lot of research
proposal we describe the use of spatial filtering fused with   work is currently taken place to analyse images and using
classification models based on interest points. We aim to      image content attempt to automatically create semantic
cluster related images and train support vector machine        tags. Using low-level image features alone has been shown
classification models based on these image clusters interest   to work well only with very low-level semantics such as
point values then assign spatial locations to each of these    distinguishing between outdoor and indoor images [1] and
models.                                                        distinguishing between cityscapes or landscapes [2]. Low-
                                                               level image features don’t seem to distinguish accurately
                   1. INTRODUCTION                             between high-level semantics. We discuss a new approach
The main outline of this proposal is to enhance automated      that will utilise image and object matching techniques
image tag and caption creation using interest points and       using interest point detection and keypoint descriptors
classification models. Each commonly used interest point       fused with SVM’s and spatial data.
detection method will generate on average up to 1000           The approach that we propose is to enable efficient image
keypoints within each image. This presents a considerable      matching using image classification models fused with
challenge in terms of matching two images using their          spatial data. Multiple image views of landmarks taken from
image points and computational overhead. To match each         similar viewpoints will be clustered to create a single view
keypoint from one image to each keypoint in every image        model. With this model we plan to classify other images as
located in a large-scale database is extremely                 belonging to the same cluster and hence create an image
computationally expensive. To put it into perspective to       tag automatically for the new image. This new image could
compare one image to all images in a 1000 image database       also be added to the cluster and a new model could be
using the sift algorithm would require 128 million             created to replace the old model, which would be more
comparisons to be made (1000 images * 1000 keypoints *         robust and accurate. Therefore the more images uploaded to
128 values per keypoint vector). To compare one image          a system based on this approach, the more accurately the
against a database of 100,000 images would require over 12     system would function.
trillion comparisons to be made and this number would          There are two main advantages to be gained from
grow considerably as the size of the database grows. Clearly   clustering multiple image views into single models:
this type of point to point matching could not be done in           • Computational overhead: The amount of time
real time. A new technique is thus desired which will filter            taken to compare and classify images in a large-
the amount of keypoints that needed to be compared or a                 scale database is drastically reduced. With efficient
new technique, which doesn’t actually match keypoint by                 filtering methods this classification could be
keypoint singly in order to be able to do this matching in
          theoretically done in real time in large-scale                     ‘View of Front of Christchurch Cathedral’. A big research
          databases.                                                         challenge that exists here is how to automatically create
     •    Robustness: Increased robustness is obtained by                    these image views while ensuring that classification
          combining features obtained under multiple                         accuracy remains high. As interest points from occluding
          imaging conditions into a single model view.                       objects or of objects, which have been incorrectly added to a
                                                                             cluster, are used as positive examples in the training they
                  3. PROPOSED APPROACH                                       will add a lot of noise to the classifier and significantly
                                                                             effect classification performance. It is important that the
  A large-scale training database is to be created. This                     clustering process is as accurate as possible. Ideally images
database contains images, which have been manually                           of the same landmark taken from the same viewpoint
tagged and contain spatial data regarding the location                       should only be included in each cluster.
where the image was taken. All of these images will then be                  Another challenge is how to train the SVM’s. Using all the
clustered based on spatial data eg. all images based within a                image points from each image will also create a noisy
500-meter radius of a location will be clustered together.                   training set as certain interest points could be obtained
These clusters of images will then be split into sub-clusters                from occluding objects and background clutter. Other
based on image content. Low-level image features will be                     approaches could be to combine local features with low-
fused with local image features based on the SIFT [3] and                    level image features as inputs to the SVM’s or to use the
SURF [4] algorithms to create these sub-clusters. Each of                    means of interest point values as inputs.
these clusters is intended to represent a view of a building
or landmark from a specific viewpoint. Using these image                                        4. FUTURE WORK
features a SVM classification model will be trained for each                 We aim to implement a large-scale system to test our
cluster. Each of these classification models will be assigned                hypothesis. We have already collected a large amount of
spatial coordinated based on a mean of the spatial data                      training images (130,000+), which have human defined
from all images located within the cluster.                                  tags and spatial coordinates. An efficient method to cluster
                                                                             multiple images into ‘image views’ is required as the
                                                                             accuracy of the approach depends of how accurately the
                                                                             training images can be clustered. Different approaches to
                                                                             the number and combination of image features to use will
                                                                             be tried and tested to ascertain which works best for this
                                                                             problem.

                                                                                                   9. REFERENCES
                                                                             [1] Martin Szummer and Rosalind W. Picard, "Indoor-Outdoor
                                                                             Image Classification", IEEE International Workshop on Content-
                                                                             based Access of Image and Video Databases, in conjunction with
                                                                             ICCV'98, 1998.

                                                                             [2] Rautiainen M, Seppänen T, Penttilä J & Peltola J
                                                                             Detecting semantic concepts from video using temporal gradients
                                                                             and                     audio                    classification.
                                                                             Proc. International Conference on Image and Video Retrieval,
                                                                             Urbana, IL, 260 – 270, 2003.

                                                                             [3] David G. Lowe, "Object recognition from local scale-invariant
                                                                             features," International Conference on Computer Vision, Corfu,
                                                                             Greece (September 1999), pp. 1150-1157
Figure 1. Brief outline of proposed approach on a small scale. Image
features extracted from image in bottom right are used as inputs to the      [4] Herbert Bay, Tinne Tuytelaars, Luc Van Gool, "SURF:
classification models trained using image features from clusters of images
in other three corners.
                                                                             Speeded Up Robust Features", Proceedings of the ninth European
                                                                             Conference on Computer Vision, May 2006
To automatically identify a landmark within an image,
local image features will be extracted and will be inputted
into all the classification models which have spatial
coordinates located near the images spatial location will be
used. If one of these classifiers outputs a confidence value
above a certain threshold then that image is tagged with the
captions or tags associated with the classification model eg.

</pre>