=Paper=
{{Paper
|id=Vol-379/paper-10
|storemode=property
|title=Determining Spatial Classification Models for Automated Landmark Identification
|pdfUrl=https://ceur-ws.org/Vol-379/paper13.pdf
|volume=Vol-379
}}
==Determining Spatial Classification Models for Automated Landmark Identification==
Determining Spatial Classification Models for Automated Landmark
Identification
Mark Hughes
Center for Digital Video Processing
Dublin City University
ABSTRACT real time. In this proposal an approach that uses SVM
classification models with assigned spatial data is
The use of interest point detection and key point described.
descriptors has been used successfully for object
recognition and for automated image matching. We aim to 2. BACKGROUND
use keypoint descriptors fused with spatial data to Almost all widely used image search engines today use
automatically identify landmarks and buildings within user-defined or semi-automated image tags to retrieve and
images using a large-scale training database. However one rank images which are deemed to be relevant to a users
big problem that exists in large-scale image databases is query. The accuracy of the results returned depends largely
that an average image can yield around 1000 keypoints. To on the accuracy and richness of the tags associated with the
compare each feature from one image to every feature image. The main disadvantage of this approach is that most
extracted from all the images within a large-scale database casual users will not spend the appropriate time required to
is extremely computationally expensive and requires a lot create accurate tags for images. Another disadvantage is
of time to execute. Therefore a technique is required which that a lot of user created tags could be heterogeneous and
will provide reliable image matching using these extracted might not be applicable for other people’s ideas of the
keypoint values in an acceptable timeframe. In this objects or places located within an image. A lot of research
proposal we describe the use of spatial filtering fused with work is currently taken place to analyse images and using
classification models based on interest points. We aim to image content attempt to automatically create semantic
cluster related images and train support vector machine tags. Using low-level image features alone has been shown
classification models based on these image clusters interest to work well only with very low-level semantics such as
point values then assign spatial locations to each of these distinguishing between outdoor and indoor images [1] and
models. distinguishing between cityscapes or landscapes [2]. Low-
level image features don’t seem to distinguish accurately
1. INTRODUCTION between high-level semantics. We discuss a new approach
The main outline of this proposal is to enhance automated that will utilise image and object matching techniques
image tag and caption creation using interest points and using interest point detection and keypoint descriptors
classification models. Each commonly used interest point fused with SVM’s and spatial data.
detection method will generate on average up to 1000 The approach that we propose is to enable efficient image
keypoints within each image. This presents a considerable matching using image classification models fused with
challenge in terms of matching two images using their spatial data. Multiple image views of landmarks taken from
image points and computational overhead. To match each similar viewpoints will be clustered to create a single view
keypoint from one image to each keypoint in every image model. With this model we plan to classify other images as
located in a large-scale database is extremely belonging to the same cluster and hence create an image
computationally expensive. To put it into perspective to tag automatically for the new image. This new image could
compare one image to all images in a 1000 image database also be added to the cluster and a new model could be
using the sift algorithm would require 128 million created to replace the old model, which would be more
comparisons to be made (1000 images * 1000 keypoints * robust and accurate. Therefore the more images uploaded to
128 values per keypoint vector). To compare one image a system based on this approach, the more accurately the
against a database of 100,000 images would require over 12 system would function.
trillion comparisons to be made and this number would There are two main advantages to be gained from
grow considerably as the size of the database grows. Clearly clustering multiple image views into single models:
this type of point to point matching could not be done in • Computational overhead: The amount of time
real time. A new technique is thus desired which will filter taken to compare and classify images in a large-
the amount of keypoints that needed to be compared or a scale database is drastically reduced. With efficient
new technique, which doesn’t actually match keypoint by filtering methods this classification could be
keypoint singly in order to be able to do this matching in
theoretically done in real time in large-scale ‘View of Front of Christchurch Cathedral’. A big research
databases. challenge that exists here is how to automatically create
• Robustness: Increased robustness is obtained by these image views while ensuring that classification
combining features obtained under multiple accuracy remains high. As interest points from occluding
imaging conditions into a single model view. objects or of objects, which have been incorrectly added to a
cluster, are used as positive examples in the training they
3. PROPOSED APPROACH will add a lot of noise to the classifier and significantly
effect classification performance. It is important that the
A large-scale training database is to be created. This clustering process is as accurate as possible. Ideally images
database contains images, which have been manually of the same landmark taken from the same viewpoint
tagged and contain spatial data regarding the location should only be included in each cluster.
where the image was taken. All of these images will then be Another challenge is how to train the SVM’s. Using all the
clustered based on spatial data eg. all images based within a image points from each image will also create a noisy
500-meter radius of a location will be clustered together. training set as certain interest points could be obtained
These clusters of images will then be split into sub-clusters from occluding objects and background clutter. Other
based on image content. Low-level image features will be approaches could be to combine local features with low-
fused with local image features based on the SIFT [3] and level image features as inputs to the SVM’s or to use the
SURF [4] algorithms to create these sub-clusters. Each of means of interest point values as inputs.
these clusters is intended to represent a view of a building
or landmark from a specific viewpoint. Using these image 4. FUTURE WORK
features a SVM classification model will be trained for each We aim to implement a large-scale system to test our
cluster. Each of these classification models will be assigned hypothesis. We have already collected a large amount of
spatial coordinated based on a mean of the spatial data training images (130,000+), which have human defined
from all images located within the cluster. tags and spatial coordinates. An efficient method to cluster
multiple images into ‘image views’ is required as the
accuracy of the approach depends of how accurately the
training images can be clustered. Different approaches to
the number and combination of image features to use will
be tried and tested to ascertain which works best for this
problem.
9. REFERENCES
[1] Martin Szummer and Rosalind W. Picard, "Indoor-Outdoor
Image Classification", IEEE International Workshop on Content-
based Access of Image and Video Databases, in conjunction with
ICCV'98, 1998.
[2] Rautiainen M, Seppänen T, Penttilä J & Peltola J
Detecting semantic concepts from video using temporal gradients
and audio classification.
Proc. International Conference on Image and Video Retrieval,
Urbana, IL, 260 – 270, 2003.
[3] David G. Lowe, "Object recognition from local scale-invariant
features," International Conference on Computer Vision, Corfu,
Greece (September 1999), pp. 1150-1157
Figure 1. Brief outline of proposed approach on a small scale. Image
features extracted from image in bottom right are used as inputs to the [4] Herbert Bay, Tinne Tuytelaars, Luc Van Gool, "SURF:
classification models trained using image features from clusters of images
in other three corners.
Speeded Up Robust Features", Proceedings of the ninth European
Conference on Computer Vision, May 2006
To automatically identify a landmark within an image,
local image features will be extracted and will be inputted
into all the classification models which have spatial
coordinates located near the images spatial location will be
used. If one of these classifiers outputs a confidence value
above a certain threshold then that image is tagged with the
captions or tags associated with the classification model eg.