Image Annotation Through Gaming
                                      Lasantha Seneviratne and Ebroul Izquierdo
           Multimedia and Vision Lab, Queen Mary, University of London,. Mile End Road, E1 4NS, London, UK.
                                      {lasantha.s, ebroul.izquierdo}@elec.qmul.ac.uk

                        ABSTRACT
We introduce an interactive framework for image                    level rather than consideration of the whole scene. We used a
understanding, a game that is enjoyable and provide valuable       very simple and effective approach called ‘elementary
image annotations. When people play the game, they provide         building elements of imagers’ or simply image blocks
useful information about contents of an image. In reality the      technique [3]. The advantage of this approach is that it
most accurate method to describe the content of an image is        distinguishes the background from an object. Those image
manual labelling. Our approach is to motivate people to label      blocks are more related to objects thus objects are represented
imagers while entertaining themselves. Therefore if this game      by a number of image blocks. Therefore we use object base
becomes popular it will be able to annotate most imagers on        image representation techniques in our framework.
the web within a couple of months. When considering
accuracy we use a combination of computer vision techniques        This paper is organised as follows, section 2 describes the
to secure the accuracy of image labelling. By doing this we        general view of the system; section 3 describes the
believe our system will make a significant contribution to         performance measure and the paper ends with the conclusions
address the semantic gap in the computer vision sector.            and future works in section 4.

                    1 INTRODUCTION                                      2 GENERAL VIEW AND ARCHITECTURE
Object recognition and semantic concepts in images is a main       We call our system ‘Tag4Fun’; an interactive game designed
research topic in the computer vision sector. There are            using a 3D graphic library call OpenGL. This game is
billions of imagers on the web; retrieving those using high-       designed to be played by a single player and meant to be
level semantic concepts is not accurate enough yet. Low-level      played by a large number of players. The goal of the system is
feature extraction techniques are able to determine the            to annotate imagers according to their contents.
difference and distribution between colours, textures etc; but
the gap between low-level features and high-level concepts is      The image that requires annotation will be displayed by the
an open issue. Over the last decade problems related to            Tag4Fun game. The game player comments on the contents
semantic gap have driven the research into several directions.     of the image. The basic game structure is similar to a well
The web base ESP game [1] is an “out the box” approach that        known game ‘Tetris’. The major difference in Tag4Fun is it
provides an appealing way to annotate images. The idea             uses characters instead of different shapes of building blocks.
behind the game is to label images on the web according to         To speed the annotating process Tag4Fun uses three columns
their visual integrity. As mentioned the most precise way to       of moving characters. The 3D characters move from top to
describe the image integrity is manual annotation.                 bottom on the screen, the player is intended to collect them
Considering billions of imagers are on the web this technique      using the keyboard. For interactive purposes, Tag4Fun
is more costly and impractical.                                    generates random magic characters which are subject to
                                                                   change to any character. The game player has to construct the
The main objective of this paper is to present an interactive      key-word related to the contents of the image by collecting
approach to annotate imagers using manual labelling. In order      individual characters. The collected characters are used to
to reduce the cost of manual annotation we introduce a highly      select the pre-trained classifier for image classification
enjoyable framework. When considering the label validity we        purposes thus improving the label accuracy.
use different combinations of techniques to increase the
accuracy. This includes both psychological and computer            The Tag4fun visual game will entertain and motivate the
vision techniques. When considering the psychological              player and provide valuable key-words about what is
behaviour we use some simple techniques to clarify the user        contained within the image. At the same time it helps to
attitude. By doing so we were able to find whether the player      determine the users’ attitude by feeding imagers using 3
is a cheater or not and treat them differently. At the same time   different databases called none-annotated, partially-annotated
we use computer vision techniques to increase the accuracy of      and fully-annotated. The game player will be fed randomly by
labels. The goal is to classify an image according to user key-    all 3 databases. Therefore if the game player tries to annotate
word quarry and annotate them. In real world applications          a fully annotated image using non related key-words the
images may represent a scene that may contain a number of          system will identify them as cheaters and treat them only with
objects. We therefore required an annotation within an object      partially annotated imagers. The key-words generated by
                                                                   those players won’t be used for any labelling.
                                                                   classification; this helps us to increase the accuracy of
                                                                   annotation and in turn minimise cheating.

                                                                            3   PERFORMANCE MEASURE
                                                                   These types of games depend on the physiological behaviour
                                                                   of the game player. Therefore it is extraordinarily difficult to
                                                                   measure the performance of Tag4Fun unless it is being played
                                                                   by a large number of users. As an ongoing project Tag4Fun is
                                                                   not yet ready for commercialization. Its performance was
                                                                   analyzed in two different ways. First of all we analyzed the
              Figure 2: Tag4Fun game framework                     performance of the classification process for three different
                                                                   concepts and secondly we analyzed the performance of the
2.1 When is an image annotated?                                    complete frame work. For testing purposes the classifier was
                                                                   trained for three concepts using 10 images, equating to 320
When the classifier agrees on an image it will be temporally       image blocks. The concepts used are butterfly, tree and
annotated and the player will get a certain number of points to    cougar. The Performance for the three concepts obtained is
encourage them to continue playing. When an image passes           displayed in table 1as follows.
through Tag4Fun it contains a number of possible labels for
it. If an image describes using the same label 5 times that key-
word will be associated as a taboo word for the image and                   Precision   Butterfly    cougar     Tree
won’t allow players to use the taboo words for further                        CLD        45%          12%       65%
labelling. If an image got 8 taboo words the image will fully
                                                                              DCD        30%           5%       40%
annotate and be discarded from the database. All other
information captured will be saved for future references. For                 EHD        45%          12%       40%
integrity and language changes over time, a few months later                Table 1: Performance of the SVM classifier
fully annotated imagers will be loaded back for update
purposes. (For example, George. W. Bush is the president of        According to table 1, we conclude the performance of our
the United States and will be the former president in the          classifier is not as accurate as expected. Therefore we will
future).                                                           keep working until we achieve a satisfactory result. However
                                                                   with such precision of the classifier we managed to get 71%
2.2 Low-level feature extraction                                   accuracy for the complete framework. (The performance
                                                                   measured by using eight regular game players).
Most image retrieval systems failed to produce satisfactory
results especially when a user was interested in a particular
                                                                       4    CONCLUSIONS AND FUTURE WORK
object rather than whole scene. We therefore used a simple
but effective technique called ‘elementary building elements       We introduced a computer game which is able to encourage
of imagers’, often called image blocks. This technique divides     and motivate game players to annotate imagers manually. The
the whole image into blocks of imagers. Image blocks               proposed framework was tested with eight regular game
represent different types of objects in the image, or              players and the performance was acceptable. As an ongoing
combinations of blocks represent a single object. We               project we will develop and improve the whole system to
extracted low level features from each block and classified        achieve high accuracy of labels. We will also improve the
them manually to create a vocabulary of training sets. The         quality of our system according to physiological aspects of
trainee model selection is directly related to the input quarry    regular game players.
(Key-word). Using the pre-trained models we were able to           Future work will mainly focus on techniques for improving
classify image blocks; because of the image block concept we       the accuracy of annotation process and combining low-level
find the block related to a particular object. Therefore in the    features to improve the accuracy of the classification process.
future we will use it as a benefit to help the game player
interact by giving them a chance to collect more bonus points          5    REFERENCES
when they point to the location of the object. This will allow
us to test the user attitude for the second time and provide       [1] Luis von Ahn and Laura Dabbish “ Labaling Imagers with
more valuable information about the location of a particular       a Computer Game” Pittsburgh,PA,USA 2004.
object.                                                            [2] Luis von Ahn, Shiry Ginosar, Mihir Kedia, Ruoran Liu
                                                                   and Manuel Blum “Improving Accessibility of the Web with a
The low level feature extraction was completed by using three      Computer Game” Pittsburgh 2006.
low-level descriptors, dominant colour descriptor (DCD),           [3]Qianni Zhang; E.Izquierdo “Optimizing Metrics combining
colour layout descriptor (CLD) and edge histogram descriptor       Low-Level Visual Descriptors for Image Annotation and
(EHD). Those descriptors are defined and used by MPEG-7            Retrieval”2006.
(Moving picture experts group). Using those descriptors we         [4] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin “A
trained a support vector machine (SVM) [4] classifier for          Practical Guide to Support Vector Classification” 2008.