1


        Towards the Recognition of Sketches by Learning
             Common Qualitative Representations

 Wael Zakaria1 Ahmed M. H. Abdelfattah1 Nashwa Abdelgha↵ar1 Nermeen Elsaadany1
           Nohayr Abdelmoneim1   Haythem Ismail2  Kai-Uwe Kühnberger3
      Faculty of Science, Ain Shams University, Egypt      Faculty of Engineering, Cairo University, Egypt
                       Institute of Cognitive Science, University of Osnabrück, Germany


                                                    Abstract.
                     The aim of this paper is to take advantage of major perspectives for
                     processing hand-drawn sketches to build a hybrid technique, in which
                     aspects of machine learning, computer vision, and qualitative represen-
                     tations are mixed to produce a classifier for sketches of four specific
                     types of objects.
                     Keywords: Feature learning - HOG - spatial relations - sketch recog-
                     nition

1     Introduction and Problem Definition
Sketch recognition is a well-established field in AI and spatial cognition, where techniques that recognize hand-
drawn sketches usually follow one of two major perspectives for processing common sketches: raster and vector.
The former considers a sketch as an image, from which a feature vector is built from a histograms of oriented
gradient (HOG) of pixels. Then, it uses a machine learning technique, such as support vector machines, to build a
classifier model. Vector processing, on the other hand, deals with a sketch as a set of points, from which qualitative
representations are extracted to build a knowledge base of features, on which analogical processing models can
be applied for the recognition process. (cf. §4 for some references to existing work in sketch recognition.)
   This paper presents a hybrid technique, QuRSeR-Tech, that is based on qualitative representation for
recognizing hand-drawn sketches. The presented technique goes somewhat down into low-level representa-
tions of sketches, where a sketch consists of geometric shapes as the sketch’s constituents that have sim-
ple QR features such as relative sizes of geometric shapes, positional relations of any successive pairs
of geometric shapes, and similarity group of any number of successive geometric shapes that have the
same relative sizes. Our technique has two SVM-based classifiers that employ histograms of oriented gra-
dient, HOG, characteristics from computer vision, and a restricted set of qualitative representations (QR)
[Scholkopf and Smola, 2001,Dalal et al., 2006,Cohn and Renz, 2008].

1.1   Abstract Building Blocks
In the following, we formulate the basic definitions of the terms we use in this paper, and explain some conventions
that help understanding the rest of the presentation and the obtained results.
   A stroke is the basic building block of a sketch. It can be represented as a sequence of points in the sense
                                                                              n(s)
of a polyline, but we formally define a stroke as a sequence s := hpi ii=1 of n(s) 2 N ordered tuples that
correspond to sequentially recorded points between a pen-down and a pen-up events. Each representation of a
point pi corresponds to a tuple containing both the xi - and yi - coordinates of the pixels composing s, as well

Copyright c by the paper’s authors. Copying permitted for private and academic purposes.
In: A.M. Olteteanu, Z. Falomir (eds.): Proceedings of ProSocrates 2017: Symposium on Problem-solving, Creativity and
Spatial Reasoning in Cognitive Systems, at Delmenhorst, Germany, July 2017, published at http://ceur-ws.org


                                                       11
2


as a temporal parameter, ti , that reflects the time of drawing each pixel pi , for 1  i  n(s). For simplicity, we
                                                                                       n(s)
may assume that a stroke s is a sequence of n(s) ordered pairs, s := h(xi , yi )ii=1 , with p1 = (x1 , y1 ) being the
pair corresponding to the coordinates of the “start point” of drawing s (pen down), and pn(s) = (xn(s) , yn(s) )
corresponding to s’s last point (pen up). Even though, p1 and pn(s) , in particular, must also reflect (at least)
the temporal parameters, t1 and tn(s) ⌦ , that record the   ↵ start and end times, respectively, of s. We also define a
sketch as a sequence of strokes, S = s1 , s2 , . . . , sm(S) , where i is the stroke number for each 1  i  m(S). Two
strokes si and si+1 are called successive. Figure 1a gives an example of a sketch, Sw1 = hs1 , s2 , s3 , s4 , s5 , s6 , s7 i
that contains seven strokes (i.e., m(Sw1 ) = 7) corresponding to two successive “ellipses” (circles) followed by five
consecutive “line segments”.


         (a) The sketch Sw1                                            (b) The sketch Sb1


Fig. 1: Two hand-drawn sketches of two di↵erent objects: (a) a wheel object (which is referred to in the text as
Sw1 , with some illustrative labels and a dashed line), and (b) a bus object (referred to as Sb1 ).

    To assist the learner presented later in §2, we base the building on a feature space that contains guiding
information about all sketches of the set of objects to be learnt (cf. Figure 2). From all the sketched objects,
we gather and record a specific number of qualitative representational (QR) criteria (which are limited in this
paper to the following three kinds): relative size, positional relation, and similarity group. Relative sizes measure
the drawn stokes’ metric lengths compared to that of the whole sketch they compose. The size of the (whole)
sketch is measured by calculating the perimeter of the canvas that contains it. Based on this size, values ranging
from zero to the perimeter’s are discretized into intervals, each of which is assigned to one of the relative sizes:
tiny, small, medium, semi-medium, large, very large, and huge. The relative size of a stroke is assigned to the
intervals it belongs to. A positional relation is a binary relation between        ⌦ (the centroids ↵of) the two geometric
shapes corresponding to successive strokes in a sketch. For a sketch S = s1 , s2 , . . . , sm(S) , positional relations
R(si+1 , si ) on S ⇥ S determine the location of a stroke si+1 , relative to that of the stroke immediately sketched
before it, si , for 1  i < m(S), where the positional relation R can be one of the following: equal (⌘), up ("), right
(!), left ( ), up-left (-), up-right (%), down (#), down-left (.), and down-right (&). A similarity group
is a collection of two or more geometric shapes of the same relative sizes. In other words, successive geometric
shapes of the same type (e.g., ellipses) can be grouped together into one group if they have the same relative
size.                            ⌦                 ↵
    For each sketch S = s1 , s2 , . . . , sm(S) , we record a (constant) number M of selected features that reflect
the qualitative representation of a sketch QR(S). A QR feature vector, fS , is used for recording numerical
                  M
values hfi ii=1 as representatives of the stored feature values. The features we use here have the fixed ordering
f1 , f2 , . . . , fM (cf. §2). If no stroke satisfies any given feature i, then fi := 0. The feature vectors of all the N
(learning sample) sketches are collected in the feature space F . If fi 6= 0, then the selected feature number i is
satisfied by exactly fi strokes (of each of the N sketches). The value fi 6= 0 means that either (i) fi represents the
number of geometric shapes (for the corresponding relative size features), (ii) fi represents the number of pairs
of geometric shapes (for positional relation features), or (iii) fi represents the number of n geometric shapes (for
similarity group n 2). Thus, the QR feature space F is a matrix of N ⇥ (M + 1) integers, in which the N rows
represent the N sketches, while its first M columns represent the M QR feature vectors of the sketches, and the
(M + 1)st column is a class value (e.g., the sketched object’s label; cf. the “class value” column in the right part
of Figure 2).

1.2   Example: The Sw1 wheel
The wheel in Figure 1a has been sketched in the following order: outer circle/ellipse, inner ellipse, four lines
started at down right and move clockwise. For this object, our technique extracts several features, such that the


                                                         12
                                                                                                                                 3


                             Fig. 2: Constructing the Features Space F from the N sketches.

relative size of the outer ellipse is medium, the smaller ellipse is tiny, and the four lines are tiny. The positional
relations between the outer circle and the inner circle is that the inner is equal to the outer (w.r.t. centroid
positions), but the first line is down right of the inner ellipse, the second line is down left the first one, etc. The
                                                                                     7
four lines are grouped together and create one group (cf. §1.1). Let Sw1 = hsi ii=1 represent the wheel sketch in
Figure 1a. Let s1 and s2 be the outer and inner “circles”, respectively, and assume that their centroids have the
same location. Let s3 and s4 be the right and left strokes, respectively, below the dashed line in the same figure
(that is, s4 is the line stroke immediately below the inner circle s2 , and s3 is the stroke to its right in the figure1 ).
Instead of the postfix notation, ⌘ (s2 , s1 ), of the positional relation “equal”, we use the infix notation s2 ⌘ s1
to indicate that s2 is located at the same location of s1 . Since s3 is a successor to s2 , and spatially located to
the latter’s “down-right”, we write s2 & s3 . Similarly, s4 . s3 indicates that the semi-vertical line segment, s4 ,
comes “down-left” of (and successor to) the leaned line segment, s3 . The five consecutive line strokes hsi i7i=3 of
the wheel Sw1 of Figure 1a are successive and have the same relative size (which is “tiny” in this case). Therefore,
these strokes are combined, forming a group labelled “group of three or more lines”. The feature vector for Sw1
contains values that reflect the following features of the wheel in Figure 1a: five tiny lines, one tiny circle, and
one small circle. This means that the value of feature “tiny-line” is 4, of “tiny-circle” is 1, and of “small-circle”
is 1. The similarity group labelled “group of two circles” has the value 1.


2     Methodology
We present a Qualitative Representation used for Sketch Recognition Technique, or QuRSeR-Tech for short,
which is proposed to build its classification model by learning a restricted set of common qualitative represen-
tations. QuRSeR-Tech is used for recognizing newly drawn sketches based on a previously given set of learning
sketches. QuRSeR-Tech consists of two learning systems and, consequently, two kinds of datasets used for learning
purposes.
   The first classifier utilizes HOG-based extracted features for recognizing drawn strokes within geometric
shapes. The geometric shapes are considered the low-level representation constituents of a sketch. Here, these
geometric shapes are “line”, “arc”, “ellipse”, “poly-line”, “triangle”, “rectangle”, or “polygon”. For each geo-
metric shape, we collect a set of hand-drawn sketches describing it. These shapes are used as a training dataset
in order to learn and build a classifier model that, in turn, is able to classify a given stroke as one of the ge-
ometric shapes. To that aim, we extract the features that discriminate each kind of image. The first dataset
contains hand-drawn strokes (the learning samples), each of which is classified into one of the geometric shapes.
The second classifier uses QR-based extracted features for recognizing a hand-drawn sketch to an object name.
The second dataset contains hand-drawn sketches, each of which is classified into one of the object names:
“airplane”, “bicycle”, “bus”, or “house”.
   QuRSeR-Tech’s first learning system is called StrokeToGeometric-classifier (StoG), which builds its classifier
model to recognize a hand-drawn stroke (as one of the aforementioned geometric shapes). After that, all recog-
nized geometric shapes and their qualitative representations are stored together, to build a new dataset that will
be set as input for the second learning system, which we call SketchToObject-classifier (StOb). The latter is a
backbone of QuRSeR-Tech, and is is used for classifying a hand-drawn sketch.
1
    Note that the labeling of the strokes is intended to reflect the successiveness of the strokes; e.g., s4 is sketched after s3 .
    This is vital, especially because the presented positional relations are not all commutative.


                                                             13
4


2.1    StoG classifier: The 1st Learner

The Histogram of Gradient Orientation (HOG) is a method that is based on evaluating well-normalized local
histograms of image gradient orientations in a dense grid. HOG’s basic idea is that local object appearance and
shape can often be characterized rather well by the distribution of local intensity gradients or edge directions,
even without precise knowledge of the corresponding gradient or edge positions [Dalal et al., 2006].
   In order to extract HOG features of an image, one starts with describing the structure of the image from
which the HOG features are extracted. Each gray image is represented by a matrix IMGw⇥h of integer numbers,
w and h, varying between 0 and 255, with IMG(i, j) representing the intensity of the image at pixels 1  i  w
and 1  j  h. The image IMG is resized to IMG0128⇥128 . The function, extractHOGFeatures 2, is applied to extract
the HOG features of the resized image IMG0 , which gives the ability to set the values of its parameters depending
on the needs. The output of this function is the extracted HOG features from a gray image IMG0128⇥128 , which
returns a 1-by-250 feature vector. These features encode local shape information from regions within an image.
Therefore, for each image, we extract 250 features stored as an instance in the dataset. As a result, if we have
k images, then we extract a feature space as a matrix called HogFk⇥250 , with one extra column attached that
represents the class value (geometric shape) of each image. So the output is HogFk⇥251 .
   A support vector machine (SVM) [Scholkopf and Smola, 2001] is a supervised learning algorithm capable of
building a classifier model by learning from training samples to analyze data used for classification. In our case,
SVM employs the extracted feature space HogF (of all images as training samples for the learning process), to
produce a classifier model, called StrokeToGeometric-classifier (StoG), which is able to classify a drawn stroke
as one of the geometric shapes mentioned earlier. In order to use StoG for testing an unknown classified image
UIMw⇥M , we go through the following steps: (i) resize the image into UIM0128⇥128 , (ii) extract its HOG features
UIMHOG which will be a vector 1 ⇥ 250, (iii) apply StoG to UIMHOG, where the output will be the class value
(geometric name) of this image. Figure 3 gives an example of a real case of a hand-drawn sketch (bus), in which
there are 15 strokes that have been drawn. By applying StoG to each one of the drawn strokes, we are able
to classify them into: polygon, ellipse, ellipse, ellipse, line, line, line, ellipse, line, line, line, rectangle, rectangle,
rectangle, and rectangle.


                          Fig. 3: QR feature vector of the sketched bus, fSb1 , of Figure 1b.


2
    This is one of the predefined functions within the Matlab computer vision toolbox.


                                                          14
                                                                                                                                 5


2.2    Qualitative Representation (QR) Identification
Our main goal in this paper is to study the e↵ect of some qualitative representations on building an accurate
classifier model that is able to recognize hand-drawn sketches. We present three qualitative representations, along
with an argument why they may be considered as suitable candidates. Recall that each sketch S consists of a
set of m(s) strokes, classified into one of the mentioned geometric shapes.

Relative sizes of strokes:
                                                  n(s)
The absolute size of a stroke s := hpi ii=1            , called size(s), is the accumulated distances between every
two successive points pi and pi+1 belonging to s (for i < n(s)). This can be formulated as size(s) =
Pn(s) 1 p
   i=1       (xi xi+1 )2 + (yi yi+1 )2 . The relative size is also calculated for each stroke w.r.t. the sketch it
belongs to. In our case, we use only relative sizes based on the discretization of the absolute size into one of
the following seven discrete values: tiny, small, medium, semi medium, large, very large, huge. We conducted
experiments with human subjects, in which the participants draw wheels of busses in di↵erent sizes, but almost
always approximate a relative size (say “medium”) w.r.t. the whole bus sketch. Furthermore, also based on our
experiments, rectangles appear (as windows) in the sketches of a house object in large or very large relative sizes,
whereas rectangles (as windows) in the sketches of bus objects appear in medium relative sizes (w.r.t. the whole
bus sketch). Figure 3 shows that the relative sizes of the drawn strokes: polygon, ellipse, ellipse, ellipse, line, line,
line, ellipse, line, line, line, rectangle, rectangle, rectangle, and rectangle are very large, small, small, tiny, tiny,
tiny, tiny, tiny, tiny, tiny, tiny, small, small, small, and small, respectively.
    We extracted 49 relative size features; for each one of the seven geometric shapes, we use seven relative size
features (tiny, small, . . . , etc.). Therefore, the relative size features are: tiny lines, . . . , huge lines; tiny arcs, . . . ,
huge arcs; . . . ; tiny polygons, . . . , huge polygons. In Figure 3, the relative feature “very large polygon” is 1,
which means that there is only one very large drawn polygon, and the relative size feature “tiny lines” is 6, which
means that there are 6 drawn tiny lines.

Positional relations:
A positional relation is calculated based on geometric properties of every two successive strokes, si and si+1 , where
the relation is calculated for si+1 with respect to si (cf. §1.1). Thus, we make use of 441 features: si R si+1 , where
R 2 {⌘, !, , ", -, %, #, ., &} (i.e., 7 ⇥ 9 ⇥ 7 = 441). An example is shown in Figure 3, where the positional
relation called “rectangle      rectangle” is set to 3, because there are three pairs of successive strokes having the
same positional relation feature.

Similarity groups:
A similarity group is based on the human vision system, and aims to create groups of similar successive strokes.
For instance, a participant drew many lines —of similar relative sizes— inside the wheel. These lines are grouped
together to form one group, which may contain two geometric shapes or a group of more than two geometric
shapes. Hence, we extracted 14 features; each one of the 7 geometric shapes s will have the corresponding 2
kinds of the similarity groups (group of tow geometric shapes or group of three or more geometric shapes). The
similarity groups are groups of two lines, groups of three or more lines, . . . , group of two polygons, groups of
three or more polygons.

2.3    StOb classifier: The 2nd Learner
The three kinds of extracted features for each sketch are stored together to form a matrix called QR feature
space F (as mentioned before in §1). which are 504 (i.e., 49 relative sizes, plus 441 positional relations, and 14
similarity groups).

3     Experimental Results and Discussion
We aim at building a classifier model able to recognize a hand-drawn sketch into object name. To achieve this
goal, two datasets have been collected through two experiments in order to provide our classification processes
with many training datasets from which our learning system can learn and build its model. To evaluate our
classifiers, we use a k-fold cross-validation technique with k = 10.


                                                             15
6


                               (a)                                                          (b)


Fig. 4: (a) Stroke to Geometric shape Classifier’s Confusion Matrix. (b) Sketch to Object Classifier’s Confusion
Matrix.

    1. Hand-drawn Geometric Shapes Dataset: In the first dataset, we collect many hand-drawn strokes that rep-
       resent the constituents of any hand-drawn sketch and extract their HOG features to build a HOG feature
       space. These strokes are classified into one of the following geometric shapes: line, arc, ellipse/circle, poly-
       line, triangle, rectangle, or polygon. Regarding this issue, a Matlab GUI interface has been implemented for
       sketching and collecting hand-drawn geometric shapes. Using this interface, the experiment was conducted
       on 6 subjects where each subject sketched all 5 geometric shapes with di↵erent scales, orientations, and
       positions. We collected 250 hand-drawn geometric shapes and converted them into HOG feature space. The
       core point is to apply one of the machine learning algorithms such as support vector machine SVM on the
       HOG feature space, and get a classifier named StoG classifier that will be used for recognizing any sketched
       stroke and classifying it to only the following geometric shapes line, arc, ellipse, poly-line, polygon. StoG
       classifier records accuracy= 88.4% for recognizing any hand-drawn geometric shape. Further mathematical
       analysis are made to identify the polygon to one of the following: triangle, rectangle, or polygon.
       Figure 4a shows the StrokeToGeometric-Classifier’s confusion matrix, in which the arc and ellipses can be
       recognized with accuracy greater than 90%, while the other shapes can be recognized with accuracy greater
       than 83%. We note that 10% of the poly-lines are recognized as polygons, so we use further processes to
       decide whether the shape is open or closed.3
    2. Hand-drawn Sketches Dataset: In this dataset, we collect many hand-drawn sketches for four specific objects:
       airplane, bicycle, bus, and house. By implementing a matlab GUI interface, a second experiment has
       been conducted on 12 subjects where each subject drew 20 sketches —five for each one of the four objects.
       The qualitative representation QR of all those sketches are extracted and stored in QR feature space, F .
       By applying SVM to the feature space, F , the sketchToObject-classifier classifies a given hand-drawn object
       as one the mentioned 4 objects with 80% accuracy. Figure 4b shows the StrokeToGeometric-Classifier’s
       confusion matrix, in which all objects are classified correctly with percentage greater than 80%. In order to
       show the e↵ects of each QR feature (relative size, similarity group, and positional relation), we build and
       measure the accuracy of the following 7 models: relative size, similarity group, positional relation, relative
       size and similarity group, relative size and positional relation, similarity group and positional relation, and
       all QR. Figure 5 shows that for each one of the 7 cases, 4 classifier model are tested (linear SVM without
       using PCA for feature selection, linear SVM using PCA, Quadratic SVM without PCA, Quadratic SVM
       with PCA). The results show that the three combined QR feature plays an important role to achieve the best
       accuracy. One of the interesting notes is that, without using positional relation, we still have an accurate
       classifier. This implies that this feature needs more modifications.

   All experiments have been implemented and tested on Windows 10, Matlab 2016a, computer vision toolbox,
and the classification learner application. All sketches have been collected via WACOM pen, WACOM touch
screen, and some sketches from the dataset in [Eitz et al., 2012] are used as guiding images to the participants.

4      Conclusion
Understanding what hand-drawn sketches represent, in a way similar to what humans do, is a challenging problem
for computation. The representation of sketches using qualitative features plays a key role in this understanding.
Qualitative models that can be used to represent sketches are mainly based on two aspects. The first aspect is
the way of segmenting an object’s contour into meaningful edges. The second aspect is the level of detail for
describing an object. The way of handling these aspects can distinguish one model from another.
3
    E.g., the distance between the start- and end- points of the drawn stroke is compared to the length of the drawn stroke.


                                                          16
                                                                                                                      7


                                     Fig. 5: SketchToObject classifiers accuracy

   Here, a technique called QuRSeR-Tech has been proposed and implemented for recognizing hand-drawn sketch.
Our technique consists of two main stages: StoG classifier (that uses the function extractHOGFeatures for
extracting the HOG features of each stroke), and quadratic support vector machine SVM (that is based on
the HOG features for recognizing the drawn stroke to geometric shape). To get an accurate classifier, we set
the parameters cell size, block size, block overlap, number of bins, and the signed orientation to [4 4], [2 2],
[50% 50%], 8, and false respectively. The StoG classifier recorded 88.4% accuracy for recognizing the strokes to
geometric shapes. Second, the StOb classifier builds a new representation for any hand drawn sketch by recording
three kinds of qualitative representation: relative size, positional relation, and group of similarity. Based on this
new representation and quadratic support vector machine, the StOb classifier is able to recognize hand drawn
sketches with accuracy 80%. Some comparative studies have been conducted to see the e↵ect each one of the three
kinds of QR, in which we proved that the StOb classifier’s accuracy will be decreased if any of the mentioned
representations is missing. In both of the two classifiers, we apply PCA to reduce the features to only the features
that are highly correlated to the class value.
   The ideas of this paper are in connection with others in the literature (such as [Wang et al., 2013],
[McLure et al., 2011], [Paulson and Hammond, 2008], and [Lovett et al., 2006], to mention a few).
[Wang et al., 2013] gives a method based on fuzzy hybrid-based features to classify strokes into geometry
primitives. Their human computer interactive system is developed for determining the ambiguous results and
then revising the missrecognitions, where their system is developed to classify eight primitive shapes. Pale-
oSketch is also a primitive shape recognizer that classifies single strokes into certain primitive geometric shapes
(cf. [Paulson and Hammond, 2008]). [McLure et al., 2011] describe a higher-level of qualitative representation
based on edge-cycles, which are sequences of edges connected end-to-end, whose last edge connects back to
the first edge. The results of the edge-cycle representation outperforms that of the edge-level representation
—produced by CogSketch in [Lovett et al., 2006]— in learning to classify hand-drawn sketches of everyday
objects. Although the remarkable performance of edge-cycles representation, it has its own set of problems that
can lead to misclassification of sketched concepts (cf. [McLure et al., 2011]).


References
[Cohn and Renz, 2008] Cohn, A. G. and Renz, J. (2008). Qualitative spatial representation and reasoning. Foundations
  of Artificial Intelligence, 3:551–596.
[Dalal et al., 2006] Dalal, N., Triggs, B., and Schmid, C. (2006). Human detection using oriented histograms of flow and
  appearance. In European conference on computer vision, pages 428–441. Springer.
[Eitz et al., 2012] Eitz, M., Hays, J., and Alexa, M. (2012). How do humans sketch objects? ACM Trans. Graph.,
  31(4):1–10.
[Lovett et al., 2006] Lovett, A., Dehghani, M., and Forbus, K. D. (2006). Efficient Learning of Qualitative Descriptions
  for Sketch Recognition. In Proceedings of the 20 th International Workshop on Qualitative Reasoning.


                                                        17
8


[McLure et al., 2011] McLure, M. D., Friedman, S. E., Lovett, A., and Forbus, K. D. (2011). Edge-cycles: A qualita-
  tive sketch representation to support recognition. In Proceedings of the 25th International Workshop on Qualitative
  Reasoning.
[Paulson and Hammond, 2008] Paulson, B. and Hammond, T. (2008). Paleosketch: Accurate primitive sketch recognition
  and beautification. In Proceedings of the 13th International Conference on Intelligent User Interfaces, IUI ’08, pages
  1–10, New York, NY, USA. ACM.
[Scholkopf and Smola, 2001] Scholkopf, B. and Smola, A. J. (2001). Learning with kernels: support vector machines,
  regularization, optimization, and beyond. MIT press.
[Wang et al., 2013] Wang, S., Wang, G., Gao, M., and Yu, S. (2013). Using fuzzy hybrid features to classify strokes in
  interactive sketches. Advances in Mechanical Engineering, 5:259152.


                                                        18