Keypoints selection using Evolutionary Algorithms

                                                         David Adamczyk, Jan Hůla

                                       Institute for Research and Applications of Fuzzy Modeling
                                                        david.adamczyk@osu.cz
                                                            jan.hula@osu.cz
                                   University of Ostrava, 30. dubna 22, 701 03 Ostrava, Czech Republic

Abstract: This contribution presents the use of neural net-               2   The role of Keypoints in Computer Vision
works trained by an evolutionary algorithm for a selection
of visual keypoints. Visual keypoints play an important
role in many computer vision tasks but many algorithms
for keypoint detection produce many keypoints which are
not useful for the target task. We aim to filter them in a
data-driven way. Our model uses a neural network that
ranks each keypoint by a relevancy score that we use to
choose top-K keypoints with the highest rank. These key-
points are then used for the target task, which is image
classification in our case. Because we use discrete opera-
tions in our model, we can not easily obtain gradients for
weight updates. We, therefore, optimize the weights of the
network by CMA-ES algorithm, which enables efficient
optimization of continuous parameters of black-box func-
tions. In this article, we present our initial experiments
with this method.


1    Introduction
In many problems of machine learning, we deal with sets
of features that describe objects we want to process. In
some cases, these sets are ordered, like, for example, in
tabular datasets where each type of feature corresponds
to one column. In other cases, they are unordered, like,
for example, with visual keypoint descriptors that are ex-
tracted by keypoint detection algorithm. Feature selec-                   Figure 1: An illustration of the keypoint selection process.
tion is a well known preprocessing step for the case where                Our algorithm learns to select keypoints which are relevant
these features are ordered. In the case of tabular datasets,              for image classification.
we may, for example, filter out specific columns of the
table which contain redundant or unrelated information.                      Keypoints, sometimes also called points of interest or
Feature selection is much harder for unordered sets of fea-               visual keypoints, are points in an image represented by co-
tures because we need to figure out which features to throw               ordinates of pixels. They are usually used to extract local
out for each example separately based on the content of                   feature descriptors that provide a representation of an im-
these features. Here we present an approach for feature se-               age that is invariant with respect to transformations such as
lection of unordered sets where each feature is ranked by                 translation, rotation, scale, or other affine transformations.
a neural network whose weights are learned by a variant                      There are many well-known algorithms for keypoint de-
of evolutionary algorithm called Covariance Matrix Adap-                  tection. Many of them are based on edge detection, corner
tation Evolution Strategy (CMA-ES). We showcase it for                    detection, or blob detection. The best-known algorithms
the problem of visual keypoint selection, which could be                  are probably SIFT (Scale-invariant feature transform) [1]
viewed as a preprocessing step for many computer vision                   or SURF (Speeded up robust features) [2]. In the SIFT al-
algorithms.                                                               gorithm, the keypoint locations are defined as maxima and
                                                                          minima of the result of the difference of Gaussians applied
                                                                          in scale space to a series of smoothed and resampled im-
      Copyright c 2020 for this paper by its authors. Use permitted un-
der Creative Commons License Attribution 4.0 International (CC BY         ages. SURF algorithm is based on square-shaped filters as
4.0).                                                                     an approximation of Gaussian smoothing.
   Keypoints are especially useful for tasks such as image      4     CMA-ES
registration, where we want to transform multiple images
into one coordinate system or 3D reconstruction, where          Evolutionary Strategies represent a subclass of optimiza-
we want to find 3D coordinates of each pixel. In both           tion algorithms inspired by natural selection. They are
of these tasks, we need to match keypoints from two or          stochastic, derivative-free methods aimed for numerical
more images by a matching algorithm. We assume that             optimization of non-linear or non-convex continuous op-
the keypoints we are trying to match come from images           timization problems. They search through the space of
capturing the same scenery but from different viewpoints        continuous parameters by using a mutation and selection
or lighting conditions. Therefore all images we are trying      operators, which are interleaved in an iterative process.
to match should contain similar keypoint descriptors. In        One iteration (mutation and selection) of this process cor-
other tasks, such as image classification, this assumption      responds to one generation. In Evolution Strategies, the
may no longer hold. In image classification, we may try to      mutation operator produces each individual by sampling
classify each image by matching its keypoint descriptors        from a multivariate gaussian distribution with mean µ and
to keypoint descriptors of prototype images of different        covariance matrix Σ. The Selection operator selects n best
classes. We would assign the image to a class of the proto-     individuals from the current generation using the function
type image with the best match. The problem arises when          f we are trying to optimize. We treat this function as a
images of objects in the same class contain, for example,       black box. Therefore we can evaluate it for an individ-
different backgrounds, which will be the case most of the       ual x to get its fitness f (x), but otherwise, we have no as-
time. Keypoints detected outside of the object may not be       sumptions about it. The next generation is sampled from a
correlated with the class, and this may decrease the accu-      multivariate gaussian distribution with parameters depend-
racy of the classification. For this reason, keypoints are      ing on selected individuals from the previous generation.
not very often used for tasks such as image classification.     Here are the steps of the algorithm for generic distribution
This could be changed if we can filter out keypoints that       parametrized by θ :
are not relevant for the task at hand, which is a problem
we address in this contribution.                                    1. Generate population of individuals Λ = {x1 , . . . , xn }
                                                                       where xi ∼ pθ (x).

                                                                    2. Compute a fitness score f (x) for all individuals in Λ.
3   Optimization of non-differentiable                              3. Select subset of individuals with the best score.
    objective functions with continuous                                Based on this subset, update the value of parameter
    parameters                                                         θ.

                                                                    4. Repeat.

Deep Neural Networks, which are currently the most pop-           In all types of Evolution Strategies, we adapt the mean
ular approach in Machine Learning and Computer Vision,          µ for the gaussian distribution. In CMA-ES we also adapt
are most of the time trained by variants of gradient de-        the covariance matrix Σ. This allows the algorithm to
scent algorithm that require differentiable objective func-     adapt the step size for each dimension of the parameter
tion. The objective function itself is a proxy for the metric   vector x separately. Adaptation of the covariance matrix
we really care about. For example, in classification, we        can be seen as an estimation of a second-order model of
mostly care about classification accuracy, but we use an        the underlying function f . It is similar to the approxi-
objective function such as cross-entropy for training be-       mation of the inverse Hessian matrix in the quasi-Newton
cause accuracy is not differentiable, but cross-entropy is.     method. For the concrete form of the update equations for
For optimization problems in which the model makes dis-         µ and Σ see the tutorial in [3].
crete choices during processing, the objective function will
not be differentiable, and therefore such problems are of-
ten approached with Reinforcement Learning where gra-           5     Model and task description
dients are estimated by various gradient estimation tech-
niques or by Evolutionary Algorithms which do not use           Our task was inspired by [4] where a Reinforcement
gradients but perform a kind of random search. In our           Learning agent is trained to play a game from a screen
case, we need to optimize the parameters of a neural net-       of pixels. The agent uses an attention module which se-
work, which will rank each keypoint by a relevancy score.       lects positions in the screen where the agent should pay
We choose top-K keypoints with the highest relevancy            its attention and filters out the rest of the screen. Authors
score, and this discrete choice makes our objective func-       of this method showed that the agent learned to pay atten-
tion non-differentiable. Our choice of the optimization al-     tion to objects/positions, which are essential for the game.
gorithm is CMA-ES, which is described in the next sec-          We adapt attention module described in [4] together with
tion.                                                           the learning algorithm (CMA-ES) and apply it to the task
of keypoint selection, which we believe could have a high               as weights of two 1-layer linear neural networks. The
practical impact.                                                       parameter vector φ has a dimension D · Ξ + D · Ξ and the
   We assume that we are given a set of N keypoints in an               function INIT only splits this vector in half and reshapes
image extracted from some keypoint detection algorithm                  the two parts to a matrix. On lines 3 and 4 we create 2
and we want to choose a subset of K keypoints (with K                  low-dimensional representations of each keypoint descrip-
N) that will contain a useful information for the task at               tor and store them to matrices Q and S. On line 5 we com-
hand.                                                                   pute inner products between these two low-dimensional
   The task we consider here is image classification. For               representations for each pair of keypoints. For a better
each class, we assume R prototype images1 which are rep-                intuition, the inner product between Q-representation of
resentative examples of the class. Each image is repre-                 keypoint k1 and W -representation of keypoint k2 can be
sented by a set of keypoints. For a new image, we can                   seen as a vote that k1 is giving to k2 . On line 6 we normal-
match its keypoints to keypoints of prototype image p and               ize these inner products for each column of matrix P by a
obtain a matching score. To match the keypoints, we solve               softmax function so that the whole column sums to 1. This
a linear assignment problem where the cost for every pair               can be seen as a restriction for the keypoint k1 to distribute
of keypoints is computed by a distance between their key-               its one vote to all other keypoints. On line 7 we compute
point descriptors. The matching score is computed by                    the score of each keypoint ki by summing the votes from
summing the distances between matched keypoints. Af-                    every other keypoint k j . On line 8 we sort the indices of
ter averaging the matching score over R prototype images                keypoints by their score. On lines 9 and 10 we choose
in each class, we can assign the new image to the class                 top-K indices and select keypoint descriptors that belong
with the highest average score.                                         to these indices from a matrix A. These are then returned
   Given the type of distance function used for each match-             from the function.
ing pair of keypoints2 , the average matching score depends                We optimize the parameters φ with respect to the fol-
only on the sets of keypoints used for matching. Therefore,             lowing criterion c:
if we have a training dataset X = ((x1 , y1 ), . . . , (xM , yM )) of
                                                                                               M
images xi ∈ RW ×H with labels yi ∈ 1, . . . , P, where W , H, P
                                                                                      c(φ ) = ∑ 1(class( fφ (xi )), yi ),
are width and height of an image and a number of classes                                      i=1
respectively, we can optimize a function fφ which selects
                                                                        where 1 is an indicator function and class: RK×D → N is
a subset of K keypoints from the set of J keypoints, based
                                                                        the function which assigns the class to an image based on
on a criterion c which measures how many images were
                                                                        the selected keypoints as described above. Therefore we
assigned to the correct class. We describe the optimization
                                                                        have an optimization problem in the form of:
criterion c later in this section.
   The function fφ : RN×D → RK×D gets as an input a set
                                                                                             φ = arg max c(φ ).
of N real-valued vectors (keypoint descriptors) of dimen-                                               φ
sion D and returns a subset of K elements from it. It is                    As in [4], we approach it with CMA-ES algorithm.
the attention module which we adapted from [4]. Here
we provide a pseudo-code implementation of this function
with an informal description below.                                     6     Experiments
  1: function GET- TOP - K (A, φ )                                      We conduct two types of experiments that test the viability
  2:     F1 , F2 ← INIT(φ )                                             of our method. The first experiment was conducted on a
  3:     Q ← A · F1                                                     synthetically generated dataset, where the hardness of the
  4:     S ← A · F2                                                     problem could have been controlled manually. The second
  5:     P ← ST · Q                                                     set of experiments was conducted with a realistic dataset
  6:     V ← COLUMN - WISE - SOFTMAX(P)                                 called Willow-Objects [5].
  7:     z1 ← ∑ j Vi j
  8:     z2 ← ARGSORT(z1 )
                                                                        6.1   Synthetic dataset with gaussian feature
  9:     ixs ← z2 [0: K]
                                                                              descriptors
 10:     S ← A[ixs]
 11:     return S                                                       This experiment enabled us to start from an easy case
 12: end function                                                       where we expected the algorithm to work and then in-
   The input to the function is a matrix A ∈ RN×D with one              crementally make it harder by either making feature de-
keypoint descriptor per row and a vector of learnable pa-               scriptors noisier or adding distracting keypoints. Here we
rameters φ . On line 2 we initialize matrices F1 , F2 ∈ RD×Ξ            describe the final form of the dataset for which we show
where Ξ is a hyperparameter3 . These matrices can be seen               results in the Table 1.
                                                                           The dataset contains 10 classes with 1500 examples
    1 In our case R ∈ {5, 10}.                                          per class, where each example is represented by 20 fea-
    2 In our experiments, we use cosine and euclidean distance.         ture descriptors. Each feature descriptor is in turn a 200-
    3 In our case it is 10.                                             dimensional vector. Therefore the dataset has a form X =
{(x1 , y1 ), . . . , (xM , yM )} where M = 15000, xi ∈ R20×200   gorithm with more confidence, without worrying whether
and yi ∈ {0, . . . , 9}.                                         problems arise due to the algorithm or due to the dataset.
    We wanted to model the fact that some classes could          We split the whole dataset to 10000 training examples and
share the same types of feature descriptors. Each type           4900 testing examples. The remaining 100 examples are
of feature descriptor is represented by a mean µj and co-        used as prototypes to which we match the selected feature
variance matrix Σ j . Particular feature descriptors are then    descriptors (10 prototypes per class). After we achieved
sampled from Gaussians with these parameters. The type           85% accuracy on the test set, we moved to experiments
of feature descriptor could be viewed as an abstract object,     with realistic examples.
such as "eye," and the particular sample can be viewed as a
particular image of an eye. We want some types of feature
descriptors to be unique to a class and others to be shared
by more classes. For example, an eye could be a useful           6.2   Experiments on Willow-Object dataset
type of feature descriptor, which is shared by classes such
as "DOG" or "CAT," but it is not present in classes such as
"CAR" or "APPLE." To model this, we represent classes            Our next set of experiments was conducted with the
as leaves in a binary tree where classes that share more         dataset called Willow-Object [5]. We choose this dataset
common ancestors will share more common types of fea-            because it contains annotated keypoints, which we could
ture descriptors. Every class/leaf is assigned six unique        use for the debugging of the algorithm. The dataset con-
feature descriptors, and every common ancestor adds one          tains 5 classes, each with 40 example images. To obtain
shared type of feature descriptor. To get ten classes, we        feature descriptors for an image, we use a keypoint de-
sample ten random paths in a binary tree, which is five          tector in the SIFT algorithm to obtain 400 keypionts per
levels deep. This will produce ten types of feature de-          image. For each keypoint in the image, we extract a vector
scriptors (or ten indices j indexing the parameters µj and       of activation values from two layers of a pre-trained neu-
Σ j ) per class. These types of feature descriptors contain      ral network in the spatial position corresponding to that
information that is relevant for classification. Every exam-     keypoint. Concretely, we use VGG-11 pre-trained on Im-
ple xj from the same class c will contain samples from 10        ageNet and layers named relu4_2 and relu5_1 . For each
Gaussians, which correspond to the class c.                      keypoint we obtain a 1024-dimensional vector. Therefore,
    We also want to model distracting feature descriptors        each example xi ∈ R400×1024 is represented by 400 1024-
that would correspond to background clutter. For these,          dimensional vectors from which we wanted to select 10
we reserve 60 unique types of feature descriptors repre-         vectors used for matching.
sented by 60 new values for the index j. For each exam-
ple xi , we first sample 10 indices from these 60 reserved          In this experiment, we found that cosine distance pro-
and then sample random vectors from distributions cor-           duces much better matches than the euclidean one. We
responding to these indices. Together, each example in           figured this out, with the help of manually annotated key-
the class c is represented by 20 200-dimensional vectors.        points in this dataset. Each image contains 10 annotated
Ten vectors are always sampled from the same distribu-           keypoints which correspond to parts of the object. For
tions corresponding to class c, and ten vectors are sampled      example, each image in the class "FACE" will contain
from ten randomly chosen distributions from the pool of          keypoints for eyes, nose, mouth, etc. Together, there
60, for each new example separately.                             are 5 × 10 = 50 types of different keypoints. We can,
    The parameters of each gaussian corresponding to one         therefore, measure an average distance between each pair
type of feature descriptor are sampled randomly, but in          of keypoint types, by computing distances of all possi-
such a way that the gaussian ellipsoids corresponding to         ble pairs from two types of keypoints, e.g., "EYE" and
different types of feature descriptors are well separable in     "WHEEL", and averaging them. Naturally, we would like
the 200-dimensional space. To achieve this, we set the           the average distance to be smallest between keypoints of
means of these Gaussians to coordinates of corners of the        the same type so that the matcher is encouraged to match
unit 200-simplex. Concretely, the mean µj will have 1 on         keypoints of the same type together. We found that this
position j and zeros everywhere else. The covariance ma-         property holds for cosine distance, but not for euclidean
trix is sampled from ranges that guarantee that the sam-         distance, as shown in Figure 3. We speculate that this may
ples are well separable. In order to make the task more          be due to the fact that the keypoint descriptors extracted
difficult, we process each sample with a randomly initial-       from a pre-trained network may be rather sparse because
ized 2-layer neural network, which preserves the dimen-          they are activation values of the ReLU function and for
sion. This will nonlinearly deform the space, and also, the      such descriptors, cosine distance may work better.
resulting clusters will not be axis-aligned.                       We split the dataset in such a way that for each class,
    Our model is trained to select 10 out of 20 feature de-      we have 30 training examples, five prototype examples,
scriptors that are relevant for classification. Because we       and five test examples. As shown in Table 1, we achieved
knew exactly which feature descriptors contain the rele-         perfect accuracy on the train and the test set. The selected
vant information, this dataset allowed us to develop the al-     keypoints are shown in Figure 2.
Figure 2: Examples of selected keypoints. Top row: keypoints from SIFT keypoint detector, Bottom row: 10 keypoints
selected by our algorithm.


                                                                                            distance    Train acc.    Test acc.
                                                                  Synthetic dataset        euclidean       0.86         0.85
                                                                Willow-Object-SIFT          cosine           -          0.74
                                                                Willow-Object-SIFT         euclidean         -          0.34
                                                                Willow-Object-filtered      cosine          1.0          1.0

                                                               Table 1: Training and testing accuracy for our experi-
                                                               ments. Willow-Object-SIFT uses all keypoints extracted
                                                               by SIFT. Willow-Object-filtered uses keypoints selected
                                                               by our algorithm.
Figure 3: A visualization of a distance matrix for all
pairs of keypoint types. An element on row i and col-
umn j corresponds to an average distance between all key-      saccadic eye movements. In most cases, the attention is
points of type i and all keypoints of type j. Cosine dis-      differentiable and modeled by a softmax distribution over
tance (RIGHT) works much better than euclidean distance        possible positions [8]. This kind of attention is called soft
(LEFT) for matching keypoints of the same type together.       attention as opposed to hard attention [9–11], where dis-
                                                               crete choices are made, usually with a maximum operator.
                                                               Whereas soft attention modules can be trained with gra-
7   Related work                                               dient descent, hard attention modules were usually trained
                                                               with Reinforcement Learning. As far as we know, [4] were
The approach described in this work is based on tech-          the first one to use CMA-ES to train a parallel hard atten-
niques used in [4], which uses parallel selective attention    tion module. We tried to adapt their method to keypoint
to identify essential parts of the environment. This kind of   filtering, which we believe can have a more immediate
parallel attention mechanism was popularized by [6]. Pre-      practical impact. Our contribution is also related to other
viously approaches modeling attention [7] were sequential      articles about keypoint selection and discovery. In [12] he
in nature, in Computer Vision, for example, mimicking          authors described a method named Iterative Keypoint Se-
lection, where the main idea is to select representative key-          [4] Y. Tang, D. Nguyen, and D. Ha, “Neuroevolution of self-
points in the first step and then filter them using a distance-            interpretable agents,” arXiv preprint arXiv:2003.08165,
based rule. Keypoints for which the distance is higher than                2020.
a predefined threshold are removed in an iterative fashion.            [5] M. Cho, K. Alahari, and J. Ponce, “Learning graphs to
   The Transporter [13] is a neural network architecture for               match,” in Proceedings of the IEEE Interational Confer-
unsupervised keypoint discovery from video frames. The                     ence on Computer Vision, 2013.
discovered method enables two notable results in control               [6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
domains. Using the keypoint co-ordinates and correspond-                   A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all
ing image features as input enables highly sample-efficient                you need,” in Advances in neural information processing
                                                                           systems, pp. 5998–6008, 2017.
reinforcement learning, and learning to explore by con-
trolling keypoint locations drastically reduces the search             [7] X. Liu and M. Milanova, “Visual attention in deep learning:
                                                                           a review,” Int. Robot. Automat. J, vol. 4, pp. 154–155, 2018.
space. Another architecture named KeypointNet [14] is
used for the detection and discovery of 3D keypoints from              [8] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudi-
                                                                           nov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neu-
2D images. Their model discovers geometrically and se-
                                                                           ral image caption generation with visual attention,” in Inter-
mantically consistent keypoints across viewing angles and                  national conference on machine learning, pp. 2048–2057,
instances of an object category.                                           2015.
                                                                       [9] V. Mnih, N. Heess, A. Graves, et al., “Recurrent models
                                                                           of visual attention,” in Advances in neural information pro-
8   Discussion
                                                                           cessing systems, pp. 2204–2212, 2014.
                                                                      [10] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple ob-
In our experiments, we worked with datasets that con-
                                                                           ject recognition with visual attention,” arXiv preprint
tained annotated keypoints in order to test our hypothesis                 arXiv:1412.7755, 2014.
that the algorithm will be able to recover the relevant key-
                                                                      [11] G. Elsayed, S. Kornblith, and Q. V. Le, “Saccader: improv-
points that we knew were present. These datasets enabled                   ing accuracy of hard attention models for vision,” in Ad-
easier debugging and development of the algorithm. In                      vances in Neural Information Processing Systems, pp. 702–
the future, we will continue our work with more realistic                  714, 2019.
datasets, which contain more training and testing exam-               [12] W.-C. Lin, C.-F. Tsai, Z.-Y. Chen, and S.-W. Ke, “Key-
ples and also more classes of objects. Also, we would like                 point selection for efficient bag-of-words feature generation
to test our approach with manually specified keypoints for                 and effective image classification,” Information Sciences,
the prototype images where the user would be able to spec-                 vol. 329, pp. 33–51, 2016.
ify which parts of the object are useful for classification.          [13] T. D. Kulkarni, A. Gupta, C. Ionescu, S. Borgeaud,
Lastly, we will try to incorporate spatial constraints be-                 M. Reynolds, A. Zisserman, and V. Mnih, “Unsupervised
tween keypoints for the matching algorithm, which should                   learning of object keypoints for perception and control,”
make the classification more robust.                                       in Advances in neural information processing systems,
                                                                           pp. 10724–10734, 2019.
                                                                      [14] S. Suwajanakorn, N. Snavely, J. J. Tompson, and
9   Conclusion                                                             M. Norouzi, “Discovery of latent 3d keypoints via end-to-
                                                                           end geometric reasoning,” in Advances in neural informa-
In this article, we described and evaluated a method for                   tion processing systems, pp. 2059–2070, 2018.
keypoint selection, which is based on the attention module
from [4]. We evaluated the method in proof-of-concept
experiments and showed that it could select a small sub-
set of relevant keypoints from a large set of generic key-
points. We also showed the importance of the right dis-
tance function when matching individual keypoints. We
hope that such developments will enable the use of key-
points in tasks where they are not standardly used.


References
 [1] G. Lowe, “Sift-the scale invariant feature transform,” Int. J,
     vol. 2, pp. 91–110, 2004.
 [2] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded
     up robust features,” in European conference on computer
     vision, pp. 404–417, Springer, 2006.
 [3] N. Hansen, “The cma evolution strategy: A tutorial,” arXiv
     preprint arXiv:1604.00772, 2016.