Keypoints selection using Evolutionary Algorithms David Adamczyk, Jan Hůla Institute for Research and Applications of Fuzzy Modeling david.adamczyk@osu.cz jan.hula@osu.cz University of Ostrava, 30. dubna 22, 701 03 Ostrava, Czech Republic Abstract: This contribution presents the use of neural net- 2 The role of Keypoints in Computer Vision works trained by an evolutionary algorithm for a selection of visual keypoints. Visual keypoints play an important role in many computer vision tasks but many algorithms for keypoint detection produce many keypoints which are not useful for the target task. We aim to filter them in a data-driven way. Our model uses a neural network that ranks each keypoint by a relevancy score that we use to choose top-K keypoints with the highest rank. These key- points are then used for the target task, which is image classification in our case. Because we use discrete opera- tions in our model, we can not easily obtain gradients for weight updates. We, therefore, optimize the weights of the network by CMA-ES algorithm, which enables efficient optimization of continuous parameters of black-box func- tions. In this article, we present our initial experiments with this method. 1 Introduction In many problems of machine learning, we deal with sets of features that describe objects we want to process. In some cases, these sets are ordered, like, for example, in tabular datasets where each type of feature corresponds to one column. In other cases, they are unordered, like, for example, with visual keypoint descriptors that are ex- tracted by keypoint detection algorithm. Feature selec- Figure 1: An illustration of the keypoint selection process. tion is a well known preprocessing step for the case where Our algorithm learns to select keypoints which are relevant these features are ordered. In the case of tabular datasets, for image classification. we may, for example, filter out specific columns of the table which contain redundant or unrelated information. Keypoints, sometimes also called points of interest or Feature selection is much harder for unordered sets of fea- visual keypoints, are points in an image represented by co- tures because we need to figure out which features to throw ordinates of pixels. They are usually used to extract local out for each example separately based on the content of feature descriptors that provide a representation of an im- these features. Here we present an approach for feature se- age that is invariant with respect to transformations such as lection of unordered sets where each feature is ranked by translation, rotation, scale, or other affine transformations. a neural network whose weights are learned by a variant There are many well-known algorithms for keypoint de- of evolutionary algorithm called Covariance Matrix Adap- tection. Many of them are based on edge detection, corner tation Evolution Strategy (CMA-ES). We showcase it for detection, or blob detection. The best-known algorithms the problem of visual keypoint selection, which could be are probably SIFT (Scale-invariant feature transform) [1] viewed as a preprocessing step for many computer vision or SURF (Speeded up robust features) [2]. In the SIFT al- algorithms. gorithm, the keypoint locations are defined as maxima and minima of the result of the difference of Gaussians applied in scale space to a series of smoothed and resampled im- Copyright c 2020 for this paper by its authors. Use permitted un- der Creative Commons License Attribution 4.0 International (CC BY ages. SURF algorithm is based on square-shaped filters as 4.0). an approximation of Gaussian smoothing. Keypoints are especially useful for tasks such as image 4 CMA-ES registration, where we want to transform multiple images into one coordinate system or 3D reconstruction, where Evolutionary Strategies represent a subclass of optimiza- we want to find 3D coordinates of each pixel. In both tion algorithms inspired by natural selection. They are of these tasks, we need to match keypoints from two or stochastic, derivative-free methods aimed for numerical more images by a matching algorithm. We assume that optimization of non-linear or non-convex continuous op- the keypoints we are trying to match come from images timization problems. They search through the space of capturing the same scenery but from different viewpoints continuous parameters by using a mutation and selection or lighting conditions. Therefore all images we are trying operators, which are interleaved in an iterative process. to match should contain similar keypoint descriptors. In One iteration (mutation and selection) of this process cor- other tasks, such as image classification, this assumption responds to one generation. In Evolution Strategies, the may no longer hold. In image classification, we may try to mutation operator produces each individual by sampling classify each image by matching its keypoint descriptors from a multivariate gaussian distribution with mean µ and to keypoint descriptors of prototype images of different covariance matrix Σ. The Selection operator selects n best classes. We would assign the image to a class of the proto- individuals from the current generation using the function type image with the best match. The problem arises when f we are trying to optimize. We treat this function as a images of objects in the same class contain, for example, black box. Therefore we can evaluate it for an individ- different backgrounds, which will be the case most of the ual x to get its fitness f (x), but otherwise, we have no as- time. Keypoints detected outside of the object may not be sumptions about it. The next generation is sampled from a correlated with the class, and this may decrease the accu- multivariate gaussian distribution with parameters depend- racy of the classification. For this reason, keypoints are ing on selected individuals from the previous generation. not very often used for tasks such as image classification. Here are the steps of the algorithm for generic distribution This could be changed if we can filter out keypoints that parametrized by θ : are not relevant for the task at hand, which is a problem we address in this contribution. 1. Generate population of individuals Λ = {x1 , . . . , xn } where xi ∼ pθ (x). 2. Compute a fitness score f (x) for all individuals in Λ. 3 Optimization of non-differentiable 3. Select subset of individuals with the best score. objective functions with continuous Based on this subset, update the value of parameter parameters θ. 4. Repeat. Deep Neural Networks, which are currently the most pop- In all types of Evolution Strategies, we adapt the mean ular approach in Machine Learning and Computer Vision, µ for the gaussian distribution. In CMA-ES we also adapt are most of the time trained by variants of gradient de- the covariance matrix Σ. This allows the algorithm to scent algorithm that require differentiable objective func- adapt the step size for each dimension of the parameter tion. The objective function itself is a proxy for the metric vector x separately. Adaptation of the covariance matrix we really care about. For example, in classification, we can be seen as an estimation of a second-order model of mostly care about classification accuracy, but we use an the underlying function f . It is similar to the approxi- objective function such as cross-entropy for training be- mation of the inverse Hessian matrix in the quasi-Newton cause accuracy is not differentiable, but cross-entropy is. method. For the concrete form of the update equations for For optimization problems in which the model makes dis- µ and Σ see the tutorial in [3]. crete choices during processing, the objective function will not be differentiable, and therefore such problems are of- ten approached with Reinforcement Learning where gra- 5 Model and task description dients are estimated by various gradient estimation tech- niques or by Evolutionary Algorithms which do not use Our task was inspired by [4] where a Reinforcement gradients but perform a kind of random search. In our Learning agent is trained to play a game from a screen case, we need to optimize the parameters of a neural net- of pixels. The agent uses an attention module which se- work, which will rank each keypoint by a relevancy score. lects positions in the screen where the agent should pay We choose top-K keypoints with the highest relevancy its attention and filters out the rest of the screen. Authors score, and this discrete choice makes our objective func- of this method showed that the agent learned to pay atten- tion non-differentiable. Our choice of the optimization al- tion to objects/positions, which are essential for the game. gorithm is CMA-ES, which is described in the next sec- We adapt attention module described in [4] together with tion. the learning algorithm (CMA-ES) and apply it to the task of keypoint selection, which we believe could have a high as weights of two 1-layer linear neural networks. The practical impact. parameter vector φ has a dimension D · Ξ + D · Ξ and the We assume that we are given a set of N keypoints in an function INIT only splits this vector in half and reshapes image extracted from some keypoint detection algorithm the two parts to a matrix. On lines 3 and 4 we create 2 and we want to choose a subset of K keypoints (with K  low-dimensional representations of each keypoint descrip- N) that will contain a useful information for the task at tor and store them to matrices Q and S. On line 5 we com- hand. pute inner products between these two low-dimensional The task we consider here is image classification. For representations for each pair of keypoints. For a better each class, we assume R prototype images1 which are rep- intuition, the inner product between Q-representation of resentative examples of the class. Each image is repre- keypoint k1 and W -representation of keypoint k2 can be sented by a set of keypoints. For a new image, we can seen as a vote that k1 is giving to k2 . On line 6 we normal- match its keypoints to keypoints of prototype image p and ize these inner products for each column of matrix P by a obtain a matching score. To match the keypoints, we solve softmax function so that the whole column sums to 1. This a linear assignment problem where the cost for every pair can be seen as a restriction for the keypoint k1 to distribute of keypoints is computed by a distance between their key- its one vote to all other keypoints. On line 7 we compute point descriptors. The matching score is computed by the score of each keypoint ki by summing the votes from summing the distances between matched keypoints. Af- every other keypoint k j . On line 8 we sort the indices of ter averaging the matching score over R prototype images keypoints by their score. On lines 9 and 10 we choose in each class, we can assign the new image to the class top-K indices and select keypoint descriptors that belong with the highest average score. to these indices from a matrix A. These are then returned Given the type of distance function used for each match- from the function. ing pair of keypoints2 , the average matching score depends We optimize the parameters φ with respect to the fol- only on the sets of keypoints used for matching. Therefore, lowing criterion c: if we have a training dataset X = ((x1 , y1 ), . . . , (xM , yM )) of M images xi ∈ RW ×H with labels yi ∈ 1, . . . , P, where W , H, P c(φ ) = ∑ 1(class( fφ (xi )), yi ), are width and height of an image and a number of classes i=1 respectively, we can optimize a function fφ which selects where 1 is an indicator function and class: RK×D → N is a subset of K keypoints from the set of J keypoints, based the function which assigns the class to an image based on on a criterion c which measures how many images were the selected keypoints as described above. Therefore we assigned to the correct class. We describe the optimization have an optimization problem in the form of: criterion c later in this section. The function fφ : RN×D → RK×D gets as an input a set φ = arg max c(φ ). of N real-valued vectors (keypoint descriptors) of dimen- φ sion D and returns a subset of K elements from it. It is As in [4], we approach it with CMA-ES algorithm. the attention module which we adapted from [4]. Here we provide a pseudo-code implementation of this function with an informal description below. 6 Experiments 1: function GET- TOP - K (A, φ ) We conduct two types of experiments that test the viability 2: F1 , F2 ← INIT(φ ) of our method. The first experiment was conducted on a 3: Q ← A · F1 synthetically generated dataset, where the hardness of the 4: S ← A · F2 problem could have been controlled manually. The second 5: P ← ST · Q set of experiments was conducted with a realistic dataset 6: V ← COLUMN - WISE - SOFTMAX(P) called Willow-Objects [5]. 7: z1 ← ∑ j Vi j 8: z2 ← ARGSORT(z1 ) 6.1 Synthetic dataset with gaussian feature 9: ixs ← z2 [0: K] descriptors 10: S ← A[ixs] 11: return S This experiment enabled us to start from an easy case 12: end function where we expected the algorithm to work and then in- The input to the function is a matrix A ∈ RN×D with one crementally make it harder by either making feature de- keypoint descriptor per row and a vector of learnable pa- scriptors noisier or adding distracting keypoints. Here we rameters φ . On line 2 we initialize matrices F1 , F2 ∈ RD×Ξ describe the final form of the dataset for which we show where Ξ is a hyperparameter3 . These matrices can be seen results in the Table 1. The dataset contains 10 classes with 1500 examples 1 In our case R ∈ {5, 10}. per class, where each example is represented by 20 fea- 2 In our experiments, we use cosine and euclidean distance. ture descriptors. Each feature descriptor is in turn a 200- 3 In our case it is 10. dimensional vector. Therefore the dataset has a form X = {(x1 , y1 ), . . . , (xM , yM )} where M = 15000, xi ∈ R20×200 gorithm with more confidence, without worrying whether and yi ∈ {0, . . . , 9}. problems arise due to the algorithm or due to the dataset. We wanted to model the fact that some classes could We split the whole dataset to 10000 training examples and share the same types of feature descriptors. Each type 4900 testing examples. The remaining 100 examples are of feature descriptor is represented by a mean µj and co- used as prototypes to which we match the selected feature variance matrix Σ j . Particular feature descriptors are then descriptors (10 prototypes per class). After we achieved sampled from Gaussians with these parameters. The type 85% accuracy on the test set, we moved to experiments of feature descriptor could be viewed as an abstract object, with realistic examples. such as "eye," and the particular sample can be viewed as a particular image of an eye. We want some types of feature descriptors to be unique to a class and others to be shared by more classes. For example, an eye could be a useful 6.2 Experiments on Willow-Object dataset type of feature descriptor, which is shared by classes such as "DOG" or "CAT," but it is not present in classes such as "CAR" or "APPLE." To model this, we represent classes Our next set of experiments was conducted with the as leaves in a binary tree where classes that share more dataset called Willow-Object [5]. We choose this dataset common ancestors will share more common types of fea- because it contains annotated keypoints, which we could ture descriptors. Every class/leaf is assigned six unique use for the debugging of the algorithm. The dataset con- feature descriptors, and every common ancestor adds one tains 5 classes, each with 40 example images. To obtain shared type of feature descriptor. To get ten classes, we feature descriptors for an image, we use a keypoint de- sample ten random paths in a binary tree, which is five tector in the SIFT algorithm to obtain 400 keypionts per levels deep. This will produce ten types of feature de- image. For each keypoint in the image, we extract a vector scriptors (or ten indices j indexing the parameters µj and of activation values from two layers of a pre-trained neu- Σ j ) per class. These types of feature descriptors contain ral network in the spatial position corresponding to that information that is relevant for classification. Every exam- keypoint. Concretely, we use VGG-11 pre-trained on Im- ple xj from the same class c will contain samples from 10 ageNet and layers named relu4_2 and relu5_1 . For each Gaussians, which correspond to the class c. keypoint we obtain a 1024-dimensional vector. Therefore, We also want to model distracting feature descriptors each example xi ∈ R400×1024 is represented by 400 1024- that would correspond to background clutter. For these, dimensional vectors from which we wanted to select 10 we reserve 60 unique types of feature descriptors repre- vectors used for matching. sented by 60 new values for the index j. For each exam- ple xi , we first sample 10 indices from these 60 reserved In this experiment, we found that cosine distance pro- and then sample random vectors from distributions cor- duces much better matches than the euclidean one. We responding to these indices. Together, each example in figured this out, with the help of manually annotated key- the class c is represented by 20 200-dimensional vectors. points in this dataset. Each image contains 10 annotated Ten vectors are always sampled from the same distribu- keypoints which correspond to parts of the object. For tions corresponding to class c, and ten vectors are sampled example, each image in the class "FACE" will contain from ten randomly chosen distributions from the pool of keypoints for eyes, nose, mouth, etc. Together, there 60, for each new example separately. are 5 × 10 = 50 types of different keypoints. We can, The parameters of each gaussian corresponding to one therefore, measure an average distance between each pair type of feature descriptor are sampled randomly, but in of keypoint types, by computing distances of all possi- such a way that the gaussian ellipsoids corresponding to ble pairs from two types of keypoints, e.g., "EYE" and different types of feature descriptors are well separable in "WHEEL", and averaging them. Naturally, we would like the 200-dimensional space. To achieve this, we set the the average distance to be smallest between keypoints of means of these Gaussians to coordinates of corners of the the same type so that the matcher is encouraged to match unit 200-simplex. Concretely, the mean µj will have 1 on keypoints of the same type together. We found that this position j and zeros everywhere else. The covariance ma- property holds for cosine distance, but not for euclidean trix is sampled from ranges that guarantee that the sam- distance, as shown in Figure 3. We speculate that this may ples are well separable. In order to make the task more be due to the fact that the keypoint descriptors extracted difficult, we process each sample with a randomly initial- from a pre-trained network may be rather sparse because ized 2-layer neural network, which preserves the dimen- they are activation values of the ReLU function and for sion. This will nonlinearly deform the space, and also, the such descriptors, cosine distance may work better. resulting clusters will not be axis-aligned. We split the dataset in such a way that for each class, Our model is trained to select 10 out of 20 feature de- we have 30 training examples, five prototype examples, scriptors that are relevant for classification. Because we and five test examples. As shown in Table 1, we achieved knew exactly which feature descriptors contain the rele- perfect accuracy on the train and the test set. The selected vant information, this dataset allowed us to develop the al- keypoints are shown in Figure 2. Figure 2: Examples of selected keypoints. Top row: keypoints from SIFT keypoint detector, Bottom row: 10 keypoints selected by our algorithm. distance Train acc. Test acc. Synthetic dataset euclidean 0.86 0.85 Willow-Object-SIFT cosine - 0.74 Willow-Object-SIFT euclidean - 0.34 Willow-Object-filtered cosine 1.0 1.0 Table 1: Training and testing accuracy for our experi- ments. Willow-Object-SIFT uses all keypoints extracted by SIFT. Willow-Object-filtered uses keypoints selected by our algorithm. Figure 3: A visualization of a distance matrix for all pairs of keypoint types. An element on row i and col- umn j corresponds to an average distance between all key- saccadic eye movements. In most cases, the attention is points of type i and all keypoints of type j. Cosine dis- differentiable and modeled by a softmax distribution over tance (RIGHT) works much better than euclidean distance possible positions [8]. This kind of attention is called soft (LEFT) for matching keypoints of the same type together. attention as opposed to hard attention [9–11], where dis- crete choices are made, usually with a maximum operator. Whereas soft attention modules can be trained with gra- 7 Related work dient descent, hard attention modules were usually trained with Reinforcement Learning. As far as we know, [4] were The approach described in this work is based on tech- the first one to use CMA-ES to train a parallel hard atten- niques used in [4], which uses parallel selective attention tion module. We tried to adapt their method to keypoint to identify essential parts of the environment. This kind of filtering, which we believe can have a more immediate parallel attention mechanism was popularized by [6]. Pre- practical impact. Our contribution is also related to other viously approaches modeling attention [7] were sequential articles about keypoint selection and discovery. In [12] he in nature, in Computer Vision, for example, mimicking authors described a method named Iterative Keypoint Se- lection, where the main idea is to select representative key- [4] Y. Tang, D. Nguyen, and D. Ha, “Neuroevolution of self- points in the first step and then filter them using a distance- interpretable agents,” arXiv preprint arXiv:2003.08165, based rule. Keypoints for which the distance is higher than 2020. a predefined threshold are removed in an iterative fashion. [5] M. Cho, K. Alahari, and J. Ponce, “Learning graphs to The Transporter [13] is a neural network architecture for match,” in Proceedings of the IEEE Interational Confer- unsupervised keypoint discovery from video frames. The ence on Computer Vision, 2013. discovered method enables two notable results in control [6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, domains. Using the keypoint co-ordinates and correspond- A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all ing image features as input enables highly sample-efficient you need,” in Advances in neural information processing systems, pp. 5998–6008, 2017. reinforcement learning, and learning to explore by con- trolling keypoint locations drastically reduces the search [7] X. Liu and M. Milanova, “Visual attention in deep learning: a review,” Int. Robot. Automat. J, vol. 4, pp. 154–155, 2018. space. Another architecture named KeypointNet [14] is used for the detection and discovery of 3D keypoints from [8] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudi- nov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neu- 2D images. Their model discovers geometrically and se- ral image caption generation with visual attention,” in Inter- mantically consistent keypoints across viewing angles and national conference on machine learning, pp. 2048–2057, instances of an object category. 2015. [9] V. Mnih, N. Heess, A. Graves, et al., “Recurrent models of visual attention,” in Advances in neural information pro- 8 Discussion cessing systems, pp. 2204–2212, 2014. [10] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple ob- In our experiments, we worked with datasets that con- ject recognition with visual attention,” arXiv preprint tained annotated keypoints in order to test our hypothesis arXiv:1412.7755, 2014. that the algorithm will be able to recover the relevant key- [11] G. Elsayed, S. Kornblith, and Q. V. Le, “Saccader: improv- points that we knew were present. These datasets enabled ing accuracy of hard attention models for vision,” in Ad- easier debugging and development of the algorithm. In vances in Neural Information Processing Systems, pp. 702– the future, we will continue our work with more realistic 714, 2019. datasets, which contain more training and testing exam- [12] W.-C. Lin, C.-F. Tsai, Z.-Y. Chen, and S.-W. Ke, “Key- ples and also more classes of objects. Also, we would like point selection for efficient bag-of-words feature generation to test our approach with manually specified keypoints for and effective image classification,” Information Sciences, the prototype images where the user would be able to spec- vol. 329, pp. 33–51, 2016. ify which parts of the object are useful for classification. [13] T. D. Kulkarni, A. Gupta, C. Ionescu, S. Borgeaud, Lastly, we will try to incorporate spatial constraints be- M. Reynolds, A. Zisserman, and V. Mnih, “Unsupervised tween keypoints for the matching algorithm, which should learning of object keypoints for perception and control,” make the classification more robust. in Advances in neural information processing systems, pp. 10724–10734, 2019. [14] S. Suwajanakorn, N. Snavely, J. J. Tompson, and 9 Conclusion M. Norouzi, “Discovery of latent 3d keypoints via end-to- end geometric reasoning,” in Advances in neural informa- In this article, we described and evaluated a method for tion processing systems, pp. 2059–2070, 2018. keypoint selection, which is based on the attention module from [4]. We evaluated the method in proof-of-concept experiments and showed that it could select a small sub- set of relevant keypoints from a large set of generic key- points. We also showed the importance of the right dis- tance function when matching individual keypoints. We hope that such developments will enable the use of key- points in tasks where they are not standardly used. References [1] G. Lowe, “Sift-the scale invariant feature transform,” Int. J, vol. 2, pp. 91–110, 2004. [2] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” in European conference on computer vision, pp. 404–417, Springer, 2006. [3] N. Hansen, “The cma evolution strategy: A tutorial,” arXiv preprint arXiv:1604.00772, 2016.