1. Introduction

Towards one-shot Learning via Attention

Andrej Lucny

0 0 Faculty of Mathematics , Physics and Informatics , Comenius University , Mlynska Dolina Bratislava 84248 , Slovakia

Though the deep neural networks enabled us to create systems that would be incredible ten years ago, still most of them learn gradually and ofline. We introduce an approach to how to overcome this limitation. We have implemented it by the well-known attention mechanism that transforms one latent space into another using a list of key-value pairs defining the correspondence between points in the two spaces. We can express any point in the first space as a mixture of keys and map it to a point in the second space that is an analogical mixture of values. While encoders and decoders of these spaces we train only gradually, the keys and values of the transformation we can collect online so that we constantly improve the mapping quality, achieving the perfect mapping of the current situation immediately. We demonstrate our approach to one-shot learning on the simplified imitation game in the human-robot interaction, where we map the representations of the robot's body and the examinator's body seen by the robot.

eol>one-shot learning attention imitation game deep learning models self-supervision

1. Introduction The behavior of these creatures can change, e.g., by Pavlo

vian conditioning. The third kind (Popperian) can create Artificial intelligence is a rapidly growing domain mainly mental models and, thanks to that, can suddenly adapt due to deep learning technology. Nowadays, the ambi- behavior, needing few or one shot. Finally, the fourth tion is to achieve general artificial intelligence, so we aim kind (Gregorian) manages language communication to to develop a model that simultaneously processes image, transfer the content of the mental models from one inditext, and voice, incorporating multimodal knowledge. vidual to others. In this way, these creatures adapt even However, how we create the model is still close to devel- without a single shot. oping any model tailored to a particular task; we need The typical approach of deep learning is to train a just much bigger datasets, data storage, and much more Skinnerian system and interpret some observed behavior powerful hardware for training. Nevertheless, it is pos- as Popperian or Gregorian capabilities. Such systems sible to train a model that can answer what the longest are Skinnerian at the structural level, but higher faculties river in Africa is. But the model learns this fact gradually, emerge as side efects. A diferent approach could look for processing it many times until it can answer correctly. structural changes that correspond to, e.g., the acquisition On the other hand, we learn such facts in one shot. It is of associations. Of course, some parts of an intelligent enough to tell us that the longest river in Africa is the system can grow only gradually. But, after achieving a Nile, and we can remember or forget it, but we avoid certain complexity level, one-shot learning should appear. evolving answers from "blah blah blah" through "Egypt" We suppose that responsibility for this faculty is laying to "the Nile." In the worst case, we produce errors like on processes diferent from those for gradual growth. "the Mississippi." In this paper, we outline how it could work. At first, we

Philosophers analyzing natural intelligence enlight need to develop structures that map some structured data that what today we call general intelligence is very far (like the seen image or joint setup) into a (so-called latent) from the human one. For example, Daniel Dennet pro- space in which any point codes a reasonable instance of vided a famous analysis that recognized four kinds of the data. We can provide that by technologies such as minds [ 1 ]. The first kind (Darwinian) – although its be- autoencoders or contrastive learning that do not need havior can be very efective – cannot adapt. Most of our annotations. We need a vast set of samples and a lot of artificial solutions correspond to this type that, in na- time, but after some time, the mapping converges, and ture, we can observe mainly on insects. The second kind we can fix it. It is essential to mention that though the (Skinnerian) can adapt gradually, needing many shots to used examples correspond only to isolated points, any achieve a reasonable probability of intelligent behavior. point in the latent space corresponds to an instance of the data. At that moment, we can start a diferent process ITAT’22: Information technologies – Applications and Theory, Septem- that maps one latent space to another, e. g. perception to ber 23–27, 2022, Zuberec, Slovakia action. This process creates a list of associations between $ l0u0c0n0y-0@00fm1-p6h04.u2n-7ib4a3.4sk(A( A..LLuucncny)y) points in one latent space and the other. For the two © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License mapped points, the new association works correctly and CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) immediately. However, the more associations we collect, the more accurate the mapping of the whole spaces we generator (decoder). Feature vectors of all instances of get. Anytime we can express any point as a mixture the raw data constitute a latent space with a dimension of the associated points from one space and map it to equal to the size of the feature vector. an analogical mix from their pairs in the other space. The distribution of the feature vectors in the latent In the deep neural network, we can provide it by the space is crucial [5]. We prefer to have similar data Attention module. As a result, the system can learn the mapped to similar vectors. We can achieve a good perforpresented example in one shot and use the knowledge mance mainly by the variational autoencoders [6]. They for approximation for situations never seen. split the feature vector into two parts: one corresponds

We test this one-shot learning process on a modern to average and the other to deviation. In this way, we reimplementation of an imitation game with a robot, in- push features to have the Gaussian distribution. troduced in [ 2 ] [ 3 ]. We aim to learn the ability to imitate arm movements, for which we need associations between 2.2. Metric learning the robot’s body and the body seen by the robot’s camera.

In the first phase, the robot invites us to imitate it. It generates various arm poses, and the human imitates them with its body in front of the robot’s camera. In the second phase, the robot mimics the human by associations learned during the first phase.

We present details of our approach in the third chapter after discussing the related works in Chapter 2. Then we deal with its demonstration in Chapter 4. Finally, we discuss quality and the pros and cons.

We can create feature extractors also without the necessity to develop both encoders and decoders. However, training only the encoder part, we lack the exact output we like to get for a given input. We do not know what feature vector we expect. We know only, e. g., the input category. Thus we cannot calculate the gradient from the diference between the actual and expected outputs.

So, instead, we specify a metric that the mapping of all instances from our dataset should hold. Then we try our network with all inputs and identify the worst pairs of feature vectors that are close, have diferent categories, or 2. Related Works are far, and have the same type. We want to move them in the latent space further or closer, and that direction Looking at how to initiate one-shot learning, we pre- provides us with the gradient for training the network. fer self-supervised gradual methods, i.e., gradient-based After many training cycles, the mapping holds the metric methods working with unlabelled data. Only these meth- and becomes suitable [7]. ods could correspond to how living creatures or robots Advanced methods, called contrastive learning, emin changing environments learn. We know several such ploy metrics based on the similarity or diversity of outways and pay attention to autoencoders and metric learn- puts of two copies of the same network. We feed them ing. Their task is to provide us (by a gradual process) with two augmentations of the same instance from the with extractors that map raw data into a latent space and dataset and require similar outcomes or with diferent generators that can turn any value from the latent space samples expecting diferent results. We can train both into raw data. Finally, the attention mechanism enables networks [8] or only one of them [9]. In the second case, us to map one latent space to the other. one copy is the teacher and the other student. During the training, we adjust the student network weights and 2.1. Autoencoders occasionally copy them to the teacher. In this way, we get feature extractors of better qualAutoencoders [4] are predecessors of the early convolu- ity. Then we can train the corresponding decoders and tional networks. They contain blocks of convolutional use them as generators when we feed them with inputs layers interleaved by dimension reduction in the first half diferent from the feature vectors of instances from the and dimension expansion in the second half. Thus data dataset. like images with a typical dimension of hundreds of thousands are sequentially reduced to a feature vector with a size of hundreds or thousands and then expanded to the 2.3. Attention original extent. We train them from an unlabelled dataset to respond with an output equal to the input. If the training is successful, each part of the processing sequence contains the same information. As a result, the feature vectors have the same information as images from the dataset. Then we cut the part of the neural network after the feature vector and get the extractor (encoder). Or we remove the part before the feature vector and get the The invention of the attention mechanism in natural language processing enabled us to take to regard the global content during processing sequences of words, e. g., for distinguishing the meaning of synonyms upon their content. Then it helped to avoid sequential processing and start the transformer revolution in the whole domain of deep learning [10]. But, for our purposes, it is enough to concern it from a mathematical point of view. The (0) = = ‖‖‖‖ cos , where is as dot product the angle between and . Since cos is 1 for same, 0 for diferent, and − 1 for opposite vectors, we can turn the products to high, middle, and small values from ⟨0, 1⟩ by () = ∑︀ . Thus = ( (0) ) =

attention mechanism works with a set of key-value pairs. and is their latent space, i.e., turns feature vectors Having a query on input, we mix the query from keys into the output data. and outputs an analogical mixture from the corre- We map the spaces and by the attention modsponding values , where: ule . We are gradually building matrices and containing keys and values. Each pair key-value corre⎛ 1 ⎞ ⎛ 1 ⎞ sponds to an association between two feature vectors = ⎜⎜⎝ ..2. ⎟⎠⎟ = ⎜⎜⎝ ..2. ⎟⎠⎟ The system employs the generator to act with = () representing stimulus and response at the system level. upon a random value . As a result, it receives a response and extracts key = (). Then, concerning their All queries and keys are vectors of dimension , so is a quality, the system can include , into , . Thus, the matrix × . Values and outputs are vectors of dimension mapping and become richer. , so is a matrix × . At first, we find such ∈ ⟨0, 1⟩ On the other hand, when the system receives an input that ∑︀ = , ∑︀ = 1, and = 1, 2, .... Since we independent of the system activities, it turns it into like to mix the query more from keys similar to it and less query = (). If the query is equal to one of the from diferent keys, we can define the mixture roughly keys, i. e. = , we aim to act with (). However, it is much more probable that we cannot translate the query so directly. Therefore we operate with () where = (, , ).

So, we can summarize the system operation into two processes (procedures and ): 3. Method Of course, this schema is not generally applicable. However, if we can apply it, it grants one-shot learnLet us assume we have a system with one feature extrac- ing. The system capability is still growing gradually but tor and one generator . The extractor represents in steps, without transient states. Each demonstration perception and generator action. The extractor imple- invokes immediate faculty to act accordingly under sitments function : → , where is a set of system uations close to the seen example. The system operates inputs and , is their latent space, i.e., turns the input somehow also upon unseen conditions, and the quality of data into feature vectors. The generator implements func- these actions grows with the number of key-value pairs. tion : → , where is a set of system outputs ( ), where is a constant that enables us to scale how much we mix from similar keys and how much from diferent ones. Since the length of vectors is growing with √ and dot product with , it is popular to define = √. It would mean we include diferent and opposite keys, even if one key equals the query. For our purposes, we use a much smaller scale factor = √5, since we prefer to mix almost from a single key if the key is equal to the query. Having coeficients of the mixture , we can mix values to output = . So, the complete response of the attention module to a single query is: (, , ) = ︂( )︂

Yet we mention that the typical use of the attention

mechanism is the so-called self-attention, for which the queries, keys, and values are coming from the same input, and we aim to get the query compatibility to each key. However, this is not the case. So instead, we will use the mechanism for mapping two latent spaces; queries and keys are from one space, and values and outputs are from another.

Algorithm Method of attention-based one-shot learning is extractor, generator is attention, keys, values procedure Acquire( ,,,) loop ← () ← () () ← () ← () ← ∪ {} ← ∪ {} procedure Use( ,,,) loop ← () ← () ← (, , ) ← () ()

4.1. Extractor We turn images into features by the pre-trained model

DINO [9]. In detail, we use dino_deits8.onnx – the middlesized version of the latter vision-transformer backbone, distributed in the ONNX format. Though it turns color images with resolution 224x224 into feature vectors of mere 384 numbers, its quality is incredible, demonstrated by several successful applications, including pose detection. Thus we are sure that the vector also contains information representing the person’s pose on the image. But, of course, the pose is in a raw form: we use the backbone only, while the applications mentioned above add further processing layers. The model is relatively large, but its middle-sized version can fit into the 4GB GPU. Moreover, its inference takes 0.05s on an ordinary gaming notebook; thus, it is very suitable for building real-time applications.

4. Demonstration and Evaluation

To demonstrate the method, we deal with the imitation game between a human and a humanoid robot. Though it is possible to implement this task with classic computer 4.2. Generator vision [ 3 ] and deep learning [11], for our purposes, it has The iCub robot arm contains five significant degrees of the meaning to re-implement it in a novel way. In this freedom, two in the shoulder and three in the elbow joints. game, a humanoid robot with a body similar to humans All together pose of the left and right arms is coded by invites a human to mimic his movements. If the human ten angles. accepts the invitation and imitates the robot, the robot Aiming to create a decoder generating accurate poses, learns how to mimic the human. As a result, the robot we first need to get their dataset. We cannot get it by can mimic the movements of humans (Figure 1). random generation of joints because it would also con

For implementation, we need a humanoid robot; we tain unnatural poses. So we need to define what it means employ iCubSim, the simulator of the iCub robot [12], natural here. To do that, we have decided to concern natequipped with an external camera. We control it from ural poses generated by the inverse kinematics. We have Python via pyicubsim, ONNX-runtime and OpenCV [13] asked the robot to move its arm to all possible coordinates libraries. Further, we need a feature extractor that turns in its vicinity, and if it succeeded in reaching the point, images seen by the robot to feature vectors. For our pur- we have added the current joint setup into the dataset. pose, it is not necessary to train it; we can get it from It was not an easy job because inverse kinematics for a pre-trained model for computer vision (obtained by a iCub was not available to us. Therefore we have started self-supervised method described in Chapter 2.2). How- from the known Denavit-Hartenberg parameters of the ever, we need to invest more efort in generating robot robot, and – using on-the-shelf direct kinematics [16] movements since neither pre-trained models nor datasets – we have implemented the FABRIK algorithm [17] adare available to us for the chosen robot. First, we create justed for Denavit-Hartenberg notation and extended by the dataset as a set of the robot joint positions while mov- constraints [18]. It is a slow but fully operational solution ing its arm to random points in the robot’s vicinity. We that not only defines the natural poses of the robot but avoid abnormal setups of robot joints by their calculation speeds up the creation of the dataset. We speed up the by inverse kinematics. Then we train (using Keras [14]) process because we can reliably calculate all data on the the variational autoencoder (see Chapter 2.1) and get the kinematics model, and we do not need to try them on the generator model as its part. Having the extractor and robot. Also, as we see later, it is profitable that our dataset generator, we can define the overall model controlling can contain the Euler coordinates corresponding to the the robot as their integration by the attention module recorded joint setups. We do not use them for model (see Chapter 2.3). Since the system operates in real-time creation, but they are helpful for model visualization. In and calls models, the integration employs a blackboard this way, we have collected all possible poses (Figure 2), architecture [15] that helps us to combine slower and 23470 for each arm. Then we randomly selected 60000 faster processes. Finally, we test the system. examples concerning the equal probability that the robot uses the left arm, the right arm, both arms symmetrically, and both arms in diferent poses.

In the second phase, we used Keras to train the variational autoencoder of the selected joint setups. Since the space of iCub’s arm action is not ample, we have used just ten input, six intermediate, two feature, six intermediate, and ten output neurons. Of course, we double the internal structures because the features are the sum of the average and random multiple of standard deviation (Figure 3). We have used ReLU and tanh activations since we turned joint angles from − 180∘ to 180∘ into code from − 1 to 1. Before training, we shufled the dataset and split it into 50000 training and 10000 testing examples. The training required ten epochs with batch size 32 and took mere 92s (Figure 4). Finally, we have distilled the decoder part of the trained architecture and saved it as our generator. Yet we have converted the generator model from the .h5 format to the .pb format that we can open in the OpenCV library.

This model can turn any pair of numbers from − 1 to 1 into a proper joint setup on the robot (Figure 5). But, further, we need to check that the space of generated actions is well-organized. In other words, we need to check that a fluent change of the feature vector causes only a fluent shift in the joint setup. Since the feature vector has only two numbers, we can easily visualize its quality in six pictures depicting the x,y, and z coordinates of the right and left arms. We put a point to the picture for each example from the testing set. Its color corresponds to the value of the coordinate. Then fluent color gradient means that the space is well-organized (Figure 6).

4.3. Integration

Now we can integrate the extractor and generator models into one system. Since such a system needs to combine fast data sources like a camera with slower models and languid robot movement, the integration employs a blackboard architecture. Concretely, we use our solution named Agent-Space architecture [15] to split the system into a set of agents communicating via blackboard and let the overall control emerge from the individual behaviors of the agents.

Our system contains the following agents (Figure 7): • The camera agent grabs images from the camera and writes them onto the backboard, where other agents can read image samples according to their processing capacity. (This way, we avoid delays and overloadings appearing if we put grabbing and processing images into the same loop.) • The perception agent reads the grabbed image from the blackboard, turns it into a blob, feeds the extractor model, and writes the provided feature vector to the blackboard. • The control agent operates in two modes: AC

QUIRE and USE. In the first mode, it collects lists of keys and values corresponding to the feature vectors of the extractor and generator in the following way. First, it randomly generates a feature vector for the generator, writes it to the blackboard, waits, and reads the feature vector provided by the extractor. Then it adds them to the lists. In the second mode, the agent reads the feature vector provided by the extractor and writes the feature vector for the generator calculated by 4.4. Testing the attention module from the lists of keys and values. We have developed the real-time system incrementally, • The action agent reads the feature vector for the working with the of-line version, which quality we can generator and controls the iCub robot by the com- investigate more easily. In this phase, we have selected a mands of the YARP protocol encapsulated by the bunch of ten examinator’s poses and created a few impyicubsim library. ages – under varying conditions – for each pose. Then we taught the system, and after each sample, we tested • For simplification, at the current stage of devel- the system’s capability to imitate all poses. The number opment, we specify the accurate time for waiting of operational poses indicated whether the system could in the ACQUIRE mode by the examinator. Since forget a learned pose. We found that it did forget neither his hands are busy imitating the robot’s pose, we one. Even the system learned one pose implicitly, commanage this signaling by whistling. We imple- pounding the correct response from two other already mented this input by the pitch agent. It processes presented poses (Figure 8). sound by the Fourier transform and looks for the So far, we have not evaluated the real-time version in high frequencies. another way than by the examinator’s opinion. In the future, we plan to employ pose detectors for this purpose.

5. Conclusion In this paper, we introduced a kind of one-shot learning.

Its key component is the attention module. We have used this existing component of deep neural networks for a new task: mapping two latent spaces. First, however, we had to adjust one of its parameters: the scale factor.

We demonstrated our approach to one-shot learning on imitation between human and humanoid robots.

We built our demo from modules developed in a selfsupervision way. Thus we avoided using datasets containing particular poses of the person on images and the robot’s body. Instead, our robot has learned them by interacting with the examinator in a one-shot learning way.

For imitation, the robot needs to get the model of the body seen as analogical to the model of its own body. In humans, it is not clear where this ability originates. But our approach enlights that the imitation game not only solicits this ability but can also help it to emerge. Here imitation is an ability of a society [19], and one of its members learns it from other (a child from its parent or a robot from its user). Remarkably, this transfer could rely on the attention module, an essential building block for natural language processing. We could look at language as a kind of imitation related to the movement of vocal cords. Its nature is similar to hand movement. On the other hand, the presented one-shot learning mechanism could play a role in the early evolution of signal-based language.

Our approach also has weaknesses. The major one is that if we use an encoder that stems from general data, the mapping could be relevant only for specific conditions. For example, associations learned in the presented imitation game could be fooled by more persons in front of the camera. On the other hand, the quality of today’s self-supervised models does not allow us to cheat the system by e. g. the diferent colors of the wall or the diferent dress of the seen person. We could decrease this problem by training the encoder from more specific data under specific conditions. However, it isn’t easy to imagine that we could manage it in a self-supervision way.

Finally, our method is more general than an imitation game. For example, processing vision, the robot cannot see itself; therefore, it needs the help of an examinator.

However, it could apply the same method for seeing itself in the mirror. Or it could similarly process voice.

Having a speech generator corresponding to the physical capabilities of vocal cords, lips, and tongue and a voice listener analogical to the ear, it could start to produce random voices and learn the mapping between the listener perception and the generator action. As a result, it can reproduce the listened speech when another source makes it. tonomous Mental Development 6 (2014) 213–225. [19] A. Aristidou, J. Lasenby, Embodied gesture prodoi:10.1109/TAMD.2014.2319861. cessing: Motor-based integration of perception [4] G. E. Hinton, R. R. Salakhutdinov, Reducing the di- and action in social artificial agents, Cognimensionality of data with neural networks, Science tive Computing 3 (2011) 419–435. doi:10.1007/ 313 (2006). doi:10.1126/science.1127647. s12559-010-9082-z. [5] Brownlee, Deep Learning for Computer Vision, 1.4

ed., machinelearningmastery.com, 2019. [6] D. P. Kingma, M. Welling, An introduction to vari- 6. Online Resources ational autoencoders, Foundations and Trends in

Machine Learning 12 (2019) 307–392. We share codes of this project at GitHub [7] D. King, High quality face recognition with deep metric learning, 2017. URL: http://blog.dlib.net/ 2017/02/high-quality-face-recognition-with-deep.

html. [8] T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A simple framework for contrastive learning of visual representations, in: Proceedings of the 37th International Conference on Machine Learning, number 149 in ICML, 2020, pp. 1597–1607. [9] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal,

P. Bojanowski, A. Joulin, Emerging properties in self-supervised vision transformers, in: Proceedings of the International Conference on Computer

Vision, ICCV, 2021. [10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,

L. Jones, A. N. Gomez, Łukasz Kaiser, I. Polosukhin, Attention is all you need, in: 31st International Conference on Neural Information Processing Systems,

ACM, Long Beach, 2017. [11] M. Petrovich, M. J. Black, G. Varol, Actionconditioned 3D human motion synthesis with transformer VAE, in: International Conference on Computer Vision, ICCV, 2021. [12] D. Vernon, G. Metta, G. Sandini, The icub cognitive architecture: Interactive development in a humanoid robot, in: 2007 IEEE 6th International Conference on Development and Learning, 2007, pp.

122–127. doi:10.1109/DEVLRN.2007.4354038. [13] G. Bradski, The opencv library, Dr. Dobb’s Journal

of Software Tools (2000). [14] F. Chollet, Deep Learning with Python, Manning

Publications Co., Greenwich, CT, USA, 2017. [15] A. Lucny, Building complex systems with agentspace architecture, Computers and Informatics 23 (2004) 1–36. [16] L. Natale, C. Bartolozzi, F. Nori, G. Sandini, G. Metta,

Humanoid Robotics, Springer, Dordrecht, 2017.

doi:10.1007/978-94-007-6046-2. [17] A. Aristidou, J. Lasenby, Fabrik: A fast, iterative solver for the inverse kinematics problem, Graphical Models 73 (2011) 243–260. [18] R. A. Tenneti, A. Sarkar, Implementation of modiifed fabrik for robot manipulators, in: Proceedings of the Advances in Robotics 2019, 2019, pp. 1–6. doi:10.1145/3352593.3352605.

[1]

D. C.

Dennett , Kinds of minds: towards an understanding of consciousness , Weidenfeld & Nicolson , London, 1996 .

[2]

J. P.

Bandera ,

J. A.

Rodriguez ,

Molina-Tanco ,

Bandera , A survey of vision-based architectures for robot learning by imitation , International Journal of Humanoid Robotics 9 ( 2012 ). doi: 10 .1142/ S0219843612500065, world Scientific Publishing Company.

[3]

Boucenna ,

Anzalone ,

Tilmont ,

Cohen ,

Chetouani , Learning of social signatures through imitation game between a robot and a human partner , IEEE Transactions on Au-