Towards one-shot Learning via Attention Andrej Lucny1 1 Faculty of Mathematics, Physics and Informatics, Comenius University, Mlynska Dolina Bratislava 84248, Slovakia Abstract Though the deep neural networks enabled us to create systems that would be incredible ten years ago, still most of them learn gradually and offline. We introduce an approach to how to overcome this limitation. We have implemented it by the well-known attention mechanism that transforms one latent space into another using a list of key-value pairs defining the correspondence between points in the two spaces. We can express any point in the first space as a mixture of keys and map it to a point in the second space that is an analogical mixture of values. While encoders and decoders of these spaces we train only gradually, the keys and values of the transformation we can collect online so that we constantly improve the mapping quality, achieving the perfect mapping of the current situation immediately. We demonstrate our approach to one-shot learning on the simplified imitation game in the human-robot interaction, where we map the representations of the robot’s body and the examinator’s body seen by the robot. Keywords one-shot learning, attention, imitation game, deep learning models, self-supervision 1. Introduction The behavior of these creatures can change, e.g., by Pavlo- vian conditioning. The third kind (Popperian) can create Artificial intelligence is a rapidly growing domain mainly mental models and, thanks to that, can suddenly adapt due to deep learning technology. Nowadays, the ambi- behavior, needing few or one shot. Finally, the fourth tion is to achieve general artificial intelligence, so we aim kind (Gregorian) manages language communication to to develop a model that simultaneously processes image, transfer the content of the mental models from one indi- text, and voice, incorporating multimodal knowledge. vidual to others. In this way, these creatures adapt even However, how we create the model is still close to devel- without a single shot. oping any model tailored to a particular task; we need The typical approach of deep learning is to train a just much bigger datasets, data storage, and much more Skinnerian system and interpret some observed behavior powerful hardware for training. Nevertheless, it is pos- as Popperian or Gregorian capabilities. Such systems sible to train a model that can answer what the longest are Skinnerian at the structural level, but higher faculties river in Africa is. But the model learns this fact gradually, emerge as side effects. A different approach could look for processing it many times until it can answer correctly. structural changes that correspond to, e.g., the acquisition On the other hand, we learn such facts in one shot. It is of associations. Of course, some parts of an intelligent enough to tell us that the longest river in Africa is the system can grow only gradually. But, after achieving a Nile, and we can remember or forget it, but we avoid certain complexity level, one-shot learning should appear. evolving answers from "blah blah blah" through "Egypt" We suppose that responsibility for this faculty is laying to "the Nile." In the worst case, we produce errors like on processes different from those for gradual growth. "the Mississippi." In this paper, we outline how it could work. At first, we Philosophers analyzing natural intelligence enlight need to develop structures that map some structured data that what today we call general intelligence is very far (like the seen image or joint setup) into a (so-called latent) from the human one. For example, Daniel Dennet pro- space in which any point codes a reasonable instance of vided a famous analysis that recognized four kinds of the data. We can provide that by technologies such as minds [1]. The first kind (Darwinian) – although its be- autoencoders or contrastive learning that do not need havior can be very effective – cannot adapt. Most of our annotations. We need a vast set of samples and a lot of artificial solutions correspond to this type that, in na- time, but after some time, the mapping converges, and ture, we can observe mainly on insects. The second kind we can fix it. It is essential to mention that though the (Skinnerian) can adapt gradually, needing many shots to used examples correspond only to isolated points, any achieve a reasonable probability of intelligent behavior. point in the latent space corresponds to an instance of the data. At that moment, we can start a different process ITAT’22: Information technologies – Applications and Theory, Septem- that maps one latent space to another, e. g. perception to ber 23–27, 2022, Zuberec, Slovakia action. This process creates a list of associations between $ lucny@fmph.uniba.sk (A. Lucny)  0000-0001-6042-7434 (A. Lucny) points in one latent space and the other. For the two © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License mapped points, the new association works correctly and Attribution 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 immediately. However, the more associations we collect, the more accurate the mapping of the whole spaces we generator (decoder). Feature vectors of all instances of get. Anytime we can express any point as a mixture the raw data constitute a latent space with a dimension of the associated points from one space and map it to equal to the size of the feature vector. an analogical mix from their pairs in the other space. The distribution of the feature vectors in the latent In the deep neural network, we can provide it by the space is crucial [5]. We prefer to have similar data Attention module. As a result, the system can learn the mapped to similar vectors. We can achieve a good perfor- presented example in one shot and use the knowledge mance mainly by the variational autoencoders [6]. They for approximation for situations never seen. split the feature vector into two parts: one corresponds We test this one-shot learning process on a modern to average and the other to deviation. In this way, we reimplementation of an imitation game with a robot, in- push features to have the Gaussian distribution. troduced in [2] [3]. We aim to learn the ability to imitate arm movements, for which we need associations between 2.2. Metric learning the robot’s body and the body seen by the robot’s camera. In the first phase, the robot invites us to imitate it. It gen- We can create feature extractors also without the neces- erates various arm poses, and the human imitates them sity to develop both encoders and decoders. However, with its body in front of the robot’s camera. In the sec- training only the encoder part, we lack the exact output ond phase, the robot mimics the human by associations we like to get for a given input. We do not know what learned during the first phase. feature vector we expect. We know only, e. g., the input We present details of our approach in the third chapter category. Thus we cannot calculate the gradient from after discussing the related works in Chapter 2. Then the difference between the actual and expected outputs. we deal with its demonstration in Chapter 4. Finally, we So, instead, we specify a metric that the mapping of all discuss quality and the pros and cons. instances from our dataset should hold. Then we try our network with all inputs and identify the worst pairs of feature vectors that are close, have different categories, or 2. Related Works are far, and have the same type. We want to move them in the latent space further or closer, and that direction Looking at how to initiate one-shot learning, we pre- provides us with the gradient for training the network. fer self-supervised gradual methods, i.e., gradient-based After many training cycles, the mapping holds the metric methods working with unlabelled data. Only these meth- and becomes suitable [7]. ods could correspond to how living creatures or robots Advanced methods, called contrastive learning, em- in changing environments learn. We know several such ploy metrics based on the similarity or diversity of out- ways and pay attention to autoencoders and metric learn- puts of two copies of the same network. We feed them ing. Their task is to provide us (by a gradual process) with two augmentations of the same instance from the with extractors that map raw data into a latent space and dataset and require similar outcomes or with different generators that can turn any value from the latent space samples expecting different results. We can train both into raw data. Finally, the attention mechanism enables networks [8] or only one of them [9]. In the second case, us to map one latent space to the other. one copy is the teacher and the other student. During the training, we adjust the student network weights and 2.1. Autoencoders occasionally copy them to the teacher. In this way, we get feature extractors of better qual- Autoencoders [4] are predecessors of the early convolu- ity. Then we can train the corresponding decoders and tional networks. They contain blocks of convolutional use them as generators when we feed them with inputs layers interleaved by dimension reduction in the first half different from the feature vectors of instances from the and dimension expansion in the second half. Thus data dataset. like images with a typical dimension of hundreds of thou- sands are sequentially reduced to a feature vector with a size of hundreds or thousands and then expanded to the 2.3. Attention original extent. We train them from an unlabelled dataset The invention of the attention mechanism in natural lan- to respond with an output equal to the input. If the train- guage processing enabled us to take to regard the global ing is successful, each part of the processing sequence content during processing sequences of words, e. g., for contains the same information. As a result, the feature distinguishing the meaning of synonyms upon their con- vectors have the same information as images from the tent. Then it helped to avoid sequential processing and dataset. Then we cut the part of the neural network after start the transformer revolution in the whole domain of the feature vector and get the extractor (encoder). Or deep learning [10]. But, for our purposes, it is enough we remove the part before the feature vector and get the to concern it from a mathematical point of view. The attention mechanism works with a set of key-value pairs. and 𝐿𝐺 is their latent space, i.e., turns feature vectors Having a query 𝑞 on input, we mix the query from keys into the output data. 𝐾 and outputs an analogical mixture from the corre- We map the spaces 𝐿𝐹 and 𝐿𝐺 by the attention mod- sponding values 𝑉 , where: ule 𝐴. We are gradually building matrices 𝐾 and 𝑉 ⎛ ⎞ ⎛ ⎞ containing keys and values. Each pair key-value corre- 𝑘1 𝑣1 sponds to an association between two feature vectors ⎜ 𝑘2 ⎟ ⎜ 𝑣2 ⎟ representing stimulus and response at the system level. 𝐾=⎜ ⎝ ... ⎠ ⎟ 𝑉 =⎜ ⎝ ... ⎠ ⎟ The system employs the generator to act with 𝑎 = 𝐺(𝑣) 𝑘𝑙 𝑣𝑙 upon a random value 𝑣. As a result, it receives a response 𝑟 and extracts key 𝑘 = 𝐹 (𝑟). Then, concerning their All queries and keys are vectors of dimension 𝑛, so 𝐾 is a quality, the system can include 𝑘, 𝑣 into 𝐾, 𝑉 . Thus, the matrix 𝑙 ×𝑛. Values and outputs are vectors of dimension mapping 𝐿𝐹 and 𝐿𝐺 become richer. 𝑚, so∑︀ 𝑙×𝑚. At first, we find such 𝑐𝑖 ∈ ⟨0, 1⟩ 𝑉 is a matrix∑︀ On the other hand, when the system receives an input that 𝑐𝑖 𝑘𝑖 = 𝑞, 𝑐𝑖 = 1, and 𝑖 = 1, 2, ...𝑙. Since we 𝑝 independent of the system activities, it turns it into like to mix the query more from keys similar to it and less query 𝑞 = 𝐹 (𝑝). If the query is equal to one of the from different keys, we can define the mixture roughly keys, i. e. 𝑞 = 𝑘𝑖 , we aim to act with 𝐺(𝑣𝑖 ). However, (0) as dot product 𝑐𝑖 = 𝑞𝑘𝑖 = ‖𝑞‖‖𝑘𝑖 ‖ cos 𝜑𝑖 , where 𝜑𝑖 is it is much more probable that we cannot translate the the angle between 𝑞 and 𝑘𝑖 . Since cos 𝜑𝑖 is 1 for same, 0 query so directly. Therefore we operate with 𝐺(𝑣) where for different, and −1 for opposite vectors, we can turn the 𝑣 = 𝐴(𝑞, 𝐾, 𝑉 ). products to high, middle, and small values from ⟨0, 1⟩ by So, we can summarize the system operation into two 𝑥𝑖 (0) 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑥)𝑖 = ∑︀𝑒 𝑒𝑥𝑘 . Thus 𝑐 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥( 𝑐 𝑑 ) = processes (procedures 𝐴𝐶𝑄𝑈 𝐼𝑅𝐸 and 𝑈 𝑆𝐸): 𝑘 𝑇 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥( 𝑞𝐾𝑑 ), where 𝑑 is a constant that enables us Algorithm Method of attention-based one-shot learning to scale how much we mix from similar keys and how much from different ones. Since the length of vectors is 𝐹 is extractor, 𝐺 generator √ 𝐴 is attention, 𝐾 keys, 𝑉 values growing with √ 𝑛 and dot product with 𝑛, it is popular to define 𝑑 = 𝑛. It would mean we include different and opposite keys, even if one key equals the query. For √ our procedure Acquire(𝐹 ,𝐺,𝐾,𝐿) purposes, we use a much smaller scale factor 𝑑 = 5 𝑛, loop since we prefer to mix almost from a single key if the key 𝑣 ← 𝑟𝑎𝑛𝑑𝑜𝑚() is equal to the query. Having coefficients of the mixture 𝑐, 𝑜 ← 𝐺(𝑣) we can mix values 𝑉 to output 𝑜 = 𝑐𝑉 . So, the complete 𝑜𝑢𝑡𝑝𝑢𝑡(𝑜) response of the attention module to a single query 𝑞 is: 𝑟 ← 𝑖𝑛𝑝𝑢𝑡() 𝑘 ← 𝐹 (𝑟) 𝑞𝐾 𝑇 𝐾 ← 𝐾 ∪ {𝑘} (︂ )︂ 𝐴(𝑞, 𝐾, 𝑉 ) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥 𝑉 𝑉 ← 𝑉 ∪ {𝑣} 𝑑 Yet we mention that the typical use of the attention procedure Use(𝐹 ,𝐺,𝐾,𝐿) mechanism is the so-called self-attention, for which the loop queries, keys, and values are coming from the same input, 𝑝 ← 𝑖𝑛𝑝𝑢𝑡() and we aim to get the query compatibility to each key. 𝑞 ← 𝐹 (𝑝) However, this is not the case. So instead, we will use the 𝑣 ← 𝐴(𝑞, 𝐾, 𝑉 ) mechanism for mapping two latent spaces; queries and 𝑜 ← 𝐺(𝑣) keys are from one space, and values and outputs are from 𝑜𝑢𝑡𝑝𝑢𝑡(𝑜) another. 3. Method Of course, this schema is not generally applicable. However, if we can apply it, it grants one-shot learn- Let us assume we have a system with one feature extrac- ing. The system capability is still growing gradually but tor 𝐹 and one generator 𝐺. The extractor represents in steps, without transient states. Each demonstration perception and generator action. The extractor imple- invokes immediate faculty to act accordingly under sit- ments function 𝐹 : 𝐼 → 𝐿𝐹 , where 𝐼 is a set of system uations close to the seen example. The system operates inputs and 𝐿𝐹 , is their latent space, i.e., turns the input somehow also upon unseen conditions, and the quality of data into feature vectors. The generator implements func- these actions grows with the number of key-value pairs. tion 𝐺 : 𝐿𝐺 → 𝑂, where 𝑂 is a set of system outputs 4.1. Extractor We turn images into features by the pre-trained model DINO [9]. In detail, we use dino_deits8.onnx – the middle- sized version of the latter vision-transformer backbone, distributed in the ONNX format. Though it turns color images with resolution 224x224 into feature vectors of mere 384 numbers, its quality is incredible, demonstrated by several successful applications, including pose detec- tion. Thus we are sure that the vector also contains information representing the person’s pose on the image. But, of course, the pose is in a raw form: we use the backbone only, while the applications mentioned above Figure 1: The imitation game. add further processing layers. The model is relatively large, but its middle-sized version can fit into the 4GB GPU. Moreover, its inference takes 0.05s on an ordinary 4. Demonstration and Evaluation gaming notebook; thus, it is very suitable for building real-time applications. To demonstrate the method, we deal with the imitation game between a human and a humanoid robot. Though it is possible to implement this task with classic computer 4.2. Generator vision [3] and deep learning [11], for our purposes, it has The iCub robot arm contains five significant degrees of the meaning to re-implement it in a novel way. In this freedom, two in the shoulder and three in the elbow joints. game, a humanoid robot with a body similar to humans All together pose of the left and right arms is coded by invites a human to mimic his movements. If the human ten angles. accepts the invitation and imitates the robot, the robot Aiming to create a decoder generating accurate poses, learns how to mimic the human. As a result, the robot we first need to get their dataset. We cannot get it by can mimic the movements of humans (Figure 1). random generation of joints because it would also con- For implementation, we need a humanoid robot; we tain unnatural poses. So we need to define what it means employ iCubSim, the simulator of the iCub robot [12], natural here. To do that, we have decided to concern nat- equipped with an external camera. We control it from ural poses generated by the inverse kinematics. We have Python via pyicubsim, ONNX-runtime and OpenCV [13] asked the robot to move its arm to all possible coordinates libraries. Further, we need a feature extractor that turns in its vicinity, and if it succeeded in reaching the point, images seen by the robot to feature vectors. For our pur- we have added the current joint setup into the dataset. pose, it is not necessary to train it; we can get it from It was not an easy job because inverse kinematics for a pre-trained model for computer vision (obtained by a iCub was not available to us. Therefore we have started self-supervised method described in Chapter 2.2). How- from the known Denavit-Hartenberg parameters of the ever, we need to invest more effort in generating robot robot, and – using on-the-shelf direct kinematics [16] movements since neither pre-trained models nor datasets – we have implemented the FABRIK algorithm [17] ad- are available to us for the chosen robot. First, we create justed for Denavit-Hartenberg notation and extended by the dataset as a set of the robot joint positions while mov- constraints [18]. It is a slow but fully operational solution ing its arm to random points in the robot’s vicinity. We that not only defines the natural poses of the robot but avoid abnormal setups of robot joints by their calculation speeds up the creation of the dataset. We speed up the by inverse kinematics. Then we train (using Keras [14]) process because we can reliably calculate all data on the the variational autoencoder (see Chapter 2.1) and get the kinematics model, and we do not need to try them on the generator model as its part. Having the extractor and robot. Also, as we see later, it is profitable that our dataset generator, we can define the overall model controlling can contain the Euler coordinates corresponding to the the robot as their integration by the attention module recorded joint setups. We do not use them for model (see Chapter 2.3). Since the system operates in real-time creation, but they are helpful for model visualization. In and calls models, the integration employs a blackboard this way, we have collected all possible poses (Figure 2), architecture [15] that helps us to combine slower and 23470 for each arm. Then we randomly selected 60000 faster processes. Finally, we test the system. examples concerning the equal probability that the robot uses the left arm, the right arm, both arms symmetrically, and both arms in different poses. In the second phase, we used Keras to train the varia- Figure 2: The iCub’s kinematics (on the left: coordinates reachable by the elbow, on the right: coordinates reachable by the wrist). Figure 4: Training the variational autoencoder of iCub’s arms movement. of the right and left arms. We put a point to the picture for each example from the testing set. Its color corresponds to the value of the coordinate. Then fluent color gradient means that the space is well-organized (Figure 6). 4.3. Integration Figure 3: The architecture of the iCub’s actions autoencoder Now we can integrate the extractor and generator mod- (on the left: encoder, on the right: decoder). els into one system. Since such a system needs to com- bine fast data sources like a camera with slower models and languid robot movement, the integration employs a blackboard architecture. Concretely, we use our solution tional autoencoder of the selected joint setups. Since the named Agent-Space architecture [15] to split the system space of iCub’s arm action is not ample, we have used into a set of agents communicating via blackboard and let just ten input, six intermediate, two feature, six interme- the overall control emerge from the individual behaviors diate, and ten output neurons. Of course, we double the of the agents. internal structures because the features are the sum of Our system contains the following agents (Figure 7): the average and random multiple of standard deviation (Figure 3). We have used ReLU and tanh activations since • The camera agent grabs images from the camera we turned joint angles from −180∘ to 180∘ into code and writes them onto the backboard, where other from −1 to 1. Before training, we shuffled the dataset agents can read image samples according to their and split it into 50000 training and 10000 testing exam- processing capacity. (This way, we avoid delays ples. The training required ten epochs with batch size 32 and overloadings appearing if we put grabbing and took mere 92s (Figure 4). Finally, we have distilled and processing images into the same loop.) the decoder part of the trained architecture and saved it • The perception agent reads the grabbed image as our generator. Yet we have converted the generator from the blackboard, turns it into a blob, feeds the model from the .h5 format to the .pb format that we can extractor model, and writes the provided feature open in the OpenCV library. vector to the blackboard. This model can turn any pair of numbers from −1 to • The control agent operates in two modes: AC- 1 into a proper joint setup on the robot (Figure 5). But, QUIRE and USE. In the first mode, it collects lists further, we need to check that the space of generated of keys and values corresponding to the feature actions is well-organized. In other words, we need to vectors of the extractor and generator in the fol- check that a fluent change of the feature vector causes lowing way. First, it randomly generates a feature only a fluent shift in the joint setup. Since the feature vector for the generator, writes it to the black- vector has only two numbers, we can easily visualize its board, waits, and reads the feature vector pro- quality in six pictures depicting the x,y, and z coordinates vided by the extractor. Then it adds them to the Figure 5: Examples of iCub’s actions generated from the feature vectors. Figure 6: The x, y, and z coordinates of the right and the left iCub’s arm for the testing set. Each point represents one sample, and its color is the value of the coordinate. lists. In the second mode, the agent reads the fea- ture vector provided by the extractor and writes the feature vector for the generator calculated by 4.4. Testing the attention module from the lists of keys and values. We have developed the real-time system incrementally, working with the off-line version, which quality we can • The action agent reads the feature vector for the investigate more easily. In this phase, we have selected a generator and controls the iCub robot by the com- bunch of ten examinator’s poses and created a few im- mands of the YARP protocol encapsulated by the ages – under varying conditions – for each pose. Then pyicubsim library. we taught the system, and after each sample, we tested • For simplification, at the current stage of devel- the system’s capability to imitate all poses. The number opment, we specify the accurate time for waiting of operational poses indicated whether the system could in the ACQUIRE mode by the examinator. Since forget a learned pose. We found that it did forget neither his hands are busy imitating the robot’s pose, we one. Even the system learned one pose implicitly, com- manage this signaling by whistling. We imple- pounding the correct response from two other already mented this input by the pitch agent. It processes presented poses (Figure 8). sound by the Fourier transform and looks for the So far, we have not evaluated the real-time version in high frequencies. another way than by the examinator’s opinion. In the future, we plan to employ pose detectors for this purpose. humans, it is not clear where this ability originates. But our approach enlights that the imitation game not only solicits this ability but can also help it to emerge. Here imitation is an ability of a society [19], and one of its members learns it from other (a child from its parent or a robot from its user). Remarkably, this transfer could rely on the attention module, an essential building block for natural language processing. We could look at language as a kind of imitation related to the movement of vocal cords. Its nature is similar to hand movement. On the other hand, the presented one-shot learning mechanism could play a role in the early evolution of signal-based language. Our approach also has weaknesses. The major one is that if we use an encoder that stems from general data, Figure 7: Schema of the integrated system for the imitation game. Circles represent agents, triangles the blackboard, cylin- the mapping could be relevant only for specific condi- ders models, and the letter A the attention module. tions. For example, associations learned in the presented imitation game could be fooled by more persons in front of the camera. On the other hand, the quality of today’s self-supervised models does not allow us to cheat the system by e. g. the different colors of the wall or the different dress of the seen person. We could decrease this problem by training the encoder from more specific data under specific conditions. However, it isn’t easy to imagine that we could manage it in a self-supervision way. Finally, our method is more general than an imitation game. For example, processing vision, the robot cannot see itself; therefore, it needs the help of an examinator. However, it could apply the same method for seeing it- self in the mirror. Or it could similarly process voice. Having a speech generator corresponding to the physical capabilities of vocal cords, lips, and tongue and a voice listener analogical to the ear, it could start to produce random voices and learn the mapping between the lis- Figure 8: One-shot learning of selected arms poses. tener perception and the generator action. As a result, it can reproduce the listened speech when another source makes it. 5. Conclusion In this paper, we introduced a kind of one-shot learning. References Its key component is the attention module. We have used this existing component of deep neural networks for a [1] D. C. Dennett, Kinds of minds: towards an under- new task: mapping two latent spaces. First, however, we standing of consciousness, Weidenfeld & Nicolson, had to adjust one of its parameters: the scale factor. London, 1996. We demonstrated our approach to one-shot learn- [2] J. P. Bandera, J. A. Rodriguez, L. Molina-Tanco, ing on imitation between human and humanoid robots. A. Bandera, A survey of vision-based architectures We built our demo from modules developed in a self- for robot learning by imitation, International Jour- supervision way. Thus we avoided using datasets con- nal of Humanoid Robotics 9 (2012). doi:10.1142/ taining particular poses of the person on images and the S0219843612500065, world Scientific Publishing robot’s body. Instead, our robot has learned them by Company. interacting with the examinator in a one-shot learning [3] S. Boucenna, S. Anzalone, E. Tilmont, D. Co- way. hen, M. Chetouani, Learning of social signa- For imitation, the robot needs to get the model of the tures through imitation game between a robot body seen as analogical to the model of its own body. In and a human partner, IEEE Transactions on Au- tonomous Mental Development 6 (2014) 213–225. [19] A. Aristidou, J. Lasenby, Embodied gesture pro- doi:10.1109/TAMD.2014.2319861. cessing: Motor-based integration of perception [4] G. E. Hinton, R. R. Salakhutdinov, Reducing the di- and action in social artificial agents, Cogni- mensionality of data with neural networks, Science tive Computing 3 (2011) 419–435. doi:10.1007/ 313 (2006). doi:10.1126/science.1127647. s12559-010-9082-z. [5] Brownlee, Deep Learning for Computer Vision, 1.4 ed., machinelearningmastery.com, 2019. [6] D. P. Kingma, M. Welling, An introduction to vari- ational autoencoders, Foundations and Trends in 6. Online Resources Machine Learning 12 (2019) 307–392. We share codes of this project at GitHub [7] D. King, High quality face recognition with deep metric learning, 2017. URL: http://blog.dlib.net/ 2017/02/high-quality-face-recognition-with-deep. html. [8] T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A simple framework for contrastive learning of visual representations, in: Proceedings of the 37th Inter- national Conference on Machine Learning, number 149 in ICML, 2020, pp. 1597–1607. [9] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, A. Joulin, Emerging properties in self-supervised vision transformers, in: Proceed- ings of the International Conference on Computer Vision, ICCV, 2021. [10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Łukasz Kaiser, I. Polosukhin, Attention is all you need, in: 31st International Con- ference on Neural Information Processing Systems, ACM, Long Beach, 2017. [11] M. Petrovich, M. J. Black, G. Varol, Action- conditioned 3D human motion synthesis with trans- former VAE, in: International Conference on Com- puter Vision, ICCV, 2021. [12] D. Vernon, G. Metta, G. Sandini, The icub cog- nitive architecture: Interactive development in a humanoid robot, in: 2007 IEEE 6th International Conference on Development and Learning, 2007, pp. 122–127. doi:10.1109/DEVLRN.2007.4354038. [13] G. Bradski, The opencv library, Dr. Dobb’s Journal of Software Tools (2000). [14] F. Chollet, Deep Learning with Python, Manning Publications Co., Greenwich, CT, USA, 2017. [15] A. Lucny, Building complex systems with agent- space architecture, Computers and Informatics 23 (2004) 1–36. [16] L. Natale, C. Bartolozzi, F. Nori, G. Sandini, G. Metta, Humanoid Robotics, Springer, Dordrecht, 2017. doi:10.1007/978-94-007-6046-2. [17] A. Aristidou, J. Lasenby, Fabrik: A fast, iterative solver for the inverse kinematics problem, Graphi- cal Models 73 (2011) 243–260. [18] R. A. Tenneti, A. Sarkar, Implementation of modi- fied fabrik for robot manipulators, in: Proceedings of the Advances in Robotics 2019, 2019, pp. 1–6. doi:10.1145/3352593.3352605.