=Paper= {{Paper |id=Vol-3226/invited4 |storemode=property |title=Towards One-shot Learning via Attention (invited paper) |pdfUrl=https://ceur-ws.org/Vol-3226/invited4.pdf |volume=Vol-3226 |authors=Andrej Lúčny |dblpUrl=https://dblp.org/rec/conf/itat/Lucny22 }} ==Towards One-shot Learning via Attention (invited paper)== https://ceur-ws.org/Vol-3226/invited4.pdf
Towards one-shot Learning via Attention
Andrej Lucny1
1
    Faculty of Mathematics, Physics and Informatics, Comenius University, Mlynska Dolina Bratislava 84248, Slovakia


                                       Abstract
                                       Though the deep neural networks enabled us to create systems that would be incredible ten years ago, still most of them
                                       learn gradually and offline. We introduce an approach to how to overcome this limitation. We have implemented it by the
                                       well-known attention mechanism that transforms one latent space into another using a list of key-value pairs defining the
                                       correspondence between points in the two spaces. We can express any point in the first space as a mixture of keys and map it
                                       to a point in the second space that is an analogical mixture of values. While encoders and decoders of these spaces we train
                                       only gradually, the keys and values of the transformation we can collect online so that we constantly improve the mapping
                                       quality, achieving the perfect mapping of the current situation immediately. We demonstrate our approach to one-shot
                                       learning on the simplified imitation game in the human-robot interaction, where we map the representations of the robot’s
                                       body and the examinator’s body seen by the robot.

                                       Keywords
                                       one-shot learning, attention, imitation game, deep learning models, self-supervision



1. Introduction                                                                                        The behavior of these creatures can change, e.g., by Pavlo-
                                                                                                       vian conditioning. The third kind (Popperian) can create
Artificial intelligence is a rapidly growing domain mainly mental models and, thanks to that, can suddenly adapt
due to deep learning technology. Nowadays, the ambi- behavior, needing few or one shot. Finally, the fourth
tion is to achieve general artificial intelligence, so we aim kind (Gregorian) manages language communication to
to develop a model that simultaneously processes image, transfer the content of the mental models from one indi-
text, and voice, incorporating multimodal knowledge. vidual to others. In this way, these creatures adapt even
However, how we create the model is still close to devel- without a single shot.
oping any model tailored to a particular task; we need                                                    The typical approach of deep learning is to train a
just much bigger datasets, data storage, and much more Skinnerian system and interpret some observed behavior
powerful hardware for training. Nevertheless, it is pos- as Popperian or Gregorian capabilities. Such systems
sible to train a model that can answer what the longest are Skinnerian at the structural level, but higher faculties
river in Africa is. But the model learns this fact gradually, emerge as side effects. A different approach could look for
processing it many times until it can answer correctly. structural changes that correspond to, e.g., the acquisition
On the other hand, we learn such facts in one shot. It is of associations. Of course, some parts of an intelligent
enough to tell us that the longest river in Africa is the system can grow only gradually. But, after achieving a
Nile, and we can remember or forget it, but we avoid certain complexity level, one-shot learning should appear.
evolving answers from "blah blah blah" through "Egypt" We suppose that responsibility for this faculty is laying
to "the Nile." In the worst case, we produce errors like on processes different from those for gradual growth.
"the Mississippi."                                                                                        In this paper, we outline how it could work. At first, we
   Philosophers analyzing natural intelligence enlight need to develop structures that map some structured data
that what today we call general intelligence is very far (like the seen image or joint setup) into a (so-called latent)
from the human one. For example, Daniel Dennet pro- space in which any point codes a reasonable instance of
vided a famous analysis that recognized four kinds of the data. We can provide that by technologies such as
minds [1]. The first kind (Darwinian) – although its be- autoencoders or contrastive learning that do not need
havior can be very effective – cannot adapt. Most of our annotations. We need a vast set of samples and a lot of
artificial solutions correspond to this type that, in na- time, but after some time, the mapping converges, and
ture, we can observe mainly on insects. The second kind we can fix it. It is essential to mention that though the
(Skinnerian) can adapt gradually, needing many shots to used examples correspond only to isolated points, any
achieve a reasonable probability of intelligent behavior. point in the latent space corresponds to an instance of
                                                                                                       the data. At that moment, we can start a different process
ITAT’22: Information technologies – Applications and Theory, Septem- that maps one latent space to another, e. g. perception to
ber 23–27, 2022, Zuberec, Slovakia                                                                     action. This process creates a list of associations between
$ lucny@fmph.uniba.sk (A. Lucny)
 0000-0001-6042-7434 (A. Lucny)
                                                                                                       points in one latent space and the other. For the two
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License mapped points, the new association works correctly and
          Attribution 4.0 International (CC BY 4.0).
    CEUR

          CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073                                                                       immediately. However, the more associations we collect,
the more accurate the mapping of the whole spaces we              generator (decoder). Feature vectors of all instances of
get. Anytime we can express any point as a mixture                the raw data constitute a latent space with a dimension
of the associated points from one space and map it to             equal to the size of the feature vector.
an analogical mix from their pairs in the other space.               The distribution of the feature vectors in the latent
In the deep neural network, we can provide it by the              space is crucial [5]. We prefer to have similar data
Attention module. As a result, the system can learn the           mapped to similar vectors. We can achieve a good perfor-
presented example in one shot and use the knowledge               mance mainly by the variational autoencoders [6]. They
for approximation for situations never seen.                      split the feature vector into two parts: one corresponds
   We test this one-shot learning process on a modern             to average and the other to deviation. In this way, we
reimplementation of an imitation game with a robot, in-           push features to have the Gaussian distribution.
troduced in [2] [3]. We aim to learn the ability to imitate
arm movements, for which we need associations between             2.2. Metric learning
the robot’s body and the body seen by the robot’s camera.
In the first phase, the robot invites us to imitate it. It gen-   We can create feature extractors also without the neces-
erates various arm poses, and the human imitates them             sity to develop both encoders and decoders. However,
with its body in front of the robot’s camera. In the sec-         training only the encoder part, we lack the exact output
ond phase, the robot mimics the human by associations             we like to get for a given input. We do not know what
learned during the first phase.                                   feature vector we expect. We know only, e. g., the input
   We present details of our approach in the third chapter        category. Thus we cannot calculate the gradient from
after discussing the related works in Chapter 2. Then             the difference between the actual and expected outputs.
we deal with its demonstration in Chapter 4. Finally, we          So, instead, we specify a metric that the mapping of all
discuss quality and the pros and cons.                            instances from our dataset should hold. Then we try our
                                                                  network with all inputs and identify the worst pairs of
                                                                  feature vectors that are close, have different categories, or
2. Related Works                                                  are far, and have the same type. We want to move them
                                                                  in the latent space further or closer, and that direction
Looking at how to initiate one-shot learning, we pre-             provides us with the gradient for training the network.
fer self-supervised gradual methods, i.e., gradient-based         After many training cycles, the mapping holds the metric
methods working with unlabelled data. Only these meth-            and becomes suitable [7].
ods could correspond to how living creatures or robots               Advanced methods, called contrastive learning, em-
in changing environments learn. We know several such              ploy metrics based on the similarity or diversity of out-
ways and pay attention to autoencoders and metric learn-          puts of two copies of the same network. We feed them
ing. Their task is to provide us (by a gradual process)           with two augmentations of the same instance from the
with extractors that map raw data into a latent space and         dataset and require similar outcomes or with different
generators that can turn any value from the latent space          samples expecting different results. We can train both
into raw data. Finally, the attention mechanism enables           networks [8] or only one of them [9]. In the second case,
us to map one latent space to the other.                          one copy is the teacher and the other student. During
                                                                  the training, we adjust the student network weights and
2.1. Autoencoders                                                 occasionally copy them to the teacher.
                                                                     In this way, we get feature extractors of better qual-
Autoencoders [4] are predecessors of the early convolu-
                                                                  ity. Then we can train the corresponding decoders and
tional networks. They contain blocks of convolutional
                                                                  use them as generators when we feed them with inputs
layers interleaved by dimension reduction in the first half
                                                                  different from the feature vectors of instances from the
and dimension expansion in the second half. Thus data
                                                                  dataset.
like images with a typical dimension of hundreds of thou-
sands are sequentially reduced to a feature vector with a
size of hundreds or thousands and then expanded to the            2.3. Attention
original extent. We train them from an unlabelled dataset         The invention of the attention mechanism in natural lan-
to respond with an output equal to the input. If the train-       guage processing enabled us to take to regard the global
ing is successful, each part of the processing sequence           content during processing sequences of words, e. g., for
contains the same information. As a result, the feature           distinguishing the meaning of synonyms upon their con-
vectors have the same information as images from the              tent. Then it helped to avoid sequential processing and
dataset. Then we cut the part of the neural network after         start the transformer revolution in the whole domain of
the feature vector and get the extractor (encoder). Or            deep learning [10]. But, for our purposes, it is enough
we remove the part before the feature vector and get the          to concern it from a mathematical point of view. The
attention mechanism works with a set of key-value pairs.       and 𝐿𝐺 is their latent space, i.e., turns feature vectors
Having a query 𝑞 on input, we mix the query from keys          into the output data.
𝐾 and outputs an analogical mixture from the corre-               We map the spaces 𝐿𝐹 and 𝐿𝐺 by the attention mod-
sponding values 𝑉 , where:                                     ule 𝐴. We are gradually building matrices 𝐾 and 𝑉
              ⎛       ⎞            ⎛       ⎞                   containing keys and values. Each pair key-value corre-
                  𝑘1                   𝑣1                      sponds to an association between two feature vectors
              ⎜ 𝑘2 ⎟               ⎜ 𝑣2 ⎟                      representing stimulus and response at the system level.
          𝐾=⎜ ⎝ ... ⎠
                      ⎟       𝑉 =⎜ ⎝ ... ⎠
                                           ⎟
                                                               The system employs the generator to act with 𝑎 = 𝐺(𝑣)
                  𝑘𝑙                    𝑣𝑙                     upon a random value 𝑣. As a result, it receives a response
                                                               𝑟 and extracts key 𝑘 = 𝐹 (𝑟). Then, concerning their
All queries and keys are vectors of dimension 𝑛, so 𝐾 is a     quality, the system can include 𝑘, 𝑣 into 𝐾, 𝑉 . Thus, the
matrix 𝑙 ×𝑛. Values and outputs are vectors of dimension       mapping 𝐿𝐹 and 𝐿𝐺 become richer.
𝑚, so∑︀              𝑙×𝑚. At first, we find such 𝑐𝑖 ∈ ⟨0, 1⟩
       𝑉 is a matrix∑︀                                            On the other hand, when the system receives an input
that     𝑐𝑖 𝑘𝑖 = 𝑞,    𝑐𝑖 = 1, and 𝑖 = 1, 2, ...𝑙. Since we    𝑝 independent of the system activities, it turns it into
like to mix the query more from keys similar to it and less    query 𝑞 = 𝐹 (𝑝). If the query is equal to one of the
from different keys, we can define the mixture roughly         keys, i. e. 𝑞 = 𝑘𝑖 , we aim to act with 𝐺(𝑣𝑖 ). However,
                 (0)
as dot product 𝑐𝑖 = 𝑞𝑘𝑖 = ‖𝑞‖‖𝑘𝑖 ‖ cos 𝜑𝑖 , where 𝜑𝑖 is        it is much more probable that we cannot translate the
the angle between 𝑞 and 𝑘𝑖 . Since cos 𝜑𝑖 is 1 for same, 0     query so directly. Therefore we operate with 𝐺(𝑣) where
for different, and −1 for opposite vectors, we can turn the    𝑣 = 𝐴(𝑞, 𝐾, 𝑉 ).
products to high, middle, and small values from ⟨0, 1⟩ by         So, we can summarize the system operation into two
                       𝑥𝑖                            (0)
𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑥)𝑖 = ∑︀𝑒 𝑒𝑥𝑘 . Thus 𝑐 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥( 𝑐 𝑑 ) =            processes (procedures 𝐴𝐶𝑄𝑈 𝐼𝑅𝐸 and 𝑈 𝑆𝐸):
                     𝑘
               𝑇
𝑠𝑜𝑓 𝑡𝑚𝑎𝑥( 𝑞𝐾𝑑 ), where 𝑑 is a constant that enables us
                                                              Algorithm Method of attention-based one-shot learning
to scale how much we mix from similar keys and how
much from different    ones. Since the  length  of vectors is   𝐹 is extractor, 𝐺 generator
                √                                               𝐴 is attention, 𝐾 keys, 𝑉 values
growing with √    𝑛 and dot product with  𝑛, it is popular to
define 𝑑 = 𝑛. It would mean we include different and
opposite keys, even if one key equals the query. For √   our    procedure Acquire(𝐹 ,𝐺,𝐾,𝐿)
purposes, we use a much smaller scale factor 𝑑 = 5 𝑛,               loop
since we prefer to mix almost from a single key if the key              𝑣 ← 𝑟𝑎𝑛𝑑𝑜𝑚()
is equal to the query. Having coefficients of the mixture 𝑐,            𝑜 ← 𝐺(𝑣)
we can mix values 𝑉 to output 𝑜 = 𝑐𝑉 . So, the complete                 𝑜𝑢𝑡𝑝𝑢𝑡(𝑜)
response of the attention module to a single query 𝑞 is:                𝑟 ← 𝑖𝑛𝑝𝑢𝑡()
                                                                        𝑘 ← 𝐹 (𝑟)
                                        𝑞𝐾 𝑇                            𝐾 ← 𝐾 ∪ {𝑘}
                                     (︂       )︂
           𝐴(𝑞, 𝐾, 𝑉 ) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥                𝑉                      𝑉 ← 𝑉 ∪ {𝑣}
                                          𝑑

  Yet we mention that the typical use of the attention           procedure Use(𝐹 ,𝐺,𝐾,𝐿)
mechanism is the so-called self-attention, for which the            loop
queries, keys, and values are coming from the same input,              𝑝 ← 𝑖𝑛𝑝𝑢𝑡()
and we aim to get the query compatibility to each key.                 𝑞 ← 𝐹 (𝑝)
However, this is not the case. So instead, we will use the             𝑣 ← 𝐴(𝑞, 𝐾, 𝑉 )
mechanism for mapping two latent spaces; queries and                   𝑜 ← 𝐺(𝑣)
keys are from one space, and values and outputs are from               𝑜𝑢𝑡𝑝𝑢𝑡(𝑜)
another.


3. Method                                                      Of course, this schema is not generally applicable.
                                                             However, if we can apply it, it grants one-shot learn-
Let us assume we have a system with one feature extrac- ing. The system capability is still growing gradually but
tor 𝐹 and one generator 𝐺. The extractor represents in steps, without transient states. Each demonstration
perception and generator action. The extractor imple- invokes immediate faculty to act accordingly under sit-
ments function 𝐹 : 𝐼 → 𝐿𝐹 , where 𝐼 is a set of system uations close to the seen example. The system operates
inputs and 𝐿𝐹 , is their latent space, i.e., turns the input somehow also upon unseen conditions, and the quality of
data into feature vectors. The generator implements func- these actions grows with the number of key-value pairs.
tion 𝐺 : 𝐿𝐺 → 𝑂, where 𝑂 is a set of system outputs
                                                               4.1. Extractor
                                                               We turn images into features by the pre-trained model
                                                               DINO [9]. In detail, we use dino_deits8.onnx – the middle-
                                                               sized version of the latter vision-transformer backbone,
                                                               distributed in the ONNX format. Though it turns color
                                                               images with resolution 224x224 into feature vectors of
                                                               mere 384 numbers, its quality is incredible, demonstrated
                                                               by several successful applications, including pose detec-
                                                               tion. Thus we are sure that the vector also contains
                                                               information representing the person’s pose on the image.
                                                               But, of course, the pose is in a raw form: we use the
                                                               backbone only, while the applications mentioned above
Figure 1: The imitation game.
                                                               add further processing layers. The model is relatively
                                                               large, but its middle-sized version can fit into the 4GB
                                                               GPU. Moreover, its inference takes 0.05s on an ordinary
4. Demonstration and Evaluation                                gaming notebook; thus, it is very suitable for building
                                                               real-time applications.
To demonstrate the method, we deal with the imitation
game between a human and a humanoid robot. Though
it is possible to implement this task with classic computer    4.2. Generator
vision [3] and deep learning [11], for our purposes, it has    The iCub robot arm contains five significant degrees of
the meaning to re-implement it in a novel way. In this         freedom, two in the shoulder and three in the elbow joints.
game, a humanoid robot with a body similar to humans           All together pose of the left and right arms is coded by
invites a human to mimic his movements. If the human           ten angles.
accepts the invitation and imitates the robot, the robot          Aiming to create a decoder generating accurate poses,
learns how to mimic the human. As a result, the robot          we first need to get their dataset. We cannot get it by
can mimic the movements of humans (Figure 1).                  random generation of joints because it would also con-
    For implementation, we need a humanoid robot; we           tain unnatural poses. So we need to define what it means
employ iCubSim, the simulator of the iCub robot [12],          natural here. To do that, we have decided to concern nat-
equipped with an external camera. We control it from           ural poses generated by the inverse kinematics. We have
Python via pyicubsim, ONNX-runtime and OpenCV [13]             asked the robot to move its arm to all possible coordinates
libraries. Further, we need a feature extractor that turns     in its vicinity, and if it succeeded in reaching the point,
images seen by the robot to feature vectors. For our pur-      we have added the current joint setup into the dataset.
pose, it is not necessary to train it; we can get it from      It was not an easy job because inverse kinematics for
a pre-trained model for computer vision (obtained by a         iCub was not available to us. Therefore we have started
self-supervised method described in Chapter 2.2). How-         from the known Denavit-Hartenberg parameters of the
ever, we need to invest more effort in generating robot        robot, and – using on-the-shelf direct kinematics [16]
movements since neither pre-trained models nor datasets        – we have implemented the FABRIK algorithm [17] ad-
are available to us for the chosen robot. First, we create     justed for Denavit-Hartenberg notation and extended by
the dataset as a set of the robot joint positions while mov-   constraints [18]. It is a slow but fully operational solution
ing its arm to random points in the robot’s vicinity. We       that not only defines the natural poses of the robot but
avoid abnormal setups of robot joints by their calculation     speeds up the creation of the dataset. We speed up the
by inverse kinematics. Then we train (using Keras [14])        process because we can reliably calculate all data on the
the variational autoencoder (see Chapter 2.1) and get the      kinematics model, and we do not need to try them on the
generator model as its part. Having the extractor and          robot. Also, as we see later, it is profitable that our dataset
generator, we can define the overall model controlling         can contain the Euler coordinates corresponding to the
the robot as their integration by the attention module         recorded joint setups. We do not use them for model
(see Chapter 2.3). Since the system operates in real-time      creation, but they are helpful for model visualization. In
and calls models, the integration employs a blackboard         this way, we have collected all possible poses (Figure 2),
architecture [15] that helps us to combine slower and          23470 for each arm. Then we randomly selected 60000
faster processes. Finally, we test the system.                 examples concerning the equal probability that the robot
                                                               uses the left arm, the right arm, both arms symmetrically,
                                                               and both arms in different poses.
                                                                  In the second phase, we used Keras to train the varia-
Figure 2: The iCub’s kinematics (on the left: coordinates
reachable by the elbow, on the right: coordinates reachable
by the wrist).


                                                               Figure 4: Training the variational autoencoder of iCub’s arms
                                                               movement.



                                                               of the right and left arms. We put a point to the picture for
                                                               each example from the testing set. Its color corresponds
                                                               to the value of the coordinate. Then fluent color gradient
                                                               means that the space is well-organized (Figure 6).

                                                               4.3. Integration
Figure 3: The architecture of the iCub’s actions autoencoder Now we can integrate the extractor and generator mod-
(on the left: encoder, on the right: decoder).               els into one system. Since such a system needs to com-
                                                             bine fast data sources like a camera with slower models
                                                             and languid robot movement, the integration employs a
                                                             blackboard architecture. Concretely, we use our solution
tional autoencoder of the selected joint setups. Since the named Agent-Space architecture [15] to split the system
space of iCub’s arm action is not ample, we have used into a set of agents communicating via blackboard and let
just ten input, six intermediate, two feature, six interme- the overall control emerge from the individual behaviors
diate, and ten output neurons. Of course, we double the of the agents.
internal structures because the features are the sum of         Our system contains the following agents (Figure 7):
the average and random multiple of standard deviation
(Figure 3). We have used ReLU and tanh activations since           • The camera agent grabs images from the camera
we turned joint angles from −180∘ to 180∘ into code                  and writes them onto the backboard, where other
from −1 to 1. Before training, we shuffled the dataset               agents can read image samples according to their
and split it into 50000 training and 10000 testing exam-             processing capacity. (This way, we avoid delays
ples. The training required ten epochs with batch size 32            and overloadings appearing if we put grabbing
and took mere 92s (Figure 4). Finally, we have distilled             and processing images into the same loop.)
the decoder part of the trained architecture and saved it          • The perception agent reads the grabbed image
as our generator. Yet we have converted the generator                from the blackboard, turns it into a blob, feeds the
model from the .h5 format to the .pb format that we can              extractor model, and writes the provided feature
open in the OpenCV library.                                          vector to the blackboard.
   This model can turn any pair of numbers from −1 to              • The control agent operates in two modes: AC-
1 into a proper joint setup on the robot (Figure 5). But,            QUIRE and USE. In the first mode, it collects lists
further, we need to check that the space of generated                of keys and values corresponding to the feature
actions is well-organized. In other words, we need to                vectors of the extractor and generator in the fol-
check that a fluent change of the feature vector causes              lowing way. First, it randomly generates a feature
only a fluent shift in the joint setup. Since the feature            vector for the generator, writes it to the black-
vector has only two numbers, we can easily visualize its             board, waits, and reads the feature vector pro-
quality in six pictures depicting the x,y, and z coordinates         vided by the extractor. Then it adds them to the
Figure 5: Examples of iCub’s actions generated from the
feature vectors.                                             Figure 6: The x, y, and z coordinates of the right and the
                                                             left iCub’s arm for the testing set. Each point represents one
                                                             sample, and its color is the value of the coordinate.
       lists. In the second mode, the agent reads the fea-
       ture vector provided by the extractor and writes
       the feature vector for the generator calculated by    4.4. Testing
       the attention module from the lists of keys and
       values.                                               We have developed the real-time system incrementally,
                                                             working with the off-line version, which quality we can
     • The action agent reads the feature vector for the
                                                             investigate more easily. In this phase, we have selected a
       generator and controls the iCub robot by the com-
                                                             bunch of ten examinator’s poses and created a few im-
       mands of the YARP protocol encapsulated by the
                                                             ages – under varying conditions – for each pose. Then
       pyicubsim library.
                                                             we taught the system, and after each sample, we tested
     • For simplification, at the current stage of devel-
                                                             the system’s capability to imitate all poses. The number
       opment, we specify the accurate time for waiting
                                                             of operational poses indicated whether the system could
       in the ACQUIRE mode by the examinator. Since
                                                             forget a learned pose. We found that it did forget neither
       his hands are busy imitating the robot’s pose, we
                                                             one. Even the system learned one pose implicitly, com-
       manage this signaling by whistling. We imple-
                                                             pounding the correct response from two other already
       mented this input by the pitch agent. It processes
                                                             presented poses (Figure 8).
       sound by the Fourier transform and looks for the
                                                                So far, we have not evaluated the real-time version in
       high frequencies.
                                                             another way than by the examinator’s opinion. In the
                                                             future, we plan to employ pose detectors for this purpose.
                                                                 humans, it is not clear where this ability originates. But
                                                                 our approach enlights that the imitation game not only
                                                                 solicits this ability but can also help it to emerge. Here
                                                                 imitation is an ability of a society [19], and one of its
                                                                 members learns it from other (a child from its parent or a
                                                                 robot from its user). Remarkably, this transfer could rely
                                                                 on the attention module, an essential building block for
                                                                 natural language processing. We could look at language
                                                                 as a kind of imitation related to the movement of vocal
                                                                 cords. Its nature is similar to hand movement. On the
                                                                 other hand, the presented one-shot learning mechanism
                                                                 could play a role in the early evolution of signal-based
                                                                 language.
                                                                    Our approach also has weaknesses. The major one is
                                                                 that if we use an encoder that stems from general data,
Figure 7: Schema of the integrated system for the imitation
game. Circles represent agents, triangles the blackboard, cylin- the mapping could be relevant only for specific condi-
ders models, and the letter A the attention module.              tions. For example, associations learned in the presented
                                                                 imitation game could be fooled by more persons in front
                                                                 of the camera. On the other hand, the quality of today’s
                                                                 self-supervised models does not allow us to cheat the
                                                                 system by e. g. the different colors of the wall or the
                                                                 different dress of the seen person. We could decrease
                                                                 this problem by training the encoder from more specific
                                                                 data under specific conditions. However, it isn’t easy to
                                                                 imagine that we could manage it in a self-supervision
                                                                 way.
                                                                    Finally, our method is more general than an imitation
                                                                 game. For example, processing vision, the robot cannot
                                                                 see itself; therefore, it needs the help of an examinator.
                                                                 However, it could apply the same method for seeing it-
                                                                 self in the mirror. Or it could similarly process voice.
                                                                 Having a speech generator corresponding to the physical
                                                                 capabilities of vocal cords, lips, and tongue and a voice
                                                                 listener analogical to the ear, it could start to produce
                                                                 random voices and learn the mapping between the lis-
Figure 8: One-shot learning of selected arms poses.
                                                                 tener perception and the generator action. As a result, it
                                                                 can reproduce the listened speech when another source
                                                                 makes it.
5. Conclusion
In this paper, we introduced a kind of one-shot learning.     References
Its key component is the attention module. We have used
this existing component of deep neural networks for a          [1] D. C. Dennett, Kinds of minds: towards an under-
new task: mapping two latent spaces. First, however, we            standing of consciousness, Weidenfeld & Nicolson,
had to adjust one of its parameters: the scale factor.             London, 1996.
   We demonstrated our approach to one-shot learn-             [2] J. P. Bandera, J. A. Rodriguez, L. Molina-Tanco,
ing on imitation between human and humanoid robots.                A. Bandera, A survey of vision-based architectures
We built our demo from modules developed in a self-                for robot learning by imitation, International Jour-
supervision way. Thus we avoided using datasets con-               nal of Humanoid Robotics 9 (2012). doi:10.1142/
taining particular poses of the person on images and the           S0219843612500065, world Scientific Publishing
robot’s body. Instead, our robot has learned them by               Company.
interacting with the examinator in a one-shot learning         [3] S. Boucenna, S. Anzalone, E. Tilmont, D. Co-
way.                                                               hen, M. Chetouani, Learning of social signa-
   For imitation, the robot needs to get the model of the          tures through imitation game between a robot
body seen as analogical to the model of its own body. In           and a human partner, IEEE Transactions on Au-
     tonomous Mental Development 6 (2014) 213–225. [19] A. Aristidou, J. Lasenby, Embodied gesture pro-
     doi:10.1109/TAMD.2014.2319861.                                cessing: Motor-based integration of perception
 [4] G. E. Hinton, R. R. Salakhutdinov, Reducing the di-           and action in social artificial agents, Cogni-
     mensionality of data with neural networks, Science            tive Computing 3 (2011) 419–435. doi:10.1007/
     313 (2006). doi:10.1126/science.1127647.                      s12559-010-9082-z.
 [5] Brownlee, Deep Learning for Computer Vision, 1.4
     ed., machinelearningmastery.com, 2019.
 [6] D. P. Kingma, M. Welling, An introduction to vari-
     ational autoencoders, Foundations and Trends in
                                                              6. Online Resources
     Machine Learning 12 (2019) 307–392.                      We share codes of this project at GitHub
 [7] D. King, High quality face recognition with deep
     metric learning, 2017. URL: http://blog.dlib.net/
     2017/02/high-quality-face-recognition-with-deep.
     html.
 [8] T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A
     simple framework for contrastive learning of visual
     representations, in: Proceedings of the 37th Inter-
     national Conference on Machine Learning, number
     149 in ICML, 2020, pp. 1597–1607.
 [9] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal,
     P. Bojanowski, A. Joulin, Emerging properties in
     self-supervised vision transformers, in: Proceed-
     ings of the International Conference on Computer
     Vision, ICCV, 2021.
[10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
     L. Jones, A. N. Gomez, Łukasz Kaiser, I. Polosukhin,
     Attention is all you need, in: 31st International Con-
     ference on Neural Information Processing Systems,
     ACM, Long Beach, 2017.
[11] M. Petrovich, M. J. Black, G. Varol, Action-
     conditioned 3D human motion synthesis with trans-
     former VAE, in: International Conference on Com-
     puter Vision, ICCV, 2021.
[12] D. Vernon, G. Metta, G. Sandini, The icub cog-
     nitive architecture: Interactive development in a
     humanoid robot, in: 2007 IEEE 6th International
     Conference on Development and Learning, 2007, pp.
     122–127. doi:10.1109/DEVLRN.2007.4354038.
[13] G. Bradski, The opencv library, Dr. Dobb’s Journal
     of Software Tools (2000).
[14] F. Chollet, Deep Learning with Python, Manning
     Publications Co., Greenwich, CT, USA, 2017.
[15] A. Lucny, Building complex systems with agent-
     space architecture, Computers and Informatics 23
     (2004) 1–36.
[16] L. Natale, C. Bartolozzi, F. Nori, G. Sandini, G. Metta,
     Humanoid Robotics, Springer, Dordrecht, 2017.
     doi:10.1007/978-94-007-6046-2.
[17] A. Aristidou, J. Lasenby, Fabrik: A fast, iterative
     solver for the inverse kinematics problem, Graphi-
     cal Models 73 (2011) 243–260.
[18] R. A. Tenneti, A. Sarkar, Implementation of modi-
     fied fabrik for robot manipulators, in: Proceedings
     of the Advances in Robotics 2019, 2019, pp. 1–6.
     doi:10.1145/3352593.3352605.