<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards one-shot Learning via Attention</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrej Lucny</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Mathematics</institution>
          ,
          <addr-line>Physics and Informatics</addr-line>
          ,
          <institution>Comenius University</institution>
          ,
          <addr-line>Mlynska Dolina Bratislava 84248</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Though the deep neural networks enabled us to create systems that would be incredible ten years ago, still most of them learn gradually and ofline. We introduce an approach to how to overcome this limitation. We have implemented it by the well-known attention mechanism that transforms one latent space into another using a list of key-value pairs defining the correspondence between points in the two spaces. We can express any point in the first space as a mixture of keys and map it to a point in the second space that is an analogical mixture of values. While encoders and decoders of these spaces we train only gradually, the keys and values of the transformation we can collect online so that we constantly improve the mapping quality, achieving the perfect mapping of the current situation immediately. We demonstrate our approach to one-shot learning on the simplified imitation game in the human-robot interaction, where we map the representations of the robot's body and the examinator's body seen by the robot.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;one-shot learning</kwd>
        <kwd>attention</kwd>
        <kwd>imitation game</kwd>
        <kwd>deep learning models</kwd>
        <kwd>self-supervision</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>The behavior of these creatures can change, e.g., by Pavlo</title>
        <p>vian conditioning. The third kind (Popperian) can create
Artificial intelligence is a rapidly growing domain mainly mental models and, thanks to that, can suddenly adapt
due to deep learning technology. Nowadays, the ambi- behavior, needing few or one shot. Finally, the fourth
tion is to achieve general artificial intelligence, so we aim kind (Gregorian) manages language communication to
to develop a model that simultaneously processes image, transfer the content of the mental models from one
inditext, and voice, incorporating multimodal knowledge. vidual to others. In this way, these creatures adapt even
However, how we create the model is still close to devel- without a single shot.
oping any model tailored to a particular task; we need The typical approach of deep learning is to train a
just much bigger datasets, data storage, and much more Skinnerian system and interpret some observed behavior
powerful hardware for training. Nevertheless, it is pos- as Popperian or Gregorian capabilities. Such systems
sible to train a model that can answer what the longest are Skinnerian at the structural level, but higher faculties
river in Africa is. But the model learns this fact gradually, emerge as side efects. A diferent approach could look for
processing it many times until it can answer correctly. structural changes that correspond to, e.g., the acquisition
On the other hand, we learn such facts in one shot. It is of associations. Of course, some parts of an intelligent
enough to tell us that the longest river in Africa is the system can grow only gradually. But, after achieving a
Nile, and we can remember or forget it, but we avoid certain complexity level, one-shot learning should appear.
evolving answers from "blah blah blah" through "Egypt" We suppose that responsibility for this faculty is laying
to "the Nile." In the worst case, we produce errors like on processes diferent from those for gradual growth.
"the Mississippi." In this paper, we outline how it could work. At first, we</p>
        <p>
          Philosophers analyzing natural intelligence enlight need to develop structures that map some structured data
that what today we call general intelligence is very far (like the seen image or joint setup) into a (so-called latent)
from the human one. For example, Daniel Dennet pro- space in which any point codes a reasonable instance of
vided a famous analysis that recognized four kinds of the data. We can provide that by technologies such as
minds [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The first kind (Darwinian) – although its be- autoencoders or contrastive learning that do not need
havior can be very efective – cannot adapt. Most of our annotations. We need a vast set of samples and a lot of
artificial solutions correspond to this type that, in na- time, but after some time, the mapping converges, and
ture, we can observe mainly on insects. The second kind we can fix it. It is essential to mention that though the
(Skinnerian) can adapt gradually, needing many shots to used examples correspond only to isolated points, any
achieve a reasonable probability of intelligent behavior. point in the latent space corresponds to an instance of
the data. At that moment, we can start a diferent process
ITAT’22: Information technologies – Applications and Theory, Septem- that maps one latent space to another, e. g. perception to
ber 23–27, 2022, Zuberec, Slovakia action. This process creates a list of associations between
$ l0u0c0n0y-0@00fm1-p6h04.u2n-7ib4a3.4sk(A( A..LLuucncny)y) points in one latent space and the other. For the two
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License mapped points, the new association works correctly and
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) immediately. However, the more associations we collect,
the more accurate the mapping of the whole spaces we generator (decoder). Feature vectors of all instances of
get. Anytime we can express any point as a mixture the raw data constitute a latent space with a dimension
of the associated points from one space and map it to equal to the size of the feature vector.
an analogical mix from their pairs in the other space. The distribution of the feature vectors in the latent
In the deep neural network, we can provide it by the space is crucial [5]. We prefer to have similar data
Attention module. As a result, the system can learn the mapped to similar vectors. We can achieve a good
perforpresented example in one shot and use the knowledge mance mainly by the variational autoencoders [6]. They
for approximation for situations never seen. split the feature vector into two parts: one corresponds
        </p>
        <p>
          We test this one-shot learning process on a modern to average and the other to deviation. In this way, we
reimplementation of an imitation game with a robot, in- push features to have the Gaussian distribution.
troduced in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. We aim to learn the ability to imitate
arm movements, for which we need associations between 2.2. Metric learning
the robot’s body and the body seen by the robot’s camera.
        </p>
        <p>In the first phase, the robot invites us to imitate it. It
generates various arm poses, and the human imitates them
with its body in front of the robot’s camera. In the
second phase, the robot mimics the human by associations
learned during the first phase.</p>
        <p>We present details of our approach in the third chapter
after discussing the related works in Chapter 2. Then
we deal with its demonstration in Chapter 4. Finally, we
discuss quality and the pros and cons.</p>
        <p>We can create feature extractors also without the
necessity to develop both encoders and decoders. However,
training only the encoder part, we lack the exact output
we like to get for a given input. We do not know what
feature vector we expect. We know only, e. g., the input
category. Thus we cannot calculate the gradient from
the diference between the actual and expected outputs.</p>
        <p>So, instead, we specify a metric that the mapping of all
instances from our dataset should hold. Then we try our
network with all inputs and identify the worst pairs of
feature vectors that are close, have diferent categories, or
2. Related Works are far, and have the same type. We want to move them
in the latent space further or closer, and that direction
Looking at how to initiate one-shot learning, we pre- provides us with the gradient for training the network.
fer self-supervised gradual methods, i.e., gradient-based After many training cycles, the mapping holds the metric
methods working with unlabelled data. Only these meth- and becomes suitable [7].
ods could correspond to how living creatures or robots Advanced methods, called contrastive learning,
emin changing environments learn. We know several such ploy metrics based on the similarity or diversity of
outways and pay attention to autoencoders and metric learn- puts of two copies of the same network. We feed them
ing. Their task is to provide us (by a gradual process) with two augmentations of the same instance from the
with extractors that map raw data into a latent space and dataset and require similar outcomes or with diferent
generators that can turn any value from the latent space samples expecting diferent results. We can train both
into raw data. Finally, the attention mechanism enables networks [8] or only one of them [9]. In the second case,
us to map one latent space to the other. one copy is the teacher and the other student. During
the training, we adjust the student network weights and
2.1. Autoencoders occasionally copy them to the teacher.
In this way, we get feature extractors of better
qualAutoencoders [4] are predecessors of the early convolu- ity. Then we can train the corresponding decoders and
tional networks. They contain blocks of convolutional use them as generators when we feed them with inputs
layers interleaved by dimension reduction in the first half diferent from the feature vectors of instances from the
and dimension expansion in the second half. Thus data dataset.
like images with a typical dimension of hundreds of
thousands are sequentially reduced to a feature vector with a
size of hundreds or thousands and then expanded to the 2.3. Attention
original extent. We train them from an unlabelled dataset
to respond with an output equal to the input. If the
training is successful, each part of the processing sequence
contains the same information. As a result, the feature
vectors have the same information as images from the
dataset. Then we cut the part of the neural network after
the feature vector and get the extractor (encoder). Or
we remove the part before the feature vector and get the
The invention of the attention mechanism in natural
language processing enabled us to take to regard the global
content during processing sequences of words, e. g., for
distinguishing the meaning of synonyms upon their
content. Then it helped to avoid sequential processing and
start the transformer revolution in the whole domain of
deep learning [10]. But, for our purposes, it is enough
to concern it from a mathematical point of view. The
(0) =  = ‖‖‖‖ cos , where  is
as dot product 
the angle between  and . Since cos  is 1 for same, 0
for diferent, and − 1 for opposite vectors, we can turn the
products to high, middle, and small values from ⟨0, 1⟩ by
 () = ∑︀  . Thus  =  ( (0) ) =</p>
        <p>attention mechanism works with a set of key-value pairs. and  is their latent space, i.e., turns feature vectors
Having a query  on input, we mix the query from keys into the output data.
 and outputs an analogical mixture from the corre- We map the spaces  and  by the attention
modsponding values  , where: ule . We are gradually building matrices  and 
containing keys and values. Each pair key-value
corre⎛ 1 ⎞ ⎛ 1 ⎞ sponds to an association between two feature vectors
 = ⎜⎜⎝ ..2. ⎟⎠⎟  = ⎜⎜⎝ ..2. ⎟⎠⎟ The system employs the generator to act with  = ()
representing stimulus and response at the system level.
  upon a random value . As a result, it receives a response
 and extracts key  =  (). Then, concerning their
All queries and keys are vectors of dimension , so  is a quality, the system can include ,  into ,  . Thus, the
matrix  × . Values and outputs are vectors of dimension mapping  and  become richer.
, so  is a matrix × . At first, we find such  ∈ ⟨0, 1⟩ On the other hand, when the system receives an input
that ∑︀  = , ∑︀  = 1, and  = 1, 2, .... Since we  independent of the system activities, it turns it into
like to mix the query more from keys similar to it and less query  =  (). If the query is equal to one of the
from diferent keys, we can define the mixture roughly keys, i. e.  = , we aim to act with (). However,
it is much more probable that we cannot translate the
query so directly. Therefore we operate with () where
 = (, ,  ).</p>
        <p>So, we can summarize the system operation into two
processes (procedures   and  ):
3. Method Of course, this schema is not generally applicable.
However, if we can apply it, it grants one-shot
learnLet us assume we have a system with one feature extrac- ing. The system capability is still growing gradually but
tor  and one generator . The extractor represents in steps, without transient states. Each demonstration
perception and generator action. The extractor imple- invokes immediate faculty to act accordingly under
sitments function  :  →  , where  is a set of system uations close to the seen example. The system operates
inputs and  , is their latent space, i.e., turns the input somehow also upon unseen conditions, and the quality of
data into feature vectors. The generator implements func- these actions grows with the number of key-value pairs.
tion  :  → , where  is a set of system outputs
 (  ), where  is a constant that enables us
to scale how much we mix from similar keys and how
much from diferent ones. Since the length of vectors is
growing with √ and dot product with , it is popular to
define  = √. It would mean we include diferent and
opposite keys, even if one key equals the query. For our
purposes, we use a much smaller scale factor  = √5,
since we prefer to mix almost from a single key if the key
is equal to the query. Having coeficients of the mixture ,
we can mix values  to output  =  . So, the complete
response of the attention module to a single query  is:
(, ,  ) =  
︂(  )︂</p>
        <p />
      </sec>
      <sec id="sec-1-2">
        <title>Yet we mention that the typical use of the attention</title>
        <p>mechanism is the so-called self-attention, for which the
queries, keys, and values are coming from the same input,
and we aim to get the query compatibility to each key.
However, this is not the case. So instead, we will use the
mechanism for mapping two latent spaces; queries and
keys are from one space, and values and outputs are from
another.</p>
        <p>Algorithm Method of attention-based one-shot learning
 is extractor,  generator
 is attention,  keys,  values
procedure Acquire( ,,,)
loop
 ← ()
 ← ()
()
 ← ()
 ←  ()
 ←  ∪ {}
 ←  ∪ {}
procedure Use( ,,,)
loop
 ← ()
 ←  ()
 ← (, ,  )
 ← ()
()</p>
        <sec id="sec-1-2-1">
          <title>4.1. Extractor</title>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>We turn images into features by the pre-trained model</title>
        <p>DINO [9]. In detail, we use dino_deits8.onnx – the
middlesized version of the latter vision-transformer backbone,
distributed in the ONNX format. Though it turns color
images with resolution 224x224 into feature vectors of
mere 384 numbers, its quality is incredible, demonstrated
by several successful applications, including pose
detection. Thus we are sure that the vector also contains
information representing the person’s pose on the image.
But, of course, the pose is in a raw form: we use the
backbone only, while the applications mentioned above
add further processing layers. The model is relatively
large, but its middle-sized version can fit into the 4GB
GPU. Moreover, its inference takes 0.05s on an ordinary
gaming notebook; thus, it is very suitable for building
real-time applications.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Demonstration and Evaluation</title>
      <p>
        To demonstrate the method, we deal with the imitation
game between a human and a humanoid robot. Though
it is possible to implement this task with classic computer 4.2. Generator
vision [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and deep learning [11], for our purposes, it has The iCub robot arm contains five significant degrees of
the meaning to re-implement it in a novel way. In this freedom, two in the shoulder and three in the elbow joints.
game, a humanoid robot with a body similar to humans All together pose of the left and right arms is coded by
invites a human to mimic his movements. If the human ten angles.
accepts the invitation and imitates the robot, the robot Aiming to create a decoder generating accurate poses,
learns how to mimic the human. As a result, the robot we first need to get their dataset. We cannot get it by
can mimic the movements of humans (Figure 1). random generation of joints because it would also
con
      </p>
      <p>For implementation, we need a humanoid robot; we tain unnatural poses. So we need to define what it means
employ iCubSim, the simulator of the iCub robot [12], natural here. To do that, we have decided to concern
natequipped with an external camera. We control it from ural poses generated by the inverse kinematics. We have
Python via pyicubsim, ONNX-runtime and OpenCV [13] asked the robot to move its arm to all possible coordinates
libraries. Further, we need a feature extractor that turns in its vicinity, and if it succeeded in reaching the point,
images seen by the robot to feature vectors. For our pur- we have added the current joint setup into the dataset.
pose, it is not necessary to train it; we can get it from It was not an easy job because inverse kinematics for
a pre-trained model for computer vision (obtained by a iCub was not available to us. Therefore we have started
self-supervised method described in Chapter 2.2). How- from the known Denavit-Hartenberg parameters of the
ever, we need to invest more efort in generating robot robot, and – using on-the-shelf direct kinematics [16]
movements since neither pre-trained models nor datasets – we have implemented the FABRIK algorithm [17]
adare available to us for the chosen robot. First, we create justed for Denavit-Hartenberg notation and extended by
the dataset as a set of the robot joint positions while mov- constraints [18]. It is a slow but fully operational solution
ing its arm to random points in the robot’s vicinity. We that not only defines the natural poses of the robot but
avoid abnormal setups of robot joints by their calculation speeds up the creation of the dataset. We speed up the
by inverse kinematics. Then we train (using Keras [14]) process because we can reliably calculate all data on the
the variational autoencoder (see Chapter 2.1) and get the kinematics model, and we do not need to try them on the
generator model as its part. Having the extractor and robot. Also, as we see later, it is profitable that our dataset
generator, we can define the overall model controlling can contain the Euler coordinates corresponding to the
the robot as their integration by the attention module recorded joint setups. We do not use them for model
(see Chapter 2.3). Since the system operates in real-time creation, but they are helpful for model visualization. In
and calls models, the integration employs a blackboard this way, we have collected all possible poses (Figure 2),
architecture [15] that helps us to combine slower and 23470 for each arm. Then we randomly selected 60000
faster processes. Finally, we test the system. examples concerning the equal probability that the robot
uses the left arm, the right arm, both arms symmetrically,
and both arms in diferent poses.</p>
      <p>In the second phase, we used Keras to train the
variational autoencoder of the selected joint setups. Since the
space of iCub’s arm action is not ample, we have used
just ten input, six intermediate, two feature, six
intermediate, and ten output neurons. Of course, we double the
internal structures because the features are the sum of
the average and random multiple of standard deviation
(Figure 3). We have used ReLU and tanh activations since
we turned joint angles from − 180∘ to 180∘ into code
from − 1 to 1. Before training, we shufled the dataset
and split it into 50000 training and 10000 testing
examples. The training required ten epochs with batch size 32
and took mere 92s (Figure 4). Finally, we have distilled
the decoder part of the trained architecture and saved it
as our generator. Yet we have converted the generator
model from the .h5 format to the .pb format that we can
open in the OpenCV library.</p>
      <p>This model can turn any pair of numbers from − 1 to
1 into a proper joint setup on the robot (Figure 5). But,
further, we need to check that the space of generated
actions is well-organized. In other words, we need to
check that a fluent change of the feature vector causes
only a fluent shift in the joint setup. Since the feature
vector has only two numbers, we can easily visualize its
quality in six pictures depicting the x,y, and z coordinates
of the right and left arms. We put a point to the picture for
each example from the testing set. Its color corresponds
to the value of the coordinate. Then fluent color gradient
means that the space is well-organized (Figure 6).</p>
      <sec id="sec-2-1">
        <title>4.3. Integration</title>
        <p>Now we can integrate the extractor and generator
models into one system. Since such a system needs to
combine fast data sources like a camera with slower models
and languid robot movement, the integration employs a
blackboard architecture. Concretely, we use our solution
named Agent-Space architecture [15] to split the system
into a set of agents communicating via blackboard and let
the overall control emerge from the individual behaviors
of the agents.</p>
        <p>Our system contains the following agents (Figure 7):
• The camera agent grabs images from the camera
and writes them onto the backboard, where other
agents can read image samples according to their
processing capacity. (This way, we avoid delays
and overloadings appearing if we put grabbing
and processing images into the same loop.)
• The perception agent reads the grabbed image
from the blackboard, turns it into a blob, feeds the
extractor model, and writes the provided feature
vector to the blackboard.
• The control agent operates in two modes:
AC</p>
        <p>QUIRE and USE. In the first mode, it collects lists
of keys and values corresponding to the feature
vectors of the extractor and generator in the
following way. First, it randomly generates a feature
vector for the generator, writes it to the
blackboard, waits, and reads the feature vector
provided by the extractor. Then it adds them to the
lists. In the second mode, the agent reads the
feature vector provided by the extractor and writes
the feature vector for the generator calculated by 4.4. Testing
the attention module from the lists of keys and
values. We have developed the real-time system incrementally,
• The action agent reads the feature vector for the working with the of-line version, which quality we can
generator and controls the iCub robot by the com- investigate more easily. In this phase, we have selected a
mands of the YARP protocol encapsulated by the bunch of ten examinator’s poses and created a few
impyicubsim library. ages – under varying conditions – for each pose. Then
we taught the system, and after each sample, we tested
• For simplification, at the current stage of devel- the system’s capability to imitate all poses. The number
opment, we specify the accurate time for waiting of operational poses indicated whether the system could
in the ACQUIRE mode by the examinator. Since forget a learned pose. We found that it did forget neither
his hands are busy imitating the robot’s pose, we one. Even the system learned one pose implicitly,
commanage this signaling by whistling. We imple- pounding the correct response from two other already
mented this input by the pitch agent. It processes presented poses (Figure 8).
sound by the Fourier transform and looks for the So far, we have not evaluated the real-time version in
high frequencies. another way than by the examinator’s opinion. In the
future, we plan to employ pose detectors for this purpose.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Conclusion</title>
      <sec id="sec-3-1">
        <title>In this paper, we introduced a kind of one-shot learning.</title>
        <p>Its key component is the attention module. We have used
this existing component of deep neural networks for a
new task: mapping two latent spaces. First, however, we
had to adjust one of its parameters: the scale factor.</p>
        <p>We demonstrated our approach to one-shot
learning on imitation between human and humanoid robots.</p>
        <p>We built our demo from modules developed in a
selfsupervision way. Thus we avoided using datasets
containing particular poses of the person on images and the
robot’s body. Instead, our robot has learned them by
interacting with the examinator in a one-shot learning
way.</p>
        <p>For imitation, the robot needs to get the model of the
body seen as analogical to the model of its own body. In
humans, it is not clear where this ability originates. But
our approach enlights that the imitation game not only
solicits this ability but can also help it to emerge. Here
imitation is an ability of a society [19], and one of its
members learns it from other (a child from its parent or a
robot from its user). Remarkably, this transfer could rely
on the attention module, an essential building block for
natural language processing. We could look at language
as a kind of imitation related to the movement of vocal
cords. Its nature is similar to hand movement. On the
other hand, the presented one-shot learning mechanism
could play a role in the early evolution of signal-based
language.</p>
        <p>Our approach also has weaknesses. The major one is
that if we use an encoder that stems from general data,
the mapping could be relevant only for specific
conditions. For example, associations learned in the presented
imitation game could be fooled by more persons in front
of the camera. On the other hand, the quality of today’s
self-supervised models does not allow us to cheat the
system by e. g. the diferent colors of the wall or the
diferent dress of the seen person. We could decrease
this problem by training the encoder from more specific
data under specific conditions. However, it isn’t easy to
imagine that we could manage it in a self-supervision
way.</p>
        <p>Finally, our method is more general than an imitation
game. For example, processing vision, the robot cannot
see itself; therefore, it needs the help of an examinator.</p>
        <p>However, it could apply the same method for seeing
itself in the mirror. Or it could similarly process voice.</p>
        <p>Having a speech generator corresponding to the physical
capabilities of vocal cords, lips, and tongue and a voice
listener analogical to the ear, it could start to produce
random voices and learn the mapping between the
listener perception and the generator action. As a result, it
can reproduce the listened speech when another source
makes it.
tonomous Mental Development 6 (2014) 213–225. [19] A. Aristidou, J. Lasenby, Embodied gesture
prodoi:10.1109/TAMD.2014.2319861. cessing: Motor-based integration of perception
[4] G. E. Hinton, R. R. Salakhutdinov, Reducing the di- and action in social artificial agents,
Cognimensionality of data with neural networks, Science tive Computing 3 (2011) 419–435. doi:10.1007/
313 (2006). doi:10.1126/science.1127647. s12559-010-9082-z.
[5] Brownlee, Deep Learning for Computer Vision, 1.4</p>
        <p>ed., machinelearningmastery.com, 2019.
[6] D. P. Kingma, M. Welling, An introduction to vari- 6. Online Resources
ational autoencoders, Foundations and Trends in</p>
        <p>Machine Learning 12 (2019) 307–392. We share codes of this project at GitHub
[7] D. King, High quality face recognition with deep
metric learning, 2017. URL: http://blog.dlib.net/
2017/02/high-quality-face-recognition-with-deep.</p>
        <p>html.
[8] T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A
simple framework for contrastive learning of visual
representations, in: Proceedings of the 37th
International Conference on Machine Learning, number
149 in ICML, 2020, pp. 1597–1607.
[9] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal,</p>
        <p>P. Bojanowski, A. Joulin, Emerging properties in
self-supervised vision transformers, in:
Proceedings of the International Conference on Computer</p>
        <p>Vision, ICCV, 2021.
[10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,</p>
        <p>L. Jones, A. N. Gomez, Łukasz Kaiser, I. Polosukhin,
Attention is all you need, in: 31st International
Conference on Neural Information Processing Systems,</p>
        <p>ACM, Long Beach, 2017.
[11] M. Petrovich, M. J. Black, G. Varol,
Actionconditioned 3D human motion synthesis with
transformer VAE, in: International Conference on
Computer Vision, ICCV, 2021.
[12] D. Vernon, G. Metta, G. Sandini, The icub
cognitive architecture: Interactive development in a
humanoid robot, in: 2007 IEEE 6th International
Conference on Development and Learning, 2007, pp.</p>
        <p>122–127. doi:10.1109/DEVLRN.2007.4354038.
[13] G. Bradski, The opencv library, Dr. Dobb’s Journal</p>
        <p>of Software Tools (2000).
[14] F. Chollet, Deep Learning with Python, Manning</p>
        <p>Publications Co., Greenwich, CT, USA, 2017.
[15] A. Lucny, Building complex systems with
agentspace architecture, Computers and Informatics 23
(2004) 1–36.
[16] L. Natale, C. Bartolozzi, F. Nori, G. Sandini, G. Metta,</p>
        <p>Humanoid Robotics, Springer, Dordrecht, 2017.</p>
        <p>doi:10.1007/978-94-007-6046-2.
[17] A. Aristidou, J. Lasenby, Fabrik: A fast, iterative
solver for the inverse kinematics problem,
Graphical Models 73 (2011) 243–260.
[18] R. A. Tenneti, A. Sarkar, Implementation of
modiifed fabrik for robot manipulators, in: Proceedings
of the Advances in Robotics 2019, 2019, pp. 1–6.
doi:10.1145/3352593.3352605.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D. C.</given-names>
            <surname>Dennett</surname>
          </string-name>
          ,
          <article-title>Kinds of minds: towards an understanding of consciousness</article-title>
          ,
          <source>Weidenfeld &amp; Nicolson</source>
          , London,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Bandera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Rodriguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Molina-Tanco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bandera</surname>
          </string-name>
          ,
          <article-title>A survey of vision-based architectures for robot learning by imitation</article-title>
          ,
          <source>International Journal of Humanoid Robotics</source>
          <volume>9</volume>
          (
          <year>2012</year>
          ). doi:
          <volume>10</volume>
          .1142/ S0219843612500065, world Scientific Publishing Company.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Boucenna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Anzalone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tilmont</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chetouani</surname>
          </string-name>
          ,
          <article-title>Learning of social signatures through imitation game between a robot and a human partner</article-title>
          , IEEE Transactions on Au-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>