<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Systems - Volume</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Grounding Words in Visual Perceptions: Experiments in Spoken Language Acquisition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fabio De Ponte</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sarah Rauchas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computing</institution>
          ,
          <addr-line>Goldsmiths</addr-line>
          ,
          <institution>University of London</institution>
          ,
          <addr-line>Lewisham Way, New Cross, London SE14 6NW</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>2</volume>
      <issue>2014</issue>
      <fpage>0000</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>In recent years, Natural Language Processing models have shown compelling progress in generating and translating text. Yet, the symbols that are manipulated by these models are not produced within the models themselves. On the contrary, they are externally given in the form of tokens. The models only measure the probability that a specific token comes after another (or a group of others) and allow the generation of a list of tokens, each of which has a certain probability of following the previous one. There is no connection to sensory perceptions, and the semantic interpretation of the outputs - as well as of the inputs - of these models is completely invisible to the system that produces them. Therefore, language cannot be used by the system to manipulate information about the perceived world. This is commonly referred to as the Symbol Grounding Problem and, as of today, there is not a generally accepted procedure to solve it. This paper explores a possible solution: a sequence-to-sequence model trained over videos characterised by visual elements which reliably predict the presence of acoustic co-occurring elements. A dataset was created ad-hoc, with videos that include 5 types of objects and 5 actions. Two research questions were considered: whether such a model could map video features onto audio features, in fact producing a categorization without labels, where the categories would emerge from the parallel, simultaneous generalization of both input and target; and whether the model would be able to combine learned information about objects and movements to correctly describe a new combination, shown in a video it was not exposed to during training, a process that is referred to as compositional semantics. The experiment showed that the model was able to generalize simultaneously over videos and over the utterances that were paired with them. Further, it produced sentences that were in some cases more accurate than the original ones, precisely because of the process of generalization. However, the results suggest also that the model did not develop the ability to combine information taken from diferent samples. In other words, while symbol grounding seems to have been achieved, compositional semantics does not. The experiment shows that sensory perceptions can be mapped onto one another with a sequence-to-sequence model trained over a dataset where elements coming from diferent sensory domains are paired. However, it is not suficient to develop compositional semantics.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Symbol grounding</kwd>
        <kwd>Compositional semantics</kwd>
        <kwd>NLP</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Recently, Natural Language Processing models like GPT3 and BERT have shown compelling
progress in interpreting and predicting text. Yet, while they help us understand the way
languages work — ofering at the same time valuable tools for diferent applications — they do
not address the question of the relationship between words and sensory perceptions and of how
language emerged in the first place. These questions date back centuries but have lately become
known within the field of artificial intelligence as “the symbol grounding problem,” since the
issue was clearly defined by Harnad [ 13, Abstract], who stated it in these terms: «How can the
semantic interpretation of a formal symbol system be made intrinsic to the system, rather than
just parasitic on the meanings in our heads?»</p>
      <p>Nowadays, even a model as simple as an n-gram conditional frequency model, trained over a
suficiently large corpus of text, can produce apparently meaningful text. However, the symbols
that are manipulated by the system are not inherent to the system itself. Applying the model,
the machine only calculates the probability that a certain token comes after another (or a group
of others) and then produces a list of tokens, each of which has a certain probability of following
the previous one. In principle, this could be done by a human being on an unknown language.
As John R. Searle [34] famously argued in his paper “Minds, brains, and programs,” a person
could apply a similar method to a Chinese corpus and produce a Chinese text, without knowing
the meaning of a single Chinese word.</p>
      <p>Yet, recent advancements led a senior engineer at Google, Blake Lemoine, to claim that a
language model, LaMDA, short for Language Model for Dialogue Applications, a system for
building chatbots, was «sentient» [46]. The company quickly dismissed the claim and, in a
statement, a Google spokesperson said: «There was no evidence that LaMDA was sentient
(and lots of evidence against it).»[46]. In order to support his claim, Lemoine published a few
astonishing fragments of chats between himself and the system. However, as surprising as they
are, they do not seem to suggest the existence of a sentient mind. In fact, a simpler explanation
is possible. Language models capture language structures and, with them, a fraction of the
logic behind language itself. Mikolov, Yih and Zweig [20] were able to design a model such
that some seemingly logical operations between word vectors were possible, like “King - Man +
Woman”, resulting in a vector very close to “Queen”, corresponding to the analogy “a man is
to a king as a woman is to a queen.” Operations like this between embedding vectors do not
always work, even in more recent models. However, they show that sometimes it is possible to
convert logical operations into geometrical operations. More generally, they suggest that human
reasoning reflects in language and that a suficiently complex language model can capture part
of that logic. This could explain the surprising performances of the most advanced language
models but leaves us with the unsolved problem of how a system can develop a meaningful
semantic interpretation of the symbols that logic is applied to. Therefore, the challenge is to
design a technique that would let the system define its own meaningful symbols. According to
Harnad, «there is really only one viable route from sense to symbols: from the ground up.» [13,
Conclusions]. In other words, symbols have to be grounded to perceptions. There have been
many attempts to do that, we will see some of them below.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background Literature</title>
      <p>Many attempts have been made at developing a method for grounding symbols to perceptions.
One of the earliest was proposed by Roy [29, section 1], who designed «a computational model
which learns from untranscribed multisensory input» where «acquired words are represented in
terms associations between acoustic and visual sensory experience.» The model was designed to
learn the same way it is claimed children do, by discovering «words by searching for segments
of speech which reliably predict the presence of visually co-occurring shapes.» The author
recorded a number of sessions of adults speaking to babies in a room. The adults were playing
with the babies with toys, one toy at a time. The system included a speech processor that
«converted spoken utterances into sequences of phoneme probabilities»; and a visual processor
that «extracted statistical representations of shapes and color from images of objects.» Phoneme
probabilities and statistical representations of co-occurring images were stored in a short-term
memory so that the model was able to predict the most probable next word, given an image.
While the experiment was a major step forward, it had an important drawback: not all utterances
contained the name of the toy. Sometimes, adults said only, for example, «Here it comes!»
referring to the toy car they were playing with. Therefore, the results were inconclusive.</p>
      <p>Another interesting experiment was proposed by Tuci et al. [48]. The authors set up a virtual
environment where subsequent generations of evolving robots were trained «to access linguistic
instructions and to execute them by indicating, touching or moving specific target objects.»
[48, Abstract]. In the course of the process, they were able to learn that the commands were
composed of diferent parts. For example, “touch the pen” is made of two parts, “touch” and “the
pen”, while “move the spoon” is composed of “move” and “the spoon.” Once they had learned
this, they were able to generalize to new compositions, e.g. “touch the spoon,” and execute them,
even though they had never seen that specific command before. This, according to the authors,
demonstrated «how the emergence of compositional semantics is afected by the presence of
behavioural regularities in the execution of diferent actions.» A drawback was that, while
this experiment efectively demonstrated the principle, it did not provide a method to enable a
system to acquire language, other than within the very limited scope of the experiment itself.</p>
      <p>A similar approach was proposed by Sugita and Tani [43]. They focused on the creation
of a geometrical n-dimensional space, where the geometric arrangements represented «the
underlying combinatoriality» among symbols. They defined 26 actions and recorded 120
corresponding algorithmically generated sensor-motor time series for each. They were then
able to create a model where «the composition of symbols is realized by summing up their
corresponding vectors,» a compositional semantics potentially much more powerful than the
one of Tuci et al. [48].</p>
      <p>More recently, other approaches have been proposed. One came from Gonzalez-Billandon
et al., who converted «the audio signal to an embedding space where embeddings for the same
words are closer than diferent words, regardless of speaker.» [ 11, Section II]. In order to achieve
this, they used a Vector Quantized-Variational Autoencoder network. Then, the embeddings
were associated with images. The project is still ongoing and the results have not been released
yet.</p>
      <p>Another method came from Wang et al. [51], who proposed MAXSAT, a SATNet layer that
can be integrated into neural network architectures that could successfully be used to learn
logical structures. Their technique was subsequently used by Topan, Rolnick and Si [47], to
map visual inputs to symbolic variables without explicit supervision, through a self-supervised
pre-training pipeline.</p>
      <p>A diferent approach was tried by Tan and Bansal [ 45]. They proposed LXMERT (Learning
Cross-Modality Encoder Representations from Transformers), a large-scale transformer model.
In essence, it is a framework designed to learn direct vision-to-language connections and it
represents a step forward in the task of describing the content of images in words. It outperforms
previous models on visual question-answering datasets. However, it does not include sound or
any perception other than vision. Therefore, it does not ofer a direct contribution to tackle the
symbol grounding problem and it does not address the question of compositional semantics.</p>
      <p>A proposal in this direction was instead put forward by Liu, Li and Cheng [17] who replaced
text with audio. They proposed a new general-purpose neural sound synthesis network, based
on generative adversarial networks, that was able to generate sound directly from visual inputs.
In their work, the task was formulated as a regression problem to map a sequence of video
frames to a sequence of raw audio waveform. This model paves the way for the design of a
general method to associate visual to acoustic features and vice versa. It represents a significant
step in the direction of the solution of the symbol grounding problem. However, it was designed
to reproduce any kind of sound, and the authors had in mind mostly noise produced by specific
objects (e.g. cars, motorcycles, scrolling water, etc.). It was not designed for spoken language
and, again, it does not address the question of compositional semantics.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Objectives of the research</title>
        <p>Our work focused on the following research objectives:
1. Verify whether it is possible for a system to learn words from an unlabeled dataset of
spoken utterances and visual representations, through a sequence-to-sequence neural
network.
2. Verify if, once such words are learned, compositional semantic would emerge, that is
whether the system would be able to combine them to compose sentences that were not
present in the training dataset.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Dataset</title>
        <p>The dataset is composed of 1,000 videos, each showing an object – either staying still or moving
– while a voice reads a sentence, for example “this is a pen,” that refers to what we see. There
are twenty voices in total: ten voices are natural – i.e. they have been recorded by real people –
while the other ten are artificial (produced through the services of the website www.murf.ai).
Each video is exactly 3 seconds long. It is 180x180 pixel sized and the audio is sampled to 16
KHZ. The dataset is available online 1.</p>
        <p>Each voice recorded 50 sentences: the first 25 (Table 1) describe still objects, the next 25
(Table 2) refer to the moving of objects. A video was recorded for each sentence. Audio and
video files were then paired and saved into a single MP4 file. It is a balanced dataset: each object
is represented 5 times not moving and 5 times moving for each group of 50 videos.</p>
        <p>1It can be downloaded from https://www.kaggle.com/datasets/fabiodeponte/symbolgrounding and it is released
under licence Creative Commons Attribution-ShareAlike 4.0 Generic (CC BY-SA 4.0).</p>
        <p>The videos were augmented by five operations: flipping image, increasing brightness,
decreasing brightness, increasing saturation and zooming in. Audios were augmented by five
operations as well: increasing and decreasing speed, increasing and decreasing pitch, and
increasing volume. Lastly, data was split into training, validation and test sets.</p>
        <p>After completing the data augmentation, each of the five augmented audio sets of 1,000
samples, plus the original one, was paired with each of the five augmented video sets of 1,000
samples, plus the original one. In total, 36,000 samples were generated.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Model architecture</title>
        <p>
          Once the dataset was ready, a system was designed to map visual elements to spoken utterances.
Video files were processed by two pre-trained neural networks, namely Wav2vec (a model
introduced by Baevski et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] at Facebook for self-supervised learning of representations
from raw audio, trained on LibriSpeech corpus) and CLIP (developed by Radford et al. [26] at
OpenAI, trained on a dataset of 400 million image-text pairs collected form a variety of publicly
available sources on the Internet), that extracted respectively acoustic and visual features. Then
a sequence-to-sequence neural network mapped the extracted features of the visual part onto
the extracted features of the acoustic part 2.
        </p>
        <p>Wav2vec returns features vectors of 150 integer values, which in turn can be in turn decoded
to a written text by the same library. As we had a dataset composed of 36,000 videos, we got an
audio features matrix sized 36,000 x 150. CLIP returns vectors of 512 float values. However, it
does not provide a tool to convert the features back to a video or to any other directly readable
form. For this reason, while Wav2vec makes it relatively simple to evaluate the predictions of
a video-to-audio neural network, CLIP does not do the same for an audio-to-video network.
Therefore, we focused on the transformation of video features into audio features and not the
reverse, developing a sequence-to-sequence video-to-audio network.</p>
        <p>As a preliminary check, audio features were decoded to written text, through one of the
libraries made available by Facebook with Wav2vec. From this verification, it emerged that
not all features produced intelligible text. In particular, the sentences recorded by real people
contained many errors. The reason is probably that, as they are non-native speakers, their
pronunciation was not good enough for the library. Moreover, the artificial voices that had their
pitch multiplied by a factor of ten for data augmentation purposes resulted in unintelligible
2The code of audio and video extractor and of sequence-to-sequence model is available at https://github.com/
fabiodeponte/symbol_grounding.</p>
        <p>26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
“Move the pen to the left”
“Move the pen to the right”
“Move the pen up”
“Move the pen down”
“Rotate the pen”
“Move the phone to the left”
“Move the phone to the right”
“Move the phone up”
“Move the phone down”
“Rotate the phone”
“Move the spoon to the left”
“Move the spoon to the right”
“Move the spoon up”
“Move the spoon down”
“Rotate the spoon”
“Move the knife to the left”
“Move the knife to the right”
“Move the knife up”
“Move the knife down”
“Rotate the knife”
“Move the fork to the left”
“Move the fork to the right”
“Move the fork up”
“Move the fork down”
“Rotate the fork”
decoded text as well. Therefore, all real voices and some of the artificial ones were discarded,
along with the corresponding videos. The dataset was thus reduced in size and as a result was
composed of only 14,500 samples. Each object was represented 2,900 times: 1,450 times laying
still and 1,450 moving.</p>
        <p>It should be noted that this does not necessarily mean that the discarded samples are useless.
They could be used in a model that does not make use of Wav2vec. In fact, whether the features
are intelligible or not for a speech-to-text converter is not relevant for symbol grounding, as
long as the same utterances are pronounced consistently in similar ways.</p>
        <p>The sequence-to-sequence models are composed of two parts: an encoder and a decoder.
Each of them includes one or more Long Short-Term Memory (LSTM) layers. During training,
the encoder basically compresses the sequence into a vector, while the decoder learns a series
of conditional frequency distributions: for each value and each vector coming from the encoder,
it calculates the most likely value to come next. In terms of the translation process, this means
that the decoder calculates the most probable next word given the last predicted word and the
vector coming from the encoder, which summarizes information about the whole sentence to
be translated.</p>
        <p>In our case, we can consider each word of the input sentence one value of the video features
vector and each word of the target translated sentence a value of the audio features target vector.
Then, mapping video features onto audio features may be interpreted as a translation task: on
the input side, we have a sequence that represents the contents of a video; on the output side,
we expect a sequence that represents the same contents expressed in audio format.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. The compositional semantics experiment</title>
        <p>As we mentioned, while the first aim was to design a model that would be able to map videos onto
audio recordings, the second was to verify whether such a model could develop compositional
semantics. In order to do that, an experiment was set up. The videos containing a specific
moving object were removed from the dataset, while videos showing the same object laying still
were kept, along with all other objects, both moving and still. The model was trained on the
reduced dataset, and it was tested against the removed videos. The audio features predicted by
the model were converted into text and so it was possible to compare them with the intended
ones.</p>
        <p>For example, at one point all videos of moving pens were removed, so that the model was
trained only on videos showing pens laying still (paired with the utterance “this is a pen” read by
diferent voices) and other objects (spoons, forks, knives and phones) both staying still (paired
with utterances like “this is a spoon”) and moving (paired with utterances like “move the spoon
to the right”). No videos showing a pen moving to the right and the utterance “move the pen to
the right” were included in the training set. The experiment consisted of verifying whether the
model was able to compose the utterance “move the the pen to the right,” once such a video
was given to the network. In order to do that, the model should have been able to combine the
information coming from the shape of the pen and the information coming from the movement
of other objects. We repeated the experiment five times, with one object at a time.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and discussion</title>
      <sec id="sec-4-1">
        <title>4.1. Symbol grounding</title>
        <p>Twenty diferent configurations of the sequence-to-sequence model were tried 3, modifying the
size of the layers, the learning rate, the optmizer and the directionality. The best performing
one had an LSTM layer of size 1,024 for the encoder and an LSTM layer of the same size for the
decoder. It was not bidirectional, as bidirectionality made the training slower without ofering
any gain in terms of performance. Categorical cross-entropy loss function and Adam optimizer
were adopted, with learning rate 0.001, and a batch size of 64. The video features values were
normalized to the range 0-1, multiplied by 100 and rounded to integer. Therefore, each value
was one-hot encoded in a vector sized 101. This model showed a test loss of 0.0336 and a test
accuracy of 0.9890. Along with loss and accuracy, another performance metric was adopted:
the cosine distance between the predicted features and the expected ones. This measure was
calculated comparing predicted and target vectors on 100 samples extracted from the train set
and 100 from the test set. The showed a test cosine distance of 50.12.</p>
        <p>Another important method of evaluation was the visual inspection. Due to space constraints,
we show here only the results of the best performing model4. We can see in Table 3 a sample of
11 randomly picked sentences, as predicted by the model. The left column shows the sentence
originally paired with the video (as converted by Wav2vec from the original audio into text),
while the right column shows the predicted sentence. As we can see, the model was able to
produce sentences that in some cases were even more understandable than the original. In the
batch of 11 sentences that we randomly extracted from the test dataset:
• The model correctly predicted “move the fork to the right” (sample number 0), “this is a
spoon” (n. 2), “move the fork to the right” (n. 3), “move the spoon down” (n. 4), “rotate
the spoon” (n. 7), “this is a spoon” (n. 9) and “move the phone to the right” (n.10).
3The detailed description of this process can be found on https://github.com/fabiodeponte/symbol_grounding.
4The results of the other models are described on the Github repository.
• It predicted “wrotate the knife” (n. 1), a substantially correct output with the minor error
of adding a “W” to the word “rotate”.
• It predicted “This is a spern” (n. 5) for the video of a spoon staying still that was originally
paired with a scarcely comprehensible utterance, namely “his is a spurn”, instead of
“this is a spoon”. In fact, generalizing over similar videos and over the utterances that
were paired with them, the model showed a result that turned out to be better than the
original. This was again the case with the utterance (n. 6). The model returned features
corresponding to “move the fork to the left”, whereas the original video had been paired
with “moved the fork to the left”, with a “d” at the end of “move” that made the utterance
slightly incorrect. And again (n. 8) the model returned “moves the phone up”, which
was a strong improvement over the utterance originally paired with the video, that was
interpreted by the Wav2vec library as “lusa phonap”.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Compositional semantics test</title>
        <p>Once video features were mapped onto audio features, the experiment described in 3.4 was
performed. In order to do that, five subsequent tests were performed, one for each object: pen,
phone, spoon, knife and fork. For each of them, a reduced dataset was prepared, removing the
videos that showed the object moving to the left, to the right, up, down and rotating. The model
was trained on the reduced dataset, resulting in 13,050 samples, and then tested against the
1,450 videos that had been removed. The results are shown in Table 4.</p>
        <p>As we can see, loss increased dramatically when the model was tested over videos of moving
objects that it had not been exposed to during training. We can clearly see that it is not a
generalization problem, because the Test loss column shows that, when tested against videos
previously unseen but belonging to known groups, the model showed a loss ranging between
0.030 and 0.043. Yet, when exposed to new kinds of videos, its loss increased dramatically,
ranging between 0.73 and 0.97. The model was able to generalize when similar videos were
present during training. However, it was not able to combine information from the videos that
showed the objects staying still and information about the movement applied to other objects,
to form a sentence composed by “move” and the name of the object.</p>
        <p>In Table 5, ten predictions (out of 1,450) are shown. As we can see, the model did not predict
a sentence containing either “move the pen” or “rotate the pen” when exposed to a moving pen,
nor was it able (Table 6) to predict “move the phone” or “rotate the phone” when exposed to
the moving phone videos. Similarly, it could not predict any of the correct sentences for the
other three objects. In the dataset of 14,500 videos, each object was represented 2,900 times:
1,450 times laying still and 1,450 moving. The model was trained five times, each time leaving
aside the videos of a particular object depicted while moving. Each time it was tested against
those videos. In total, it returned 7,250 audio features, converted then to text. Again, it cannot
be shown here due to space constraints, but a through inspection showed that the combination
never occurred5.</p>
        <p>The model mostly favoured the shape of the object, predicting a sentence in the form “this
is...,” thus ignoring the movement. In a minority of cases, it recognised the movement, but
it failed to combine the information with the shape of the object and simply predicted the
utterance “move” accompanied by a diferent object, often the most similar. For example, “move
the pen” was frequently mistaken for “move the knife.”</p>
        <p>5See COMPOSITIONAL SEMANTICS RESULTS - FOR EACH OBJECT COMPARE ORIGINAL AUDIO FEATURES
AND PREDICTED AUDIO FEATURES.ipynb on Github repository.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <sec id="sec-5-1">
        <title>5.1. Symbol grounding and compositional semantics</title>
        <p>This project had two aims. The first was to design a system, composed of three neural networks,
that had to be able to directly map, without labels, visual elements onto spoken utterances, to
achieve symbol grounding. The second aim was to verify whether, once the mapping was done,
such a system would develop the ability to compose sentences that were not present in the
training dataset, combining information gathered from diferent samples, in order to achieve
compositional semantics.</p>
        <p>The results shown above suggest that the ability to directly produce a correct sentence when
exposed to a video has been achieved. The model was able to generalize on both sides of the
dataset and map videos onto audio recordings without labels. However, the results also suggest
that the model did not develop the ability to combine information taken from diferent samples.</p>
        <p>With this experiment, we tried to capture the power of categorization carried by words into
an artificial system. In fact, the co-occurring audio features associated with each video, once
generalized, play the role of categories. However, they are categories that the model itself
extracts from the sensory perceptions through a process of generalization, rather than from
externally given labels. The fact that words are intimately linked to categories is not surprising,
as with language comes indeed «the ability to generalize,» as Oliver Sacks [33, p. 42] pointed
out in “Seeing voices”, a book devoted to the relationships between sensory perceptions and
language. As Lupyan [19] argues, «merely perceiving an object does not require categorizing
it. In contrast, naming an object (whether to communicate to another individual or for your
own benefit) does require placing it into a category.» However, classification into categories
is not enough. Once things have names, the necessity of a grammar arises, in order to allow
the acquired categories to be manipulated. And, as Corballis [7, p. 37] illustrates, the origin of
grammar could very well have been compositional semantics: "The simplest events consist of
objects and actions, such as baby screams, snake approaches, or apple falls. Suppose, for example,
that an animal’s experience includes five meaningful objects and five meaningful actions. If
each object is associated with a single action, so that only babies scream or only apples fall,
then there are only five events to be signaled, and five event symbols will do the trick; the
objects do not need to be distinguished from the actions associated with them. But if all possible
combinations of objects and actions can occur, then it would be more economical to learn five
symbols for the objects and five for the actions, making ten in all, than to learn twenty-five
symbols to cover all their possible combinations. This might be the source of protolanguage,
leading eventually to grammar."</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Further work</title>
        <p>
          This experiment shows that it is possible to achieve symbol grounding through a
sequence-tosequence model trained over a dataset with audio and video co-occurring elements. The work
could be expanded in several directions:
1. Apply an attention layer as introduced by Bahdanau et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] to the sequence-to-sequence
model, something that might allow the model to achieve compositional semantics.
2. Give up pre-trained features extraction networks. For real symbol grounding to be
fully achieved, the system should not include networks that were trained over labels.
They could be replaced by autoencoders, a solution that could ofer the possibility of
reconstructing the original audio and video files.
3. Expand the dataset. The dataset developed for this project was relatively small, with one
thousand original videos and only five objects represented. It could be expanded, with
the addition of more objects, either staying still or moving, and new voices.
4. Apply the model to the “something something” video database, that ofers images of
thousands of actions on objects, paired with a linguistic label, composed by a description
of the action in the form “do &lt;something&gt; on &lt;something&gt;” (hence the name). That
dataset seems perfectly suitable for purposes of compositional semantics. However, it
lacks audio. In order to overcome the problem, captions – at least a fraction of them –
could be converted to spoken utterances through artificial voices.
[7] Corballis, M.C.: From hand to mouth. The origins of language. Princeton University Press,
        </p>
        <p>Princeton and Oxford (2002).
[8] Fanello S.R., Ciliberto C., Natale L., Metta G.: Weakly supervised strategies for natural
object recognition in robotics in IEEE International Conference on Robotics and Automation
(ICRA), Karlsruhe, Germany, May 6-10 (2013). Available at: https://ieeexplore.ieee.org/
document/6631174. Last Accessed 23 May 2022.
[9] Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N.: Convolutional Sequence
to Sequence Learning. In: Proceedings of the 34th International Conference on Machine
Learning - Volume 70 - 1243–1252 (2017). Available at: https://arxiv.org/abs/1705.03122.</p>
        <p>Last Accessed 6 September 2022.
[10] Gentner, D.: Some interesting diferences between verbs and nouns. In: Cognition and
Brain Theory, 1981, vol. 4, 161-178 (1981). Available at: https://groups.psych.northwestern.
edu/gentner/papers/Gentner81c.pdf. Last Accessed 22 May 2022.
[11] Gonzalez-Billandon, J., Grasse, L., Sciutti, A., Tata, M., Rea, F.: Cognitive
Architecture for Joint Attentional Learning of word-object mapping with a Humanoid Robot.
In: IEEE/RSJ International Conference on Intelligent Robots and Systems (2019).
Available at: https://www.researchgate.net/publication/344432053_Cognitive_Architecture_for_
Joint_Attentional_Learning_of_word-object_mapping_with_a_Humanoid_Robot. Last
Accessed 12 June 2022.
[12] Hannagan, T., Agrawal, A., Cohen, L., Dehaene S.: Emergence of a compositional neural
code for written words: Recycling of a convolutional neural network for reading. In: Pnas,
Vol. 118 - No. 46 (2021). Available at: https://www.pnas.org/doi/10.1073/pnas.2104779118.</p>
        <p>Last Accessed 6 September 2022.
[13] Harnad, S.: The Symbol Grounding Problem. Physica D 42: 335-346 (1999). Available at:
https://arxiv.org/abs/cs/9906002v1. Last Accessed 7 May 2022.
[14] Harnad, S.: Symbol grounding and the origin of language, University of Southampton
Institutional Research Repository (2002). Available at: https://eprints.soton.ac.uk/256471.</p>
        <p>Last Accessed 16 September 2022.
[15] Huang, H., Wong, R.K.: Attention-based Seq2seq Regularisation for Relation Extraction.</p>
        <p>In: International Joint Conference on Neural Networks (IJCNN) (2021). Available at: https:
//ieeexplore.ieee.org/abstract/document/9533807. Last Accessed 6 September 2022.
[16] CLIP, Video Features Documentation, https://iashin.ai/video_features/models/clip. Last</p>
        <p>Accessed 5 September 2022.
[17] Liu, S., Li, S., Cheng, H.: Towards an End-to-End Visual-to-Raw-Audio Generation with
GAN. In: IEEE Transaction on Circuits and Systems for Video Technology, Vol. 32,
Issue 3 (2021). Available at: https://ieeexplore.ieee.org/document/9430540. Last Accessed 6
September 2022.
[18] Luong, M.T., Pham, H., Manning, C.D.: Efective Approaches to Attention-based Neural
Machine Translation. In: Proceedings of the 2015 Conference on Empirical Methods in
Natural Language Processing (2015). Available at: https://arxiv.org/abs/1508.04025. Last
Accessed 6 September 2022.
[19] Lupyan, G.: Carving nature at its joints and carving joints into nature: how labels augment
category representations, Progress in Neural Processing, Vol. 16, 2005, 87-96 (2005). Available
at: https://doi.org/10.1142/9789812701886_0008. Last Accessed 8 May 2022.
[20] Mikolov, T., Yih, W., Zweig, G.: Linguistic Regularities in Continuous Space Word
Representations. In: Proceedings of the 2013 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies. Association for
Computational Linguistics, 746–751 (2013). Available at: https://aclanthology.org/N13-1090.</p>
        <p>Last Accessed 16 September 2022.
[21] Nolfi S.: “Emergence of communication and language in evolving robots” in Lefebvre, C.,
Comrie, B. and Cohen H., New Perspectives on the Origins of Language, 533–554 (2013).</p>
        <p>Available at: https://doi.org/10.1075/slcs.144.20nol. Last Accessed 8 May 2022.
[22] Pasquale, G., Ciliberto, C., Odone, F., Rosasco, L., Natale, L.: Are we done with object
recognition? The iCub robot’s perspective. In: Robotics and Autonomous Systems, Volume
112, February 2019, 260-281 (2019). Available at: https://www.sciencedirect.com/science/
article/pii/S0921889018300332. Last Accessed 12 June 2022.
[23] Pinker, S.: Language Learnability and Language Development (1984/1996). Cambridge,</p>
        <p>MA: Harvard University Press.
[24] Pinker, S.: How could a child use verb syntax to learn verb semantics?. In: Lingua Vol. 92,
April 1994, 377-410 (1994). Available at: https://www.sciencedirect.com/science/article/pii/
0024384194903476. Last Accessed 22 Maj 2022.
[25] Prickett, B., Traylor, A., Pater, J.: Seq2Seq Models with Dropout can Learn Generalizable
Reduplication. In: Proceedings of the 15th SIGMORPHON Workshop on Computational
Research in Phonetics, Phonology, and Morphology, 93–100, Brussels, Belgium, October
31 (2018). Available at: https://aclanthology.org/W18-5810.pdf. Last Accessed 5 September
2022.
[26] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, A., Sastry, G., Askell,
A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models
From Natural Language Supervision. In: Proceedings of the 38 th International Conference
on Machine Learning, PMLR 139 (2021). Available at: http://proceedings.mlr.press/v139/
radford21a/radford21a.pdf. Last Accessed 14 September 2022.
[27] Ramanathan, V., Tang, K., Mori, G., Fei-Fei, L.: Learning Temporal Embeddings for Complex
Video Analysis in Proceedings of International Conference on Computer Vision (ICCV)
(2015). Available at: https://arxiv.org/abs/1505.00315. Last Accessed 26 June 2022.
[28] Roger, E., Banjac, S., Thiebaut de Schotten, M., Baciua, M.: Missing links: The
functional unification of language and memory in Neuroscience &amp; Biobehavioral Reviews,
Volume 133, February (2022). Available at: https://www.sciencedirect.com/science/article/
pii/S0149763421005601. Last Accessed 26 June 2022.
[29] Roy, D.: Grounded speech communication. In: Proceedings of the 6th International
Conference on Spoken Language (ICSLP), vol. 4, 69-72 (2000). Available at: https://www.isca-speech.
org/archive_v0/archive_papers/icslp_2000/i00_4069.pdf. Last Accessed 22 May 2022.
[30] Roy, D.: Learning Visually Grounded Words and Syntax of Natural Spoken Language,
Computer Speech &amp; Language, Volume 16, Issues 3–4, July–October 2002, 353-385 (2002).
Available at: https://www.media.mit.edu/cogmac/publications/evol_comm_2002.pdf. Last
Accessed 12 May 2022.
[31] Roy, D.: Grounded Spoken Language acquisition: Experiments in Word Learning. In: IEEE
Transactions on Multimedia, Vol. 5, No. 2 (2003). Available at: https://www.media.mit.edu/
cogmac/publications/ieee_multimedia_2003.pdf. Last Accessed 12 May 2022.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Anderson</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teney</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bruce</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Johnson,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Sünderhauf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Reid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            ,
            <surname>Gould</surname>
          </string-name>
          , S., van den Hengel, A.:
          <article-title>Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          <year>2018</year>
          , pp.
          <fpage>3674</fpage>
          -
          <lpage>3683</lpage>
          . (
          <year>2018</year>
          ). Available at: https://openaccess.thecvf.com/content_cvpr_2018/html/Anderson_
          <article-title>Vision-and-</article-title>
          <string-name>
            <surname>Language</surname>
          </string-name>
          _
          <article-title>Navigation_Interpreting_CVPR_2018_paper</article-title>
          .html.
          <source>Last Accessed 16 September</source>
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Baevski</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohamed</surname>
            ,
            <given-names>A.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auli</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Wav2vec 2.0: A Framework for SelfSupervised Learning of Speech Representations</article-title>
          .
          <source>In: 34th Conference on Neural Information Processing Systems (NeurIPS</source>
          <year>2020</year>
          ), Vancouver, Canada (
          <year>2020</year>
          ). Available at: https:// proceedings.neurips.cc/paper/2020/file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf.
          <source>Last Accessed 14 September</source>
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Bahdanau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.:</given-names>
          </string-name>
          <article-title>Neural Machine Translation by Jointly Learning to Align and Translate</article-title>
          .
          <source>In: 3rd International Conference on Learning Representations</source>
          (
          <year>2015</year>
          ). Available at: https://arxiv.org/pdf/1409.0473.pdf.
          <source>Last Accessed 6 September</source>
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Cangelosi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Greco</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harnad</surname>
            .,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>From robotic toil to symbolic theft: Grounding transfer from entry-level to higher-level categories</article-title>
          .
          <source>Connection Science</source>
          , Vol.
          <volume>12</volume>
          , No.
          <volume>2</volume>
          ,
          <fpage>143</fpage>
          -
          <lpage>162</lpage>
          (
          <year>2000</year>
          ). Available at: https://doi.org/10.1080/09540090050129763. Last Accessed 8 May
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>van Merrienboer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gulcehre</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bahdanau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bougares</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schwenk</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation</article-title>
          .
          <source>In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <fpage>1724</fpage>
          -
          <lpage>1734</lpage>
          (
          <year>2014</year>
          ). Available at: https://aclanthology.org/ D14-1179. Last Accessed 19
          <year>September 2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Chollet</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Character-level recurrent sequence-to-sequence model</article-title>
          ,
          <source>Keras</source>
          , 29
          <string-name>
            <surname>September</surname>
          </string-name>
          (
          <year>2017</year>
          ). Available at: https://keras.io/examples/nlp/lstm_seq2seq.
          <source>Last Accessed 5 September</source>
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>