Grounding end-to-end Architectures for Semantic
Role Labeling in Human Robot Interaction
Claudiu Daniel Hromei1,2 , Danilo Croce1 and Roberto Basili1
    University of Roma Tor Vergata, Rome, Italy
    Università Campus Bio-Medico di Roma, Rome, Italy

                                         Natural language interactions between humans and robots are intended to be situated in the sense that
                                         both user and robot can access and refer to the shared environment. Contextual knowledge plays a key
                                         role in resolving the ambiguities inherent in interpretation tasks. In addition, we expect the interpretation
                                         produced to be well-founded, e.g., that all mentions of entities in the environment (as perceived by
                                         the robot) are correctly grounded. In this paper, we propose the application of a transformer-based
                                         architecture that combines the input utterance with a linguistic description of the environment to
                                         produce interpretations and references to the environment in an end-to-end fashion. Experimental
                                         results demonstrate the robustness of the proposed methodology, overcoming previous approaches in
                                         which linguistic interpretation and grounding are composed of possible complex processing chains.

                                         Grounded Semantic Role Labeling, Human Robot Interaction, End to End Sequence to Sequence Archi-
                                         tectures, Robotics and Perception

1. Introduction
In a world that is moving towards a widespread use of virtual assistants (at home, for work,
or as a hobby) and robotic platforms that perform difficult or risky tasks for humans, making
sure these technologies understand human language becomes increasingly important. Virtual
assistants are designed to satisfy a user need that is often informational or merely entertaining,
like in the daily requests made to search for translation of individual words or the requests
for popular songs. Understanding a command or the title of a song turns out to be crucial
to satisfying these needs in a natural manner. The quality of such interpretation processes is
important, especially in critical scenarios, e.g. involving robotic platforms that perform sensitive,
medical or critical tasks usually carried out under speech-based control.
   The use of natural language to control these platforms or teach them the movements or
actions needed to perform a task could be a key factor in the not-too-distant future. Today, there
are countless domestic robots whose job is to perform domotic tasks, such as automatic cleaning
or even cooking. As suggested in [2], domestic robots have to contend with complex problems
such as (1) self-localization and navigation in complex environments, (2) precise recognition of
objects and people, (3) manipulation of physical objects, and (4) meaningful interaction with
humans to satisfy their needs, being them physical or abstract (Human Robot Interaction).
" hromei@ing.uniroma2.it (C. D. Hromei); croce@info.uniroma2.it (D. Croce); basili@info.uniroma2.it (R. Basili)
  It is necessary to make these home automation assistants aware of their surroundings and
the elements in them. In order to correctly interpret a sentence such as

                              “Take the volume on the table near the window′′                     (1)

it is necessary to support capability for entity association able to retrieve objects mentioned in
the command (such as volume, table and window) and possibly disambiguate between entities
of the same type. For example, in the case where several tables exist in the environment, it is
necessary to explicitly guide the interpretation towards the one that is next to the window.
    Several works, such as [3] proposed specific methods for a Grounded language interpretation
of robotic commands. We investigate here the approach recently proposed in [4], namely
Grounded language Understanding via Transformers (GrUT): this method suggests adopting a
Transformer-based architecture (e.g., BART presented in [5]) that can produce the linguistic
interpretation of an utterance by taking in input i) the transcription of the input command, ii) a
linguistic description of the entities from the map involved in the command and iii) a linguistic
description of the robot’s capabilities.
    GrUT is appealing as it drastically reduces the need for task-specific engineering of the model
(such as [3]): it only requires a way to linguistically describe a map, so that the Transformer
generates interpretations consistent both with the utterance and the map. In other words, the
same command can generate different linguistic interpretations when coupled with different
map descriptions. Let us consider the utterance in example 1. Whenever the volume is actually
next to the table, the input utterance is extended with the additional synthetic text “v1 is a
volume,(︀t1 is a table and v1 is near t1 ” so that GrUT produces
                                                               )︀    the following interpretation:
Taking Theme("the       volume on the table near the window") . On the contrary,     GrUT generates
             (︀                                                            )︀
“Bringing Theme("the volume”),Goal("on the table near the window”) whenever these objects
are far from each other, that can be expressed by “v1 is a volume, t1 is a table and v1 is far from
t1 ”. The final interpretation is consistent with Frame Semantics [6]: in the first case, the robot is
expected to move towards the table (i.e., t1) and take the volume (i.e., v1 ), while in the second
case it is expected to take the volume that is far away from the table and bring it over there.
However, while the actual state of the environment affects the interpretation process, GrUT
only produces just an approximate linguistic effect, expressed by texts fragments as fillers of the
output argument predicate structure. A further step is still required to ground each fragment to
the intended entity triggered by the predicate.
    In this paper, we propose an end-to-end grounded interpretation process1 ,(︀so that GrUT     )︀ is
expected to (︀ produce indexed representations
                                     )︀           in the robot KB, such as Taking Theme(v1 ) or
Bringing Theme(v1 ), Goal(t1 ) . It is worth noting that the connection between words in a
spoken command and the entities (here referred to as linguistic grounding) is not a trivial task.
In general, as in [3], it is assumed that each entity in a map is denoted by one or more labels
to enable such a grounding: for example, one or more linguistic references, such as volume or
book, correspond to the object v1 . These are used in [3] and [4] to retrieve all entities involved
by a command from the map. The simplest approach here is to select all and only those entities
whose denotation coincides with one of the command words, as applied in [4]. Unfortunately,
this assumption is quite unrealistic, as for the role of synonyms or paraphrases used to refer to
        We released an extended version of GrUT at https://github.com/crux82/grut
objects in user commands: “take the handbook . . . ” or “take the tome . . . ” can be equivalently
used in natural language interaction. To overcome this limitation, a more complex retrieval
function is explored in this paper. It makes use of more expressive associations between several
linguistic labels including words that are highly similar according to a neural semantic similarity
function. To the best of our knowledge, this expands recent research like [7] and [8] as it is the
first end-to-end technique for Fully Grounded Linguistic Interpretation, made dependent on an
explicit (logical) description of the environment.
   Results indicate that the adoption of expressive functions for linguistic grounding enables an
end-to-end process that is even more robust than an a-posteriori application of the linguistic
grounding process to the interpretations produced by GrUT.
   In the rest of this paper, section 2 summarizes the related work, section 3 presents the
proposed extension of GrUT, section 4 reports the experimental evaluation, while section 5
derives some conclusions.

2. Related Work
The semantic interpretation of texts or spoken utterances is generally modeled as a Semantic
Role Labeling (SRL) task, which consists of identifying all the expressed linguistic predicates
(such as Bringing vs. Taking evoked by the verb "to take") and their corresponding semantic
arguments (such as "the volume" or "on the table near the window") in order to perform a deep
semantic interpretation of a human-generated utterance [9]. Data-driven approaches for SRL
have gotten a lot of attention since the pioneering work of Gildea and Jurafsky [10], leading
to multiple benchmarking initiatives [11, 12, 13]. As in [14, 15, 16], the majority of methods
divide the processing tasks into at least two steps: first, the target predicates are identified and
clarified; second, for each predicate, the relevant arguments are located and organized according
to their roles in the corresponding predicate. The latter often focuses solely on semantic role
labeling while ignoring previous predicate identification and disambiguation techniques. This
decomposition generally holds even when transformer-based models have been increasingly
used in SRL, since the seminal works of [17]: in [18, 19] or [20] a pre-trained architecture, such
as BERT [21], RoBERTa [22], BART [5] or T5 [23] are succesfully applied, but always according
to the above task decomposition.
   The authors of [19] demonstrate how BERT can be applied to semantic role categorization
without relying on syntactic features and yet produce cutting-edge results. Instead, [18] uses a
graph neural network stacked on BERT encodings to demonstrate the value of incorporating
dependency structures within the transformers: first, the output of the transformers is fed
into the graph neural network, and then semantic structures are imposed within the attention
layers. Both methods produce outcomes that are on par with the current state-of-the-art. The
importance of predicate disambiguation for the overall process is demonstrated in [20] by
modeling the two tasks of argument labeling and predicate disambiguation using RoBERTa
and the PropBank corpus [24], showing how predicate disambiguation is helpful for the overall
process. The initial proposals for an end-to-end architecture, which accepts plain text as input
and uses T5 and BART to simultaneously identify predicates and arguments, are found in [7]
and [8], respectively. In essence, T5 and BART take a simple sentence as input and create
an artificial text that allows all predicates and roles to be derived. In the same way that the
argument is recognized by specifying its position within the phrase, BART is used to identify the
predicate by signaling to the GSRL model the token that inspires it. The model was evaluated
on CoNLL2012 [13].
   All of the aforementioned methods, however, simply consider linguistic evidence. According
to the concept of Grounded Semantic Role Labeling (G-SRL) in [25], the proper interpretation of
an utterance in a given context depends on the language’s grounding concerning the environ-
ment itself, such as the real objects the speaker refers to. The interpretation is to be dependent
on data derived from the analysis of photographs depicting the environment, and a probabilistic
model is proposed in the same paper. This concept is further emphasized in [3], which explains
how a domestic robot’s perception of commands depends on proofs of properties the robot can
carry out over a logical map of the environment. This latter is used to describe the surrounding
area, the objects located in specific positions and other relevant relationships.
   It’s interesting to note that texts are annotated using the Frame Semantics theory [6], which
[3] suggests can be exploited by the robot to directly derive the necessary plan and action
primitives. However, [3] adopts the traditional SRL processing chain. Additionally, their output
still only exists at the linguistic level: labeling only defines roles associated with words, not
with actual objects, whereas interpretation depends on correlations between words and objects
in the environment. As suggested in [3, 25], contemporary approaches to the interpretation of
robotic spoken commands must be harmonized across several semantic dimensions, at the very
least: (1) spatial awareness, or knowledge about the physical environment in which the robot
acts; (2) self-awareness, or knowledge of its proper capabilities and limitations as a robotic
platform; and (3) linguistic competence needed to understand user’s utterance and produce
meaningful statements in response to stimuli or needs.
   Recently, in [4] Grounded language Understanding via Transformers (GrUT) is proposed as a
sequence-to-sequence (seq2seq) approach for GSRL, sensitive to the map information in form
of linguistic descriptions and capable of directly perform Grounding during interpretation,
effectively linking entities in the map with the Arguments predicted. In this paper, we extend
GrUT to make it end-to-end, thus allowing richer interpretations that are also grounded in the
environment. Moreover, different and more expressive policies to retrieve entities from the map
are investigated.

3. End-to-end Grounded Semantic Role Labeling
As discussed in the previous section, the semantic interpretation of spoken commands (and in
general texts) strongly benefits from the application of Transformer-based architectures such as
BART [5]. In a nutshell, a Transformer is applied to interpretation processes by taking in input
a text expressed in natural language, and “translating” it into an artificial text reflecting the
underlying linguistic predicate. In order to extend the application of Transformer to Grounded
SRL tasks, the idea behind GrUT [4] is to use a natural language description of the map and add
it to the input sentence. If we want to make the interpretation sensitive to the entities, their
proprieties, position and relational information (such as proximity or distance), we first need a
way to refer to them. In our context, one entity is known through its (English) noun (possibly
its most commonly used lexical reference, e.g. the word volume) as well as its conceptual type.
The association with the environment (i.e. the grounding) is realized through its identifier
(Existence Constraint, 𝐸𝐶) that is linked to the position of the corresponding physical object in
the environment. For example, the map to be paired with the command in (1) can give rise to
the following description:
EC: "𝑏1 , also known as volume or book, is an instance of the class Book, 𝑡1 , also known as table, is
an instance of the class Table and 𝑤1 , also known as window, is an instance of the class Window.".

Moreover, if book 𝑏1 and the table 𝑡1 are close to each other in the environment, a further
declaration of a Proximity Constraint (𝑃 𝐶) acting over them will be added:
PC: "𝑏1 is near 𝑡1 and 𝑡1 is near 𝑤1 "
Finally, for each selected entity, a description of whether the property containing other objects
is true (Containability Constraint, 𝐶𝐶) will be added. For illustrative purposes, imagine the
existence of a hypothetical cup 𝑐1 :
CC: "𝑐1 can contain other objects"
The entire description is a micro-story2 useful for the SRL model to disambiguate between
the different situations. Notice that only when the spatial constraint 𝑃 𝐶 is true, the correct
interpretation for the ambiguous situation (1) finally corresponds to the role labeled logical
                  Taking(Theme(”the volume on the table near the window”)).                 (2)
Since the book 𝑏1 , referenced through the noun volume, is close to 𝑡1 , it is interpreted thus
as the Theme of the Taking predicate. In this work, only these 3 constraints are defined and
used, but in the future a wider list of properties will be explored. It is worth noticing that
the linguistic description of the map enables the use of highly accurate transformers (such as
BART [5] and T5 [23]). These are pre-trained on large natural language corpora and may take
advantage of linguistic features, relationships and cross-dependencies to properly carry out
SRL on the overall textual examples made by the informative pairs in GrUT. The extraction
algorithm acting over a map, given a command 𝑐 is reported in Algorithm 1.
   In this paper, we extend the above method in two directions. First, we make GrUT an end-to-
end architecture. To this end, the transformer is trained to generate a predicate that expresses
both the semantic information and the object involved. In other words, while the input is kept
consistent with that used by GrUT, the output is:
                                             (︀           )︀
                                      Taking Theme(𝑏1 )                                      (3)

and it is expected that the transformer autonomously learns this transformation. This schema
produces a different input in the case that book 𝑏1 is far from table 𝑡1 . Part of the map description
will be PC: "𝑏1 is far from 𝑡1 and 𝑡1 is near 𝑤1 ", while the output is expected to be significantly
different, i.e.,                           (︀                       )︀
                                 Bringing Theme(𝑏1 ), Goal(𝑡1 )                                     (4)

        All the constraints are appended to the input, each of which is divided by a "#" delimiter character.
Algorithm 1 GrUT compilation Algorithm
 1: procedure construct_input(Sentence 𝑠 = (𝑤1 , ..., 𝑤|𝑠| ), LexSim. 𝑙𝑠, Threshold 𝜏𝑙𝑠 )
 2:    𝐸𝑛𝑡𝑖𝑡𝑖𝑒𝑠 ← ∅
 3:    for 𝑖 = 1, ..., |𝑠| do ◁ All entities that can be potentially referred to by a word in the command are collected
      to be considered in the map description. The implementation of get_candidate_entities is in Algorithm 2
 4:           if 𝑃 𝑜𝑠𝑇 𝑎𝑔(𝑤𝑖 ) == Noun then
 5:               𝐸𝑛𝑡𝑖𝑡𝑖𝑒𝑠 ← 𝐸𝑛𝑡𝑖𝑡𝑖𝑒𝑠 ∪ get_candidate_entities(𝑤𝑖 , 𝑙𝑠, 𝜏𝑙𝑠 )
 6:      𝑒𝑐 ← ””                                                                        ◁ Existence Constraints
 7:      for 𝑒 ∈ 𝐸𝑛𝑡𝑖𝑡𝑖𝑒𝑠 do
 8:          𝑒𝑐 ← 𝑒𝑐+" # "+𝑔𝑒𝑡_𝑟𝑒𝑓 (𝑒)+" also known as "+𝑔𝑒𝑡_𝑙𝑒𝑥𝑖𝑐𝑎𝑙_𝑟𝑒𝑓 (𝑒)+
 9:                                              " is an instance of class "+𝑔𝑒𝑡_𝑐𝑙𝑎𝑠𝑠(𝑒)
10:      𝑐𝑐 ← ””                                                                         ◁ Containability Constraints
11:      for 𝑒 ∈ 𝐸𝑛𝑡𝑖𝑡𝑖𝑒𝑠 do
12:          if 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑎𝑏𝑖𝑙𝑖𝑡𝑦(𝑒) then
13:              𝑐𝑐 ← 𝑐𝑐+" # "+𝑔𝑒𝑡_𝑟𝑒𝑓 (𝑒)+" can contain other objects"
14:      𝑝𝑐 ← ””                                                                             ◁ Proximity Constraints
15:      for 𝑒1 ∈ 𝐸𝑛𝑡𝑖𝑡𝑖𝑒𝑠 do
16:         for 𝑒2 ∈ 𝐸𝑛𝑡𝑖𝑡𝑖𝑒𝑠 do
17:             if 𝑒1 ̸= 𝑒2 ∧ 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑒1 , 𝑒2 ) < 𝜏 then
18:                 𝑝𝑐 ← 𝑝𝑐+" # "+𝑔𝑒𝑡_𝑟𝑒𝑓 (𝑒1 )+" is near "+𝑔𝑒𝑡_𝑟𝑒𝑓 (𝑒2 )
19:         𝐸𝑛𝑡𝑖𝑡𝑖𝑒𝑠 ← 𝐸𝑛𝑡𝑖𝑡𝑖𝑒𝑠 − {𝑒1 }
20:      return 𝑠 + 𝑒𝑐 + 𝑝𝑐 + 𝑐𝑐

   In addition, this paper extends the grounding process to improve the robustness and ap-
plicability of GrUT. In fact, GrUT assumes that each entity 𝑒 in the environment is enriched
with a set of lexical references 𝐿𝑅(𝑒) = {𝑤1𝑒 , . . . , 𝑤𝑙𝑒 }, used to link words (𝑤1 , . . . , 𝑤|𝑠| ) in
the sentence 𝑠. As an example let us consider the volume 𝑣1 and its corresponding lexical
reference 𝐿𝑅(𝑣1 ) = {volume, book}. A robust linguistic grounding function is essential for
GrUT (and our extended counterpart) to build the map description. In this sense, the algorithm
2 expresses the policy adopted to retrieve the entities involved in the utterance. In a nutshell,
we propose to retrieve all entities that have significant lexical similarity with each noun in 𝑠.
We will experimentally evaluate three LexicalSimilarity functions, characterized by incremental
levels of expressiveness:

      1. Exact Match: the simplest function corresponds to the naive exact match between two
         input strings. It produces a Boolean result, i.e., 1 if the two strings are equal, and 0
         otherwise. It allows retrieving all entities that have at least one lexical reference perfectly
         matching one word in the command; it is the function used in GrUT and, while it is very
         precise, it fails when the user refers to entities using synonyms, e.g., referring to 𝑣1 with
         handbook, tome or manual. Despite its simplicity, it assumes significant effort in map
         construction: in many practical scenarios, we cannot assume that all possible lexical
         references are defined for all entities.
      2. Levenshtein similarity: a “soft” string matching that sometimes captures also mor-
         phological relatedness between input word pairs. We defined a Levenshtein Similarity
Algorithm 2 Entity Retrieval Algorithm
 1: procedure get_candidate_entities(Word 𝑤, LexicalSimilarity 𝑙𝑠, Threshold 𝜏𝑙𝑠 )
 2:    𝐶𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒_𝐸𝑛𝑡𝑖𝑡𝑖𝑒𝑠 ← ∅
 3:    for 𝑒 ∈ 𝐾𝐵_𝐸𝑛𝑡𝑖𝑡𝑖𝑒𝑠 do
 4:        for 𝑙𝑒𝑥_𝑟𝑒𝑓 ∈ 𝐿𝑅(𝑒) do                         ◁ For each lexical reference
 5:            if 𝑙𝑠(𝑤, 𝑙𝑒𝑥_𝑟𝑒𝑓 ) > 𝜏𝑙𝑠 then
 6:                𝐶𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒_𝐸𝑛𝑡𝑖𝑡𝑖𝑒𝑠 ← 𝐶𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒_𝐸𝑛𝑡𝑖𝑡𝑖𝑒𝑠 ∪ 𝑒
 7:    return 𝐶𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒_𝐸𝑛𝑡𝑖𝑡𝑖𝑒𝑠

        (𝐿𝑒𝑣𝑆𝑖𝑚) in the form

                                                                 𝐿𝑒𝑣𝐷𝑖𝑠𝑡(𝑤𝑖 ,𝑗 )
                                  𝐿𝑒𝑣𝑆𝑖𝑚(𝑤𝑖 , 𝑤𝑗 ) = 1 −                                                      (5)
                                                                |𝑙𝑜𝑛𝑔𝑒𝑠𝑡(𝑤𝑖 , 𝑤𝑗 )|

      which is based on the well-known Levenshtein distance (𝐿𝑒𝑣𝐷𝑖𝑠𝑡) between input strings
      𝑤𝑖 and 𝑤𝑗 . While Levenshtein distance produces values ranging from 0 to the length of the
      longest word (between the two), 𝐿𝑒𝑣𝑆𝑖𝑚 again ranges between 0 (totally different strings)
      and 1 (the same string). This similarity function is more robust in linking slightly different
      input strings (e.g., book vs handbook) and it may capture some sort of morphological
      analogy between words, but it fails when the user refers to entities using synonyms.
   3. Neural Semantic similarity: it corresponds to the cosine similarity3 between embed-
      dings representing both words [26], providing a more robust connection between words
      involved in paradigmatic relationships, such as quasi-synonymy. In this work, we adopted
      a neural representation based on Word Embeddings, using the well-known Word2vec
      formulation [27] derived from the analysis of the English version of Wikipedia.

   To prevent smoothed measures (such as the one based on word embeddings) from causing an
excessive number of entities to be retrieved (e.g., from a map containing dozens of objects), we
applied a threshold 𝜏 : as a result, only entities with significant similarity to one of the command
words are retrieved. Our method should be robust even in cases where multiple entities are
retrieved for the same word. Although the input text may contain redundant entities in the
map description, the arguments in the output should contain only relevant entities. As a result,
the transformer is expected to select only valid entities (using the attention mechanism of the
encoder) to produce the correct output.

4. Experimental Evaluation
The goal of the following evaluation is to demonstrate the effectiveness of the proposed extension
of GrUT and the different impacts of the lexical similarity functions we considered.

     The cosine similarity
                         (︀ function ranges between
                                            )︀      [-1, 1]. To maintain a consistency with the other policies, we
replaced it with the 𝑚𝑎𝑥 0, 𝑐𝑜𝑠𝑖𝑚(𝑤𝑖 , 𝑤𝑗 ) .
4.1. Experimental Setup
The proposed approach is evaluated here in a home automation scenario, where a robot is
supposed to receive spoken commands and is required to interpret commands in order to
perform the expected actions. Examples are picking up a book on a table, taking out the rubbish,
or looking for the keys. The evaluation is applied to the HuRIC4 dataset, which consists of 656
voice commands in English coupled with interpretations, in terms of predicates and arguments,
to which the Grounding processes of the previous chapter were applied, i.e. linking them
with identifiers of entities in the surrounding environment. In HuRIC, predicates are defined
according to a subset of the semantic frames of FrameNet [6] and the corresponding arguments
are selected. On average, each entity is represented by 1.37 lexical references. Inspired by the
previous evaluation in [4] we adopted a 10-fold cross-validation scheme with 80/10/10 data split
between training/validation/test and the following aspects of the extended version of GrUT are

    • Frame Prediction (FP), i.e. the ability of the models to correctly generate the names of
      frames evoked by the voice command; it is measured as the F1-measure, where Precision
      and Recall reflect the capability of GrUT to recover the correct frame(s) expressed in the
      spoken command.
    • Argument Identification and Classification (AIC) as an Exact Match (AIC-ExM) evaluation
      in which the ability of the systems to correctly generate the names of the Arguments
      evoked by the command and to associate them all with the entities that evoke that
      Argument is evaluated; AIC-ExM is measured as the F1-Measure of produced arguments
      that perfectly corresponds to the gold-standard in their complete form, including frame,
      type of argument and corresponding grounded entities.
    • Argument Identification and Classification (AIC) as a more relaxed evaluation, in which the
      models are required to be able to associate with each Argument at least the correct Entity
      Head (AIC-Enty); it is measured as the F1-measure, where Precision and Recall reflect the
      capability of GrUT to recover the correct arguments, i.e., having the same argument type
      and grounded entity of the ones in the spoken command.

   We compared the impact of the three proposed Lexical Similarity functions involved in the
entity retrieval step. For each function, a specific threshold 𝜏 was estimated on the validation set
by maximizing the F1-measure of the entity retrieval step. In this subtask, Precision is measured
as the average percentage of entities that are correctly retrieved when constructing the map
description, while Recall is measured as the average percentage of entities that were expected
to be retrieved as mentioned in the command. In particular, under the Exact Match policy a
definition of 𝜏EM = 0.50 was applied, while under the policy based on Levenshtein Similarity
and the Word Embedding one, 𝜏Lev = 0.80 and 𝜏WE = 0.55 are respectively used. In Table 1, the
values of the main parameters used to train the Transformer are reported as the configuration
maximizing the AIC-ExM on the development set, on average.

Table 1
Summarization of parameters of the BART based Transformer
                       Param Name                             Value
                         Optimizer                           AdamW
                       Learning Rate                           5e-5
                    Early_stopping_delta                       1e-4
                   Early_stopping_metric                     eval_loss
                         Batch_size                             16
                Gradient_accumulation_steps                      2
                  Early_stopping_patience                        3
                         Scheduler                linear_schedule_with_warmup
                       Warmup Ratio                             0.1
                         Max_length                            256
                           Epochs                            50(max)

4.2. Results and Discussion
The experimental results are reported in Table 2. In the first rows, GrUT + ExPostGrounding rep-
resents our strong baseline and it is based on GrUT which, in [4] was demonstrated competitive
with state-of-the-art models for Grounded Semantic Role Labeling, such as the one proposed
in [3]. GrUT generates logic forms expressing the interpretation of commands at a linguistic
level and in [4] it was reported to achieve 92.28% of F1 in the FP task, 88.41% in the Argument
Identification and Classification Exact match (AIC-ExM) and 93.29% as the score in recovering
the Semantic Head of the individual arguments.
   To implement our baseline, we re-used the predictions of GrUT and applied the linguistic
grounding a-posteriori: for each argument in the produced logic form, such as Goal(“on the
table near the window”), the first noun is selected (here the table) and the entity maximizing the
lexical similarity is selected to replace the argument. If no entity is retrieved (or the threshold
is not exceeded) the related argument is removed. We considered the three Lexical Similarity
functions, i.e. based on Exact Match (EM), Levenshtein Similarity (LS) and semantic similarity
estimated over Word Embeddings (WE).

Table 2
Comparative Evaluation on the Frame Prediction 𝐹 𝑃 , Argument Identification and Classification 𝐴𝐼𝐶
tasks of the different G-SRL models: Exact Match (ExM) and Head Match (AIC-Enty) are the different
metrics for AIC.
                    Model             Retrieval Policy     FP     AIC-ExM AIC-Enty
                                        Exact Match                 78.62%        80.00%
          GrUT + ExPostGrounding      Levenshtein Sim.   92.18%     79.80%        81.37%
                                     Word Embeddings                80.91%        82.22%
                                        Exact Match      90.38%     83.16%        84.79%
          GrUT End-to-End             Levenshtein Sim.   92.40%     84.24%        85.66%
                                     Word Embeddings 91.90%         90.03%       91.46%

  The application of the grounding function causes a significant performance drop: as an
example, AIC-ExM drops from 88.41% to 80.91%. This is mainly due to the 10% of entities in
HuRIC whose lexical reference does not match any of the words used in a candidate argument.
   The values raise from 78.62% and 80.00% for AIC tasks at argument level (indeed simpler
tasks), when using the String Matching Retrieval method, to 80.91% and 82.22% for the W2V
Retrieval method, showing that the word embeddings and the vector similarity function are
useful for recovering some connections. The small difference between a grounding function
based on the Exact Match and the one based on WEs suggests that lexical references in HuRIC
are generally the same words used in the commands. The different entity retrieval policies do
not affect the Frame Prediction subtask.
   When applying our proposed model, namely GrUT-End-to-End, the Transformer-based model
outperforms the baselines: +50% of error reduction on AIC tasks between ExPostGrounding𝑊 2𝑉
at 80.91% and End-to-End𝑊 2𝑉 at 90.03%. The lexical grounding based on neural representations
seems indeed robust in retrieving entities, while the attention mechanism within the transformer
effectively grounds the correct interpretation. The transformer effectively learns how to map
entities in the descriptions of the map associated with the input command to the correct entities
in the produced interpretations. It also seems to improve the results of GrUT5 . The differences
between the EM, LS and WE are here more evident. When the Exact Match fails, no entity is
retrieved and the argument cannot be included in the final command. On the contrary, smoother
measures like the one based on WEs may introduce a super-set of the correct entities, thus
introducing some noise; however, the transformer seems effective in pruning those entities not
involved in the command. The different retrieval policies slightly affect the FP tasks: the Exact
Match generally ignores several entities not retrieved in the input, and it negatively affects
the capability of frame disambiguation. The effect of LS allows improving the quality of frame
disambiguation also against the WE: we speculate that if “extra” entities not mentioned in the
command do help the overall grounding process from one point of view, from the other these
do not help the frame prediction substep.

Table 3
Error analysis of the GrUT - End-to-End model (and different retrieval policies) applied to the command
"take the book that is in the kitchen".
           Retrieval Policy                            Output                            Correct
        Exact Match                 Bringing(Theme(’the book’), Goal(’the bedroom’))        NO
        Levenshtein Similarity           Bringing(Theme(’the book’), Goal(𝑠6 ))             NO
        Word Embeddings                     Bringing(Theme(𝑤6 )) Goal(𝑠6 ))                YES

   The error analysis summarized by the example in 3 confirms that most of the misinterpreta-
tions are due to errors in the linguistic grounding phase. This example refers to a simple map
with two entities, 𝑤6 that is instance of the Book class, with the only lexical reference volume
and 𝑠6 that is instance of the Room class, with the only lexical reference guest room. Given the
command "take the book to the bed room”, the GrUT-End-to-End model adopting the Exact Match
is not able to retrieve any entity, and, even though it is able to infer the correct predicate and

       The evaluation of this paper follows a slightly different policy than [4] : here the “simpler” measure just
requires that since an interpretation include the correct frame, arguments and mentioned entities; in [4] evaluation
is correct only when also all words in the text are correctly mapped to their corresponding frames and arguments.
arguments, it simply rewrites the input text. When using Levenshtein Similarity, the "guest
room" is correctly linked so that the second argument is correctly grounded. The adoption of
the cosine similarity applied in the Neural Word Embedding space allows generating the input
"take the book to the bed room # 𝑤6 also known as volume is an instance of class BOOK & 𝑠6 also
known as guest room is an instance of class BEDROOM & 𝑤6 is far from 𝑠6 ", that leads to the
correct interpretation. Most of the errors involve entities whose lexical references are exactly
one and cannot be retrieved by the linguistic grounding function because they are uncommon
nouns, e.g. the lexical reference volume for the Book type entity.

5. Conclusions
This paper presents an End-to-End sequence-to-sequence process for Grounded Semantic Role
Labeling. The proposed approach suggests providing the input text, which expresses the user’s
utterance enriched by a description of the surrounding environment expressed in natural
language, as the input of a Transformer-based architecture. Correspondingly, the desired output
is a logical form in which the entities of the environment are correctly associated with the
command and grounded as well. Several policies have been applied as strategies for retrieving
the entities from the map that are involved by the input utterance.
   The experimental results confirm the robustness of the presented methods, especially when
compared with traditional architectures chaining the linguistic interpretation of the utterance
and then the linguistic grounding of the involved entity. This result is widely applicable as it
does not require a costly adaptation to a particular scenario or domain.
   Future work will extend the proposed methodology to consider additional properties of the
environment (e.g., object properties crucial to disambiguate multiple instances of the same class,
such as multiple books in a map), the user profile, or information extracted during the previous
interactions between the user and the robot (exploiting the dialogue history).

We would like to thank the “Istituto di Analisi dei Sistemi ed Informatica - Antonio Ruberti"
(IASI) for supporting the experimentations through access to dedicated computing resources.
Claudiu Daniel Hromei is a Ph.D. student enrolled in the National Ph.D. in Artificial Intelligence,
XXXVII cycle, course on Health and life sciences, organized by the Universitá Campus Bio-Medico
di Roma.

