Collecting information for action
     understanding. The enrichment of the
       IMAGACT Ontology of Action
      Andrea Amelio RAVELLI, Lorenzo GREGORI and Alessandro PANUNZI
                LABLITA - Università degli Studi di Firenze, Italy

           Abstract.
             This paper presents the status of our work aimed at enriching the IMAGACT
           Ontology of Action by linking it to other resources. In order to achieve this goal we
           performed a visual mapping, exploiting the IMAGACT visual component (video
           scenes that represent physical actions) as the linkage point among resources. By
           using visual objects, which are free from linguistic constraints and can be inter-
           preted and described from different perspectives, we connected resources respond-
           ing to different scopes and theoretical frameworks, in which a concept-to-concept
           mapping appeared difficult to obtain.
             We provide a brief description of two linking obtained by using this technique:
           an automatic linking between IMAGACT and BabelNet, a multilingual semantic
           network, and a manual linking between IMAGACT and Praxicon, a conceptual
           knowledge base of action.

           Keywords. ontology linking, IMAGACT, BabelNet, Praxicon,


1. Introduction

Action verbs contain the basic information that should be understood in order to make
sense of a sentence and that should be processed in instructions given to artificial systems.
The difficulty behind action verbs understanding comes out from the evidence that no one
to one correspondence can be established between action predicates and action concepts.
The same action can be predicated by multiple verbs (e.g. “John takes/brings/leads Mary
to the restaurant”) and, conversely, one verb can extend to multiple and different actions
(e.g. “John takes the cup from the table”, “John takes/brings the cup to Mary”). Most
of these verbs belong to the class of general verbs, which are characterized by a high
ambiguity and high frequency in the use [1]. In these circumstances, senses are often
vague and overlapping, their discrimination is not clear, and this is a critical issue for
their semantic representation.
     The representation is more difficult in a multilingual perspective, given that differ-
ent languages operate different action space segmentations. It has been observed [2] that
even with a fine-grained sense distinction it is not often possible to find an exact match
between action concepts lexicalized by verbs in different languages. Moreover, one lan-
guage could totally lack a lexical representation for a specific concept, whenever there is
a lexical gap [3]. These problems deeply affect NLP task dealing with actions and their
correct interpretation [4].
     This paper reports two linking experiments performed on the IMAGACT Visual
Ontology of Action, in order to gather information about actions from several perspective
and at different levels: semantic, motoric and visual. Linkings have been led exploiting
the visual information of IMAGACT: instead of a classic concept-to-concept mapping
we performed a visual mapping, that is a concept-to-video linking. This strategy allowed
us to connect linguistic resources having different conceptualizations of events.
     This work, far from being definitive, could be useful for the future construction of
integrated resources on action understanding, to be effectively exploited for both theo-
retical analysis and computational applications.


2. The IMAGACT Visual Ontology of Action

Verbs are the lexical class that is normally responsible for event categorization. Among
events, actions (defined as goal-oriented events performed by an intentional agent) play
an important role from a linguistic perspective: action verbs are very frequent in spoken
language and they are also very ambiguous. Moreover the semantic classification of ac-
tion verbs is more complex and not equally linear as the one of nouns, so that frequently
it’s not possible to discriminate a coherent list of word senses.
      IMAGACT Visual Ontology of Action1 [5] is a multimodal and multilingual re-
source that offers a novel integration of visual and linguistic information as comple-
mentary elements. The resource contains 1010 distinct action concepts as a result of an
information bootstrapping form Italian and English spoken corpora. Metaphorical and
phraseological usage have been excluded from the annotation process, in order to collect
exclusively occurrences of verbs referring to physical actions.
      Verbs in IMAGACT are divided into action types, according to their semantic vari-
ation; each type is linked to one or more video scenes (either 3D animations or filmed
video clips), in which a prototypical action is performed. The verbs referring to the same
concept are linked to the same scenes, creating an interlinguistic semantic network.
      The ontology is in continous development and, at present, contains 9 fully-mapped
languages and 13 that are underway, with an average of 730 action verbs per language.
      This resource gives a broad picture of the variety of actions and activities that are
prominent in everyday life and specifies the lexicon used to express each one in ordinary
communication, in all the included languages.


3. Linking resources, sharing knowledge

In order to collect more information, we planned an extensive campaign of enrichment,
through the comparison and mutual exchange with other resources.
      For this task we applied a visual mapping, a methodology which aims at pointing
concepts to a shared visual representation. In fact, a video depicting an event is not sub-
ject to any linguistic constraint, and the associated semantic information can be described
in various manners. Starting from this observation we used the videos to link concepts of
  1 http://www.imagact.it/
        Figure 1. An example of the resulting linking between BabelNet, IMAGACT and Praxicon.


different resources, that express independent event conceptualization according to their
own theoretical framework. It follows that the multimodal feature of IMAGACT is a key
point for its enrichment and implementation.
     Herein we present some current results we obtained by linking IMAGACT with
BabelNet and Praxicon. An example of the obtained output can be observed in Figure 1,
that shows a beating event with the parallel representation in the 3 resources.

3.1. IMAGACT and BabelNet

BabelNet2 [6] is a multilingual semantic network obtained through the automatic map-
ping of the WordNet thesaurus and the Wikipedia enciclopedia. At present, BabelNet 3.7
contains 284 languages and it is the widest multilingual resources available for semantic
disambiguation. Concepts and entities are represented by BabelSynsets (BSs), unitary
concepts identified by several kinds of information (semantic features, glosses, usage ex-
amples, images, etc.) and related to lemmas (in any language) which have a sense match-
ing with that concepts. BSs are not isolated, but connected together into a huge network
by means of the semantic relations inherited from WordNet.
     BabelNet concepts (the BSs) are interlinguistic: they gather all the word senses in
different languages that are semantically equivalent (or almost equivalent). Conversely,
IMAGACT action types encode small semantic differences, so they are more granular
and language-dependent. Given these differences, an exact match between their concepts
is very rare; it’s also hard to establish less strict semantic relations (e.g. narrow-to-broad),
because the BSs boundaries are often fuzzy and the gloss is not always able to make a
clear discrimination between them.

  2 http://www.babelnet.org/
     In this case visual mapping solved the problem: in fact even for the BSs where the
description is not precise, it’s easy to say if a video is a good action prototype for it or
not3 .
     Given the multilingual nature of the two resources, we could exploit a rich lexical
information, i.e. all the verbs in many languages related both to IMAGACT scenes and
BabelNet BSs. The connections between BSs and scenes have been automatically es-
tablished on the basis of the number of shared verbal lemmas through an ML algorithm
[7].
     As a result from this linking, on the one hand, IMAGACT gained translation infor-
mation for languages still not implemented in the Visual Ontology and, on the other, BSs
referring to action verbs obtained a video representation. In Table 1, the detailed numbers
of scenes and BSs connected through this linking are shown.

                               Table 1. IMAGACT-BabelNet linking results.
                                IM Scenes linked to BS                     773
                                BS linked to Scenes                        517
                                IM English Verbs related to Scenes         544
                                BabelNet English Verbs related to BS      1,100


3.2. IMAGACT and Praxicon

Praxicon4 is an ontology for the representation of action concepts, based on the Minimal-
istic Grammar of Action [8]. In Praxicon, an action is expressed through motor concepts,
specified in terms of 3 basic components: GOAL, TOOL and OBJECT. A wide part of
this ontology is also linked with WordNet synsets and ImageNet images [9].
      Praxicon makes a distinction between Actions, Movements, and Events5 . Actions
are sets of structured motoric execution, intentionally performed by an agent to achieve
a goal. The goal is a necessary component, so any non-voluntary motoric activation is
addressed as a Movement, but not as an Action. Finally, actions that are too complex to
be described as a set of motoric concepts, are considered Events and are out of the scope
of the Praxicon resource.
      Similarly to the linking with BabelNet, the IMAGACT scenes are used to connect
the information of the two resources, given that their definitions of concepts are too
different to try a proper and extensive sense matching. In fact, the IMAGACT scenes
can work as a visual representation for Praxicon action concepts and, at the same time,
Praxicon syntax could be used to analytically describe, from a physical-motoric point of
view, all the low-level actions involved in the execution of more complex ones.
      Differently from the previous linking, in this case it is a totally manual work, con-
sisting in the analysis of each scene and the determination of the physical action per-
formed.
   3 The measured inter-rater agreement for this task is a Fleiss k of 0.74 with 3 annotators. Annotated dataset

is available at http://bit.ly/2jt2cD4
   4 https://github.com/CSRI/PraxiconDB
   5 These categories have their own definition in the Praxicon framework. We use capital letters when referring

to this specific meaning
     The scene annotation has been accomplished on 281 IMAGACT scenes (∼28% of
the total) and we obtained the following results:
    • 154 scenes (∼55%) have a one-to-one relation with Praxicon Action concepts;
    • 64 scenes (∼23%) map on more than one Action concept;
    • 19 scenes (∼7%) are Movement but not Actions (in the Praxicon framework);
    • 30 (∼11%) are Events but not Actions (in the Praxicon framework);
    • 14 scenes (∼5%) are unclear.
     IMAGACT scenes are specifically created to provide a prototypical representation of
a lexicalized action concept: every scene is a reference of at least an English action verb.
This allowed us to derive from these numbers some considerations about the relation
between motoric and lexical level.
     In Praxicon Events motoric properties does not play a role in the verb meaning,
which encodes an abstract result, that is independent from the physical action execution.
Examples are verbs like to drive, to clean or to rob, that encode a complex set of motoric
actions by predicating their final result: ∼11% of the actions that are commonly referred
with the language (in English) belong to this class. Conversely, ∼55% of the scenes have
a one-to-one mapping with a Praxicon concept, meaning that there is a low distance be-
tween motoric and lexical level: we can consider these cases as the ones where the phys-
ical execution of an action more deeply affect the verb semantics. Example verbs of this
class are to push, to gallop or to brush. Then, ∼23% of retrieved action are at an inter-
mediate level of abstraction: they can be expressed in terms of physical action concepts,
but more than one Praxicon concept is involved into a single lexicalized action. Some
example verbs are to break, to open or to glue. Finally, we found that ∼7% of events
that in English are referred through action verbs are Movement that do not correspond to
voluntary actions, like to fall or to drop.
     This work is still in progress, but we believe that the integration between linguis-
tic and motoric knowledge on action is very relevant both for theoretical analysis and
robotic applications. From one side an integrated resource is desirable to carry on deep
investigations on the relation between language and action, that is a long debated subject
in linguistics and neuroscience [10,11]. Praxicon is also exploited for robotic applica-
tions [12,13] and the integration with a linguistic-oriented resource like IMAGACT can
be useful to enhance human-robot interaction through natural language.


4. Conclusions & Future Works

In this paper we presented the very first steps in the construction of a comprehensive re-
source for the understanding of actions and their representation in the language systems,
built on top of the ontological structure of IMAGACT.
     We introduced the visual mapping methodology, that allows resource linking
through visual representations. This approach is particularly useful when it’s hard to find
relations between concepts, because it does not force any kind of convergence between
senses. For this reason we feel confident that this methodology could be successfully
applied also in other linking tasks involving multimodal resources.
     Two case studies have been described: the linking of IMAGACT with BabelNet and
Praxicon. In the first case we were dealing with lexical-semantic resources having huge
differences in sense discrimination and for this reason it was hard to find inter-resource
semantic relations. In the case of Praxicon we applied visual mapping to link IMAGACT
with a resource of a different type, in which the concepts are motoric and not linguistic.
     Finally, to extend the information connected to action concepts, we aim to enrich our
ontology with the annotation of noun senses and with the predicate argument structures
[14], in order to implement semantic selection restriction for the verbs in each action
type.


References

 [1]   M. Moneglia & A. Panunzi (2010). I verbi generali nei corpora di parlato. Un progetto di annotazione
       semantica cross-linguistica. In E. Cresti & I. Korzen (eds), Language, Cognition and Identity. Extension
       of the Endocentric/Esocentric Typology. Firenze: Firenze University Press, 27-46.
 [2]   M. Moneglia & A. Panunzi (2007). Action Predicates and the Ontology of Action across Spoken Lan-
       guage Corpora. The Basic Issue of the SEMACT Project. In M. Alcántara Plá & T. Declerck (eds), Pro-
       ceeding of the International Workshop on the Semantic Representation of Spoken Language. Salamanca:
       Universidad de Salamanca, 51-58.
 [3]   L. Gregori & A. Panunzi (2017). Measuring the Italian-English lexical gap for action verbs and its
       impact on translation. In Proceedings of the 1st Workshop on Sense, Concept and Entity Representations
       and their Applications. Valencia: Association for Computational Linguistics, 102-109.
 [4]   M. Moneglia (2014). Natural Language Ontology of Action: A Gap with Huge Consequences for Nat-
       ural Language Understanding and Machine Translation. In Z. Vetulani & J. Mariani (eds), Human Lan-
       guage Technology Challenges for Computer Science and Linguistics, volume 8387 of Lecture Notes in
       Computer Science. Springer International Publishing, 379-395.
 [5]   M. Moneglia, S. Brown, F. Frontini, G. Gagliardi, F. Khan, M. Monachini & A. Panunzi (2014). The
       IMAGACT Visual Ontology. An Extendable Multilingual Infrastructure for the Representation of Lexi-
       cal Encoding of Action. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani,
       A. Moreno, J. Odijk & S. Piperidis (eds), Proceedings of the Ninth International Conference on Lan-
       guage Resources and Evaluation (LREC14), Reykjavik, Iceland. European Language Resources Asso-
       ciation (ELRA), 3425-3432.
 [6]   R. Navigli & S. Ponzetto (2012). BabelNet: The Automatic Construction, Evaluation and Application
       of a Wide-Coverage Multilingual Semantic Network. Artificial Intelligence 193, 217-250.
 [7]   L. Gregori, A. Panunzi & A.A. Ravelli (2016). Linking IMAGACT ontology to BabelNet through action
       videos. In A. Corazza, S. Montemagni & G. Semeraro, Proceedings of the Third Italian Conference
       on Computational Linguistics CLiC-it 2016. 5-6 December 2016, Napoli. Accademia University Press,
       162-167.
 [8]   K. Pastra, & Y. Aloimonos (2011). The minimalist grammar of action. Philosophical Transactions of
       the Royal Society of London B: Biological Sciences 1585, 103-117.
 [9]   J. Deng, W. Dong, R. Soecher, L.J. Li, K. Li & L. Fei-Fei (2009). ImageNet: A Large-Scale Hierarchical
       Image Database. IEEE Computer Vision and Pattern Recognition (CVPR).
[10]   J. Pustejovsky (1991). The syntax of event structure. Cognition 41:1-3, 47-81
[11]   F. Pulvermller (2005). Brain mechanisms linking language and action. Nature reviews. Neuroscience
       6.7:576.
[12]   N. Vitucci, A. M. Franchi & G. Gini (2016). Programming a humanoid robot in natural language: an
       experiment with description logics. Workshop Simulation in robot programming, SIMPAR 2016.
[13]   N. G. Tsagarakis, G. Metta, G. Sandini, D. Vernon, R. Beira, F. Becchi, L. Righetti, J. Santos-Victor,
       A. J. Ijspeert, M. C. Carrozza & D. G. Caldwell (2007). iCub: the design and realization of an open
       humanoid platform for cognitive and neuroscience research. Advanced Robotics 21:10.
[14]   E. Jezek, B. Magnini, A. Feltrcco, A. Bianchini & O. Popescu (2014). A resource of Typed Predicate
       Argument Structures for linguistic analysis and semantic processing. In Proceedings of the Ninth Inter-
       national Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland.