=Paper=
{{Paper
|id=Vol-1749/paper28
|storemode=property
|title=Linking IMAGACT Ontology to BabelNet through Action Videos
|pdfUrl=https://ceur-ws.org/Vol-1749/paper28.pdf
|volume=Vol-1749
|authors=Lorenzo Gregori,Alessandro Panunzi,Andrea Amelio Ravelli
|dblpUrl=https://dblp.org/rec/conf/clic-it/GregoriPR16
}}
==Linking IMAGACT Ontology to BabelNet through Action Videos==
<pdf width="1500px">https://ceur-ws.org/Vol-1749/paper28.pdf</pdf>
<pre>
   Linking IMAGACT ontology to BabelNet through action videos

  Lorenzo Gregori                  Alessandro Panunzi                Andrea Amelio Ravelli
 University of Florence            University of Florence            University of Florence
lorenzo.gregori@unifi.it       alessandro.panunzi@unifi.it             aramelior@gmail.com


               Abstract                        1       Introduction1
English. Herein we present a study deal-       Ontologies are widely used to represent language
ing with the linking of two multilingual       resources on the web, allowing them to be eas-
and multimedia resources, BabelNet and         ily accessed and exploited by machines. For this
IMAGACT, which seeks to connect videos         reason, data interconnection between different se-
contained in the IMAGACT Ontology of           mantic resources is a crucial task in order to en-
Actions with related action concepts in        hance disambiguation and information retrieval
BabelNet. The linking is based on a ma-        capabilities in Artificial Intelligence, as evidenced
chine learning algorithm that exploits the     by the increasing research into mapping and link-
lexical information of the two resources.      ing techniques among ontologies (Otero-Cerdeira
The algorithm has been firstly trained and     et al., 2015). Nevertheless, ontology mapping has
tested on a manually annotated dataset and     to face the problem of concept representation mis-
then it was run on all the data, allowing      match between resources, due to different building
to connect 773 IMAGACT action videos           criteria and purposes (Siemoneit et al., 2015). In-
with 517 BabelNet synsets. This link-          stance matching techniques play an important role
age aims to enrich BabelNet verbal entries     in this context, allowing to connect entities from
with a visual representations and to con-      heterogeneous data resources which refer to the
nect the IMAGACT ontology to the huge          same real-world object (Castano et al., 2008; Nath
BabelNet semantic network.                     et al., 2014).
                                                  Aside the general interest for knowledge bases
Italiano. In questo articolo si presenta       interconnection in a web-based perspective, there
uno studio sul linking tra due risorse         is also a growing interest in multimodal resources,
linguistiche multilingui e multimediali,       which combine textual and visual data. These re-
BabelNet e IMAGACT. L’esperimento              sources can be exploited by intelligent algorithms
ha l’obiettivo di collegare i video            integrating vision and natural language processing
dell’ontologia dell’azione IMAGACT con         techinques2 . This integrated approach was suc-
i concetti azionali contenuti in BabelNet.     cessfully applied for some challenging tasks in-
Il collegamento è realizzato attraverso       volving verbs and their action reference as a video.
un algoritmo di Machine Learning che           Regneri et al. (2013) developed machine learn-
sfrutta l’informazione lessicale delle due     ing models for the automatic identification of sim-
risorse. L’algoritmo è stato addestrato e     ilarity among actions, by using a corpus of natu-
valutato su un dataset annotato manual-        ral language descriptions, derived from the videos
mente e poi eseguito sull’insieme totale       of the MPII Cooking Composite Activities dataset,
dei dati, permettendo di collegare 773         which represents actions involved in basic cooking
video di IMAGACT con 517 synset di             tasks. Instead, the algorithm developed by Mathe
BabelNet. Questo linking ha lo scopo di            1
                                                     Lorenzo Gregori developed the linking algorithm and
arricchire le entrate verbali di BabelNet      wrote sections 3, 4, and 5; Andrea Amelio Ravelli performed
con una rappresentazione visuale e di          the data annotation and wrote sections 1 and 2; Alessandro
collegare IMAGACT alla rete semantica          Panunzi supervised the research work and revised the paper.
                                                   2
                                                     Several works in this field have been developed within
di BabelNet.                                   The European Network on Integrating Vision and Language
                                               (iV&L Net), http://ivl-net.eu/
et al. (2008) extracts higher level semantic fea-     grained categorization of action concepts, each
tures in common among a sample set of verbs, us-      represented by one or more video prototypes as
ing a fine-grained analysis of the represented ac-    recorded scenes and 3D animations. IMAGACT
tion concepts, intended as a subsequent stable set    currently contains 1,010 scenes which encom-
of abstract features of the objects involved in the   pass the action concepts most commonly referred
videos. Within this interdisciplinary perspective,    to in everyday language usage5 . The links be-
a knowledge base which relates verbal lemmas in       tween verbs and video scenes are based on the co-
different languages with video prototypes can help    referentiality of different verbs with respect to the
in serveral applications, and be exploited by both    action expressed by a scene (i.e. different verbs
humans and machines.                                  can describe the same action, visualised in the
                                                      scene). The visual representations convey the ac-
2       Resources                                     tion information in a cross-linguistic environment
                                                      and IMAGACT may thus be exploited for ref-
This paper presents a linking between BabelNet
                                                      erence disambiguation in automatic and assisted
(Navigli and Ponzetto, 2012a) and IMAGACT
                                                      translation tasks (Panunzi et al., 2014).
(Moneglia et al., 2014a), two multilanguage and
multimedia resources suitable for automatic trans-
                                                      3   Related works
lation and disambiguation tasks (Russo et al.,
2013; Moneglia, 2014; Moro and Navigli, 2015).        Other attempts have previously been made to link
                                                      IMAGACT with other resources. Two experi-
2.1      BabelNet                                     ments by De Felice et al. (2014) and by Bar-
BabelNet3 is a multilingual semantic network cre-     tolini et al. (2014) were conducted in an intra-
ated from the mapping together of the WordNet         linguistic perspective: their aim was to evaluate
thesaurus and the Wikipedia enciclopedia. At          the results of a mapping between the action con-
present, BabelNet 3.7 contains 271 languages and      cepts defined in ItalWordNet and the ones catego-
it is the widest multilingual resources available     rized by IMAGACT (in terms of perfect matches
for semantic disambiguation. Concepts and en-         or hypernym/hyponym relations).
tities are represented by BabelSynsets (BS), ex-         On the contrary, the objective behind our work
tensions of WordNet synsets: a BS is a uni-           is to obtain a light link between the resources by
tary concept identified by several kinds of infor-    enriching the action concepts in BabelNet with a
mations (semantic features, glosses, usage exam-      visual representation; in this way, we overpass the
ples, etc.) and related to lemmas (in any lan-        problem of finding a match between the generic
guage) which have a sense matching with that con-     semantic concepts in BabelNet and the specific
cept. BSs are not isolated, but connected together    pragmatic concepts in IMAGACT. This methodol-
by semantic relations. Moreover, BabelNet re-         ogy is also enforced by the multilingual frame in
ceived a large contributions from its mapping with    which the experiment is conducted. As a matter of
other resources such as ImageNet, GeoNames,           fact, the relation between words and concepts can
OmegaWiki (along with many others), which in-         deeply differ across languages, while the prototyp-
creased its information beyond the lexicon and        ical scenes ensure a language-independent modal-
produced a wide-ranging, multimedia knowledge         ity which is able to keep together the different lex-
base.                                                 icalizations of the action space.
                                                         This work is a further step from a previous
2.2      IMAGACT                                      IMAGACT-BabelNet linking experiment (Gregori
IMAGACT4 is a visual ontology of action that          et al., 2015). Even if it was just a feasibility test to
provides a video-based translation and disam-         check the consistency of the linking, we reported
biguation framework for general verbs. The            good results in automatic assignment of IMA-
database evolves continously (Moneglia et al.,        GACT prototypical scenes to BabelNet synsets.
2014b) and at present contains 9 fully-mapped         For this reason, we built a bigger dataset and we
languages and 13 which are underway. The re-          went from a metric-based to a Machine Learning
source is built on an ontology containing a fine-     algorithm to be run on the whole set of IMAGACT
    3                                                    5
        http://babelnet.org                                The data is derived from the annotation of verb occur-
    4
        http://www.imagact.it                         rences in spontaneous spoken corpora (Moneglia et al., 2012)
                                                                                                              nsb
scenes.                                                           is determined by calculating the ratio nb+ns     for
                                                                  each pair and setting a threshold of 0.04, that max-
4     Linking experiment                                          imizes the F-measure on our dataset.
The aim of this experiment is to link the IMA-                       Table 1 reports on the 17 languages common to
GACT video scenes to the BabelNet interlinguis-                   both BabelNet and IMAGACT, detailing the rela-
tic concepts (BabelSynsets). In fact, the Babel-                  tive number of verbs in each, and constitutes the
Net objects are already enriched with visual ob-                  quantitative data which the matching algorithms
jects, though this information contains static im-                can exploit.
ages which are inadequate for representing action
                                                                     Language             BN Verbs     IM Verbs
concepts. In this way, adding video scenes to the
verbs is very desirable and would suggest itself as                  English (EN)          29,738        1,299
a natural extension of BabelNet.                                     Polish (PL)           9,660         1,193
                                                                     Chinese (ZH)          9,507         1,171
4.1    Training and test set                                         Italian (IT)          7,184         1,100
A manually annotated dataset of 50 scenes and 57                     Spanish (ES)          6,159          736
BabelSynsets (2,850 judgments) was created in or-                    Russian (RU)          4,975           34
der to test the algorithm and evaluate the results.                  Portuguese (PT)       4,624          776
   The sampling was carried on in two steps. First                   Arabic (AR)           3,738          804
of all, a purely actional semantic area has been se-                 German (DE)           3,754          992
lected by taking BSs and scenes linked to 7 En-                      Norwegian (NO)        1,729          115
glish action verbs, which are general and very fre-                  Danish (DA)           1,685          646
quent in the language use: put, move, take, insert,                  Hebrew (HE)           1,647          160
press, give and strike. The wide variation of these                  Serbian (SR)           858          1,124
verbs allowed us to obtain a big set of concepts,                    Hindi (HI)             831           466
with a high variation in terms of frequency and                      Urdu (UR)              233            78
generality. On this set, a second sampling has been                  Sanskrit (SA)           33           276
performed by preserving the variability in terms of                  Oriya (OR)              6            160
number of connected verbs, that is a measurable                      Total                 86,361       11,130
parameter in both the resources.
   Each hBS,Scenei pair has been evaluated to                     Table 1:   The 17 shared languages of Babel-
check if the scene is appropriate in represent-                   Net (BN) and IMAGACT (IM) with verbal lemma
ing the BS. Three annotators compiled the binary                  counts.
judgment table and we reported the values shared
                                                                   The basic features that we used for this experi-
by at least 2 of 3. The measured Fleiss’ kappa
                                                                  ment are:
inter-rater agreement for this task was 0.74 6 .
   Finally, the dataset has been split in a training                • ns: the number of verbs connected to the
set and a test set, with the proportions of 80% and                   Scene;
20% respectively (10 randomly chosen scenes for
the test set and the remaining 40 scenes for the                    • nb: the number of verbs connected to the BS;
training set).
                                                                    • nsb: the number of verbs that are shared be-
4.2    Algorithm                                                      tween the Scene and the BS;
For this task, we developed a new algorithm which
uses Machine Learning techniques, by exploiting                      These 3 features have been calculated for each
the training set. As in the previous experiment,                  candidate BS and for the ones which are seman-
the features are extracted from the lexical items                 tically related to it. We took the 8 BabelNet se-
belonging to both the candidate BabelSynset and                   mantic relations available for verbs (see table 2)
its neighbours7 . Beside the algorithm, a baseline                and for each BS we extracted 8 groups of related
                                                                  synsets, each one containing the set of BS con-
   6
     The manually annotated training set is published at          nected to the main one by the same relation. Then,
http://bit.ly/29J0ypx
   7
     This test is based on BabelNet 3.6; the data was extracted   ns, nb and nsb are calculated for each group by
using the Java API (Navigli and Ponzetto, 2012b)                  summing the values of the BSs belonging to it.
The feature set is comprised of 27 features: 3 fea-      represent not only a sense of throw, but also senses
tures for the main BS and 3 features for each Ba-        of other verbs, like to play or to catch, that refer
belNet relation. The set of candidates consists of       to different semantic concepts; in these cases, the
all the possible BSs for each verb connected to the      scene in IMAGACT is not linked to the alterna-
scene. A machine learning algorithm was trained          tive verbs, but it can be described with them (i.e.
on the annotated dataset: we used Support Vector         John and Mark play with the ball, Mark catches
Machine (SVM) classifier with a RBF kernel.              the ball). For this reason, the manual annotation
   Table 2 shows the list of relations between the       provides more BS-to-scene relations than an algo-
verbal BSs ranked by their relevance values for          rithm can foresee on the basis of a lexical match,
this task; this value is measured with Information       causing a low recall value.
Gain on the annotated dataset.                              Table 4 reports some statistics about the link-
                                                         ing process; the entire results are browsable at the
          BabelNet relations      IG value
                                                         page http://bit.ly/2a4FefT.
          Hyponym                  0.057
          Hypernym                 0.026                     IM Scenes linked to BS                     773
          Also See                 0.019                     BS linked to Scenes                        517
          Verb Group               0.019                     IM English Verbs related to Scenes         544
          Gloss Related            0.009                     BabelNet English Verbs related to BS      1,100
          Entailment               0.003
          Antonym                  0.000                  Table 4: IMAGACT-BabelNet linking numbers
          Cause                    0.000
                                                            Switching to Machine Learning had a strong
      Table 2: Relations between verbal BSs.             impact on this linking task. The main advantage
                                                         from the previous linking experiment (Gregori et
                                                         al., 2015) is that now the number of BSs that can
4.3   Results
                                                         be assigned to each scene is variable, depending
The algorithm was run on the training set and eval-      on the different reference possibilities that the BSs
uated on the test set; the results are reported in Ta-   have. This is coherent with the BabelNet structure
ble 3.                                                   where we find very general concepts, that can be
              Baseline       ML Algorithm                represented by several action prototypes, and spe-
             th = 0.04        27 features                cific ones, for which one prototype is enough to
       Pr      0.580             0.833                   provide a clear representation.
       Re      0.529             0.441                      For example the BS ”bn:00090224v” (Put into
       Fm      0.553             0.577                   a certain place or abstract location) expresses a
                                                         general concept and is linked to 72 scenes, com-
Table 3: Precision, Recall and F-measure of BSs          prising the actions involving one or more objects
to scenes linking task calculated on the test set for    or a body part, relating to different ways of putting
the algorithm and the baseline.                          (like inserting, throwing, attaching,...) or to differ-
                                                         ent states of the Theme (e.g. solid or liquid). Con-
   The results in terms of F-measure are not so sat-     versely, the BS ”bn:00084326v” (Fasten with but-
isfying and the value obtained with the algorithm        tons) is much more specific and is linked to only
is barely better than the baseline. Despite this, it’s   one scene (c17d7346) that represents a man that
important to consider the difference with the base-      fastens his jacket.
line in terms of precision and recall, since preci-
                                                         5     Conclusions
sion is more important for this task: for this rea-
son, the algorithm provides a much more reliable         The experiment described in this paper shows that
result compared to the baseline. We also have to         is possible to obtain an extensive linking between
point out that a low recall is mainly caused by mul-     IMAGACT and BabelNet through visual entities
tiple possiblities in the interpretation of a scene      (see Figure 1 for a visual representation of a link-
from different points of view: for example, the          ing example); this can be advantageous for both
scene linked to the English verb to throw described      the resources. BabelNet can add a clear video
by the sentence John throws the ball to Mark can         representation of the verbal synsets that refer to
                      Figure 1: IMAGACT scene to BabelNet synset linking example


 actions; IMAGACT can import verb translation                 may. European Language Resources Association
 candidates from many languages by exploiting the             (ELRA).
 BabelNet semantic network; their integration can         [Castano et al.2008] S. Castano, A. Ferrara, D. Lorusso,
 be exploited as a unified multimedial resource to           and S. Montanelli. 2008. On the ontology instance
 accomplish complex tasks that combine natural               matching problem. In Database and Expert Sys-
                                                             tems Application, 2008. DEXA ’08. 19th Interna-
 language processing and computer vision.
                                                             tional Workshop on, pages 180–184, Sept.
    Finally, we feel important to note that this pro-
 cedure is scalable and the statistical model can be      [De Felice et al.2014] Irene De Felice, Roberto Bar-
                                                             tolini, Irene Russo, Valeria Quochi, and Mon-
 retrained at resource changes. This is a fundamen-          ica Monachini.       2014.   Evaluating ImagAct-
 tal feature, especially considering the continuous          WordNet mapping for English and Italian through
 update of IMAGACT languages and lemmas in-                  videos. In Roberto Basili, Alessandro Lenci, and
 ventory.                                                    Bernardo Magnini, editors, Proceedings of the First
                                                             Italian Conference on Computational Linguistics
 Acknowledgments                                             CLiC-it 2014 & the Fourth International Workshop
                                                             EVALITA 2014, volume I, pages 128–131. Pisa Uni-
 This research has been supported by the MOD-                versity Press.
 ELACT Project, funded by the Futuro in Ricerca           [Gregori et al.2015] Lorenzo Gregori, Andrea Amelio
 2012 programme (Project Code RBFR12C608);                   Ravelli, and Alessandro Panunzi. 2015. Linking
 http://modelact.lablita.it.                                 dei contenuti multimediali tra ontologie multilingui:
                                                             i verbi di azione tra imagact e babelnet. In C. Bosco,
                                                             F.M. Zanzotto, and S. Tonelli, editors, Proceed-
                                                             ings of the Second Italian Conference on Compu-
 References                                                  tational Linguistics CLiC-it 2015, pages 150–154.
[Bartolini et al.2014] Roberto    Bartolini,    Valeria      Accademia University Press.
   Quochi, Irene De Felice, Irene Russo, and Mon-         [Mathe et al.2008] S. Mathe, A. Fazly, S. Dickinson,
   ica Monachini. 2014. From synsets to videos:              and S. Stevenson. 2008. Learning the abstract
   Enriching italwordnet multimodally. In Nicoletta          motion semantics of verbs from captioned videos.
   Calzolari (Conference Chair), Khalid Choukri,             In Computer Vision and Pattern Recognition Work-
   Thierry Declerck, Hrafn Loftsson, Bente Maegaard,         shops, 2008. CVPRW ’08. IEEE Computer Society
   Joseph Mariani, Asuncion Moreno, Jan Odijk, and           Conference on, pages 1–8, June.
   Stelios Piperidis, editors, Proceedings of the Ninth
   International Conference on Language Resources         [Moneglia et al.2012] Massimo Moneglia, Francesca
   and Evaluation (LREC’14), Reykjavik, Iceland,             Frontini, Gloria Gagliardi, Irene Russo, Alessandro
   Panunzi, and Monica Monachini. 2012. Imagact:              Proceedings of the 50th Annual Meeting of the Asso-
   deriving an action ontology from spoken corpora.           ciation for Computational Linguistics (ACL 2012),
   Proceedings of the Eighth Joint ACL-ISO Workshop           Jeju, Korea.
   on Interoperable Semantic Annotation (isa-8), pages
   42–47.                                                  [Otero-Cerdeira et al.2015] Lorena     Otero-Cerdeira,
                                                               Francisco J. Rodrguez-Martnez, and Alma Gmez-
[Moneglia et al.2014a] Massimo Moneglia, Susan                 Rodrguez. 2015. Ontology matching: A literature
   Brown, Francesca Frontini, Gloria Gagliardi,                review. Expert Systems with Applications, 42(2):949
   Fahad Khan, Monica Monachini, and Alessandro                – 971.
   Panunzi. 2014a. The IMAGACT Visual Ontology.
   An Extendable Multilingual Infrastructure for the       [Panunzi et al.2014] Alessandro Panunzi, Irene De Fe-
   Representation of Lexical Encoding of Action. In            lice, Lorenzo Gregori, Stefano Jacoviello, Monica
   Nicoletta Calzolari (Conference Chair), Khalid              Monachini, Massimo Moneglia, Valeria Quochi, and
   Choukri, Thierry Declerck, Hrafn Loftsson, Bente            Irene Russo. 2014. Translating Action Verbs us-
   Maegaard, Joseph Mariani, Asuncion Moreno, Jan              ing a Dictionary of Images: the IMAGACT Ontol-
   Odijk, and Stelios Piperidis, editors, Proceedings          ogy. In XVI EURALEX International Congress: The
   of the Ninth International Conference on Language           User in Focus, pages 1163–1170, Bolzano / Bozen,
   Resources and Evaluation (LREC’14), Reykjavik,              7/2014. EURALEX 2014, EURALEX 2014.
   Iceland, May. European Language Resources
   Association (ELRA).                                     [Regneri et al.2013] Michaela   Regneri,    Marcus
                                                              Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt
[Moneglia et al.2014b] Massimo Moneglia, Susan                Schiele, and Manfred Pinkal. 2013. Grounding
   Brown, Aniruddha Kar, Anand Kumar, Atul Kumar              action descriptions in videos.   Transactions of
   Ojha, Heliana Mello, Niharika, Girish Nath Jha,            the Association for Computational Linguistics,
   Bhaskar Ray, and Annu Sharma. 2014b. Mapping               1:25–36.
   Indian Languages onto the IMAGACT Visual On-
   tology of Action. In Girish Nath Jha, Kalika Bali,      [Russo et al.2013] Irene Russo, Francesca Frontini,
   Sobha L, and Esha Banerjee, editors, Proceedings           Irene De Felice, Fahad Khan, and Monica Mona-
   of WILDRE2 - 2nd Workshop on Indian Language               chini. 2013. Disambiguation of Basic Action Types
   Data: Resources and Evaluation at LREC’14,                 through Nouns Telic Qualia. In Roser Saur, Nico-
   Reykjavik, Iceland, May. European Language                 letta Calzolari, Chu-Ren Huang, Alessandro Lenci,
   Resources Association (ELRA).                              Monica Monachini, and James Pustejovsky, editors,
                                                              Proceedings of the 6th International Conference on
[Moneglia2014] Massimo Moneglia. 2014. Natural                Generative Approaches to the Lexicon. Generative
   Language Ontology of Action: A Gap with Huge               Lexicon and Distributional Semantics, pages 70–75.
   Consequences for Natural Language Understanding
   and Machine Translation. In Zygmunt Vetulani and        [Siemoneit et al.2015] Benjamin             Siemoneit,
   Joseph Mariani, editors, Human Language Technol-            John Philip McCrae, and Philipp Cimiano. 2015.
   ogy Challenges for Computer Science and Linguis-            Linking four heterogeneous language resources as
   tics, volume 8387 of Lecture Notes in Computer Sci-         linked data. In Proceedings of the 4th Workshop
   ence, pages 379–395. Springer International Pub-            on Linked Data in Linguistics: Resources and
   lishing.                                                    Applications, pages 59–63, Beijing, China, July.
                                                               Association for Computational Linguistics.
[Moro and Navigli2015] Andrea Moro and Roberto
   Navigli. 2015. SemEval-2015 Task 13: Multilin-
   gual All-Words Sense Disambiguation and Entity
   Linking. In Proceedings of the 9th International
   Workshop on Semantic Evaluation (SemEval 2015),
   pages 288–297, Denver, Colorado, June. Associa-
   tion for Computational Linguistics.

[Nath et al.2014] Rudra Nath, Hanif Seddiqui, and
   Masaki Aono. 2014. An efficient and scalable ap-
   proach for ontology instance matching. Journal of
   Computers, 9(8).

[Navigli and Ponzetto2012a] Roberto Navigli and Si-
   mone Paolo Ponzetto. 2012a. BabelNet: The au-
   tomatic construction, evaluation and application of a
   wide-coverage multilingual semantic network. Arti-
   ficial Intelligence, 193:217–250.

[Navigli and Ponzetto2012b] Roberto Navigli and Si-
   mone Paolo Ponzetto. 2012b. Multilingual WSD
   with just a few lines of code: the BabelNet API. In

</pre>