=Paper=
{{Paper
|id=Vol-1749/paper28
|storemode=property
|title=Linking IMAGACT Ontology to BabelNet through Action Videos
|pdfUrl=https://ceur-ws.org/Vol-1749/paper28.pdf
|volume=Vol-1749
|authors=Lorenzo Gregori,Alessandro Panunzi,Andrea Amelio Ravelli
|dblpUrl=https://dblp.org/rec/conf/clic-it/GregoriPR16
}}
==Linking IMAGACT Ontology to BabelNet through Action Videos==
Linking IMAGACT ontology to BabelNet through action videos
Lorenzo Gregori Alessandro Panunzi Andrea Amelio Ravelli
University of Florence University of Florence University of Florence
lorenzo.gregori@unifi.it alessandro.panunzi@unifi.it aramelior@gmail.com
Abstract 1 Introduction1
English. Herein we present a study deal- Ontologies are widely used to represent language
ing with the linking of two multilingual resources on the web, allowing them to be eas-
and multimedia resources, BabelNet and ily accessed and exploited by machines. For this
IMAGACT, which seeks to connect videos reason, data interconnection between different se-
contained in the IMAGACT Ontology of mantic resources is a crucial task in order to en-
Actions with related action concepts in hance disambiguation and information retrieval
BabelNet. The linking is based on a ma- capabilities in Artificial Intelligence, as evidenced
chine learning algorithm that exploits the by the increasing research into mapping and link-
lexical information of the two resources. ing techniques among ontologies (Otero-Cerdeira
The algorithm has been firstly trained and et al., 2015). Nevertheless, ontology mapping has
tested on a manually annotated dataset and to face the problem of concept representation mis-
then it was run on all the data, allowing match between resources, due to different building
to connect 773 IMAGACT action videos criteria and purposes (Siemoneit et al., 2015). In-
with 517 BabelNet synsets. This link- stance matching techniques play an important role
age aims to enrich BabelNet verbal entries in this context, allowing to connect entities from
with a visual representations and to con- heterogeneous data resources which refer to the
nect the IMAGACT ontology to the huge same real-world object (Castano et al., 2008; Nath
BabelNet semantic network. et al., 2014).
Aside the general interest for knowledge bases
Italiano. In questo articolo si presenta interconnection in a web-based perspective, there
uno studio sul linking tra due risorse is also a growing interest in multimodal resources,
linguistiche multilingui e multimediali, which combine textual and visual data. These re-
BabelNet e IMAGACT. L’esperimento sources can be exploited by intelligent algorithms
ha l’obiettivo di collegare i video integrating vision and natural language processing
dell’ontologia dell’azione IMAGACT con techinques2 . This integrated approach was suc-
i concetti azionali contenuti in BabelNet. cessfully applied for some challenging tasks in-
Il collegamento è realizzato attraverso volving verbs and their action reference as a video.
un algoritmo di Machine Learning che Regneri et al. (2013) developed machine learn-
sfrutta l’informazione lessicale delle due ing models for the automatic identification of sim-
risorse. L’algoritmo è stato addestrato e ilarity among actions, by using a corpus of natu-
valutato su un dataset annotato manual- ral language descriptions, derived from the videos
mente e poi eseguito sull’insieme totale of the MPII Cooking Composite Activities dataset,
dei dati, permettendo di collegare 773 which represents actions involved in basic cooking
video di IMAGACT con 517 synset di tasks. Instead, the algorithm developed by Mathe
BabelNet. Questo linking ha lo scopo di 1
Lorenzo Gregori developed the linking algorithm and
arricchire le entrate verbali di BabelNet wrote sections 3, 4, and 5; Andrea Amelio Ravelli performed
con una rappresentazione visuale e di the data annotation and wrote sections 1 and 2; Alessandro
collegare IMAGACT alla rete semantica Panunzi supervised the research work and revised the paper.
2
Several works in this field have been developed within
di BabelNet. The European Network on Integrating Vision and Language
(iV&L Net), http://ivl-net.eu/
et al. (2008) extracts higher level semantic fea- grained categorization of action concepts, each
tures in common among a sample set of verbs, us- represented by one or more video prototypes as
ing a fine-grained analysis of the represented ac- recorded scenes and 3D animations. IMAGACT
tion concepts, intended as a subsequent stable set currently contains 1,010 scenes which encom-
of abstract features of the objects involved in the pass the action concepts most commonly referred
videos. Within this interdisciplinary perspective, to in everyday language usage5 . The links be-
a knowledge base which relates verbal lemmas in tween verbs and video scenes are based on the co-
different languages with video prototypes can help referentiality of different verbs with respect to the
in serveral applications, and be exploited by both action expressed by a scene (i.e. different verbs
humans and machines. can describe the same action, visualised in the
scene). The visual representations convey the ac-
2 Resources tion information in a cross-linguistic environment
and IMAGACT may thus be exploited for ref-
This paper presents a linking between BabelNet
erence disambiguation in automatic and assisted
(Navigli and Ponzetto, 2012a) and IMAGACT
translation tasks (Panunzi et al., 2014).
(Moneglia et al., 2014a), two multilanguage and
multimedia resources suitable for automatic trans-
3 Related works
lation and disambiguation tasks (Russo et al.,
2013; Moneglia, 2014; Moro and Navigli, 2015). Other attempts have previously been made to link
IMAGACT with other resources. Two experi-
2.1 BabelNet ments by De Felice et al. (2014) and by Bar-
BabelNet3 is a multilingual semantic network cre- tolini et al. (2014) were conducted in an intra-
ated from the mapping together of the WordNet linguistic perspective: their aim was to evaluate
thesaurus and the Wikipedia enciclopedia. At the results of a mapping between the action con-
present, BabelNet 3.7 contains 271 languages and cepts defined in ItalWordNet and the ones catego-
it is the widest multilingual resources available rized by IMAGACT (in terms of perfect matches
for semantic disambiguation. Concepts and en- or hypernym/hyponym relations).
tities are represented by BabelSynsets (BS), ex- On the contrary, the objective behind our work
tensions of WordNet synsets: a BS is a uni- is to obtain a light link between the resources by
tary concept identified by several kinds of infor- enriching the action concepts in BabelNet with a
mations (semantic features, glosses, usage exam- visual representation; in this way, we overpass the
ples, etc.) and related to lemmas (in any lan- problem of finding a match between the generic
guage) which have a sense matching with that con- semantic concepts in BabelNet and the specific
cept. BSs are not isolated, but connected together pragmatic concepts in IMAGACT. This methodol-
by semantic relations. Moreover, BabelNet re- ogy is also enforced by the multilingual frame in
ceived a large contributions from its mapping with which the experiment is conducted. As a matter of
other resources such as ImageNet, GeoNames, fact, the relation between words and concepts can
OmegaWiki (along with many others), which in- deeply differ across languages, while the prototyp-
creased its information beyond the lexicon and ical scenes ensure a language-independent modal-
produced a wide-ranging, multimedia knowledge ity which is able to keep together the different lex-
base. icalizations of the action space.
This work is a further step from a previous
2.2 IMAGACT IMAGACT-BabelNet linking experiment (Gregori
IMAGACT4 is a visual ontology of action that et al., 2015). Even if it was just a feasibility test to
provides a video-based translation and disam- check the consistency of the linking, we reported
biguation framework for general verbs. The good results in automatic assignment of IMA-
database evolves continously (Moneglia et al., GACT prototypical scenes to BabelNet synsets.
2014b) and at present contains 9 fully-mapped For this reason, we built a bigger dataset and we
languages and 13 which are underway. The re- went from a metric-based to a Machine Learning
source is built on an ontology containing a fine- algorithm to be run on the whole set of IMAGACT
3 5
http://babelnet.org The data is derived from the annotation of verb occur-
4
http://www.imagact.it rences in spontaneous spoken corpora (Moneglia et al., 2012)
nsb
scenes. is determined by calculating the ratio nb+ns for
each pair and setting a threshold of 0.04, that max-
4 Linking experiment imizes the F-measure on our dataset.
The aim of this experiment is to link the IMA- Table 1 reports on the 17 languages common to
GACT video scenes to the BabelNet interlinguis- both BabelNet and IMAGACT, detailing the rela-
tic concepts (BabelSynsets). In fact, the Babel- tive number of verbs in each, and constitutes the
Net objects are already enriched with visual ob- quantitative data which the matching algorithms
jects, though this information contains static im- can exploit.
ages which are inadequate for representing action
Language BN Verbs IM Verbs
concepts. In this way, adding video scenes to the
verbs is very desirable and would suggest itself as English (EN) 29,738 1,299
a natural extension of BabelNet. Polish (PL) 9,660 1,193
Chinese (ZH) 9,507 1,171
4.1 Training and test set Italian (IT) 7,184 1,100
A manually annotated dataset of 50 scenes and 57 Spanish (ES) 6,159 736
BabelSynsets (2,850 judgments) was created in or- Russian (RU) 4,975 34
der to test the algorithm and evaluate the results. Portuguese (PT) 4,624 776
The sampling was carried on in two steps. First Arabic (AR) 3,738 804
of all, a purely actional semantic area has been se- German (DE) 3,754 992
lected by taking BSs and scenes linked to 7 En- Norwegian (NO) 1,729 115
glish action verbs, which are general and very fre- Danish (DA) 1,685 646
quent in the language use: put, move, take, insert, Hebrew (HE) 1,647 160
press, give and strike. The wide variation of these Serbian (SR) 858 1,124
verbs allowed us to obtain a big set of concepts, Hindi (HI) 831 466
with a high variation in terms of frequency and Urdu (UR) 233 78
generality. On this set, a second sampling has been Sanskrit (SA) 33 276
performed by preserving the variability in terms of Oriya (OR) 6 160
number of connected verbs, that is a measurable Total 86,361 11,130
parameter in both the resources.
Each hBS,Scenei pair has been evaluated to Table 1: The 17 shared languages of Babel-
check if the scene is appropriate in represent- Net (BN) and IMAGACT (IM) with verbal lemma
ing the BS. Three annotators compiled the binary counts.
judgment table and we reported the values shared
The basic features that we used for this experi-
by at least 2 of 3. The measured Fleiss’ kappa
ment are:
inter-rater agreement for this task was 0.74 6 .
Finally, the dataset has been split in a training • ns: the number of verbs connected to the
set and a test set, with the proportions of 80% and Scene;
20% respectively (10 randomly chosen scenes for
the test set and the remaining 40 scenes for the • nb: the number of verbs connected to the BS;
training set).
• nsb: the number of verbs that are shared be-
4.2 Algorithm tween the Scene and the BS;
For this task, we developed a new algorithm which
uses Machine Learning techniques, by exploiting These 3 features have been calculated for each
the training set. As in the previous experiment, candidate BS and for the ones which are seman-
the features are extracted from the lexical items tically related to it. We took the 8 BabelNet se-
belonging to both the candidate BabelSynset and mantic relations available for verbs (see table 2)
its neighbours7 . Beside the algorithm, a baseline and for each BS we extracted 8 groups of related
synsets, each one containing the set of BS con-
6
The manually annotated training set is published at nected to the main one by the same relation. Then,
http://bit.ly/29J0ypx
7
This test is based on BabelNet 3.6; the data was extracted ns, nb and nsb are calculated for each group by
using the Java API (Navigli and Ponzetto, 2012b) summing the values of the BSs belonging to it.
The feature set is comprised of 27 features: 3 fea- represent not only a sense of throw, but also senses
tures for the main BS and 3 features for each Ba- of other verbs, like to play or to catch, that refer
belNet relation. The set of candidates consists of to different semantic concepts; in these cases, the
all the possible BSs for each verb connected to the scene in IMAGACT is not linked to the alterna-
scene. A machine learning algorithm was trained tive verbs, but it can be described with them (i.e.
on the annotated dataset: we used Support Vector John and Mark play with the ball, Mark catches
Machine (SVM) classifier with a RBF kernel. the ball). For this reason, the manual annotation
Table 2 shows the list of relations between the provides more BS-to-scene relations than an algo-
verbal BSs ranked by their relevance values for rithm can foresee on the basis of a lexical match,
this task; this value is measured with Information causing a low recall value.
Gain on the annotated dataset. Table 4 reports some statistics about the link-
ing process; the entire results are browsable at the
BabelNet relations IG value
page http://bit.ly/2a4FefT.
Hyponym 0.057
Hypernym 0.026 IM Scenes linked to BS 773
Also See 0.019 BS linked to Scenes 517
Verb Group 0.019 IM English Verbs related to Scenes 544
Gloss Related 0.009 BabelNet English Verbs related to BS 1,100
Entailment 0.003
Antonym 0.000 Table 4: IMAGACT-BabelNet linking numbers
Cause 0.000
Switching to Machine Learning had a strong
Table 2: Relations between verbal BSs. impact on this linking task. The main advantage
from the previous linking experiment (Gregori et
al., 2015) is that now the number of BSs that can
4.3 Results
be assigned to each scene is variable, depending
The algorithm was run on the training set and eval- on the different reference possibilities that the BSs
uated on the test set; the results are reported in Ta- have. This is coherent with the BabelNet structure
ble 3. where we find very general concepts, that can be
Baseline ML Algorithm represented by several action prototypes, and spe-
th = 0.04 27 features cific ones, for which one prototype is enough to
Pr 0.580 0.833 provide a clear representation.
Re 0.529 0.441 For example the BS ”bn:00090224v” (Put into
Fm 0.553 0.577 a certain place or abstract location) expresses a
general concept and is linked to 72 scenes, com-
Table 3: Precision, Recall and F-measure of BSs prising the actions involving one or more objects
to scenes linking task calculated on the test set for or a body part, relating to different ways of putting
the algorithm and the baseline. (like inserting, throwing, attaching,...) or to differ-
ent states of the Theme (e.g. solid or liquid). Con-
The results in terms of F-measure are not so sat- versely, the BS ”bn:00084326v” (Fasten with but-
isfying and the value obtained with the algorithm tons) is much more specific and is linked to only
is barely better than the baseline. Despite this, it’s one scene (c17d7346) that represents a man that
important to consider the difference with the base- fastens his jacket.
line in terms of precision and recall, since preci-
5 Conclusions
sion is more important for this task: for this rea-
son, the algorithm provides a much more reliable The experiment described in this paper shows that
result compared to the baseline. We also have to is possible to obtain an extensive linking between
point out that a low recall is mainly caused by mul- IMAGACT and BabelNet through visual entities
tiple possiblities in the interpretation of a scene (see Figure 1 for a visual representation of a link-
from different points of view: for example, the ing example); this can be advantageous for both
scene linked to the English verb to throw described the resources. BabelNet can add a clear video
by the sentence John throws the ball to Mark can representation of the verbal synsets that refer to
Figure 1: IMAGACT scene to BabelNet synset linking example
actions; IMAGACT can import verb translation may. European Language Resources Association
candidates from many languages by exploiting the (ELRA).
BabelNet semantic network; their integration can [Castano et al.2008] S. Castano, A. Ferrara, D. Lorusso,
be exploited as a unified multimedial resource to and S. Montanelli. 2008. On the ontology instance
accomplish complex tasks that combine natural matching problem. In Database and Expert Sys-
tems Application, 2008. DEXA ’08. 19th Interna-
language processing and computer vision.
tional Workshop on, pages 180–184, Sept.
Finally, we feel important to note that this pro-
cedure is scalable and the statistical model can be [De Felice et al.2014] Irene De Felice, Roberto Bar-
tolini, Irene Russo, Valeria Quochi, and Mon-
retrained at resource changes. This is a fundamen- ica Monachini. 2014. Evaluating ImagAct-
tal feature, especially considering the continuous WordNet mapping for English and Italian through
update of IMAGACT languages and lemmas in- videos. In Roberto Basili, Alessandro Lenci, and
ventory. Bernardo Magnini, editors, Proceedings of the First
Italian Conference on Computational Linguistics
Acknowledgments CLiC-it 2014 & the Fourth International Workshop
EVALITA 2014, volume I, pages 128–131. Pisa Uni-
This research has been supported by the MOD- versity Press.
ELACT Project, funded by the Futuro in Ricerca [Gregori et al.2015] Lorenzo Gregori, Andrea Amelio
2012 programme (Project Code RBFR12C608); Ravelli, and Alessandro Panunzi. 2015. Linking
http://modelact.lablita.it. dei contenuti multimediali tra ontologie multilingui:
i verbi di azione tra imagact e babelnet. In C. Bosco,
F.M. Zanzotto, and S. Tonelli, editors, Proceed-
ings of the Second Italian Conference on Compu-
References tational Linguistics CLiC-it 2015, pages 150–154.
[Bartolini et al.2014] Roberto Bartolini, Valeria Accademia University Press.
Quochi, Irene De Felice, Irene Russo, and Mon- [Mathe et al.2008] S. Mathe, A. Fazly, S. Dickinson,
ica Monachini. 2014. From synsets to videos: and S. Stevenson. 2008. Learning the abstract
Enriching italwordnet multimodally. In Nicoletta motion semantics of verbs from captioned videos.
Calzolari (Conference Chair), Khalid Choukri, In Computer Vision and Pattern Recognition Work-
Thierry Declerck, Hrafn Loftsson, Bente Maegaard, shops, 2008. CVPRW ’08. IEEE Computer Society
Joseph Mariani, Asuncion Moreno, Jan Odijk, and Conference on, pages 1–8, June.
Stelios Piperidis, editors, Proceedings of the Ninth
International Conference on Language Resources [Moneglia et al.2012] Massimo Moneglia, Francesca
and Evaluation (LREC’14), Reykjavik, Iceland, Frontini, Gloria Gagliardi, Irene Russo, Alessandro
Panunzi, and Monica Monachini. 2012. Imagact: Proceedings of the 50th Annual Meeting of the Asso-
deriving an action ontology from spoken corpora. ciation for Computational Linguistics (ACL 2012),
Proceedings of the Eighth Joint ACL-ISO Workshop Jeju, Korea.
on Interoperable Semantic Annotation (isa-8), pages
42–47. [Otero-Cerdeira et al.2015] Lorena Otero-Cerdeira,
Francisco J. Rodrguez-Martnez, and Alma Gmez-
[Moneglia et al.2014a] Massimo Moneglia, Susan Rodrguez. 2015. Ontology matching: A literature
Brown, Francesca Frontini, Gloria Gagliardi, review. Expert Systems with Applications, 42(2):949
Fahad Khan, Monica Monachini, and Alessandro – 971.
Panunzi. 2014a. The IMAGACT Visual Ontology.
An Extendable Multilingual Infrastructure for the [Panunzi et al.2014] Alessandro Panunzi, Irene De Fe-
Representation of Lexical Encoding of Action. In lice, Lorenzo Gregori, Stefano Jacoviello, Monica
Nicoletta Calzolari (Conference Chair), Khalid Monachini, Massimo Moneglia, Valeria Quochi, and
Choukri, Thierry Declerck, Hrafn Loftsson, Bente Irene Russo. 2014. Translating Action Verbs us-
Maegaard, Joseph Mariani, Asuncion Moreno, Jan ing a Dictionary of Images: the IMAGACT Ontol-
Odijk, and Stelios Piperidis, editors, Proceedings ogy. In XVI EURALEX International Congress: The
of the Ninth International Conference on Language User in Focus, pages 1163–1170, Bolzano / Bozen,
Resources and Evaluation (LREC’14), Reykjavik, 7/2014. EURALEX 2014, EURALEX 2014.
Iceland, May. European Language Resources
Association (ELRA). [Regneri et al.2013] Michaela Regneri, Marcus
Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt
[Moneglia et al.2014b] Massimo Moneglia, Susan Schiele, and Manfred Pinkal. 2013. Grounding
Brown, Aniruddha Kar, Anand Kumar, Atul Kumar action descriptions in videos. Transactions of
Ojha, Heliana Mello, Niharika, Girish Nath Jha, the Association for Computational Linguistics,
Bhaskar Ray, and Annu Sharma. 2014b. Mapping 1:25–36.
Indian Languages onto the IMAGACT Visual On-
tology of Action. In Girish Nath Jha, Kalika Bali, [Russo et al.2013] Irene Russo, Francesca Frontini,
Sobha L, and Esha Banerjee, editors, Proceedings Irene De Felice, Fahad Khan, and Monica Mona-
of WILDRE2 - 2nd Workshop on Indian Language chini. 2013. Disambiguation of Basic Action Types
Data: Resources and Evaluation at LREC’14, through Nouns Telic Qualia. In Roser Saur, Nico-
Reykjavik, Iceland, May. European Language letta Calzolari, Chu-Ren Huang, Alessandro Lenci,
Resources Association (ELRA). Monica Monachini, and James Pustejovsky, editors,
Proceedings of the 6th International Conference on
[Moneglia2014] Massimo Moneglia. 2014. Natural Generative Approaches to the Lexicon. Generative
Language Ontology of Action: A Gap with Huge Lexicon and Distributional Semantics, pages 70–75.
Consequences for Natural Language Understanding
and Machine Translation. In Zygmunt Vetulani and [Siemoneit et al.2015] Benjamin Siemoneit,
Joseph Mariani, editors, Human Language Technol- John Philip McCrae, and Philipp Cimiano. 2015.
ogy Challenges for Computer Science and Linguis- Linking four heterogeneous language resources as
tics, volume 8387 of Lecture Notes in Computer Sci- linked data. In Proceedings of the 4th Workshop
ence, pages 379–395. Springer International Pub- on Linked Data in Linguistics: Resources and
lishing. Applications, pages 59–63, Beijing, China, July.
Association for Computational Linguistics.
[Moro and Navigli2015] Andrea Moro and Roberto
Navigli. 2015. SemEval-2015 Task 13: Multilin-
gual All-Words Sense Disambiguation and Entity
Linking. In Proceedings of the 9th International
Workshop on Semantic Evaluation (SemEval 2015),
pages 288–297, Denver, Colorado, June. Associa-
tion for Computational Linguistics.
[Nath et al.2014] Rudra Nath, Hanif Seddiqui, and
Masaki Aono. 2014. An efficient and scalable ap-
proach for ontology instance matching. Journal of
Computers, 9(8).
[Navigli and Ponzetto2012a] Roberto Navigli and Si-
mone Paolo Ponzetto. 2012a. BabelNet: The au-
tomatic construction, evaluation and application of a
wide-coverage multilingual semantic network. Arti-
ficial Intelligence, 193:217–250.
[Navigli and Ponzetto2012b] Roberto Navigli and Si-
mone Paolo Ponzetto. 2012b. Multilingual WSD
with just a few lines of code: the BabelNet API. In