Enriching Open Multilingual Wordnets with Morphological Features∗ Stefania Racioppa Thierry Declerck German Research Center German Research Center for Artificial Intelligence for Artificial Intelligence Saarbrücken, Germany Saarbrücken, Germany stefania.racioppa@dfki.de & Austrian Centre for Digital Humanities Austrian Academy of Sciences Vienna, Austria thierry.declerck@dfki.de Abstract shared synset IDs (Bond and Foster, 2013; Bond et al., 2016). The resources in OMW are of different English. In this article, we describe our coverage and contain not always the same amount work on porting Open Multilingual Word- of information, as for example many resources are net resources into the OntoLex-Lemon lacking definitions (or “glosses”), contrary to the model, in order to establish an interlink- PWN resource, or example sentences. ing with corresponding morphological re- The work described in the present article is an sources, such as the MMorph resource set. extension of previous experiments done with En- For this purpose, the morphological re- glish (Gromann and Declerck, 2019) and more re- sources were also ported onto OntoLex- cently with German lexical semantics resource, as Lemon. We show how the “lemmas” con- we wanted to consider languages with a complex tained in the Wordnet resources can be en- morphology.1 In the present article we focus on riched with morphological features using Romance languages, especially Italian. the lexical representation and linking fea- Our current work deals primarily with the mor- tures of OntoLex-Lemon, which support, phological enrichment of OMW resources for Ital- among others, the formulation of restric- ian, i.e. “ItalWordNet”.2 The first morpholog- tions in the usage of such expressions. Our ical resource we took into consideration for this work will result in an improved lexical re- purpose is an updated version of the MMorph source combining Wordnet senses and full morphological analyser (Petitpierre and Russell, morphological descriptions in a single on- 1995). tological framework, as specified in the As a representation mean we chose OntoLex- OntoLex-Lemon model. Lemon (Cimiano et al., 2016)3 , as this model has proven to be able to represent both a classical lex- 1 Introduction icographic description (McCrae et al., 2017) as WordNets are well-established lexical resources well as lexical semantics networks like WordNet with a wide range of applications. For more than (McCrae et al., 2014). twenty years they have been elaborately set up OntoLex-Lemon is a further development of and maintained by hand, especially the original the “Lexicon Model for Ontologies” (lemon) (Mc- Princeton WordNet of English (PWN) (Fellbaum, Crae et al., 2012). Following the Guidelines4 1998). In recent years, there have been increas- for mapping Global WordNet formats onto lemon- ing activities in which open WordNets for different based RDF5 , some WordNets have already been languages have been automatically extracted from 1 This work will be published soon in the proceedings of various resources and enriched with lexical se- the Global Wordnet Conference 2019. mantics information, building the so-called Open 2 See (Pianta et al., 2002; Toral et al., 2010). But we also Multilingual Wordnet (OMW) (Bond and Paik, made similar experiments with French and Spanish. 3 2012). These WordNets were linked to PWN via See also https://www.w3.org/2016/05/ ontolex/ for more details. ∗ 4 Copyright © 2019 for this paper by its authors. Use per- See https://globalwordnet.github.io/ mitted under Creative Commons License Attribution 4.0 In- schemas/##rdf. 5 ternational (CC BY 4.0). RDF stands for “Resource Description Framework”. See mapped onto the former lemon model (McCrae et As the reader can see in the two examples above, al., 2014). Our present goal is thus to integrate OMW resources deliver information on the synset conceptual descriptions, lemmas and morphologi- number, together with the part-of-speech of the cal descriptions in the extended ontological frame- associated lemma. In some cases, definitions work specified by the OntoLex-Lemon model.6 (marked with ita : def) are provided, as well as In the next sections, we give some background examples (marked with ita : exe). information on OMW and MMorph. We continue This format is used for all languages of the with a section on OntoLex-Lemon, followed by OMW corpus. This eases its mapping to a for- sections that describe how OntoLex-Lemon sup- mal representation supporting the interoperability ports the linking of lemmas in the OMW resources and interlinking of language resources, such as the to full morphological descriptions. Doing so, mor- OntoLex-Lemon model (see Section 4). phological descriptions can be associated with the conceptual entries of WordNet. 3 MMorph MMorph was originally developed by ISSCO at 2 Open Multilingual WordNet the University of Geneva in the past MULTEXT project8 . For our purposes, we used the ex- OMW is an initiative that brings together Word- tended MMorph version developed at DFKI LT nets in different languages, linking them to the Lab (MMorph3). This version includes huge lex- original Princeton WordNet (PWN). As stated on ical resources for English, French, German, Ital- the web page of OMW, those Wordnets were ian and Spanish. Very generally, the tool relates of different quality, and some of those were in a word to a morphosyntactic description (MSD) fact extracted from different types of language re- containing free-definable attribute and values. The sources. We are dealing with three OMW Word- MMorph lexicon which is used to realize such Net resources.7 OMW provided for an harmo- MSD consists of a set of lexical entries and struc- nization of such resources, and published them in tural rules.9 For example, the following rule cre- a uniform format, which is displayed just below, ates in Italian a noun plural concatenating the noun showing here a few examples from the Italian re- stem and the gender-specific suffixes: source: 08388207 − n i t a : lemma n o b i l t à Listing 1: Rule for noun plural generation in Ital- 08388207 − n i t a : lemma a r i s t o c r a z i a ian. Note how the rule ensures that the gender 08388207 − n i t a : lemma p a t r i z i a t o 08388207 − n i t a : d e f 0 doesn’t change in the plural form. l ' insieme degli a r i s t o c r a t i c i N . ms : ” o ” N S u f f i x [ num= s i n g gen =masc 08388207 − n i t a : d e f 1 t y p e = oa ] l ' insieme dei n o b i l i N . mp : ” i ” N S u f f i x [ num= p l u r gen =masc ... t y p e = oa ] N . f s : ” a ” N S u f f i x [ num= s i n g gen =fem 14842992 − n i t a : lemma t e r r a t y p e = oa ] 14842992 − n i t a : lemma t e r r e n o N . f p : ” e ” N S u f f i x [ num= p l u r gen =fem 14842992 − n i t a : lemma s u o l o t y p e = oa ] 14842992 − n i t a : d e f 0 p a r t e superficiale della crosta FlexN : Noun [ gen =$1 num=$2 form = s u r f ] t e r r e s t r e s u l l a quale si <− Noun [ gen =$1 num= s i n g s t a o s i cammina form = s t e m t y p e =$T ] 14842992 − n i t a : e x e 0 s i p i e g ò N ASfix [ gen =$1 num=$2 con f a t i c a p e r r a c c o g l i e r e da t y p e =$T ] t e r r a i sacchetti , pronta a s a l i r e sull ' autobus This rule will apply only to the lexical entries 14842992 − n i t a : e x e 1 l ' uomo (feminine and/or masculine nouns) matching the c o m i n c i ò a r o t o l a r s i p e r t e r r a in preda a d o l o r i l a n c i n a n t i defined features, e.g. Noun [ gen =masc num= s i n g form = s t e m t y p e = oa ] https://www.w3.org/RDF/ for more details. ” patriziat ” = ” patriziato ” 6 OntoLex-Lemon is indeed representing an ontology of ” suol ” = ” suolo ” lexical elements. 7 8 French, Spanish and Italian, with a focus on the latter. See https://www.issco.unige.ch/en/ See http://compling.hss.ntu.edu.sg/omw/ for research/projects/MULTEXT.html for more downloading the resources. For more details see also (Bond details on the resulting MMorph2.3.4 version. 9 and Paik, 2012). See (Petitpierre and Russell, 1995) The morphology is completed by a set of spelling rules to catch the orthographic peculiarities of a specific language (e.g. fung + i = funghi in Italian). The MMorph lexica can be dumped to full form lists for the usage in further programs, as can be seen in the following examples: ” n o b i l t à ” = ” n o b i l t à ” Noun [ gen =fem num= s i n g | p l u r ] ” suoli ” = ” suolo ” Noun [ gen =masc num= p l u r ] ” suolo ” = ” suolo ” Figure 1: The core module of OntoLex- Noun [ gen =masc num= s i n g ] Lemon: Ontology Lexicon Interface. Graphic The entries above are completed by labelled fea- taken from https://www.w3.org/2016/ tures for gender and number, but the user can 05/ontolex/. freely define further features, if needed (e.g. cli- tics for verbal entries or rection of prepositions). in Figure 1, lexical entries can be linked, Multiple values of a feature are expressed by “|”. via the ontolex : evokes property, to such Because of their well-structured form, the SKOS concepts, which can represent WordNet dumped Mmorph lexica are ideally suited for the synsets. This structure parallels the relation be- mapping into the OntoLex-Lemon format. tween lexical entries and ontological resources, 4 OntoLex-Lemon which is implemented either directly by the ontolex : reference property or mediated by The OntoLex-Lemon model was originally devel- the instances of the ontolex : LexicalSense oped with the aim to provide a rich linguistic class.12 The ontolex : LexicalConcept class grounding for ontologies, meaning that the natu- seems to be most appropriate to model the ral language expressions used in the description “sets of cognitive synonyms (synsets)”13 de- of ontology elements are equipped with an exten- scribed by Princeton WordNet (PWN), while the sive linguistic description.10 This rich linguistic ontolex : LexicalSense class is meant to rep- grounding includes the representation of morpho- resent the bridge between lexical and ontological logical and syntactical properties of a lexical entry entities. as well as the syntax-semantics interface, i.e. the meaning of these lexical entries with respect to an 5 Mapping the OMW Resources to ontology or to specialized vocabularies. OntoLex-Lemon The main organizing unit for those linguistic de- As mentioned above, the format generated by the scriptions is the lexical entry, which enables the OMW initiative is very convenient to map dif- representation of morphological patterns for each entry (a MWE, a word or an affix). The connection tem”. SKOS provides “a model for expressing the basic struc- ture and content of concept schemes such as thesauri, clas- of a lexical entry to an ontological entity is marked sification schemes, subject heading lists, taxonomies, folk- mainly by the denotes property or is mediated sonomies, and other similar types of controlled vocabulary” by the LexicalSense or the LexicalConcept (https://www.w3.org/TR/skos-primer/) 12 Quoting from Section 3.6 “Lexical Concept” https: properties, as represented in Figure 1, which dis- //www.w3.org/2016/05/ontolex/: “We [...] cap- plays the core module of the model. ture the fact that a certain lexical entry can be used to denote a certain ontological predicate. We capture this by saying OntoLex-Lemon is based on and extends the that the lexical entry denotes the class or ontology element lemon model (McCrae et al., 2012). A ma- in question. However, sometimes we would like to express jor difference is that OntoLex-Lemon includes the fact that a certain lexical entry evokes a certain mental concept rather than that it refers to a class with a formal in- an explicit way to encode conceptual hierar- terpretation in some model. Thus, in lemon we introduce the chies, using the SKOS standard.11 As shown class Lexical Concept that represents a mental abstraction, concept or unit of thought that can be lexicalized by a given 10 See (McCrae et al., 2012), (Cimiano et al., 2016) and collection of senses. A lexical concept is thus a subclass of also https://www.w3.org/community/ontolex/ skos:Concept.” wiki/Final_Model_Specification. 13 Quoted from https://wordnet.princeton. 11 SKOS stands for “Simple Knowledge Organization Sys- edu/. ferent information onto more complex represen- Listing 3: The OntoLex-Lemon entry for viola tation frameworks. To transform the OWN data : lex viola fem a ontolex : LexicalEntry ; onto the OntoLex-Lemon representation, a Python l e x i n f o : p a r t O f S p e e c h l e x i n f o : noun ; ontolex : canonicalForm : f o r m v i o l a f ; script was used. A design decision was to ex- ontolex : otherForm : f o r m v i o l a f p l . tract only the synset information and to encode : lex viola masc a ontolex : LexicalEntry ; the synsets as instances of the LexicalConcept l e x i n f o : p a r t O f S p e e c h l e x i n f o : noun ; class of OntoLex-Lemon. As some OWM lem- ontolex : canonicalForm : form viola m ; mas are present in the MMorph resources, we : f o r m v i o l a f a o n t o l e x : Form ; just link the synsets to those lemmas, which l e x i n f o : gender l e x i n f o : feminine ; are encoded as instances of the OntoLex-Lemon l e x i n f o : number l e x i n f o : s i n g u l a r ; LexicalEntry class (see next section). We will o n t o l e x : w r i t t e n R e p ” v i o l a ” @it . need to create new instances of the OntoLex- : f o r m v i o l a f p l a o n t o l e x : Form ; Lemon LexicalEntry class for the OWM lem- l e x i n f o : gender l e x i n f o : feminine ; mas not present in the MMorph resources. l e x i n f o : number l e x i n f o : p l u r a l ; o n t o l e x : w r i t t e n R e p ” v i o l e ” @it . We have now 15553 such LexicalConcept in- stances for Italian. This is due to the fact that we : f o r m v i o l a m a o n t o l e x : Form ; l e x i n f o : gender l e x i n f o : masculine ; consider only the subset of ItalWordNet that has l e x i n f o : number l e x i n f o : p l u r a l , been curated by OMW. We also noted that we have lexinfo : singular ; less instances of the LexicalConcept as lines for o n t o l e x : w r i t t e n R e p ” v i o l a ” @it . each synset in the original files, as the synset in- As the reader can observe, we have two lexical en- dices are represented by a unique URI in OntoLex- tries for the entry “viola”, as this is requested by Lemon. the OntoLex-Lemon guidelines, following which In Listing 2 we show examples of the OntoLex- a word with different grammatical genders should Lemon encoding of two synsets for Spanish.14 have one lexical entry per gender. “Viola” in fem- The lemmas associated with these synsets are inine is the music instrument, while in masculine “cura”. In Section 7, we explain how the synsets it means “violet”. This is in fact an important fea- are linked to the lemmas, which are differentiated ture for linking synsets to lemmas having distinct in the OntoLex-Lemon representation, but not in genders, as we will exemplify in Section 7. the original OMW file. The transformation of nominal entries from MMorph to the OntoLex-Lemon format resulted Listing 2: The OntoLex-Lemon representation of in 21085 instances of the class LexicalEntry for two Spanish synsets Italian. We still need to consider the lemmas of the : s y n s e t s p a w n −13491616−n rdf : type ontolex : LexicalConcept ; OMW resources that are not in MMorph. This is s k o s : inScheme : s p a w n e t . concerning mostly multiword entries in OMW. We will also investigate the use of other lexical : s y n s e t s p a w n −10470779−n rdf : type ontolex : LexicalConcept ; resources, but the current use of the MMorph was s k o s : inScheme : s p a w n e t . motivated by the fact that we could have access to the different languages available in one and the same format, facilitating thus the uniform map- 6 Mapping MMorph to Ontolex-Lemon ping into OntoLex-Lemon. To transform the MMorph data into OntoLex- 7 Linking the OMW Resources to the Lemon we used a Python script including the MMorph Resources rdflib module15 , which supports the generation of RDF-graphs in rdf : xml, turtle, or other rel- We see the use of OntoLex-Lemon for represent- evant formats. In Listing 3, we show examples of ing WordNets not only as a chance to port infor- the resulting data for the lemma “viola” in Italian. mation from one format to another (including the possibility to publish WordNets in the Linguistic 14 For the representation of OntoLex-Lemon data, we chose Linked Opend Data cloud16 ), but also as an oppor- the turtle syntax serialization. More on the turtle syntax: tunity to extend the coverage of WordNet descrip- https://www.w3.org/TR/turtle/. 15 16 See https://github.com/RDFLib/rdflib for See http://linguistic-lod.org/ more details. llod-cloud and (Chiarcos et al., 2012) tions to more complex lexical phenomena, beyond tions on the usage of certain Wordnet concepts, as lemma and PoS considerations. One case that has for example in the Italian case of the noun “bene” been studied in the recent past concerns the mean- versus its plural form “beni”, or English “silk” ver- ing that can be specifically associated to English sus the plural form “silks”, which are associated plurals listed in PWN (Gromann and Declerck, with different and sometimes not shareable mean- 2019). We are interested in applying a similar ap- ings.17 We are making use for this of a strategy proach to grammatical gender: we could link a described in an extension to the core module of Wordnet synset to a specific gender, as this infor- Ontolex-Lemon, called “Lexicog”,18 which fore- mation is normally not included in the Wordnets, sees the description of instances of a class named which consider only the part-of-speech of the as- FormRestriction, so that it is possible to state sociated lemmas. that a meaning is available only with the use of a OntoLex-Lemon supports this linking in a specific form, like singular or plural. straightforward manner. As can be seen in Figure 1, there is a property putting a LexicalConcept 8 Conclusion in relation to a LexicalEntry, i.e. the prop- erty evokes and its reverse isEvokedBy. There- fore we just need to add this property to both the We described our work on porting Open Multilin- OntoLex-Lemon representations of a synset and gual Wordnet resources into the OntoLex-Lemon its corresponding entry. In Listing 4 we show such model, in order to establish an interlinking with a case, taking again the word “cura” as an exam- corresponding morphological resources, such as ple. the MMorph resource set. For this purpose, the morphological resources were also ported onto Listing 4: Interlinking a synset and an entry for OntoLex-Lemon. As a result we noticed that this cura model can be easily used for bridging the Word- : s y n s e t s p a w n −13491616−n Net type of lexical resources to a full description rdf : type ontolex : LexicalConcept ; of lexical entries, which coult possibly lead to an s k o s : inScheme : s p a w n e t ; ontolex : evokes : l e x c u r a 1 . extension of the coverage of WordNets beyond the consideration of lemmas and PoS information. : lex cura 1 a ontolex : LexicalEntry ; l e x i n f o : g e n d e r l e x i n f o : fem ; We documented our interlinking work with the l e x i n f o : p a r t O f S p e e c h l e x i n f o : noun ; example of the full morphological representation ontolex : canonicalForm : form cura ; of Italian words, putting them in relation with the ontolex : otherForm : f o r m c u r a p l u r a l ; ontolex : isEvokatedBy corresponding OMW data sets. We also started : s y n s e t s p a w n −1349161−n . to investigate the description of usage restrictions, which allows us to state formally that certain : s y n s e t s p a w n −10470779−n rdf : type ontolex : LexicalConcept ; Wordnet concepts should be used only in the sin- s k o s : inScheme : s p a w n e t ; gular or in the plural form. ontolex : evokes : l e x c u r a 2 . As a final goal of our work, we see the in- : lex cura 2 a ontolex : LexicalEntry ; terlinked or merged resources in the Linguistic l e x i n f o : g e n d e r l e x i n f o : mas ; l e x i n f o : p a r t O f S p e e c h l e x i n f o : noun ; Linked Open Data (LLOD) cloud. We will in- ontolex : canonicalForm : form cura ; vestigate how our work can be combined with re- ontolex : otherForm : f o r m c u r a p l u r a l ; sources present in the LLOD, especially with the ontolex : isEvokatedBy : s y n s e t s p a w n −10470779−n . BabelNet framework, which is already integrat- ing a huge number of lexical resources, including Just adding the properties evokes and its reverse Princeton WordNet, and encyclopedic data sets isEvokedBy to the corresponding elements in the (Ehrmann et al., 2014). generated OntoLex-Lemons data sets is providing for this morphological enrichment of the original 17 The reader can see the different meanings associated Wordnets. Once the original (different types of) to those plural words while querying for those in the user resources have been mapped onto the OntoLex- interface of PWN: http://wordnetweb.princeton. Lemon model, it is very easy to interlink or even edu/perl/webwn. 18 The current state of this “Lexicography” module to merge them into a richer representation. An ex- is available at https://www.w3.org/community/ tension of this work consists in describing restric- ontolex/wiki/Lexicography. Acknowledgments Semantic Web. Language Resources and Evalua- tion, 46(6):701–709. The presented work has been supported in part by the H2020 project “Prêt-à-LLOD” with Grant John P. McCrae, Christiane Fellbaum, and Philipp Cimiano. 2014. Publishing and linking wordnet us- Agreement number 825182. Contributions by ing lemon and rdf. In Proceedings of the 3 rd Work- Thierry Declerck have been supported addition- shop on Linked Data in Linguistics. ally in part and by the H2020 project “ELEXIS” John P. McCrae, Paul Buitelaar, and Philipp Cimiano. with Grant Agreement number 731015. 2017. The OntoLex-Lemon Model: Development and Applications. In Iztok Kosem, Jelena Kallas, Carole Tiberius, Simon Krek, Miloš Jakubı́ček, and References Vı́t Baisa, editors, Proceedings of eLex 2017, pages 587–597. INT, Trojı́na and Lexical Computing, Lex- Francis Bond and Ryan Foster. 2013. Linking and ical Computing CZ s.r.o., 9. extending an open multilingual wordnet. In Pro- ceedings of the 51st Annual Meeting of the Asso- Dominique Petitpierre and Graham. Russell. 1995. ciation for Computational Linguistics, pages 1352– MMORPH: The Multext morphology program. 1362, Sofia. Multext deliverable 2.3.1, ISSCO, University of Geneva. Francis Bond and Kyonghee Paik. 2012. A survey of wordnets and their licenses. Small, 8(4):5. Emanuele Pianta, Luisa Bentivogli, and Christian Gi- rardi. 2002. Multiwordnet: Developing an aligned Francis Bond, Piek Vossen, John P McCrae, and Chris- multilingual database. In In Proceedings of the First tiane Fellbaum. 2016. Cili: the collaborative inter- International Conference on Global WordNet, pages lingual index. In Proceedings of the Global WordNet 293–302, Mysore, India. Conference, volume 2016. Antonio Toral, Stefania Bracale, Monica Monachini, Christian Chiarcos, Sebastian Nordhoff, and Sebastian and Claudia Soria. 2010. Rejuvenating the ital- Hellmann, editors. 2012. Linked Data in Linguistics ian wordnet: upgrading, standardising, extending. - Representing and Connecting Language Data and In Proceedings of the 5th International Conference Language Metadata. Springer. of the Global WordNet Association (GWC-2010), Philipp Cimiano, John P. McCrae, and Paul Buitelaar. Mumbai. 2016. Lexicon Model for Ontologies: Community Report. Maud Ehrmann, Francesco Cecconi, Daniele Vannella, John Philip McCrae, Philipp Cimiano, and Roberto Navigli. 2014. Representing multilingual data as linked data: the case of BabelNet 2.0. In Pro- ceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), pages 401–408, Reykjavik, Iceland, May. European Languages Resources Association (ELRA). Christiane Fellbaum, editor. 1998. WordNet: An Elec- tronic Lexical Database. MIT Press, Cambridge, MA. Dagmar Gromann and Thierry Declerck. 2019. To- wards the detection and formal representation of se- mantic shifts in inflectional morphology. In Maria Eskevich, Gerard de Melo, Christian Fth, John P. McCrae, Paul Buitelaar, Christian Chiarcos, Bettina Klimek, and Milan Dojchinovski, editors, 2nd Con- ference on Language, Data and Knowledge (LDK), volume 70 of OpenAccess Series in Informatics (OA- SIcs), pages 21:1–21:15. Schloss Dagstuhl–Leibniz- Zentrum fuer Informatik, 5. John McCrae, Guadalupe Aguado de Cea, Paul Buite- laar, Philipp Cimiano, Thierry Declerck, Asunción Gómez-Pérez, Jorge Gracia, Laura Hollink, Elena Montiel-Ponsoda, D ennis Spohr, and Tobias Wun- ner. 2012. Interchanging lexical resources on the