1. Introduction

The Lemma Bank of the LiITA Knowledge Base of Interoperable Resources for Italian

Eleonora Litta

eleonoramaria.litta@unicatt.it 0

Marco Passarotti

marco.passarotti@unicatt.it 0

Paolo Brasolin

paolo.brasolin@gmail.com 0

Giovanni Moretti

giovanni.moretti@unicatt.it 0

Francesco Mambrini

francesco.mambrini@unicatt.it 0

Valerio Basile

valerio.basile@unito.it 1

Andrea Di Fabio

andrea.difabio@unito.it 1

Cristina Bosco

cristina.bosco@unito.it 1 0 CIRCSE Research Centre, Università Cattolica del Sacro Cuore , Largo Gemelli 1, 20123 Milano , Italy 1 Università degli Studi di Torino - Dipartimento di Informatica , Corso Svizzera 185, 10149 Torino , Italy

The paper introduces the LiITA Knowledge Base of interoperable linguistic resources for Italian. After describing the principles of the Linked Data paradigm, on which LiITA is grounded, the paper presents the lemma-centred architecture of the Knowledge Base and details its core component, consisting of a large collection of Italian lemmas (called the Lemma Bank) used to interlink distributed lexical and textual resources. tive impact on the empirical study of the language and resource usability. Indeed, diferent resources may proWhen considering the number of digital linguistic re- vide diferent information or use diferent granularity sources, either lexical or textual, Italian is among the rich- of information about the same common object, namely est languages: e.g., at the time of writing, a search on the words, which appear as occurrences in corpora and as CLARIN Virtual Language Observatory,1 filtered for the entries in dictionaries or lexicons. Making this wealth Italian language, returns more than 8 000 results. Like of information interact represents one of today's main other high-resource languages, Italian is provided with a challenges, to best leverage the huge asset of (meta)data large set of fundamental resources, including WordNets collected over decades of work. ([1] and [2]), a few treebanks available from the Univer- As a consequence, a very active line of research cursal Dependencies collection2, historical corpora 34 and rently focuses on the so-called Linguistic Linked Open reference corpora of written (e.g., CORIS/CODIS [3]) and Data (LLOD), aiming to define common practices for the spoken language (e.g., KIParla [4]). representation and publication of linguistic resources acHowever, as is the case for many other languages, most cording to the principles of the Linked Data paradigm, linguistic resources for Italian vary in terms of data for- which underpins the Semantic Web5. mat, annotation criteria, and/or adopted tagsets. Such A recently concluded COST Action (Nexus Linvariation hinders full interaction between the (meta)data guarum6) resulted both in the creation of a large and provided by the many available resources, with a nega- cohesive scientific community and in the definition of a set of shared vocabularies for linguistic knowledge description. Some of these vocabularies have been widely applied in the LiLa Knowledge Base (KB), which is probably the main LLOD use case currently available. LiLa (Linking Latin) is a KB of Latin linguistic resources made interoperable through their representation and publication according to the Linked Data principles. Thanks to its streamlined and language-independent architecture, LiLa is today a reference model for projects aiming to achieve online interoperability between distributed linguistic resources. Building on the experience of LiLa and reusing its ar-

eol>Linked Open Data Linguistic Resources Italian Interoperability

1. Introduction

chitecture, the LiITA (Linking Italian)7 project has started the creation of a KB of interoperable linguistic resources for Italian published as Linked Data. This paper describes the development of the fundamental component of the LiITA KB, which consists of a collection of Italian lemmas (called the Lemma Bank) that serves as the connection point between word occurrences and their entries in the corpora and lexical resources that will be published in the KB.

Language)9 is a query language for (meta)data represented in RDF; 4. Include links to other URIs to allow people (and

machines) to discover more things.

Applying the principles of the Linked Data paradigm to (meta)data derived from linguistic resources and publishing them on the Web ofers several benefits [ 7 ]. Firstly, as for representation and modelling of (meta)data, RDF is a very versatile model, suitable for representing metadata such as those conveyed by the various levels of 2. Linguistic Linked Data annotation available in linguistic resources (morphology, syntax, lemmatisation, etc.). Moreover, the adoption of Introduced by Tim Berners-Lee et alii [ 5 ], the concept of a common data model (RDF) enables both structural (or the Semantic Web is based on the assumption that docu- syntactic) interoperability, which is the ability of diferent ments published on the World Wide Web are associated systems to process exchanged data using shared protowith information and metadata structured in such a way cols and formats (such as HTTP and URI), and conceptual as to allow their querying and semantic interpretation (or semantic) interoperability, which is the ability of a not only by humans but also by automated agents. system to automatically and semantically interpret the

This structuring is implemented in the form of Linked exchanged information using a common set of classes Data, which are the pillars of the Semantic Web. Unlike and data categories defined in ontologies and vocabua web made of hypertexts, where links are not semanti- laries [ 8 ]. The Italian language is no stranger to this cally interpretable, the Semantic Web consists of links paradigm101112. But this is the first attempt to create such between “objects” associated with a unique and persis- a kind of resource in the form of a lemma bank in Italian. tent identifier (URI: Uniform Resource Identifier). The links between objects are semantically interpretable as they are represented through vocabularies for knowledge 3. The LiITA Knowledge Base description recorded in the form of ontologies.

The Linked Data paradigm is founded on four princi- This Section introduces the fundamental architecture ples defined by Berners-Lee himself 8: of the LiITA KB and details its core component, i.e., a collection of canonical forms of citations (lemmas) for the Italian language13. The base URI of the resource is http://www.liita.it/data/, a namespace we reserved by buying the domain from a registrar to use also as a URL, e.g., for the project website. 1. Use URIs as “names for things” to identify them uniquely and persistently. The “things” dealt with when handling linguistic (meta)data in Linked Data are linguistic objects, such as occurrences of words in texts, lexical entries in dictionaries, or sets of parts of speech; 3.1. The Architecture of LiITA 2. Use HTTP URIs to allow people (and machines)

to look up things on the Web; The architecture of the LiITA KB resembles that of the 3. Use standards such as RDF and SPARQL to pro- LiLa KB for Latin14, which is based on the assumption vide useful information about what is identified that the sources of the (meta)data that the KB makes by a URI, for the purpose of representation and re- interoperable are all related to words. These sources are trieval of (meta)data. RDF (Resource Description linguistic resources and specifically: Framework) [ 6 ] is the data model that underlies the Semantic Web. According to this model, in- • lexical resources, such as dictionaries or lexicons, formation in the Semantic Web is organised and which describe the properties of words and conrepresented in terms of triples, i.e., relationships sist of lexical entries; between a Subject and an Object through a Prop- • textual resources, such as corpora and digital lierty. The classes to which Subjects and Objects braries, which provide texts and are made of ocbelong, as well as the semantics of Properties, are currences of words (tokens). established by ontologies shared by the diferent communities that enrich and use the Semantic

Web. SPARQL (SPARQL Protocol And RDF Query 7http://www.liita.it/ 8https://www.w3.org/DesignIssues/LinkedData 9https://www.w3.org/TR/rdf-sparql-query/ 10http://hdl.handle.net/20.500.11752/ILC-1007 11http://hdl.handle.net/20.500.11752/ILC-66 12http://hdl.handle.net/20.500.11752/ILC-558 13https://github.com/LiITA-LOD 14https://lila-erc.eu/data-page/

Lexical entries and word occurrences coming from distributed resources are made interoperable in LiITA by linking them to their respective lemmas. This makes it possible to perform federated searches on the diferent linguistic resources that LiITA makes interoperable.

For example, one can search for all occurrences (tokens) of the same lemma in multiple textual corpora; or extract from multiple corpora all those tokens that have certain lexical properties provided by one or more lexical resources.

Given the central role played by lemmas in the architecture of LiITA, the core component of the KB is a collection of conventional citation forms (lemmas) of Figure 1: The OntoLex-Lemon model. Italian words, called the Lemma Bank.

In the LiLa KB lemmas are described with the help of custom ontology.15 This ontology, on the one hand, provides detailed information on some morphological and OntoLex-Lemon model. linguistic features of the lemmas (e.g. the part of speech, In Figure 1, the Classes of OntoLex-Lemon are graphthe gramatical gender for nouns and the inflectional class) ically represented within rectangles. The relationships relying on the OLiA annotation model [ 9, 151-155 ]. On between Classes are shown as arrows associated with the other hand, the LiLa ontology defines classes and the name of the Property that connects two Classes. properties to model the task of lemmatization, such as The main Class of OntoLex-Lemon is the property lila:hasLemma16 which links lemmas to ontolex:LexicalEntry18, understood as the corpus tokens. The class of lila:hasLemma17 is defined unit of lexicon analysis that gathers one or more as a subclass of ontolex:Form (on which, see sec. 3.2), forms (ontolex:Form19) and one or more lexical so that the LiLa KB is not a lexical resource in itself, but senses (ontolex:LexicalSense20), lexical concepts rather a collection of canonical forms that can be either (ontolex:LexicalConcept21) or entities from used to lemmatize texts or to index lexical entries. ontologies.

Lexical senses are lexicalised senses: a sense belongs exactly to one lexical entry. Semantic aspects that can 3.2. The LiITA Lemma Bank be expressed by multiple words are represented through Data modelling lexical concepts, which can therefore have more than one lexicalisation. A typical example of a lexical concept is the The Lemma Bank of LiITA consists of a collection of lem- synset in a resource like WordNet, which groups multiple mas of the Italian language, i.e., lexical citation forms words related by a conceptual synonymy relationship. adopted (more or less conventionally) in linguistic re- Forms can have one or more graphical varisources. These lemmas are the names of entries in (most) ants (written representations), represented through lexical resources and the forms chosen to gather all oc- the Data Property ontolex:writtenRep22, and currences of a particular word in (lemmatised) textual possibly one or more phonetic variants (Property resources. As mentioned above, the Lemma Bank plays a ontolex:phoneticRep23). One of these forms, the fundamental role in the LiITA KB, acting as the connec- object of the ontolex:canonicalForm Property24, tion point between entries in various lexical resources is the form that is conventionally chosen to represent and word occurrences in textual resources. the entire set of inflected forms of a lexical entry. The

Following the principles of the Linked Data paradigm, Lemma Bank of LiITA is a collection of such forms, conceptual interoperability among the distributed re- modelled as individuals of the Class lila:Lemma25, sources connected in LiITA is achieved by applying a which is a subclass of ontolex:Form, originally created vocabulary for knowledge description commonly used for the LiLa project, and adopted in the LiITA Lemma in the world of Linguistic Linked Open Data. In the spe- Bank accordingly. The lemmas of the LiITA Lemma cific case of the Lemma Bank, this means adopting the vocabulary defined by OntoLex-Lemon [ 10 ], one of the most widely used models for representing and publishing lexical resources as Linked Data. Figure 1 shows the 18http://www.w3.org/ns/lemon/ontolex#LexicalEntry 19http://www.w3.org/ns/lemon/ontolex#Form 20http://www.w3.org/ns/lemon/ontolex#LexicalSense 21http://www.w3.org/ns/lemon/ontolex#LexicalConcept 22http://www.w3.org/ns/lemon/ontolex#writtenRep 23http://www.w3.org/ns/lemon/ontolex#phoneticRep 24http://www.w3.org/ns/lemon/ontolex#canonicalForm 25http://lila-erc.eu/ontologies/lila/Lemma 15http://lila-erc.eu/ontologies/lila/. 16http://lila-erc.eu/ontologies/lila/hasLemma 17http://lila-erc.eu/ontologies/lila/hasLemma Bank are unbound by any relationship with a lexical the part of speech Adjective. Participles are modelled as entry, as the Lemma Bank is not a lexical resource individuals of the lila:Hypolemma Class and are conconsisting of lexical entries but a set of canonical forms nected to their verbal lemma (cadere ‘to fall’) through the of citation. This reflects the role of the Lemma Bank in lila:isHypolemma Property.

LiITA as a collection of lemmas used to make resources Regardless of whether two resources lemmatise particiinteroperable. ples according to diferent criteria (namely, one under the

The LiITA Lemma Bank makes textual resources for participial lemma and the other under the verbal lemma), Italian interoperable through the lila:hasLemma Prop- the two diferent lemmatisations are harmonised in the erty26, which links a token in a corpus with its lemma Lemma Bank. in the Lemma Bank. Lexical resources, on the other hand, are connected to the Lemma Bank through the Data acquisition ontolex:canonicalForm Property, which links a lexical entry in the resource to its corresponding lemma in The lemmas and PoS that constitute the Lemma Bank the Lemma Bank. is based on the lexical base of an online version of the

By using the Property lila:hasPos27, each lemma in dictionary Nuovo De Mauro32, which amounts to about the Lemma Bank is assigned one part of speech, following 145 000 entries; out of these, 13 000 multi-word expresthe Universal PoS tagset [ 11 ]. sions were excluded because they were deemed unnec

In the case of words that are assigned multiple PoS essary, as lemmatisers usually deal with single tokens. tags in lexical resources, multiple lemmas are created in About 94 000 lemmas were derived from the remaining the Lemma Bank. For instance, the word sopra ‘over’ is 131 000 entries. The most numerically abundant PoS usually assigned four PoS: preposition, adverb, adjective with which the Lemma Bank was populated are listed in and noun. Thus, four distinct lemmas are created in the Table 1.

Lemma Bank with four diferent PoS represented via the lila:hasPos Property. To harmonise diferent lemmatisation criteria that may be 56 575 Nouns found in linguistic resources, the Lemma Bank of LiITA 19 912 Adjectives includes two specific Properties. The symmetric Prop- 15 885 Verbs erty lila:lemmaVariant28 connects diferent forms of 359 Proper Nouns the inflectional paradigm of a word that can be used as 311 Adverbs lemmas. A typical case is that of pluralia tantum, which 111026 PCroonnjouunnctsions can be lemmatised either in the plural form or in the sin- 40 Prepositions gular form. This model allows, for example, for both the 58 Articles lila:Lemma pantaloni and pantalone, which are linked to each other by the lila:lemmaVariant Property.

While lila:lemmaVariant links lemmas that This population process was not an easy task for are assigned the same part of speech, the Prop- two main reasons. Firstly, the online version of Nuovo erty lila:hasHypolemma29 (and its inverse De Mauro is tailored for visualisation: data is mixed propertylila:isHypolemma30) connects lemmas with graphical information. Secondly, Nuovo De Mauro that can be used for the same word but have diferent stems from one of the greatest eforts in Italian lexicoparts of speech. This is the case for the adjectives used graphic history, namely GRADIT (Grande dizionario italas adverbs, e.g. veloce which can be interpreted (and iano dell’uso [ 12 ]). The resource includes information lemmatised) either as a form of adjective (hence modelled especially hard to handle computationally: De Mauro as a lila:Lemma) or as an adverb (hence modelled as a and colleagues described for every lemma not only each lila:Hypolemma31, a subclass of lila:Lemma). of its usual lexicographic metadata (meaning, PoS, exam

Past participles are another kind of hypolemma (e.g. ples, etc.) but also frequency, semantic domain, grouping caduto ‘fallen’), which in the Lemma Bank are assigned of senses, multi-word expressions and more. The extraction of data is in practice hindered by information that must be filtered out because it is not relevant for our purposes of building a lemma bank or is provided in some non-homogeneous forms. Therefore, in order to ease this 26http://lila-erc.eu/ontologies/lila/hasLemma 27http://lila-erc.eu/ontologies/lila/hasPOS 28http://lila-erc.eu/ontologies/lila/lemmaVariant 29http://lila-erc.eu/ontologies/lila/hasHypolemma 30http://lila-erc.eu/ontologies/lila/isHypolemma 31http://lila-erc.eu/ontologies/lila/Hypolemma 32https://dizionario.internazionale.it/. PoS tags were converted automatically into the Universal tagset, adopted in the Lemma Bank. initial work, we decided to preliminary extract the afore- eased by a graphical interface which will help with the mentioned PoS, leaving out a part of the minor lexical task of writing complex SPARQL queries. categories like acronyms (e.g. NASA, FBI ), exclamation Finally, given its language-independent architecture marks, or unit symbols (e.g. cm, kg) setting them aside and the use of common vocabularies for knowledge defor future developments of LiITA. scription, LiITA promises to have a substantial method

For the time being, the Nuovo De Mauro’s PoS cat- ological impact on how linguistic resources are published egorisation rationale was adopted with some in-house and made interoperable as Linked Data. adjustment. In fact, the Nuovo De Mauro’s PoS categorisation rationale was mapped to the UPOS tagset. The original tagging was that of the Italian grammarian tradi- Acknowledgments tion, hence we had to adapt some tags, for example conjunctions. As a matter of fact, De Mauro’s conjunctions This contribution is funded by the European Union didn’t distinguish between subordinate and coordinate, - Next Generation EU, Mission 4 Component 1 CUP so, we aligned manually each of the dictionary’s conjunc- J53D2301727OOO1. The PRIN 2022 PNRR project “Litions to the UPOS tags. For the rest of De Mauro’s PoS ITA: Interlinking Linguistic Resources for Italian we have manually found the correspondence with UPOS via Linked Data” is carried out jointly by the Università tagset. Cattolica del Sacro Cuore, Milano and the Università di Torino.

4. Conclusion and Future Work

In this paper we presented the first steps towards the publication as LLOD of a collection of canonical forms of citation (lemmas) for Italian. Such Lemma Bank is the core component of LiITA, a knowledge base of interoperable linguistic resources for Italian inspired by the LiLa knowledge base for Latin. LiITA aims to compensate the current lack of interoperability between Italian resources, as well as to become the pivot to interlink all the present and future lexicons and corpora for Italian. To this aim, the Lemma Bank is modelled such that it can harmonise diferent lemmatisation criteria found in lexical and textual resources, following a bottom-up approach rather that a top-down one.

Building a Lemma Bank to make distributed resources interoperable in Linked Data is an open-ended process. As the linking of more and more resources to the KB might require the inclusion of new lemmas, the LiITA Lemma Bank will keep on growing, both through the extraction of lemmas from other lexical sources and in a resource-driven fashion.

Beside extending the Lemma Bank and linking the first resources, the LiITA project will develop online services, following what has been done for LiLa [ 13 ]. The process of linking a text or corpus in the KB must be supported by an accessible tool performing automatic lemmatisation, PoS-tagging and linking. Currently, a new Stanza model [ 14 ] has been trained combining all the existing Italian treebanks. This model will serve as the foundation for the linkage process of textual resources to be included in the LiITA KB.33 The advanced interrogation of data ofered by all the resources interlinked in LiITA will be 33The current model’s performances are presented in Table 2 in Appendix. The model can be found at https://github.com/LiITALOD/LiITA

Appendix

[1]

Pianta ,

Bentivogli ,

Girardi , Multiwordnet: developing an aligned multilingual database , in: First international conference on global WordNet , 2002 , pp. 293 - 302 .

[2]

Roventini ,

Marinelli ,

Bertagna , ItalWordNet v. 2 , 2016 . URL: http://hdl.handle. net/20.500 .11752/ ILC-62, ILC-CNR for CLARIN-IT repository hosted at Institute for Computational Linguistics "A. Zampolli" , National Research Council, in Pisa.

[3]

R. R.

Favretti ,

Tamburini , C. De Santis, Coris/- codis: A corpus of written italian based on a defined and a dynamic model, A rainbow of corpora: Corpus linguistics and the languages of the world ( 2002 ) 27 - 38 .

[4]

Mauri ,

Ballarè , E. Goria,

Cerruti ,

Suriano , et al., Kiparla corpus: a new resource for spoken italian , in: CEUR WORKSHOP PROCEEDINGS , SunSITE Central Europe, 2019 , pp. 1 - 7 .

[5]

Berners-Lee ,

Hendler , O. Lassila, The semantic web , Scientific american 284 ( 2001 ) 34 - 43 .

[6]

E. J.

Miller , An introduction to the resource description framework , Journal of library administration 34 ( 2001 ) 245 - 255 .

[7]

Chiarcos ,

Moran ,

P. N.

Mendes ,

Nordhof ,

Littauer , Building a linked open data cloud of linguistic resources: Motivations and developments, The People's Web Meets NLP: Collaboratively Constructed Language Resources ( 2013 ) 315 - 348 .

[8]

Ide ,

Pustejovsky , What does interoperability mean, anyway? toward an operational definition of interoperability for language technology , in: Proceedings of the Second International Conference on Global Interoperability for Language Resources. Hong Kong , China, 2010 .

[9]

Cimiano ,

Chiarcos ,

J. P.

McCrae ,

Gracia , Linguistic Linked Data: Representation, Generation and Applications , Springer, Cham, 2020 . URL: https: //www.springer.com/gp/book/9783030302245. doi: 10 .1007/978-3- 030 -30225-2.

[10]

J. P.

McCrae ,

Bosque-Gil ,

Gracia ,

Buitelaar ,

Cimiano , The ontolex-lemon model: development and applications , in: Proceedings of eLex 2017 conference , 2017 , pp. 19 - 21 .

[11]

Petrov , D. Das , R.

McDonald , A

Universal Part -of-Speech Tagset , in: N. C. C. Chair),

Choukri ,

Declerck ,

M. U.

Doğan ,

Maegaard ,

Mariani ,

Moreno ,

Odijk , S. Piperidis (Eds.), Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12) , European Language Resources Association (ELRA) , Istanbul, Turkey, 2012 , pp. 2089 - 2096 . URL: http://www.lrec-conf.org/proceedings/ lrec2012/pdf/274_Paper.pdf .

[12] T. De

Mauro

, Grande dizionario italiano dell'usoGradit , UTET, 1999 .

[13]

Passarotti ,

Mambrini , G. Moretti, The services of the lila knowledge base of interoperable linguistic resources for latin , in: Proceedings of the 9th Workshop on Linked Data in Linguistics@ LREC-COLING 2024 , 2024 , pp. 75 - 83 .

[14]

Qi ,

Zhang ,

Bolton ,

C. D.

Manning , Stanza: A Python natural language processing toolkit for many human languages , in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations , 2020 . URL: https://nlp.stanford.edu/pubs/ qi2020stanza.pdf .