<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Wikibase: Remodeling Use Cases</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David Lindemann</string-name>
          <email>david.lindemann@ehu.eus</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sina Ahmadi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anas Fahad Khan</string-name>
          <email>fahad.khan@ilc.cnr.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Mambrini</string-name>
          <email>francesco.mambrini@unicatt.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Federica Iurescia</string-name>
          <email>federica.iurescia@unicatt.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Passarotti</string-name>
          <email>marco.passarotti@unicatt.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>CNR-ILC</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Italy</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, George Mason University</institution>
          ,
          <addr-line>Fairfax, VA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>UPV/EHU University of the Basque Country</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Università Cattolica del Sacro Cuore</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <fpage>7</fpage>
      <lpage>20</lpage>
      <abstract>
        <p>Wikibase is the software that powers Wikidata, but it can be also be used as a separate installation, to suit individual needs. This platform is ideal for creating data archives that can easily interact with the semantic web through the use of open standards; compared with other software solutions, it ofers unique features, such as the option to manually edit every single semantic triple in a graphical interface. However, integrating various data models and vocabularies in Wikibase is a challenging task due to specialties in the data model. This study sheds light on modeling datasets in Ontolex-Lemon - the Lexicon Model for Ontologies, as one of the predominant and prevailing ontologies in lexicography - in Wikibase. We discuss some of the major issues that should be taken into account for remodeling Ontolex-Lemon on Wikibase, looking at two use cases dealing with Latin and Kurdish lexical data, respectively. We believe that our approach paves the way for further conversions in the future and toward a set of general guidelines.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Wikibase (https://wikiba.se) as an extension of MediaWiki is the software underlying
Wikidata (https://www.wikidata.org) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a very large knowledge graph maintained by
the community of Wikidata users, and technically supported by Wikimedia Deutschland
(https://www.wikimedia.de). In addition to being an ontology of concepts with properties
relating them to each other and pointing to typed literal values, or to external identifiers,
Wikidata also contains descriptions of lexemes of multiple languages as summarised in
nEvelop-O
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].1
      </p>
      <p>
        Although the data model for lexical data underlying Wikibase2 is based on the
OntolexLemon [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ] core classes, i.e.
      </p>
      <p>ontolex:LexicalEntry, ontolex:LexicalSense and
4.0 International (CC BY 4.0).</p>
      <p>CEUR
Workshop
Proceedings
htp:/ceur-ws.org
ISN1613-073</p>
      <p>CEUR</p>
      <p>Workshop Proceedings (CEUR-WS.org)
1Lexicographical coverage statistics: https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage.
2See https://www.mediawiki.org/wiki/Extension:WikibaseLexeme/Data_Model and https://www.mediawiki.
org/wiki/Extension:WikibaseLexeme/RDF_mapping for a technical description.
ontolex:Form, the fine-grained implementation of the former difers substantially from
the latter. Thus, a thorough review and re-modeling of the existing datasets are required.</p>
      <p>In this paper, we describe the data model used in Wikibase for the representation of
lexemes along with two use cases that aim to adapt lexical resources modeled according
to Ontolex-Lemon in a way that they can be uploaded to a Wikibase. Our use cases focus
on data in Latin and Kurdish, respectively. While for the Latin data we have decided to
interact with Wikidata directly, the latter case aims to use a separate Wikibase installation.
Nevertheless, the final goal of the operation in both use cases is to enrich the Wikidata
lexeme collection. Discussing the advantages and implications of each approach, we argue
that the workflow descriptions are helpful for the creation of Wikidata-compatible lexical
datasets.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Lexicographical Data on Wikibase</title>
      <p>The Wikibase software comes with a default backbone ontology, the Wikibase
Ontology,3 for which widely used RDF vocabularies are deployed. For instance, the
properties rdfs:label for entity labels, prov:wasDerivedFrom for references, the classes
ontolex:LexicalEntry for lexemes and schema:Article for Wikipedia articles are used.
Moreover, additional classes and properties are described. In the RDF representation
produced by Wikibase, these classes and properties appear as such, hence called passthrough
properties, and in the graphical interface for editing, their values have their fixed place
in the page layout.</p>
      <p>On the other hand, any additionally defined ontology concept or any additional property
will be identified by a unique numeral. A Wikibase item with a value for rdfs:label is the
minimum a user has to provide to create a concept. Such concepts appear in the canonical
namespace for the corresponding Wikibase, such as http://www.wikidata.org/entity/ for
Wikidata. The numeral is preceded by a capital letter: item identifiers are preceded by
the letter Q, properties by a P, and as a third category, L-entities describe lexemes.</p>
      <p>The three kinds of Wikibase entities, i.e. Q, P, and L-entities, exist uniquely on one
Wikibase instance. They can themselves be further described, and linked to equivalent
entities in another Wikibase such as Wikidata, or enriched with external identifiers
pointing to external entities declared as equivalent. These alignments can be used for
federated querying, i.e., a query that would involve Wikidata and a custom Wikibase at
the same time, and, as soon as Wikibase properties are mapped to W3C-recommended
vocabularies, enable to export datasets in an RDF representation compatible to the LOD
cloud.</p>
      <p>Assertions made using Wikibase P-properties are called “statements”. In the RDF
representation of the entity data, e.g. ‘lexicography’ on Wikidata (Q184524), statements
are blank nodes which allow attaching the property value along with qualifiers, ranks
and references (see also Fig. 2). In the editing GUI, statements appear as a central
3The canonical URI of the Wikibase RDF–resource description framework ontology is http://wikiba.se/
ontology; the current version can be found at http://wikiba.se/ontology-1.0.owl.
section of the entity page, with no property or value preset. A Wikibase property used
in statements, qualifiers, or references is by default restricted to one datatype. 4</p>
      <p>
        The modeling of lexemes in Wikibase which is described in the following subsections
along with the three core classes and the links between them deploys the Ontolex-Lemon
model. The pre-defined backbone ontology hardly specifies anymore, leaving much space
for the user to define fine-grained details of the lexicographical data model. The path
taken by the community around Wikidata lexemes can be conveniently explored using the
ORDIA tool [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]5. In addition, the community also maintains user-oriented documentation
with examples for the creation and querying of data.6
      </p>
      <sec id="sec-3-1">
        <title>2.1. Lexical Entry</title>
        <p>Each Lexeme entity in Wikibase is identified by a unique Lexeme ID. When users create
lexemes, one or more lemmata have to be provided with a language code that specifies
the language and script of each of the lemma strings (ku-arab, for example, for Kurdish
in Arabic script, and ku-latn, for Kurdish in Latin script).7 In the RDF representation
of the lexeme, a lemma string appears as attached to the lexeme entity using a property
called wikibase:lemma, and is additionally referred to by rdfs:label. The lemma is
used for the indexation of Wikibase lexemes for ElasticSearch, akin to rdfs:label and
skos:altLabel values used for that in the case of items and properties. That way, users
can find lexemes performing a textual search in the GUI or via API without further
descriptions.</p>
        <p>The form of the lexeme used as a lemma would typically be described as ontolex:Form
attached to properties that describe morphological or other features. Also, values for
wikibase:lexicalCategory and dct:language are required; both values are restricted to
be of the same Wikibase instance and to describe parts of speech and natural languages,
respectively. Other properties, such as those that describe pronunciation, etymology,
or usage examples, are attached to the lexeme (at lexeme, sense, or form level) using
Wikibase statements, i.e., using P-entities as properties. A lexeme is linked to its senses
using ontolex:sense, and to its forms using ontolex:lexicalForm.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Lexical Sense</title>
        <p>A Wikibase sense is identified by a numeral identifier preceded by the identifier of the
corresponding lexeme, a dash, and the letter S. For example, wd:L3257-S1 for the sense
of the English noun apple referring to “a fruit of a tree of the genus Malus”. Lexeme
senses are by default described using textual sense glosses in any language. Those glosses
appear attached to the Sense entity using the skos:definition property; the language
of the gloss text is again specified by a language code. Any further description of the
sense is done using Wikibase statements.</p>
        <sec id="sec-3-2-1">
          <title>4https://www.wikidata.org/wiki/Help:Data_type</title>
          <p>5https://ordia.toolforge.org
6https://www.wikidata.org/wiki/Wikidata:Lexicographical_data
7For a list of available Wikimedia language codes, refer to https://www.wikidata.org/wiki/Help:Wikimedia_
language_codes/lists/all.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>2.3. Lexical Form</title>
        <p>The naming of Form URI is similar to that of sense entities, but using the letter F, as
in wd:L3257-F2 for the plural form of ‘apple’ in English. The written representation is
pointed to using ontolex:representation. Items describing grammatical features of a
form are linked to using wikibase:grammaticalFeature. All other description is made
through custom statements.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Ontolex-Lemon</title>
      <p>The diferences in ontolex:LexicalEntry, ontolex:LexicalSense and ontolex:Form in
the OntoLex guidelines from those described above are not significant, but it is important
to be aware of them to carry out mappings from OntoLex to Wikibase.</p>
      <p>To start with, in OntoLex each entry must be associated with at least one instance
of ontolex:Form via the property ontolex:lexicalForm, and can be associated with
at most one lemma form via the functional property ontolex:canonicalForm. Each
ontolex:LexicalEntry is efectively associated with a language via the rdf:langString
language code tag on its form representations of which it must have one. Form
representations are linked via the ontolex:writtenRep property (defined as a subproperty
of ontolex:representation, which is used on Wikibase for that purpose). That said, it
is not required that the language specification be done via the dct:language property.
Moreover, there is no requirement for lexical category information to be associated with
individual instances of ontolex:LexicalEntry; in such cases where this information is
provided, the lexinfo vocabulary is recommended. There are no naming conventions
presupposed by OntoLex for ontolex:LexicalSense and ontolex:Form URI.</p>
      <p>Table 1 shows the mapping of the Wikibase passthrough properties by default used for
lexemes and Ontolex-Lemon.</p>
      <p>Ontolex-Lemon
rdfs:label
ontolex:writtenRep
ontolex:sense
(skos:definition)
ontolex:lexicalForm
ontolex:writtenRep
(lexinfo properties)</p>
      <p>Wikibase
wikibase:lemma
ontolex:sense
skos:definition
ontolex:lexicalForm
ontolex:representation
wikibase:grammaticalFeatures</p>
      <p>For other properties with no mapping, some particularities have to be considered.
As explained above, an OntoLex entry is linked to one canonical form, the written
representation of which is to be mapped to wikibase:lemma. On the other hand, it
is common in OntoLex datasets that the lemma string is attached to the entry entity
using rdfs:label; the same is always true for Wikibase lexeme RDF representations. A
Wikibase lexeme can have multiple values for wikibase:lemma (see 2.1).</p>
      <p>In Ontolex, sense-defining definitions are attached to the ontolex:LexicalConcept
class, which is linked to the sense using ontolex:lexicalizedSense. OntoLex does
not specify what property to use here. In Wikibase, sense short definitions, called
gloss, are attached directly to the sense using skos:definition. Properties from the
lexinfo vocabulary that describe morphosyntactic features of forms, i.e., sub-properties
of lexinfo:morphosyntacticProperty like e.g. lexinfo:number, are all to be mapped
to wikibase:grammaticalFeature, and their values, on Wikibase, need to be Wikibase
items; as a consequence, lexinfo concepts need to be mapped to Wikibase items.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Use Case 1: Kurdî Wikibase</title>
      <p>
        As a low-resourced and under-represented language, Kurdish faces many challenges in
language technology due to a paucity of data. To remedy this, Azin and Ahmadi [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
and Ahmadi et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] address the creation of lexicographical resources compatible with
semantic technologies, particularly by relying on Ontolex-Lemon. An entry in
OntolexLemon in these resources is provided in Figure 1. As such, there are four resources freely
available under an open source license for Kurdish varieties.8 The resources are described
as follows:
• Northern Kurdish (also known as Kurmanji, kmr): Over 4,000 headwords are
provided in Northern Kurmanji in the Latin-based script. Headwords are defined
with part-of-speech tags, grammatical gender, and glosses based on distinct senses
in Northern Kurdish and English. Usage examples are also provided in some cases.
• Central Kurdish (also referred to as Sorani, ckb): Over 5,000 headwords are provided
in Central Kurdish written in the Latin-based script. This script, unlike Northern
Kurmanji, is not much used by Central Kurdish speakers; the Perso-Arabic-based
script is mostly used for this variant. Entries are described with part-of-speech
tags, glosses in English and, sometimes, usage examples. Grammatical gender is
not present in Central Kurdish.
• Southern Kurdish (sdh): This resource contains over 11,000 headwords, the highest
number among the selected resources. The headwords are written in both
PersoArabic and Latin-based scripts and are described with glosses in Persian and other
varieties of Kurdish. Such varieties include words from Kurdish varieties along with
Laki and Luri languages. That said, the distinction of the varieties is not explicit
in the resource. Therefore, Kurdish glosses in this resource are specified with the
ku code as an umbrella code to refer to all varieties of Kurdish. It should be noted
that Luri (ldd) is a distinct language from Kurdish.
• Gorani (also known as Hawrami, hac): In comparison to the other resources, this
resource is the smallest one containing around 1,000 headwords written in the
Latin-based script and described with part-of-speech tags, grammatical gender,
      </p>
      <sec id="sec-5-1">
        <title>8https://github.com/sinaahmadi/KurdishLexicography</title>
        <p>glosses in English and a few usage examples. Similar to Central Kurdish, this
language is mostly written in the Perso-Arabic-based script of Kurdish.</p>
        <p>Among the selected resources, those of Southern Kurdish and Gorani are especially
important as they are relatively more under-resourced than Northern and Central Kurdish.
While the two latter are also available on Wiktionary9 facilitating community support,
Southern Kurdish and Gorani are barely available for such initiatives.</p>
        <p>Nevertheless, these resources have not been much used by the native speakers’
community, due to chiefly what we believe is the lack of familiarity with LOD. Furthermore, any
static resource compatible with LOD requires a SPARQL endpoint to be queried and
efectively integrated into other applications. This further hinders the interoperability of
the resources as individual eforts do not necessarily adapt to a larger scale usage.</p>
        <p>To remedy this, we create an instance of the Wikibase software where the existing
Kurdish resources are made available. To that end, we convert the resources into a
Wikibase-compatible format requiring further modeling and modification of the lexical
data. The Wikibase is accessible at https://kurdi.wikibase.cloud. In addition to the
modeling described in ğ2, we provide information on the modeling and conversion of the
selected resources as follows.
1 :lex_kmr_7802180323 a ontolex:LexicalEntry, ontolex:Word ;
2 ontolex:canonicalForm :form_kmr_7802180323 ;
3 dct:language &lt;www.lexvo.org/page/iso639-3/kmr&gt; ;
4 rdfs:label "partî"@kmr-latn ;
5 lexinfo:partOfSpeech lexinfo:noun ;
6 ontolex:sense :kmr_8301494711_sense ;
7 lexinfo:gender lexinfo:feminine .
8
9 :form_kmr_7802180323 a ontolex:Form ;
10 ontolex:writtenRep "partî"@kmr-latn ;
11 lexinfo:number lexinfo:singular .
12
13 :kmr_8301494711_sense a ontolex:LexicalSense;
14 skos:definition "political party"@en.
15
16 :kmr_8301494711_sense ontolex:usage [
17 rdf:value "partiyên kurd yên siyasî"@kmr-latn ;
18 rdf:value "Kurdish political parties"@en ] .</p>
        <sec id="sec-5-1-1">
          <title>4.1. Data Remodelling</title>
          <p>One of the major ongoing issues in creating Kurdî Wikibase is the lack of language codes
sdh and hac respectively for Southern Kurdish and Gorani on Wikibase. To tackle this,
we use ku as the language code for all the varieties, but point to an item describing the
variety using dct:language. This can be further refined once language codes are added
to Wikibase.10</p>
          <p>In addition to language codes, we also face specific challenges related to the
lexicographical data of the sources, particularly in orthographic normalization using Latin vs.
Perso-Arabic scripts and spelling variations. This is of importance to LOD technologies
given that duplicated entries, i.e. several entries that describe the same lexeme, should
be avoided. Therefore, we verify and unify scripts among the resources to conform with
the orthographies that are widely used, e.g. ë [Q] for a glottal stop, is replaced with ‘‵’.
Moreover, some of the headwords in the selected resources contained punctuation marks,
which are removed.</p>
          <p>Usage examples on Kurdî Wikibase are attached to a sense, and described with their
English translation, while on Wikidata, usage examples are attached to a lexeme, qualified
with their subject sense. On the one hand, that modeling corresponds to the OntoLex
source where senses point to usage examples via ontolex:usage and, on the other, this
made the upload process more convenient, since a usage example attached to a lexeme
could not be qualified with the URI of its subject sense until that sense would get an
identifier, which doesn’t happen until the item data is written on the Wikibase. When
transferring to Wikidata, we attach usage examples to lexemes using wd:P5831.</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>4.2. Transfer to Wikibase</title>
          <p>While Wikibase data output is available in RDF and in a JSON format11, the format
required for upload is the latter. We have used python modules for parsing the source
RDF Turtle files, 12 and for producing and uploading data in the required JSON
representation.13 The mapping of source URI to Wikibase URI is hardcoded in the script, but
could also be taken from Wikibase, since URI alignments are also represented there.14</p>
        </sec>
        <sec id="sec-5-1-3">
          <title>4.3. Federation with Wikidata</title>
          <p>Since our focus on this project is on creating a Wikibase for Kurdish with minimal
manual manipulation and verification of the data, we aim to raise awareness in the related
community to contribute to Kurdî Wikibase. This way, not only the existing data can be
checked, but also further completed by missing senses and headwords. This is, in fact, a
chance for such under-represented communities to promote their language on their own
Wikibase. Ultimately, this results in cleaner data without creating uncertain or noisy
material on Wikidata. After the community is trained on their own Wikibase, members
would most probably go on editing Wikidata, when the data is transferred to that global
platform.</p>
          <p>10We have filed a request for inclusion of these codes in subsequent releases of the Wikibase software.
11https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON
12RDFLib: https://rdflib.readthedocs.io
13WikibaseIntegrator: https://github.com/LeMyst/WikibaseIntegrator
14Details at https://kurdi.wikibase.cloud/wiki/Project_Log</p>
          <p>Once curation tasks are done, transferring the Kurdish lexical data from Kurdî Wikibase
to Wikidata will be trivial, as long as all implied URI are aligned; that is done using a
Wikibase property of type external identifier, which points to the equivalent Wikidata
entity, kdb:P1 in our case. Some specialties of our Wikibase data have to be taken into
account, e.g. the diferent locations of usage examples. Open issues to be solved are only
two: How to model the English translations of usage examples, and the already mentioned
lack of certain language codes. Both issues are already addressed in Wikidata lexemes
community discussions. Fig. 2 shows a modeling proposal for an example Kurmanji
entry on Wikidata.
wd:L1083983-S1 a ontolex:LexicalSense ;
skos:definition "political party"@en ;
wdt:P5137 wd:Q7278 . # item-for-this-sense political party</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Use Case 2: Latin and Wikidata</title>
      <sec id="sec-6-1">
        <title>5.1. The LiLa Project</title>
        <p>Latin is a widely attested and studied language: its written attestations go from the
earliest inscriptions (variously dated from the 7th to the 5th century BCE) until nowadays
(as Latin is the oficial language of the Vatican State). In the span of more than two
millennia, Latin has been spoken by several peoples in Europe; it outlived the political
domination of Rome over the Mediterranean and part of Central Europe as the main
international language of culture for centuries. This resulted in a corpus of texts that
show a large diatopic, diastratic, and diaphasic variation, as well as a covering of a wide
range of genres. Notably, Latin, a part of the Indo-European family thus historically
related to many widely attested idioms of the world, is also the direct ancestor of the
Romance languages, spoken in many countries of Europe and of the world.</p>
        <p>In the last decades, several lexical and textual digital resources have been individually
developed for Latin, often as retro-digitization of earlier printed sources. Most of these
resources are the product of separate eforts by the research community across multiple
decades and waves of digitization campaigns. While their existence is of great importance,
they do not rely on common vocabularies and ontologies, nor do they provide a standard
query language to access the data. Such a situation undermines their efectiveness as
tools for the researchers and the general public.</p>
        <p>
          The LiLa:Linking Latin project15 was started precisely to address this lack of
interoperability with the help of Semantic Web technologies. The goal of the project is to connect
the Latin lexical and textual resources currently available on the web by describing them
with a common set of vocabularies and a shared data model. A parallel aim is also
to foster, by adopting a common model for data representation, interoperability with
the resources for the other languages of the Indo-European and of the Romance family.
Particularly in the field of etymology and language contact, LiLa aims at representing
phenomena like inheritance [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and borrowing [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>
          A key process that, for a morphologically rich language like Latin, concerns both lexical
and textual resources is lemmatization. Lexicons are indexed using canonical forms of
citation for lexemes; lexical research in Latin corpora is only possible with lemmatized
text. Therefore, the core of the LiLa is represented by the LiLa Knowledge Base (LKB),
a collection of Latin word forms that are conventionally used to lemmatize corpora and
index lexica [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The LKB is based on Ontolex-Lemon: lemmas are defined as instances
of the class ontolex:Form and are described according to a series of morphological
features, like part of speech, using the list of the “upos” of the Universal Dependencies
project [11], and grammatical gender (used for nouns only). Nouns, verbs, pronouns
and adjectives are also classified for their inflection type, adopting the traditional classes
used in Latin grammars. Finally, about 47,000 lemmas of Classical Latin also register
information on the prefixes and sufixes that can be identified in their formation, derived
from the data compiled for the Word Formation Latin lexical resource [12].
        </p>
        <p>The list of lemmas, as well as the morphological analyses of each of them, is derived
from the morphological analyzer Lemlat [13]. Currently, the LKB includes ca. 200,000
lemmas.</p>
        <p>Starting from the LKB, LiLa is now connecting a growing collection of lexical resources
[14] and textual corpora [15, 16]. The former, all described with the Ontolex-Lemon
model, collect a series of instances of ontolex:LexicalEntry to the lemmas in the
LKB via the ontolex:canonicalForm property. These lexica include a manually revised
version of the Latin WordNet, a valency lexicon [17], and a Latin-English dictionary [18].</p>
      </sec>
      <sec id="sec-6-2">
        <title>5.2. Data Remodelling</title>
        <p>Before we began our work to integrate information from the LKB, the Wikidata lexeme
collections included 32,183 entries in Latin. Most of them originated from the word
list used in William Whitaker’s WORDS software for morphological analysis of Latin
forms.16 While there were overlaps with this preexisting collection, the LKB held a
substantial amount of additional materials. Our work was split into two subtasks: 1.
aligning the two resources for the preexisting words; 2. setting up a workflow for the
integration of new Wikidata Latin lexemes from the data in the LKB.</p>
        <p>
          The 32,183 preexisting lexemes were variously distributed across 25 values of wikibase:lexicalCategory
some of which – like “hapax legomenon” ( wd:Q168417), or “letter” ( wd:Q9788) – had
no correspondence to LiLa’s parts of speech. In particular, phrases, idioms, and other
multi-word expressions did not have any equivalent in the LKB. Prefixes and afixes used
in word formation processes were also among the Wikidata Latin lexemes. Although, as
said, the LKB includes afixes, they are currently not aligned with the OntoLex ontology
[
          <xref ref-type="bibr" rid="ref10">10, 191</xref>
          ]; we decided therefore not to consider them in our first alignment.
        </p>
        <p>The preexisting lexemes were matched both by lemma string (comparing the ontolex:writtenRep
of the LKB lemmas and the lemma string stored as the data property wikibase:lemma),
and the POS (manually mapping Universal Dependencies POS to the corresponding
Wikidata item to use as the value for wikibase:lexicalCategory, whenever possible).
26,379 lexemes (81.97% of the preexisting Latin words in Wikidata) were matched
univocally to one lemma in the LKB; 1,356 (4.21%) were ambiguous (matching more than one
lemma in the LKB), while 4,448 lexemes (13.82%) were not found.</p>
        <p>Of the latter, more than 1,000 entries could be further matched using simple rules of
alignment for POS and lexical categories (e.g. nouns of Wikidata and proper nouns of
the LKB). In total, we aligned 27,486 entries; these entries in Wikidata are now linked to
the corresponding lemmas in the LKB via the special property “LiLa Linking Latin URI”
( wdt:P11033, see ğ5.3). The work of semi-automatic disambiguation and matching of
the remaining lexemes is still in progress.</p>
        <p>The lexical database of the Lemlat analyzer, from which, as said, the LKB was originally
derived, aggregated lemmas from three classes of sources: a dictionary for Classical Latin,
an Onomasticon of proper nouns, and a dictionary of Medieval Latin [cf. 13]. In our first
experiment on expanding the Wikidata Latin lexeme inventory, we decided to concentrate
on the lexicon of Classical Latin (which was already covered, though not exhaustively, by
the collection from Whitaker’s WORDS).</p>
        <p>We identified 24,007 additional lemmas belonging to the Classical base of the LKB
that were not present in Wikidata. We proceeded to create the new lexemes, with the
mandatory information of the lemma string (mapped onto the rdfs:label of the LKB
lemma), the lexical category and the grammatical gender for nouns.</p>
        <p>Only a handful (49) of the preexisting Latin lexemes in Wikidata had information on the
inflection type, as a paradigm class ( wdt:P5911) or conjugation class ( wdt:P5186). These
properties link verbs and nouns to, respectively, the 5 declensions and 4 conjugations of
traditional grammars (e.g. wd:Q3921592 for the Latin first declension of a-stems). As</p>
        <p>URI
lila:n2
lila:n2e
lila:n6
lila:v3r
lila:v3e
lila:n
Label</p>
        <p>URI</p>
        <p>Wikidata</p>
        <p>Label
2nd declension (m/f) nouns wd:Q3953983
2nd decl irregular nouns wd:Q3953983
1st class adj. wd:Q3606519
3rd conj. verb wd:Q54295441
3rd conj. impersonal verb wd:Q54295441
uninflected noun/adj. —
2nd decl.
2nd decl.</p>
        <p>Adj. of 1st class
Latin 3rd conj.</p>
        <p>Latin 3rd conj.
—
said, LiLa includes comprehensive information on inflection types. We consider these
data particularly relevant for enriching the Latin lexemes, as the classification of the
lexemes into morphological types can potentially be used to support disambiguation
of homographs (e.g. dico “proclaim, dedicate”, 1st conjugation, vs dico “say”, 3rd
conjugation), or automatic form generation.</p>
        <p>We proceeded to align the 53 inflection types used in LiLa to the relevant entities
in Wikidata. In 5 cases, we have created new Wikidata items, since a few classes in
LiLa (primarily for uninflected or invariable words) did not have any match in Wikidata,
or we have added English labels to the preexisting entries (which, as in the case of
wd:Q3606519, had only an Italian one). Some examples of this alignment are reported in
Table 2. The classification in the LKB is more fine-grained, as LiLa distinguishes several
sub-classes of the traditional declensions and conjugations; for instance, LiLa includes a
subdivision of irregular nouns of the 2nd declension, or the impersonal verbs of the 3rd
conjugation, which are mapped to the general Wikidata categories for 2nd declension
and 3rd conjugation respectively (see the ex. in Table 2).</p>
        <p>The enrichment of the Latin lexemes with the information on inflection taken from
LiLa is currently ongoing.</p>
      </sec>
      <sec id="sec-6-3">
        <title>5.3. Transfer to Wikidata</title>
        <p>Since, in this case, we are interacting directly with Wikidata, we have gone through the
process established by the community to create a new external identifier property for the
linking to the LKB, and for the batch writing bot permission request. We have uploaded
the data using python modules (see ğ4.2), and have documented the upload process.17</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusions</title>
      <p>The published datasets for Gorani and Southern Kurdish varieties are unique as those
are rarely represented on the web. Together with the Sorani and Kurmanji data, they are
17See https://www.wikidata.org/wiki/Property_talk:P11033.
now accessible and re-usable as part of the Wikibase ecosystem, and, after completing
the addressed curation tasks, prepared for transfer to Wikidata.</p>
      <p>As for Latin, Wikidata now includes 56,202 lexemes, 51,492 of which now provide a
link to the LKB via the property wdt:P11033. For these lexemes, a federated query over
the LiLa SPARQL endpoint18 already gives access to the wealth of information provided
in the LiLa resources. Starting from a lexeme in Wikidata, for instance, it would be
possible to retrieve all the occurrences of the words lemmatized under the connected
lemma in the collection of LiLa corpora.</p>
      <p>Lexemes in Wikidata are typically enriched with their inflected forms (which are
available, for instance, for those obtained from Withaker’s WORDS) and their senses.
Currently, LiLa does not link the canonical forms in the LKB to any other forms of the
same word; in other words, it will not be possible to use LiLa to retrieve an exhaustive
list of the forms of a lexeme (unless the scope is limited to the forms attested in the
LiLa corpora and lemmatized under any given lemma). Plans to create a lexical resource
linked to LiLa with all possible inflected forms are, however, under development. In the
near future, therefore, it will be possible to use LiLa to enhance the catalog of forms for
the Latin lexemes.</p>
      <p>The repertoire of senses, on the contrary, is already available for a limited number of
Latin words. The 53,437 lexical entries from the Lewis and Short Latin-English dictionary
[cf. 18] and the 6,269 entries of the Latin WordNet are all provided with definitions and
senses. In particular, for the latter, links to the synsets in the Princeton Wordnet are
available. Since Wikidata concepts are linked to Princeton Wordnet using wd:P8814, for
the intersection of these with the Princeton Wordnet ID in Latin Wordnet, we will use
these existing alignments for setting wd:P5137 relations between Latin lexeme senses
and Wikidata concepts.</p>
      <p>We are recording all direct alignments between OntoLex and lexinfo URI to Wikidata
defined in the use cases presented in this paper, as well as indirect correspondences
(those that imply some re-modeling), and plan to expand these towards all entities in the
OntoLex and lexinfo ontologies, to provide general guidelines for transferring OntoLex
datasets to Wikidata.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This research has been supported by “Monumenta Linguae Vasconum 6: avances en
cronologia de la historia y la prehistoria de la lengua vasca” (MINECO,
PID2020118445GB-I00) project and “Hizkuntzalaritza Diakronikoa, Tipologia eta Euskararen
Historia / Diachronic Linguistics, Typology and the History of Basque (DLTB)” (Basque
Government, IT1534-2) research group, and worked on in the framework of Nexus
Linguarum COST Action (European Union, CA-18209).
58 (2020) 177–212. doi:10.4454/ssl.v58i1.277.
[11] M.-C. de Marnefe, C. D. Manning, J. Nivre, D. Zeman, Universal Dependencies,</p>
      <p>Computational Linguistics 47 (2021) 255–308. doi:10.1162/coli_a_00402.
[12] M. Pellegrini, M. Passarotti, E. Litta, F. Mambrini, G. Moretti, C. Corbetta,
M. Verdelli, Enhancing Derivational Information on Latin Lemmas in the LiLa
Knowledge Base. A Structural and Diachronic Extension, Prague Bulletin of Mathematical
Linguistics 119 (2022) 67–92. URL: http://ufal.mff.cuni.cz/pbml/119/art-pellegrini-et-al.
doi:10.14712/00326585.023.
[13] M. Passarotti, M. Budassi, E. Litta, P. Rufolo, The Lemlat 3.0 Package for
Morphological Analysis of Latin, in: G. Bouma, Y. Adesam (Eds.), Proceedings
of the NoDaLiDa 2017 Workshop on Processing Historical Language, volume 133,
Linköping University Electronic Press, Gothenburg, 2017, pp. 24–31. URL: http:
//www.ep.liu.se/ecp/article.asp?issue=133&amp;article=006&amp;volume=.
[14] M. Passarotti, F. Mambrini, Linking latin: Interoperable lexical resources in the lila
project, in: E. Biagetti, C. Zanchi, S. Luraghi (Eds.), Building new resources for
historical linguistics, Pavia University Press, Pavia, 2021, pp. 103–124.
[15] F. Mambrini, M. Passarotti, G. Moretti, M. Pellegrini, The Index
Thomisticus Treebank as Linked Data in the LiLa Knowledge Base, in: Proceedings
of the Thirteenth Language Resources and Evaluation Conference, European
Language Resources Association, Marseille, France, 2022, pp. 4022–4029. URL:
https://aclanthology.org/2022.lrec-1.428.
[16] M. Fantoli, M. Passarotti, F. Mambrini, G. Moretti, P. Rufolo, Linking the LASLA
Corpus in the LiLa Knowledge Base of Interoperable Linguistic Resources for Latin,
in: Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th
Language Resources and Evaluation Conference, European Language Resources
Association, Marseille, France, 2022, pp. 26–34. URL: https://aclanthology.org/2022.
ldl-1.4.
[17] F. Mambrini, M. Passarotti, E. Litta, G. Moretti, Interlinking Valency Frames
and WordNet Synsets in the LiLa Knowledge Base of Linguistic Resources for
Latin, in: Further with Knowledge Graphs. Studies on the Semantic Web 53,
IOS Press, Amsterdam, 2021. URL: https://ebooks.iospress.nl/doi/10.3233/SSW210032.
doi:10.3233/SSW210032.
[18] F. Mambrini, E. Litta, M. Passarotti, P. Rufolo, Linking the Lewis &amp; Short
Dictionary to the LiLa Knowledge Base of Interoperable Linguistic Resources for
Latin, in: E. Fersini, M. Passarotti, V. Patti (Eds.), Proceedings of the Eighth
Italian Conference on Computational Linguistics CliC-it 2021, Accademia University
Press, 2022, pp. 214–220. URL: http://books.openedition.org/aaccademia/10713. doi:10.
4000/books.aaccademia.10713.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Vrandei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krötzsch</surname>
          </string-name>
          , Wikidata:
          <string-name>
            <given-names>A Free</given-names>
            <surname>Collaborative Knowledge Base</surname>
          </string-name>
          ,
          <source>Communications of the ACM</source>
          <volume>57</volume>
          (
          <year>2014</year>
          )
          <fpage>78</fpage>
          -
          <lpage>85</lpage>
          . URL: http://cacm.acm.org/magazines/2014/ 10/178785-wikidata/fulltext.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Nielsen</surname>
          </string-name>
          , Lexemes in Wikidata: 2020 status, in
          <source>: Proceedings of the 7th Workshop on Linked Data in Linguistics (LDL-2020)</source>
          , European Language Resources Association, Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>82</fpage>
          -
          <lpage>86</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .ldl-
          <volume>1</volume>
          .
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bosque-Gil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gracia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Buitelaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          ,
          <article-title>The OntoLex-Lemon Model: Development and Applications</article-title>
          , in: I.
          <string-name>
            <surname>Kosem</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Tiberius</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Jakubíek</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kallas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Krek</surname>
          </string-name>
          , V. Baisa (Eds.),
          <article-title>Electronic lexicography in the 21st century: Lexicography from scratch</article-title>
          .
          <source>Proceedings of eLex</source>
          <year>2017</year>
          ,
          <article-title>Lexical Computing CZ s</article-title>
          .r.o.,
          <string-name>
            <surname>Brno</surname>
          </string-name>
          ,
          <year>2017</year>
          , pp.
          <fpage>587</fpage>
          -
          <lpage>597</lpage>
          . URL: https://elex.link/elex2017/wp-content/uploads/2017/09/ paper36.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Declerck</surname>
          </string-name>
          ,
          <article-title>Towards a Linked Lexical Data Cloud based on OntoLex-Lemon</article-title>
          , in: J.
          <string-name>
            <surname>P. McCrae</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Chiarcos</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Declerck</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gracia</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          Klimek (Eds.),
          <source>Proceedings of the LREC 2018 Workshop "6th Workshop on Linked Data in Linguistics LDL-2018"</source>
          , Miyazaki, Japan,
          <year>2018</year>
          . URL: http://lrec-conf.org/workshops/lrec2018/W23/index.html.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Nielsen</surname>
          </string-name>
          ,
          <article-title>Ordia: A Web Application for Wikidata Lexemes</article-title>
          , in: P. Hitzler,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kirrane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hartig</surname>
          </string-name>
          , V. de Boer, M.-E. Vidal,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maleshkova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schlobach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hammar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lasierra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Stadtmüller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hose</surname>
          </string-name>
          , R. Verborgh (Eds.),
          <source>The Semantic Web: ESWC 2019 Satellite Events</source>
          , volume
          <volume>11762</volume>
          , Springer International Publishing, Cham,
          <year>2019</year>
          , pp.
          <fpage>141</fpage>
          -
          <lpage>146</lpage>
          . URL: http://link.springer.com/10.1007/978-3-
          <fpage>030</fpage>
          -32327-1_
          <fpage>28</fpage>
          . doi:
          <volume>10</volume>
          .1007/978- 3-
          <fpage>030</fpage>
          - 32327- 1_28, series Title: Lecture Notes in Computer Science.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Azin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahmadi</surname>
          </string-name>
          ,
          <article-title>Creating an Electronic Lexicon for the Under-resourced Southern Varieties of Kurdish Language</article-title>
          ,
          <source>Proceedings of Seventh Biennial Conference on Electronic Lexicography (eLex</source>
          <year>2021</year>
          )
          <article-title>(</article-title>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahmadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hassani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Towards electronic lexicography for the Kurdish language</article-title>
          ,
          <source>in: Proceedings of the sixth biennial conference on electronic lexicography (eLex)</source>
          , Sintra, Portugal,
          <year>2019</year>
          , pp.
          <fpage>881</fpage>
          -
          <lpage>906</lpage>
          . URL: https://elex.link/ elex2019/wp-content/uploads/2019/09/eLex_
          <year>2019</year>
          _50.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Mambrini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Passarotti</surname>
          </string-name>
          ,
          <article-title>Representing Etymology in the LiLa Knowledge Base of Linguistic Resources for Latin</article-title>
          ,
          <source>in: Proceedings of the Globalex Workshop on Linked Lexicography. LREC 2020 Workshop, European Language Resources Association (ELRA)</source>
          , Paris,
          <year>2020</year>
          , pp.
          <fpage>20</fpage>
          -
          <lpage>28</lpage>
          . doi:
          <volume>10</volume>
          .5281/zenodo.3862156.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Franzini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zampedri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Passarotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Mambrini</surname>
          </string-name>
          , G. Moretti,
          <article-title>Graecissâre: Ancient Greek Loanwords in the LiLa Knowledge Base of Linguistic Resources for Latin</article-title>
          ,
          <source>in: Proceedings of the Seventh Italian Conference on Computational Linguistics</source>
          . Bologna, Italy, March 1-
          <issue>3</issue>
          ,
          <year>2021</year>
          , CEUR-WS.org, Bologna,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2769</volume>
          /paper_06.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Passarotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Mambrini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Franzini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Cecchini</surname>
          </string-name>
          , E. Litta, G. Moretti,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rufolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sprugnoli</surname>
          </string-name>
          ,
          <article-title>Interlinking through Lemmas. The Lexical Collection of the LiLa Knowledge Base of Linguistic Resources for Latin</article-title>
          , Studi e Saggi Linguistici
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>