1. Introduction

Dec

Linking the Dictionary of Medieval Latin in the Czech Lands to the LiLa Knowledge Base

Federica Gamba

Marco C. Passarotti

Paolo Rufolo

1 0 Charles University, Faculty of Mathematics and Physics , Malostranské náměstí 25, 118 00 Prague, Czechia 1 Università Cattolica del Sacro Cuore , largo A. Gemelli 1, 20123 Milan , Italy

2023

02 2023 0000 0003

The paper presents the process of linking the Dictionary of Medieval Latin in the Czech Lands to the LiLa Knowledge Base, which adopts the Linked Data paradigm to make linguistic resources for Latin interoperable. An overview of the Dictionary and of the architecture of the LiLa Knowledge Base is first provided; then, the stages of the process of linking the Dictionary to LiLa's collection of lemmas are described. In conclusion, a query illustrates how interoperability allows for full exploitation of Latin resources.

eol>Linked Data LiLa Knowledge Base dictionary Medieval Latin

1. Introduction

and the Computational Historical Semantics Corpus [6] includes e.g. the Decretum Gratiani, a collection of canon Many resources are available for Latin, making it a par- law compiled in the XII century. ticularly privileged language among the historical ones. However, while the LiLa Knowledge Base already exHowever, most often those resources are scattered, with tends over a large temporal range, its spatial coverage their sparsity representing a substantial hindrance to the is not as wide. So far, no resource from the Eastern Eufull exploitation of the information they contain. To over- rope areas where Latin was spoken has been linked. For come the sparsity of resources, stored in separate silos, this reason, we decided to link to LiLa the Dictionary of the CIRCSE Research Center in Milan, Italy, started the Medieval Latin in the Czech Lands, a lexical resource that LiLa - Linking Latin project1 (2018-2023), which built a aims at collecting the Latin vocabulary as it emerged in Knowledge Base to make all existing textual and lexical that area during the Middle Ages. The resource encomresources for Latin interoperable by adopting the four passes a late variety of Latin (1000-1500 CE), strongly principles of the Linked Open Data (LOD) paradigm [1]: tied to a specific geographical area. These two levels of 1) use URIs as names for things; 2) use HTTP URIs so that variability, along the temporal and spatial axes, make people can look up those names; 3) when someone looks it extremely interesting to link such a resource to the up a URI, provide useful information; 4) include links to Knowledge Base, as we expect it to contribute to enlarge other URIs. so that they can discover more things.2 the amount of lemmas stored in the large collection of

The LiLa Knowledge Base has already a wide cover- Latin lemmas that represents the core part of the whole age in terms of interlinked resources. Classical Latin architecture of LiLa. is naturally well-represented, as proved by the LASLA The paper is organised as follows. Section 2 introcorpus, which includes 130 Classical Latin texts [2], and duces the LiLa Knowledge Base. Section 3 describes the by the Lewis and Short dictionary [3], whose primary Dictionary. Section 4 outlines the process of linking the focus is on Classical Latin. Later stages of Latin are found Dictionary to LiLa. Section 5 shows the added value of as well in the Knowledge Base; for instance, the Index interoperability of Latin resources in LiLa by presenting Thomisticus Treebank [4] comprises texts by Thomas a query on the Dictionary interlinked. Aquinas (1225–1274), the UDante treebank [5] encompasses Medieval Latin works written by Dante Alighieri,

2. The LiLa Knowledge Base The LiLa Knowledge Base [7] achieves interoperability

between linguistic resources for Latin, by adopting a set of ontologies widely used to model linguistic information, as well as Semantic Web and Linked Data standards.

Among the former, OLiA is used to model linguistic annotation [8], Ontolex-Lemon for lexical data [9, 10] and POWLA for corpus data [11]. As for the latter, the Resource Description Framework (RDF) [12] is a data model used to describe information in terms of triples, consist- source citations with a translation illustrate its ing of: (1) a predicate-property that connects (2) a subject meaning. E.g., labellum ‘small lip’. (i.e. a resource) with (3) its object (another resource or • Vocabulary taken from Classical Latin with a literal). Data recorded in the form of RDF triples are changes. This type of entry is composed of two queried via the SPARQL query language [13]. parts: first, ancient meanings are listed; then, the

The architecture of the LiLa Knowledge Base is highly + sign introduces Medieval developments (synlexically-based, as it exploits the lemma as the most pro- tactical alternations, new phrases, meanings of ductive interface between resources and tools. Indeed, its the word coined in Medieval times). E.g., falcatus core is the so-called Lemma Bank, a collection of around ‘curved’ + ‘shod’. 200,000 lemmas taken from the database of the morpho- • Vocabulary that emerged during the Middle Ages. logical analyser LEMLAT [14] and constantly extended. Such entries are marked with an asterisk (* ). A lila:Lemma3 is a subclass of ontolex:Form4, whose Square brackets [ ] with etymology and references individuals are the inflected forms of a lexical item. In to other dictionaries including the word follow particular, the lemma is a form that can be linked to a the heading of the entry. E.g., emicamen ‘splenontolex:lexicalEntry5 via the property ontolex:canonicaldFoourr,mcl,arity’. 6 which identifies the form that is canonically used to represent a lexical entry. Moreover, the Dictionary relies on a diferential method

To overcome divergent lemmatisation criteria that may to capture all divergences – at several linguistic laypossibly be adopted in resources, LiLa exploits three key ers – of Medieval Latin vocabulary inherited from the properties. The symmetric property lila:lemmaVariant7ancient era as compared with the Classical norms. Inconnects diferent forms of the same lexical item that deed, it records language phenomena not attested in the can be used as lemmas for that item, like for verbs with 8th edition (and later unchanged editions) of Georges’ an active and a deponent inflection (e.g., sequo and se- Latin-German Lexicon [15]. quor ‘to follow’). The property ontolex:writtenRep8 The material the Dictionary is built upon amounts toregisters diferent spellings or graphical variants of one day to ca. 800,000 excerpt sheets, assembled from varilemma, like for instance conditio and condicio ‘condition’. ous sources of Czech provenance (diplomatical, oficial, For forms that can be reduced to multiple lemmas like par- belles-lettres, scientific literature, etc.). What is particticiples – that can be considered either part of the verbal ularly valuable is that not only edited texts served as a inflectional paradigm or as independent lemmas – a spe- source to build the Dictionary, but also several manuscripts cial sub-class of lila:Lemma called lila:hypolemma9 and old prints from Czech and foreign libraries were used. is defined. The excerption of sources has been carried out from 1934, when the project of the Dictionary started, until the 1970s.

In 1977 the first fascicle was published, illustrating ed3. The Dictionary of Medieval itorial principles and lists of sources and abbreviations.

Latin in the Czech Lands Overall, the electronic database [16] is built upon, and comprises, the three volumes prepared by Silagiová and The Dictionary of Medieval Latin in the Czech Lands10 is a colleagues ([17], [18], [19]). lexical resource developed at the Department of Medieval So far, letters A-M are covered, for a total of 48,452 Lexicography of the Institute of Philosophy of the Czech entries. 24,943 out of these are full entries (provided with Academy of Sciences. It aims to collect the vocabulary meanings, definitions, grammatical information, examof Medieval Latin as it was used in the Czech lands from ples), whereas 23,509 are references that point to full about 1000 CE, when Latin writing began in the area, to entries (see 3.1). Fascicle 24, encompassing entries begin1500 CE. In light of this aim, the Dictionary features three ning with N, is currently under preparation. types of entries: The Dictionary is accessible through a dedicated website11 and can be downloaded from the LINDAT/CLARIAH• Vocabulary taken from Classical Latin without CZ research infrastructure12 as a compressed set of XML any semantic change in the Middle Ages. Only files. 3https://lila-erc.eu/lodview/ontologies/lila/Lemma. 4http://www.w3.org/ns/lemon/ontolex#Form. 5http://www.w3.org/ns/lemon/ontolex#LexicalEntry. 6http://www.w3.org/ns/lemon/ontolex#canonicalForm. 7http://lila-erc.eu/ontologies/lila/lemmaVariant. 8http://www.w3.org/ns/lemon/ontolex#writtenRep. 9https://lila-erc.eu/lodview/ontologies/lila/Hypolemma.

10The Czech title is Slovník středověké latiny v českých zemích; the Latin one Latinitatis medii aevi lexicon Bohemorum.

3.1. XML Files

We provide a brief overview of the structure of the XML ifles of the Dictionary, as those data are relevant for the process of modeling information and linking the entries 11http://lb.ics.cas.cz. 12http://hdl.handle.net/11234/1-4792. to the LiLa Knowledge Base. The lexical entry for the corresponding to the lemma/POS used in the entry of the adjective exquisitus ‘exquisite’ (Figure 1) will serve as an Dictionary. example of the XML files of the resource. The string match results in three possible outcomes:

The whole entry is encoded as the value of an a) only one matching lemma/POS is found in the Lemma entryFree element, which contains a single unstruc- Bank; b) more than one matching lemma/POS is found, retured entry in any kind of lexical resource, such as a sulting in an ambiguity due to homography; c) no matchdictionary or lexicon. Core information about the en- ing lemma/POS is found, as the couple is not present in try is provided through attributes: the lemma is given, the Lemma Bank. together with a numerical unique identifier assigned to The first outcome is overall straightforward and does it; georges=‘True’ or ‘False’ specifies whether an not raise particular issues. The second one, i.e. multientry for the same lemma is found or not in Georges’ dic- ple matches found, requires disambiguation to be pertionary. Optionally, hom_nr distinguishes homographs, formed. To this aim, grammatical information (inflecand type=‘reference’ denotes that the entry is just tional classes) can be exploited, although they do not a reference to a diferent one; for instance, the dummy always guarantee a full resolution of the ambiguity; Subentry for geniculor ‘to bend the knee’ is just a reference section 4.1 elaborates on this. The third possibility, i.e. to its active counterpart geniculo, which, in light of that, missing matches, represents the most interesting outis the only full entry of the two (with meanings, gram- come; firstly, because it entails enlarging the Lemma matical information, etc.). Then, in the orth element the Bank with new canonical forms of citation, and secondly lemma is stated once again as a value; orth includes the because it allows to reflect about the peculiar aspects of attribute type either with value ‘lemma’, if it is a full the variety of the Latin vocabulary represented in the entry, or with value ‘ref_all’, if it is a reference. Dictionary, by focusing on those lexical items provided

Following the lemma, the gramGrp element encodes by the Dictionary that result as out-of-vocabulary with grammatical information about the lexical item, roughly respect to the current Lemma Bank of LiLa. corresponding to its Part of Speech (POS) and (possibly) its inflectional category. In the case of exquisitus, the 4.1. Aligning Grammatical Information value <gramGrp> 3. </gramGrp> indicates that it is an adjective of the first class, i.e. with three distinct endings for the three genders (exquisitus, -a, -um, respectively for the forms of masculine, feminine and neuter singular nominative).

The sense elements (possibly more than one for a same entry) capture the diferent meanings of a lexical item. For each sense, a definition def is provided both in Latin and in Czech, with the Czech one corresponding to a translation of the Latin counterpart. Some examples are listed as well, together with their source. The label script. et form. is used to record orthographic and morphological variants (e.g., exequisitus for exquisitus), while the label metr. for metrical ones.

In order to automatically disambiguate multiple matches,

we exploit the grammatical information provided by the Dictionary in the gramGrp element. However, this information is not encoded in a fully standardised way, thus requiring an alignment to be performed. Indeed, we need to define a set of heuristics to align grammatical categories as they are encoded in the Dictionary and the set of tags employed in LiLa, which is based on the Universal POS tagset [20] and expanded with inflectional categories. As an illustration, the word acus ‘needle’ has -us, f. as gramGrp, i.e. the genitive ending and the gender.

From that we can generalise and establish a correspondence between the genitive ending in -us together with the gender, as found in the Dictionary, and a NOUN with inflectional class n413 in LiLa. 4. Linking the Dictionary to LiLa In most cases, grammatical information provided by the Dictionary is suficiently fine-grained to provide all This section describes the process of linking the Dictio- elements needed to disambiguate the multiple linking nary of Medieval Latin in the Czech Lands to the LiLa to the Lemma Bank, as it roughly consists of POS and Knowledge Base. The coverage of the linking task is not inflection class, like in the case of acus. Yet, sometimes yet complete, as, so far, we have been working only with only information corresponding to POS is available. Sevfull entries (i.e., excluding those with type=’reference’).eral substantives are marked just as subst. (e.g., deptar,

As mentioned in Section 2, in LiLa the lemma works type of medicinal plant), which makes it non-trivial, if as interface between interlinked resources. In light of the possible at all, to infer an inflectional category. pivotal use of lemmas, the core operation at the base of the linking process is to perform a string match between the tuples (lemma, POS) in the resource to be linked and the lemmas and their POS in the LiLa Lemma Bank. The goal is to retrieve the correct lemma in the Lemma Bank 13n4 corresponds to fourth declension nouns. 4.2. Linking to the Lemma Bank verbs can be linked to their counterpart with opposite voice (e.g., attaedio - attaedior ‘to bore’); (c) 80 plural After aligning the two tagsets, we proceed to link the forms can be linked to their singular equivalent (moscilli Dictionary entries to the Lemma Bank. The one-to-one - moscillus ‘little habit’). matches, i.e. lemmas in the Dictionary that match with A closer look at lemmas that remain unmatched (10,088) just one lemma in the LiLa Lemma Bank with respect to raises interesting insights, allowing for some linguistic both lemma and POS, have been considered validated. considerations. First, clear evidence of areal contact is Tiohse,nfoalmloewlyinognesu-tbos-emctainoynsandidscounses-ttoh-ezetwroomotahtcehressc.enar- opnroisv. iAdesdthbeyspfoerlmlinsglirkeevebaolssa,tkhoe,s-eonfoisrmansdarkeatmheenrnesikukltoo,fa contact with the language that was spoken in the area 4.2.1. One-to-Many at that time, namely Old Czech. Indeed, bosako comes The string match on lemma and POS results in 827 am- from the Czech form bosák, denoting a monk that by biguous matches. Therefore, we add inflectional class as virtue of the rule has to walk barefoot, while kamenniko a further constraint; as a result, 303 lemmas are disam- ‘stonemason’ derives from kameník. Additionally, several biguated automatically, while 445 still remain ambiguous lemmas pertain to very specific domains. Consider e.g. and need to be inspected manually. For instance, for lac- ascoa, a sea animal, igenecha, a type of quadruped14, or ertus a correspondence in the Lemma Bank is found with cinapus, a species of fish, as an example of vocabulary of lacertus ‘upper arm’ and lacertus ‘lizard; a seafish’, both fauna. Flora is found as well: e.g., elipurgis, correspondNOUNs of the second declension (inflectional class n2). ing to Cynoglossum oficinale , bulboquilon, ‘mandrake’, Only the manual checking of the meaning can thus allow and atomana, a herb. Similar forms evidently display the to retrieve the correct match. specificity of some domains covered by the Dictionary of Medieval Latin in the Czech Lands. 4.2.2. One-to-Zero

4.3. Results

After performing the string match on lemma and POS, no match is found in the LiLa Lemma Bank for 10,278 The string match on lemmas and POS tags results in 55.5% lemmas. Among those, we automatically handle adverbs, one-to-one mappings; for 3.3% of entries more than one verbs and pluralia tantum to find out whether they could possible match was found, while for 41.2% no match was be linked to the Lemma Bank respectively as hypolemmas retrieved. The amount of lemmas that are not found in of an adjective, lemma variants of a corresponding verb the Lemma Bank reflects the nature of the Dictionary, and with opposite voice (active if deponent and vice versa) or especially its temporal, geographical and domain specilemma variant of a noun in singular form. By defining a ifcity. For comparison purposes, consider, for instance, set of heuristics applied automatically, we find that: (a) that the process of linking the bilingual Latin-English 92 adverbs can be linked to the adjective they are derived from (e.g., homagialiter - homagialis ‘of homage’); (b) 18 14Possibly the common genet. dictionary by Lewis and Short, which is focused on Clas- The 11 retrieved lemmas23 occur in 5 corpora, for a sical Latin, resulted in only 9% of unmatched lemmas [21]. total of 132 occurrences, 5 out of which are found in the The percentage of no-match entries increases to 70% in Computational Historical Semantics corpus24 , 104 in the the case of the Neulateinische Wortliste by Ramminger Index Thomisticus Treebank,25 4 in UDante,26 1 in the [22], which covers a time range spanning between 1300 CIRCSE Latin Library27 (specifically, in Augustine’s Conand 1700 and features entries mirroring contemporary fessiones) and 18 in the LASLA corpus28 [2]. The results changes in the society, e.g. typographus ‘typographer’. of the query confirm once again the specificity of the

Figure 2 shows an example of an entry of the Dictionary Dictionary of Medieval Latin in the Czech Lands. Having (exquisitus) linked to the LiLa Knowledge Base. The (yel- excluded Classical lemmas that can also be found in the low) node in the center of Figure 2 is the Lewis and Short dictionary, what remains are mostly lemontolex:lexicalEntry for exquisitus, which is linked mas that occur in corpora featuring texts of later stages via the property lime:entry15 to the node that repre- of Latin: for instance, the texts from the Index Thomistisents the entire Dictionary (an individual of the class cus Treebank and UDante date back respectively to XIII lime:lexicon16) and to the corresponding lemma in and XIV centuries. The only exception is represented by the Lemma Bank via the property the LASLA corpus, which includes Classical Latin. Yet, ontolex:canonicalForm. The lexical entry works as occurrences in LASLA are limited to the lemma mollitia gateway to all information associated to it in the resource. ‘softness, weakness’, which is therefore attested in ClasFor instance, Figure 2 shows how the two meanings as- sical times as well, while all the other lemmas appear to sociated to exquisitus in its entry in the Dictionary are have originated later. modeled. The two definitions provided by the resource (in Latin and in Czech) are linked to the lexical entry as individuals of the class ontolex:lexicalSense17 via the 6. Conclusions property ontolex:sense18. Each sense is the specific lexicalisation of a more general ontolex:lexicalConcepLti1n9,king the Dictionary to the LiLa Knowledge Base not to which the sense is linked via the only was a further step towards the full exploitation of linguistic resources for Latin, thanks to their interoperproAplethrtoyugohntnooltevxis:ibilseLinexFiigcuarlei2z, ethdeSelenmsmeOafe2x0q.uisitus ability, but also contributed to improve the degree of linguistic diversity represented in LiLa as for three aspects, itno tthheeLeenmtrmieasBfaonrkeixsqliuniksietdusviianosnetvoerlaelxo:tchaern olenxiiccaallrFeo-rmthat are particularly relevant for Latin as a language that sources and to its occurrences (tokens) in the textual was used for centuries all over Europe: (a) diachronic resources interlinked in LiLa21 . diversity: the Dictionary collects a portion of the Latin vocabulary that emerged in Medieval times; (b) diatopic diversity: the lexical resource includes items from a spe5. Querying the Dictionary in LiLa cific area, namely the Czech lands; (c) domain-based diversity: quite frequently the entries of the Dictionary This Section presents a query to exemplify the added belong to very specific domains (e.g., flora and fauna; value of interoperability between the resources linked to see Section 4.2.2). The contribution of the lemmas from LiLa22. The query, available within a set of preocompiled the Dictionary in enlarging the LiLa Lemma Bank is thus queries in the SPARQL endpoint of LiLa, retrieves all considerable both in terms of quantity and in terms of those lemmas whose entries in the Dictionary include quality, and highlights the importance of linking to the the word natura ‘nature’ in their definition(s) and do not Knowledge Base also resources that feature non-standard occur also in the Lewis and Short dictionary, and returns varieties of Latin. the number of their occurrences in the textual corpora In the near future, we intend to finalise the linking, by linked to LiLa. disambiguating ambiguous matches and adding missing lemmas to the Lemma Bank, as well as by including referencing lemmas besides full entries (see Section 3.1). We also intend to model citations of attestations, i.e. refer15http://www.w3.org/ns/lemon/lime#entry. 16http://www.w3.org/ns/lemon/lime#Lexicon. 17http://www.w3.org/ns/lemon/ontolex#LexicalSense. 18http://www.w3.org/ns/lemon/ontolex#sense. 19http://www.w3.org/ns/lemon/ontolex#LexicalConcept. 20http://www.w3.org/ns/lemon/ontolex#isLexicalizedSenseOf.

21For the full list of the resources currently made interoperable through LiLa, see https://lila-erc.eu/data-page/.

22The linguistic resources for Latin linked in LiLa can be queried either via a query graphical interface (https://lila-erc.eu/query/) or through a SPARQL endpoint (https://lila-erc.eu/sparql/).

23Accidentalis, bestialitas, connaturalis, connaturalitas, contingentia, eligibilis, finitas , fumositas, leuiathan, materialitas, mollitia.

24http://lila-erc.eu/data/corpora/CompHistSem/id/corpus. 25http://lila-erc.eu/data/corpora/ITTB/id/corpus. 26http://lila-erc.eu/data/corpora/UDante/id/corpus.

27http://lila-erc.eu/data/corpora/CIRCSELatinLibrary/id/ corpus. Collection of Latin texts enhanced with diferent layers of linguistic annotation.

28http://lila-erc.eu/data/corpora/Lasla/id/corpus. ences to other dictionaries where an entry is found and to sources of examples. Moreover, we plan to link to the Knowledge Base some documents from the same area and period as the Dictionary, such as the Czech Medieval sources from the AHISTO project29. However, these documents are currently available only as raw texts, and 29https://nlp.fi.muni.cz/projekty/ahisto/portal.

would need to be lemmatised before the linking. Given the peculiar nature of their Latin variety, conditioned by the Czech language and rich of local proper names, lemmatisation with the currently available trained models will probably provide low accuracy rates. Once again, this proves the importance of collecting non-standard Latin data (and resources) and investigating to what extent Latin varieties difer.

Acknowledgments The “LiLa - Linking Latin” project has received funding

from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme – Grant Agreement No. 769994. This work was partially supported by the Grant No. 20-16819X (LUSyD) of the Czech Science Foundation (GACR). We want to thank Pavel Nývlt for his collaboration in providing the database “The Dictionary of Medieval Latin in Czech Lands”, available via the LINDAT/CLARIAHCZ Research Infrastructure, supported by the Ministry of Education, Youth, and Sports of the Czech Republic (Project No. LM2018101). ume III (I-M), 1995 to 2016. The electronic version has been created by Jan Ctibor. [20] S. Petrov, D. Das, R. McDonald, A Universal Partof-Speech Tagset, in: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey, 2012, pp. 2089–2096. URL: http://www.lrec-conf. org/proceedings/lrec2012/pdf/274_Paper.pdf . [21] F. Mambrini, E. Litta, M. Passarotti, P. Rufolo, Linking the Lewis & Short Dictionary to the LiLa Knowledge Base of Interoperable Linguistic Resources for Latin, in: Proceedings of the Eighth Italian Conference on Computational Linguistics (CLiC-it 2021).

Milan, Italy, January 26-28, 2022, 2021, pp. 214–220. [22] F. Iurescia, E. Litta, M. Passarotti, M. Pellegrini, G. Moretti, P. Rufolo, Linking the Neulateinische Wortliste to the LiLa Knowledge Base of Interoperable Resources for Latin, in: Proceedings of the 7th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Association for Computational Linguistics, Dubrovnik, Croatia, 2023, pp. 82–87. URL: https://aclanthology.org/2023. latechclfl-1.9.