Enhancing the Latin Morphological Analyser LEMLAT with a Medieval Latin Glossary Flavio Cecchini*, Marco Passarotti*, Paolo Ruffolo*, Marinella Testori*, Lia Draettao , Martina Fieromonteo , Annarita Lianoo , Costanza Marinio , Giovanni Piantanidao *Università Cattolica del Sacro Cuore - o Università di Pavia *Largo Gemelli 1, 20123 Milan, Italy - o Corso Strada Nuova 65, 27100 Pavia, Italy {flavio.cecchini, marco.passarotti}@unicatt.it Abstract by Ponti and Passarotti (2016), who show that the performance of a dependency parser trained on English. We present the process of ex- Medieval Latin data drops dramatically when the panding the lexical basis of the Latin mor- same trained model is applied to texts from the phological analyser LEMLAT with the en- Classical era. tries from the Medieval Latin glossary Du This issue affects all layers of linguistic annota- Cange. This process is performed semi- tion, including fundamental ones, like lemmatisa- automatically by exploiting the morpho- tion and morphological analysis. Today, a hand- logical properties of lemmas, a previously ful of morphological analysers are available for available word list enhanced with inflec- Latin, chiefly Words,1 LEMLAT 3.0,2 Morpheus3 tional information, and the contents of the –reimplemented in 2013 as Parsley4 –, the PROIEL lexical entries of Du Cange. Latin morphology system5 and LatMor.6 Italiano. L’articolo descrive il pro- Although LEMLAT, together with LatMor,7 has cesso di ampliamento della base lessicale proved to be the best performing morphological dell’analizzatore morfologico per il latino analyser for Latin and the one boasting the largest LEMLAT con il glossario di latino me- lexical basis, its lexical coverage is still limited dievale Du Cange. Il processo è realiz- to Classical and Late Latin only. First released zato semiautomaticamente ricorrendo ad as a morphological lemmatiser at the end of the alcune proprietà morfologiche dei lemmi, 1980s at ILC - CNR in Pisa (Bozzi and Cappelli, a un lemmario completo d’informazione 1990; Marinone, 1990, v 1.0), where it was en- flessionale e ai contenuti delle entrate hanced with morphological features between 2002 lessicali del Du Cange. and 2005 (Passarotti, 2004, v 2.0), LEMLAT re- lies on a lexical basis resulting from the collation of three Latin dictionaries (Georges and Georges, 1 Introduction 1913 1918; Glare, 1982; Gradenwitz, 1904) for Latin raises particular challenges for Natural Lan- a total of 40 014 lexical entries and 43 432 lem- guage Processing (NLP). Given that accuracy rates mas, as more than one lemma can be included of stochastic NLP tools heavily depend on the in one lexical entry. This lexical basis was fur- training set on which their models are built, this ther enlarged in version 3.0 of LEMLAT by semi- becomes a particularly problematic issue when automatically adding most of the Onomasticon Latin is concerned, because Latin texts show an (26 415 lemmas out of 28 178) provided by the 5th enormous linguistic variety resulting from (a) a edition of the Forcellini dictionary (Budassi and wide time span (covering more than two millen- 1 http://archives.nd.edu/words.html nia), (b) a large set of genres (ranging from liter- 2 www.lemlat3.eu. Binaries and database available at ary to philosophical, historical and documentary https://github.com/CIRCSE/LEMLAT3. 3 https://github.com/tmallon/morpheus texts) and (c) a big diatopic diversity (spread all 4 https://github.com/goldibex/ over Europe and beyond). parsley-core 5 Such complexity impacts NLP to the point that https://github.com/mlj/proiel-webapp/ tree/master/lib/morphology building NLP tools claiming to be suitable for all 6 http://cistern.cis.lmu.de Latin varieties is an unrealistic task. One practi- 7 For an evaluation of morphological analysers for Latin cal example comes from an experiment described see (Springmann et al., 2016). Passarotti, 2016). ern sense of the word, but a glossary, i. e. a mere In order to equip LEMLAT to process Latin texts collection of words where information about parts beyond the Classical period, we recently enhanced of speech (PoS) and inflectional categories is al- its lexical basis with the lexical entries from a large most absent, and therefore has to be deduced or reference glossary for Medieval Latin, namely the reconstructed before an entry can be included in Glossarium Mediae et Infimae Latinitatis by Du LEMLAT .8 In addition, lemmatisation criteria are Cange et alii (1883 1887, hereafter DC). This pa- often inconsistent, even for words belonging to per details the process performed to include DC in the same class (e. g. verbs are cited either by their LEMLAT ’s lexical basis. present active infinitive or by their first person sin- gular present indicative). 2 Word Form Analysis in LEMLAT This is partly due to the fact that five different LEMLAT is a lemmatiser and morphological anal- authors contributed to the glossary over a period of yser of types (i. e. no contextual disambiguation two centuries (Géraud, 1839), not always coher- is performed). Given a word form in input (e. g. ently with respect to their predecessors. Nonethe- coniugae), LEMLAT’s output produces the cor- less, it is possible to distinguish some recurring responding lemma(s) (e. g. coniuga ‘wife’) and patterns, which can be exploited to automatically a number of tags conveying (a) the inflectional include in LEMLAT as many of the 85 999 lemmas paradigm of the lemma(s) (e. g. first declension in DC as possible, or at least to expedite the man- noun) and (b) the morphological features of the in- ual recording of lexical entries. put word form (e. g. feminine singular genitive and 3.1 Suffixes and Bon’s Word List dative; feminine plural nominative and vocative). LEMLAT makes use of a database that includes The preliminary step to extend LEMLAT with DC multiple tables recording the different formative consists in selecting a set of derivational suffixes elements (segments) of word forms. The core ta- that are morphologically-unambiguous in terms of ble is the lexical look-up table, whose basic com- PoS and inflectional category, and hence the set ponent is the so-called LES (LExical Segment). of all lemmas displaying these suffixes. These The LES is defined as the invariable part of the in- lemmas require no further analysis for entry in flected form (e. g. coniug for coniug-ae). In other LEMLAT . Examples are -itas for feminine im- words, the LES is the string (or one of the strings) parysillabic third declension nouns, or -icum for of characters that remains the same in the inflec- neuter second declension nouns. On the contrary, tional paradigm of a lemma; hence, the LES does suffixes like, e. g. -anus or -atus are considered not necessarily correspond to either the word stem morphologically-ambiguous, as they can belong or the root. to different PoS (adjective or noun) and/or differ- LEMLAT includes a LES archive, in which LES ent inflectional categories (first or fourth declen- are assigned an ID and a number of inflectional sion). In these cases the corresponding lemmas features, among which a tag for the gender of the require manual annotation (see Section 3.2). Ap- lemma (for nouns only) and a code (called CO - proximately 30 000 DC lemmas are retrieved and DLES) for its inflectional category. According to added to LEMLAT in this way. the CODLES, the LES is compatible with the end- To extend the automatic acquisition of DC’s ings (called SF, “Final Segment”) of its inflectional lemmas, we also take advantage of a list of 71 908 paradigm, which are collected in a separate table Latin lemmas collected by Bruno Bon from vari- in the LEMLAT database. For example, the CO - ous lexicographic sources and corpora.9 This list DLES for the LES coniug is N 1 (first declension supplies information about inflectional morphol- nouns) and its gender is F (feminine). The word ogy.10 Of these lemmas, 22 628 are found among form coniugae is thus analysed as belonging to the 8 For this work, we use the digital version of DC pro- LES coniug, the segment ae being recognised as an vided by the École nationale des chartes (Paris). Source ending compatible with a LES with CODLES N 1. data are available in XML format at http://svn.code. sf.net/p/ducange/code/xml/. The glossary can be accessed online at http://ducange.enc.sorbonne. 3 Adding the Du Cange Glossary fr/. 9 Available at http://glossaria.eu/outils/ Adding DC to LEMLAT is a challenging task lemmatisation/ and presented in (Bon, 2011). 10 mostly because DC is not a dictionary in the mod- Specifically: PoS; genitive endings of nouns; nominative those in DC that are not analysed in the prelimi- a set of very common lexicographical annotations nary step; and out of these, 21 805 showing a one- and abbreviations (e. g. Italus or Ital., f. = fortasse, to-one correspondence with lemmas in Bon’s list lib., cap.). are added to LEMLAT with no further check.11 With regard to quotations, we only consider the first one as the most significant. Given the 3.2 Definitions and Quotations lemma’s citation form in DC, we exploit the list of Each lexical entry in DC comprises (a) the name all Latin endings and their agreements with inflec- of the lemma, (b) usually, a short definition and tional categories available in LEMLAT’s database (c) possibly one or more quotations (taken from to construct all of its a priori possible inflec- explicitly-cited textual sources), where most of the tional paradigms; of these (partly artificial) forms, times a form of the lexical entry is capitalised. By we retain only those that allow us to unambigu- making use of all these elements, we automatically ously discriminate a PoS and/or an inflectional assign a PoS and an inflectional category (i. e. a category from the others. For example, the en- CODLES , in LEMLAT ’s terms) to the lemma. try for mansaticus ‘mansion, house’ illustrates this In particular, to assess the PoS of a lemma we method: follow a principle of “lexical osmosis”, that is, we assume that a lemma’s definition core (see be- MANSATICUS, Mansio, domus. An- low) will most probably use terms belonging to the nal. Bertin. ad ann. 874. tom. 7. Collect. same PoS of that lemma. By cross-checking this Histor. Franc. pag. 118 : Inde per At- information with the citation form of the lemma tiniacum et consuetos Mansaticos Com- and possibly with its inflected forms in a quota- pendium adiit [. . . ] tion, we are able to assign it also its inflectional category. Since the definition’s core mansio can only be With regard to the definition, we take into con- a noun for LEMLAT, we can conclude that sideration only its initial part, maximally up to the mansaticus is almost surely a noun too, even if first quotation; what comes after are mostly more the -icus ending tends to be associated with de- in-depth discussions of the term, secondary inter- nominal adjectives in Latin. The -us ending tells pretations or later interpolations. More precisely, us that mansaticus can be either a masculine sec- we focus on the definition’s core, i. e. a short cap- ond or fourth declension noun;12 a first class ad- italised phrase, enclosed in commas and/or end- jective might theoretically be possible, but is ruled ing with a full-stop, providing a short explanation out by the definition’s core mansio. The second or paraphrase of the lemma immediately after the declension is confirmed by the ending -os found lemma itself. Its terms are lemmas in typical quo- in the quotation, thus excluding the fourth declen- tation form, e. g. the nominative case for nouns. sion (which should yield -us). Moreover, the definition’s core makes use of a Thanks to this process, more than 10 000 addi- standardised and Classical variety of Latin lexicon tional lemmas are automatically included in LEM - so as to be as clear as possible to the reader. This LAT . This process is applied very carefully, cover- means that most of the terms in a definition’s core ing only decidedly unambiguous cases, i. e. when can also be found in the list of lemmas of LEM - content words in the definition’s core are found to LAT 3.0. Of the recognised forms, we retain only belong to only one PoS or to a phrase of a fixed those that are univocally assigned only one PoS. type (e. g. a phrase ending with an infinitive as- We ignore a small set of both function and con- signs PoS verb to the lemma) and when the inflec- tent words often recurring in definitions (e. g. pro tional category of the word form possibly found ‘for’ and omnis ‘all, every’), and discard as noise in the quotation can be univocally discriminated. This leads to high precision (1.0), but affects re- endings of adjectives; infinitive endings of regular verbs and full paradigms of irregular verbs. call (0.18). For the remaining cases we have to re- 11 The remaining lemmas are manually-checked because sort to manual annotation; this happens most fre- they correspond to multiple entries in one and/or the other quently when we correctly identify the PoS and source. For example, the lemma fedus appears once in DC (as a masculine second declension noun, ‘fief’) but three times the inflectional category of a lemma, but cannot in Bon’s list: as a masculine second declension noun (but infer its gender a priori. For instance, approxi- with the different meaning ‘goat’), as a neuter third declen- 12 sion noun (with the genitive federis, ‘alliance’) and as a first Feminines are so rare in these declensions that we ex- class adjective (‘hideous’). clude them from the automated analysis. mately 10% of first declension nouns are found to clension), showing a trend towards more transpar- be masculine, and not feminine as expected. ent lexical items. While similar figures can be ob- served for verbs, in DC we notice a reduced pres- 4 Discussion ence of adjectives (12% against LEMLAT’s 25%), revealing that they represent a less diachronically- Not all of the 85 999 lemmas of DC are included productive PoS than nouns and verbs. in LEMLAT. We exclude the entries of some 3 000 fixed or idiomatic multi-word expressions and of 5 Evaluation around 300 adverbs derived either from an adjec- tive (e. g. affectuose ‘tenderly’ from affectuosus As conducted for the previous major update of ‘tender’) or from a verb (e. g. attendenter ‘watch- LEMLAT (Passarotti et al., 2017), we evaluate fully’ from attendere ‘to keep, to watch’) in the LEMLAT ’s coverage of the Latin lexicon against lexical basis of the DC-enhanced LEMLAT. This is the Thesaurus formarum totius latinitatis (TFTL) because LEMLAT considers derived adverbs as part by Tombeur (1998), in order to assess the impact of the inflectional paradigm of the source adjective of LEMLAT’s acquisition of DC. A primary refer- or verb. ence for the study of the Latin lexicon, TFTL is a At the end of the process, 82 556 DC lemmas are comprehensive diachronic collection of all Latin added to LEMLAT. Since DC shows a tendency to word forms as they occur in texts from the archaic treat different nuances of the same lemma as dis- period up to the Second Vatican Council (20th tinct entries, the total number of DC distinct lem- century), listing their respective frequencies in the mas inserted in LEMLAT is 73 131. The lemmas sources from different eras.14 with the highest number of separate entries are Passarotti et alii (2017) report a coverage forma ‘form’ (17), scala ‘stairs, staircase, ladder’ of 72.254% of TFTL’s forms, corresponding to (15) and status ‘mode, state, position, size’ (15). 98.345% of the 62 922 781 total occurrences in These are all already attested in Classical Latin, the source texts.15 This is partly explained but are also recorded in DC because of their seman- by the fact that many forms in TFTL are ei- tic change over time.13 This happens frequently; ther extremely rare, include punctuation in their there are, in fact, 10 168 shared lemmas (corre- spelling, or are merely sequences of numbers, sponding to 14 469 entries in DC) in LEMLAT 3.0 letters and punctuation marks. When we add and DC, with respect to the name of the lemma, its DC to LEMLAT , our coverage of TFTL raises PoS and inflectional category (and gender, when by 3.264% to 75.518%, corresponding to 17 224 applicable). Additionally, 1 820 lemmas share the newly-recognised forms, whereas the covered oc- same quotation form in both sources (often inci- currences increase to 98.665%. dentally), despite being morphologically different. We also perform a coverage evaluation over An example is amo: in DC, it is the third declen- three Medieval Latin texts of comparable size, sion noun amo, amonis, a variant of ammo, ammo- available from ALIM, the Archive of Italian Me- nis (a unit of measure for wine), while in LEMLAT dieval Latinity (Ferrarini, 2017).16 The texts be- it is the verb amare ‘to love’. long to three different periods and genres; these The remaining 66 267 lemmas are to be consid- are: the Codex diplomaticus Cavensis I (doc- ered lexical innovations of “media et infima La- uments 33-210), a collection of documentary tinitas”. Looking at these Medieval lemmas, we sources from Southern Italy dating to the 9th cen- notice some tendencies in the distribution of PoS tury; the Historia Mongalorum, a 13th century and inflectional categories. Whereas nouns are the report of a journey and diplomatic mission; and prevalent PoS both in LEMLAT and DC (albeit at the De falso credita et ementita Constantini dona- very different rates, respectively 52% and 75%), tione, a philological treatise dating back to the end in the former the most attested declension is the of the 15th century. third (37% of nouns), while in the latter it is the 14 Archaic Latin (up to IInd c. AD), Patristic Latin (IInd c. first and second declensions that dominate (34% AD – AD 735), Medieval Latin ( AD 736 – AD 1499) and Mod- and 39% of nouns, against 20% of the third de- ern Latin (AD 1500 – AD 1965), respectively. 15 The statistics in this paper are based on updated, 13 marginally corrected statistics with respect to those presented Indeed, DC does not at all record lemmas already avail- able in Classical Latin, unless they show a different meaning in Passarotti et alii (2017). 16 and/or morphology. http://it.alim.unisi.it/ Work (century) Tokens Types LEMLAT LEMLAT + DC Only DC Codex dipl. Cavensis (IX) 19428 3262 54.1% 59.2% 166 (5.1%) Historia Mongalorum (XIII) 20360 4649 90.3% 92.2% 87 (1.9%) De Constantini donatione (XV) 19805 6514 93.9% 94.8% 56 (0.9%) Table 1: Comparison of the lexical coverage of DC-enhanced LEMLAT of three Medieval texts. The “Only DC” column lists the number of terms to be found exclusively in the added DC vocabulary. Table 1 shows the improvements in lexical cov- tagging of the glossary’s definitions and quota- erage obtained thanks to the enhancement of LEM - tions. Indeed, unless tuned on an in-domain train- LAT through DC . The results are in line with those ing set, existing stochastic PoS-taggers for Latin for TFTL. Remarkably, the highest increase in per- are not yet reliable enough when it comes to pro- formance is recorded for the least-standardised of cessing the complex, raw and “freestyle” defini- the three texts, the Codex diplomaticus, which re- tions of DC. mains the most demanding for LEMLAT to analyse. The ever-growing availability of digitised Latin This can be explained by the large presence of lo- texts from various eras urges us to build NLP tools cal names of people and places (e. g. Sichelpertus, capable of automatically analysing such varied Eboli), and especially by the very frequent devia- sets of linguistic data. In this respect, enhancing tions from the orthographic standard (e. g. abentes the lexical basis of LEMLAT with a Medieval Latin for habentes ’having (pl.)’, ecclesie for ecclesiae dictionary is a first step towards the development ’of/to the church; churches’); the latter are also of well-performing tools on diachronic data. Con- the source of many false positives, which LEMLAT versely, even if building a tool suitable for differ- does not discriminate from true positives. Names ent diachronic varieties of Latin were feasible for are challenging, too, as can be observed, for exam- low-level annotation tasks (like e. g. lemmatisation ple, from the fact that among the 363 unrecognised and morphological analysis), this does not seem forms in the Historia Mongalorum, the majority to be the case for tasks such as syntactic parsing are ethnonyms, toponyms and anthroponyms (e. g. or word sense disambiguation, for which either Caracoron ‘Karakorum’, circassos ‘Circassians’, highly flexible or highly specialised tools will be Mengu ‘Möngkh’). needed. At the same time, LEMLAT is now able to anal- yse words which, while absent from the vocabu- This is an open issue not only for Latin. Indeed, lary of Classical Latin, are tied to key, widespread the portability of NLP tools across domains and concepts in the Middle Ages. For example, in genres is currently one of the main challenges in the Historia Mongalorum the enhanced LEMLAT NLP . Thanks to its highly diverse corpus, Latin is can now detect terms like orda ‘horde’ (11 occur- a perfect case-study language to tackle these prob- rences) or protonotarius ‘prothonotary’ (4 occur- lems. rences), both important in the 13th century on- For the future, we plan to expand LEMLAT’s ward in the context of conflicts and diplomatic lexical database with all of the graphical variants missions between Western Europe and the Mongol reported in DC and possibly also with other Me- Empire. Interestingly, the source for these lemmas dieval Latin thesauri, such as the Dictionary of in DC is not the Historia Mongalorum itself, which Medieval Latin from British Sources (Ashdown is an indication of the effective circulation of such et al., 2018), so as to improve both its diatopic words. and diachronic coverage. In general, we aspire to make LEMLAT’s algorithm better able to cope with 6 Conclusion the most widespread and predictable orthographic variations recorded in Medieval manuscripts and In this paper we present the rule-based pro- texts.17 cess performed to semi-automatically enhance the Latin morphological analyser LEMLAT with the Du Cange glossary. While dated, such an ap- proach is still necessary if the intent is to minimise 17 An introduction and an approach to this issue can be the error rate resulting from the automatic PoS- found in Kestemont and De Gussem (2017). References Marco Passarotti, Marco Budassi, Eleonora Litta, and Paolo Ruffolo. 2017. The 3.0 Package for Mor- Richard K Ashdown, David R Howlett, and Ronald E phological Analysis of Latin. In Proceedings of the Latham, editors. 2018. Dictionary of Medieval NoDaLiDa 2017 Workshop on Processing Histori- Latin from British Sources. Oxford University Press cal Language, pages 24–31, Gothenburg, Sweden. for the British Academy, Oxford, UK. Northern European association for language tech- nology (NEALT), Linköping University Electronic Bruno Bon. 2011. OMNIA : outils et méthodes Press. numériques pour l’interrogation et l’analyse des textes médiolatins (3). Bulletin du centre Marco Passarotti. 2004. Development and perspec- d’études médiévales d’Auxerre BUCEMA, (15). tives of the Latin morphological analyser LEMLAT. Online at http://journals.openedition. Linguistica computazionale, XX-XXI:397–414. org/cem/12015. Edoardo Maria Ponti and Marco Passarotti. 2016. Dif- Andrea Bozzi and Giuseppe Cappelli. 1990. A project ferentia compositionem facit. a slower-paced and re- for Latin lexicography: 2. A Latin morphological liable parser for Latin. In Proceedings of the tenth analyzer. Computers and the Humanities, 24(5- international Conference on Language Resources 6):421–426. and Evaluation (LREC ’16), pages 683–688, Por- torož, Slovenia. European Language Resources As- Marco Budassi and Marco Passarotti. 2016. Nomen sociation (ELRA). omen. Enhancing the Latin morphological analyser Lemlat with an onomasticon. In Proceedings of the Uwe Springmann, Helmut Schmid, and Dietmar Na- 10th SIGHUM Workshop on Language Technology jock. 2016. LatMor: A Latin finite-state mor- for Cultural Heritage, Social Sciences, and Human- phology encoding vowel quantity. Open Linguis- ities (LaTeCH), pages 90–94, Berlin, Germany. As- tics - Topical Issue on Treebanking and Ancient sociation for Computational Linguistics. Languages: Current and Prospective Research, 2(1):386–392. Charles du Fresne du Cange, Bénédictins de Saint- Paul Tombeur. 1998. Thesaurus formarum totius La- Maur, Pierre Carpentier, Louis Henschel, and tinitatis: a Plauto usque ad saeculum XXum. Bre- Léopold Favre. 1883–1887. Glossarium mediae et pols, Turnhout, Belgium. infimae latinitatis. Niortm France. Edoardo Ferrarini. 2017. ALIM ieri e oggi. Umanistica Digitale, 1(1). Online at https://umanisticadigitale.unibo. it/article/view/7193. Karl Ernst Georges and Heinrich Georges. 1913– 1918. Ausführliches lateinisch-deutsches Handwörterbuch. Hahn, Hannover, Germany. Hercule Géraud. 1839. Historique du glossaire de la basse latinité de Du Cange. Bibliothèque de l’École Nationale des Chartes, 1:498–510. Peter GW Glare. 1982. Oxford Latin dictionary. Clarendon Press. Oxford University Press, Oxford, UK. Otto Gradenwitz. 1904. Laterculi Vocum Latinarum: voces Latinas et a fronte et a tergo ordinandas. Hirzel, Leipzig, Germany. Mike Kestemont and Jeroen De Gussem. 2017. In- tegrated Sequence Tagging for Medieval Latin Us- ing Deep Representation Learning. Journal of Data Mining & Digital Humanities, Special Issue on Computer-Aided Processing of Intertextuality in Ancient Languages, August. Online at https: //jdmdh.episciences.org/3835. Nino Marinone. 1990. A project for Latin lexicog- raphy: 1. Automatic lemmatization and word-list. Computers and the Humanities, 24(5-6):417–420.