-io Nouns through the Ages. Analysing Latin Morphological Productivity with Lemlat Marco Budassi Eleonora Litta, Marco Passarotti Università degli Studi di Pavia Università Cattolica del Sacro Cuore Corso Strada Nuova 65 Largo Gemelli 1 Pavia, Italy 27100 Milan, Italy 20123 marcobudassi@hotmail.it e.littamodignani@gmail.com marco.passarotti@unicatt.it Abstract (Passarotti et al., 2017), Lemlat is one of the best performing morphological analysers and lemma- English. This paper aims at examining tisers for Latin.2 Lemlat is currently in the process the diachronic distribution of one of the of being enriched with all lemmas contained in the richest classes of nouns in Latin, namely glossary of Medieval Latin Glossarium mediae et those ending in -io. The work is performed infimae latinitatis compiled by Charles Du Cange through the combined use of a morpho- et alii in 1883-1887 (Glorieux, 2010). logical analyser for Latin (Lemlat), and a One of the first groups of lemmas from Du database collecting all word forms occur- Cange which was included into the lexical basis ring through different periods of Latin lan- of Lemlat was that collecting all 3rd declension guage (TF-CILF). nouns ending in -io, one of the most productive Italiano. Questo articolo presenta affixes in all periods of Latin, up to Romance lan- un’analisi della distribuzione diacronica guages (Fruyt, 2011). The aim of this study is to di una delle più ricche classi di nomi perform a diachronic quantitative evaluation of 3rd in latino, ossia quelli che terminano in declension nouns ending in -io. To do so, first we -io. Metodologicamente, il lavoro viene use Lemlat to lemmatise all word forms of such condotto attraverso l’uso incrociato di nouns contained in Thesaurus formarum totius la- un analizzatore morfologico per il latino tinitatis a Plauto usque ad saeculum XXum (TF- (Lemlat) e di una risorsa lessicale conte- CILF) (Tombeur, 1998). Then we evaluate the re- nente tutte le forme di parole latine che oc- sults of the lemmatisation in both quantitative and corrono in testi che vanno dall’antichità al qualitative terms. neo-latino (TF-CILF). 2 Lemlat and Du Cange 1 Introduction Lemlat relies on a lexical basis resulting from the collation of three Classical Latin dictionaries,3 for The investigation of lexical data of Classical lan- a total of 40,014 lexical entries and 43,432 lemmas guages through the use of linguistic resources and (as more than one lemma can be included in one Natural Language Processing (NLP) tools has wit- lexical entry). In the context of the development nessed a surge of interest in the past decade. As of Lemlat version 3.0, its lexical basis was further far as Latin is concerned, today several textual enlarged by adding semi-automatically most of the and lexical resources, as well as NLP tools, are Onomasticon (26,415 lemmas out of 28,178) pro- being used in lexicographic research.1 One of vided by the 5th edition of the Forcellini dictio- the bedrocks of this type of research is the use nary for Latin (Budassi and Passarotti, 2016). of morphological analysers, that is, tools that, Furthermore, the inflectional information provided given an input word form, output its correspond- by Lemlat has been enhanced with information on ing lemma(s) and morphological features. derivational morphology taken from the Word For- First released at the beginning of the 1990s and 2 recently made freely available in its version 3.0 www.lemlat3.eu. See (Springmann et al., 2016) for a comparative evaluation of the morphological analysers cur- 1 rently available for Latin. See (Bamman and Crane, 2008), (McGillivray and Pas- 3 sarotti, 2009), (McGillivray, 2013) and (Passarotti et al., (Georges and Georges, 1913-1918), (Glare, 1982) and 2016). (Gradenwitz, 1904). mation Latin (WFL) lexicon (Litta et al., 2016).4 For this study, we have grouped the nouns in -io However, being based on dictionaries for Clas- as follows: sical Latin, one of the current limitations of Lemlat is the fact that its lexical basis is not large enough 1. Group D: nouns that are only contained in Du yet to provide a wide coverage of the word forms Cange (tot. no. 1,416); occurring in Late and Medieval Latin texts. For 2. Group L: nouns that are only contained in this reason an upgrade of Lemlat 3.0 with the Me- Lemlat (tot. no. 2,246); dieval Latin lemmas contained in the Du Cange glossary (Glorieux, 2010), made available online 3. Group L&D: nouns that are contained in both by the École National des Chartes,5 is underway. Du Cange and Lemlat (tot. no. 1,494). 3 Nouns Ending in -io Du Cange contains a total of 2,910 nouns end- In the Lemlat lexical basis, nouns of the 3rd de- ing in -io. One of the characteristics of the Du clension ending in -io (with genitive in -ionis) are Cange glossary is indeed that no Classical Latin mostly feminine. Only 294 out of 3,065 -io nouns lemma is included in its lexical basis, and if the in Lemlat are masculine, more than half of which same lemma is contained in both lexical bases, are proper names.6 Most frequently, nouns in -io it means that it has undergone a major semantic derive from verbs. WFL contains 2,510 deverbal or morphological change. 1,416 -io nouns out nouns in -io, 87 denominal, and 36 deadjectival. of 2,910 are listed only in Du Cange (Group D), There are also not derived -io nouns, like for in- which means that they were absent in the Classi- stance bacrio ‘trowel’. cal Latin dictionaries used for compiling Lemlat. Group L contains all those -io nouns whose Resulting from one of the main mechanisms meaning (or morphology) did not change from for Latin nominalisation (Rosén, 1983), deverbal Classical Latin throughout time, or that were not nouns in -io are generally called processes or ver- used anymore in Medieval Latin. Such words are bal nouns. Semantically, they can be either “nom- then exclusive only of the Lemlat lexical basis. ina actionis”, referring to the process of the action Even if they were used in Medieval times, they did expressed by the input verb (e.g. aberro ‘to wan- not undergo a semantic or morphological change, der from the way’ > aberratio ‘diversion’, as the hence they were not included in Du Cange. process of wandering from the way), or “nomina rei actae”, referring to the result of such process Group L&D contains all those -io nouns that are (e.g. aberratio as the result of wandering from the recorded both in Lemlat and Du Cange. These way).7 are mostly words that have undergone a semantic change, but there are also cases of words that are An investigation on productivity in affixal spelled differently in Medieval sources (e.g. Med. derivation performed on the data extracted from adsumtio or assumtio for Cl. assumptio ‘acquisi- WFL has proved that deverbal nouns in -io are tion’), or that in Medieval times acquired a differ- the most numerous formations in Classical Latin ent inflection (e.g. Cl. beneficium ‘kindness’, 2nd (Litta et al., 2017). Such a high presence of nouns declension > Med. beneficio, 3rd declension). Be- in -io in Latin lexicon motivates the choice of them cause Du Cange treats different meanings in dif- as the object of this work. ferent entries, there is also a number of words ap- 4 Funded by the European Union’s Horizon 2020 research pearing more than once (e.g. defensio ‘defense’ and innovation programme under the Marie Skłodowska- x4, invocatio ‘invocation’ x2). Curie grant agreement No 658332-WFL, Word Formation Latin is a derivational morphology resource for Latin that links lemmas on the basis of word formation processes 4 Methodology (http://wfl.marginalia.it). 5 http://ducange.enc.sorbonne.fr/doc/ In order to perform a diachronic evaluation of the sources. frequency of distribution of these three groups, 6 Because at the moment of writing there is no imple- mented distinction between onomastic and non-onomastic we have used data extracted from the TF-CILF lemmas for what lemmas in Du Cange are concerned, we database (Tombeur, 1998). TF-CILF is a database have taken into consideration onomastic data also in the Lem- collecting the vocabulary of the entire Latin world lat lexical basis. 7 An ample bibliography on -io nouns in Latin is available. drawn from (a) the ancient Latin literature, (b) the See for example (Fruyt, 1995) and (Fruyt, 2011). literature of the patristic period, (c) a vast body of Medieval material and (d) collections of Neo- L L&D D Latin works. Word forms are assigned their num- Antiquity 30,282 36,570 1,638 ber of occurrences in each of these four periods. Patres 133,042 255,235 5,740 Lemlat has been already proven to perform very Medieval 216,220 541,049 14,299 efficiently on the TF-CILF dataset, as it is able to Neo-Latin 19,551 45,145 1,812 analyse 98.345% of the approximately 63 millions textual occurrences of the word forms it contains Table 1: Absolute frequencies by period. (Budassi and Passarotti, 2017). We extracted from TF-CILF a list including an idea of the difference in size between the four those word forms that feature one of the possi- chronological subsets, Table 2 reports the total ble inflectional endings of -io nouns (-io, -ionis, number of word forms and lemmas in TF-CILF -ionem, -ioni etc.), together with data on their fre- by period. quency of occurrence in the four periods of Latin mentioned above. In total we extracted 25,510 Word Forms Lemmas candidate word forms. Antiquity 5,726,051 229,587 Then we processed these word forms with both Patres 21,982,097 310,348 Lemlat 3.0 and an enhanced version of it contain- Medieval 33,285,740 359,262 ing nouns ending in -io taken from Du Cange. This Neo-Latin 2,184,025 105,857 version of Lemlat was able to analyse 17,775 word Total 63,177,913 554,828 forms out of the 25,510 extracted from TF-CILF. Such a low word coverage (69.79%) is consistent Table 2: Number of word forms and lemmas in with the overall coverage of TF-CILF word forms TF-CILF by period. provided by Lemlat 3.0 (72.25%) (Budassi and Passarotti, 2017). However, if we look at the In order to flatten the difference in size between number of textual occurrences of these unknown the subsets, relative values need to be used instead forms, they are extremely rare, which makes the of absolute. Table 3 displays the distribution of -io textual coverage of Lemlat 3.0 largely reliable. nouns in Latin texts in terms of relative frequen- The automatic processing allows not only to match cies of occurrence by period. each word form with a lemma, but also to exclude homographs like capio ‘to seize’ (verb). The re- L L&D D sulting output (lemmas + frequency) can be graph- Antiquity 0.528% 0.638% 0.028% ically mapped on a temporal axis in order to have Patres 0.605% 1.161% 0.026% a complete view on the distribution of -io nouns Medieval 0.649% 1.625% 0.042% through the ages. Neo-Latin 0.895% 2.067% 0.082% 5 Distribution of -io Nouns in Latin Table 3: Relative frequencies by period. Table 1 offers an overview of the total number of For instance, looking at Table 3, it turns out occurrences by period.8 The vast majority of -io that -io nouns that are only contained in Lemlat nouns are attested in the Middle Ages. are 0.649% of the total number of occurrences in However, any evaluation of these results is go- Medieval texts. Those contained in both Lemlat ing to be biased by the fact that the datasets for and Du Cange are 1.625%, and those contained in each period are not balanced. The size of the Du Cange (hence exclusively Medieval) are only subsets covering respectively the Patristic and the 0.042%. An overview of the diachronic distribu- Medieval period is bigger than that for Classical tion of relative frequencies of occurrence of -io Latin. The subset for Neo-Latin is considerably nouns is provided in Figure 1. smaller than those for the other periods. To give Figure 1 clarifies the variation of the presence 8 L stands for Lemlat only, L&D stands for Lemlat and Du of -io nouns in different chronological phases of Cange, D stands for Du Cange only. ‘Antiquity’ (i.e. up to Latin. The distribution of the occurrences of those the end of 2nd century AD), ‘Patres’ (i.e. 3rd century - 735 -io nouns that were in the lexicon of Classical AD), ‘Medieval’ (i.e. 736 - 1499 AD) and ‘Neo-Latin’ (i.e. 1499 AD henceforth) are chronological parameters adopted Latin (Lemlat line) remains fairly constant across by TF-CILF. all the diachronic phases of the language. In Neo- most used words in Antiquity are ratio, oratio, le- gio ‘legion’ and regio ‘region’. The top most fre- quent -io nouns in Patristic and Medieval times can all be found both in Lemlat and Du Cange. In Patristic literature, the most frequent words (from now on, after ratio) are oratio, actio, passio and resurrectio ‘resurrection’. In Medieval times, they are passio, oratio, operatio ‘activity’ and perfectio ‘perfection/completion’. On another note, the high peak in the relative frequency of -io nouns in Neo-Latin texts sug- Figure 1: Distribution of relative frequencies of gests that these were used more often than others occurrence of -io nouns. in more recent times. This can be explained by looking at the kind of texts included in the cor- Latin times, however, a sharp increase is regis- pus. The texts contained in the Neo-Latin sub- tered (from 0.649% to 0.895% in terms of rela- set are mainly scientific and philosophical trea- tive frequencies). This peak is observable also as tises, judicial texts, and the text of the Second far as Medieval Latin -io nouns are concerned (Du Vatican Council. When these texts were written, Cange line). From a value of 0.042% in the Me- Latin was not the spoken language anymore, as dieval period, the relative frequency raises until its place was mainly taken by Italian and French, 0.082% in Neo-Latin. Nevertheless, the major- two languages that inherited the suffix -io straight ity of -io nouns stored in both Lemlat’s and Du from Latin, especially for what learned vocabulary Cange’s lexical bases (which mostly underwent was concerned.10 The assumption is that learned some semantic change across centuries) are the texts contained a large number of words resem- ones that live the best fate (Lemlat and Du Cange bling those used in Italian and French learned lan- line): they constantly keep growing from the rel- guage, at least for what -io nouns are concerned. A ative frequency value of 0.638% in the Antiquity look at the most used -io nouns in Neo-Latin texts to the relative frequency value of 2.067% in Neo- confirms that once again ratio was the most used, Latin. followed by propositio ‘statement of facts’, actio, The odd presence of words from Du Cange in notio ‘judicial enquiry’, definitio ‘definition’ and Classical times is due to non-disambiguated ho- cognitio ‘examination’. These are also all con- mography. For instance, this is the case of the tained in the Lemlat + Du Cange group. word dubio, which is analysed by Lemlat both as a form of the first class adjective dubius ‘uncertain’ 7 Conclusions and Future Work (recorded in the original lexical basis of Lemlat, In this paper, we presented a study of the di- hence here left out) and as the nominative/vocative achronic distribution of Latin nouns ending in -io singular of the -io noun dubio (a type of hooked by processing word forms from the TF-CILF cor- tool) from the Du Cange lexical basis. pus with the morphological analyser Lemlat. We 6 General Discussion demonstrated that the -io suffix is very productive across all periods of Latin language, showing a The distribution of -io nouns reflects Zipf’s law particularly high frequency in both Medieval and (Zipf, 1949), stating that the frequency of any Neo-Latin texts. Ratio remains always the most word in a corpus is inversely proportional to its used -io noun across the entire diachronic span rank in the frequency table. To put it another way, covered by the corpus used in our work. there are a few -io nouns that are massively used, One step further in the study of -io nouns would and a lot of -io nouns that are used only a few be to establish derivational relationships for each times. lemma and to verify which of the two lexical The top most used nouns in -io throughout all groups (Lemlat or Du Cange) the input lemma be- periods are ratio ‘reckoning’, passio ‘passion (of longs to. Also, an evaluation of the unknown word Christ)’,9 oratio ‘speech’ and actio ‘action’. The 10 See (Thornton, 1990), (Thornton, 1991) and (Štichauer, 9 Passio is absent in Antiquity texts. 2015). forms after the lemmatisation process should be Otto Gradenwitz. 1904. Laterali Vocum Latinarum. performed. Hirzel: Leipzig. Given the wide lexical coverage provided by Eleonora Litta, Marco Passarotti, and Chris Culy. Lemlat, our work represents a positive example of 2016. Formatio formosa est. Building a Word how much NLP tools can help to investigate di- Formation Lexicon for Latin. Proceedings of the achronic aspects of language. The wide diachronic Third Italian Conference on Computational Linguis- tics (CLiC–it 2016). Napoli, aAccademia University as well as diatopic span over which Latin texts are Press, 185-189. spread opens an appealing challenge for research in NLP, which has to address the problem of porta- Eleonora Litta, Marco Passarotti and Paolo Ruffolo. 2017. Node Formation. Using Networks to In- bility of NLP tools across time, place and genre. In spect Productivity in Affixal Derivation in Classi- this sense, Latin texts represent a perfect dataset cal Latin. In Proceedings of DATeCH2017, Göttin- both for developing and for evaluating techniques gen, Germany, June 01-02, 2017, 103-108. DOI: of domain-adaptation of NLP tools. http://dx.doi.org/10.1145/3078081.3078092. Barbara McGillivray. 2013. Methods in Latin Compu- tational Linguistics Brill: Leiden. References Barbara McGillivray and Marco Passarotti. 2009. David Bamman and Gregory Crane. 2008. Build- The Development of the Index Thomisticus Tree- ing a Dynamic Lexicon from a Digital Library, In bank Valency Lexicon. In Proceedings of LaTeCH- Proceedings of the 8th ACM/IEEE-CS Joint Confer- SHELT&R Workshop 2009, Athens, Greece, 43-50, ence on Digital Libraries (JCDL 2008, Pittsburgh) ACL. ACM: New York. Marco Passarotti, Berta González Saveedra and Marco Budassi and Marco Passarotti. 2016. Nomen Christophe Onambele. 2016. Latin vallex. A Omen. Enhancing the Latin Morphological Analyser treebank-based semantic valency lexicon for latin. Lemlat with an Onomasticon. In Proceedings of the In Proceedings of the Tenth International Confer- 10th SIGHUM Workshop on Language Technology ence on Language Resources and Evaluation (LREC for Cultural Heritage, Social Sciences, and Human- 2016) Portorož, Slovenia, 2599–2606. ities (LaTeCH), 90-94, Association for Computa- tional Linguistics: Berlin. Marco Passarotti, Marco Budassi, Eleonora Litta and Paolo Ruffolo 2017. The Lemlat 3.0 Package for Marco Budassi and Marco Passarotti. 2017. The Morphological Analysis of Latin. Proceedings of Impact of Unassimilated Loanwords on the Latin the NoDaLiDa 2017 Workshop on Processing His- Lexicon. A Qualitative and Quantitative Analy- torical Language, 24–31. sis. In Proceedings of DATeCH2017, Göttin- gen, Germany, June 01-02, 2017, 85-90. DOI: Hannah Rosén. 1993. The mechanisms of Latin nom- http://dx.doi.org/10.1145/3078081.3078083. inalization and conceptualization in historical view. In ANRW 11I29. 1, 178-211. De Gruyter: Berlin. Charles du Fresne Du Cange 1678-1887. Glossarium Mediae et Infimae Latinitatis, éd. augm., Niort, L. Uwe Springmann, Helmut Schmid and Dietmar Na- Favre http://ducange.enc.sorbonne.fr/. jock. 2016. LatMor: A Latin Finite-State Mor- phology Encoding Vowel Quantity. In Giuseppe Michèle Fruyt. 2011 Word-Formation in Classi- Celano and Gregory Crane (eds.), Treebanking cal Latin. In A Companion to the Latin Language, and Ancient Languages: Current and Prospective ed. James Clackson, 157-175, Wiley-Blackwell: Re-search (Topical Issue), Open Linguistics vol. 2, Malden, Mass. 386–392. Pavel Štichauer. 2015. From emergent availability to Michèle Fruyt. L’accusatif et les noms en-tio chez full profitability: The diachronic development of the Plaute. De usu, Études de syntaxe latine offertes en Italian suffix -zione from the 16th to the 20th cen- hommage à Marius Lavency, 131-141. tury. In Augendre S., Couasnon-Torlois G., Lebon D., Michard C., et al. Proceedings of the Décem- Karl Ernst Georges and Heinrich Georges. 1913- brettes 8th International conference on morphology, 1918. Ausführliches Lateinisch-Deutsches Hand- 319-326, Université de Toulouse: Toulouse. wörterbuch. Hahn: Hannover. Anna Maria Thornton. 1990. Sui deverbali italiani in Peter G.W. Glare. 1982. Oxford Latin Dictionary. -mento e -zione (I). Archivio glottologico italiano, Oxford University Press: Oxford. LXXV/II, 169-207. Le Monnier: Torino. Thuillier Glorieux. 2010. Pourquoi informatiser un Anna Maria Thornton. 1991. Sui deverbali italiani in vieux glossaire? Présentation du Du Cange en ligne. -mento e -zione (II). Archivio glottologico italiano, ÉLA n°156, octobre-décembre 2009, Klincksieck. LXXVI/I, 79-102. Le Monnier: Torino. Paul Tombeur. 1998. Thesaurus formarum totius la- tinitatis a Plauto usque ad saeculum XXum Brepols: Turnhout. George K. Zipf. 1949. Human Behavior and the Prin- ciple of Least Effort. Addison-Wesley Press: Cam- bridge, Mass.