LatInfLexi: an Inflected Lexicon of Latin Verbs Matteo Pellegrini Marco Passarotti Università di Bergamo/Pavia CIRCSE Research Centre Piazza Rosate, 2 – Università Cattolica del Sacro Cuore 24129 Bergamo, Italy Largo Gemelli, 1 – 20123 Milan, Italy matteo.pellegrini@unibg.it marco.passarotti@unicatt.it In morphological theory, there is a recent Abstract trend towards a more realistic modelling of com- plex inflectional systems: for instance, Ackerman English. We present a paradigm-based in- et al. (2009) and Bonami and Boyé (2014) pro- flected lexicon of Latin verbs built to provide pose that the analysis should take a full inflected empirical evidence supporting an entropy- form as a starting point, without assuming any based estimation of the degree of uncertainty segmentation a priori. In such approaches, what in inflectional paradigms. The lexicon con- is investigated is not the construction of forms tains information on the inflected forms that from smaller units like stems and inflectional occupy the 254 morphologically possible endings, but rather their predictability given paradigm cells of 3,348 verbal lexemes ex- knowledge of other forms. This can be done by tracted from a frequency lexicon of Latin. using the information theoretic notion of condi- The resource also includes annotation of tional entropy to estimate the uncertainty in vowel length and the frequency of each form guessing the content of the paradigm cell of a in different epochs. lexeme knowing another inflected form of the same lexeme, by weighting the probability of Italiano. Presentiamo un lessico di forme application of each inflectional pattern based on flesse basato sui paradigmi per i verbi latini, their type frequency in real data. costruito per fornire evidenza empirica che To do so, large-scale inflected lexicons listing permetta di quantificare il grado di incertez- all forms of a representative selection of lexemes za nei paradigmi flessivi tramite l’entropia. are needed. Such resources are increasingly be- Il lessico contiene informazioni sulle forme ing developed for modern languages – see flesse che occupano le 254 celle possibili dal among else Zanchetta and Baroni (2005) and punto di vista morfologico di 3.348 lessemi Calderone et al. (2017) for Italian, Neme (2013) verbali estratti da un dizionario frequenziale for Arabic, Bonami et al. (2014) and Hathout et del latino. La risorsa include anche al. (2014) for French. However, to the best of our l’annotazione della lunghezza vocalica e la knowledge, there are no resources of this kind frequenza di ogni forma in diverse epoche. for Latin, although their (semi-)automatic build- ing is made possible by the current availability of 1 Introduction several morphological analyzers for Latin, in- cluding Words In this paper, we describe the construction of (http://archives.nd.edu/words.html), Lem- LatInfLexi, an inflected lexicon of Latin verbs lat (www.lemlat3.eu), Morpheus organized in lexemes1 and paradigm cells. (https://github.com/tmallon/morpheus), the PROIEL Latin morphology system (https://github.com/mlj/proiel- 1 The term “lexeme” is used for the abstract theoreti- cal concept normally adopted in morphology and lex- icology, while “lemma” refers to the concrete citation aim at a resource suitable for theoretical inquiries, we form representing an entry in dictionaries. Since we use the first term as a label in our resource. webapp/tree/master/lib/morphology) and ative relations between inflected forms (Bonami LatMor (http://cistern.cis.lmu.de). Our and Beniamine, 2016; Beniamine, 2017). resource was created to fill this gap and to enable As for (ii), the identifier corresponds to the ci- a quantitative, entropy-based analysis of Latin tation form of the lexeme, almost always the verb inflection. first-person singular of the present indicative, following the Latin lexicographical and didacti- 2 Design cal tradition. A diacritic is added in those rare cases where different verbs have the same cita- A distinctive feature of our inflected lexicon is tion form (see infra, §3.2). that it is based on lexemes and paradigm cells, Regarding (iii), we use the PoS-tags of the rather than on forms. This means that for each Universal Part-of-Speech Tagset by Petrov et al. lexeme, all the morphologically possible para- (2012) and the morphological features used in digm cells are filled with a form, and not only Universal Dependencies those forms that are indeed attested in Latin texts (http://universaldependencies.org/u/feat are stored in paradigm cells. In this respect, our /index.html). resource is similar to other recently developed Lastly, the frequency data in (iv) are taken inflected lexicons, like for instance Flexique for from Tombeur’s (1998) Thesaurus Formarum French (Bonami et al., 2014). Totius Latinitatis (see infra, §3.3). For each paradigm cell, the following infor- mation is provided: 3 Building the Lexicon (i) the inflected form that occupies the para- This section details the procedure followed to digm cell; build the lexicon. (ii) a univocal identifier of the lexeme to which it belongs; 3.1 Selecting the Lexemes (iii) the set of its morphological features; Our first objective is to build an inflected lexicon (iv) information on the frequency of the form of Latin featuring all the possible inflected forms in different epochs. of verbs only. To this aim, we include all the verbal entries contained in Delatte et al.’s (1981) As for (i), it should be noted that there is never Dictionnaire fréquentiel et Index inverse de la more than one form per paradigm cell. In cases langue latine (henceforth DFILL). This yields a of overabundance (i.e. cells that are filled by total of 3,348 verbs. In rare cases, more than one more than one form, cf. Thornton, 2012), a entry of DFILL corresponds to one and the same choice was made to decide which “cell-mate” lexeme in our resource. This happens because (Thornton, 2012: 183) should be kept, and which some verbs are lemmatized twice in DFILL. For one discarded. instance, for the verb verso two different entries On the other hand, in some cases a paradigm appear in DFILL, using as citation form both the cell could be empty, either because it is defective first-person singular of the present active indica- – like for instance the passive cells of intransitive tive verso and the corresponding morphological- verbs – or because it is not filled by a synthetic ly passive form versor. This choice is likely to be form, but rather it is analytically expressed, by motivated by the different semantics of the two means of a phrase – like for instance, in Latin, verbs, with the first one meaning ‘to turn’ and the perfective cells of deponent verbs, for which the second one meaning ‘to remain’. However, in the periphrasis PRF.PTCP 2 + AUX esse ‘to be’ is such cases our resource gives priority to collect- used (e.g. PRF.IND.1SG hortātus sum ‘I incited’). ing into one common inflectional paradigm all In both cases, the cell is marked as #DEF# in the the forms that can be assigned to the same lex- resource. This convention is adopted also in eme based on their morphological relatedness, Flexique (Bonami et al., 2014: 2585), and it fits rather than separating them in paradigms of dif- the requirements of the Qumin package for en- ferent lexemes according to semantic criteria. tropy calculations on the predictability of implic- Therefore, our lexicon includes only one lexeme verso, for which both active and passive forms are listed. 2 Throughout the paper, we will refer to grammatical features by using the standard abbreviations of the Leipzig Glossing Rules. 3.2 Generating the Forms verbal entry there is a set of four “principal parts” (Bennett, 1908: 55), i.e. exemplary in- In order to fill all of the paradigm cells of the flected forms from which the whole paradigm of selected lexemes, we exploit the database of the lexeme can be inferred. We keep only those Lemlat (Passarotti et al., 2017). For each lexeme, LESs that correspond to such principal parts, ex- the database of Lemlat contains a list of seg- cluding the ones that correspond to more mar- ments called LES – roughly corresponding to the ginal forms that do appear in dictionaries but are stems that are used in different subparadigms – given less prominence in the entry. For instance, each with a corresponding CODLES that provides Lemlat includes two LESs with CODLES “v3r” for (among else) information on the inflectional end- the verb dico ‘to say’: “dic” and “deic”. Howev- ings that can be attached to a LES. We make use er, in both the lexicographical sources we use, of this information to generate the relevant the relevant principal parts are dico and dicere, forms. corresponding to the first LES, while the second To illustrate the details of the procedure, let’s one is only mentioned later in the entries as an consider the verb rumpo ‘to break’. For this verb, alternative form. Therefore, the LES selected for the database of Lemlat features the LESs and our resource is “dic”. CODLESs shown in Table 1. We use the same dictionaries also to manually annotate the vowel length for each LES. This is a LES CODLES necessary enhancement, because in Latin verb rump v3r inflection there are homographic forms that can rumpisse fe be distinguished only based on that, like for in- rup v7s stance PRS.ACT.IND.3SG fugit ‘(s)he flees’ vs. rupsit fe PRF.ACT.IND.3SG fūgit ‘(s)he fleed’. rupt n41 Following this process, we fill all the 254 par- rupt n6p1 adigm cells of each of the 3,348 lexemes. How- ruptur n6p2 ever, because of Lemlat’s design, for some quite Table 1: the verb rumpo in Lemlat 3.0 frequent verbs with a highly irregular inflectional paradigm, it was not possible to apply the same The two LESs with CODLES “fe” (“forma ec- procedure, at least for the cells of the present sys- cezionale”, ‘exceptional form’) were discarded, tem, which is where most irregularity of the in- since they are full irregular forms that are stored flectional endings of Latin verbs happens. For as such. As for the other LESs, the one with the verbs shown in Table 2 and for those derived CODLES “v3r” is used to fill all the cells of the from them by prefixation (e.g. abeo ‘to go away’ present system, by adding the inflectional end- from verb eo ‘to go’), although it was technically ings of the conjugation represented by the possible to adopt a similar approach by using rd CODLES (i.e. the 3 conjugation). Similarly, the more than one LES for a CODLES, it proved to be LES with CODLES “v7s” is used to fill the cells of faster and practical to manually record the cor- the perfect system. From the remaining LESs, rect forms as such. some nominal forms built upon the so-called “third stem” (Aronoff, 1994) can be derived, Lemma Meaning namely the supine rupt-um and rupt-ū from the aio to say LES with CODLES “n41”, the perfect participle eo to go rupt-us, -a, -um from the LES with CODLES fero to bring “n6p1” and the future participle ruptūr-us, -a, - fio to become um from the LES with CODLES “n6p2”. inquam to say This given, our first step is to extract infor- malo to prefer mation on the LESs and CODLESs of each lexeme. nolo not to want Since Lemlat is a tool built to analyze rather than possum can produce forms, it contains also several LESs oc- sum to be curring only in irregular and/or rare forms. To volo to want avoid the risk of overgeneration, we choose and Table 2: irregular verbs keep only one LES for each CODLES. The choice is based on lexicographical sources, namely To each of the 850,392 generated paradigms Lewis and Short (1879) and Glare (1982). In the- cells, a univocal lexeme identifier is assigned, se dictionaries, at the very beginning of each which corresponds to the lemma used in Lemlat. in languages with large inflectional paradigms – In those rare cases where two or more verbs have like the ones of Latin verbs – it is perfectly nor- the same lemma in Lemlat (although they inflect mal that many plausible forms do not appear, differently), a numeric diacritic is added to make even in very large datasets, and the lexemes for the relevant distinction: for instance, we have which the full paradigm is attested are very few. volo1 ‘to fly’ and volo2 ‘to want’. 4 Discussion and Future Work 3.3 Frequency Data We described the design and building of a lex- Many forms included in the paradigm cells of eme-based inflected lexicon consisting of our lexicon are never attested in Latin texts. In 850,392 paradigm cells of 3,348 Latin verbs. Our order to make it possible to distinguish between first objective in the near future is to make the plausible but unattested forms and those indeed resource complete in terms of lexical coverage, occurring in texts, we enhance forms with infor- including the lexemes of the other PoS. The lexi- mation on their frequency. This information is con is available for download as a .csv file at taken from Tombeur’s (1998) Thesaurus For- https://github.com/matteo- marum Totius Latinitatis (henceforth TFTL), pellegrini/LatInfLexi. where each form is assigned the number of its We also plan to include phonetic annotation, occurrences in four different epochs, respectively by giving the IPA transcription of each form, called Antiquitas (from the origins to the end of which can be obtained semi-automatically by the 2nd century A.D.), Aetas Patrum (2nd century- applying a script provided by the Classical Lan- 735 A.D.), Medium Aeuum (736-1499) and Re- guage Toolkit (Johnson et al., 2014-17) to stems centior Latinitas (1500-1965). and endings. By including the frequency of each form in the Another welcome addition would be to ac- lexicon, we know how many of the 752,537 3 count for cases of overabundance, by allowing forms recorded in the lexicon are never actually more than one form to appear in the same para- attested. Table 3 reports the relevant data4. digm cell. However, to decide which cell-mates to keep and which ones to discard, their frequen- TFTL epoch unattested forms (%) cy in Latin texts should be preliminarily evaluat- Antiquitas 544,395 (72.34%) ed. In this respect, it has to be noted that the fre- Aetas Patrum 482,324 (64.1%) quencies in the TFTL refer to bare surface forms, Medium Aeuum 484,421 (64.37%) with no contextual disambiguation. For instance, Recentior Latinitas 640,552 (85.12%) the frequency of veniam comprises not only oc- all epochs 401,690 (53.38%) currences of both the PRS.ACT.SBJV.1SG and FUT.ACT.IND.1SG of the verb venio ‘to come’, but Table 3: not attested forms also of the ACC.SG of the noun venia ‘indul- gence’. It can be observed that a significant amount of To get an idea of the impact of morphological forms recorded in our lexicon are not attested, ambiguity on our lexicon, we analyzed all the even in such a large corpus as the one the TFTL generated forms with Lemlat (version 3.0). We is based on. However, this is not surprising: re- found that only for about 23% (170,735) of the cent large-scale corpus-based investigations (e.g. 752,537 forms Lemlat outputs only one analysis Bonami and Beniamine, 2016: 158 ff.) show that (i.e. one lemma and one set of morphological features), the remaining 581,802 (about 77%) 3 The 97,855 paradigm cells marked as #DEF# are being ambiguous. This result weakens the relia- excluded from this count. bility of the frequency data provided in the lexi- 4 In total, the TFTL includes 554,828 different forms, con. Therefore, disambiguation is needed, alt- corresponding to 62,922,781 occurrences in the refer- hough this would require a very time-consuming ence corpus used by the Thesaurus. Our lexicon con- work. tains 165,898 of these unique forms (forms appearing However, to tackle the problem of ambiguity, in more than one paradigm cell are counted only a first useful step is distinguishing between cases once), for a total of 18,261,179 occurrences. This like veniam above, which can be analyzed as an means that our resource covers around 30% of the inflected form of two different lemmas, and cas- forms of the TFTL, in terms of both type and token frequency. In addition, it also contains several other es where the different analyses only refer to dif- forms that are not attested in the TFTL (245,623 ferent forms of the same lemma, e.g. laudatis, unique forms). that appears both in the PRS.ACT.IND.2PL and in the PRF.PTCP.DAT/ABL.PL of laudo ‘to praise’, References but cannot be a form of other lemmas. We call Farrell Ackerman, James P. Blevins, and Robert these different types ‘exolemmatic’ and ‘endo- Malouf. 2009. Parts and wholes: Implicative pat- lemmatic’ ambiguity, respectively (cf. Passarotti terns in inflectional paradigms. In James P. Blevins and Ruffolo, 2004). Cases of exolemmatic ambi- and Juliette Blevins, editors, Analogy in Grammar: guity are clearly more problematic, but they are Form and Acquisition. Oxford University Press, also much rarer: only 79,490 (about 10%) of the Oxford: 54–82. forms in our resource belong to this type. The Mark Aronoff. 1994. Morphology by itself: Stems and great majority of ambiguous forms only give rise inflectional classes. MIT Press, Cam- to endolemmatic ambiguity, as can be observed bridge/London. in Table 4 below, where the relevant data are summarized. Sacha Beniamine. 2017. Un algorithme universel pour l'abstraction automatique d'alternances morpho- phonologiques. In 24e Conférence sur le n. % Traitement Automatique des Langues Naturelles unambiguous forms 170,735 22.69% (TALN). ambiguous forms 581,802 77.31% Charles Edwin Bennett. 1908. New Latin Grammar. only endolemmatic amb. 502,312 66.75% Bolchazy-Carducci Publishers. exolemmatic amb. 79,490 10.56% Olivier Bonami and Sarah Beniamine. 2016. Joint Table 4: the impact of ambiguity on frequency predictiveness in inflectional paradigms. Word data Structure 9(2): 156–182. Olivier Bonami and Gilles Boyé. 2014. De formes en As far as endolemmatic ambiguity is con- thèmes. In Florence Villoing, Sophie David and cerned, although its quantitative impact is far Sarah Leroy, editors, Foisonnements mor- greater, it could be considerably reduced in a phologiques: Études en hommage à Françoise Ker- principled manner. Indeed, it should be noted leroux. Presses universitaires de Paris Ouest, Paris: that in many cases this kind of ambiguity is due 17–45. to systematic syncretism. For instance, the cells Olivier Bonami, Gauthier Caron and Clément Plancq. FUT.ACT.IMP.2SG and FUT.ACT.IMP.3SG are never 2014. Construction d’un lexique flexionnel phoné- unambiguously analyzed, because they are al- tisé libre du français. In Franck Neveu, Peter Blu- ways identical for a same verb. Given the full menthal, Linda Hriba, Annette Gerstenberg, Judith systematicity of this syncretism, which holds for Meinschaefer and Sophie Prévost, editors, Actes du all lexemes, these cells could be considered as quatrième congrès mondial de linguistique fran- only one from a purely morphological point of çaise: 2583–2596. view. Therefore, the problem of endolemmatic Gilles Boyé. 2016. Pour une modélisation surfaciste ambiguity could be at least reduced by adopting de la flexion. Le cas de la conjugaison du français. an approach based on “morphomic paradigms” In SHS Web of Conferences. Vol. 27. EDP Scienc- (Boyé and Schalchli, 2016), where always syn- es. cretic cells are conflated, rather than on morpho- Gilles Boyé and Gauvain Shalchli. 2016. The status of syntactic paradigms. This would be helpful espe- paradigms. In Andrew Hippisley and Gregory cially in nominal forms like participles and ge- Stump, editors, The Cambridge Handbook of Mor- rundives, where such cases of systematic syncre- phology. Cambridge University Press, Cambridge: tism are widespread. 206–234. When such ambiguity issues will have been Basilio Calderone, Matteo Pascoli, Nabil Hathout and resolved, it will also be possible to exploit the Franck Sajous. 2017. Hybrid method for stress frequency data in a more systematic fashion, e.g. prediction applied to GLAFF-IT, a large-scale Ital- to perform diachronic investigations on how the ian lexicon. In International Conference on Lan- frequency of specific (groups of) forms or para- guage, Data and Knowledge. Springer, Cham: 26– digm cells change across the four considered 41. epochs, or to model Latin inflectional morpholo- Louis Delatte, Étienne Evrard, Suzanne Govaerts and gy in an even more realistic way, by considering Joseph Denooz. 1981. Dictionnaire fréquentiel et also the token frequency of inflected forms, as index inverse de la langue latine. L.A.S.L.A, Lie- has been recently proposed by Boyé (2016). ge. Peter G.W. Glare. 1982. Oxford Latin Dictionary. Oxford University Press, Oxford. Nabil Hathout, Franck Sajous and Basilio Calderone. 2014. GLÀFF, a large versatile French lexicon. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14): 1007–1012. Kyle P. Johnson et al. 2014-2017. CLTK: The Classi- cal Language Toolkit. DOI 10.5281/zenodo593336. Charlton Lewis and Charles Short. 1879. A Latin Dic- tionary. Clarendon, Oxford. Alexis Amid Neme. 2013. A fully inflected Arabic verb resource constructed from a lexicon of lem- mas by using finite-state transducers. Revue RIST: revue de l’information scientifique et technique 20(2): 7–19. Marco Passarotti, Marco Budassi, Eleonora Litta and Paolo Ruffolo 2017. The Lemlat 3.0 Package for Morphological Analysis of Latin. In Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language: 24–31. Marco Passarotti and Paolo Ruffolo. 2004. L’utilizzo del lemmatizzatore LEMLAT per una sistema- tizzazione dell’omografia in latino. EUPHROSYNE 32(A): 99–110. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2011. A universal part-of-speech tagset. ArXiv:1104–2086 Anna M. Thornton. 2012. Reduction and maintenance of overabundance. A case study on Italian verb paradigms. Word Structure 5(2): 183–207. Paul Tombeur. 1998. Thesaurus formarum totius la- tinitatis a Plauto usque ad saeculum XXum. Brepols, Turnhout. Eros Zanchetta and Marco Baroni. 2005. Morph-it!: a free corpus-based morphological resource for the italian language.