LatInfLexi: an Inflected Lexicon of Latin Verbs


              Matteo Pellegrini                 Marco Passarotti
          Università di Bergamo/Pavia      CIRCSE Research Centre
              Piazza Rosate, 2 –       Università Cattolica del Sacro Cuore
             24129 Bergamo, Italy     Largo Gemelli, 1 – 20123 Milan, Italy
       matteo.pellegrini@unibg.it marco.passarotti@unicatt.it


                                                             In morphological theory, there is a recent
                     Abstract                            trend towards a more realistic modelling of com-
                                                         plex inflectional systems: for instance, Ackerman
English. We present a paradigm-based in-                 et al. (2009) and Bonami and Boyé (2014) pro-
flected lexicon of Latin verbs built to provide          pose that the analysis should take a full inflected
empirical evidence supporting an entropy-                form as a starting point, without assuming any
based estimation of the degree of uncertainty            segmentation a priori. In such approaches, what
in inflectional paradigms. The lexicon con-              is investigated is not the construction of forms
tains information on the inflected forms that            from smaller units like stems and inflectional
occupy the 254 morphologically possible                  endings, but rather their predictability given
paradigm cells of 3,348 verbal lexemes ex-               knowledge of other forms. This can be done by
tracted from a frequency lexicon of Latin.               using the information theoretic notion of condi-
The resource also includes annotation of                 tional entropy to estimate the uncertainty in
vowel length and the frequency of each form              guessing the content of the paradigm cell of a
in different epochs.                                     lexeme knowing another inflected form of the
                                                         same lexeme, by weighting the probability of
Italiano. Presentiamo un lessico di forme                application of each inflectional pattern based on
flesse basato sui paradigmi per i verbi latini,          their type frequency in real data.
costruito per fornire evidenza empirica che                  To do so, large-scale inflected lexicons listing
permetta di quantificare il grado di incertez-           all forms of a representative selection of lexemes
za nei paradigmi flessivi tramite l’entropia.            are needed. Such resources are increasingly be-
Il lessico contiene informazioni sulle forme             ing developed for modern languages – see
flesse che occupano le 254 celle possibili dal           among else Zanchetta and Baroni (2005) and
punto di vista morfologico di 3.348 lessemi              Calderone et al. (2017) for Italian, Neme (2013)
verbali estratti da un dizionario frequenziale           for Arabic, Bonami et al. (2014) and Hathout et
del latino. La risorsa include anche                     al. (2014) for French. However, to the best of our
l’annotazione della lunghezza vocalica e la              knowledge, there are no resources of this kind
frequenza di ogni forma in diverse epoche.               for Latin, although their (semi-)automatic build-
                                                         ing is made possible by the current availability of
1    Introduction                                        several morphological analyzers for Latin, in-
                                                         cluding                                      Words
In this paper, we describe the construction of           (http://archives.nd.edu/words.html), Lem-
LatInfLexi, an inflected lexicon of Latin verbs
                                                         lat         (www.lemlat3.eu),            Morpheus
organized in lexemes1 and paradigm cells.
                                                         (https://github.com/tmallon/morpheus), the
                                                         PROIEL         Latin      morphology         system
                                                         (https://github.com/mlj/proiel-
1
  The term “lexeme” is used for the abstract theoreti-
cal concept normally adopted in morphology and lex-
icology, while “lemma” refers to the concrete citation   aim at a resource suitable for theoretical inquiries, we
form representing an entry in dictionaries. Since we     use the first term as a label in our resource.
webapp/tree/master/lib/morphology)             and      ative relations between inflected forms (Bonami
LatMor (http://cistern.cis.lmu.de).            Our      and Beniamine, 2016; Beniamine, 2017).
resource was created to fill this gap and to enable        As for (ii), the identifier corresponds to the ci-
a quantitative, entropy-based analysis of Latin         tation form of the lexeme, almost always the
verb inflection.                                        first-person singular of the present indicative,
                                                        following the Latin lexicographical and didacti-
2       Design                                          cal tradition. A diacritic is added in those rare
                                                        cases where different verbs have the same cita-
A distinctive feature of our inflected lexicon is
                                                        tion form (see infra, §3.2).
that it is based on lexemes and paradigm cells,
                                                           Regarding (iii), we use the PoS-tags of the
rather than on forms. This means that for each          Universal Part-of-Speech Tagset by Petrov et al.
lexeme, all the morphologically possible para-
                                                        (2012) and the morphological features used in
digm cells are filled with a form, and not only
                                                        Universal                             Dependencies
those forms that are indeed attested in Latin texts
                                                        (http://universaldependencies.org/u/feat
are stored in paradigm cells. In this respect, our
                                                        /index.html).
resource is similar to other recently developed
                                                           Lastly, the frequency data in (iv) are taken
inflected lexicons, like for instance Flexique for
                                                        from Tombeur’s (1998) Thesaurus Formarum
French (Bonami et al., 2014).
                                                        Totius Latinitatis (see infra, §3.3).
   For each paradigm cell, the following infor-
mation is provided:                                     3     Building the Lexicon
(i)      the inflected form that occupies the para-     This section details the procedure followed to
         digm cell;                                     build the lexicon.
(ii)     a univocal identifier of the lexeme to
         which it belongs;                              3.1    Selecting the Lexemes
(iii)    the set of its morphological features;         Our first objective is to build an inflected lexicon
(iv)     information on the frequency of the form       of Latin featuring all the possible inflected forms
         in different epochs.                           of verbs only. To this aim, we include all the
                                                        verbal entries contained in Delatte et al.’s (1981)
   As for (i), it should be noted that there is never   Dictionnaire fréquentiel et Index inverse de la
more than one form per paradigm cell. In cases          langue latine (henceforth DFILL). This yields a
of overabundance (i.e. cells that are filled by         total of 3,348 verbs. In rare cases, more than one
more than one form, cf. Thornton, 2012), a              entry of DFILL corresponds to one and the same
choice was made to decide which “cell-mate”             lexeme in our resource. This happens because
(Thornton, 2012: 183) should be kept, and which         some verbs are lemmatized twice in DFILL. For
one discarded.                                          instance, for the verb verso two different entries
   On the other hand, in some cases a paradigm          appear in DFILL, using as citation form both the
cell could be empty, either because it is defective     first-person singular of the present active indica-
– like for instance the passive cells of intransitive   tive verso and the corresponding morphological-
verbs – or because it is not filled by a synthetic      ly passive form versor. This choice is likely to be
form, but rather it is analytically expressed, by       motivated by the different semantics of the two
means of a phrase – like for instance, in Latin,        verbs, with the first one meaning ‘to turn’ and
the perfective cells of deponent verbs, for which       the second one meaning ‘to remain’. However, in
the periphrasis PRF.PTCP 2 + AUX esse ‘to be’ is        such cases our resource gives priority to collect-
used (e.g. PRF.IND.1SG hortātus sum ‘I incited’).       ing into one common inflectional paradigm all
In both cases, the cell is marked as #DEF# in the       the forms that can be assigned to the same lex-
resource. This convention is adopted also in            eme based on their morphological relatedness,
Flexique (Bonami et al., 2014: 2585), and it fits       rather than separating them in paradigms of dif-
the requirements of the Qumin package for en-           ferent lexemes according to semantic criteria.
tropy calculations on the predictability of implic-     Therefore, our lexicon includes only one lexeme
                                                        verso, for which both active and passive forms
                                                        are listed.
2
  Throughout the paper, we will refer to grammatical
features by using the standard abbreviations of the
Leipzig Glossing Rules.
3.2   Generating the Forms                             verbal entry there is a set of four “principal
                                                       parts” (Bennett, 1908: 55), i.e. exemplary in-
In order to fill all of the paradigm cells of the
                                                       flected forms from which the whole paradigm of
selected lexemes, we exploit the database of
                                                       the lexeme can be inferred. We keep only those
Lemlat (Passarotti et al., 2017). For each lexeme,
                                                       LESs that correspond to such principal parts, ex-
the database of Lemlat contains a list of seg-
                                                       cluding the ones that correspond to more mar-
ments called LES – roughly corresponding to the
                                                       ginal forms that do appear in dictionaries but are
stems that are used in different subparadigms –
                                                       given less prominence in the entry. For instance,
each with a corresponding CODLES that provides
                                                       Lemlat includes two LESs with CODLES “v3r” for
(among else) information on the inflectional end-
                                                       the verb dico ‘to say’: “dic” and “deic”. Howev-
ings that can be attached to a LES. We make use
                                                       er, in both the lexicographical sources we use,
of this information to generate the relevant
                                                       the relevant principal parts are dico and dicere,
forms.
                                                       corresponding to the first LES, while the second
   To illustrate the details of the procedure, let’s
                                                       one is only mentioned later in the entries as an
consider the verb rumpo ‘to break’. For this verb,
                                                       alternative form. Therefore, the LES selected for
the database of Lemlat features the LESs and
                                                       our resource is “dic”.
CODLESs shown in Table 1.
                                                          We use the same dictionaries also to manually
                                                       annotate the vowel length for each LES. This is a
              LES         CODLES
                                                       necessary enhancement, because in Latin verb
              rump        v3r
                                                       inflection there are homographic forms that can
              rumpisse    fe
                                                       be distinguished only based on that, like for in-
              rup         v7s
                                                       stance PRS.ACT.IND.3SG fugit ‘(s)he flees’ vs.
              rupsit      fe
                                                       PRF.ACT.IND.3SG fūgit ‘(s)he fleed’.
              rupt        n41
                                                          Following this process, we fill all the 254 par-
              rupt        n6p1
                                                       adigm cells of each of the 3,348 lexemes. How-
              ruptur      n6p2
                                                       ever, because of Lemlat’s design, for some quite
      Table 1: the verb rumpo in Lemlat 3.0            frequent verbs with a highly irregular inflectional
                                                       paradigm, it was not possible to apply the same
   The two LESs with CODLES “fe” (“forma ec-           procedure, at least for the cells of the present sys-
cezionale”, ‘exceptional form’) were discarded,        tem, which is where most irregularity of the in-
since they are full irregular forms that are stored    flectional endings of Latin verbs happens. For
as such. As for the other LESs, the one with           the verbs shown in Table 2 and for those derived
CODLES “v3r” is used to fill all the cells of the      from them by prefixation (e.g. abeo ‘to go away’
present system, by adding the inflectional end-        from verb eo ‘to go’), although it was technically
ings of the conjugation represented by the             possible to adopt a similar approach by using
                    rd
CODLES (i.e. the 3 conjugation). Similarly, the        more than one LES for a CODLES, it proved to be
LES with CODLES “v7s” is used to fill the cells of     faster and practical to manually record the cor-
the perfect system. From the remaining LESs,           rect forms as such.
some nominal forms built upon the so-called
“third stem” (Aronoff, 1994) can be derived,                   Lemma              Meaning
namely the supine rupt-um and rupt-ū from the                  aio                to say
LES with CODLES “n41”, the perfect participle                  eo                 to go
rupt-us, -a, -um from the LES with CODLES                      fero               to bring
“n6p1” and the future participle ruptūr-us, -a, -              fio                to become
um from the LES with CODLES “n6p2”.                            inquam             to say
   This given, our first step is to extract infor-             malo               to prefer
mation on the LESs and CODLESs of each lexeme.                 nolo               not to want
Since Lemlat is a tool built to analyze rather than            possum             can
produce forms, it contains also several LESs oc-               sum                to be
curring only in irregular and/or rare forms. To                volo               to want
avoid the risk of overgeneration, we choose and
                                                                      Table 2: irregular verbs
keep only one LES for each CODLES. The choice
is based on lexicographical sources, namely
                                                          To each of the 850,392 generated paradigms
Lewis and Short (1879) and Glare (1982). In the-
                                                       cells, a univocal lexeme identifier is assigned,
se dictionaries, at the very beginning of each
which corresponds to the lemma used in Lemlat.           in languages with large inflectional paradigms –
In those rare cases where two or more verbs have         like the ones of Latin verbs – it is perfectly nor-
the same lemma in Lemlat (although they inflect          mal that many plausible forms do not appear,
differently), a numeric diacritic is added to make       even in very large datasets, and the lexemes for
the relevant distinction: for instance, we have          which the full paradigm is attested are very few.
volo1 ‘to fly’ and volo2 ‘to want’.
                                                         4    Discussion and Future Work
3.3    Frequency Data
                                                         We described the design and building of a lex-
Many forms included in the paradigm cells of             eme-based inflected lexicon consisting of
our lexicon are never attested in Latin texts. In        850,392 paradigm cells of 3,348 Latin verbs. Our
order to make it possible to distinguish between         first objective in the near future is to make the
plausible but unattested forms and those indeed          resource complete in terms of lexical coverage,
occurring in texts, we enhance forms with infor-         including the lexemes of the other PoS. The lexi-
mation on their frequency. This information is           con is available for download as a .csv file at
taken from Tombeur’s (1998) Thesaurus For-               https://github.com/matteo-
marum Totius Latinitatis (henceforth TFTL),              pellegrini/LatInfLexi.
where each form is assigned the number of its               We also plan to include phonetic annotation,
occurrences in four different epochs, respectively       by giving the IPA transcription of each form,
called Antiquitas (from the origins to the end of        which can be obtained semi-automatically by
the 2nd century A.D.), Aetas Patrum (2nd century-        applying a script provided by the Classical Lan-
735 A.D.), Medium Aeuum (736-1499) and Re-               guage Toolkit (Johnson et al., 2014-17) to stems
centior Latinitas (1500-1965).                           and endings.
   By including the frequency of each form in the           Another welcome addition would be to ac-
lexicon, we know how many of the 752,537 3               count for cases of overabundance, by allowing
forms recorded in the lexicon are never actually         more than one form to appear in the same para-
attested. Table 3 reports the relevant data4.            digm cell. However, to decide which cell-mates
                                                         to keep and which ones to discard, their frequen-
TFTL epoch                   unattested forms (%)        cy in Latin texts should be preliminarily evaluat-
Antiquitas                   544,395 (72.34%)            ed. In this respect, it has to be noted that the fre-
Aetas Patrum                 482,324 (64.1%)             quencies in the TFTL refer to bare surface forms,
Medium Aeuum                 484,421 (64.37%)            with no contextual disambiguation. For instance,
Recentior Latinitas          640,552 (85.12%)            the frequency of veniam comprises not only oc-
all epochs                   401,690 (53.38%)            currences of both the PRS.ACT.SBJV.1SG and
                                                         FUT.ACT.IND.1SG of the verb venio ‘to come’, but
             Table 3: not attested forms                 also of the ACC.SG of the noun venia ‘indul-
                                                         gence’.
   It can be observed that a significant amount of          To get an idea of the impact of morphological
forms recorded in our lexicon are not attested,          ambiguity on our lexicon, we analyzed all the
even in such a large corpus as the one the TFTL
                                                         generated forms with Lemlat (version 3.0). We
is based on. However, this is not surprising: re-
                                                         found that only for about 23% (170,735) of the
cent large-scale corpus-based investigations (e.g.       752,537 forms Lemlat outputs only one analysis
Bonami and Beniamine, 2016: 158 ff.) show that
                                                         (i.e. one lemma and one set of morphological
                                                         features), the remaining 581,802 (about 77%)
3
  The 97,855 paradigm cells marked as #DEF# are          being ambiguous. This result weakens the relia-
excluded from this count.                                bility of the frequency data provided in the lexi-
4
  In total, the TFTL includes 554,828 different forms,   con. Therefore, disambiguation is needed, alt-
corresponding to 62,922,781 occurrences in the refer-    hough this would require a very time-consuming
ence corpus used by the Thesaurus. Our lexicon con-      work.
tains 165,898 of these unique forms (forms appearing        However, to tackle the problem of ambiguity,
in more than one paradigm cell are counted only          a first useful step is distinguishing between cases
once), for a total of 18,261,179 occurrences. This       like veniam above, which can be analyzed as an
means that our resource covers around 30% of the
                                                         inflected form of two different lemmas, and cas-
forms of the TFTL, in terms of both type and token
frequency. In addition, it also contains several other
                                                         es where the different analyses only refer to dif-
forms that are not attested in the TFTL (245,623         ferent forms of the same lemma, e.g. laudatis,
unique forms).                                           that appears both in the PRS.ACT.IND.2PL and in
the PRF.PTCP.DAT/ABL.PL of laudo ‘to praise’,        References
but cannot be a form of other lemmas. We call
                                                     Farrell Ackerman, James P. Blevins, and Robert
these different types ‘exolemmatic’ and ‘endo-         Malouf. 2009. Parts and wholes: Implicative pat-
lemmatic’ ambiguity, respectively (cf. Passarotti      terns in inflectional paradigms. In James P. Blevins
and Ruffolo, 2004). Cases of exolemmatic ambi-         and Juliette Blevins, editors, Analogy in Grammar:
guity are clearly more problematic, but they are       Form and Acquisition. Oxford University Press,
also much rarer: only 79,490 (about 10%) of the        Oxford: 54–82.
forms in our resource belong to this type. The       Mark Aronoff. 1994. Morphology by itself: Stems and
great majority of ambiguous forms only give rise       inflectional  classes.  MIT      Press,     Cam-
to endolemmatic ambiguity, as can be observed          bridge/London.
in Table 4 below, where the relevant data are
summarized.                                          Sacha Beniamine. 2017. Un algorithme universel pour
                                                       l'abstraction automatique d'alternances morpho-
                                                       phonologiques. In 24e Conférence sur le
                              n.         %             Traitement Automatique des Langues Naturelles
unambiguous forms             170,735    22.69%        (TALN).
ambiguous forms               581,802    77.31%
                                                     Charles Edwin Bennett. 1908. New Latin Grammar.
 only endolemmatic amb.       502,312    66.75%
                                                       Bolchazy-Carducci Publishers.
 exolemmatic amb.             79,490     10.56%
                                                     Olivier Bonami and Sarah Beniamine. 2016. Joint
  Table 4: the impact of ambiguity on frequency        predictiveness in inflectional paradigms. Word
                      data                             Structure 9(2): 156–182.
                                                     Olivier Bonami and Gilles Boyé. 2014. De formes en
   As far as endolemmatic ambiguity is con-            thèmes. In Florence Villoing, Sophie David and
cerned, although its quantitative impact is far        Sarah Leroy, editors, Foisonnements mor-
greater, it could be considerably reduced in a         phologiques: Études en hommage à Françoise Ker-
principled manner. Indeed, it should be noted          leroux. Presses universitaires de Paris Ouest, Paris:
that in many cases this kind of ambiguity is due       17–45.
to systematic syncretism. For instance, the cells
                                                     Olivier Bonami, Gauthier Caron and Clément Plancq.
FUT.ACT.IMP.2SG and FUT.ACT.IMP.3SG are never          2014. Construction d’un lexique flexionnel phoné-
unambiguously analyzed, because they are al-           tisé libre du français. In Franck Neveu, Peter Blu-
ways identical for a same verb. Given the full         menthal, Linda Hriba, Annette Gerstenberg, Judith
systematicity of this syncretism, which holds for      Meinschaefer and Sophie Prévost, editors, Actes du
all lexemes, these cells could be considered as        quatrième congrès mondial de linguistique fran-
only one from a purely morphological point of          çaise: 2583–2596.
view. Therefore, the problem of endolemmatic         Gilles Boyé. 2016. Pour une modélisation surfaciste
ambiguity could be at least reduced by adopting        de la flexion. Le cas de la conjugaison du français.
an approach based on “morphomic paradigms”             In SHS Web of Conferences. Vol. 27. EDP Scienc-
(Boyé and Schalchli, 2016), where always syn-          es.
cretic cells are conflated, rather than on morpho-   Gilles Boyé and Gauvain Shalchli. 2016. The status of
syntactic paradigms. This would be helpful espe-       paradigms. In Andrew Hippisley and Gregory
cially in nominal forms like participles and ge-       Stump, editors, The Cambridge Handbook of Mor-
rundives, where such cases of systematic syncre-       phology. Cambridge University Press, Cambridge:
tism are widespread.                                   206–234.
   When such ambiguity issues will have been         Basilio Calderone, Matteo Pascoli, Nabil Hathout and
resolved, it will also be possible to exploit the      Franck Sajous. 2017. Hybrid method for stress
frequency data in a more systematic fashion, e.g.      prediction applied to GLAFF-IT, a large-scale Ital-
to perform diachronic investigations on how the        ian lexicon. In International Conference on Lan-
frequency of specific (groups of) forms or para-       guage, Data and Knowledge. Springer, Cham: 26–
digm cells change across the four considered           41.
epochs, or to model Latin inflectional morpholo-     Louis Delatte, Étienne Evrard, Suzanne Govaerts and
gy in an even more realistic way, by considering       Joseph Denooz. 1981. Dictionnaire fréquentiel et
also the token frequency of inflected forms, as        index inverse de la langue latine. L.A.S.L.A, Lie-
has been recently proposed by Boyé (2016).             ge.
                                                     Peter G.W. Glare. 1982. Oxford Latin Dictionary.
                                                       Oxford University Press, Oxford.
Nabil Hathout, Franck Sajous and Basilio Calderone.
  2014. GLÀFF, a large versatile French lexicon. In
  Proceedings of the Ninth International Conference
  on Language Resources and Evaluation
  (LREC’14): 1007–1012.
Kyle P. Johnson et al. 2014-2017. CLTK: The Classi-
  cal        Language           Toolkit.      DOI
  10.5281/zenodo593336.
Charlton Lewis and Charles Short. 1879. A Latin Dic-
  tionary. Clarendon, Oxford.
Alexis Amid Neme. 2013. A fully inflected Arabic
  verb resource constructed from a lexicon of lem-
  mas by using finite-state transducers. Revue RIST:
  revue de l’information scientifique et technique
  20(2): 7–19.
Marco Passarotti, Marco Budassi, Eleonora Litta and
  Paolo Ruffolo 2017. The Lemlat 3.0 Package for
  Morphological Analysis of Latin. In Proceedings
  of the NoDaLiDa 2017 Workshop on Processing
  Historical Language: 24–31.
Marco Passarotti and Paolo Ruffolo. 2004. L’utilizzo
  del lemmatizzatore LEMLAT per una sistema-
  tizzazione dell’omografia in latino. EUPHROSYNE
  32(A): 99–110.
Slav Petrov, Dipanjan Das, and Ryan McDonald.
   2011. A universal part-of-speech tagset.
   ArXiv:1104–2086
Anna M. Thornton. 2012. Reduction and maintenance
  of overabundance. A case study on Italian verb
  paradigms. Word Structure 5(2): 183–207.
Paul Tombeur. 1998. Thesaurus formarum totius la-
  tinitatis a Plauto usque ad saeculum XXum.
  Brepols, Turnhout.
Eros Zanchetta and Marco Baroni. 2005. Morph-it!: a
  free corpus-based morphological resource for the
  italian language.