=Paper=
{{Paper
|id=Vol-2253/paper23
|storemode=property
|title=LatInfLexi: an Inflected Lexicon of Latin Verbs
|pdfUrl=https://ceur-ws.org/Vol-2253/paper23.pdf
|volume=Vol-2253
|authors=Matteo Pellegrini,Marco Passarotti
|dblpUrl=https://dblp.org/rec/conf/clic-it/PellegriniP18
}}
==LatInfLexi: an Inflected Lexicon of Latin Verbs==
LatInfLexi: an Inflected Lexicon of Latin Verbs
Matteo Pellegrini Marco Passarotti
Università di Bergamo/Pavia CIRCSE Research Centre
Piazza Rosate, 2 – Università Cattolica del Sacro Cuore
24129 Bergamo, Italy Largo Gemelli, 1 – 20123 Milan, Italy
matteo.pellegrini@unibg.it marco.passarotti@unicatt.it
In morphological theory, there is a recent
Abstract trend towards a more realistic modelling of com-
plex inflectional systems: for instance, Ackerman
English. We present a paradigm-based in- et al. (2009) and Bonami and Boyé (2014) pro-
flected lexicon of Latin verbs built to provide pose that the analysis should take a full inflected
empirical evidence supporting an entropy- form as a starting point, without assuming any
based estimation of the degree of uncertainty segmentation a priori. In such approaches, what
in inflectional paradigms. The lexicon con- is investigated is not the construction of forms
tains information on the inflected forms that from smaller units like stems and inflectional
occupy the 254 morphologically possible endings, but rather their predictability given
paradigm cells of 3,348 verbal lexemes ex- knowledge of other forms. This can be done by
tracted from a frequency lexicon of Latin. using the information theoretic notion of condi-
The resource also includes annotation of tional entropy to estimate the uncertainty in
vowel length and the frequency of each form guessing the content of the paradigm cell of a
in different epochs. lexeme knowing another inflected form of the
same lexeme, by weighting the probability of
Italiano. Presentiamo un lessico di forme application of each inflectional pattern based on
flesse basato sui paradigmi per i verbi latini, their type frequency in real data.
costruito per fornire evidenza empirica che To do so, large-scale inflected lexicons listing
permetta di quantificare il grado di incertez- all forms of a representative selection of lexemes
za nei paradigmi flessivi tramite l’entropia. are needed. Such resources are increasingly be-
Il lessico contiene informazioni sulle forme ing developed for modern languages – see
flesse che occupano le 254 celle possibili dal among else Zanchetta and Baroni (2005) and
punto di vista morfologico di 3.348 lessemi Calderone et al. (2017) for Italian, Neme (2013)
verbali estratti da un dizionario frequenziale for Arabic, Bonami et al. (2014) and Hathout et
del latino. La risorsa include anche al. (2014) for French. However, to the best of our
l’annotazione della lunghezza vocalica e la knowledge, there are no resources of this kind
frequenza di ogni forma in diverse epoche. for Latin, although their (semi-)automatic build-
ing is made possible by the current availability of
1 Introduction several morphological analyzers for Latin, in-
cluding Words
In this paper, we describe the construction of (http://archives.nd.edu/words.html), Lem-
LatInfLexi, an inflected lexicon of Latin verbs
lat (www.lemlat3.eu), Morpheus
organized in lexemes1 and paradigm cells.
(https://github.com/tmallon/morpheus), the
PROIEL Latin morphology system
(https://github.com/mlj/proiel-
1
The term “lexeme” is used for the abstract theoreti-
cal concept normally adopted in morphology and lex-
icology, while “lemma” refers to the concrete citation aim at a resource suitable for theoretical inquiries, we
form representing an entry in dictionaries. Since we use the first term as a label in our resource.
webapp/tree/master/lib/morphology) and ative relations between inflected forms (Bonami
LatMor (http://cistern.cis.lmu.de). Our and Beniamine, 2016; Beniamine, 2017).
resource was created to fill this gap and to enable As for (ii), the identifier corresponds to the ci-
a quantitative, entropy-based analysis of Latin tation form of the lexeme, almost always the
verb inflection. first-person singular of the present indicative,
following the Latin lexicographical and didacti-
2 Design cal tradition. A diacritic is added in those rare
cases where different verbs have the same cita-
A distinctive feature of our inflected lexicon is
tion form (see infra, §3.2).
that it is based on lexemes and paradigm cells,
Regarding (iii), we use the PoS-tags of the
rather than on forms. This means that for each Universal Part-of-Speech Tagset by Petrov et al.
lexeme, all the morphologically possible para-
(2012) and the morphological features used in
digm cells are filled with a form, and not only
Universal Dependencies
those forms that are indeed attested in Latin texts
(http://universaldependencies.org/u/feat
are stored in paradigm cells. In this respect, our
/index.html).
resource is similar to other recently developed
Lastly, the frequency data in (iv) are taken
inflected lexicons, like for instance Flexique for
from Tombeur’s (1998) Thesaurus Formarum
French (Bonami et al., 2014).
Totius Latinitatis (see infra, §3.3).
For each paradigm cell, the following infor-
mation is provided: 3 Building the Lexicon
(i) the inflected form that occupies the para- This section details the procedure followed to
digm cell; build the lexicon.
(ii) a univocal identifier of the lexeme to
which it belongs; 3.1 Selecting the Lexemes
(iii) the set of its morphological features; Our first objective is to build an inflected lexicon
(iv) information on the frequency of the form of Latin featuring all the possible inflected forms
in different epochs. of verbs only. To this aim, we include all the
verbal entries contained in Delatte et al.’s (1981)
As for (i), it should be noted that there is never Dictionnaire fréquentiel et Index inverse de la
more than one form per paradigm cell. In cases langue latine (henceforth DFILL). This yields a
of overabundance (i.e. cells that are filled by total of 3,348 verbs. In rare cases, more than one
more than one form, cf. Thornton, 2012), a entry of DFILL corresponds to one and the same
choice was made to decide which “cell-mate” lexeme in our resource. This happens because
(Thornton, 2012: 183) should be kept, and which some verbs are lemmatized twice in DFILL. For
one discarded. instance, for the verb verso two different entries
On the other hand, in some cases a paradigm appear in DFILL, using as citation form both the
cell could be empty, either because it is defective first-person singular of the present active indica-
– like for instance the passive cells of intransitive tive verso and the corresponding morphological-
verbs – or because it is not filled by a synthetic ly passive form versor. This choice is likely to be
form, but rather it is analytically expressed, by motivated by the different semantics of the two
means of a phrase – like for instance, in Latin, verbs, with the first one meaning ‘to turn’ and
the perfective cells of deponent verbs, for which the second one meaning ‘to remain’. However, in
the periphrasis PRF.PTCP 2 + AUX esse ‘to be’ is such cases our resource gives priority to collect-
used (e.g. PRF.IND.1SG hortātus sum ‘I incited’). ing into one common inflectional paradigm all
In both cases, the cell is marked as #DEF# in the the forms that can be assigned to the same lex-
resource. This convention is adopted also in eme based on their morphological relatedness,
Flexique (Bonami et al., 2014: 2585), and it fits rather than separating them in paradigms of dif-
the requirements of the Qumin package for en- ferent lexemes according to semantic criteria.
tropy calculations on the predictability of implic- Therefore, our lexicon includes only one lexeme
verso, for which both active and passive forms
are listed.
2
Throughout the paper, we will refer to grammatical
features by using the standard abbreviations of the
Leipzig Glossing Rules.
3.2 Generating the Forms verbal entry there is a set of four “principal
parts” (Bennett, 1908: 55), i.e. exemplary in-
In order to fill all of the paradigm cells of the
flected forms from which the whole paradigm of
selected lexemes, we exploit the database of
the lexeme can be inferred. We keep only those
Lemlat (Passarotti et al., 2017). For each lexeme,
LESs that correspond to such principal parts, ex-
the database of Lemlat contains a list of seg-
cluding the ones that correspond to more mar-
ments called LES – roughly corresponding to the
ginal forms that do appear in dictionaries but are
stems that are used in different subparadigms –
given less prominence in the entry. For instance,
each with a corresponding CODLES that provides
Lemlat includes two LESs with CODLES “v3r” for
(among else) information on the inflectional end-
the verb dico ‘to say’: “dic” and “deic”. Howev-
ings that can be attached to a LES. We make use
er, in both the lexicographical sources we use,
of this information to generate the relevant
the relevant principal parts are dico and dicere,
forms.
corresponding to the first LES, while the second
To illustrate the details of the procedure, let’s
one is only mentioned later in the entries as an
consider the verb rumpo ‘to break’. For this verb,
alternative form. Therefore, the LES selected for
the database of Lemlat features the LESs and
our resource is “dic”.
CODLESs shown in Table 1.
We use the same dictionaries also to manually
annotate the vowel length for each LES. This is a
LES CODLES
necessary enhancement, because in Latin verb
rump v3r
inflection there are homographic forms that can
rumpisse fe
be distinguished only based on that, like for in-
rup v7s
stance PRS.ACT.IND.3SG fugit ‘(s)he flees’ vs.
rupsit fe
PRF.ACT.IND.3SG fūgit ‘(s)he fleed’.
rupt n41
Following this process, we fill all the 254 par-
rupt n6p1
adigm cells of each of the 3,348 lexemes. How-
ruptur n6p2
ever, because of Lemlat’s design, for some quite
Table 1: the verb rumpo in Lemlat 3.0 frequent verbs with a highly irregular inflectional
paradigm, it was not possible to apply the same
The two LESs with CODLES “fe” (“forma ec- procedure, at least for the cells of the present sys-
cezionale”, ‘exceptional form’) were discarded, tem, which is where most irregularity of the in-
since they are full irregular forms that are stored flectional endings of Latin verbs happens. For
as such. As for the other LESs, the one with the verbs shown in Table 2 and for those derived
CODLES “v3r” is used to fill all the cells of the from them by prefixation (e.g. abeo ‘to go away’
present system, by adding the inflectional end- from verb eo ‘to go’), although it was technically
ings of the conjugation represented by the possible to adopt a similar approach by using
rd
CODLES (i.e. the 3 conjugation). Similarly, the more than one LES for a CODLES, it proved to be
LES with CODLES “v7s” is used to fill the cells of faster and practical to manually record the cor-
the perfect system. From the remaining LESs, rect forms as such.
some nominal forms built upon the so-called
“third stem” (Aronoff, 1994) can be derived, Lemma Meaning
namely the supine rupt-um and rupt-ū from the aio to say
LES with CODLES “n41”, the perfect participle eo to go
rupt-us, -a, -um from the LES with CODLES fero to bring
“n6p1” and the future participle ruptūr-us, -a, - fio to become
um from the LES with CODLES “n6p2”. inquam to say
This given, our first step is to extract infor- malo to prefer
mation on the LESs and CODLESs of each lexeme. nolo not to want
Since Lemlat is a tool built to analyze rather than possum can
produce forms, it contains also several LESs oc- sum to be
curring only in irregular and/or rare forms. To volo to want
avoid the risk of overgeneration, we choose and
Table 2: irregular verbs
keep only one LES for each CODLES. The choice
is based on lexicographical sources, namely
To each of the 850,392 generated paradigms
Lewis and Short (1879) and Glare (1982). In the-
cells, a univocal lexeme identifier is assigned,
se dictionaries, at the very beginning of each
which corresponds to the lemma used in Lemlat. in languages with large inflectional paradigms –
In those rare cases where two or more verbs have like the ones of Latin verbs – it is perfectly nor-
the same lemma in Lemlat (although they inflect mal that many plausible forms do not appear,
differently), a numeric diacritic is added to make even in very large datasets, and the lexemes for
the relevant distinction: for instance, we have which the full paradigm is attested are very few.
volo1 ‘to fly’ and volo2 ‘to want’.
4 Discussion and Future Work
3.3 Frequency Data
We described the design and building of a lex-
Many forms included in the paradigm cells of eme-based inflected lexicon consisting of
our lexicon are never attested in Latin texts. In 850,392 paradigm cells of 3,348 Latin verbs. Our
order to make it possible to distinguish between first objective in the near future is to make the
plausible but unattested forms and those indeed resource complete in terms of lexical coverage,
occurring in texts, we enhance forms with infor- including the lexemes of the other PoS. The lexi-
mation on their frequency. This information is con is available for download as a .csv file at
taken from Tombeur’s (1998) Thesaurus For- https://github.com/matteo-
marum Totius Latinitatis (henceforth TFTL), pellegrini/LatInfLexi.
where each form is assigned the number of its We also plan to include phonetic annotation,
occurrences in four different epochs, respectively by giving the IPA transcription of each form,
called Antiquitas (from the origins to the end of which can be obtained semi-automatically by
the 2nd century A.D.), Aetas Patrum (2nd century- applying a script provided by the Classical Lan-
735 A.D.), Medium Aeuum (736-1499) and Re- guage Toolkit (Johnson et al., 2014-17) to stems
centior Latinitas (1500-1965). and endings.
By including the frequency of each form in the Another welcome addition would be to ac-
lexicon, we know how many of the 752,537 3 count for cases of overabundance, by allowing
forms recorded in the lexicon are never actually more than one form to appear in the same para-
attested. Table 3 reports the relevant data4. digm cell. However, to decide which cell-mates
to keep and which ones to discard, their frequen-
TFTL epoch unattested forms (%) cy in Latin texts should be preliminarily evaluat-
Antiquitas 544,395 (72.34%) ed. In this respect, it has to be noted that the fre-
Aetas Patrum 482,324 (64.1%) quencies in the TFTL refer to bare surface forms,
Medium Aeuum 484,421 (64.37%) with no contextual disambiguation. For instance,
Recentior Latinitas 640,552 (85.12%) the frequency of veniam comprises not only oc-
all epochs 401,690 (53.38%) currences of both the PRS.ACT.SBJV.1SG and
FUT.ACT.IND.1SG of the verb venio ‘to come’, but
Table 3: not attested forms also of the ACC.SG of the noun venia ‘indul-
gence’.
It can be observed that a significant amount of To get an idea of the impact of morphological
forms recorded in our lexicon are not attested, ambiguity on our lexicon, we analyzed all the
even in such a large corpus as the one the TFTL
generated forms with Lemlat (version 3.0). We
is based on. However, this is not surprising: re-
found that only for about 23% (170,735) of the
cent large-scale corpus-based investigations (e.g. 752,537 forms Lemlat outputs only one analysis
Bonami and Beniamine, 2016: 158 ff.) show that
(i.e. one lemma and one set of morphological
features), the remaining 581,802 (about 77%)
3
The 97,855 paradigm cells marked as #DEF# are being ambiguous. This result weakens the relia-
excluded from this count. bility of the frequency data provided in the lexi-
4
In total, the TFTL includes 554,828 different forms, con. Therefore, disambiguation is needed, alt-
corresponding to 62,922,781 occurrences in the refer- hough this would require a very time-consuming
ence corpus used by the Thesaurus. Our lexicon con- work.
tains 165,898 of these unique forms (forms appearing However, to tackle the problem of ambiguity,
in more than one paradigm cell are counted only a first useful step is distinguishing between cases
once), for a total of 18,261,179 occurrences. This like veniam above, which can be analyzed as an
means that our resource covers around 30% of the
inflected form of two different lemmas, and cas-
forms of the TFTL, in terms of both type and token
frequency. In addition, it also contains several other
es where the different analyses only refer to dif-
forms that are not attested in the TFTL (245,623 ferent forms of the same lemma, e.g. laudatis,
unique forms). that appears both in the PRS.ACT.IND.2PL and in
the PRF.PTCP.DAT/ABL.PL of laudo ‘to praise’, References
but cannot be a form of other lemmas. We call
Farrell Ackerman, James P. Blevins, and Robert
these different types ‘exolemmatic’ and ‘endo- Malouf. 2009. Parts and wholes: Implicative pat-
lemmatic’ ambiguity, respectively (cf. Passarotti terns in inflectional paradigms. In James P. Blevins
and Ruffolo, 2004). Cases of exolemmatic ambi- and Juliette Blevins, editors, Analogy in Grammar:
guity are clearly more problematic, but they are Form and Acquisition. Oxford University Press,
also much rarer: only 79,490 (about 10%) of the Oxford: 54–82.
forms in our resource belong to this type. The Mark Aronoff. 1994. Morphology by itself: Stems and
great majority of ambiguous forms only give rise inflectional classes. MIT Press, Cam-
to endolemmatic ambiguity, as can be observed bridge/London.
in Table 4 below, where the relevant data are
summarized. Sacha Beniamine. 2017. Un algorithme universel pour
l'abstraction automatique d'alternances morpho-
phonologiques. In 24e Conférence sur le
n. % Traitement Automatique des Langues Naturelles
unambiguous forms 170,735 22.69% (TALN).
ambiguous forms 581,802 77.31%
Charles Edwin Bennett. 1908. New Latin Grammar.
only endolemmatic amb. 502,312 66.75%
Bolchazy-Carducci Publishers.
exolemmatic amb. 79,490 10.56%
Olivier Bonami and Sarah Beniamine. 2016. Joint
Table 4: the impact of ambiguity on frequency predictiveness in inflectional paradigms. Word
data Structure 9(2): 156–182.
Olivier Bonami and Gilles Boyé. 2014. De formes en
As far as endolemmatic ambiguity is con- thèmes. In Florence Villoing, Sophie David and
cerned, although its quantitative impact is far Sarah Leroy, editors, Foisonnements mor-
greater, it could be considerably reduced in a phologiques: Études en hommage à Françoise Ker-
principled manner. Indeed, it should be noted leroux. Presses universitaires de Paris Ouest, Paris:
that in many cases this kind of ambiguity is due 17–45.
to systematic syncretism. For instance, the cells
Olivier Bonami, Gauthier Caron and Clément Plancq.
FUT.ACT.IMP.2SG and FUT.ACT.IMP.3SG are never 2014. Construction d’un lexique flexionnel phoné-
unambiguously analyzed, because they are al- tisé libre du français. In Franck Neveu, Peter Blu-
ways identical for a same verb. Given the full menthal, Linda Hriba, Annette Gerstenberg, Judith
systematicity of this syncretism, which holds for Meinschaefer and Sophie Prévost, editors, Actes du
all lexemes, these cells could be considered as quatrième congrès mondial de linguistique fran-
only one from a purely morphological point of çaise: 2583–2596.
view. Therefore, the problem of endolemmatic Gilles Boyé. 2016. Pour une modélisation surfaciste
ambiguity could be at least reduced by adopting de la flexion. Le cas de la conjugaison du français.
an approach based on “morphomic paradigms” In SHS Web of Conferences. Vol. 27. EDP Scienc-
(Boyé and Schalchli, 2016), where always syn- es.
cretic cells are conflated, rather than on morpho- Gilles Boyé and Gauvain Shalchli. 2016. The status of
syntactic paradigms. This would be helpful espe- paradigms. In Andrew Hippisley and Gregory
cially in nominal forms like participles and ge- Stump, editors, The Cambridge Handbook of Mor-
rundives, where such cases of systematic syncre- phology. Cambridge University Press, Cambridge:
tism are widespread. 206–234.
When such ambiguity issues will have been Basilio Calderone, Matteo Pascoli, Nabil Hathout and
resolved, it will also be possible to exploit the Franck Sajous. 2017. Hybrid method for stress
frequency data in a more systematic fashion, e.g. prediction applied to GLAFF-IT, a large-scale Ital-
to perform diachronic investigations on how the ian lexicon. In International Conference on Lan-
frequency of specific (groups of) forms or para- guage, Data and Knowledge. Springer, Cham: 26–
digm cells change across the four considered 41.
epochs, or to model Latin inflectional morpholo- Louis Delatte, Étienne Evrard, Suzanne Govaerts and
gy in an even more realistic way, by considering Joseph Denooz. 1981. Dictionnaire fréquentiel et
also the token frequency of inflected forms, as index inverse de la langue latine. L.A.S.L.A, Lie-
has been recently proposed by Boyé (2016). ge.
Peter G.W. Glare. 1982. Oxford Latin Dictionary.
Oxford University Press, Oxford.
Nabil Hathout, Franck Sajous and Basilio Calderone.
2014. GLÀFF, a large versatile French lexicon. In
Proceedings of the Ninth International Conference
on Language Resources and Evaluation
(LREC’14): 1007–1012.
Kyle P. Johnson et al. 2014-2017. CLTK: The Classi-
cal Language Toolkit. DOI
10.5281/zenodo593336.
Charlton Lewis and Charles Short. 1879. A Latin Dic-
tionary. Clarendon, Oxford.
Alexis Amid Neme. 2013. A fully inflected Arabic
verb resource constructed from a lexicon of lem-
mas by using finite-state transducers. Revue RIST:
revue de l’information scientifique et technique
20(2): 7–19.
Marco Passarotti, Marco Budassi, Eleonora Litta and
Paolo Ruffolo 2017. The Lemlat 3.0 Package for
Morphological Analysis of Latin. In Proceedings
of the NoDaLiDa 2017 Workshop on Processing
Historical Language: 24–31.
Marco Passarotti and Paolo Ruffolo. 2004. L’utilizzo
del lemmatizzatore LEMLAT per una sistema-
tizzazione dell’omografia in latino. EUPHROSYNE
32(A): 99–110.
Slav Petrov, Dipanjan Das, and Ryan McDonald.
2011. A universal part-of-speech tagset.
ArXiv:1104–2086
Anna M. Thornton. 2012. Reduction and maintenance
of overabundance. A case study on Italian verb
paradigms. Word Structure 5(2): 183–207.
Paul Tombeur. 1998. Thesaurus formarum totius la-
tinitatis a Plauto usque ad saeculum XXum.
Brepols, Turnhout.
Eros Zanchetta and Marco Baroni. 2005. Morph-it!: a
free corpus-based morphological resource for the
italian language.