=Paper=
{{Paper
|id=Vol-1749/paper32
|storemode=property
|title=Formatio formosa est. Building a Word Formation Lexicon for Latin
|pdfUrl=https://ceur-ws.org/Vol-1749/paper32.pdf
|volume=Vol-1749
|authors=Eleonora Litta,Marco Passarotti,Chris Culy
|dblpUrl=https://dblp.org/rec/conf/clic-it/LittaPC16
}}
==Formatio formosa est. Building a Word Formation Lexicon for Latin==
Formatio formosa est.
Building a Word Formation Lexicon for Latin
Eleonora Litta, Marco Passarotti, Chris Culy
CIRCSE Research Centre
Università Cattolica del Sacro Cuore
Largo Gemelli, 1 – 20123 Milan, Italy
{eleonoramaria.litta, marco.passarotti}@unicatt.it,
chrisculy@mac.com
Abstract (Ševčíková and Žabokrtský, 2014), the
derivational lexicon for German DErivBASE
English. This paper presents the steps (Zeller et al., 2013) and that for Italian
undertaken for building a word formation derIvaTario (Talamo et al., 2016). Furthermore,
lexicon for Latin. The types of word stemming is a technique largely used for
formation rules are discussed and the semi- detecting word formation processes (Goldsmith,
automatic procedure to pair their input and 2001), and language independent NLP tools were
output lexical items is evaluated. An on-line trained to extract derivation information from
graphical query system to access the lexicon
inflectional lexica (Baranes and Sagot, 2014).
is described as well.
On the Classical languages front, although the
Italiano. Questo articolo presenta le number of resources and NLP tools for Ancient
procedure di realizzazione di un lessico Greek and Latin is now manifold and varied
morfologico derivazionale per il latino. Sono (ranging from digital libraries, treebanks and
descritti i tipi di regole di formazione di computational lexica to PoS taggers and parsers),
parola e viene valutata la qualità del sistema no lexical resource for derivational morphology
semi-automatico di individuazione delle is available yet, where words are connected by
parole in input e in output ad esse. Il sistema word formation processes. The first steps
grafico d’interrogazione on-line dei dati è towards building such a word formation lexicon
altresì presentato.
for Latin were made by Passarotti and Mambrini
(2012), who described a model for the semi-
1 Introduction automatic extraction of word formation rules
In the area of Natural Language Processing from the list of lemmas of Lexicon Totius
(NLP), derivational morphology has always been Latinitatis by Forcellini (fifth edition; 1940) and
neglected if compared to inflectional the subsequent pairing of lexical entries and their
morphology, which plays a central role in derivational ancestor(s).
fundamental annotation tasks like PoS tagging. The Word Formation Latin project has
Yet enhancing textual data with derivational received funding from the EU Horizon 2020
morphology tagging promises to provide strong Research and Innovation Programme under the
outcomes. First, it organises the lexicon at higher Marie Skłodowska-Curie Individual Fellowship
level than words, by building word formation to expand on these efforts and create a word
based sets of lexical items sharing a common formation lexicon (working as an NLP tool as
derivational ancestor. Secondly, derivational well) for Latin. In this paper, we describe the
morphology acts like a kind of interface between steps undertaken to build such a lexicon.
morphology and semantics, since core semantic The paper is organised as follows. Section 2
properties are shared at different extent by words presents the lexical basis supporting the lexicon;
built by a common word formation process. section 3 details the way the lexicon is built;
Lately, some lexical resources for derivational section 4 describes how to access the data;
morphology have been made available. Among section 5 concludes the paper and sketches the
them are the lexical network for Czech DeriNet future work.
2 Lemlat same status of lexical bases; the third condition
concerns the semantic properties of WFRs
The lexical basis used for building the word mentioned in Section 1.
formation lexicon is the one provided by the WFRs fall into two main types: (1) derivation
morphological analyser for Latin Lemlat and (2) compounding. Derivation rules are
(Passarotti, 2004). Resulting from the collation further organised into two subcategories: (a)
of three Latin dictionaries (Georges and Georges, affixal, in its turn split into prefixal and suffixal,
1913-1918; Glare, 1982; Gradenwitz, 1904), it and (b) conversion, a derivation process that
counts 40,014 lexical entries and 43,432 lemmas changes the PoS of the input word without
(as more than one lemma can be included into affixation.
the same lexical entry). Recently, the lexical Compounding and conversion WFRs are
basis of Lemlat was further enlarged by adding automatically detected, by considering all the
most of the Onomasticon (26,250 lemmas out of possible combinations of main PoS (verbs,
28,178) provided by Forcellini (1940). nouns, adjectives), regardless of their actual
The basic component of the lexical look-up instantiations in the lexical basis. For instance,
table used by Lemlat to morphologically analyse there are four possible types of conversion WFRs
(and lemmatise) the input wordforms is the so- involving verbs: V-To-N (claudo → clausa; “to
called les (“LExical Segment”), which roughly close” → “cell”), V-To-A (eligo → elegans; “to
corresponds to the invariable part of the inflected pick out” → “accustomed to select, tasteful”), N-
forms. In other words, the les is the sequence (or To-V (magister → magistro; “master” → “to
one of the sequences) of characters that remains rule”), A-To-V (celer → celero; “quick” → “to
the same in the inflectional paradigm of a lemma quicken”). Each compounding and conversion
(hence, the les does not necessarily correspond to WFR type is further specified by the inflectional
the word stem). For instance, puell is the les for category of both input and output. For instance,
the lemma puell–a (“girl”). A1-To-V1 is the conversion WFR from first
Lemlat includes a LES archive, in which each class adjectives to first conjugation verbs.
LES is assigned a number of inflectional features Affixal WFRs are found both according to
among which are a tag for the gender of the previous literature on Latin derivational
lemma (for nouns only) and a code (CODLES) for morphology (Jenks, 1911; Fruyt, 2011; Oniga,
its inflectional category. For instance, the 1988) and in semi-automatic fashion. The latter
CODLES for the LES puell is N1 (first declension is performed by extracting from the list of
regular nouns) and its gender is F (feminine). lemmas of Lemlat the most frequent sequences
of characters occurring on the left (prefixes) and
3 Building the Lexicon on the right (suffixes) side of lemmas. The PoS
The word formation lexicon is built in two steps. for WFR input and output lemmas as well as
First, word formation rules are detected. Then, their inflectional category are manually assigned.
they are applied to lexical data. Further affixal WFRs are found by confrontation
with data. So far, we have detected 167 affixal
3.1 Detecting Word Formation Rules WFRs: 71 prefixal and 96 suffixal.
Word formation rules (WFRs) are conceived We recorded the rules in a table of a MySQL
according to the so-called Item-and-Arrangement relational database where each WFR is classified
model, outlined by Hockett (1954), which by type and it is assigned the required PoS,
considers word forms either as simple inflectional category and gender for its input and
morphemes (not derived word forms) or as a output.
concatenation of morphemes (derived word 3.2 Applying Word Formation Rules
forms). The following conditions on bases and
affixes do hold: (1) Baudoin’s assumption that Each morphologically derived lemma is assigned
both bases and affixes are lexical elements (i.e. a WFR. All those lemmas that share a common
they are both morphemes); (2) as a consequence, (not derived) ancestor belong to the same
they exist in the lexicon (Bloomfield’s “lexical “morphological family”. For instance, lemmas
morpheme” theory); (3) they are dualistic, i.e. formatio (“formation”), formo (“to form”) and
they have both form and meaning (Bloomfield’s formosus (“beautiful”, lit. “finely formed”) all
“sign-base” morpheme theory). The first two belong to the morphological family whose
conditions motivate the fact that in our word ancestor is the lemma forma (“form”).
formation lexicon affixes are recorded with the
Lemmas and WFRs are paired by using a assigning WFRs, which are thus fully manually
MySQL relational database whose main tables hard-coded. For instance, the compound lemma
are the LES archive of Lemlat, the list of its matricida (“matricide”) is derived by
lemmas (each assigned its PoS, inflectional compounding the input lemmas mater
category and, for nouns only, gender) and the list (“mother”) and caedo (“to cut”), thus showing
of WFRs. quite an obscure morphotactic configuration.
A number of MySQL queries provide the So far, we have applied to data 134 WFRs (45
candidate lemmas for each WFR. Some of these prefixal, 80 suffixal, 6 conversion and 3
queries run on the list of lemmas, while others on compounding), which corresponds to having
the LES archive. In particular, most candidate assigned a WFR to 18,774 lemmas. Evaluation is
lemmas of prefixal WFRs are found by running performed by calculating the precision rate (Van
queries on the list of lemmas, as such rules tend Rijsbergen, 1979) of MySQL queries, i.e. the
to just add the characters of the prefix to the percentage of the correct candidate input-output
input lemma, like in the case of accuso → pairs that are automatically assigned to a WFR
sub+accuso (“to blame” → “to blame by a query.
somewhat”). Instead, suffixal WFRs are mostly As expected, precision is higher when
assigned to their candidate input and output morphotactic mutations are lower. Indeed, while
lemmas by running queries on the LES archive, precision rates for prefixal rules range between
because suffixes attach to LES instead of 0.95 and 0.8, as they imply quite a few graphical
modifying full lemmas, like in amo → amabilis mutations, precision for suffixal rules can vary
(“to love” → “lovable”) where suffix –bil– heavily, ranging from 0.75 to as little as 0.3.
attaches to LES am (plus the thematic vowel –a–, Instead, the recall of queries has to be calculated
used for first conjugation verbs) instead of full later in the project, as currently we are unable to
lemma amo. Also, there are suffixal WFRs verify how many derived lemmas are not
whose input is the basis of the irregular perfect automatically picked up by queries.
participle of the input verb, like in duco →
ductilis (“to lead” → “that may be led”) where 4 Accessing the Data
suffix –il– attaches to the basis of the irregular
The word formation lexicon can be accessed on-
perfect participle of the verb duco (duct). Such
line through a visualisation query system
irregular bases are recorded explicitly in the LES
(http://wfl.marginalia.it). The lexicon can be
archive with a specific CODLES.
browsed either by WFR, affix, or input and
3.3 State of Affair and Evaluation output PoS or lemma. Drop down menus provide
the available options for each selection, like for
The procedure described above is not sufficient instance the list of affixes and lemmas.
neither for detecting nor for applying the WFRs Results are visualised as tree graphs, whose
and, ultimately, for building the morphological nodes are lemmas and edges are WFRs. Trees
families. Manual checking is largely needed for are interactive. Clicking on a node shows the full
identifying false results and disambiguating derivation tree (“word formation cluster”, which
duplication, as well as for filling lacunas is calculated dynamically) for the lemma
resulting from the automatic process. reported in that node. For example, figure 1
For example, while looking for the candidates shows the currently available word formation
of the WFR that forms adjectives from nouns cluster for the lemma amo. One can see that
with the addition of the suffix –ax/–acis, two amabilis derives from amo and it is in turn the
candidate input nouns are found for the adjective input for two other derived lemmas: amabilitas
fugax (“swift, transitory”): fuga (“flight”) and (“loveliness”) and inamabilis (“unlovely”).
fugium (rare, scarcely used in place of fuga). Clicking on an edge shows the lemmas built by
Such duplicate results need to be checked and the WFR concerned in that edge. Lemmas are
disambiguated manually, as there must be only provided both as a derivation graph and as an
one input lemma for each output lemma resulting alphabetical list. For instance, clicking on the
from a WFR of the derivation type, just like there edge going from amo to amabilis in figure 1
must be only one WFR associated with each shows the lemmas built by the derivation WFR
derived lemma. that builds second class adjectives (A2) from
Morphotactically obscure word formation first conjugation verbs (V1) with suffix –bil–.
processes, like most compounding WFRs, are
examples of lacunas of the automatic process of
Figure 2 presents a portion of the derivation
graph for this rule.
Figure 1. Word formation cluster for amo.
Figure 2. Derivation graph for a WFR.
5 Conclusion and Future Work Language, Wiley-Blackwell, Chichester/Malden,
Mass, 157–175.
The building process of the word formation
Karl Ernst Georges and Heinrich Georges. 1913-
lexicon for Latin is ongoing. We still have to 1918. Ausfuhrliches Lateinisch-Deutsches
fully exploit the potential of querying the lexical Handwôrterbuch. Hahn, Hannover.
basis of Lemlat to automatically detect
candidates for WFRs. Furthermore, a substantial Peter GW. Glare. 1982. Oxford Latin Dictionary. At
amount of manual work is needed to pick up the Clarendon Press, Oxford.
morphotactically obscure formations, like those John Goldsmith. 2001. Unsupervised learning of the
resulting from compounding. morphology of a natural language. Computational
The word formation lexicon is meant to Linguistics, 27(2): 153–198.
enhance Lemlat by providing its processing with Otto Gradenwitz. 1904. Laterculi vocum latinarum.
word formation analysis of input data, thus Hirzel, Leipzig.
building a wide lexical resource and NLP tool for
Charles F. Hockett. 1954. Two Models of
Latin morphology, which will be made available Grammatical Description. Words, 10: 210–231.
through CLARIN infrastructure (www.clarin.eu).
Paul Rockwell Jenks. 1911. A manual of Latin word
References formation for secondary schools. DC Heath &
Company, Harvard.
Marion Baranes and Benoît Sagot. 2014. A Language-
Independent Approach to Extracting Derivational Renato Oniga. 1988. I composti nominali latini: una
Relations from an Inflectional Lexicon. morfologia generativa (Vol. 29). Pàtron, Bologna.
Proceedings of the Ninth International Conference Marco Passarotti. 2004. Development and
on Language Resources and Evaluation perspectives of the Latin morphological analyzer
(LREC'14). ELRA, Reykjavik, Iceland, 2793– LEMLAT. Linguistica Computazionale, XX-XXI:
2799. 397-414.
Egidio Forcellini. 1940. Lexicon totius latinitatis. Marco Passarotti and Francesco Mambrini. 2012.
Typis Seminarii, Padova. First Steps towards the Semi-automatic
Michele Fruyt. 2011. Word Formation in Classical Development of a Wordformation-based Lexicon
Latin. J. Clarckson (ed.), A Companion to the Latin of Latin. Proceedings of the Eighth International
Conference on Language Resources and
Evaluation (LREC'12). ELRA, Istanbul, Turkey,
852–859.
Magda Ševčíková and Zdenĕk Žabokrtský. 2014.
Word-Formation Network for Czech. Proceedings
of the Ninth International Conference on Language
Resources and Evaluation (LREC'14). ELRA,
Reykjavik, Iceland, 1087–1093.
Luigi Talamo, Chiara Celata and Pier Marco
Bertinetto. 2016. DerIvaTario: An annotated
lexicon of Italian derivatives. Word Structure, 9(1):
72–102.
Cornelis Joost Van Rijsbergen. 1979. Information
retrieval. Butterworths, London, 2nd edition.
Britta D. Zeller, Jan Snajder and Sebastian Padó.
2013. DErivBase: Inducing and Evaluating a
Derivational Morphology Resource for German.
Proceedings of the Annual Meeting of the
Association for Computational Linguistics. ACL,
Sofia, Bulgaria, 1201-1211.