=Paper= {{Paper |id=Vol-1749/paper32 |storemode=property |title=Formatio formosa est. Building a Word Formation Lexicon for Latin |pdfUrl=https://ceur-ws.org/Vol-1749/paper32.pdf |volume=Vol-1749 |authors=Eleonora Litta,Marco Passarotti,Chris Culy |dblpUrl=https://dblp.org/rec/conf/clic-it/LittaPC16 }} ==Formatio formosa est. Building a Word Formation Lexicon for Latin== https://ceur-ws.org/Vol-1749/paper32.pdf
                                       Formatio formosa est.

                   Building a Word Formation Lexicon for Latin

                   Eleonora Litta, Marco Passarotti, Chris Culy
                             CIRCSE Research Centre
                        Università Cattolica del Sacro Cuore
                       Largo Gemelli, 1 – 20123 Milan, Italy
          {eleonoramaria.litta, marco.passarotti}@unicatt.it,
                             chrisculy@mac.com



                     Abstract                          (Ševčíková and Žabokrtský, 2014), the
                                                       derivational lexicon for German DErivBASE
    English. This paper presents the steps             (Zeller et al., 2013) and that for Italian
    undertaken for building a word formation           derIvaTario (Talamo et al., 2016). Furthermore,
    lexicon for Latin. The types of word               stemming is a technique largely used for
    formation rules are discussed and the semi-        detecting word formation processes (Goldsmith,
    automatic procedure to pair their input and        2001), and language independent NLP tools were
    output lexical items is evaluated. An on-line      trained to extract derivation information from
    graphical query system to access the lexicon
                                                       inflectional lexica (Baranes and Sagot, 2014).
    is described as well.
                                                          On the Classical languages front, although the
    Italiano. Questo articolo presenta le              number of resources and NLP tools for Ancient
    procedure di realizzazione di un lessico           Greek and Latin is now manifold and varied
    morfologico derivazionale per il latino. Sono      (ranging from digital libraries, treebanks and
    descritti i tipi di regole di formazione di        computational lexica to PoS taggers and parsers),
    parola e viene valutata la qualità del sistema     no lexical resource for derivational morphology
    semi-automatico di individuazione delle            is available yet, where words are connected by
    parole in input e in output ad esse. Il sistema    word formation processes. The first steps
    grafico d’interrogazione on-line dei dati è        towards building such a word formation lexicon
    altresì presentato.
                                                       for Latin were made by Passarotti and Mambrini
                                                       (2012), who described a model for the semi-
1    Introduction                                      automatic extraction of word formation rules
In the area of Natural Language Processing             from the list of lemmas of Lexicon Totius
(NLP), derivational morphology has always been         Latinitatis by Forcellini (fifth edition; 1940) and
neglected     if    compared      to    inflectional   the subsequent pairing of lexical entries and their
morphology, which plays a central role in              derivational ancestor(s).
fundamental annotation tasks like PoS tagging.            The Word Formation Latin project has
Yet enhancing textual data with derivational           received funding from the EU Horizon 2020
morphology tagging promises to provide strong          Research and Innovation Programme under the
outcomes. First, it organises the lexicon at higher    Marie Skłodowska-Curie Individual Fellowship
level than words, by building word formation           to expand on these efforts and create a word
based sets of lexical items sharing a common           formation lexicon (working as an NLP tool as
derivational ancestor. Secondly, derivational          well) for Latin. In this paper, we describe the
morphology acts like a kind of interface between       steps undertaken to build such a lexicon.
morphology and semantics, since core semantic             The paper is organised as follows. Section 2
properties are shared at different extent by words     presents the lexical basis supporting the lexicon;
built by a common word formation process.              section 3 details the way the lexicon is built;
   Lately, some lexical resources for derivational     section 4 describes how to access the data;
morphology have been made available. Among             section 5 concludes the paper and sketches the
them are the lexical network for Czech DeriNet         future work.
2     Lemlat                                          same status of lexical bases; the third condition
                                                      concerns the semantic properties of WFRs
The lexical basis used for building the word          mentioned in Section 1.
formation lexicon is the one provided by the             WFRs fall into two main types: (1) derivation
morphological analyser for Latin Lemlat               and (2) compounding. Derivation rules are
(Passarotti, 2004). Resulting from the collation      further organised into two subcategories: (a)
of three Latin dictionaries (Georges and Georges,     affixal, in its turn split into prefixal and suffixal,
1913-1918; Glare, 1982; Gradenwitz, 1904), it         and (b) conversion, a derivation process that
counts 40,014 lexical entries and 43,432 lemmas       changes the PoS of the input word without
(as more than one lemma can be included into          affixation.
the same lexical entry). Recently, the lexical           Compounding and conversion WFRs are
basis of Lemlat was further enlarged by adding        automatically detected, by considering all the
most of the Onomasticon (26,250 lemmas out of         possible combinations of main PoS (verbs,
28,178) provided by Forcellini (1940).                nouns, adjectives), regardless of their actual
   The basic component of the lexical look-up         instantiations in the lexical basis. For instance,
table used by Lemlat to morphologically analyse       there are four possible types of conversion WFRs
(and lemmatise) the input wordforms is the so-        involving verbs: V-To-N (claudo → clausa; “to
called les (“LExical Segment”), which roughly         close” → “cell”), V-To-A (eligo → elegans; “to
corresponds to the invariable part of the inflected   pick out” → “accustomed to select, tasteful”), N-
forms. In other words, the les is the sequence (or    To-V (magister → magistro; “master” → “to
one of the sequences) of characters that remains      rule”), A-To-V (celer → celero; “quick” → “to
the same in the inflectional paradigm of a lemma      quicken”). Each compounding and conversion
(hence, the les does not necessarily correspond to    WFR type is further specified by the inflectional
the word stem). For instance, puell is the les for    category of both input and output. For instance,
the lemma puell–a (“girl”).                           A1-To-V1 is the conversion WFR from first
   Lemlat includes a LES archive, in which each       class adjectives to first conjugation verbs.
LES is assigned a number of inflectional features        Affixal WFRs are found both according to
among which are a tag for the gender of the           previous literature on Latin derivational
lemma (for nouns only) and a code (CODLES) for        morphology (Jenks, 1911; Fruyt, 2011; Oniga,
its inflectional category. For instance, the          1988) and in semi-automatic fashion. The latter
CODLES for the LES puell is N1 (first declension      is performed by extracting from the list of
regular nouns) and its gender is F (feminine).        lemmas of Lemlat the most frequent sequences
                                                      of characters occurring on the left (prefixes) and
3     Building the Lexicon                            on the right (suffixes) side of lemmas. The PoS
The word formation lexicon is built in two steps.     for WFR input and output lemmas as well as
First, word formation rules are detected. Then,       their inflectional category are manually assigned.
they are applied to lexical data.                     Further affixal WFRs are found by confrontation
                                                      with data. So far, we have detected 167 affixal
3.1    Detecting Word Formation Rules                 WFRs: 71 prefixal and 96 suffixal.
Word formation rules (WFRs) are conceived                We recorded the rules in a table of a MySQL
according to the so-called Item-and-Arrangement       relational database where each WFR is classified
model, outlined by Hockett (1954), which              by type and it is assigned the required PoS,
considers word forms either as simple                 inflectional category and gender for its input and
morphemes (not derived word forms) or as a            output.
concatenation of morphemes (derived word              3.2    Applying Word Formation Rules
forms). The following conditions on bases and
affixes do hold: (1) Baudoin’s assumption that        Each morphologically derived lemma is assigned
both bases and affixes are lexical elements (i.e.     a WFR. All those lemmas that share a common
they are both morphemes); (2) as a consequence,       (not derived) ancestor belong to the same
they exist in the lexicon (Bloomfield’s “lexical      “morphological family”. For instance, lemmas
morpheme” theory); (3) they are dualistic, i.e.       formatio (“formation”), formo (“to form”) and
they have both form and meaning (Bloomfield’s         formosus (“beautiful”, lit. “finely formed”) all
“sign-base” morpheme theory). The first two           belong to the morphological family whose
conditions motivate the fact that in our word         ancestor is the lemma forma (“form”).
formation lexicon affixes are recorded with the
   Lemmas and WFRs are paired by using a             assigning WFRs, which are thus fully manually
MySQL relational database whose main tables          hard-coded. For instance, the compound lemma
are the LES archive of Lemlat, the list of its       matricida (“matricide”) is derived by
lemmas (each assigned its PoS, inflectional          compounding the input lemmas mater
category and, for nouns only, gender) and the list   (“mother”) and caedo (“to cut”), thus showing
of WFRs.                                             quite an obscure morphotactic configuration.
   A number of MySQL queries provide the                So far, we have applied to data 134 WFRs (45
candidate lemmas for each WFR. Some of these         prefixal, 80 suffixal, 6 conversion and 3
queries run on the list of lemmas, while others on   compounding), which corresponds to having
the LES archive. In particular, most candidate       assigned a WFR to 18,774 lemmas. Evaluation is
lemmas of prefixal WFRs are found by running         performed by calculating the precision rate (Van
queries on the list of lemmas, as such rules tend    Rijsbergen, 1979) of MySQL queries, i.e. the
to just add the characters of the prefix to the      percentage of the correct candidate input-output
input lemma, like in the case of accuso →            pairs that are automatically assigned to a WFR
sub+accuso (“to blame” → “to blame                   by a query.
somewhat”). Instead, suffixal WFRs are mostly           As expected, precision is higher when
assigned to their candidate input and output         morphotactic mutations are lower. Indeed, while
lemmas by running queries on the LES archive,        precision rates for prefixal rules range between
because suffixes attach to LES instead of            0.95 and 0.8, as they imply quite a few graphical
modifying full lemmas, like in amo → amabilis        mutations, precision for suffixal rules can vary
(“to love” → “lovable”) where suffix –bil–           heavily, ranging from 0.75 to as little as 0.3.
attaches to LES am (plus the thematic vowel –a–,     Instead, the recall of queries has to be calculated
used for first conjugation verbs) instead of full    later in the project, as currently we are unable to
lemma amo. Also, there are suffixal WFRs             verify how many derived lemmas are not
whose input is the basis of the irregular perfect    automatically picked up by queries.
participle of the input verb, like in duco →
ductilis (“to lead” → “that may be led”) where       4    Accessing the Data
suffix –il– attaches to the basis of the irregular
                                                     The word formation lexicon can be accessed on-
perfect participle of the verb duco (duct). Such
                                                     line through a visualisation query system
irregular bases are recorded explicitly in the LES
                                                     (http://wfl.marginalia.it). The lexicon can be
archive with a specific CODLES.
                                                     browsed either by WFR, affix, or input and
3.3   State of Affair and Evaluation                 output PoS or lemma. Drop down menus provide
                                                     the available options for each selection, like for
The procedure described above is not sufficient      instance the list of affixes and lemmas.
neither for detecting nor for applying the WFRs         Results are visualised as tree graphs, whose
and, ultimately, for building the morphological      nodes are lemmas and edges are WFRs. Trees
families. Manual checking is largely needed for      are interactive. Clicking on a node shows the full
identifying false results and disambiguating         derivation tree (“word formation cluster”, which
duplication, as well as for filling lacunas          is calculated dynamically) for the lemma
resulting from the automatic process.                reported in that node. For example, figure 1
   For example, while looking for the candidates     shows the currently available word formation
of the WFR that forms adjectives from nouns          cluster for the lemma amo. One can see that
with the addition of the suffix –ax/–acis, two       amabilis derives from amo and it is in turn the
candidate input nouns are found for the adjective    input for two other derived lemmas: amabilitas
fugax (“swift, transitory”): fuga (“flight”) and     (“loveliness”) and inamabilis (“unlovely”).
fugium (rare, scarcely used in place of fuga).       Clicking on an edge shows the lemmas built by
Such duplicate results need to be checked and        the WFR concerned in that edge. Lemmas are
disambiguated manually, as there must be only        provided both as a derivation graph and as an
one input lemma for each output lemma resulting      alphabetical list. For instance, clicking on the
from a WFR of the derivation type, just like there   edge going from amo to amabilis in figure 1
must be only one WFR associated with each            shows the lemmas built by the derivation WFR
derived lemma.                                       that builds second class adjectives (A2) from
   Morphotactically obscure word formation           first conjugation verbs (V1) with suffix –bil–.
processes, like most compounding WFRs, are
examples of lacunas of the automatic process of
Figure 2 presents a portion of the derivation
graph for this rule.




                                Figure 1. Word formation cluster for amo.




                                  Figure 2. Derivation graph for a WFR.


5    Conclusion and Future Work                           Language, Wiley-Blackwell, Chichester/Malden,
                                                          Mass, 157–175.
The building process of the word formation
                                                        Karl Ernst Georges and Heinrich Georges. 1913-
lexicon for Latin is ongoing. We still have to            1918.     Ausfuhrliches   Lateinisch-Deutsches
fully exploit the potential of querying the lexical       Handwôrterbuch. Hahn, Hannover.
basis of Lemlat to automatically detect
candidates for WFRs. Furthermore, a substantial         Peter GW. Glare. 1982. Oxford Latin Dictionary. At
amount of manual work is needed to pick up                the Clarendon Press, Oxford.
morphotactically obscure formations, like those         John Goldsmith. 2001. Unsupervised learning of the
resulting from compounding.                               morphology of a natural language. Computational
   The word formation lexicon is meant to                 Linguistics, 27(2): 153–198.
enhance Lemlat by providing its processing with         Otto Gradenwitz. 1904. Laterculi vocum latinarum.
word formation analysis of input data, thus               Hirzel, Leipzig.
building a wide lexical resource and NLP tool for
                                                        Charles F. Hockett. 1954. Two Models of
Latin morphology, which will be made available            Grammatical Description. Words, 10: 210–231.
through CLARIN infrastructure (www.clarin.eu).
                                                        Paul Rockwell Jenks. 1911. A manual of Latin word
References                                                formation for secondary schools. DC Heath &
                                                          Company, Harvard.
Marion Baranes and Benoît Sagot. 2014. A Language-
  Independent Approach to Extracting Derivational       Renato Oniga. 1988. I composti nominali latini: una
  Relations from an Inflectional Lexicon.                 morfologia generativa (Vol. 29). Pàtron, Bologna.
  Proceedings of the Ninth International Conference     Marco Passarotti.     2004.   Development      and
  on Language Resources and Evaluation                    perspectives of the Latin morphological analyzer
  (LREC'14). ELRA, Reykjavik, Iceland, 2793–              LEMLAT. Linguistica Computazionale, XX-XXI:
  2799.                                                   397-414.
Egidio Forcellini. 1940. Lexicon totius latinitatis.    Marco Passarotti and Francesco Mambrini. 2012.
  Typis Seminarii, Padova.                                First   Steps towards      the Semi-automatic
Michele Fruyt. 2011. Word Formation in Classical          Development of a Wordformation-based Lexicon
  Latin. J. Clarckson (ed.), A Companion to the Latin     of Latin. Proceedings of the Eighth International
                                                          Conference on Language Resources and
  Evaluation (LREC'12). ELRA, Istanbul, Turkey,
  852–859.
Magda Ševčíková and Zdenĕk Žabokrtský. 2014.
  Word-Formation Network for Czech. Proceedings
  of the Ninth International Conference on Language
  Resources and Evaluation (LREC'14). ELRA,
  Reykjavik, Iceland, 1087–1093.
Luigi Talamo, Chiara Celata and Pier Marco
  Bertinetto. 2016. DerIvaTario: An annotated
  lexicon of Italian derivatives. Word Structure, 9(1):
  72–102.
Cornelis Joost Van Rijsbergen. 1979. Information
  retrieval. Butterworths, London, 2nd edition.
Britta D. Zeller, Jan Snajder and Sebastian Padó.
   2013. DErivBase: Inducing and Evaluating a
   Derivational Morphology Resource for German.
   Proceedings of the Annual Meeting of the
   Association for Computational Linguistics. ACL,
   Sofia, Bulgaria, 1201-1211.