May the Goddess of Hope Help Us.

                   Homonymy in Latin Lexicon and Onomasticon

                Marco Passarotti                                   Marco Budassi
            CIRCSE Research Centre                                Università di Pavia
       Università Cattolica del Sacro Cuore                     Corso Strada Nuova, 65
       Largo Gemelli, 1 – 20123 Milan, Italy                      27100 Pavia, Italy
      marco.passarotti@unicatt.it                            marcobudassi@hotmail.it


                                                      homonymy is often tackled by using the
                     Abstract                         upper/lowercase distinction for the initial letter
                                                      of words, this solution is neither decisive (as
    English. We present a study on the degree of      uppercase letters can also be motivated by
    homonymy between the lexicon of a                 punctuation) nor always available. The latter is
    morphological analyser for Latin and an           especially true for historical languages, as a large
    Onomasticon. To understand the impact of          amount of texts in such languages comes with no
    homonymy, we discuss an experiment on
                                                      upper/lowercase distinction and it may follow
    four Latin texts of different era and genre.
                                                      different editorial criteria.
    Italiano. L’articolo presenta uno studio sul         The recent extension of the lexical basis of the
    grado di omonimia tra il lessico di un            morphological analyser and lemmatiser for Latin
    analizzatore morfologico per il latino e un       Lemlat with an Onomasticon (i.e. a list of proper
    Onomasticon. Al fine di comprendere               names) makes it possible to evaluate the degree
    l’impatto dell’omonimia, viene descritto un       of homonymy of proper names in Latin and,
    esperimento condotto su quattro testi latini di   thus, to understand the extent of the
    diversa epoca e genere.                           disambiguation task (Passarotti and Ruffolo,
                                                      2004). To this aim, in this paper we explore the
1    Introduction                                     lexical basis of Lemlat as providing the empirical
Ambiguity affects linguistic analysis at various      evidence supporting our analysis on homonymy
levels. In particular, homonymy plays a               between names in the Onomasticon and words in
substantial role in the analysis of single words.     the Latin lexicon.
Indeed, when considered out of context, one
                                                      2    Lemlat
same word can be assigned different Parts of
Speech (PoS), morphological features, lemmas          Together with Morpheus (Crane, 1991) and
and meanings. Contextual disambiguation is the        Whitaker’s Words, Lemlat (Passarotti, 2004) is
task of Natural Language Processing (NLP) tools       one of the most widespread tools for automatic
like PoS-taggers, morphological analysers,            analysis of Latin morphology available. The
lemmatisers and word-sense disambiguators.            original lexical basis of Lemlat (L) results from
   The problem of ambiguity is particularly           the collation of three Latin dictionaries (Georges
remarkable for NLP when Named Entity                  and Georges, 1913-1918; Glare, 1982;
Recognition (NER) is concerned. In order to           Gradenwitz, 1904). It counts 40,014 lexical
automatically classify the textual occurrences of     entries and 43,432 lemmas (as more than one
(multi)words into categories such as names of         lemma can be included into the same lexical
persons, locations and organisations, NER faces       entry). Such lexical basis was recently merged
that specific kind of ambiguity consisting in the     with most of the Onomasticon (O) (26,250
homonymy between proper names and other               lemmas out of 28,178) provided by the 5th
words in the lexicon (Nadeau and Sekine, 2007).       edition of Lexicon Totius Latinitatis (Forcellini,
For instance, the word mark in English can be a       1940) (Budassi and Passarotti, 2016).
proper name, a noun or a verb. Although such
   Since the large majority of lemmas in O are         of Hope” in O. PH is represented, for instance,
nouns (19,599 out of 26,250), we will focus on         by the word augustus, which is an adjective in L
them here, first by comparing their distribution in    (“majestic”) and a noun in O (a cognomen given
L and O. Table 1 shows the number of nouns and         to Octavius Caesar as emperor). The word spina
their percentage (on the total of nouns) in L and      is a case of MH, being a first declension
O by inflectional category.                            feminine noun in L (“thorn”) and both a first
                                                       declension feminine noun (an old town in
    Infl. Cat.        Lemlat        Onomasticon        Aemilia) and a third declension masculine noun
    I decl.       5,009 (22.26%) 6,651 (33.94%)        with genitive in –anis (a river God) in O, the
                                                       former thus showing FH and the latter PH.
    II decl.      7,466 (33.17%) 7,235 (36.92%)           Table 2 presents the rates of homonymy in L
    III decl.     8,677 (38.54%) 4,464 (22.77%)        and O by each kind per inflectional category.
    IV decl.         980 (4.35%)       58 (0.29%)      The total number of homonyms is provided as
                                                       well (column “H”). This corresponds to the
    V decl.          101 (0.45%)        6 (0.03%)      number of nouns of an inflectional category that
    Uninfl.          278 (1.23%)    1,185 (6.05%)      are graphically identical in L and O. For
                                                       instance, the first row of table 2 shows that there
    TOTAL                  22,511           19,599
                                                       are 556 lemmas recorded as first declension
                 Table 1. Nouns in L and O.            nouns (in L or O) that are identical to a lemma
                                                       occurring in the other section of the lexical basis
   While third declension nouns are more               of Lemlat. 383 lemmas out of these show FH, i.e.
frequent in L than in O, the opposite holds for        they share not only the same graphical form but
first declension and (to a lesser extent) second       also the same PoS, inflectional category and
declension nouns. The main difference between          gender in L and O (column “FH”). Instead, 163
L and O concerns uninflected nouns, which are          lemmas occur as graphically identical in L and O
much more in O than in L because of the large          but do not have in common at least one among
number of loans recorded in O.                         PoS, inflectional category or gender (column
   Also gender-based distribution of nouns by          PH”). Finally, 10 lemmas show MH.
inflectional    category      shows      substantial
differences between L and O. Among the most                 Infl. Cat.    H       FH     PH     MH
relevant is that O includes more first declension
masculine nouns than L (1,626 vs. 562). Instead,            I decl.       556      383 163       10
the number of second declension neuter nouns is             II decl.      752      307 389       56
larger in L than in O (4,005 vs. 1,523), because            III decl.     584      334 226       24
O tends to include more proper names of persons
than of places, the latter being often assigned the         IV decl.        85       9    73       3
neuter gender. As for third declension, feminine            V decl.           6      5      1      0
nouns are more than masculine in L (5,112 vs.               Uninfl.         60       0    60       0
2,590), while the opposite holds in O (2,847
masculine vs. 1,185 feminine).                              TOTAL 2,043 1,038 912          93
                                                               Table 2. Kinds of homonymy.
3      Mining Nominal Homonymy
                                                          Most of the PH instances for first declension
To categorise nominal homonymy in L and O,
                                                       lemmas are due to different gender. An example
we defined three kinds of homonymy: (a) Full
                                                       is the first declension noun caligula, which is
Homonymy (FH): words with the same lemma,
                                                       feminine in L (“a small military boot”) while it is
PoS, inflectional category and gender in L and
                                                       masculine in O (a cognomen). Second declension
O; (b) Partial Homonymy (PH): words with the
                                                       shows several cases of PoS change, like in the
same lemma in L and O, but with different PoS,
                                                       case of severus, which is an adjective in L
inflectional category or gender (the last for nouns
                                                       (“serious”) and a noun in O (a proper name).
only); (c) Mixed Homonymy (MH): words with
                                                       Instead, a large number of verb-noun changes
the same lemma in L and O and with more than
                                                       holds for third declension. This mostly occurs for
one PoS, inflectional category or gender, thus
                                                       imparisyllable nouns ending in –o, like cato,
resulting partly into FH and partly into PH.
                                                       which is a first conjugation verb in L (“to see”)
   An example of FH in our data is the word
                                                       and a noun in O (a proper name).
spes, which means “hope” in L and “the Goddess
   PH does not raise any tricky issue for NLP,                         Text     L/O     H     FH     PH    MH
the task of PoS/morphological taggers being just                       (1)       618 405 303          88     14
that of disambiguating contextually PoS and
morphological       features.  Conversely,    FH                       (2)     1,207 799 546 186             67
(including the FH-like part of MH) represents a                        (3)       686 486 330 120             36
challenging question for NLP. Indeed, if                               (4)     1,062 706 469 177       60
upper/lowercase distinction is not available in
input data, only context-based semantic                               Table 4. Overlapping and homonymy rates.
properties can disambiguate between candidate
lemmas affected by FH. For instance, in the                          Column “L/O” in table 4 reports the number
clause “spes est expectatio boni” (“hope is                       of words for each text that are analysed both by
expectation of good”, Cicero, Tusculanae, 4, 37,                  Lemlat and by LemlatON. The other columns
80) there is nothing but semantics to help us to                  show the homonymy rates by the kinds described
understand that the word spes is an occurrence of                 in Section 3. For instance, in the text of Caesar
the noun from L instead of the proper name from                   (1) there are 618 words analysed by both the
O. In order to evaluate the extent of homonymy                    versions of Lemlat (L/O). 405 out of them share
in real texts and to understand how much big the                  the same lemma in at least one analysis (H). This
impact of FH is, we performed the experiment                      is further detailed: 303 out of 405 show FH, 88
discussed in the next section.                                    PH and 14 MH. An example of a word analysed
                                                                  by both the versions of the tool that does not
4     Homonymy in Texts. A Case-study                             share the same lemma in all analyses is acie,
                                                                  which is lemmatised under acies (“dagger”) by
We run Lemlat on four Latin texts of similar size                 Lemlat (fifth declension feminine noun) and also
and different genre and era.1 Table 3 shows the                   under the proper name acius by LemlatON
number of distinct words out of the total (column                 (second declension masculine noun). The word
“Types”) analysed by the original version of                      constantia is an example of H: it is lemmatised
Lemlat (column “Lemlat”) and by the one                           as a form of both lemmas consto (“to agree”;
enhanced with the Onomasticon (column                             first conjugation verb) and constantia
“LemlatON”).                                                      (“steadiness”; second declension feminine noun)
                                                                  by Lemlat, and also as a form of both proper
Text Types Lemlat LemlatON Improv.                                names constantius (second declension masculine
(1)       3,092        2,888            3,039          +151       noun) and constantia (second declension
                                                                  feminine noun) by LemlatON. The word
(2)       5,057        4,717            5,005          +288
                                                                  constantia is also an example of FH, as the
(3)       3,542        3,357            3,487          +130       analyses provided by the two versions of Lemlat
(4)     4,589     4,292       4,537       +245                    that share the same lemma have in common even
                                                                  the same inflectional category and gender. PH is
  Table 3. Results of Lemlat(ON) on four texts.
                                                                  shown by the word crassi, which is assigned the
                                                                  same lemma (crassus) both by Lemlat and by
   Beside the words analysed by LemlatOn only
                                                                  LemlatON, but while it is a first class adjective in
(column “Improv.”), there is a certain degree of
                                                                  the former (“solid”), it is a second declension
overlapping between Lemlat and LemlatOn. The
                                                                  masculine noun in the latter (a proper name). An
words falling in this ‘grey zone’ are those that
                                                                  example of MH is the word amico, which is
are analysed both by Lemlat and by LemlatOn,
                                                                  lemmatised under the lemma amicus (“friend”)
as they are lemmatised both under a lemma from
                                                                  both by Lemlat and LemlatON. The lemma
L and under one from O. Among these words,
                                                                  amicus is both an adjective and a second
those affected by homonymy are to be found.
                                                                  declension masculine noun in Lemlat, but only
                                                                  the latter analysis is shared with LemlatON,
                                                                  because the lemma amicus in the Onomasticon is
                                                                  recorded only as a noun and not also as an
1                                                                 adjective. Thus, when the word amico is
  (1) Caesar, De Bello Gallico, 1 (Classical Lat., prose); (2)
Virgil, Aeneid, 1 & 2 (Classical Lat., poetry); (3) Tertullian,   assigned PoS noun it shows FH, while when it is
Apologeticum (Late Lat., prose); (4) Claudian, De Raptu           assigned PoS adjective it shows PH.
Proserpinae (Late Lat., poetry). All the texts were                  The proportions between the kinds of
downloaded from the Perseus Digital Library                       homonymy remain quite similar for all the texts.
(www.perseus.tufts.edu).
Words affected by H tend to be more than half of      Lemlat with information about their distribution
L/O; among them the large majority is affected        in a number of manually tagged reference texts.
by FH. By comparing columns “FH” and “MH”
in table 4 with column “Types” in table 3, one        References
can see that slightly more than 10% of the words
                                                      Marco Budassi and Marco Passarotti. 2016. Nomen
of all the texts is affected by FH. This is the         Omen. Enhancing the Latin Morphological
percentage rate of words whose lemmatisation            Analyser   Lemlat with an Onomasticon.
cannot be disambiguated by a PoS tagger,                Proceedings of the 10th Workshop on Language
because semantic features only are here at work         Technology for Cultural Heritage, Social Sciences
to choose between candidate lemmas. For                 and Humanities (LaTeCH 2016). The Association
instance, if a PoS tagger assigns to one                for Computational Linguistics, Berlin, Germany,
occurrence of the word constantia PoS noun and          90–94.
gender feminine, it cannot disambiguate between       Gregory Crane. 1991. Generating and Parsing
the two (fully morphologically identical) lemmas        Classical Greek. Literary and Linguistic
constantia provided by LemlatON.                        Computing, 6(4):243–245.
   If we focus on textual occurrences (tokens)        Egidio Forcellini. 1940. Lexicon Totius Latinitatis /
instead of distinct words (types), the rates of FH      ad Aeg. Forcellini lucubratum, dein a Jos.
(+MH) range between 8.44% (Caesar) and                  Furlanetto emendatum et auctum; nunc demum Fr.
13.19% (Tertullian), as shown by table 5. This          Corradini et Jos. Perin curantibus emendatius et
result represents the extent of the impact of FH        auctius meloremque in formam redactum adjecto
on the texts that we used in the case-study.            altera quasi parte Onomastico totius latinitatis
                                                        opera et studio ejusdem Jos. Perin. Typis
        Text Tokens          FH+MH                      Seminarii, Padova.

        (1)       8,171      690 (8.44%)              Karl Ernst Georges and Heinrich Georges. 1913-
                                                        1918.     Ausführliches   Lateinisch-Deutsches
        (2)     10,045 1,325 (13.19%)                   Handwörterbuch. Hahn, Hannover.
        (3)       7,317      668 (9.13%)              Peter G.W. Glare. 1982. Oxford Latin Dictionary.
       (4)       6,991     797 (11.4%)                  Oxford University Press, Oxford.
      Table 5. Token-based homonymy rates.            Otto Gradenwitz. 1904. Laterculi Vocum Latinarum.
                                                        Hirzel, Leipzig.
   Most of the words showing FH can be easily         David Nadeau and Satoshi Sekine. 2007. A survey of
disambiguated (at least, manually) according to         named entity recognition and classification.
peculiarities of single texts. For instance, the        Lingvisticae Investigationes, 30(1):3–26.
word amicitiam (from Caesar’s text) belongs to        Marco Passarotti.     2004.   Development     and
lemma amicitia both in L (“friendship”) and in O        perspectives of the Latin morphological analyser
(“the Goddess of Friendship”), thus showing FH.         LEMLAT. Linguistica Computazionale, XX-
However, it is more likely that the former is the       XXI:397–414.
one occurring in Caesar than the latter.
                                                      Marco Passarotti and Paolo Ruffolo. 2004. L’utilizzo
Conversely, in the same text the word galli             del    lemmatizzatore    LEMLAT        per     una
(lemma gallus) is more likely a proper name             sistematizzazione dell’omografia in latino.
from O (“Gauls”) than a noun from L (“cock”).           Euphrosyne, XXXII:99–110.

5    Conclusion
We presented a study about the degree of
homonymy between the lexical basis of a
morphological analyser for Latin and an
Onomasticon recently added in the tool. We have
shown the impact of nominal homonymy on a
number of Latin texts of different era and genre.
   Since the analysis of many homonymous
words can be disambiguated according to the
features of single texts (and authors), in the near
future we foresee to enhance such words in