May the Goddess of Hope Help Us. Homonymy in Latin Lexicon and Onomasticon Marco Passarotti Marco Budassi CIRCSE Research Centre Università di Pavia Università Cattolica del Sacro Cuore Corso Strada Nuova, 65 Largo Gemelli, 1 – 20123 Milan, Italy 27100 Pavia, Italy marco.passarotti@unicatt.it marcobudassi@hotmail.it homonymy is often tackled by using the Abstract upper/lowercase distinction for the initial letter of words, this solution is neither decisive (as English. We present a study on the degree of uppercase letters can also be motivated by homonymy between the lexicon of a punctuation) nor always available. The latter is morphological analyser for Latin and an especially true for historical languages, as a large Onomasticon. To understand the impact of amount of texts in such languages comes with no homonymy, we discuss an experiment on upper/lowercase distinction and it may follow four Latin texts of different era and genre. different editorial criteria. Italiano. L’articolo presenta uno studio sul The recent extension of the lexical basis of the grado di omonimia tra il lessico di un morphological analyser and lemmatiser for Latin analizzatore morfologico per il latino e un Lemlat with an Onomasticon (i.e. a list of proper Onomasticon. Al fine di comprendere names) makes it possible to evaluate the degree l’impatto dell’omonimia, viene descritto un of homonymy of proper names in Latin and, esperimento condotto su quattro testi latini di thus, to understand the extent of the diversa epoca e genere. disambiguation task (Passarotti and Ruffolo, 2004). To this aim, in this paper we explore the 1 Introduction lexical basis of Lemlat as providing the empirical Ambiguity affects linguistic analysis at various evidence supporting our analysis on homonymy levels. In particular, homonymy plays a between names in the Onomasticon and words in substantial role in the analysis of single words. the Latin lexicon. Indeed, when considered out of context, one 2 Lemlat same word can be assigned different Parts of Speech (PoS), morphological features, lemmas Together with Morpheus (Crane, 1991) and and meanings. Contextual disambiguation is the Whitaker’s Words, Lemlat (Passarotti, 2004) is task of Natural Language Processing (NLP) tools one of the most widespread tools for automatic like PoS-taggers, morphological analysers, analysis of Latin morphology available. The lemmatisers and word-sense disambiguators. original lexical basis of Lemlat (L) results from The problem of ambiguity is particularly the collation of three Latin dictionaries (Georges remarkable for NLP when Named Entity and Georges, 1913-1918; Glare, 1982; Recognition (NER) is concerned. In order to Gradenwitz, 1904). It counts 40,014 lexical automatically classify the textual occurrences of entries and 43,432 lemmas (as more than one (multi)words into categories such as names of lemma can be included into the same lexical persons, locations and organisations, NER faces entry). Such lexical basis was recently merged that specific kind of ambiguity consisting in the with most of the Onomasticon (O) (26,250 homonymy between proper names and other lemmas out of 28,178) provided by the 5th words in the lexicon (Nadeau and Sekine, 2007). edition of Lexicon Totius Latinitatis (Forcellini, For instance, the word mark in English can be a 1940) (Budassi and Passarotti, 2016). proper name, a noun or a verb. Although such Since the large majority of lemmas in O are of Hope” in O. PH is represented, for instance, nouns (19,599 out of 26,250), we will focus on by the word augustus, which is an adjective in L them here, first by comparing their distribution in (“majestic”) and a noun in O (a cognomen given L and O. Table 1 shows the number of nouns and to Octavius Caesar as emperor). The word spina their percentage (on the total of nouns) in L and is a case of MH, being a first declension O by inflectional category. feminine noun in L (“thorn”) and both a first declension feminine noun (an old town in Infl. Cat. Lemlat Onomasticon Aemilia) and a third declension masculine noun I decl. 5,009 (22.26%) 6,651 (33.94%) with genitive in –anis (a river God) in O, the former thus showing FH and the latter PH. II decl. 7,466 (33.17%) 7,235 (36.92%) Table 2 presents the rates of homonymy in L III decl. 8,677 (38.54%) 4,464 (22.77%) and O by each kind per inflectional category. IV decl. 980 (4.35%) 58 (0.29%) The total number of homonyms is provided as well (column “H”). This corresponds to the V decl. 101 (0.45%) 6 (0.03%) number of nouns of an inflectional category that Uninfl. 278 (1.23%) 1,185 (6.05%) are graphically identical in L and O. For instance, the first row of table 2 shows that there TOTAL 22,511 19,599 are 556 lemmas recorded as first declension Table 1. Nouns in L and O. nouns (in L or O) that are identical to a lemma occurring in the other section of the lexical basis While third declension nouns are more of Lemlat. 383 lemmas out of these show FH, i.e. frequent in L than in O, the opposite holds for they share not only the same graphical form but first declension and (to a lesser extent) second also the same PoS, inflectional category and declension nouns. The main difference between gender in L and O (column “FH”). Instead, 163 L and O concerns uninflected nouns, which are lemmas occur as graphically identical in L and O much more in O than in L because of the large but do not have in common at least one among number of loans recorded in O. PoS, inflectional category or gender (column Also gender-based distribution of nouns by PH”). Finally, 10 lemmas show MH. inflectional category shows substantial differences between L and O. Among the most Infl. Cat. H FH PH MH relevant is that O includes more first declension masculine nouns than L (1,626 vs. 562). Instead, I decl. 556 383 163 10 the number of second declension neuter nouns is II decl. 752 307 389 56 larger in L than in O (4,005 vs. 1,523), because III decl. 584 334 226 24 O tends to include more proper names of persons than of places, the latter being often assigned the IV decl. 85 9 73 3 neuter gender. As for third declension, feminine V decl. 6 5 1 0 nouns are more than masculine in L (5,112 vs. Uninfl. 60 0 60 0 2,590), while the opposite holds in O (2,847 masculine vs. 1,185 feminine). TOTAL 2,043 1,038 912 93 Table 2. Kinds of homonymy. 3 Mining Nominal Homonymy Most of the PH instances for first declension To categorise nominal homonymy in L and O, lemmas are due to different gender. An example we defined three kinds of homonymy: (a) Full is the first declension noun caligula, which is Homonymy (FH): words with the same lemma, feminine in L (“a small military boot”) while it is PoS, inflectional category and gender in L and masculine in O (a cognomen). Second declension O; (b) Partial Homonymy (PH): words with the shows several cases of PoS change, like in the same lemma in L and O, but with different PoS, case of severus, which is an adjective in L inflectional category or gender (the last for nouns (“serious”) and a noun in O (a proper name). only); (c) Mixed Homonymy (MH): words with Instead, a large number of verb-noun changes the same lemma in L and O and with more than holds for third declension. This mostly occurs for one PoS, inflectional category or gender, thus imparisyllable nouns ending in –o, like cato, resulting partly into FH and partly into PH. which is a first conjugation verb in L (“to see”) An example of FH in our data is the word and a noun in O (a proper name). spes, which means “hope” in L and “the Goddess PH does not raise any tricky issue for NLP, Text L/O H FH PH MH the task of PoS/morphological taggers being just (1) 618 405 303 88 14 that of disambiguating contextually PoS and morphological features. Conversely, FH (2) 1,207 799 546 186 67 (including the FH-like part of MH) represents a (3) 686 486 330 120 36 challenging question for NLP. Indeed, if (4) 1,062 706 469 177 60 upper/lowercase distinction is not available in input data, only context-based semantic Table 4. Overlapping and homonymy rates. properties can disambiguate between candidate lemmas affected by FH. For instance, in the Column “L/O” in table 4 reports the number clause “spes est expectatio boni” (“hope is of words for each text that are analysed both by expectation of good”, Cicero, Tusculanae, 4, 37, Lemlat and by LemlatON. The other columns 80) there is nothing but semantics to help us to show the homonymy rates by the kinds described understand that the word spes is an occurrence of in Section 3. For instance, in the text of Caesar the noun from L instead of the proper name from (1) there are 618 words analysed by both the O. In order to evaluate the extent of homonymy versions of Lemlat (L/O). 405 out of them share in real texts and to understand how much big the the same lemma in at least one analysis (H). This impact of FH is, we performed the experiment is further detailed: 303 out of 405 show FH, 88 discussed in the next section. PH and 14 MH. An example of a word analysed by both the versions of the tool that does not 4 Homonymy in Texts. A Case-study share the same lemma in all analyses is acie, which is lemmatised under acies (“dagger”) by We run Lemlat on four Latin texts of similar size Lemlat (fifth declension feminine noun) and also and different genre and era.1 Table 3 shows the under the proper name acius by LemlatON number of distinct words out of the total (column (second declension masculine noun). The word “Types”) analysed by the original version of constantia is an example of H: it is lemmatised Lemlat (column “Lemlat”) and by the one as a form of both lemmas consto (“to agree”; enhanced with the Onomasticon (column first conjugation verb) and constantia “LemlatON”). (“steadiness”; second declension feminine noun) by Lemlat, and also as a form of both proper Text Types Lemlat LemlatON Improv. names constantius (second declension masculine (1) 3,092 2,888 3,039 +151 noun) and constantia (second declension feminine noun) by LemlatON. The word (2) 5,057 4,717 5,005 +288 constantia is also an example of FH, as the (3) 3,542 3,357 3,487 +130 analyses provided by the two versions of Lemlat (4) 4,589 4,292 4,537 +245 that share the same lemma have in common even the same inflectional category and gender. PH is Table 3. Results of Lemlat(ON) on four texts. shown by the word crassi, which is assigned the same lemma (crassus) both by Lemlat and by Beside the words analysed by LemlatOn only LemlatON, but while it is a first class adjective in (column “Improv.”), there is a certain degree of the former (“solid”), it is a second declension overlapping between Lemlat and LemlatOn. The masculine noun in the latter (a proper name). An words falling in this ‘grey zone’ are those that example of MH is the word amico, which is are analysed both by Lemlat and by LemlatOn, lemmatised under the lemma amicus (“friend”) as they are lemmatised both under a lemma from both by Lemlat and LemlatON. The lemma L and under one from O. Among these words, amicus is both an adjective and a second those affected by homonymy are to be found. declension masculine noun in Lemlat, but only the latter analysis is shared with LemlatON, because the lemma amicus in the Onomasticon is recorded only as a noun and not also as an 1 adjective. Thus, when the word amico is (1) Caesar, De Bello Gallico, 1 (Classical Lat., prose); (2) Virgil, Aeneid, 1 & 2 (Classical Lat., poetry); (3) Tertullian, assigned PoS noun it shows FH, while when it is Apologeticum (Late Lat., prose); (4) Claudian, De Raptu assigned PoS adjective it shows PH. Proserpinae (Late Lat., poetry). All the texts were The proportions between the kinds of downloaded from the Perseus Digital Library homonymy remain quite similar for all the texts. (www.perseus.tufts.edu). Words affected by H tend to be more than half of Lemlat with information about their distribution L/O; among them the large majority is affected in a number of manually tagged reference texts. by FH. By comparing columns “FH” and “MH” in table 4 with column “Types” in table 3, one References can see that slightly more than 10% of the words Marco Budassi and Marco Passarotti. 2016. Nomen of all the texts is affected by FH. This is the Omen. Enhancing the Latin Morphological percentage rate of words whose lemmatisation Analyser Lemlat with an Onomasticon. cannot be disambiguated by a PoS tagger, Proceedings of the 10th Workshop on Language because semantic features only are here at work Technology for Cultural Heritage, Social Sciences to choose between candidate lemmas. For and Humanities (LaTeCH 2016). The Association instance, if a PoS tagger assigns to one for Computational Linguistics, Berlin, Germany, occurrence of the word constantia PoS noun and 90–94. gender feminine, it cannot disambiguate between Gregory Crane. 1991. Generating and Parsing the two (fully morphologically identical) lemmas Classical Greek. Literary and Linguistic constantia provided by LemlatON. Computing, 6(4):243–245. If we focus on textual occurrences (tokens) Egidio Forcellini. 1940. Lexicon Totius Latinitatis / instead of distinct words (types), the rates of FH ad Aeg. Forcellini lucubratum, dein a Jos. (+MH) range between 8.44% (Caesar) and Furlanetto emendatum et auctum; nunc demum Fr. 13.19% (Tertullian), as shown by table 5. This Corradini et Jos. Perin curantibus emendatius et result represents the extent of the impact of FH auctius meloremque in formam redactum adjecto on the texts that we used in the case-study. altera quasi parte Onomastico totius latinitatis opera et studio ejusdem Jos. Perin. Typis Text Tokens FH+MH Seminarii, Padova. (1) 8,171 690 (8.44%) Karl Ernst Georges and Heinrich Georges. 1913- 1918. Ausführliches Lateinisch-Deutsches (2) 10,045 1,325 (13.19%) Handwörterbuch. Hahn, Hannover. (3) 7,317 668 (9.13%) Peter G.W. Glare. 1982. Oxford Latin Dictionary. (4) 6,991 797 (11.4%) Oxford University Press, Oxford. Table 5. Token-based homonymy rates. Otto Gradenwitz. 1904. Laterculi Vocum Latinarum. Hirzel, Leipzig. Most of the words showing FH can be easily David Nadeau and Satoshi Sekine. 2007. A survey of disambiguated (at least, manually) according to named entity recognition and classification. peculiarities of single texts. For instance, the Lingvisticae Investigationes, 30(1):3–26. word amicitiam (from Caesar’s text) belongs to Marco Passarotti. 2004. Development and lemma amicitia both in L (“friendship”) and in O perspectives of the Latin morphological analyser (“the Goddess of Friendship”), thus showing FH. LEMLAT. Linguistica Computazionale, XX- However, it is more likely that the former is the XXI:397–414. one occurring in Caesar than the latter. Marco Passarotti and Paolo Ruffolo. 2004. L’utilizzo Conversely, in the same text the word galli del lemmatizzatore LEMLAT per una (lemma gallus) is more likely a proper name sistematizzazione dell’omografia in latino. from O (“Gauls”) than a noun from L (“cock”). Euphrosyne, XXXII:99–110. 5 Conclusion We presented a study about the degree of homonymy between the lexical basis of a morphological analyser for Latin and an Onomasticon recently added in the tool. We have shown the impact of nominal homonymy on a number of Latin texts of different era and genre. Since the analysis of many homonymous words can be disambiguated according to the features of single texts (and authors), in the near future we foresee to enhance such words in