1 Introduction

(Stem and Word) Predictability in Italian verb paradigms: An Entropy-Based Study Exploiting the New Resource LeFFI

Matteo Pellegrini

Alessandra Teresa Cignarella

cigna@di.unito.it 0 1

. Liceo Statale “Augusto Monti” di Chieri

Italy

0 . Dipartimento di Informatica, Universita` degli Studi di Torino , Italy 1 . PRHLT Research Center, Universitat Polite`cnica de Vale`ncia , Spain

English. In this paper we present LeFFI, an inflected lexicon of Italian listing all the available wordforms of 2,053 verbs. We then use this resource to perform an entropy-based analysis of the mutual predictability of wordforms within Italian verb paradigms, and compare our findings to the ones of previous work on stem predictability in Italian verb inflection.

1 Introduction

The pioneering work of Aronoff (1994) has inspired an influential line of research where predictability within inflectional paradigms is modelled by resorting to the notion of morphomic stems – i.e., stems that cannot be considered as bearing any meaning, as they appear in groups of cells that do not share a fixed morphosyntactic content. In this perspective, every lexeme is seen as equipped with a set of indexed stems, that only for regular lexemes are mutually predictable, while for irregular verbs they need to be independently stored. From each of these stems, a fixed set of wordforms can be obtained by adding the appropriate inflectional endings. An analysis relying on these assumptions was proposed by Maiden (1992) and subsequent work – see Maiden (2018) for a recent survey – to account for the patterns of stem allomorphy that are found in the verbal inflection of Romance languages in general. More detailed implementations of these ideas have then been provided for individual languages, among them Italian (Pirrelli and Battista, 2000; Montermini and Boye´, 2012; Montermini and Bonami, 2013) . Another possibility that has been explored in more recent times is tackling the issue of inflectional predictability in terms of predictions of wordforms from one another, without

Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). assuming a given segmentation in stems vs. endings, in a fully word-based, abstractive (Blevins, 2016) approach. Within this framework, Ackerman et al. (2009) propose to estimate the reliability of inflectional predictions by means of the informationtheoretic notion of conditional entropy. Building on this work, Bonami and Boye´ (2014) outline a procedure that allows to compute entropy values estimating the uncertainty in predicting one cell from another one directly from a lexicon of fully inflected wordforms in phonological transcription, using the type frequency of different inflectional patterns to estimate their probability of application. This method has been applied to French by Bonami and Boye´ (2014), to Latin by Pellegrini (2020), and it has been used for typological comparison on a small sample of languages by Beniamine (2018), who also provides a freely available toolkit (Qumin) allowing to perform this computation automatically for any language.

A similar entropy-based analysis has not been proposed for Italian yet. To be able to use the Qumin toolkit to perform it, it is necessary to have an inflected lexicon listing all the wordforms of a representative number of lexemes in phonological transcription, like e.g. Flexique for French (Bonami et al., 2014) or LatInflexi for Latin (Pellegrini and Passarotti, 2018) . Looking for such a resource for Italian, we can see that in most lexicons wordforms are given in orthographic transcription – see e.g. Morph-it! (Zanchetta and Baroni, 2005) and CoLFIS (Bertinetto et al., 2005) . On the other hand, in PhonItalia (Goslin et al., 2014) there are phonological transcriptions, but not all the inflected wordforms of each lexeme are listed. To the best of our knowledge, the only resource providing phonological transcriptions of the full paradigm of lexemes is GLAFF-IT (Calderone et al., 2017) , but due to the way in which it was created, it proves to be too noisy to be used for entropy computations as such.

In this paper, we describe the work that was done to obtain a smaller, but cleaner version of GLAFFIT. We then use this resource to perform an entropybased analysis of predictability in Italian verb inflection. After briefly describing the methodology, we present our results comparing them with the findings of previous stem-based analyses. 2

The Resource

In order to build LeFFI (Lessico delle Forme Flesse dell’Italiano), we have firstly consulted GLAFF-IT, a free machine-readable dictionary based on Wikizionario, the Italian language edition of Wiktionary. It is a morphophonological Italian lexicon which contains a total of 485,135 wordforms among verbs, nouns, adjectives and adverbs, in both orthographic and phonological IPA transcription. Since our interest for the present research lies only in verbs, in this step a total of 411,770 verbal forms in phonological transcription have been extracted from GLAFF-IT, together with the citation form (the infinitive) of the lexeme they belong to , thus resulting in a list of the complete paradigms of 7,552 verbs. To indicate the morphosyntactic properties expressed by each wordform, we use the notation of the Leipzig Glossing Rules (Comrie et al., 2008) , both in our resource and in the examples shown in this paper.

Due to the large amount of manual work needed in order to obtain our resource, for the time being we have decided to focus only on a fraction of this list. So as not to lose quantitatively relevant data, our selection was based on the frequency of lexemes, as reported in the CoLFIS frequency lexicon. We have thus crossed the list of 7,552 verbs extracted from GLAFF-IT with the 5,193 verbal lexemes contained in CoLFIS, and kept only the ones with a frequency higher than 10. The resulting dataset, listing the 53 available, nonperiphrastic cells of 2,053 verbs, is still large enough to allow for reasonably safe generalizations on Italian verb inflection.

After these automatic steps, several manual changes have been made in order to obtain the current version of our resource. Firstly, it should be noticed that many of the phonological transcriptions provided by GLAFF-IT are obtained automatically from the orthographic form. In some cases, however, it is not possible to infer a precise phonological transcription from orthography alone, because some graphemes can correspond to different phonemes. In such cases, the phonological transcriptions provided by GLAFF-IT are underspecified: for instance, the symbol E is used for the grapheme hei, that can correspond to /e/ or /E/, and similarly O for hoi (/o/ or /O/), S for hsi (/s/ or /z/), Z for hzi (/µ/ or /dz/). While we have manually reconducted hsi, hzi and a few other marginal ambiguous graphemes to the actual phonemes they correspond to, for hei and hoi we have decided to keep the same neutralization as in GLAFF-IT. This choice is due to the fact that manually disambiguating all cases to reflect the actual pronunciation in the standard variety of Italian would have been very time consuming, but it is also justified by the fact that in many varieties (including the northern ones of the authors) these distinctions are not made.

Another systematic correction concerns the placement of stress, that for many wordforms have been obtained automatically in GLAFF-IT, and sometimes turns out not to be in the right place: for instance, in many third-plural forms, the stress is incorrectly placed on the penultimate (e.g. PRS.IND.3SG /divent"ano/ ‘they become’, /okkup"ano/ ‘they occupy’), while in our resource we move it to the (pre)antepenultimate (e.g. /div"entano/, /"okkupano/). While in other cases it was possible to correct stress position in an automatic way, by moving the stress to the syllable where it is systematically placed (e.g. the antepenultimate in forms like PRET.IND.3SG /f"eÙero/ ‘they did’), in this case, since there are two alternatives, the changes had to be done semiautomatically, by automatically moving the stress to the antepenultimate, and then manually moving it to the preantepenultimate whenever needed.

In cases of cells containing more than one wordform, we keep only one of the cell-mates. Wherever it was possible, we have used Thornton (2008)’s description of overabundance in Italian verb inflection to select the less marginal variant (e.g., keeping /d"evo/ rather than /d"ebbo/ in the PRS.IND.1SG of DOVERE ‘must’).

Several other punctual corrections were manually made on the data of GLAFF-IT, yielding the current version of our resource, that is clean enough to be able to perform an entropy-based analysis shedding light on the patterns of interpredictability between wordforms in Italian verb paradigms. 3

The Method

The Qumin toolkit computes implicative entropy values estimating the uncertainty in predicting each paradigm cell assuming knowledge of one (or more than one) wordform, following the procedure described in Beniamine (2018). Here, we illustrate the methodology using the data given in Table 1. lexeme conj. GER AMARE ‘love’ 1st VEDERE ‘see’ 2nd SENTIRE ‘hear’ 3rd /am"ando/ /am"ate/ /ved"endo/ /ved"ete/ /sent"endo/ /sent"ite/

PRS.IND.2PL The first step of the procedure consists in classifying verbs according to the patterns of formal alternation between wordforms, and the phonological context in which such alternations are attested. As is shown in the second column of Table 2, 1st and 2nd conjugation verbs display the same pattern (1), while 3rd conjugation verbs use another pattern (2). The second step is another classification based on the patterns that can potentially be applied to GER to obtain PRS.IND.2PL. As can be seen in the third column of Table 2, verbs of the 2nd and 3rd conjugation are in the same class (B), because patterns 1 and 2 can potentially be applied to a GER ending in /endo/, while only pattern 1 can be applied to 1st conjugation verbs with GER in /ando/. Entropy is then computed for each of the classes of this second classification, weighing the probability of application of different patterns by means of their type frequency in the data, i.e., the number of verbs in which they are attested: here, data from LeFFI are given in the last column of Table 2.

lexeme AMARE VEDERE SENTIRE pattern/context applicable (1SG $ 3SG) patterns 1 ( ndo $ te / V #) A (1) 1 ( ndo $ te / V #) B (1,2) 2 ( endo $ ite / C #) B (1,2) n. verbs 1,505 320 215

As is shown in Equation 1, there is no uncertainty in class A: given a GER in /ando/, PRS.IND.2PL cannot but be in /ate/. On the other hand, given a GER in /endo/, PRS.IND.2PL can be in in /ete/ (applying pattern 1) or in /ite/ (applying pattern 2). As a consequence, there is some uncertainty in this case. The entropy values of different classes are then summed and weighed – again on the basis of type frequency – in a single entropy value, that estimates the overall uncertainty in predicting PRS.IND.2PL from GER in Italian verbs. 4

Results

Giving the data of LeFFI as input to the Qumin toolkit, the output is an entropy-based distance matrix of all the cells of Italian verb paradigms. We do not show it here for reasons of space as it comprises 53 columns and rows, but we use its values to draw a mapping of the paradigm in zones of full interpredictability, where two cells A; B are conflated in the same zone if they can be predicted from one another with no uncertainty, i.e. if H(AjB) = H(BjA) = 0. The outcome of this grouping is given in Table 3.

FUT.IND PRS.COND PRS.SBJV.

PRS.IND IPRF.IND IPRF.SBJV PRET.IND IMP PST.PTCP PRS.PTCP GER INF 1SG Z6 Z6 Z2 Z12 Z9 Z1

Z5 M.SG Z7 Z11 Z8 2SG Z6 Z6 Z2 Z14 Z9 Z1 Z1 Z3 F.SG Z7 3SG Z6 Z6 Z2 Z15 Z9 Z1 Z5 M.PL

Z7 Z11

Z11 1PL Z6 Z6 Z4 Z4 Z9 Z1 Z1 Z4 F.PL Z7 2PL Z6 Z6 Z4 Z10 Z9 Z1 Z1 Z10 3PL Z6 Z6 Z2 Z13 Z9 Z1 Z5 are based on the same stem. For this comparison, we refer to Montermini and Bonami (2013), where the most recent version of the stem-based mapping is provided. In their description, 8 stems are identified, while our word-based mapping is composed of 15 zones. In particular, Z1-9-10-11 of our mapping correspond to the zones including cells that are based on the same stem S1 in Montermini and Bonami (2013)’s analysis: this is why they are all colored with different shades of red in Table 3. Similarly, our Z2-12-13 (different shades of blue) include cells based on Montermini and Bonami (2013)’s S2 and our Z3-14-15 (different shades of green) include cells based on Montermini and Bonami (2013)’s S3. As for the other zones of our mapping, there is a one-to-one correspondence with the stems identified by Montermini and Bonami (2013).

The discrepancies between the two approaches are mostly due to two different reasons: (i) the presence of a few, highly irregular verbs1 that are not accounted for by Montermini and Bonami (2013)’s analysis, but are included in our dataset, and, therefore, in our entropy-based analysis; (ii) more systematic opacities of some wordforms, that are poorly informative on the conjugation of lexemes.

As an example of case (i), PRS.IND.2PL and IPRF.IND.3SG can almost always be predicted from one another by replacing the final segments /te/ with /va/, or vice versa: e.g. AMARE (PRS.IND.2PL /am"ate/, IPRF.IND.3SG /am"ava/) and SENTIRE (PRS.IND.2PL /sent"ite/, IPRF.IND.3SG /sent"iva/).

1Namely: ANDARE ‘to go’, AVERE ‘to have’, DARE ‘to give’, DIRE ‘to say’, ESSERE ‘to be’, FARE ‘to do’, SAPERE ‘to know’, and STARE ‘to stay’.

However, this generalization does not hold for a handful of highly irregular verbs, as is exemplified by DIRE ‘say’, with PRS.IND.2PL /d"ite/ but IPRF.IND.3SG /diÙ"eva/. Of course, the picture is different depending on the presence of such irregular verbs in the data. If they are excluded, as in Montermini and Bonami (2013), the two cells can be considered as based on the same stem (S1) and, thus, as being fully interpredictable. If they are included, as happens in our data, the two cells have to be assigned to different zones, since there is some uncertainty in predicting the cells from one another. However, entropy is very low in such cases, thanks to the weighing based on type frequency (see the corresponding values in Table 4). It should be noticed that the lexemes that are not considered by Montermini and Bonami (2013) because of their irregularity are among the verbs with higher token frequency in Italian (all ranking among the first 13 positions in COLFIS). This makes their exclusion less worrisome, as the irregular formal patterns they display can plausibly be considered as being learned by rote. Nevertheless, our entropy-based picture can be considered as achieving a higher level of granularity in the description.

As an example of case (ii), PRS.IND.2SG and PRS.IND.3SG are in the same zone in Montermini and Bonami (2013), because they are both considered as obtained from S3: in particular, PRS.IND.3SG is identical to S3, while to obtain PRS.IND.2SG the final vowel of S3 has to be replaced by /a/. In both cases, knowing the shape of S3 is sufficient to infer the cell without any uncertainty. However, in our word-based perspective there is uncertainty when guessing PRS.IND.3SG from PRS.IND.2SG: the latter always ends in /i/ (e.g. AMARE /"ami/, VEDERE /v"edi/), neutralizing the distinction between verbs of different conjugations, and, thus, not allowing to discriminate between 1st conjugation verbs with S3 and PRS.IND.3SG in /a/ (e.g. AMARE /"ama/) and 2nd and 3rd conjugation verbs with S3 and PRS.IND.3SG in /e/ (e.g. VEDERE /v"ede/).

These examples show that our method allows to identify sources of uncertainty that are downplayed in the stem-based picture, either because of their quantitative marginality – case (i) – or because they are obscured by the use of an abstract stem, that however is not always inferrable by the shape of the single wordform used as predictor – case (ii).

However, it should be noticed that at least the possible availability of more exhaustive stem spaces accounting for all the formal variation of Italian verb inflection, without excluding highly irregular verbs – thus corresponding to our case (i) – was already acknowledged in the works cited above: see e.g. Pirrelli and Battista (2000, Footnote 16) and Montermini and Bonami (2013, Footnote 9). Indeed, there is of course a trade-off between the number of zones in which the paradigm is split on the one hand, and the coverage of the identified zones with respect to the whole lexicon on the other hand. In the stem-based mapping, the choice is not to make the number of zones too high, at the (minimal) cost of not accounting for a handful of irregular verbs. Conversely, in the word-based mapping that we adopt in the present paper, the higher number of zones is compensated by a complete coverage of the whole lexicon. Now, how many of the zones are actually identified and learned by speakers is an empirical matter that should be tackled by means of psycholinguistic experiments. However, what is important to keep in mind is that this gap between the two approaches can be filled, either by drawing the stem space in such a way that it covers also for irregular verbs, or by reducing the number of zones in the word-based analysis gradually collapsing zones of interpredictability for increasing values of implicative entropy. For instance, if the criterion for two cells to be assigned to the same zone is for them to be predictable from one another with an implicative entropy value lower than 0.01, rather than 0, then Z3,13,15 can be merged in a same zone. If the threshold is set at 0.02, also Z1 and Z9 can be conflated in the same zone, to which also Z7 can be added with threshold set at 0.03.

On the other hand, the discrepancy between the two approaches generated by more systematic, but unidirectional opacities such as the one described above in (ii) could be avoided if in the entropybased mapping we decided that having null entropy in one direction would be a sufficient criterion for two cells to be assigned to the same zone – i.e., two cells belong to the same zone if either H(AjB) or H(BjA) = 0. 5

Conclusions

In this paper, we have presented the inflected lexicon of Italian verbs LeFFI. We have then exploited it to investigate predictability in Italian verb inflection, using implicative entropy to estimate the uncertainty in predicting wordforms from one another. The results have been used to obtain a mapping of the paradigm in zones of interpredictability, that we have compared to the mapping of stems proposed in previous work, showing that our word-based procedure is capable of capturing aspects that are downplayed, if not ignored in the stem-based approach.

Besides their theoretical interest, both the resource and the information-theoretic approach potentially have more practical applications, for instance in the field of psycholinguistics. The resource provides a very clean but sufficiently large dataset of forms that can be used as a source of input for fine-grained experiments. In such experiments, it would be possible to test if the different levels of predictability between cells identified by different values of implicative entropy find a correspondence in the process of acquisition of inflectional morphology by L1 and L2 speakers – i.e., if the pairs of cells between which there are higher implicative entropy values are indeed the ones on which learners are more uncertain. More generally, our entropy-based evaluation of uncertainty in inflectional predictions can be considered as a measure of (at least one aspect of) morphological complexity, that can be used also in other areas, for instance to asses text readibility. 6

Availability of Data and Tools

The data and tools used in this study are freely available online, allowing for an easy replication of the presented results. LeFFI can be found in the following repository: https://github.com/ matteo-pellegrini/LeFFI. The Qumin toolkit that was used to automatically perform entropy computations can be freely downloaded at: https://github.com/XachaB/Qumin.

Farrell

Ackerman ,

James P.

Blevins , and

Robert

Malouf . 2009 . Parts and wholes: Patterns of relatedness in complex morphological systems and why they matter . In James P. Blevins and Juliette Blevins, editors, Analogy in grammar: Form and acquisition , pages 54 - 82 . Oxford University Press, Oxford.

Mark

Aronoff . 1994 . Morphology by itself: Stems and inflectional classes . MIT press, Cambridge.

Sacha

Beniamine . 2018 . Classifications flexionnelles . E´tude quantitative des structures de paradigmes. Ph.D. thesis , Universite´ Sorbonne Paris Cite´-Universite´ Paris Diderot.

Pier

Marco

Bertinetto , Cristina Burani, Alessandro Laudanna, Lucia Marconi, Daniela Ratti, Claudia Rolando, and Anna Maria Thornton. 2005 . CoLFIS (Corpus e Lessico di Frequenza dell'Italiano Scritto) .

James P.

Blevins . 2016 . Word and paradigm morphology . Oxford University Press, Oxford.

Olivier

Bonami and Gilles Boye´. 2014 . De formes en the`mes . In Florence Villoing, Sarah Leroy, and Sophie David, editors, Foisonnements morphologiques. E´tudes en hommage a` Franc¸oise Kerleroux , pages 17 - 45 . Presses Universitaires de Paris-Ouest, Paris.

Olivier

Bonami , Gauthier Caron, and Cle´ment Plancq. 2014 . Construction d'un lexique flexionnel phone´tise´ libre du franc¸ais . In Congre`s Mondial de Linguistique Franc¸ aise -- CMLF 2014 , volume 8 , pages 2583 - 2596 . EDP Sciences.

Basilio

Calderone , Matteo Pascoli, Franck Sajous, and

Nabil

Hathout . 2017 . Hybrid method for stress prediction applied to GLAFF-IT, a large-scale Italian lexicon . In International Conference on Language, Data and Knowledge , pages 26 - 41 , Cham. Springer.

Bernard

Comrie ,

Martin

Haspelmath , and

Balthasar

Bickel . 2008 . The Leipzig Glossing Rules: Conventions for interlinear morpheme-by-morpheme glosses. Department of Linguistics of the Max Planck Institute for Evolutionary Anthropology and the Department of Linguistics of the University of Leipzig.

Jeremy

Goslin , Claudia Galluzzi, and

Cristina

Romani . 2014 . PhonItalia: a phonological lexicon for Italian . Behavior research methods , 46 ( 3 ): 872 - 886 .

Martin

Maiden . 1992 . Irregularity as a determinant of morphological change . Journal of linguistics , 28 ( 2 ): 285 - 312 .

Martin

Maiden . 2018 . The Romance verb: Morphomic structure and diachrony . Oxford University Press, Oxford.

Fabio

Montermini and

Olivier

Bonami . 2013 . Stem spaces and predictability in verbal inflection . Lingue e linguaggio , 12 ( 2 ): 171 - 190 .

Fabio

Montermini and Gilles Boye´. 2012 . Stem relations and inflection class assignment in Italian . Word Structure , 5 ( 1 ): 69 - 87 .

Matteo

Pellegrini and

Marco

Passarotti . 2018 . LatInfLexi: an inflected lexicon of Latin verbs . In Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018 ).

Matteo

Pellegrini . 2020 . Using LatInfLexi for an Entropy-Based Assessment of Predictability in Latin Inflection . In Proceedings of LT4HALA 2020-1st Workshop on Language Technologies for Historical and Ancient Languages , pages 37 - 46 .

Vito

Pirrelli and

Marco

Battista . 2000 . The paradigmatic dimension of stem allomorphy in Italian verb inflection: 2628 . Italian Journal of Linguistics , 12 ( 2 ): 307 - 380 .

Gregory

Stump and

Raphael A.

Finkel . 2013 . Morphological typology: From word to paradigm . Cambridge University Press, Cambridge.

Anna

Thornton . 2008 . A non-canonical phenomenon in italian verb morphology: double forms realizing the same cell . Poster presented at OxMorph1-Oxford.

Eros

Zanchetta and

Marco

Baroni . 2005 . Morph-it!: A free corpus-based morphological resource for the Italian language . Proceedings of corpus linguistics.