How “deep” is learning word inflection? Franco Alberto Cardillo Marcello Ferro Istituto di Linguistica Computazionale Istituto di Linguistica Computazionale ILC-CNR, Pisa (Italy) ILC-CNR, Pisa (Italy) francoalberto.cardillo@ilc.cnr.it marcello.ferro@ilc.cnr.it Claudia Marzi Vito Pirrelli Istituto di Linguistica Computazionale Istituto di Linguistica Computazionale ILC-CNR, Pisa (Italy) ILC-CNR, Pisa (Italy) claudia.marzi@ilc.cnr.it vito.pirrelli@ilc.cnr.it Abstract 1 Introduction English. Machine learning offers two ba- Morphological induction can be defined as the task sic strategies for morphology induction: of singling out morphological formatives from lexical segmentation and surface word re- fully inflected word forms. These formatives are lation. The first one assumes that words understood to be part of the morphological lexi- can be segmented into morphemes. Induc- con, where they are accessed and retrieved, to be ing a novel inflected form requires iden- recombined and spelled out in word production. tification of morphemic constituents and The view requires that a word form be segmented a strategy for their recombination. The into meaningful morphemes, each contributing a second approach dispenses with segmen- separable piece of morpho-lexical content. Typ- tation: lexical representations form part ically, this holds for regularly inflected forms, as of a network of associatively related in- with Italian cred-ut-o ’believed’ (past participle, flected forms. Production of a novel form from CREDERE), where cred- conveys the lexi- consists in filling in one empty node in cal meaning, and -ut-o is associated with morpho- the network. Here, we present the re- syntactic features. A further assumption is that sults of a recurrent LSTM network that there always exists an underlying base form upon learns to fill in paradigm cells of incom- which all other forms are spelled out. In an irreg- plete verb paradigms. Although the pro- ular verb form like Italian appes-o ’hung’ (from cess is not based on morpheme segmenta- APPENDERE ), however, it soon becomes difficult tion, the model shows sensitivity to stem to separate morpholexical information (the verb selection and stem-ending boundaries. stem) from morpho-syntactic information. A different formulation of the same task as- Italiano. La letteratura offre due strategie sumes that the lexicon consists of fully-inflected di base per l’induzione morfologica. La word forms and that morphology induction is prima presuppone la segmentazione delle the result of finding out implicative relations be- forme lessicali in morfemi e genera parole tween them. Unknown forms are generated by nuove ricombinando morfemi conosciuti; redundant analogy-based patterns between known la seconda si basa sulle relazioni di una forms, along the lines of an analogical propor- forma con le altre forme del suo paradig- tion such as: rendere ‘make’ :: reso ‘made’ = ma, e genera una parola sconosciuta riem- appendere ‘hang’ :: appeso ‘hung’. Support piendo una cella vuota del paradigma. In to this view comes from developmental psychol- questo articolo, presentiamo i risultati di ogy, where words are understood as the foun- una rete LSTM ricorrente, capace di impa- dational elements of language acquisition, from rare a generare nuove forme verbali a par- which early grammar rules emerge epiphenoma- tire da forme giᅵ note non segmentate. lly (Tomasello, 2000; Goldberg, 2003). After all, Ciononostante, la rete acquisisce una co- children appear to be extremely sensitive to sub- noscenza implicita del tema verbale e del regularities holding between inflectionally-related confine con la terminazione flessionale. forms (Bittner et al., 2003; Colombo et al., 2004; Dabrowska, ˛ 2004; Orsolini and Marslen-Wilson, z(t) lexeme LSTM(t) 1997; Orsolini et al., 1998). Further support is (50) … lent by neurobiologically inspired computer mod- els of language, blurring the traditional dichotomy y(t) between processing and storage (Elman, 2009; symbol (t) (33) 1:n symbol Marzi et al., 2016). In particular we will con- 1:n (t+1) 1:n 1:1 sider here the consequences of this view on issues tense mood of word inflection by recurrent Long Short Term (5) Memory (LSTM) networks (Malouf, in press). person (4) 2 The cell-filling problem number (3) To understand how word inflection can be con- ceptualised as a word relation task, it is useful to think of this task as a cell-filling problem (Blevins Figure 1: The network architecture. The input et al., 2017; Ackerman and Malouf, 2013; Acker- vector dimension is shown in brackets. Trainable man et al., 2009). Inflected forms are tradition- dense projection matrices are shown as 1 : n, and ally arranged in so-called paradigms. The full concatenation as 1 : 1. paradigm of CREDERE ’believe’ is a labelled set of all its inflected forms: credere, credendo, creduto, over the upcoming symbol st+1 in the sequence: credo etc. In most cases, these forms take one and p(st+1 |st ,CREDERE , PRES _ IND , 3, S). To pro- only one cell, defined as a specific combination duce the form , we take the start symbol of tense, mood, person and number features: e.g. ‘<’ as s1 , use s1 to predict s2 , then use the pre- crede, PRES IND , 3 S. In all languages, words hap- dicted symbol to predict s3 and so on, until ‘>’ is pen to follow a Zipfian distribution, with very few predicted. Input symbols are encoded as mutually high-frequency words, and a vast majority of ex- orthogonal one-hot vectors with as many dimen- ceedingly rare words (Blevins et al., 2017). As a sions as the overall number of different symbols result, even high-frequency paradigms happen to used to encode all inflected forms. The morpho- be attested partially, and learners must then be able syntactic features of tense, person and number are to generalise incomplete paradigmatic knowledge. given different one-hot vectors, whose dimensions This is the cell-filling problem: given a set of at- equal the number of different values each fea- tested forms in a paradigm, the learner has to guess ture can take.2 All input vectors are encoded by other missing forms in the same paradigm. trainable dense matrices whose outputs are con- The task can be simulated by training a learn- catenated into the projection layer z(t), which is ing model on a number of partial paradigms, to in turn input to a layer of LSTM blocks (Figure then complete them by generating missing forms. 1). The layer takes as input both the informa- Training consists of pairs. A lemma is not a form (e.g. rent LSTM blocks are known to be able to capture credere), but a symbolic proxy of its lexical con- long-distance relations in time series of symbols tent (e.g. CREDERE). Word inflection consists of (Bengio et al., 1994; Hochreiter and Schmidhuber, producing a fully inflected form given a known 1997; Jozefowicz et al., 2015), avoiding classical lemma and an empty paradigm cell. problems with training gradients of Simple Recur- rent Networks (Jordan, 1986; Elman, 1990). 2.1 Methods and materials We tested our model on two comparable sets Following Malouf (in press), our LSTM net- of Italian and German inflected verb forms (Ta- work (Figure 1) is designed to take as input ble 1), where paradigms are selected by sampling a lemma (e.g. CREDERE), a set of morpho- the highest-frequency fifty paradigms in two ref- syntactic features (e.g. PRES _ IND , 3, S) and erence corpora (Baayen et al., 1995; Lyding et al., a sequence of symbols ()1 one symbol 2014). For both languages, a fixed set of cells was st at a time, to output a probability distribution 2 Note that an extra dimension is added when a feature can 1 ‘<’ and ‘>’ are respectively the start-of-word and the be left uninstatiated in particular forms, as is the case with end-of-word symbols person and number features in the infinitive. alphabet max reg / irreg German test all regs irregs language size len cells paradigms forms CoNLL baseline 0.4 0.81 0.23 German 27 13 15 16 / 34 750 128-blocks 0.68 0.79 0.64 Italian 21 14 15 23 / 27 750 256-blocks 0.75 0.89 0.69 512-blocks 0.71 0.84 0.66 Table 1: The German and Italian datasets. Italian test all regs irregs chosen from each paradigm: all present indicative baseline 0.65 0.9 0.5 forms (n=6), all past tense forms (n=6), infinitive 128-blocks 0.61 0.84 0.47 (n=1), past participle (n=1), German present par- 256-blocks 0.63 0.83 0.51 ticiple/Italian gerund (n=1).3 The two sets are in- 512-blocks 0.69 0.92 0.54 flectionally complex: they exhibit extensive stem allomorphy and a rich set of affixations, includ- Table 2: Per-word accuracy in German and Italian. ing circumfixation (German ge-mach-t ’made’, Overall scores for the three word classes are aver- past participle). Most importantly, the distribu- aged across 10 repetitions of each LSTM type. tion of stem allomorphs is entirely accountable in terms of equivalence classes of cells, forming mor- -dare -> -do, -adare -> -ado, -badare -> -bado. phologically heterogenous, phonologically poorly The algorithm then generates the PRES _ IND _3 S of predictable, but fairly stable sub-paradigms (Pir- - say - diradare ’thin out’, by using the rewrite rule relli, 2000). Selection of the contextually appro- with the longest left-hand side matching diradare priate stem allomorph for a given cell thus requires (namely -adare > -ado). If there is no matching knowledge of the form of the allomorph and of its rule, the base is used as a default output. distribution within the paradigm. The algorithm proves to be effective for regu- 3 Results and discussion lar forms in both languages (Table 2). However, per-word accuracy drops dramatically on German To meaningfully assess the relative computa- irregulars (0.23), and Italian irregulars (0.5). The tional difficulty of the cell-filling task, we cal- same table shows accuracy scores on test data ob- culated a simple baseline performance, with 695 tained by running 128, 256 and 512 LSTM blocks. forms of our original datasets selected for train- Each model instance was run 10 times, and overall ing, and 55 for testing.4 For this purpose, we used per-word scores are averaged across repetitions.6 the baseline system for Task 1 of the CoNLL- The CoNLL baseline is reminiscent of Al- SIGMORPHON-2017 Universal Morphological bright and Hayes’ (2003) Minimal Generalization Reinflection shared task.5 The model changes the Learner, inferring Italian infinitives from first sin- infinitive into its inflected forms through rewrite gular present indicative forms (Albright, 2002). In rules of increasing specificity: e.g. two Italian the present case, however, the inference goes from forms such as badare ‘to look after’ and bado ‘I the infinitive (base) to other paradigm cells. The look after’ stand in a BASE :: PRES _ IND _3 S re- inference is much weaker in German, where stem lation. The most general rule changing the for- allomorphy is more consistently distributed within mer into the latter is -are -> -o, but more specific each paradigm. In Appendix, Table 3 contains rewrite rules can be extracted from the same pair: a list of all German forms wrongly produced by 3 The full data set is available at http://www. the CoNLL baseline, together with per-word ac- comphyslab.it/redirect/?id=clic2017_data. curacy of our models. Most wrong forms are in- Each training form is administered once per epoch, and the flected forms requiring ablaut, which turn out to number of epochs is a function of a “patience” threshold. Although a uniform distribution is admittedly not realistic, be over-regularised by the CoNLL baseline (e.g. it increases the entropy of the cell-filling problem, to define *stehtet for standet, *beginntet for begannt). It some sort of upper bound on the complexity of the task. appears that, in German, a purely syntagmatic ap- 4 Test forms were selected to constitute a benchmark for evaluation. We made it sure that a representative sample of proach to word production, deriving all inflected German and Italian irregulars were included for evaluation, forms from an underlying base, has a strong bias provided that they could be generalised on the basis of the towards over-regularisation. Simply put, the or- training data available. 5 https://github.com/sigmorphon/ thotactic/phonotactic structure of the German stem conll2017(written by Mans Hulden). 6 The per-word score is 1 (correct), or 0 (wrong). Italian training (256−cell LSTM) Italian training (512−cell LSTM) coming symbol (1 for a hit, 0 for a miss) in both 1 1 training and test. The marginal plots of Figure 2 show that there is a clear structural effect of the accuracy accuracy 0.9 0.9 distance to the stem-ending boundary of the sym- 0.8 0.8 irregulars irregulars bol currently being produced, over and above the regulars regulars 0.7 −10 −5 0 5 0.7 −10 −5 0 5 length of the input string. Besides, stems and suf- distance to boundary distance to boundary fixes of regulars exhibit different accuracy slopes Italian test (256−cell LSTM) Italian test (512−cell LSTM) compared with stems and suffixes of irregulars. 1 1 Intuitively, production of an inflected form by a LSTM network is fairly easy at the beginning of accuracy accuracy 0.9 0.9 the stem, but it soon gets more difficult when ap- 0.8 0.8 irregulars irregulars proaching the morpheme boundary, particularly regulars regulars 0.7 −10 −5 0 5 0.7 −10 −5 0 5 with irregulars. Accuracy reaches the minimum distance to boundary distance to boundary value on the first symbol of the inflectional end- ing, which marks a point of structural discontinu- Figure 2: Marginal plots of the interac- ity in an inflected verb form. From that position, tion between distance to morpheme boundary, accuracy starts increasing again, showing a char- stem/inflectional ending, inflectional regularity, acteristically V-shaped trend. Clearly, this trend stem length and suffix length (fixed effects) in a is more apparent with test words (Figure 2, bot- LME model fitting per-symbol accuracy by a 256- tom), where stems and endings are recombined in block (left) and a 512-block (right) RNN on train- novel ways. The same results hold for German. ing (top) and test (bottom) Italian data. Random On the other hand, no evidence of structure sensi- effects are model repetitions and word forms. tivity was found in a LME model of the baseline output for both German and Italian. is less criterial for stem allomorphy than the Ital- The cell-filling problem is an ecological, devel- ian one. LSTMs are considerably more robust in opmentally motivated task, based on evidence of this respect. Memory resources allowing, they fully inflected forms. Although other (simpler) can keep track of local syntagmatic constraints models have been proposed to account for form- as well as more global, paradigmatic constraints, meaning mapping in Morphology (Baayen et al., whereby all paradigmatically-related forms con- 2011; Plaut and Gonnerman, 2000, among oth- tribute to fill in gaps in the same paradigm. For ex- ers), we do not know of any other artificial neu- ample, knowledge that a paradigm contains a few ral networks that can simulate word inflection as stem allomorphs is good reason for an LSTM to a cell-filling task. Unlike more traditional con- produce a stem allomorph in other (empty) cells. nectionist architectures (Rumelhart and McClel- The more systematic the distribution of stem al- land, 1986), recurrent LSTMs do not presuppose ternants is across the paradigm, the easier for the the existence of underlying base forms, but they learner to fill in empty cells. German conjugation learn possibly alternating stems upon exposure proves to be paradigmatically well-behaved. to full forms. Admittedly, the use of orthogo- An LSTM recurrent network has no information nal one-hot vectors for lemmas, unigram temporal about the morphological structure of input forms. series for inflected forms, and abstract morpho- Due to the predictive nature of the production task syntactic features as a proxy of context-sensitive and the LSTM re-entrant layer, however, the net- functional agreement effects, are crude represen- work develops a left-to-right sensitivity to upcom- tational short-hands. Nonetheless, in tackling ing symbols, with per-symbol accuracy being a the task, LSTMs prove to be able to orchestrate function of the network confidence about the next “deep” knowledge about word structure, well be- output symbol. To assess the correlation between yond pure surface word relations: namely stem- per-symbol accuracy and “perception” of the mor- affix boundaries, paradigm organisation and de- phological structure, we used a Linear Mixed Ef- grees of regularity in stem formation. Acquisition fects (LME) model of how well structural features of different inflectional systems may require a dif- of German and Italian verb forms interpolate the ferent balance of all these pieces of knowledge. “average” network accuracy in producing an up- References Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735– Farrell Ackerman and Robert Malouf. 2013. Morpho- 1780. logical organization: The low conditional entropy con- jecture. Language, 89(3):429–464. Michael Jordan. 1986. Serial order: A parallel dis- tributed processing approach. Technical Report 8604, Farrell Ackerman, James P. Blevins, and Robert Mal- University of California. ouf. 2009. Parts and wholes: Patterns of relatedness in complex morphological systems and why they mat- Rafal Jozefowicz, Wojciech Zaremba, and Ilya ter. In James P. Blevins and Juliette Blevins, editors, Sutskever. 2015. An empirical exploration of recurrent Analogy in Grammar: Form and Acquisition. Oxford network architectures. In Proceedings of the 32nd In- University Press. ternational Conference on Machine Learning (ICML- 15), pages 2342–2350. Adam Albright and Bruce Hayes. 2003. Rules vs. analogy in english past tenses: A computa- Verena Lyding, Egon Stemle, Claudia Borghetti, Marco tional/experimental study. Cognition, 90(2):119–161. Brunello, Sara Castagnoli, Felice Dell’ Orletta, Henrik Dittmann, Alessandro Lenci, and Vito Pirrelli. 2014. Adam Albright. 2002. Islands of reliability for regular The paisá corpus of italian web texts. Proceedings of morphology: Evidence from italian. Language, pages the 9th Web as Corpus Workshop (WaC-9)@ EACL 684–709. 2014, pages 36–43. Association for Computational Harald R. Baayen, P. Piepenbrock, and L. Gulikers, Linguistics. 1995. The CELEX Lexical Database (CD-ROM). Lin- Robert Malouf. in press. Generating morphological guistic Data Consortium, University of Pennsylvania, paradigms with a recurrent neural network. Morphol- Philadelphia, PA. ogy. Harald R. Baayen, Petar Milin, Dusica Filipović Claudia Marzi, Marcello Ferro, Franco Alberto Ðurd̄ević, Peter Hendrix, and Marco Marelli. 2011. Cardillo, and Vito Pirrelli. 2016. Effects of frequency An amorphous model for morphological processing in and regularity in an integrative model of word stor- visual comprehension based on naive discriminative age and processing. Italian Journal of Linguistics, learning. Psychological review, 118(3):438–481. 28(1):79–114. Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Margherita Orsolini and William Marslen-Wilson. 1994. Learning long-term dependencies with gradient 1997. Universals in morphological representation: Ev- descent is difficult. IEEE transactions on neural net- idence from italian. Language and Cognitive Pro- works, 5(2):157–166. cesses, 12(1):1–47. Dagmar Bittner, Wolfgang U. Dressler, and Marianne Margherita Orsolini, Rachele Fanari, and Hugo Kilani-Schoch, editors. 2003. Development of Verb Bowles. 1998. Acquiring regular and irregular inflec- Inflection in First Language Acquisition: a cross- tion in a language with verb classes. Language and linguistic perspective. Mouton de Gruyter, Berlin. cognitive processes, 13(4):425–464. James P. Blevins, Petar Milin, and Michael Ramscar. Vito Pirrelli. 2000. Paradigmi in morfologia. 2017. The zipfian paradigm cell filling problem. In Un approccio interdisciplinare alla flessione verbale Ferenc Kiefer, James P. Blevins, and Huba Bartos, ed- dell’italiano. Istituti Editoriali e Poligrafici Inter- itors, Morphological Paradigms and Functions. Brill, nazionali, Pisa. Leiden. David C Plaut and Laura M Gonnerman. 2000. Are Lucia Colombo, Alessandro Laudanna, Maria De Mar- non-semantic morphological effects incompatible with tino, and Cristina Brivio. 2004. Regularity and/or con- a distributed connectionist approach to lexical process- sistency in the production of the past participle? Brain ing? Language and Cognitive Processes, 15(4/5):445– and language, 90(1):128–142. 485. Ewa Dabrowska. ˛ 2004. Rules or schemas? evi- David E. Rumelhart and James L. McClelland. 1986. dence from polish. Language and cognitive processes, On learning the past tenses of english verbs. In 19(2):225–271. David E. Rumelhart, James L. McClelland, and the Jeffrey L Elman. 1990. Finding structure in time. PDP Research Group, editors, Parallel Distributed Cognitive Science, 14(2):179–211. Processing. Explorations in the Microstructures of Cognition, volume 2 Psychological and Biological Jeffrey L Elman. 2009. On the meaning of words and Models, pages 216–271. MIT Press. dinosaur bones: Lexical knowledge without a lexicon. Cognitive science, 33(4):547–582. Michael Tomasello. 2000. The item-based nature of children’s early syntactic development. Trends in cog- Adele E Goldberg. 2003. Constructions: a new theo- nitive sciences, 4(4):156–163. retical approach to language. Trends in cognitive sci- ences, 7(5):219–224. Appendix A. Comparative test results base form target CoNNL LSTM LSTM LSTM baseline 128 256 512 bleiben bliebt blieb 0 0 0 dᅵrfen gedurft gedᅵrfen 0.2 0.2 0 sein seiend seind 0 0 0 mᅵssen gemusst gemᅵssen 0.2 0.1 0.1 bestehen bestandet bestehtet 0.5 0.6 0.2 sprechen spricht sprecht 0 0.5 0.2 geben gibt gebt 0 0.3 0.3 sehen siehst sehst 0 0 0.3 tun tatet tut 0.3 0 0.3 stehen standet stehtet 0.2 0.1 0.4 fahren fᅵhrst fahrst 0.2 0.6 0.5 finden fandet findet 0.8 0.6 0.6 dᅵrfen darf dᅵrfe 0.5 0.7 0.7 fahren fuhrst fahrtest 0.4 0.4 0.7 beginnen begannt beginntet 0.6 0.9 0.8 kommen kamst kommst 1 1 0.8 liegen lagt liechtet 0.5 0.9 0.8 sehen saht sehtet 0.9 0.8 0.8 bringen brachtet brinchtet 1 1 0.9 fragen fragtet frugt 1 1 0.9 gehen gingt gehtet 0.9 1 0.9 haben hattet habt 1 1 0.9 nehmen nahmt neht 0.9 1 0.9 nennen nanntet nenntet 0.8 1 0.9 sagen sagtet sugt 1 1 0.9 tragen trᅵgst tragst 0.9 0.9 0.9 bitten baten bitten 1 1 1 denken dachtest denkest 1 1 1 geben gabst gebst 1 1 1 scheinen schienst scheintest 0.8 1 1 setzen setztet setzet 1 1 1 sprechen sprachst sprechtest 0.8 1 1 werden wurdet werdet 0.9 1 1 Table 3: Comparative results for the 33 German verb forms that are wrongly inflected by the CoNNL baseline (highlighted in bold). In most cases, forms are over-regularised. Results are ordered by increas- ing accuracy of the 512-block LSTM model. Accuracy scores are given per word, and averaged across repetitions of each LSTM model in the [0, 1] range: ‘0’ means that the output is wrong in all model repetitions, ‘1’ that it is always correct. The most accurate results are provided by the 256-block LSTM model.