How “deep” is learning word inflection?
           Franco Alberto Cardillo                      Marcello Ferro
    Istituto di Linguistica Computazionale Istituto di Linguistica Computazionale
             ILC-CNR, Pisa (Italy)                  ILC-CNR, Pisa (Italy)
francoalberto.cardillo@ilc.cnr.it marcello.ferro@ilc.cnr.it

                  Claudia Marzi                                    Vito Pirrelli
     Istituto di Linguistica Computazionale          Istituto di Linguistica Computazionale
              ILC-CNR, Pisa (Italy)                           ILC-CNR, Pisa (Italy)
      claudia.marzi@ilc.cnr.it                        vito.pirrelli@ilc.cnr.it

                 Abstract                       1   Introduction
 English. Machine learning offers two ba-       Morphological induction can be defined as the task
 sic strategies for morphology induction:       of singling out morphological formatives from
 lexical segmentation and surface word re-      fully inflected word forms. These formatives are
 lation. The first one assumes that words       understood to be part of the morphological lexi-
 can be segmented into morphemes. Induc-        con, where they are accessed and retrieved, to be
 ing a novel inflected form requires iden-      recombined and spelled out in word production.
 tification of morphemic constituents and       The view requires that a word form be segmented
 a strategy for their recombination. The        into meaningful morphemes, each contributing a
 second approach dispenses with segmen-         separable piece of morpho-lexical content. Typ-
 tation: lexical representations form part      ically, this holds for regularly inflected forms, as
 of a network of associatively related in-      with Italian cred-ut-o ’believed’ (past participle,
 flected forms. Production of a novel form      from CREDERE), where cred- conveys the lexi-
 consists in filling in one empty node in       cal meaning, and -ut-o is associated with morpho-
 the network. Here, we present the re-          syntactic features. A further assumption is that
 sults of a recurrent LSTM network that         there always exists an underlying base form upon
 learns to fill in paradigm cells of incom-     which all other forms are spelled out. In an irreg-
 plete verb paradigms. Although the pro-        ular verb form like Italian appes-o ’hung’ (from
 cess is not based on morpheme segmenta-        APPENDERE ), however, it soon becomes difficult
 tion, the model shows sensitivity to stem      to separate morpholexical information (the verb
 selection and stem-ending boundaries.          stem) from morpho-syntactic information.
                                                   A different formulation of the same task as-
 Italiano. La letteratura offre due strategie   sumes that the lexicon consists of fully-inflected
 di base per l’induzione morfologica. La        word forms and that morphology induction is
 prima presuppone la segmentazione delle        the result of finding out implicative relations be-
 forme lessicali in morfemi e genera parole     tween them. Unknown forms are generated by
 nuove ricombinando morfemi conosciuti;         redundant analogy-based patterns between known
 la seconda si basa sulle relazioni di una      forms, along the lines of an analogical propor-
 forma con le altre forme del suo paradig-      tion such as: rendere ‘make’ :: reso ‘made’ =
 ma, e genera una parola sconosciuta riem-      appendere ‘hang’ :: appeso ‘hung’. Support
 piendo una cella vuota del paradigma. In       to this view comes from developmental psychol-
 questo articolo, presentiamo i risultati di    ogy, where words are understood as the foun-
 una rete LSTM ricorrente, capace di impa-      dational elements of language acquisition, from
 rare a generare nuove forme verbali a par-     which early grammar rules emerge epiphenoma-
 tire da forme giï¿œ note non segmentate.       lly (Tomasello, 2000; Goldberg, 2003). After all,
 Ciononostante, la rete acquisisce una co-      children appear to be extremely sensitive to sub-
 noscenza implicita del tema verbale e del      regularities holding between inflectionally-related
 confine con la terminazione flessionale.       forms (Bittner et al., 2003; Colombo et al., 2004;
Dabrowska,
   ˛         2004; Orsolini and Marslen-Wilson,                                               z(t)
                                                                  lexeme                               LSTM(t)
1997; Orsolini et al., 1998). Further support is                    (50)      …
lent by neurobiologically inspired computer mod-
els of language, blurring the traditional dichotomy                                                                y(t)
between processing and storage (Elman, 2009;                     symbol (t)
                                                                    (33)                                     1:n          symbol
Marzi et al., 2016). In particular we will con-                                                      1:n                   (t+1)
                                                                                  1:n   1:1
sider here the consequences of this view on issues                 tense
                                                                   mood
of word inflection by recurrent Long Short Term                     (5)
Memory (LSTM) networks (Malouf, in press).                         person
                                                                     (4)
2       The cell-filling problem
                                                                  number
                                                                    (3)
To understand how word inflection can be con-
ceptualised as a word relation task, it is useful to
think of this task as a cell-filling problem (Blevins        Figure 1: The network architecture. The input
et al., 2017; Ackerman and Malouf, 2013; Acker-              vector dimension is shown in brackets. Trainable
man et al., 2009). Inflected forms are tradition-            dense projection matrices are shown as 1 : n, and
ally arranged in so-called paradigms. The full               concatenation as 1 : 1.
paradigm of CREDERE ’believe’ is a labelled set of
all its inflected forms: credere, credendo, creduto,         over the upcoming symbol st+1 in the sequence:
credo etc. In most cases, these forms take one and           p(st+1 |st ,CREDERE , PRES _ IND , 3, S). To pro-
only one cell, defined as a specific combination             duce the form <crede>, we take the start symbol
of tense, mood, person and number features: e.g.             ‘<’ as s1 , use s1 to predict s2 , then use the pre-
crede, PRES IND , 3 S. In all languages, words hap-          dicted symbol to predict s3 and so on, until ‘>’ is
pen to follow a Zipfian distribution, with very few          predicted. Input symbols are encoded as mutually
high-frequency words, and a vast majority of ex-             orthogonal one-hot vectors with as many dimen-
ceedingly rare words (Blevins et al., 2017). As a            sions as the overall number of different symbols
result, even high-frequency paradigms happen to              used to encode all inflected forms. The morpho-
be attested partially, and learners must then be able        syntactic features of tense, person and number are
to generalise incomplete paradigmatic knowledge.             given different one-hot vectors, whose dimensions
This is the cell-filling problem: given a set of at-         equal the number of different values each fea-
tested forms in a paradigm, the learner has to guess         ture can take.2 All input vectors are encoded by
other missing forms in the same paradigm.                    trainable dense matrices whose outputs are con-
   The task can be simulated by training a learn-            catenated into the projection layer z(t), which is
ing model on a number of partial paradigms, to               in turn input to a layer of LSTM blocks (Figure
then complete them by generating missing forms.              1). The layer takes as input both the informa-
Training consists of <lemma_paradigm cell, in-               tion of z(t), and its own output at t–1. Recur-
flected form> pairs. A lemma is not a form (e.g.             rent LSTM blocks are known to be able to capture
credere), but a symbolic proxy of its lexical con-           long-distance relations in time series of symbols
tent (e.g. CREDERE). Word inflection consists of             (Bengio et al., 1994; Hochreiter and Schmidhuber,
producing a fully inflected form given a known               1997; Jozefowicz et al., 2015), avoiding classical
lemma and an empty paradigm cell.                            problems with training gradients of Simple Recur-
                                                             rent Networks (Jordan, 1986; Elman, 1990).
2.1      Methods and materials
                                                                We tested our model on two comparable sets
Following Malouf (in press), our LSTM net-                   of Italian and German inflected verb forms (Ta-
work (Figure 1) is designed to take as input                 ble 1), where paradigms are selected by sampling
a lemma (e.g. CREDERE), a set of morpho-                     the highest-frequency fifty paradigms in two ref-
syntactic features (e.g. PRES _ IND , 3, S) and              erence corpora (Baayen et al., 1995; Lyding et al.,
a sequence of symbols (<crede>)1 one symbol                  2014). For both languages, a fixed set of cells was
st at a time, to output a probability distribution
                                                                2
                                                                  Note that an extra dimension is added when a feature can
    1
    ‘<’ and ‘>’ are respectively the start-of-word and the   be left uninstatiated in particular forms, as is the case with
end-of-word symbols                                          person and number features in the infinitive.
                alphabet   max             reg / irreg                          German test        all    regs    irregs
    language      size     len    cells   paradigms      forms                CoNLL baseline       0.4    0.81    0.23
    German         27      13      15       16 / 34       750                    128-blocks       0.68    0.79    0.64
     Italian       21      14      15       23 / 27       750                    256-blocks       0.75    0.89    0.69
                                                                                 512-blocks       0.71    0.84    0.66
         Table 1: The German and Italian datasets.
                                                                                 Italian test      all    regs    irregs
chosen from each paradigm: all present indicative                                 baseline        0.65     0.9     0.5
forms (n=6), all past tense forms (n=6), infinitive                              128-blocks       0.61    0.84    0.47
(n=1), past participle (n=1), German present par-                                256-blocks       0.63    0.83    0.51
ticiple/Italian gerund (n=1).3 The two sets are in-                              512-blocks       0.69    0.92    0.54
flectionally complex: they exhibit extensive stem
allomorphy and a rich set of affixations, includ-                 Table 2: Per-word accuracy in German and Italian.
ing circumfixation (German ge-mach-t ’made’,                      Overall scores for the three word classes are aver-
past participle). Most importantly, the distribu-                 aged across 10 repetitions of each LSTM type.
tion of stem allomorphs is entirely accountable in
terms of equivalence classes of cells, forming mor-               -dare -> -do, -adare -> -ado, -badare -> -bado.
phologically heterogenous, phonologically poorly                  The algorithm then generates the PRES _ IND _3 S of
predictable, but fairly stable sub-paradigms (Pir-                - say - diradare ’thin out’, by using the rewrite rule
relli, 2000). Selection of the contextually appro-                with the longest left-hand side matching diradare
priate stem allomorph for a given cell thus requires              (namely -adare > -ado). If there is no matching
knowledge of the form of the allomorph and of its                 rule, the base is used as a default output.
distribution within the paradigm.                                    The algorithm proves to be effective for regu-
3        Results and discussion                                   lar forms in both languages (Table 2). However,
                                                                  per-word accuracy drops dramatically on German
To meaningfully assess the relative computa-                      irregulars (0.23), and Italian irregulars (0.5). The
tional difficulty of the cell-filling task, we cal-               same table shows accuracy scores on test data ob-
culated a simple baseline performance, with 695                   tained by running 128, 256 and 512 LSTM blocks.
forms of our original datasets selected for train-                Each model instance was run 10 times, and overall
ing, and 55 for testing.4 For this purpose, we used               per-word scores are averaged across repetitions.6
the baseline system for Task 1 of the CoNLL-                         The CoNLL baseline is reminiscent of Al-
SIGMORPHON-2017 Universal Morphological                           bright and Hayes’ (2003) Minimal Generalization
Reinflection shared task.5 The model changes the                  Learner, inferring Italian infinitives from first sin-
infinitive into its inflected forms through rewrite               gular present indicative forms (Albright, 2002). In
rules of increasing specificity: e.g. two Italian                 the present case, however, the inference goes from
forms such as badare ‘to look after’ and bado ‘I                  the infinitive (base) to other paradigm cells. The
look after’ stand in a BASE :: PRES _ IND _3 S re-                inference is much weaker in German, where stem
lation. The most general rule changing the for-                   allomorphy is more consistently distributed within
mer into the latter is -are -> -o, but more specific              each paradigm. In Appendix, Table 3 contains
rewrite rules can be extracted from the same pair:                a list of all German forms wrongly produced by
     3
       The full data set is available at http://www.
                                                                  the CoNLL baseline, together with per-word ac-
comphyslab.it/redirect/?id=clic2017_data.                         curacy of our models. Most wrong forms are in-
Each training form is administered once per epoch, and the        flected forms requiring ablaut, which turn out to
number of epochs is a function of a “patience” threshold.
Although a uniform distribution is admittedly not realistic,      be over-regularised by the CoNLL baseline (e.g.
it increases the entropy of the cell-filling problem, to define   *stehtet for standet, *beginntet for begannt). It
some sort of upper bound on the complexity of the task.           appears that, in German, a purely syntagmatic ap-
     4
       Test forms were selected to constitute a benchmark for
evaluation. We made it sure that a representative sample of       proach to word production, deriving all inflected
German and Italian irregulars were included for evaluation,       forms from an underlying base, has a strong bias
provided that they could be generalised on the basis of the       towards over-regularisation. Simply put, the or-
training data available.
     5
       https://github.com/sigmorphon/                             thotactic/phonotactic structure of the German stem
conll2017(written by Mans Hulden).                                   6
                                                                         The per-word score is 1 (correct), or 0 (wrong).
                  Italian training (256−cell LSTM)                     Italian training (512−cell LSTM)
                                                                                                          coming symbol (1 for a hit, 0 for a miss) in both
            1                                                    1
                                                                                                          training and test. The marginal plots of Figure 2
                                                                                                          show that there is a clear structural effect of the
accuracy


                                                     accuracy
           0.9                                                  0.9
                                                                                                          distance to the stem-ending boundary of the sym-
           0.8                                                  0.8
                    irregulars                                           irregulars                       bol currently being produced, over and above the
                    regulars                                             regulars
           0.7
            −10         −5          0          5
                                                                0.7
                                                                 −10         −5          0          5
                                                                                                          length of the input string. Besides, stems and suf-
                       distance to boundary                                 distance to boundary
                                                                                                          fixes of regulars exhibit different accuracy slopes
                    Italian test (256−cell LSTM)                         Italian test (512−cell LSTM)
                                                                                                          compared with stems and suffixes of irregulars.
            1                                                    1
                                                                                                          Intuitively, production of an inflected form by a
                                                                                                          LSTM network is fairly easy at the beginning of
accuracy


                                                     accuracy
           0.9                                                  0.9

                                                                                                          the stem, but it soon gets more difficult when ap-
           0.8                                                  0.8
                    irregulars                                           irregulars                       proaching the morpheme boundary, particularly
                    regulars                                             regulars
           0.7
            −10         −5          0          5
                                                                0.7
                                                                 −10         −5          0          5
                                                                                                          with irregulars. Accuracy reaches the minimum
                       distance to boundary                                 distance to boundary
                                                                                                          value on the first symbol of the inflectional end-
                                                                                                          ing, which marks a point of structural discontinu-
 Figure 2:      Marginal plots of the interac-
                                                                                                          ity in an inflected verb form. From that position,
 tion between distance to morpheme boundary,
                                                                                                          accuracy starts increasing again, showing a char-
 stem/inflectional ending, inflectional regularity,
                                                                                                          acteristically V-shaped trend. Clearly, this trend
 stem length and suffix length (fixed effects) in a
                                                                                                          is more apparent with test words (Figure 2, bot-
 LME model fitting per-symbol accuracy by a 256-
                                                                                                          tom), where stems and endings are recombined in
 block (left) and a 512-block (right) RNN on train-
                                                                                                          novel ways. The same results hold for German.
 ing (top) and test (bottom) Italian data. Random
                                                                                                          On the other hand, no evidence of structure sensi-
 effects are model repetitions and word forms.
                                                                                                          tivity was found in a LME model of the baseline
                                                                                                          output for both German and Italian.
 is less criterial for stem allomorphy than the Ital-                                                        The cell-filling problem is an ecological, devel-
 ian one. LSTMs are considerably more robust in                                                           opmentally motivated task, based on evidence of
 this respect. Memory resources allowing, they                                                            fully inflected forms. Although other (simpler)
 can keep track of local syntagmatic constraints                                                          models have been proposed to account for form-
 as well as more global, paradigmatic constraints,                                                        meaning mapping in Morphology (Baayen et al.,
 whereby all paradigmatically-related forms con-                                                          2011; Plaut and Gonnerman, 2000, among oth-
 tribute to fill in gaps in the same paradigm. For ex-                                                    ers), we do not know of any other artificial neu-
 ample, knowledge that a paradigm contains a few                                                          ral networks that can simulate word inflection as
 stem allomorphs is good reason for an LSTM to                                                            a cell-filling task. Unlike more traditional con-
 produce a stem allomorph in other (empty) cells.                                                         nectionist architectures (Rumelhart and McClel-
 The more systematic the distribution of stem al-                                                         land, 1986), recurrent LSTMs do not presuppose
 ternants is across the paradigm, the easier for the                                                      the existence of underlying base forms, but they
 learner to fill in empty cells. German conjugation                                                       learn possibly alternating stems upon exposure
 proves to be paradigmatically well-behaved.                                                              to full forms. Admittedly, the use of orthogo-
    An LSTM recurrent network has no information                                                          nal one-hot vectors for lemmas, unigram temporal
 about the morphological structure of input forms.                                                        series for inflected forms, and abstract morpho-
 Due to the predictive nature of the production task                                                      syntactic features as a proxy of context-sensitive
 and the LSTM re-entrant layer, however, the net-                                                         functional agreement effects, are crude represen-
 work develops a left-to-right sensitivity to upcom-                                                      tational short-hands. Nonetheless, in tackling
 ing symbols, with per-symbol accuracy being a                                                            the task, LSTMs prove to be able to orchestrate
 function of the network confidence about the next                                                        “deep” knowledge about word structure, well be-
 output symbol. To assess the correlation between                                                         yond pure surface word relations: namely stem-
 per-symbol accuracy and “perception” of the mor-                                                         affix boundaries, paradigm organisation and de-
 phological structure, we used a Linear Mixed Ef-                                                         grees of regularity in stem formation. Acquisition
 fects (LME) model of how well structural features                                                        of different inflectional systems may require a dif-
 of German and Italian verb forms interpolate the                                                         ferent balance of all these pieces of knowledge.
 “average” network accuracy in producing an up-
References                                                 Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
                                                           short-term memory. Neural computation, 9(8):1735–
Farrell Ackerman and Robert Malouf. 2013. Morpho-          1780.
logical organization: The low conditional entropy con-
jecture. Language, 89(3):429–464.                          Michael Jordan. 1986. Serial order: A parallel dis-
                                                           tributed processing approach. Technical Report 8604,
Farrell Ackerman, James P. Blevins, and Robert Mal-        University of California.
ouf. 2009. Parts and wholes: Patterns of relatedness
in complex morphological systems and why they mat-         Rafal Jozefowicz, Wojciech Zaremba, and Ilya
ter. In James P. Blevins and Juliette Blevins, editors,    Sutskever. 2015. An empirical exploration of recurrent
Analogy in Grammar: Form and Acquisition. Oxford           network architectures. In Proceedings of the 32nd In-
University Press.                                          ternational Conference on Machine Learning (ICML-
                                                           15), pages 2342–2350.
Adam Albright and Bruce Hayes. 2003. Rules
vs. analogy in english past tenses: A computa-             Verena Lyding, Egon Stemle, Claudia Borghetti, Marco
tional/experimental study. Cognition, 90(2):119–161.       Brunello, Sara Castagnoli, Felice Dell’ Orletta, Henrik
                                                           Dittmann, Alessandro Lenci, and Vito Pirrelli. 2014.
Adam Albright. 2002. Islands of reliability for regular    The paisá corpus of italian web texts. Proceedings of
morphology: Evidence from italian. Language, pages         the 9th Web as Corpus Workshop (WaC-9)@ EACL
684–709.                                                   2014, pages 36–43. Association for Computational
Harald R. Baayen, P. Piepenbrock, and L. Gulikers,         Linguistics.
1995. The CELEX Lexical Database (CD-ROM). Lin-            Robert Malouf. in press. Generating morphological
guistic Data Consortium, University of Pennsylvania,       paradigms with a recurrent neural network. Morphol-
Philadelphia, PA.                                          ogy.
Harald R. Baayen, Petar Milin, Dusica Filipović           Claudia Marzi, Marcello Ferro, Franco Alberto
Ðurd̄ević, Peter Hendrix, and Marco Marelli. 2011.        Cardillo, and Vito Pirrelli. 2016. Effects of frequency
An amorphous model for morphological processing in         and regularity in an integrative model of word stor-
visual comprehension based on naive discriminative         age and processing. Italian Journal of Linguistics,
learning. Psychological review, 118(3):438–481.            28(1):79–114.
Yoshua Bengio, Patrice Simard, and Paolo Frasconi.         Margherita Orsolini and William Marslen-Wilson.
1994. Learning long-term dependencies with gradient        1997. Universals in morphological representation: Ev-
descent is difficult. IEEE transactions on neural net-     idence from italian. Language and Cognitive Pro-
works, 5(2):157–166.                                       cesses, 12(1):1–47.
Dagmar Bittner, Wolfgang U. Dressler, and Marianne
                                                           Margherita Orsolini, Rachele Fanari, and Hugo
Kilani-Schoch, editors. 2003. Development of Verb
                                                           Bowles. 1998. Acquiring regular and irregular inflec-
Inflection in First Language Acquisition: a cross-
                                                           tion in a language with verb classes. Language and
linguistic perspective. Mouton de Gruyter, Berlin.
                                                           cognitive processes, 13(4):425–464.
James P. Blevins, Petar Milin, and Michael Ramscar.
                                                           Vito Pirrelli.   2000.     Paradigmi in morfologia.
2017. The zipfian paradigm cell filling problem. In
                                                           Un approccio interdisciplinare alla flessione verbale
Ferenc Kiefer, James P. Blevins, and Huba Bartos, ed-
                                                           dell’italiano. Istituti Editoriali e Poligrafici Inter-
itors, Morphological Paradigms and Functions. Brill,
                                                           nazionali, Pisa.
Leiden.
                                                           David C Plaut and Laura M Gonnerman. 2000. Are
Lucia Colombo, Alessandro Laudanna, Maria De Mar-
                                                           non-semantic morphological effects incompatible with
tino, and Cristina Brivio. 2004. Regularity and/or con-
                                                           a distributed connectionist approach to lexical process-
sistency in the production of the past participle? Brain
                                                           ing? Language and Cognitive Processes, 15(4/5):445–
and language, 90(1):128–142.
                                                           485.
Ewa Dabrowska.
        ˛          2004. Rules or schemas? evi-
                                                           David E. Rumelhart and James L. McClelland. 1986.
dence from polish. Language and cognitive processes,
                                                           On learning the past tenses of english verbs. In
19(2):225–271.
                                                           David E. Rumelhart, James L. McClelland, and the
Jeffrey L Elman. 1990. Finding structure in time.          PDP Research Group, editors, Parallel Distributed
Cognitive Science, 14(2):179–211.                          Processing. Explorations in the Microstructures of
                                                           Cognition, volume 2 Psychological and Biological
Jeffrey L Elman. 2009. On the meaning of words and         Models, pages 216–271. MIT Press.
dinosaur bones: Lexical knowledge without a lexicon.
Cognitive science, 33(4):547–582.                          Michael Tomasello. 2000. The item-based nature of
                                                           children’s early syntactic development. Trends in cog-
Adele E Goldberg. 2003. Constructions: a new theo-         nitive sciences, 4(4):156–163.
retical approach to language. Trends in cognitive sci-
ences, 7(5):219–224.
Appendix A. Comparative test results
                   base form      target        CoNNL        LSTM     LSTM      LSTM
                                               baseline       128      256       512
                    bleiben        bliebt         blieb        0        0         0
                   dï¿œrfen       gedurft    gedï¿œrfen       0.2      0.2        0
                       sein       seiend          seind        0        0         0
                   mï¿œssen     gemusst      gemï¿œssen       0.2      0.1       0.1
                   bestehen     bestandet      bestehtet      0.5      0.6       0.2
                   sprechen       spricht       sprecht        0       0.5       0.2
                     geben          gibt           gebt        0       0.3       0.3
                     sehen         siehst         sehst        0        0        0.3
                       tun          tatet           tut       0.3       0        0.3
                     stehen       standet       stehtet       0.2      0.1       0.4
                     fahren     fï¿œhrst         fahrst       0.2      0.6       0.5
                     finden       fandet         findet       0.8      0.6       0.6
                   dï¿œrfen         darf       dï¿œrfe        0.5      0.7       0.7
                     fahren        fuhrst      fahrtest       0.4      0.4       0.7
                   beginnen      begannt      beginntet       0.6      0.9       0.8
                   kommen          kamst       kommst          1        1        0.8
                     liegen         lagt        liechtet      0.5      0.9       0.8
                     sehen          saht         sehtet       0.9      0.8       0.8
                    bringen      brachtet     brinchtet        1        1        0.9
                     fragen       fragtet         frugt        1        1        0.9
                     gehen         gingt         gehtet       0.9       1        0.9
                     haben         hattet         habt         1        1        0.9
                    nehmen        nahmt           neht        0.9       1        0.9
                    nennen       nanntet        nenntet       0.8       1        0.9
                     sagen         sagtet          sugt        1        1        0.9
                     tragen      trï¿œgst        tragst       0.9      0.9       0.9
                     bitten        baten         bitten        1        1         1
                    denken      dachtest        denkest        1        1         1
                     geben         gabst          gebst        1        1         1
                   scheinen      schienst     scheintest      0.8       1         1
                     setzen       setztet        setzet        1        1         1
                   sprechen     sprachst      sprechtest      0.8       1         1
                    werden        wurdet        werdet        0.9       1         1

Table 3: Comparative results for the 33 German verb forms that are wrongly inflected by the CoNNL
baseline (highlighted in bold). In most cases, forms are over-regularised. Results are ordered by increas-
ing accuracy of the 512-block LSTM model. Accuracy scores are given per word, and averaged across
repetitions of each LSTM model in the [0, 1] range: ‘0’ means that the output is wrong in all model
repetitions, ‘1’ that it is always correct. The most accurate results are provided by the 256-block LSTM
model.